There's definitely a funky curve of what you can and cannot comfortably do on an iPad. If you're pinned to IDEs and need lots of local graphical tools to support development, an iPad is unusable. If you already have to run all your work remotely since the tools are too heavy even for a laptop (like me with EDA tools), it turns out that the iPad makes for a great little client. I use mine a lot with iSH. I can do work locally in vim and then submit jobs to the compute cluster, it's the exact same workflow I'd use on a laptop.
This is an idea many have had before but it doesn't quite work. When you do this, you tend to lose all the performance gained from speculative execution. It's essentially data-independent-timing as applied to loads and stores, so you have to treat all hits as if they were misses to DRAM, which is not particularly appealing from a performance standpoint.
This is not to mention the fact that you can use transient execution itself (without any side channels) to amplify a single cache line being present/not present into >100ms of latency difference. Unless your plan is to burn 100ms of compute time to hide such an issue (nobody is going to buy your core in that case), you can't solve this problem like this.
Why hits to DRAM? Just use cache for speculated branches. The performance gain of the difference between the length of the speculated branch and the length of the bookkeeping is still there. There are workloads with short branches that would have a performance penalty. In those cases it would be helpful to have a flag in the instruction field to stop speculative execution.
It's not that simple. The problem is not just branches but often the intersection of memory and branches. For example, a really powerful technique for amplification is this:
ldr x2, [x2]
cbnz x2, skip
/* bunch of slow operations */
ldr x1, [x1]
add x1, x1, CACHE_STRIDE
ldr x1, [x1]
add x1, x1, CACHE_STRIDE
ldr x1, [x1]
add x1, x1, CACHE_STRIDE
ldr x1, [x1]
add x1, x1, CACHE_STRIDE
skip:
Here, if the branch condition is predicted not taken and ldr x2 misses in the cache, the CPU will speculatively execute long enough to launch the four other loads. If x2 is in the cache, the branch condition will resolve before we execute the loads. This gives us a 4x signal amplification using absolutely no external timing, just exploiting the fact that misses lead to longer speculative windows.
After repeating this procedure enough times and amplifying your signal, you can then direct measure how long it takes to load all these amplified lines (no mispredicted branches required!). Simply start the clock, load each line one by one in a for loop, and then stop the clock.
As I mentioned earlier, unless your plan is to treat every hit as a miss to DRAM, you can't hide this information.
The current sentiment for spectre mitigations is that once information has leaked into side channels you can't do anything to stop attackers from extracting it. There are simply too many ways to expose uarch state (and caches are not the only side channels!). Instead, your best and only bet is to prevent important information from leaking in the first place.
Some timing difference are inherent but if they are exploitable is the real question. There are paper and tools produced that can give you a high confidence that you are not leaking.
Much of transient execution research over the years has been invalidated or was complete bogus to begin with. It was extremely easy to get a paper into a conference for a while (and frankly still is) just by throwing in the right words because most people don't really understand the issue well enough to tell what techniques are real and practical or just totally non-functional.
You have to stop the leak into side channels in the first place, it's simply not practical to try to prevent secrets from escaping out of side channels. This is, unfortunately, the much harder problem with much worse performance implications (and indeed the reason why Spectre v1 is still almost entirely unmitigated).
Modeling programs as circuits also makes them significantly easier to formally verify too! These sorts of synthesis tools are really cool, though writing traditional software in them is extremely painful.
My favorite is when the codebase is so deeply buried in macros and headers that send you on a wild goose chase to find any actual code that it becomes much easier to just dump the binary in ida/binja. The source code can lie but at least the compiled binary directly does what it says
That feeling when you finally, finally find the bit of code you’ve been looking for… and you can no longer remember why you were looking for it, because you’ve completely purged your short term memory.
Having to track down macros across several files really annoys me as well. When I write macros in C, I place them just above the code where they are used and undefine them immediately after.
Using #undef is a big one that wasn't mentioned but oh god it is the ideal way to hide things. If you use it sparingly but in critical places of header files, especially to undef something potentially defined in three other headers, it becomes impossible to find the real substitution without reading the cc output
My sad related factoid to this is that suicide overtook car accidents as the leading cause of death of teenagers in Colorado. This is because although suicide is a worsening problem, crash safety and driver education have improved much faster than the suicide rate has been rising, causing them to flip for the first time.
RISC is not about the number of instructions but rather what the instructions do. The famous example of CISC gone to its logical extreme is the VAX's polynomial multiply instruction, which ended up being almost a full program in a single instruction. RISC tends to go the other way, focusing on things that are easy for hardware to do and leaving anything else to software.
What sort of wiki are you envisioning here? There is some decent tooling and docs around generating SoCs [1] but, as the article mentions, the most difficult part is not creating a single RISCV core but rather creating a very high performance interconnect. This is still an open and rich research area, so you're best source of information is likely to just be google scholar.
But, for what it's worth, there do seem to be some practical considerations why your idea of a hugely parallel computer would not meaningfully rival the M1 (or any other modern processor). The issue that everyone has struggled with for decades now is that lots of tasks are simply very difficult to parallelize. Hardware people would love to be able to just give software N times more cores and make it go N times faster, but that's not how it works. The most famous enunciation of this is Amdahl's Law [2]. So, for most programs people use today, 1024 tiny slow cores may very well be significantly worse than the eight fast, wide cores you can get on an M1.
The problem isn't that algorithms are inherently sequential though but rather that parallel programming is a separate discipline.
In single threaded programming you have almost infinite flexibility, you can acquire infinite resources in any arbitrary order. In multithreaded programming you must limit the number of accessible resources and the order is preferably well defined.
In my opinion expecting people to write parallel algorithms is too much, not because it is too difficult but rather because it has to permeate through your entire codebase. That is a nonstarter unless the required changes are not unreasonable.
The challenge then becomes, how do we let people write single threaded programs that can run on multiple cores and gracefully degrade to being single threaded the worse the code is optimized?
I don't have the perfect answer but I think there is an opportunity for a trinity of techniques that can be used in combination: lock hierarchies, STM and the actor model.
There is a unit of parallelism like the actor model that executes code in a sequentially single threaded fashion. Multiple of these units work in parallel, however, rather than communicating through messages, STM is used to optimistically execute code and obtain a runtime heuristic of the acquired locks. If there are no conflicts, then performance scales linearly, if there are conflicts, then by carefully defining a hierarchy of the resources, you can calculate the optimal level in the hierarchy to execute the STM transaction in. This will allow the STM transaction to succeed with a much higher chance which then eliminates the primary downside of STM: performance loss due to failed transactions whose failure rate creeps up the more resources are being acquired.
A lock hierarchy could look like this: object, person, neighborhood, city, state, country, planet.
You write an STM transaction that looks like single threaded code. It loops around all people in the times square and would thereby acquire their locks. However, if that transaction was executed on the object level, it would almost certainly fail because out of thousands of people only one needs to be changed by another transaction to fail. The STM transaction acquired a thousand people, therefore it's optimal level in the hierarchy is the neighborhood. This means only one lock needs to be acquired to process thousands of people. If it turns out that the algorithm needs information like the postal address of these people, it is possible that some of them are tourists and you therefore acquire resources that are all over the world, you might need to execute this transaction at the highest level of the hierarchy for it to finish.
The primary objection would be the dependence on STM for obtaining the necessary information about which resources have been acquired. This means that ideally, all state is immutable to make software transactional memory as cheap as possible to implement. This is not a killer but it means that if you were to water this approach down, then the acquired resources must be static, i.e. known at compile time to remove the need for optimistic execution. That still works and it lets you get away with a lot more single threaded code than you might think, especially legacy code bases.
It is fairly different from a systems programming perspective. The base instructions (I) are essentially everything you'd expect when writing any kind of regular program and it feels very normal and natural for anyone who has ever written assembly, but once you start needing fancier things like exceptions you'll see a lot of new RISCV specific design choices. This is sort of to be expected, x86 has a different exception architecture from ARMv8A. It's just different, not necessarily less capable. Odds are whatever RISCV MCUs come on the market will eventually be supported with the same SDKs you know and love, but they will have a different implementation for all the system specific functions.
For all those learning (or even those who've learned :P), my favorite cheatsheet that I always pull up while writing ARMv8-A assembly is this one [1] from the University of Washington. ARMv8-A has a lot of fairly complex instructions and sometimes it's hard to remember all the odds and ends.