I think a common blindspot that makes it difficult to fully take advantage of modern silicon is making distinctions between parallelism and concurrency in code that are in actuality ambiguous.
The canonical overly reductive examples are parallelism as trivial SIMD loop parallelism and concurrency as multiple threads working on different but related tasks. If those are your only models the opportunities will be limited. Parallelism, in the sense of executing multiple physical operations in a single instruction, is not limited to e.g. an arithmetic operation on arrays of numbers. If you can execute semantically orthogonal threads of logic in the same instruction, you arguably have both concurrency and parallelism at the same time.
For example, one of the most effective models for using AVX-512 is to treat registers as specialized engines for computing on 64-byte structs, where the structs are an arbitrary collection of unrelated types of different sizes. There are idiomatic techniques for doing many semantically orthogonal computations on the fields in these structs in parallel using the same handful of vector instructions. It essentially turns SIMD into MIMD with clever abstractions. A well-known simple example is searching row-structured data by evaluating a collection of different predicates across any number of columns in parallel with a short sequence of vector intrinsics. Query performance is faster than columnar for some workloads. For most code there is substantially more parallelism available within a single thread of execution than you'll see if you are only looking for trivial array parallelism. It is rare to see code that does this but the gains can be large.
On the other hand, I think concurrency is largely a solved problem to the extent that you can use thread-per-core architectures, particularly if you don't rely on shared task queues or work stealing to balance load.
> particularly if you don't rely on shared task queues or work stealing to balance load.
Anywhere I can read up on better load balancing techniques? Or, are we talking about “Know ahead of time how long each task will reliably run so you can statically schedule everything”?
The canonical overly reductive examples are parallelism as trivial SIMD loop parallelism and concurrency as multiple threads working on different but related tasks. If those are your only models the opportunities will be limited. Parallelism, in the sense of executing multiple physical operations in a single instruction, is not limited to e.g. an arithmetic operation on arrays of numbers. If you can execute semantically orthogonal threads of logic in the same instruction, you arguably have both concurrency and parallelism at the same time.
For example, one of the most effective models for using AVX-512 is to treat registers as specialized engines for computing on 64-byte structs, where the structs are an arbitrary collection of unrelated types of different sizes. There are idiomatic techniques for doing many semantically orthogonal computations on the fields in these structs in parallel using the same handful of vector instructions. It essentially turns SIMD into MIMD with clever abstractions. A well-known simple example is searching row-structured data by evaluating a collection of different predicates across any number of columns in parallel with a short sequence of vector intrinsics. Query performance is faster than columnar for some workloads. For most code there is substantially more parallelism available within a single thread of execution than you'll see if you are only looking for trivial array parallelism. It is rare to see code that does this but the gains can be large.
On the other hand, I think concurrency is largely a solved problem to the extent that you can use thread-per-core architectures, particularly if you don't rely on shared task queues or work stealing to balance load.