Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Branch prediction: fundamentals every programmer need not know (mycpu.org)
70 points by mycpuorg on Feb 2, 2020 | hide | past | favorite | 15 comments


Very shallow. I prefer the popular stack overflow answer by Mystical:

https://stackoverflow.com/questions/11227809/why-is-processi...


A relevant question from there: Is everything I learned about big-O performance irrelevant now?

The right answer is: No, just pay attention to what you are counting. The expensive ops today are pipeline stalls, caused by cache misses of all kinds. Branch prediction failures are a kind of cache miss. Count pipeline stalls instead of (e.g.) additions, comparisons, or swaps, and then all your big-O intuitions still work.

Modern CPUs maintain counters of various causes of pipeline stalls, so you don't need to guess how many you have had.

. . .

There are techniques to transform computations to branchless forms, of which cmov is a special case. A useful primitive, for example, is

  template <typename T>
  bool swap_if(
      bool c, T& a, T& b) {
    T v[2] = {
      std::move(a),
      std::move(b) };
    b = std::move(v[1-c]);
    a = std::move(v[c]);
    return c;
  }
Then the loop body in a partition becomes

  l += swap_if(
    *r < p, *l, *r);
with no branches. (A compiler could easily translate this to two cmov instructions when T is a machine word, though none do, just yet.) This easily doubles the performance of quicksort on modern CPUs.

Curiously, the time taken when this is compiled with g++ is hugely worse if a is assigned before b, for very non-obvious reasons.


I love the explanation, but why did the guy suggest a complicated bit work instead of just doing sum += arr[i] > 128 ? arr[i] : 0 is beyond me.


That would not evoke the behavior he wanted to call attention to.


Theres still a conditional branch in that statement


That depends on the compiler, and I'd expect most to optimize this into a cmov


"This results in a loss of a single cycle at the time of instruction fetch." Maybe on that paper CPU but branch mis-predicts on Skylake are 16.5 cycles if there's a μop cache hit and 19-20 cycles if there isn't.

https://www.7-cpu.com/cpu/Skylake.html

That said, I didn't know about using BPM to access the PMC performance registers.


Rekated: Dan Luu's article on branch prediction is pretty good: https://danluu.com/branch-prediction/


Feels very shallow. I was surprised to reach the end of the article so quickly.


Instead of complaining, post a follow-up. The author committed time to 1000 words on the topic. Without such a limit the article probably would not have been written.


What is the power consumption hit with prediction?


It's not an isolated feature, so that's not an answerable question. Branch prediction (and speculation more generally) exists because processors have long pipelines and stalling to resolve a branch affirmatively is slow. Processors have long pipelines because that allows the individual circuit paths to be short and to operate at high clock speeds. If processors didn't want to run at high clock speeds they wouldn't have long pipelines and wouldn't need speculation[1]. And they would be very low power. But also very slow.

So you can't really look at a modern architecture and ask how much of its power consumption is because of "branch prediction" vs. other things. They all feed back.

[1] They'd still want branch prediction. Everything wants branch prediction. Even the mid-80's RISC architectures implemented in 30k transistors had a "branch delay slot", which was effectively compiler-managed branch prediction for a single cycle of latency.


You'd need two identical processors but one with branch prediction turned off to find that out.

I don't think we have good models for power consumption to find out in a simulation, either.


> I don't think we have good models for power consumption to find out in a simulation, either.

There are excellent models available to predict power consumption but you'll need access to the processor RTL (the design of the processor typically written in a language such as verilog or vhdl) and / or the outputs of the implementation flow (like the netlist which is the description of the actual logic gates and their connectivity) to be able to use them.

Such things are normally not available outside the company building the processor and potentially their customers.


Branch prediction can usually be disabled during processor initialization (as well as other features like cache).

I was bringing up Linux on new hardware one time. What was happening was it would get partway into booting and then crash at a random spot. In the kernel config there's an option to disable the D-cache and I found that it would boot fine with it disabled. Turns out the root problem was that the processor voltage wasn't being set high enough, and that disabling D-cache must have reduced power consumption enough to allow it to boot at a lower voltage.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: