Branch prediction: fundamentals every programmer need not know

vowelless · on Feb 2, 2020

Very shallow. I prefer the popular stack overflow answer by Mystical:

https://stackoverflow.com/questions/11227809/why-is-processi...

ncmncm · on Feb 2, 2020

A relevant question from there: Is everything I learned about big-O performance irrelevant now?

The right answer is: No, just pay attention to what you are counting. The expensive ops today are pipeline stalls, caused by cache misses of all kinds. Branch prediction failures are a kind of cache miss. Count pipeline stalls instead of (e.g.) additions, comparisons, or swaps, and then all your big-O intuitions still work.

Modern CPUs maintain counters of various causes of pipeline stalls, so you don't need to guess how many you have had.

. . .

There are techniques to transform computations to branchless forms, of which cmov is a special case. A useful primitive, for example, is

  template <typename T>
  bool swap_if(
      bool c, T& a, T& b) {
    T v[2] = {
      std::move(a),
      std::move(b) };
    b = std::move(v[1-c]);
    a = std::move(v[c]);
    return c;
  }

Then the loop body in a partition becomes

  l += swap_if(
    *r < p, *l, *r);

with no branches. (A compiler could easily translate this to two cmov instructions when T is a machine word, though none do, just yet.) This easily doubles the performance of quicksort on modern CPUs.

Curiously, the time taken when this is compiled with g++ is hugely worse if a is assigned before b, for very non-obvious reasons.

lostmsu · on Feb 2, 2020

I love the explanation, but why did the guy suggest a complicated bit work instead of just doing sum += arr[i] > 128 ? arr[i] : 0 is beyond me.

ncmncm · on Feb 2, 2020

That would not evoke the behavior he wanted to call attention to.

rabryan35 · on Feb 3, 2020

Theres still a conditional branch in that statement

lostmsu · on Feb 3, 2020

That depends on the compiler, and I'd expect most to optimize this into a cmov

CalChris · on Feb 2, 2020

"This results in a loss of a single cycle at the time of instruction fetch." Maybe on that paper CPU but branch mis-predicts on Skylake are 16.5 cycles if there's a μop cache hit and 19-20 cycles if there isn't.

https://www.7-cpu.com/cpu/Skylake.html

That said, I didn't know about using BPM to access the PMC performance registers.

stygiansonic · on Feb 2, 2020

Rekated: Dan Luu's article on branch prediction is pretty good: https://danluu.com/branch-prediction/

azinman2 · on Feb 2, 2020

Feels very shallow. I was surprised to reach the end of the article so quickly.

ncmncm · on Feb 2, 2020

Instead of complaining, post a follow-up. The author committed time to 1000 words on the topic. Without such a limit the article probably would not have been written.

jbverschoor · on Feb 2, 2020

What is the power consumption hit with prediction?

ajross · on Feb 2, 2020

It's not an isolated feature, so that's not an answerable question. Branch prediction (and speculation more generally) exists because processors have long pipelines and stalling to resolve a branch affirmatively is slow. Processors have long pipelines because that allows the individual circuit paths to be short and to operate at high clock speeds. If processors didn't want to run at high clock speeds they wouldn't have long pipelines and wouldn't need speculation[1]. And they would be very low power. But also very slow.

So you can't really look at a modern architecture and ask how much of its power consumption is because of "branch prediction" vs. other things. They all feed back.

[1] They'd still want branch prediction. Everything wants branch prediction. Even the mid-80's RISC architectures implemented in 30k transistors had a "branch delay slot", which was effectively compiler-managed branch prediction for a single cycle of latency.

chrisseaton · on Feb 2, 2020

You'd need two identical processors but one with branch prediction turned off to find that out.

I don't think we have good models for power consumption to find out in a simulation, either.

gchadwick · on Feb 2, 2020

> I don't think we have good models for power consumption to find out in a simulation, either.

There are excellent models available to predict power consumption but you'll need access to the processor RTL (the design of the processor typically written in a language such as verilog or vhdl) and / or the outputs of the implementation flow (like the netlist which is the description of the actual logic gates and their connectivity) to be able to use them.

Such things are normally not available outside the company building the processor and potentially their customers.

castratikron · on Feb 2, 2020

Branch prediction can usually be disabled during processor initialization (as well as other features like cache).

I was bringing up Linux on new hardware one time. What was happening was it would get partway into booting and then crash at a random spot. In the kernel config there's an option to disable the D-cache and I found that it would boot fine with it disabled. Turns out the root problem was that the processor voltage wasn't being set high enough, and that disabling D-cache must have reduced power consumption enough to allow it to boot at a lower voltage.