Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?

Several of the answers in the question you link talk about rewriting the code to be branchless and thus avoiding any branch prediction issues. That’s what your updated compiler is doing. Specifically, clang++ 10 with -O3 vectorizes the inner loop. See the code on godbolt, lines 36-67 of the assembly. The code is a little … Read more

Branch target prediction in conjunction with branch prediction?

Do read along with the Intel optimization manual, current download location is here. When stale (they move stuff around all the time) then search the Intel site for “Architectures optimization manual”. Keep in mind the info there is fairly generic, they disclose only as much as needed to allow writing efficient code. Branch prediction implementation … Read more

How has CPU architecture evolution affected virtual function call performance?

AMD processor in the early-gigahertz era had a 40 cycle penalty every time you called a function Huh.. so large.. There is an “Indirect branch prediction” method, which helps to predict virtual function jump, IF there was the same indirect jump some time ago. There is still a penalty for first and mispredicted virt. function … Read more

What branch misprediction does the Branch Target Buffer detect?

This is a good question! I think the confusion that it’s causing is due to Intel’s strange naming schemes which often overload terms standard in academia. I will try to both answer your question and also clear up the confusion I see in the comments. First of all. I agree that in standard computer science … Read more

Avoid stalling pipeline by calculating conditional early

Yes, it can be beneficial to allow the the branch condition to calculated as early as possible, so that any misprediction can be resolved early and the front-end part of the pipeline can start re-filling early. In the best case, the mis-prediction can be free if there is enough work already in-flight to totally hide … Read more

Why is a conditional move not vulnerable to Branch Prediction Failure?

Mis-predicted branches are expensive A modern processor generally executes between one and three instructions each cycle if things go well (if it does not stall waiting for data dependencies for these instructions to arrive from previous instructions or from memory). The statement above holds surprisingly well for tight loops, but this shouldn’t blind you to … Read more