how much memory can be accessed by a 32 bit machine?

Yes, a 32-bit architecture is limited to addressing a maximum of 4 gigabytes of memory. Depending on the operating system, this number can be cut down even further due to reserved address space. This limitation can be removed on certain 32-bit architectures via the use of PAE (Physical Address Extension), but it must be supported … Read more

Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?

Several of the answers in the question you link talk about rewriting the code to be branchless and thus avoiding any branch prediction issues. That’s what your updated compiler is doing. Specifically, clang++ 10 with -O3 vectorizes the inner loop. See the code on godbolt, lines 36-67 of the assembly. The code is a little … Read more

Is processor can do memory and arithmetic operation at the same time?

You’re right, a modern x86 will decode add dword [mem], 1 to 3 uops: a load, an ALU add, and a store. (This is actually a simplification of various things, including Intel’s micro-fusion and how AMD always keeps a load+ALU together in some parts of the pipeline…) Those 3 dependent operations can’t happen at the … Read more

With variable length instructions how does the computer know the length of the instruction being fetched? [duplicate]

First, the processor does not need to know how many bytes to fetch, it can fetch a convenient number of bytes sufficient to provide the targeted throughput for typical or average instruction lengths. Any extra bytes can be place in a buffer to be used in the next group of bytes to be decoded. There … Read more

What specifically marks an x86 cache line as dirty – any write, or is an explicit change required?

Currently no implementation of x86 (or any other ISA, as far as I know) supports optimizing silent stores. There has been academic research on this and there is even a patent on “eliminating silent store invalidation propagation in shared memory cache coherency protocols”. (Googling ‘”silent store” cache’ if you are interested in more.) For x86, … Read more

Branch target prediction in conjunction with branch prediction?

Do read along with the Intel optimization manual, current download location is here. When stale (they move stuff around all the time) then search the Intel site for “Architectures optimization manual”. Keep in mind the info there is fairly generic, they disclose only as much as needed to allow writing efficient code. Branch prediction implementation … Read more

Why is division more expensive than multiplication?

CPU’s ALU (Arithmetic-Logic Unit) executes algorithms, though they are implemented in hardware. Classic multiplications algorithms includes Wallace tree and Dadda tree. More information is available here. More sophisticated techniques are available in newer processors. Generally, processors strive to parallelize bit-pairs operations in order the minimize the clock cycles required. Multiplication algorithms can be parallelized quite … Read more