cpu-architecture – Make Me Engineer

how much memory can be accessed by a 32 bit machine?

June 9, 2023 by Tarik

Yes, a 32-bit architecture is limited to addressing a maximum of 4 gigabytes of memory. Depending on the operating system, this number can be cut down even further due to reserved address space. This limitation can be removed on certain 32-bit architectures via the use of PAE (Physical Address Extension), but it must be supported … Read more

Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?

June 6, 2023 by Tarik

Several of the answers in the question you link talk about rewriting the code to be branchless and thus avoiding any branch prediction issues. That’s what your updated compiler is doing. Specifically, clang++ 10 with -O3 vectorizes the inner loop. See the code on godbolt, lines 36-67 of the assembly. The code is a little … Read more

What is the difference between Trap and Interrupt?

June 6, 2023 by Tarik

A trap is an exception in a user process. It’s caused by division by zero or invalid memory access. It’s also the usual way to invoke a kernel routine (a system call) because those run with a higher priority than user code. Handling is synchronous (so the user code is suspended and continues afterwards). In … Read more

Is processor can do memory and arithmetic operation at the same time?

June 3, 2023 by Tarik

You’re right, a modern x86 will decode add dword [mem], 1 to 3 uops: a load, an ALU add, and a store. (This is actually a simplification of various things, including Intel’s micro-fusion and how AMD always keeps a load+ALU together in some parts of the pipeline…) Those 3 dependent operations can’t happen at the … Read more

How does 32-bit address 4GB if 2³² bits = 4 Billion bits not Bytes?

June 3, 2023 by Tarik

It depends on how you address the data. If you use 32 bits to address each bit, you can address 232 bits or 4Gb = 512MB. If you address bytes like most current architectures it will give you 4GB. But if you address much larger blocks you will need less bits to address 4GB. For … Read more

With variable length instructions how does the computer know the length of the instruction being fetched? [duplicate]

June 2, 2023 by Tarik

First, the processor does not need to know how many bytes to fetch, it can fetch a convenient number of bytes sufficient to provide the targeted throughput for typical or average instruction lengths. Any extra bytes can be place in a buffer to be used in the next group of bytes to be decoded. There … Read more

What is instruction fusion in contemporary x86 processors?

June 1, 2023 by Tarik

No, fusion is totally separate from how one complex instruction (like cpuid or lock add [mem], eax) can decode to multiple uops. The way the retirement stage figures out that all the uops for a single instruction have retired, and thus the instruction has retired, has nothing to do with fusion. Macro-fusion decodes cmp/jcc or … Read more

What specifically marks an x86 cache line as dirty – any write, or is an explicit change required?

June 1, 2023 by Tarik

Currently no implementation of x86 (or any other ISA, as far as I know) supports optimizing silent stores. There has been academic research on this and there is even a patent on “eliminating silent store invalidation propagation in shared memory cache coherency protocols”. (Googling ‘”silent store” cache’ if you are interested in more.) For x86, … Read more

Branch target prediction in conjunction with branch prediction?

May 28, 2023 by Tarik

Do read along with the Intel optimization manual, current download location is here. When stale (they move stuff around all the time) then search the Intel site for “Architectures optimization manual”. Keep in mind the info there is fairly generic, they disclose only as much as needed to allow writing efficient code. Branch prediction implementation … Read more

Why is division more expensive than multiplication?

May 24, 2023 by Tarik

CPU’s ALU (Arithmetic-Logic Unit) executes algorithms, though they are implemented in hardware. Classic multiplications algorithms includes Wallace tree and Dadda tree. More information is available here. More sophisticated techniques are available in newer processors. Generally, processors strive to parallelize bit-pairs operations in order the minimize the clock cycles required. Multiplication algorithms can be parallelized quite … Read more