Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?

Basically no significant effect on inter-core latency, and definitely never worth using “blindly” without careful profiling, if you suspect there might be any contention from later loads missing in cache. It’s a common misconception that asm barriers are needed to make the store buffer commit to cache. In fact barriers just make this core wait … Read more

What are the semantics of ADRP and ADRL instructions in ARM assembly?

ADR ADR is a simple PC-relative address calculation: you give it an immediate offset, and it stores in the register the address relative to the current PC. For example, if the following ADR instruction is placed at position 0x4000 in memory: adr x0, #1 then after this instruction is executed x0 now contains the value … Read more

Why unsigned types are more efficient in arm cpu?

Prior to ARMv4, ARM had no native support for loading halfwords and signed bytes. To load a signed byte you had to LDRB then sign extend the value (LSL it up then ASR it back down). This is painful so char is unsigned by default. In ARMv4 instructions were added to handle halfwords and signed … Read more

Methods to vectorise histogram in SIMD?

Histogramming is almost impossible to vectorize, unfortunately. You can probably optimise the scalar code somewhat however – a common trick is to use two histograms and then combine them at the end. This allows you to overlap loads/increments/stores and thereby bury some of the serial dependencies and associated latencies. Pseudo code: init histogram 1 to … Read more

What is the difference between =label (equals sign) and [label] (brackets) in ARMv6 assembly?

ldr r0,=something … something: means load the address of the label something into the register r0. The assembler then adds a word somewhere in reach of the ldr instruction and replaces it with a ldr r0,[pc,#offset] instruction So this shortcut ldr r0,=0x12345678 means load 0x12345678 into r0. being mostly fixed length instructions, you cant load … Read more