Where the instruction decoder meets the memory consistency model
Memory consistency models specify how memory operations can be reordered. This is very important for shared memory programs, and a constant source of pain (or joy) for all kinds of people. Familiarity with memory models is assumed, pointers to more information can be found here.
The topic is pretty complex, so this posts only aims to illustrate what I consider to be a peculiar interaction between a CPU’s instruction decoder and the memory consistency model.
Decoder optimizations
Some instructions can be optimized at decoding time. The simplest example would be xor a2, a1, a1
which inevitably fills register a2
with a 0 value. This might not seem like a very interesting thing to do, but it is actually very important for CPUs without architectural zero registers (such as x86 or ARMv7). More examples can be found in What optimizations you can expect from CPU?.
While reading some Chips and Cheese posts (highly recommended) I noticed a trend on ARM machines reviews:
- Graviton 4 (Neoverse V2)
Neoverse V2 has some move elimination capability, though it’s not as robust as on Intel and AMD’s recent CPUs
- ARM Cortex A73, A72
A73’s renamer covers the basics and doesn’t recognize zeroing idioms like x86 CPUs do
- Qualcomm Oryon
Oryon has move elimination, though it’s not as robust as Intel or AMD’s implementation for chained dependent MOVs. There’s no zeroing idiom recognition for XOR-ing a register with itself, or subtracting a register from itself
- ARM Cortex A57
A57’s renamer doesn’t perform move elimination or break dependencies with zeroing idioms
The simplest explanation for this could be that if you already have a zero register, why would you want to spend extra logic to optimize for “useless” things? This is indeed a good explanation, but perhaps there is more to it. Let’s find out!
Syntactic dependencies
Syntactic dependency is the fancy term to express “data flow”-like dependencies between instructions. Consider the following snippet from RISC-V’s reference manual:
1
2
3
4
ld a1,0(s0)
xor a2,a1,a1
add s1,s1,a2
ld a5,0(s1)
The second load syntactically depends on the first load because its source address depends on the result of the first load. Note that this example is not accidental, and it features an xor
which could be optimized away by the CPU frontend given what we have discussed so far.
Here is the catch: the xor
CAN’T be optimized away (at least naively) because it would make the two loads microarchitecturally independent. The second load wouldn’t have to wait for the first load to complete to determine its address.
Memory models such as RISC-V’s or ARMv8 require a given CPU to respect program order for syntactically dependent operations. This means that the aforementioned optimization is not possible unless there are specific measures in place. If the syntactic dependency is broken the operations are fully independent from the point of view of the microarchitecture; at this point there is nothing we can do to stop the potential reordering (and consistency model violation).
This only applies for relaxed memory models. x86 CPUs are free to do this because they maintain load-load ordering anyway.
In the past, ARM differenciated between “real” and “false” syntactic dependencies (the example above would feature a “false” dependency) in order to allow some optimizations. This stopped being true some years ago. Simplifying ARM Concurrency explains this change in the ARM memory model:
The third change is concerned with the definition of dependencies (read-to-read address and control+isb/isync dependencies3 , and read-to-write address, data, and control dependencies). Historically other architectures, e.g. IBM POWER, have explicitly respected all syntactic dependencies. Previous versions of the ARM architecture text introduced notions of “true” and “false” dependencies, aiming to require processors only to preserve “true” dependencies, to allow optimisations in value computations (such as x AND 0 = 0): […] The revised ARMv8 architecture makes no such distinction.
Conclusion
Memory consistency models are so annoying great that they can actually influence CPU instruction decoding optimizations. Memory models are known to be very confusing, but this particular fact surprised me a lot.
Interestingly, this also provides a very strong argument to having an architectural zero register (ARMv8, RISC-V) (at least if you are targeting a relaxed memory model).