A Practical Look at C++ Atomics on X86

2026-06-23

Atomics are low-level primitives for shared memory across multiple CPU cores, implemented through special instructions provided by various CPU platforms.

No Extra Memory Overhead

Atomics rely on CPU-specific instructions only. On most mainstream platforms, including x86-64, atomics have no extra memory overhead compared to their non-atomic counterparts of the same size.

#include <atomic>
#include <cstdint>
#include <cstdio>

int main() {
    printf("sizeof(uint64_t)               = %zu\n", sizeof(uint64_t));
    printf("sizeof(std::atomic<uint64_t>)  = %zu\n", sizeof(std::atomic<uint64_t>));
    printf("alignof(uint64_t)              = %zu\n", alignof(uint64_t));
    printf("alignof(std::atomic<uint64_t>) = %zu\n", alignof(std::atomic<uint64_t>));
}

Running this produces the following output:

sizeof(uint64_t)               = 8
sizeof(std::atomic<uint64_t>)  = 8
alignof(uint64_t)              = 8
alignof(std::atomic<uint64_t>) = 8

Lock-free

The atomic operations discussed here are lock-free, meaning their implementation does not rely on OS-level mutexes. Mutexes typically require threads to queue and suspend. Suspended threads must fully yield the CPU to active threads, and the data and instructions in the registers need to be swapped. This is why multi-threaded locking incurs significant overhead.

Atomic modifications, on the other hand, typically only require the failing thread to spin and retry. In scenarios where multi-threaded contention is not severe, atomics are a lighter-weight choice.

Transactional Modification

Atomicity means that modifications to an atomic variable can only result in a success or failure state — no thread can ever observe an intermediate state of the modification.

In other words, if thread A modifies atomic variable X, no other thread will observe bizarre behavior such as the upper bits of X matching the new value while the lower bits are consistent with old value.

Safe Publication of Modifications

Atomics guarantee safe publication of modifications across threads. That is, if thread A first modifies atomic variable X, thread B will subsequently always see the modified value of X.

Atomics, with proper memory order settings, can also safely publish modifications to non-atomic variables. For example, if thread A first modifies a non-atomic variable Y before modifying atomic variable X, then once thread B observes the modification to X, it can be guaranteed to also see the modification to Y.

Memory Order

When using atomics in C++, understanding the concept of memory order is essential.

The default memory order in C++ already provides strong guarantees for the Happens-Before relationship across threads. However, in many scenarios, lighter-weight alternatives are available — "don't pay for what you don't need."

C++ supports several memory orders:

Memory Order	Inter-thread Synchronization	Reordering Constraint
`relaxed`	None — only guarantees atomicity of the modification itself	None
`release`	A subsequent `acquire` on the same atomic by another thread will see this modification	No reads or writes in the current thread can be reordered after this operation
`acquire`	A preceding `release` on the same atomic by another thread becomes visible	No reads or writes in the current thread can be reordered before this operation
`acq_rel`	Combines both `acquire` and `release` semantics (for read-modify-write operations such as `compare_exchange_strong`)	No reads or writes can be reordered before or after this operation
`seq_cst`	Establishes a single total order across all atomic modifications among all threads	No reads or writes can be reordered before or after this operation

release and acquire form a synchronized store-load pair on the same atomic variable. seq_cst is the default and the strongest ordering; relaxed is the lightest.

(Note: memory_order_consume was discouraged in C++17 and is not discussed here.)

Assembly Instructions

Load

On the X86 platform, examining the assembly for atomic load operations under different memory orders always yields the same result:

mov    a(%rip), %eax

The operation is remarkably simple — just a memory read.

Store

For store operations, things are a bit more complex:

The relaxed and release memory orders are identical, both producing a single memory write:

mov    %edi, a(%rip)

For seq_cst, however, the compiled instruction becomes XCHG:

xchg   a(%rip), %edi

According to the Intel SDM, when the XCHG instruction involves a memory operand (such as a(%rip) above), the CPU automatically asserts the LOCK signal during execution, making it a relatively expensive instruction.

Fetch Add

On the X86 platform, examining the assembly for the atomic fetch_add operation under different memory orders always yields the same result:

lock xadd	%eax, a(%rip)

Here, the lock prefix is added to the xadd instruction, which explicitly triggers the CPU to assert the LOCK signal.

CAS

On the X86 platform, examining the assembly for the atomic compare_exchange_strong operation under different memory orders always yields the same result:

lock cmpxchg	%esi, a(%rip)

Similarly, the lock prefix is added to the cmpxchg instruction, which explicitly triggers the CPU to assert the LOCK signal.

About the CPU LOCK Assertion

According to the Intel SDM, when the memory address is marked as WB (Write-Back) type by the operating system, and the atomic read/write does not span multiple cache lines — the best case — asserting the LOCK signal triggers cache line lock protection instead. During this period, read and write operations from all cores targeting the same cache line are executed in strict sequential order. However, in the most restrictive scenario, asserting the LOCK signal triggers a bus lock. During this period, all CPU cores are unable to issue read or write requests to system memory.

Conclusion

This article provided a brief introduction to the fundamentals of C++ atomics and listed the assembly instructions for common operations on the X86 platform.

Overall, on the X86 platform, the compiled instructions produced under different memory orders are very similar. This is because the X86 architecture itself imposes sufficiently strong constraints on read/write reordering. However, this does not mean memory orders can be mixed freely. Under different memory orders, the C++ compiler itself exhibits different reordering behaviors. Choosing the simpler relaxed memory order or release-acquire pair where appropriate still has positive significance.

Note: On ARM or other weaker memory order platforms, the compiled instructions would be quite different than cases mentioned above for X86 specifically.