Data Race Bugs and Memory Fences

Instruction Reordering
Multicore with a multilevel cache
The real solution is Memory Fences
References

One of the most famous data race bugs is probably the situation like, for two threads to execute the following code for a shared variable. Besides, the remedy for this kind of bug is relatively not hard.

    int shared_count = 0;

    inline void increase()
    {
        ++shared_count;
    }

However, there are other reasons in which data race bugs can arise. They are usually important in the situation of lock-free programming. The term lock-free does not actually refer to the elimination of mutex locks although that is certainly a component of the approach. In fact, lock-free refers to the practice of preventing threads from going to sleep while waiting on a resource to become available. In other words, blocking a thread is not allowed in lock-free programming. Perhaps the term blocking-free would have been more understandable.

The main reason is the out-of-order execution (OoOE) logic within the CPU. But it can be broken down into three reasons.

Instructions reordered at compile time
Instructions reordered at run time
Instructions effectively reordered by a multilevel cache

Instruction Reordering

At compile time

Optimizing compilers often reorder instructions in an attempt to minimize pipeline stalls. Likewise, the out-of-order execution logic within the CPU can cause instructions to be executed in an order that differs from program order. Instruction reordering is guaranteed not to alter the observable behavior of a single-threaded program. But it can alter the way in which two or more threads cooperate to share data, thereby introducing bugs into a concurrent program. For example, the following C/C++ code

    A = B + 1;
    B = 0;

would produce the following Intel x86 assembly code.

    mov  eax,[B]
    add  eax,1
    mov  [A],eax
    mov  [B],0

The compiler could very well reorder the instructions as follows, and this is not a problem in a single-threaded execution.

    mov  eax,[B]
    mov  [B],0   ;; write to B before A
    add  eax,1
    mov  [A],eax

If a second thread were waiting for B to become zero before reading the value of A, the wrong result would be produced after running this optimized code. Imagine that B is a kind of variable for polling while a thread spins. This optimization can cause a data race bug.

Polling: involves a thread sitting in a tight loop, waiting for a condition to become true. It is a bit like kids in the back seat on a road trip, repeatedly asking, “Are we there yet? Are we there yet?”. For example, the code would look like while (!checkCondition()) {}. It has the potential of burning CPU cycles unnecessarily, so sometimes called a spin-wait or a busy-wait.
Blocking: puts a thread to sleep so that it does not waste CPU resources and rely on the kernel to wake it back up when the condition becomes true at some future time.
Yielding: falls part-way between polling and blocking. The thread polls the condition in a loop, but on each iteration it relinquishes the remainder of its time slice.

At run time

Instructions can also be reordered when the CPU executes them. Since the speed difference between the CPU and memory is so great, if the CPU executes machine instructions in strict order, there may be slots in the pipeline while the instructions wait for the operands they depend on. If this slot can be filled with another instruction that has already been prepared, the execution speed of the instruction can be significantly increased.

`volatile` in C/C++ is not a solution

If trying to use volatile in C/C++ as a solution to prevent reordering instructions, it will not help to write reliable concurrent software. Obviously, volatile type qualifier guarantees that consecutive reads or writes of a variable cannot be optimized away by the compiler. However, it does not work reliably for a number of reasons.

The volatile qualifier in C/C++ was really designed to make memory-mapped I/O and signal handlers work reliably. As such, the only guarantee it provides is that the contents of a variable marked volatile will not be cached in a register. The variable’s value will be re-read directly from memory every time it is accessed. And it is not atomic. Some compilers do guarantee that instructions will not be reordered across a read or write of a volatile variable, but not all of them do, and some only provide this guarantee when targeting certain CPUs, or only when a particular command-line option is passed to the compiler. Since the C/C++ standards do not require this behavior, this is not portable, and the volatile keyword does nothing to prevent the CPU’s out-of-order execution logic from reordering the instructions at run time.

Workaround — Compiler Barriers

One reliable way of preventing the compiler from reordering read and write instructions across critical operation boundaries is to explicitly instruct it not to do so using compiler barriers. Different compilers express barriers with different syntax. With GCC, a compiler barrier can be inserted via some inline assembly syntax, asm volatile("" ::: "memory"). This syntax asks the compiler not to reorder instructions across this barrier. Unfortunately, compiler barriers do not prevent the CPU’s out-of-order execution logic from reordering instructions at run time.

Workaround — Function Calls

Another way is using function calls. Most function calls serve as an implicit compiler barrier. This makes sense because the compiler does not know anything about the side effects of a function call. As such, it cannot be assumed that the state of memory will be the same before and after the call, which means most optimizations are not permitted across a function call. Some optimizing compilers do make an exception to this rule for inline functions. However, note that function calls only serve as implicit barriers when the compiler is unable to see the definition of the function, such as when the function is defined in a separate translation unit. Link time optimizations (LTO) can introduce concurrency bugs by providing a way for the compiler’s optimizer to see the definitions of functions it otherwise could not have seen, eliminating these implicit barriers.

Multicore with a multilevel cache

Other than instruction reordering, it is possible for read and write instructions to become effectively reordered in a concurrent system. Specifically, in a multicore machine with a multilevel memory cache, two or more cores can sometimes work contrary to expectations even when the instructions were actually executed in the intended order. Such disagreements can cause subtle bugs in concurrent software. This issue can only occur on a multicore machine with a multilevel cache. Moreover, different CPUs have different memory ordering behavior, meaning that these strange effects can differ from machine to machine when running the exact same source program.

Memory Caching — Deferred Write-Back

In order to understand how these memory reordering effects can occur, consider the following code.

    constexpr int COUNT = 16;
    alignas(64) float values[COUNT];
    float sum;

    void calculateSum()
    {
        sum = 0.0f;
        for (int i = 0; i < COUNT; ++i) {
            sum += values[i];
        }
    }

The first statement sets sum to zero. Presuming that the contents of sum are not already in the L1 cache, the cache line containing it will be read into L1 at this point. Likewise, on the first iteration of the loop, the cache line containing all of the elements of the values array will be loaded into L1. — They should all fit presuming cache lines are at least 64 bytes wide because they are aligned to a 64-byte boundary. Subsequent iterations will read the copy of values that resides in the L1 cache, rather than reading them from main memory.

During each iteration, sum is updated. The compiler might optimize this by keeping the sum in a register until the end of the loop. But whether or not this optimization is performed, the sum variable will be written to at some point during this function. When that happens, again the CPU will write to the copy of sum that exists in the L1 cache, rather than writing directly to main memory.

Eventually, the master copy of sum has to be updated. The memory cache hardware does this automatically by triggering write-back operation that copies the cache line from L1 back to main memory. But the write-back does not usually happen right away. It is typically deferred until the modified variable is read again.

Cache Coherency

In a multicore machine, memory caching gets a lot more complicated. In general, each core has its own private L1 cache and the cores share an L2 cache and main memory. Assume that the L2 cache is roughly equivalent to main memory for simplicity. Consider the situation where Core2 attempts to read a variable X at some time after Core1 sets it to 1. Core2’s local L1 cache does not contain a copy of X, but Core1’s L1 cache does. So ideally Core2 would like to ask Core1 for its copy because that would still be a lot less expensive than getting the data from main memory. A cache coherency protocol is a communication mechanism that permits cores to share data between their local L1 caches in this way. Most CPUs use either the MESI or MOESI protocol.

Under the MESI protocol, the variable X is transferred to Core2 as follows.

Assume that Core1 first tries to read the current value of X for some reason. Presuming that this variable does not already exist in any core’s L1 cache, the cache line that contains it is loaded into Core1’s L1 cache. The cache line is put into the Exclusive state, meaning that no other core has this line.
Now Core2 attempts to read X. A Read message is sent over the interconnect bus(ICB) which is a special bus connecting all cores’ L1 caches. Core1 has this cache line, so it responds with a copy of the data. At this point, the cache line is put into the Shared state on both cores indicating that both have an identical copy of the line.
Next, Core1 writes a 1 into X. This updates the value in Core1’s L1 cache, and puts its copy of the line into the Modified state. An Invalidate message is sent across the ICB causing Core2’s copy of the line to be put into the Invalid state. This indicates that Core2’s copy of the line containing X is no longer up-to-date.
The next time that Core2 tries to read X, it finds that its locally-cached copy is Invalid. It sends a Read messsage across the ICB and obtains the newly-modified line from Core1’s L1 cache. This causes both cores’ cache lines to be put into the Shared state once again. It also triggers a write-back of the line to main memory.

The problem is now here

Based on the MESI protocol, it would seem that the problem of data sharing between L1 caches in a multicore machine has been solved. Then, how can the memory ordering bugs happen?

The answer is optimization. On most hardware, the MESI protocol is highly optimized to minimize latency. This means that some operations are not actually performed immediately when messages are received over the ICB. Instead, they are deferred to save time. As with compiler optimizations and CPU out-of-order execution optimizations, MESI optimizations are carefully implemented to be undetectable by a single thread. However, concurrent programs do not guarantee this.

For example, Core1 is to run X = 1; Y = 2; for some shared variables X and Y. Under certain circumstances, optimizations in the MESI protocol can cause the new value of Y to become visible to other cores within the cache coherency domain before the updated value of X becomes visible. This can happen, for example, if Core1 already has Y’s cache line in its local L1 cache, but does not have X’s line yet right before running the code. This means that Core2 can potentially see a value of 2 for Y before it sees a value of 1 in X, resulting in a data race bug.

To sum up, optimizations within a cache coherency protocol can make two read and/or write instructions appear to happen, from the point of view of other cores in the system, in an order that is opposite to the order in which the instructions were actually executed.

The real solution is Memory Fences

To prevent the memory effects of a read or write instruction passing other reads and/or writes, modern CPUs provide special machine language instructions known as memory fences, also known as memory barriers. All fence instructions have two very useful side-effects: (1) they serve as compiler barriers, and (2) prevent the CPU’s out-of-order logic from reordering instructions across the fence.

This means that when using a fence to prevent the memory ordering bugs that are caused by CPU’s cache coherency protocol, it also serves to prevent instruction reordering. So it is possible to write reliable mutexes and spin locks with atomic instructions and memory fences.

No matter what the specific fence instructions look like, it is important to reason their effects by thinking in terms of the semantics.

Release semantics (forward direction only): This semantic guarantees that a write to shared memory can never be passed by any other read or write that precedes it in program order.
Acquire semantics (reverse direction only): This semantic guarantees that a read from shared memory can never be passed by any other read or write that occurs after it in program order.
Full fence semantics (bidirectional): This semantic ensures that all memory operations appear to occur in program order across the boundary created by a fence instruction in the code.

The Intel x86 ISA specifies three fence instructions: sfence provides release semantics, lfence provides acquire semantics, and mfence acts as a full fence. The x86 ISA is strongly ordered by default meaning that fences are not actually required in many cases. That is, it requires less than CPUs with weaker default ordering semantics do. But there are some cases in which these fence instructions are required.

As of C++11, the class template std::atomic<T> allows virtually any data type to be converted into an atomic variable. A specialized class called std::atomic_flag encapsulates an atomic Boolean variable. In addition to atomicity, the std::atomic family of templates provides its variables with full fence memory ordering semantics by default although weaker semantics can be specified if desired. However, CPUs do not provide atomic instructions for manipulating arbitrarily sized data objects even though the std::atomic<T> template can wrap any type imaginable. That is, the std::atomic<T> template must be specialized by size.

By default, C++ atomic variables make use of full memory barriers, to ensure that they will work correctly in all possible use cases. However, it is possible to relax these guarantees by passing a memory order semantic to functions that manipulate atomic variables with an optional argument of type std::memory_order. It is important to realize that using a memory ordering specifier does not guarantee that a particular semantic will actually be used on every platform. All it does is to guarantee that the semantics will be at least that strong, which means a stronger semantic might be employed on some target hardware. For example, on Intel x86, the relaxed memory order is not possible because the CPU’s default memory ordering semantics are relatively strong. On an Intel CPU, any request for a relaxed read operation will actually end up having acquire semantics.

These semantics are easy to get wrong, so it is recommended to probably only use non-default memory ordering with std:atomic when it can be proved via profiling that this is really needed for performance and can produce actually benefits as expected.

References

[1] J. Gregory, Game Engine Architecture, Third Edition, CRC Press

[2] 루 샤오펑. 2024. 컴퓨터 밑바닥의 비밀, 길벗

[3] Memory Ordering at Compile Time

Data Race Bugs and Memory Fences

Instruction Reordering

At compile time

At run time

`volatile` in C/C++ is not a solution

Workaround — Compiler Barriers

Workaround — Function Calls

Multicore with a multilevel cache

Memory Caching — Deferred Write-Back

Cache Coherency

The problem is now here

The real solution is Memory Fences

References

Jeesun Kim

Error

Instruction Reordering

At compile time

At run time

volatile in C/C++ is not a solution

Workaround — Compiler Barriers

Workaround — Function Calls

Multicore with a multilevel cache

Memory Caching — Deferred Write-Back

Cache Coherency

The problem is now here

The real solution is Memory Fences

References

Templates (for web app):

Error

`volatile` in C/C++ is not a solution