Things about Architecture

General
Pipelining
Memory Hierarchy
Parallel Processors
References

General

A CPU is like an orchestra. In a CPU, the clock signal is like a conductor’s baton. Each time the clock signal changes voltage, each register of the entire circuit, that is, the state of the entire circuit, is updated.
- The clock rate refers to how many times the baton is waved in one second.
It is important to realize that one CPU instruction does not necessarily take one clock cycle to execute. CPI(clock cycles per instruction) varies from the type of instructions executed. Therefore, fewer instructions do not imply faster code since the number of clock cycles is different by instruction.
Static energy consumption occurs because of leakage current that flows even when a transistor is off. In servers, leakage is typically responsible for 40% of the energy consumption.
A very large number of registers may increase the clock cycle time simply because it takes electronic signals longer as they must travel farther. However, note that 31 registers may not be faster than 32.
The segment of the stack containing a procedure’s saved registers and local variables is called a procedure frame or activation record. The stack pointer points to the top of the stack and the frame pointer points to the first word of the frame, often a saved argument register. The frame pointer offers a stable base register within a procedure for local memory-references.
The specific bits are used for memory address in a machine code. For example, if the 16-bit field is used for memory address, it is too small to be a realistic option. One solution is to specify a register that would always be added to the 16-bit field so that it allows the program to be as large as $2^{32}$ . Since PC (program counter) contains the address of the current instruction, it is possible to go $\pm 2^{15}$ words of the current instruction if PC is used for this register. This form is called PC-relative addressing. If all instructions are 4 bytes long, memory distances can be four times stretched by representing them as word instead of byte.
A function’s signature means its name and the sequence of parameter types, but not its return type.
C offers the programmer a chance to give a hint to the compiler about which variables to keep in registers or memory. However, today’s C compilers generally ignore such hints, because the compiler does a better job at allocation than the programmer does.
When calling a function, a call stack is required. The operating system also needs a call stack to complete a system call. This stack is stored in the kernel mode stack.
- When a system call occurs, the CPU switches from user mode to kernel mode and searches for a kernel mode thread corresponding to the user mode thread. At this time, execution status information such as the register information of the user mode thread is stored in the kernel mode stack.
The operating system generates a timer interrupt at regular intervals, the CPU detects the timer interrupt signal, and executes the interrupt handling program within the operating system.
The fact that a computer can process tasks such as keyboard key input, mouse movement, and network reception while executing a program is because all of this is done using the interrupt.
The call stack of the interrupt service routine can be handled in two ways:
- The interrupt service routine uses the kernel mode stack.
- It uses its own call stack called the interrupt service routine stack, or ISR stack. Since the CPU handles interrupts, every CPU has its own ISR stack.
Function calls, whether in user mode or kernel mode, occur only within a single thread and exist within the same execution flow. On the other hand, interrupt handling jumps involve two different execution flows, so interrupt jumps require much more information to be stored than function calls.
The difference between debug and release builds is often thought of as ‘optimization’, but optimization can also be applied to debug. The biggest difference is whether additional information and pdb files are generated for debugging.
- The pdb file is an abbreviation for Program DataBase, and contains information matching data in the code area to the location of the source code, function name, and variable name. It literally means a file that contains additional information so that debugging can be done at the source code level.
WoW mode on Windows supports running x86 files in an x64 environment.
Increasing the number of bits that a CPU can process is not a matter of speed. The size of memory that can be used and the types of CPU instructions increase. In other words, it is similar to a chef getting a bigger kitchen.
There are times when a program developed by one person cannot be run by another person. This is because the version of the dll used during development must also be installed in the environment where it is to be run. A package that bundles the system dll of the development environment to be installed in the execution environment is called the redistributable package.
Windows has MT and MD builds. MD build dynamically links system API functions from dll, which makes the executable file smaller. On the other hand, MT build statically links system API functions, which can be run anywhere. However, the executable file size is larger.
A platform is the collection of all the details that make up the development and/or run-time system. For example, one platform may be the Microsoft Visual C++ 2019 compiler running on Windows 10 on an Intel Core i7 processor. Alternatively, another platform might be the GCC 10.1 compiler running on Linux on a PowerPC processor. Both platforms can compile and run C++ programs, but there are significant differences between them.
Latency is the time it takes to complete a single task, while throughput is the number of tasks that can be completed in a given amount of time.

Pipelining

The single-cycle design (one instruction per one clock cycle) requires the same length for every instruction, so the longest possible path in the processor determines the clock cycle. This design is not commonly used today because it requires the clock cycle to be the same length for all instructions, and the use of pipelining has arrived.
Pipelining improves throughput of the laundry system as an analogy. Hence, pipelining would not decrease the time to complete one load of laundry, but when there are many loads of laundry to do, the improvement in throughput decreases the total time to complete the work.
Each stage in a pipelined CPU takes one clock cycle to execute, meaning that a CPU with an N-stage pipeline has a minimum instruction latency of N clock cycles. The five stages of pipelining are as follows:
- Instruction Fetch(IF)
- Instruction Decode(ID)
- Execute(EX)
- Memory Access(MEM)
- Write Back(WB)
There are situations in pipelining when the next instruction cannot execute in the following clock cycle.
- Structural hazard: It means that the hardware cannot support the combination of instructions that are supposed to be executed in the same clock cycle. In the same clock cycle, an instruction can try to access data from memory while another instruction is fetching an instruction from that same memory. Without two memories, the pipeline could have a structural hazard.
- Data hazard: It occurs when the pipeline must be stalled because one step must wait for another to complete. It arises from the dependence of one instruction on an earlier one that is still in the pipeline. The primary solution is called forwarding or bypassing, where the intermediate result of one instruction is handed over in advance to the next instruction by adding extra hardware. Even with forwarding, there needs to stall one stage for a load-use data hazard and this stall is called a pipeline stall or bubble.
- Control hazard(branch hazard): It arises from the need to make a decision based on the results of one instruction while others are executing.
If each stage in the pipeline can work with the next instruction in advance, it can execute the next one and the result can be stored in pipeline registers. Note that there is no pipeline register at the end of the write-back stage.
A more recent innovation in branch prediction is the use of tournament predictors. A tournament predictor uses multiple predictors, tracking, for each branch, which predictor yields the best results.
To increase the potential amount of instruction-level parallelism, multiple issue replicates the internal components of the computer so that multiple instructions can be launched in every pipeline stage. Today’s high-end microprocessors attempt to issue 3~6 instructions in every clock cycle. Although two-issue processor can improve performance by up to a factor of two, it requires that twice as many instructions be overlapped in execution, and this additional overlap increases the relative performance loss from data and control hazards. Furthermore, sustaining the high issue rate is very difficult.
Dynamic multiple-issue, also known as superscalar, is an advanced pipelining technique that enables the processor to execute more than one instruction per clock cycle by selecting them during execution. Many superscalars support dynamic pipeline scheduling which is hardware support for reordering the order of instruction execution to avoid stalls.

Memory Hierarchy

DRAM cannot be kept indefinitely and must periodically be refreshed. That is why DRAM is called dynamic, as opposed to the static storage in an SRAM cell.
DRAMs added clocks and are properly called Synchronous DRAMS or SDRAMs. The advantage of SDRAMs is that the use of a clock eliminates the time for the memory and processor to synchronize. The fastest version is called DDR (Double Data Rate) SDRAM which means that data transfers on both the rising and falling edge of the clock, thereby getting twice as much bandwidth. The latest version of this technology is called DDR4.
SSDs are similar to cars in that they can fail after a certain distance driven. The unit for this distance driven is TBW (TeraBytes Written), which is the maximum bytes that can be written in terabytes.
Memory usage usually does not reach 100%, and there is always some space left. The operating system always uses this free memory space as a cache for disks to minimize the need to read data from disks. This is the basic principle of page cache in Linux operating systems.
As memory becomes more and more affordable, it is becoming popular to replace disks with memory in servers. Now, RAM itself is acting as the new disk.
- However, this does not mean that memory can completely replace disks, because memory does not have the ability to store data permanently.
Memory-mapped I/O is allocating part of the memory address space to a device to control the device, such as reading and writing memory.
- If the address space is 8 bits, the range of 00000000 ~ 11101111 can be allocated to memory, and the remaining range of 11110000 ~ 11111111 can be allocated to the device. When the CPU executes a load instruction, if the first 4 bits are all 1, the load instruction targets the device.
L1 cache is placed very near to the CPU core (on the same die). Its access latency is almost as low as the CPU’s register file, because it is so close to the CPU core. L2 cache is located further away from the core (usually also on-chip, and often shared between two or more cores on a multicore CPU). It is typical for each core to have its own L1 cache, but multiple cores might share an L2 cache, as well as sharing main RAM.
It is important to realize that both data and instruction are cached. In a L1 cache, the two caches are always physically distinct, because it is undesirable for an instruction read to cause valid data to be evicted out of the cache or vice versa. Higher-level caches typically do not make this distinction between code and data, because their larger size tends to mitigate the problems of code evicting data or data evicting code.
When a block can go in exactly one place in the cache, it is called direct mapped because there is a direct mapping from any block address in memory to a single location in the upper level of the hierarchy.
Fully associative is a scheme where a block can be placed in any location in the cache. To find a given block in a fully associative cache, all the entries in the cache must be searched because a block can be placed in any one.
The middle range of designs between direct mapped and fully associative is called set associative. A set-associative cache with n locations for a block is called an n-way set-associative cache. Each block in the memory maps to a unique set in the cache given by the index field, and a block can be placed in any element of that set.
As a full search is impractical, virtual memory systems use page table that indexes the memory to locate pages. A page table is indexed with the page number from the virtual address to discover the corresponding physical page number. The page index is looked up by the CPU’s memory management unit (MMU) in the page table. To indicate the location of the page table in memory, the hardware includes a register that points to the start of the page table and this register is called page table register.
If the valid bit for a virtual page is off, a page fault occurs. The operating system must be given control. Because when a page in memory will be replaced is unknown, the operating system usually creates the space on flash memory or disk for all pages of a process when it creates the process. This space is called the swap space. When a page fault occurs, operating systems follow the least recently used(LRU) replacement scheme. The replaced pages are written to swap space.
Virtual memory systems must use write-back, performing the individual writes into the page in memory, and copying the page back to disk when it is replaced in the memory. The dirty bit in the page table is set when any word in a page is written.
Since the page tables are stored in main memory, every memory access by a program can take at least twice as long: one memory access to obtain the physical address and a second access to get the data. Accordingly, modern processors include the special address translation cache, which is called translation-lookaside buffer(TLB). TLB is a cache that keeps track of recently used address mappings to try to avoid an access to the page table. Because this buffer is located in close proximity to the MMU, accessing it is very fast.
On every reference, the virtual page number is searched in the TLB. If hit, the physical page number is used to form the address, and the corresponding reference bit is turned on. If the processor is performing a write, the dirty bit is also turned on. If a miss in the TLB occurs, whether it is a page fault or merely a TLB miss is determined. If the page exists in memory, then the TLB miss indicates only that the translation is missing. In such cases, the processor can handle the TLB miss by loading the translation from the page table into the TLB and then trying the reference again. If the page is not present in memory, then the TLB miss indicates a true page fault. In this case, the processor invokes the operating system using an exception.
Since the TLB miss rate is small, using write-back is very efficient. Furthermore, since TLB misses are much more frequent than page faults and thus must be handled more cheaply than page faults. As a result, many systems provide some support for randomly choosing an entry to replace.
The view of memory held by two different processors is through their caches, which, without any additional precautions, could end up seeing two different values. This difficulty is generally referred to as the cache coherence problem.
The protocols to maintain coherence for multiple processors are called cache coherence protocols. Key to implementing a cache coherence protocol is tracking the state of any sharing of a data block. The most popular cache coherence protocol is snooping. This style of protocol is called a write invalidate protocol because it invalidates copies in other caches on a write.
Designers who are attempting to hide the cache miss latency commonly use a nonblocking cache, when building out-of-order processors. They implement two flavors of nonblocking: Hit under miss allows additional cache hits during a miss, while miss under miss allows multiple outstanding cache misses.
The Intel Core i7 has a prefetch mechanism for data access. It looks at a pattern of data misses and uses this information to predict the next address to start fetching the data before the miss occurs. Such techniques generally work best when accessing arrays in loops. In most cases, the prefetched line is simply the next block in the cache.

Parallel Processors

The disadvantage of a process is that there is only one entry function, main(), so the machine instructions of the process can only be executed by one CPU at a time. But, since the PC register can point to any other function, just as it can point to main(), it can form a new execution flow from this. Therefore, a process can have two or more entry functions, and the machine instructions belonging to a process can be executed simultaneously on multiple CPUs. Here, the execution flow is called a thread.
In a single-core system, the CPU can only execute one thread at a time. Although multiple threads may be executed alternately, this is not true parallel processing. But that does not mean it is meaningless.
- In general, the number of threads created should be linearly related to the number of cores. In particular, keep in mind that more threads is not necessarily better.
Each thread would have a separate copy of the register file and the program counter. In addition, the hardware must support the ability to change to a different thread relatively quickly. In particular, a thread switch should be much more efficient than a process switch, which typically requires hundreds to thousands of processor cycles while a thread switch can be instantaneous.
The concept of a thread pool is to simply create multiple threads in advance, and when a task arises for a thread to process, a thread has it processed.
- Suppose a factory hires a new worker each time a new order is added. If the work time is 5 minutes and the hiring time is 10 hours, it is inefficient to hire a worker once, process the order, and then fire him. Instead, it is much better to process the order when an order comes in and rest when there is no order.
Each thread has its own stack area. But, there is no protection mechanism between the stack areas of different threads. Therefore, if one thread can get a pointer to another thread’s stack frame, that thread can directly read and write the stack area of the other thread.
Fine-grained multithreading switches between threads on each instruction, resulting in interleaved execution of multiple threads. This interleaving is often done in a round-robin fashion, skipping any threads that are stalled at that clock cycle. One advantage of this is that it can hide the throughput losses that arise from both short and long stalls, since instructions from other threads can be executed when one thread stalls. The primary disadvantage is that it slows down the execution of the individual threads.
Coarse-grained multithreading was invented as an alternative to fine-grained multithreading. It switches threads only on costly stalls, such as last-level cache misses. This change relieves the need to have thread switching extremely fast and is much less likely to slow down the execution of an individual thread. However, it is limited in its ability to overcome throughput losses, especially from shorter stalls.
GPUs do not rely on multilevel caches to overcome the long latency to memory, as do CPUs. Instead, GPUs rely on hardware multithreading to hide the latency to memory. That is, between the time of a memory request and the time that data arrives, the GPU executes hundreds or thousands of threads that are independent of that request.
CUDA threads are blocked together and executed in groups of 32 at a time. A multithreaded processor inside a GPU executes these blocks of threads, and a GPU consists of 8~128 of these multithreaded processors.
GPU is a MIMD composed of multithreaded SIMD processors.
- MIMD distributes independent tasks to each thread, and is called task-level parallelism.
- SIMD performs the same instruction on multiple data, and is called data-level parallelism. Examples of such SIMD units include MMX, SSE, AVX, and Neon.
- GPUs are generally defined as a SIMT architecture rather than a SIMD architecture. This is because one instruction issues commands to multiple threads. The characteristics of SIMT that distinguish it from SIMD are as follows:
  - Threads within a thread group are controlled by a single control unit.
  - Each thread has its own control context.
  - Threads within a group are allowed to diverge.
The machine object that the hardware creates, manages, schedules, and executes is a thread of SIMD instructions, which is also called a SIMD thread. It is a traditional thread, but it contains exclusively SIMD instructions. These SIMD threads have their own program counters and they run on a multithreaded SIMD processor. The SIMD Thread Scheduler includes a controller that lets it know which threads of SIMD instructions are ready to run, and then it sends them off to a dispatch unit to be run on the multithreaded SIMD processor.
GPU hardware has two levels of hardware schedulers:
- The Thread Block Scheduler that assigns blocks of threads to multithreaded SIMD processors
- The SIMD Thread Scheduler within a SIMD processor, which schedules when SIMD threads should run
While hiding memory latency is the underlying philosophy of GPUs, note that the latest GPUs and vector processors have added caches. They are thought of as either bandwidth filters to reduce demands on GPU memory or as accelerators for the few variables whose latency cannot be hidden by multithreading.

References

[1] David A. Patterson and John L. Hennessy, Computer Organization and Design MIPS Edition: The Hardware/Software Interface

[2] J. Gregory, Game Engine Architecture, Third Edition, CRC Press

[3] J. Lakos, Large-Scale C++ Volume I: Process and Architecture, Addison-Wesley Professional

[4] Marc Gregoire. 2021. Professional C++, Fifth Edition. John Wiley & Sons, Ltd.

[5] 루 샤오펑. 2024. 컴퓨터 밑바닥의 비밀, 길벗

[6] 전상현. 2023. 아무도 알려주지 않은 C++ 코딩의 기술, 로드북

[7] 김덕수, CUDA 기반 GPU 병렬 처리 프로그래밍, 비제이퍼블릭.

Things about Architecture

General

Pipelining

Memory Hierarchy

Parallel Processors

References

Jeesun Kim

Error

General

Pipelining

Memory Hierarchy

Parallel Processors

References

Templates (for web app):

Error