Things about Architecture

  • It is important to realize that one CPU instruction does not necessarily take one clock cycle to execute. CPI(clock cycles per instruction) varies from the type of instructions executed. Therefore, fewer instructions do not imply faster code since the number of clock cycles is different by instruction.
  • Static energy consumption occurs because of leakage current that flows even when a transistor is off. In servers, leakage is typically responsible for 40% of the energy consumption.
  • A very large number of registers may increase the clock cycle time simply because it takes electronic signals longer as they must travel farther. However, note that 31 registers may not be faster than 32.
  • The segment of the stack containing a procedure’s saved registers and local variables is called a procedure frame or activation record. The stack pointer points to the top of the stack and the frame pointer points to the first word of the frame, often a saved argument register. The frame pointer offers a stable base register within a procedure for local memory-references.
  • The specific bits are used for memory address in a machine code. For example, if the 16-bit field is used for memory address, it is too small to be a realistic option. One solution is to specify a register that would always be added to the 16-bit field so that it allows the program to be as large as 2322^{32}. Since PC(program counter) contains the address of the current instruction, it is possible to go ±215\pm 2^{15} words of the current instruction if PC is used for this register. This form is called PC-relative addressing. If all instructions are 4 bytes long, memory distances can be four times stretched by representing them as word instead of byte.
  • A function’s signature means its name and the sequence of parameter types, but not its return type.
  • The compiler transforms the C program into an assembly language program.
  • The assembler turns the assembly language program into an object file. It can also treat common variations of machine language instructions as if they were instructions in their own right, which is called pseudoinstructions.
  • Assemblers keep track of lables used in branches and data transfer instructions in a symbol table.
  • Header files are not seen by the compiler. Instead, the C++ preprocessor replaces each #include statement with the contents of the corresponding header file prior to sending the translation unit to the compiler. Header files exist as distint files from the point of view of the programmer, but thanks to preprocessor’s header file expansion, all the compiler needs to see are translation units, or source files.
    • The preprocessor independently scans source files, and then incorporates the text from each included header file as indicated by #include preprocessor directives. Each included header file is similarly scanned and #include directives nested within these headers are processed recursively. Conceptually, the resulting text forms an intermediate, source-level module called a translation unit, as a file with a .i suffix. Note that this .i file is hypothetical and not likely to be produced by a compiler unless specifically requested. This intermediate representation is then fed to the C++ compiler, which compiles the translation unit and produces a corresponding object file.
  • The linker takes all the independently assembled machine language programs and stitches them together. The linker resolves machine code that can be loaded and run by the operating system, which the compiler left unresolved. The linker produces an executable file. It is said that object files and libraries are linked.
    • If something is relocatable, it means that the memory addresses at which the code resides have not yet been determined.
    • Note that the machine code in an executable file is still relocatable, meaning that the addresses of all instructions and data in the file are still relative to an arbitrary base address, not absolute. The final absolute base address of the program is not known until the program is actually loaded into memory, just prior to running it.
  • The loader places an executable file in main memory so that it is ready to execute.
  • Static approach to linking libraries before the program is run is the fastest way to call library routines, but has disadvantages.
    • The statically linked program keeps using the old version even though a new version is released because the library routines become part of the executable code.
    • It loads all routines in the library that are called in the executable file, even if those calls are not executed.
  • DLL(dynamically linked libraries) do not link and load library routines until the program is run. The drawback of the initial version of DLL was that it still linked all routines of the library that might be called. So lazy procedure linkage version of DLL where each routine is linked only after it is called appears. DLL pay overhead the first time a routine is called, but only a single indirect jump thereafter.
    • Once the first time DLL are needed by a process, the OS loads them into physical memory, and maps a view of them into the process’s virtual address space. The addresses of the functions and global variables provided by DLL are patched into the program’s machine code, allowing it to call them as if they had been statically linked into the executable.
    • The benefit of DLL only becomes evident when a second process is run that uses the same DLL. Rather than loading a copy of DLL’s code and global variables, the already-loaded physical pages are simply mapped into the virtual address space of the new process. This saves memory and speeds up the process of running all but the first process.
  • C offers the programmer a chance to give a hint to the compiler about which variables to keep in registers or memory. However, today’s C compilers generally ignore such hints, because the compiler does a better job at allocation than the programmer does.
  • Signed division must set the sign of the remainder. Rules are as follows.
    • The absolute value of the quotient and remainder is the same as calculating both the dividend and divisor as positive numbers.
    • If the signs of the dividend and divisor are the same, the sign of the quotient is set to be positive(+), and if they are different, it should be negative(-).
    • The sign of the remainder is the same as the sign of the dividend.
  // C++ Test Results.
   7 /  2 =  3,   7 %  2 =  1
  -7 / -2 =  3,  -7 % -2 = -1
  -7 /  2 = -3,  -7 %  2 = -1
   7 / -2 = -3,   7 % -2 =  1
  • Only for unsigned intergers, a left shift instruction can replace an integer multiplication by a power of 2 and a right shift is the same as an integer division by a power of 2. For signed integers, an arithmetic right shift that extends the sign bit is used even though a 2-bit arithmetic shift right of -5(0x1011) produces -2(0x1110) instead of -1.
  • Associativity does not hold for floating-point numbers because floating-point numbers are approximations of real numbers and because computer arithmetic has limited precision. For example, in case a million numbers are summed together, the results would not be the same when 1 processor or 1000 processors are used parallelly.
  • The single-cycle design(one instruction per one clock cycle) requires the same length for every instruction, so the longest possible path in the processor determines the clock cycle. This design has been advanced into pipelining which is nearly universal today.
  • Pipelining improves throughput of the laundry system as an analogy. Hence, pipelining would not decrease the time to complete one load of laundry, but when there are many loads of laundry to do, the improvement in throughput decreases the total time to complete the work.
  • Each stage in a pipelined CPU takes one clock cycle to execute, meaning that a CPU with an N-stage pipeline has a minimum instruction latency of N clock cycles. The five stages of pipelining are as follows:
    • Instruction Fetch(IF)
    • Instruction Decode(ID)
    • Execute(EX)
    • Memory Access(MEM)
    • Write Back(WB)
  • There are situations in pipelining when the next instruction cannot execute in the following clock cycle.
    • Structural hazard: It means that the hardware cannot support the combination of instructions that are supposed to be executed in the same clock cycle. In the same clock cycle, an instruction can try to access data from memory while another instruction is fetching an instruction from that same memory. Without two memories, the pipeline could have a structural hazard.
    • Data hazard: It occurs when the pipeline must be stalled because one step must wait for another to complete. It arises from the dependence of one instruction on an earlier one that is still in the pipeline. The primary solution is called forwarding or bypassing, where the intermediate result of one instruction is handed over in advance to the next instruction by adding extra hardware. Even with forwarding, there needs to stall one stage for a load-use data hazard and this stall is called a pipeline stall or bubble.
    • Control hazard(branch hazard): It arises from the need to make a decision based on the results of one instruction while others are executing.
  • If each stage in the pipeline can work with the next instruction in advance, it can execute the next one and the result can be stored in pipeline registers. Note that there is no pipeline register at the end of the write-back stage.
  • When pipeline bubble occurs, a CPU can look ahead in the instruction stream and select an instruction to issue out-of-order into such a delay slot. With only a single instruction stream, the CPU’s options are somewhat limited when selecting an instruction to issue into a delay slot. But what if the CPU could select its instructions from two separate instruction streams at once? This is the principle behind a hyperthreaded(HT) CPU core.
    • Technically speaking, an HT core consists of two register files and two instruction decode units, but with a single back-end for executing instructions, and a single shared L1 cache. This design enables an HT core to run two independent threads, while requiring fewer transistors than a dual core CPU, thanks to the shared back-end and L1 cache. This sharing of hardware components also results in lower instruction throughput relative to a comparable dual core CPU, because the threads contend for these shared resources.
  • A more recent innovation in branch prediction is the use of tournament predictors. A tournament predictor uses multiple predictors, tracking, for each branch, which predictor yields the best results.
  • To increase the potential amount of instruction-level parallelism, multiple issue replicates the internal components of the computer so that multiple instructions can be launched in every pipeline stage. Today’s high-end microprocessors attempt to issue 3~6 instructions in every clock cycle. Although two-issue processor can improve performance by up to a factor of two, it requires that twice as many instructions be overlapped in execution, and this additional overlap increases the relative performance loss from data and control hazards. Furthermore, sustaining the high issue rate is very difficult.
  • Dynamic multiple-issue, also known as superscalar, is an advanced pipelining technique that enables the processor to execute more than one instruction per clock cycle by selecting them during execution. Many superscalars support dynamic pipeline scheduling which is hardware support for reordering the order of instruction execution to avoid stalls.
  • Block or line is the minimum unit of information that can be either present or not present in the two-level memory hierarchy, particularly a cache.
  • DRAM cannot be kept indefinitely and must periodically be refreshed. That is why DRAM is called dynamic, as opposed to the static storage in an SRAM cell.
  • DRAMs added clocks and are properly called Synchronous DRAMS or SDRAMs. The advantage of SDRAMs is that the use of a clock eliminates the time for the memory and processor to synchronize. The fastest version is called DDR(Double Data Rate) SDRAM which means that data transfers on both the rising and falling edge of the clock, thereby getting twice as much bandwidth. The latest version of this technology is called DDR4.
  • L1 cache is placed very near to the CPU core (on the same die). Its access latency is almost as low as the CPU’s register file, because it is so close to the CPU core. L2 cache is located further away from the core (usually also on-chip, and often shared between two or more cores on a multicore CPU). It is typical for each core to have its own L1 cache, but multiple cores might share an L2 cache, as well as sharing main RAM.
  • It is important to realize that both data and instruction are cached. In a L1 cache, the two caches are always physically distinct, because it is undesirable for an instruction read to cause valid data to be evicted out of the cache or vice versa. Higher-level caches typically do not make this distinction between code and data, because their larger size tends to mitigate the problems of code evicting data or data evicting code.
  • When a block can go in exactly one place in the cache, it is called direct mapped because there is a direct mapping from any block address in memory to a single location in the upper level of the hierarchy.
  • Fully associative is a scheme where a block can be placed in any location in the cache. To find a given block in a fully associative cache, all the entries in the cache must be searched because a block can be placed in any one.
  • The middle range of designs between direct mapped and fully associative is called set associative. A set-associative cache with n locations for a block is called an n-way set-associative cache. Each block in the memory maps to a unique set in the cache given by the index field, and a block can be placed in any element of that set.
  • As a full search is impractical, virtual memory systems use page table that indexes the memory to locate pages. A page table is indexed with the page number from the virtual address to discover the corresponding physical page number. The page index is looked up by the CPU’s memory management unit(MMU) in the page table. To indicate the location of the page table in memory, the hardware includes a register that points to the start of the page table and this register is called page table register.
  • If the valid bit for a virtual page is off, a page fault occurs. The operating system must be given control. Because when a page in memory will be replaced is unknown, the operating system usually creates the space on flash memory or disk for all pages of a process when it creates the process. This space is called the swap space. When a page fault occurs, operating systems follow the least recently used(LRU) replacement scheme. The replaced pages are written to swap space.
  • Virtual memory systems must use write-back, performing the individual writes into the page in memory, and copying the page back to disk when it is replaced in the memory. The dirty bit in the page table is set when any word in a page is written.
  • Since the page tables are stored in main memory, every memory access by a program can take at least twice as long: one memory access to obtain the physical address and a second access to get the data. Accordingly, modern processors include the special address translation cache, which is called translation-lookaside buffer(TLB). TLB is a cache that keeps track of recently used address mappings to try to avoid an access to the page table. Because this buffer is located in close proximity to the MMU, accessing it is very fast.
  • On every reference, the virtual page number is searched in the TLB. If hit, the physical page number is used to form the address, and the corresponding reference bit is turned on. If the processor is performing a write, the dirty bit is also turned on. If a miss in the TLB occurs, whether it is a page fault or merely a TLB miss is determined. If the page exists in memory, then the TLB miss indicates only that the translation is missing. In such cases, the processor can handle the TLB miss by loading the translation from the page table into the TLB and then trying the reference again. If the page is not present in memory, then the TLB miss indicates a true page fault. In this case, the processor invokes the operating system using an exception.
  • Since the TLB miss rate is small, using write-back is very efficient. Furthermore, since TLB misses are much more frequent than page faults and thus must be handled more cheaply than page faults. As a result, many systems provide some support for randomly choosing an entry to replace.
  • The view of memory held by two different processors is through their caches, which, without any additional precautions, could end up seeing two different values. This difficulty is generally referred to as the cache coherence problem.
  • The protocols to maintain coherence for multiple processors are called cache coherence protocols. Key to implementing a cache coherence protocol is tracking the state of any sharing of a data block. The most popular cache coherence protocol is snooping. This style of protocol is called a write invalidate protocol because it invalidates copies in other caches on a write.
  • Designers who are attempting to hide the cache miss latency commonly use a nonblocking cache, when building out-of-order processors. They implement two flavors of nonblocking: Hit under miss allows additional cache hits during a miss, while miss under miss allows multiple outstanding cache misses.
  • The Intel Core i7 has a prefetch mechanism for data access. It looks at a pattern of data misses and uses this information to predict the next address to start fetching the data before the miss occurs. Such techniques generally work best when accessing arrays in loops. In most cases, the prefetched line is simply the next block in the cache.
  • Each thread would have a separate copy of the register file and the program counter. In addition, the hardware must support the ability to change to a different thread relatively quickly. In particular, a thread switch should be much more efficient than a process switch, which typically requires hundreds to thousands of processor cycles while a thread switch can be instantaneous.
  • Fine-grained multithreading switches between threads on each instruction, resulting in interleaved execution of multiple threads. This interleaving is often done in a round-robin fashion, skipping any threads that are stalled at that clock cycle. One advantage of this is that it can hide the throughput losses that arise from both short and long stalls, since instructions from other threads can be executed when one thread stalls. The primary disadvantage is that it slows down the execution of the individual threads.
  • Coarse-grained multithreading was invented as an alternative to fine-grained multithreading. It switches threads only on costly stalls, such as last-level cache misses. This change relieves the need to have thread switching extremely fast and is much less likely to slow down the execution of an individual thread. However, it is limited in its ability to overcome throughput losses, especially from shorter stalls.
  • GPUs do not rely on multilevel caches to overcome the long latency to memory, as do CPUs. Instead, GPUs rely on hardware multithreading to hide the latency to memory. That is, between the time of a memory request and the time that data arrives, the GPU executes hundreds or thousands of threads that are independent of that request.
  • CUDA threads are blocked together and executed in groups of 32 at a time. A multithreaded processor inside a GPU executes these blocks of threads, and a GPU consists of 8~128 of these multithreaded processors.
  • GPU is a MIMD composed of multithreaded SIMD processors.
  • The machine object that the hardware creates, manages, schedules, and executes is a thread of SIMD instructions, which is also called a SIMD thread. It is a traditional thread, but it contains exclusively SIMD instructions. These SIMD threads have their own program counters and they run on a multithreaded SIMD processor. The SIMD Thread Scheduler includes a controller that lets it know which threads of SIMD instructions are ready to run, and then it sends them off to a dispatch unit to be run on the multithreaded SIMD processor.
  • GPU hardware has two levels of hardware schedulers:
    • The Thread Block Scheduler that assigns blocks of threads to multithreaded SIMD processors
    • The SIMD Thread Scheduler within a SIMD processor, which schedules when SIMD threads should run
  • While hiding memory latency is the underlying philosophy of GPUs, note that the latest GPUs and vector processors have added caches. They are thought of as either bandwidth filters to reduce demands on GPU memory or as accelerators for the few variables whose latency cannot be hidden by multithreading.

References

[1] David A. Patterson and John L. Hennessy, Computer Organization and Design MIPS Edition: The Hardware/Software Interface

[2] J. Gregory, Game Engine Architecture, Third Edition, CRC Press

[3] J. Lakos, Large-Scale C++ Volume I: Process and Architecture, Addison-Wesley Professional


© 2025. All rights reserved.