File Processing as Memory Read/Write
- Why disks cannot be used as memory
- How disks can be made to look like they are used as memory
- Zero-copy
- Dynamic library
strace
- Reference
Why disks cannot be used as memory
Reading and writing files in code is not as convenient as reading and writing memory. This is because processes such as opening the file, obtaining the file descriptor, and closing the file are necessary. The reason that using files is more complicated is also related to the fact that disks are inherently not a substitute for memory.
In addition to the speed difference between disks and memory, they have different ways of addressing specific addresses. Memory is addressed in bytes, while disks are addressed in blocks. The size of these blocks can vary from a few bytes to tens of kilobytes. Therefore, the file on the disk must first be copied to memory, and then the file must be processed byte by byte in memory.
How disks can be made to look like they are used as memory
Due to these physical limitations, the disk cannot be used as memory, but it can create the illusion of using it as memory. This becomes possible thanks to virtual memory and the operating system.
This is possible using mmap
, which maps the file into the process address space. The file can then be manipulated byte by byte in this address space. Disk files can be used as if they were directly read and written to the memory of the address space. When the address space to which the file is mapped is first read, the page fault interrupt may occur because the file is not yet loaded. The CPU then executes the operating system’s interrupt handling function, which in turn generates disk I/O requests. Once the file is copied to memory and the connection between virtual and physical memory is established, the contents of the disk can be used directly as if the program were reading and writing memory.
Using mmap
still requires actually reading and writing to the disk, but this process is performed by the operating system. Moreover, thanks to virtual memory, programmers do not have to worry about this fact at all.
Zero-copy
As explained in this note, when reading a file, the data is first copied into the operating system in kernel mode, and then copied into memory in user mode. However, with mmap
, this burden is eliminated.
However, mmap
is not perfect either. It requires the use of a specific data structure to maintain mapping relationships, and the page fault interrupt is unavoidable. In other words, which method is better should be tested through actual comparisons in the given situation. If the situation requires file processing that exceeds physical memory capacity, mmap may be a better choice. In particular, it can simplify code design when reading or writing from arbitrary locations.
Dynamic library
As explained in this note, dynamic libraries can greatly reduce the size of an executable file. Even if the code size is 2MB and the dynamic library size is 100MB, the executable size will only be about 2MB. No matter how many programs use this dynamic library, the executable will not contain the library’s code and data.
At this point, mmap
can map this library into the address space of all processes that use that dynamic library. All processes think that the library is loaded into their address space, but the actual physical memory space it occupies is only one size.
strace
The strace
command prints out all system calls involved in the execution of a program, revealing many secrets about the execution of a program. This command shows that the startup process of many programs is similar. Almost all programs refer to some dynamic libraries with mmap
. In fact, mmap
is what supports this behind the scenes whenever a program is started in Linux.
In other words, in similar situations where many processes refer to the same data for read-only purposes, mmap
can meet this requirement very well.
Reference
[1] 루 샤오펑. 2024. 컴퓨터 밑바닥의 비밀, 길벗