File Processing as Memory Read/Write

Why disks cannot be used as memory

Reading and writing files in code is not as convenient as reading and writing memory. This is because processes such as opening the file, obtaining the file descriptor, and closing the file are necessary. The reason that using files is more complicated is also related to the fact that disks are inherently not a substitute for memory.

In addition to the speed difference between disks and memory, they have different ways of addressing specific addresses. Memory is addressed in bytes, while disks are addressed in blocks. The size of these blocks can vary from a few bytes to tens of kilobytes. Therefore, the file on the disk must first be copied to memory, and then the file must be processed byte by byte in memory.

How disks can be made to look like they are used as memory

Due to these physical limitations, the disk cannot be used as memory, but it can create the illusion of using it as memory. This becomes possible thanks to virtual memory and the operating system.

This is possible using mmap, which maps the file into the process address space. The file can then be manipulated byte by byte in this address space. Disk files can be used as if they were directly read and written to the memory of the address space. When the address space to which the file is mapped is first read, the page fault interrupt may occur because the file is not yet loaded. The CPU then executes the operating system’s interrupt handling function, which in turn generates disk I/O requests. Once the file is copied to memory and the connection between virtual and physical memory is established, the contents of the disk can be used directly as if the program were reading and writing memory.

Using mmap still requires actually reading and writing to the disk, but this process is performed by the operating system. Moreover, thanks to virtual memory, programmers do not have to worry about this fact at all.

Zero-copy

As explained in this note, when reading a file, the data is first copied into the operating system in kernel mode, and then copied into memory in user mode. However, with mmap, this burden is eliminated.

However, mmap is not perfect either. It requires the use of a specific data structure to maintain mapping relationships, and the page fault interrupt is unavoidable. In other words, which method is better should be tested through actual comparisons in the given situation. If the situation requires file processing that exceeds physical memory capacity, mmap may be a better choice. In particular, it can simplify code design when reading or writing from arbitrary locations.

Dynamic library

As explained in this note, dynamic libraries can greatly reduce the size of an executable file. Even if the code size is 2MB and the dynamic library size is 100MB, the executable size will only be about 2MB. No matter how many programs use this dynamic library, the executable will not contain the library’s code and data.

At this point, mmap can map this library into the address space of all processes that use that dynamic library. All processes think that the library is loaded into their address space, but the actual physical memory space it occupies is only one size.

strace

The strace command prints out all system calls involved in the execution of a program, revealing many secrets about the execution of a program. This command shows that the startup process of many programs is similar. Almost all programs refer to some dynamic libraries with mmap. In fact, mmap is what supports this behind the scenes whenever a program is started in Linux.

In other words, in similar situations where many processes refer to the same data for read-only purposes, mmap can meet this requirement very well.

Reference

[1] 루 샤오펑. 2024. 컴퓨터 밑바닥의 비밀, 길벗


© 2025. All rights reserved.