* Reminder on a process address space Assume a 32bit CPU, can address 2^32 or 4GB mem When OS sets up a process, it sets up segments with different protection levels: Addr Segment ---- ------- 0 TEXT (binary executable, R+X protection only) after statics: for const and "strings" (readonly access) next HEAP: malloc memory, allocated by asking OS using s/brk(2) to "extend" the breakpoint where the valid heap ends. Heap starts at a fixed location, but can grow and shrink as needed. after a BIG gap in addr space: OS maps nothing here. if your prog touches heap any addr here, you get a SEGV (recall MMU checks all these addrs) and before stack STACK: starts at bottom (or very high addr) and grows towards 4GB small mem addrs. Stack can grow and shrink as needed, used each time you decl an automatic variable, or call a function (that creates a stack frame). Details of stack memory discussed before. Note: in many systems, the hardware, at boot time, using the BIOS, will "map" the memory of certain hardware devices to the highest mem addrs. E.g., if you have a video card with 512MB onboard RAM, the BIOS may map the upper-most 512MB of RAM to the video card's memory. That allows the OS and user processes to "write" to the video memory by simply addressing memory locations. But, this can mask out actual physical DRAM you have at that memory range (e.g., between 3.5-4GB RAM). All of the above segments are virtual addresses: each valid virt addr has to have a valid PHYSICAL page (4KB) in memory that the OS maps and unmaps automatically. Recall that if the OS unmaps a phys page from you, it'll store the contents of that phys page in swap (an actual disk partition, slow). Q: are these page aligned addrs A: Yes. All pages virt/phys and mem mgmt is done in whole 4KB pages, aligned. Recall that protection bits (PROT_READ) are only set on a per-page basis. * What about OS's own memory? An OS is a "big program". Its job is to manage all h/w resources, protections, and in turn run user programs. OS is complex, has lots of code, interrupt driven parts, OO parts, many layers of abstracts, lots of concurrency. But basically, OS has a "program" (kernel) that runs on the CPU, it has a limited "stack", and many memory allocators of various sorts, as well as a general "kmalloc" HEAP-based allocator. OS wants to serve programs/users. So OS has to consume as few resources as possible. OS kernel itself should be small, OS stack, heap, and other caches should also be small. The more mem is unused by OS, the more it can be used by user programs (backing memory to user program's virt addrs). BUT: there's a big difference is speed b/t devices 1. CPU speeds are in nanoseconds (e.g., accessing registers and L1/L2 caches) 2. DRAM speeds are in microseconds (loading chunks of code from RAM into CPU caches to execute). Note: DRAM is ~1000x *slower* than CPU! Any program that results in a lot of accesses to DRAM, constantly un/loading CPU caches -- is called a "memory bound" program. 3. I/O devices are in milli-seconds or even seconds. E.g., hard disks and networks. So I/O devices are 1,000x to 1,000,000x times slower than memory, or 10E6 to 10E9 slower than CPU. IOW, you want to avoid using I/O devices as much as possible. If your program constantly has to read/write from I/O devices, the program will be "I/O bound" and very slow. Q: So why use I/O devices at all? A: b/c they are big and cheap (per gig) Generally: the faster the "memory storage" device is (e.g., registers, ram, disk), the more expensive it is, and hence you can afford only a smaller amount of it. B/c I/O devices are slower, the OS has to help programs run reasonably fast: that means, cache as much as you can inside the OS. CACHE: a copy of a subset of data, that came from a "slow" place, and now stored in a "faster" type of memory. Caches works well b/c most access has some sort of "temporal locality": if you accessed something recently, you'll probably access it in the near future. That's why LRU (Least Recently Used) works too. Also, it turns out that while users may keep a lot of data around, much of it is "cold" (hasn't been used in a while), and the most commonly accessed recent data (i.e., "hot" data) is significantly smaller than the total amount of data you've collected. OS tries to cache as much as it can, but the more (phys) mem it uses for caches, the less mem is there available to user processes. An OS has multiple caches. But one of the main and biggest caches the OS has is called the "page cache". It caches: 1. 4KB aligned pages of TEXT segments of running programs 2. "statics" segments of programs >3. HEAP and STACK pages of programs 4. shared libraries that have been accessed by programs (e.g., libc.so) >5. files' data being accessed by your or any program E.g., OS loads pages of libc.so only once into the page cache and "shares" those pages with every program that needs it. This saves a lot of mem cache space, and also speeds up execution of libc functions. Part 5, accessing files. Assume a program is calling "open("/etc/passwd", ..)" or opening to read "/var/www/index.html". All those are files and files come for slow I/O devices. So OS also wants to cache those files in memory, so next time any program accesses them, it'll be able to get a copy of the file directly from (faster) cache memory. The page cache is periodically cleaned using LRU or other page-reclamation algorithms: the idea is to purge files that have been cached, but not used recently. Assume a program does this: 1. open("/var/www/index.html", flags) 2. read(fd, buf, len) // assume buf is malloced and in HEAP The moment you open a file and then try to read it, the OS will try to load the part of the file that you wanted into the page cache (e.g., part 5 above). In fact, OS will assume that you'll want to read the rest of the file and will go ahead and "pre-fetch" or "read-ahead" parts of the file into page cache mem. This is a heuristic that assumes sequential access. NOW: OS has cached the file in page cache (part 5) Now OS has to service your read() call. You asked the OS to read the data of the file into your OWN buffer "buf", which is located in your... HEAP (or part 3 of the page cache). So now OS internally has to copy bytes from one part of the page cache (part 5 where the file was cached) into another part of the page cache (the phys page that backs up your process HEAP where "buf" is located). Problems: 1. we have TWO copies of the same data in the page cache! This wastes precious physical memory of the OS. Any phys mem wasted is worse that wasting virtual mem. 2. we're wasting CPU cycles copying the same data from one buf to the other. Usually this is done using memcpy() function. E.g., if you open and read() a 1GB file into some HEAP based buffer, the OS will have to cache twice as much or 2GB of data. Q: How do we solve this problem of having two copies and wasting cycles? A: use a different interface, namely mmap(2). Memory mapping (mmap) was created to avoid the above problem. It only keeps "one copy" of file data in the page cache, but gives a process access to it. Because mmap saves that extra memcpy(), it is sometimes called a "zero copy" technique. * How does mmap work? Any program can open a file, get an fd, and then call mmap(2) to ask the OS to "map" the file into the process's address space. If successful, mmap returns the start addr of the mapped file in your virtual addr space. Usually the addr is in the "big" hole in your process addr space. Think of mmap is a memory allocator: malloc: ask to alloc N bytes, and get addr of first byte mmap: you asked to map a file of size N, and you get the addr of the first byte. void *ptr; fd = open(file, arg); ptr = mmap(fd, ); If successful, you can now read the contents of "file" by merely dereferencing bytes in ptr[0], ptr[1], ..., all the way to ptr[N] where N is the size of the file that you asked to map. Inside the OS, it loads a file into the page cache as usual, but then the OS maps those pages DIRECTLY from the page cache (e.g., in part 5) into the process virt addr space. That way, you don't need to malloc explicit mem to hold the file: rather, you have direct access to the file in the page cache. Often mmap is more convenient, b/c the file shows up as a large byte array. If you read a byte into memory, it's just like doing read(2) but more efficient and faster. If you modify an mmap'd byte, it's just like changing the content of a file using write(2): the changed buffers corresponding to the file in the page cache are marked "dirty" by the OS, and the OS eventually flushes those changes directly to the actual file on disk (a periodic process called "dirty page flushing"). mmap(2) is always more efficient than read(2)/write(2). For that reason, most programs these days use mmap, including /bin/cp. With mmap you can choose to do the following: 1. you can map a whole file or just a part of a file starting at any offset. 2. You can choose the type of protections to set on the mapping: read only, write only, both read+write, or none (recall PROT_* flags). That's useful to prevent inadvertent access to a file: say you mmap a file read-only, if you try to write to any byte, you will get a SEGV signal sent to your program. BTW, you can "trap" the SEGV signal using signal-based syscalls, and then handle such violations as needed. 3. You can let the OS map the file at any free mem addr that the OS chooses for you (most common mode). But you can also ask to map at a specific memory location: the OS will try to honor that request, and will return err if it cannot. This can be useful if you're creating mem segments that are a combination of different files or different parts of the same file, as well as for PROT_NONE (to create protection "redzones" as discussed further below). Another reason to ask mmap to a specific addr is to use it as "shared memory". 4. You can set various flags for how the mapping will behave. One useful example is whether you asked for MAP_SHARED or MAP_PRIVATE. So permits you to save mem by sharing w/ other processes or have a private mapping. MAP_PRIVATE allows you to map read-only files into your addr space, but if you try to modify the memory of those files, the OS will create a "private" copy of the modified pages (and will not modify the backing file) -- using a technique called "copy on write" or CoW. You can even map a file that you can only read but not write, say a system file. As long as you can read it, you can mmap it to your addr space. But you cannot try to modify the mem of this mapping b/c you don't have permission to write to that file. However, with mmap, if you combine with MAP_PRIVATE, mmap will use Copy-on-Write semantics inside the OS. What that means is that original file (and its pages) remain intact and unmodified. But once you modify a readonly+privately mapped page, the OS will make a COPY of that physical page in the page cache, and internally redirect your virt addr for that page, to the copy: so now you can modify the page in memory as much as you want. 5. You don't even have to map a FILE to a memory. You can ask for "anonymous mapping" (meaning there's no known file). That essentially can be use to allocate memory in your addr space, with whatever protections you wanted, and then the OS will back it up with phys mem: but note that none of that memory is preserved on disk. Main disadvantage of mmap: you can only map and access in whole units of page size (4KB) and they must be aligned. * how to use mmap to protect buffers instead of malloc(), use mmap to create a page of anon memory void *ptr = mmap(); // assume mmap returned the addr of page 17 (17*4096) void *before = mmap(