* Reminder on a process address space

Assume a 32bit CPU, can address 2^32 or 4GB mem

When OS sets up a process, it sets up segments with different protection
levels:

Addr	Segment
----	-------
0	TEXT (binary executable, R+X protection only)

after	statics: for const and "strings" (readonly access)

next	HEAP: malloc memory, allocated by asking OS using s/brk(2) to
	"extend" the breakpoint where the valid heap ends.  Heap starts at a
	fixed location, but can grow and shrink as needed.

after	a BIG gap in addr space: OS maps nothing here.  if your prog touches
heap	any addr here, you get a SEGV (recall MMU checks all these addrs)
and
before
stack

STACK: starts at bottom (or very high addr) and grows towards 4GB small mem
addrs.  Stack can grow and shrink as needed, used each time you decl an
automatic variable, or call a function (that creates a stack frame).
Details of stack memory discussed before.

Note: in many systems, the hardware, at boot time, using the BIOS, will
"map" the memory of certain hardware devices to the highest mem addrs.
E.g., if you have a video card with 512MB onboard RAM, the BIOS may map the
upper-most 512MB of RAM to the video card's memory.  That allows the OS and
user processes to "write" to the video memory by simply addressing memory
locations.  But, this can mask out actual physical DRAM you have at that
memory range (e.g., between 3.5-4GB RAM).

All of the above segments are virtual addresses: each valid virt addr has to
have a valid PHYSICAL page (4KB) in memory that the OS maps and unmaps
automatically.  Recall that if the OS unmaps a phys page from you, it'll
store the contents of that phys page in swap (an actual disk partition,
slow).

Q: are these page aligned addrs
A: Yes.  All pages virt/phys and mem mgmt is done in whole 4KB pages,
   aligned.  Recall that protection bits (PROT_READ) are only set on a
   per-page basis.

* What about OS's own memory?

An OS is a "big program".  Its job is to manage all h/w resources,
protections, and in turn run user programs.  OS is complex, has lots of
code, interrupt driven parts, OO parts, many layers of abstracts, lots of
concurrency.  But basically, OS has a "program" (kernel) that runs on the
CPU, it has a limited "stack", and many memory allocators of various sorts,
as well as a general "kmalloc" HEAP-based allocator.

OS wants to serve programs/users.  So OS has to consume as few resources as
possible.  OS kernel itself should be small, OS stack, heap, and other
caches should also be small.  The more mem is unused by OS, the more it can
be used by user programs (backing memory to user program's virt addrs).

BUT: there's a big difference is speed b/t devices

1. CPU speeds are in nanoseconds (e.g., accessing registers and L1/L2
caches)

2. DRAM speeds are in microseconds (loading chunks of code from RAM into CPU
caches to execute).  Note: DRAM is ~1000x *slower* than CPU!  Any program
that results in a lot of accesses to DRAM, constantly un/loading CPU caches
-- is called a "memory bound" program.

3. I/O devices are in milli-seconds or even seconds.  E.g., hard disks and
networks.  So I/O devices are 1,000x to 1,000,000x times slower than memory,
or 10E6 to 10E9 slower than CPU.  IOW, you want to avoid using I/O devices
as much as possible.  If your program constantly has to read/write from I/O
devices, the program will be "I/O bound" and very slow.

Q: So why use I/O devices at all?
A: b/c they are big and cheap (per gig)

Generally: the faster the "memory storage" device is (e.g., registers, ram,
disk), the more expensive it is, and hence you can afford only a smaller
amount of it.

B/c I/O devices are slower, the OS has to help programs run reasonably fast:
that means, cache as much as you can inside the OS.

CACHE: a copy of a subset of data, that came from a "slow" place, and now
stored in a "faster" type of memory.  Caches works well b/c most access has
some sort of "temporal locality": if you accessed something recently, you'll
probably access it in the near future.  That's why LRU (Least Recently Used)
works too.  Also, it turns out that while users may keep a lot of data
around, much of it is "cold" (hasn't been used in a while), and the most
commonly accessed recent data (i.e., "hot" data) is significantly smaller
than the total amount of data you've collected.

OS tries to cache as much as it can, but the more (phys) mem it uses for
caches, the less mem is there available to user processes.

An OS has multiple caches.  But one of the main and biggest caches the OS
has is called the "page cache".  It caches:

1. 4KB aligned pages of TEXT segments of running programs
2. "statics" segments of programs
>3. HEAP and STACK pages of programs
4. shared libraries that have been accessed by programs (e.g., libc.so)
>5. files' data being accessed by your or any program

E.g., OS loads pages of libc.so only once into the page cache and "shares"
those pages with every program that needs it.  This saves a lot of mem cache
space, and also speeds up execution of libc functions.


Part 5, accessing files.  Assume a program is calling "open("/etc/passwd",
..)" or opening to read "/var/www/index.html".  All those are files and
files come for slow I/O devices.  So OS also wants to cache those files in
memory, so next time any program accesses them, it'll be able to get a copy
of the file directly from (faster) cache memory.

The page cache is periodically cleaned using LRU or other page-reclamation
algorithms: the idea is to purge files that have been cached, but not used
recently.

Assume a program does this:

1. open("/var/www/index.html", flags)
2. read(fd, buf, len) // assume buf is malloced and in HEAP

The moment you open a file and then try to read it, the OS will try to load
the part of the file that you wanted into the page cache (e.g., part 5
above).  In fact, OS will assume that you'll want to read the rest of the
file and will go ahead and "pre-fetch" or "read-ahead" parts of the file into
page cache mem.  This is a heuristic that assumes sequential access.

NOW: OS has cached the file in page cache (part 5)

Now OS has to service your read() call.  You asked the OS to read the data
of the file into your OWN buffer "buf", which is located in your... HEAP (or
part 3 of the page cache).  So now OS internally has to copy bytes from one
part of the page cache (part 5 where the file was cached) into another part
of the page cache (the phys page that backs up your process HEAP where "buf"
is located).

Problems:

1. we have TWO copies of the same data in the page cache!  This wastes
   precious physical memory of the OS.  Any phys mem wasted is worse that
   wasting virtual mem.
2. we're wasting CPU cycles copying the same data from one buf to the
   other.  Usually this is done using memcpy() function.

E.g., if you open and read() a 1GB file into some HEAP based buffer, the OS
will have to cache twice as much or 2GB of data.

Q: How do we solve this problem of having two copies and wasting cycles?
A: use a different interface, namely mmap(2).

Memory mapping (mmap) was created to avoid the above problem.  It only keeps
"one copy" of file data in the page cache, but gives a process access to it.
Because mmap saves that extra memcpy(), it is sometimes called a "zero copy"
technique.

* How does mmap work?

Any program can open a file, get an fd, and then call mmap(2) to ask the OS
to "map" the file into the process's address space.  If successful, mmap
returns the start addr of the mapped file in your virtual addr space.
Usually the addr is in the "big" hole in your process addr space.  Think of
mmap is a memory allocator:

malloc: ask to alloc N bytes, and get addr of first byte
mmap: you asked to map a file of size N, and you get the addr of the first
byte.

void *ptr;
fd = open(file, arg);
ptr = mmap(fd, <several args>);

If successful, you can now read the contents of "file" by merely
dereferencing bytes in ptr[0], ptr[1], ..., all the way to ptr[N] where N is
the size of the file that you asked to map.

Inside the OS, it loads a file into the page cache as usual, but then the OS
maps those pages DIRECTLY from the page cache (e.g., in part 5) into the
process virt addr space.  That way, you don't need to malloc explicit mem to
hold the file: rather, you have direct access to the file in the page cache.

Often mmap is more convenient, b/c the file shows up as a large byte array.
If you read a byte into memory, it's just like doing read(2) but more
efficient and faster.

If you modify an mmap'd byte, it's just like changing the content of a file
using write(2): the changed buffers corresponding to the file in the page
cache are marked "dirty" by the OS, and the OS eventually flushes those
changes directly to the actual file on disk (a periodic process called
"dirty page flushing").

mmap(2) is always more efficient than read(2)/write(2).  For that reason,
most programs these days use mmap, including /bin/cp.

With mmap you can choose to do the following:

1. you can map a whole file or just a part of a file starting at any offset.

2. You can choose the type of protections to set on the mapping: read only,
write only, both read+write, or none (recall PROT_* flags).  That's useful
to prevent inadvertent access to a file: say you mmap a file read-only, if
you try to write to any byte, you will get a SEGV signal sent to your
program.  BTW, you can "trap" the SEGV signal using signal-based syscalls,
and then handle such violations as needed.

3. You can let the OS map the file at any free mem addr that the OS chooses
for you (most common mode).  But you can also ask to map at a specific
memory location: the OS will try to honor that request, and will return err
if it cannot.  This can be useful if you're creating mem segments that are a
combination of different files or different parts of the same file, as well
as for PROT_NONE (to create protection "redzones" as discussed further
below).  Another reason to ask mmap to a specific addr is to use it as
"shared memory".

4. You can set various flags for how the mapping will behave.  One useful
example is whether you asked for MAP_SHARED or MAP_PRIVATE.  So permits you
to save mem by sharing w/ other processes or have a private mapping.
MAP_PRIVATE allows you to map read-only files into your addr space, but if
you try to modify the memory of those files, the OS will create a "private"
copy of the modified pages (and will not modify the backing file) -- using a
technique called "copy on write" or CoW.

You can even map a file that you can only read but not write, say a system
file.  As long as you can read it, you can mmap it to your addr space.  But
you cannot try to modify the mem of this mapping b/c you don't have
permission to write to that file.  However, with mmap, if you combine with
MAP_PRIVATE, mmap will use Copy-on-Write semantics inside the OS.  What that
means is that original file (and its pages) remain intact and unmodified.
But once you modify a readonly+privately mapped page, the OS will make a
COPY of that physical page in the page cache, and internally redirect your
virt addr for that page, to the copy: so now you can modify the page in
memory as much as you want.

5. You don't even have to map a FILE to a memory.  You can ask for
"anonymous mapping" (meaning there's no known file).  That essentially can
be use to allocate memory in your addr space, with whatever protections you
wanted, and then the OS will back it up with phys mem: but note that none of
that memory is preserved on disk.

Main disadvantage of mmap: you can only map and access in whole units of
page size (4KB) and they must be aligned.

* how to use mmap to protect buffers

instead of malloc(), use mmap to create a page of anon memory

void *ptr = mmap(<ask for 4KB of anon mem, protected R+W>);
// assume mmap returned the addr of page 17 (17*4096)
void *before = mmap(<ask for 1 page of anon memory, with PROT_NONE, at addr
		    16*4096);
void *after = mmap(<ask for 1 page of anon memory, with PROT_NONE, at addr
		    18*4096);

PROT_NONE means no read, write, or execute access on the page.  Any attempt
to touch a byte in a PROT_NONE page, results in SEGV.  PROT_NONE pages don't
need a backing physical page of ram inside the OS, because they'll never
have actual phys mem backing them: so PROT_NONE pages are less wasteful
inside the OS (just need an entry in the process's virt mapping table).

Now you prog can use ptr[0]..ptr[4095].  If somehow your prog were to
overflow or underflow, it'll be caught by these two PROT_NONE guardian
pages, you get a SEGV, core dump, and you can debug this.

There's an open source implementation of this in linux called ElectricFence,
or libefence.so.

Again, disadvantage is that you can only do this in units of 4KB.  Yes, this
can waste some mem but useful for debugging.  In the above example, if we
wanted only 100 bytes, we'd still allocate a whole 4KB page via mmap
(because we can't do any less), and we'll return an addr that's not the
state of that page, but start+4096-100, meaning the last 100 bytes in that
page; then we can place a PROT_NONE page after, and catch buf overflows (but
not underflows).

* What else gets mmaped:

all shared libraries are automatically mmaped into your addr space, as
read-only protected pages.  e.g., libc.so.  (Will discuss later the
differences between shared vs. static libraries.)

* MMAP interface, cont.

mmap: map a file or anon mem into addr space
munmap: undo some mapping
madvise: give OS hints as to access pattern (seq, random, etc.)
mprotect: change PROT_* flags on some mapping

mlock: lock some or all pages of a process to prevent them from being
       swapped out.  Useful if you have some critical pages of an
       application that must never be swapped out.  Note: you need root
       privileges to be able to execute mlock(2) or mlockall(2).

If you have an application that should run with a higher priority, you can
always set priority using "nice(1)".  With superuser privs, you can raise
the priority of your process.  This gives the process scheduling priority
in the scheduler.  But doesn't control what happens if there's a mem
pressure inside the OS: then pages could be swapped out.  If you have an
important program (e.g., e-commerce server) that should not be swapped out
(as it'll make the program I/O bound).  Use mlock/all, to lock parts or all
of that process' pages in memory, and the OS, if needed, will swap OTHER
processes' pages out.

Note: under mem pressure, the OS has to swap out (and later swap back in)
pages that have modified data (e.g., HEAP, STACK, dirty mmap'd pages), but
not read-only pages (e.g., TEXT, statics, read-only mmap'd files).  The
process of discarding such readonly pages is called "page out"; and
re-loading the pages' contents back is called "page in".  Note that we don't
need to preserve the original content of those pages, because they already
exist on some file (e.g., TEXT segment pages are read from an actual binary
on disk).

msync: allows you to force flushing of some "dirty" parts of your mapped
file to the backing file.  Like fsync() for a file descriptor, but you can
control exactly what parts to flush.  Recall that if your mapping is
writable, and you modify a part of your mapped file, the changes are marked
dirty in the page cache, but won't sync to the underlying file until you
msync, or fsync the entire fd, or exit your program -- but even then, OSs
are free to keep dirty buffers in mem and flush them "later".  With msync,
you can force that flushing sooner and in what order.  Controlling the order
of flushing is useful in transnational systems: e.g., to flush a transaction
log before a database itself.

* SWAPPING vs. PAGING

Swapping: is when you have to preserve the state of a process in an on-disk
"swap" partition or file (e.g., /dev/sda2 or /vm/swapfile).  You have to
preserve those pages b/c they represent the current state of a file: e.g.,
your HEAP and STACK segments have data in them that's valid for the current
state of your program.

Paging: is when you have to reclaim a page in OS memory, that contains
"read-only" data.  Consider the TEXT segment of a process, or any mmap'd
shared libraries.  So in that case, you can just throw away the phys mem
contents of that page, and don't need to preserve it in swap.  That's b/c
you can always get the contents of that page by re-reading the executable or
shared lib at the right offset.

B/c paging is faster than swapping, OSs often will first try to free mem by
paging readonly pages of the cache; then swap pages.

Problem: what if the backing file is modified before the OS pages-in back
that part of the file (e.g., a shared lib).  How do you handle updates of
your files?  So OSs have to handle it in two ways: (1) old way is that a
file being loaded into mem (e.g., a process) cannot be modified -- you get
an EBUSY; (2) new way is OS "hides" the old file (inode) keeping it on disk,
but removing its name from the file system, and allows a "new" inode file to
be created with the same name.  A new process will start with the new
executable, but the old process can still page in/out to the old, no
"hidden" version of that file.

The difference between being able or unable to override an executable on the
file system, which is currently loaded and running, is whether the OS
implements "advisory" or "mandatory" locking upon a file being opened.  Unix
OSs often use "advisory" locking: when you open a file to read, write, or
execute, the file is not locked by default, unless you issue a specific
"lock this file" syscall.  That means you can install/override the running
binary in mem with a new version: the old version will still remain on the
disk until the process that runs it exits.  Other OSs, e.g., Windows, use
mandatory locking on a file open(2): that means no other process can replace
that binary on the file system until the running program terminates.  That's
why, often, when Windows OSs are updated/patches, you have to reboot (even
multiple times).

* NEXT TIME

* read-ahead flaws and optimizations, I/O access patterns

Most common access to I/O (e.g., files) is sequential.  If you read some
bytes of a file, likely you'll next ask to read the bytes afterward.  B/c
I/O (e.g., disks) is so much slower than RAM/CPU, the OS tries to prefetch
or "read ahead" some extra blocks of data from a file that is being opened,
and read.  The idea is that by the time your program tries to read the next
chunk of a file, hopefully the file's data will already be in the OS page
cache, and you benefit from 1000x+ speed of the RAM (compared to disk I/O).

How much to read ahead?

Suppose your program reads two 1KB blocks every second, sequentially.

1. If you read-ahead only one block, then you don't benefit as much from
read-ahead (RA).  So first block will be found in the cache; second block
will have to be retrieved from I/O devices before the read(2) can be served.
In other words, the 2nd block read from I/O will dominate your performance.

2. Suppose the OS now reads ahead more than 2 blocks, say 4, 8, or more.

If you bring too much data into the cache, you may be doing extra
unnecessary work, for data blocks that the program eventually may NOT be
reading.  Results in some waste in the cache.

Worse: assume your prog reads 2 blocks every second.  The OS brings in 8
blocks into the cache, but the cache can only hold 6 blocks.  In that case,
the OS will have to find space, it'll throw away the oldest unused blocks
(e.g., the first 2 of the 8 blocks).  Ironically, the OS will have wasted
effort bringing in those first two blocks into the page cache, and then
throw them away -- RIGHT before the program needed those blocks.

Real answer is: OS has to figure out on the fly, how much to read-ahead.
That's fairly hard, b/c the conditions of a system, in combination with
hardware and software speeds, capacities, capabilities, can change.
Applications' behavior and their workloads change as well.  And the OS has
to be fair to every user/process.  In practice, most RA algs try to
dynamically estimate the amount of RA they should perform, and adjust it
based on recent history.

The above is just for sequential access.  What about random access?

Many applications access their data files randomly (no discernible
sequential pattern).  Classic example is databases.  A DB is a large set of
2D tables that have to be stored in some files.  So DBs have to store their
tables in files either row by row (row-major format); or column by column
(column-major).  When queries are made to the DB, to access various
rows/columns of one or more tables, the OS and underlying storage just see
what appears to be a random sequence of block reads/writes.

For such random access, using RA is bad: the "prediction" of which blocks to
bring next doesn't work.  So if RA brings extra blocks into page cache, and
they've not been read, then the OS just wasted a lot of effort.  This is
called "cache poisoning" or "cache pollution": when the cache has a bunch
of entries that were added but were not used.

So, modern RA algs have to determine if the access pattern is sequential or
random, and based on that, they can decide how much to readahead, if any.
E.g., Linux checks its RA prediction vs. what blocks were actually read; iff
there's too many mismatches, Linux reduces the amount of RA, and even turns
it off.

Modern OSs also have to adapt dynamically to applications that change their
behavior: not just the amount of RA (going up or down), but even when
applications change "phases" and start to read more randomly, then
sequentially again, and flip flop b/c random and sequential phases.

Q: how to tell if a cache is useful?

A: the OS keeps meta-data about each cached entry.  Time when it was brought
in, where from, and some indication of how many processes have accessed this
page, and when (i.e., cache hit).  So when the OS "page cache cleaner"
starts running, it inspects this m-d about all pages, and decides which ones
to evict first.

Nowadays, there's a growing complexity in access patterns.  Primarily
because of an exploding use of streaming media applications (especially
these last few weeks).  Assume you have an HD video that's being streamed.
HD video file can be ~50GB in size: so fairy large, you wouldn't want to
cache the entire file in memory.  What does a server's OS, serving such a
file will see in terms of access patterns:

1. If only one user is streaming the file, you'll see sequential access from
start to stop (maybe a pause here and there).  So RA will work very well
here.  In fact, RA will work even better than in normal sequential access,
b/c for media, we know precisely the rate at which bytes have to be serve:
e.g., 30 frames per second corresponds to a specific byte-rate we have to
read-ahead.  Even better, as soon as a user has 'read' the data (viewed a
segment of a movie) the OS can immediately reclaim it.

2. Suppose you have two or MORE users viewing the same large movie?  Most
likely, each user will be viewing the file at a different offset.  So what
does the OS see?  The actual access pattern will be seen as MULTIPLE
sequential sub-streams.  But a naive RA alg will perceive this as a RANDOM
access: b/c the OS has to hop around retrieving data blocks from the file
for different users in any order.  If random, then you don't RA, which means
that all users suffer slowdown (underlying media server is overloaded and is
I/O bound).

So there's been a resurgence in RA algs, trying to detect these multimedia
access patterns, and devise more clever RA algs that understand sub-streams.


* READs vs. WRITES

"reads" are said to be "synchronous" or "blocking": if the data you're
trying to read isn't in the cache, your program has to wait for it.

"writes" are said to be "asynchronous" or "non-blocking": most OSs will
return back from write(2) after just putting the changed data in memory (page
cache) and marking it "dirty" (meaning it has to be flushed to the
underlying file).  This allows writes to return much faster.  Most OSs try
to flush dirty writes every N seconds (5 or 30 seconds is common).  If you
need that data to be guaranteed persistent, you need to use fsync, msync,
etc.

* NEXT

file systems: file types, and f/s system syscalls