* FILE SYSTEMS AND APIs

What is a file system?  What is a file?

Abstract model:
1. A file as container of some data
2. A file has a name (a way to identify the "file")
3. A file has some meta-data (m-d)

File foo.c has a name "foo.c".  It has content you can read with /bin/cat or
read(2).  And it has some m-d about the file: file owner, size, timestamps,
etc.  You can get most m-d using syscalls like l/stat(2).

In most traditional "POSIX" (a standard for OSs) file systems, there are
several "types" of entities or objects a file systems holds:

1. a "regular file" (has name, content you can read/write, and m-d)
2. a "directory": an index to names and IDs of other files
3. symbolic link: a reference to another pathname
4. character device, block device, UNIX pipe/FIFO/sockets

Regular files and dirs are most popular.

A file system is a collection of these entities, organized in some manner in
some "memory" or "storage" device.  Memory device can be persistent
(non-volatile), or non-persistent (volatile).  Examples:

1. NV devices preserve their content upon system reboot or power loss: hard
disk drive, USB thumb drive, flash disk, CDROM.

2. Volatile devices lose their content on system reboot: RAM and variants.

NV devices are most popular b/c people want their files to be preserved for
longer time periods.  Volatile devices are often used for temporary, small
file systems, where performance is key (e.g., RAMFS or TMPFS).

Note that two or more devices, of different types, can be combined together
in some manner in order to create a new "virtual" device that can be
configured to store a file system.

Most devices use a "block" interface.  The way an OS will talk to the
device, is using the block interface, below it there's perhaps a SCSI, SATA,
SAS, etc. driver.  The block interface is very simple:

1. Each block has a fixed size.  Historically, 512B; more modern h/w uses
   4KB block sizes.  Blocks are sometimes called "sectors."
2. The OS can talk to the device in terms of block numbers: 0, 1, 2, ..., N.
3. A device has a max size of N blocks: that determines the maximum capacity
   of the device.
4. The OS can issue read or write requests to the device, identifying a
   specific block number to read or write.
5. OS can also issue a few special "control" operations to the device (TBD),
   but reads/writes are by far the most popular.

IOW, to the OS, the block device looks just like an "array" of N
sectors, each being of fixed size.

A file system as discussed above is a "collection of files" on some media
(persistent or non-persistent media).  Another interpretation is that a file
system is some piece of software (an OS layer) that exports an API to users
for reading/writing and managing POSIX files, but internally the file system
manages the data on a specific media device.

How does a file system (f/s) uses such a device?  The f/s decides what parts
of the device to store what type of information.  Before you can use a
device for some f/s, you have to prepare the device: this is called
"formatting" the device.  In Unix, you use "mkfs":

# mkfs -t ext4 /dev/sdb3

Above says: prepare partition 3, of SCSI device 'b' (meaning second scsi dev
system knows about), for use with the Ext4 file system.

Mkfs will format the device and perform certain actions:

1. Decide what parts of the device would hold regular file data.
2. Decide what parts would hold directories (file names).
3. Decide where to store m-d of files (e.g., inodes).
4. Decide where to store high-level, whole file system information, also
   called a "superblock".

Often, the space reserved for different kind of information is fixed is
size, and has known locations in the device.  e.g., the last 10% of the
device would be for m-d, and 50-90% of the device's blocks (by number),
would be for storing data blocks.

The block number, or sector number, is also often called a Logical Block
Address (LBA) or Logical Block Number (LBN).

The superblock often lives in the very start of the device, in a known
location, say LBA 0-9: and it stores the following:

1. the start and end LBAs for each section of the file system: where the
   m-d, directories, and file data starts/ends.
2. Stats and accounting about the entire file system: how many blocks,
   directories, and m-d units have been used; how many are free; WHERE the
   used and free ones are; etc.  This is sometimes called a "free map."
3. Depending on the f/s may also store: copies of the actual superblock, in
   known locations (e.g., every 1/4 of the max LBA numbers).  The superblock
   copy is useful in case the first copy (e.g., LBA 0-9) gets corrupted
   somehow (e.g., media errors, wear and tear).

Q: how does OS know the f/s format for any given device?

A1: at the very start of every device, there is often 1-2 sectors reserved
for a "partition table": it lists how many partitions there are, and for
each one, the start/end LBA, as well as a "type" (0..255).  There are
reserved partition types for various OSs: windows, Linux, etc.  There are
also some reserved f/s types.

A2: each file system also designates a unique fingerprint, called a "magic
number".  This number is hard-coded by the f/s code and the OS, and each OS
ensures that it has a unique number.  Even different OS vendors try to
coordinate it (Linux, Microsoft, and Apple).  The superblock of each f/s,
records this magic number (32-64 bit number).  The mkfs utility writes that
number into the superblock.  When an OS starts and you ask it to use a
previously formatted f/s (a process called "mounting" a file system volume),
the OS f/s code reads the magic number, and ensures it matches its own
number.  This prevents you from, say, mounting a partition formatted with
NTFS as Ext4: that won't work and if it did, would seriously corrupt the
NTFS data.

The m-d of a file is also called the "inode" of the file.  Many inodes can
be stored together in a f/s, in the m-d region called the "inode block."
What's in an inode?  You can find out by stat(2)-ing any file:

	struct stat {
	       dev_t	 st_dev;	 /* ID of device containing file */
	       ino_t	 st_ino;	 /* Inode number */
	       mode_t	 st_mode;	 /* File type and mode */
	       nlink_t	 st_nlink;	 /* Number of hard links */
	       uid_t	 st_uid;	 /* User ID of owner */
	       gid_t	 st_gid;	 /* Group ID of owner */
	       dev_t	 st_rdev;	 /* Device ID (if special file) */
	       off_t	 st_size;	 /* Total size, in bytes */
	       blksize_t st_blksize;	 /* Block size for filesystem I/O */
	       blkcnt_t  st_blocks;	 /* Number of 512B blocks allocated */
	       struct timespec st_atim;	 /* Time of last access */
	       struct timespec st_mtim;	 /* Time of last modification */
	       struct timespec st_ctim;	 /* Time of last status change */
	};

st_ino: each entity like a file in a f/s, has a unique inode number.  Must
be unique within that f/s -- not across other file systems.  Usually inums
start at 1, 2, 3, ... up to some MAX_INUM that is pre-determined at mkfs
time.  Note that a file's name is NOT unique (TBD), but the inum must be
unique.

st_mode: first few bits encode the type of entity this inode is (regular
file, directory, symlink, blk/chr device, etc.).  Remaining bits encode the
protection mode of the file.  See man page for stat(2) for how you can use
simple macros to determine if an st_mode field designates a regular file,
directory, etc.  e.g.,

int i = st.mode & S_IFMT; // use "mask" to get the "format" bits exposed
if (i == S_IFREG) // regular file
if (i == S_IFDIR) // directory type

The POSIX model has 3 kinds of permissions: Read, Write, and eXecute on an
inode.  And also 3 types of "users": the owner of the inode; members of the
group; and "all others".  A total of 9 permission bits.  You can set those
bits using chmod(2) or /bin/chmod -- Change Mode.

st_nlink: for hard links (demoed).  Every time you create a hard link(2) to
a file, st_nlink goes up by one; every time you unlink/delete a name,
st_nlinks goes down by one.  This is called a "reference counter".  When
refcount reaches 0, it means the inode has no more "names" (last user
deleted the file name): only NOW you can actually delete the inode struct
and all associated file data.

st_uid: the number representing this user.  Each user has a name (may not be
unique) but must have a unique user ID (uid).

st_gid: the number of the group this inode belongs to.  A user can belong to
multiple groups, defined in /etc/group and /etc/gshadow.

You can find your own unique ID and what groups you belong to using
/usr/bin/id.  The first group listed is also the "primary" or default group
you belong to.

Timestamps:
st_atim(e): last access (read) of the file's content (data)
st_mtim(e): last modification of the file (written data)
st_ctim(e): last status change of the inode itself.

ctime is confusing: it does NOT stand for "(C)reation" time.  It changes
when certain parts of the file's inode change: namely, uid, gid, file size,
mode.  Some OSs (e.g., Mac OS X) have an immutable "Birth time" field in
struct stat: set at file creation time, cannot be changed by users.

You can create regular files using creat(2) and open(2) with O_CREAT flag.
You can delete regular files using unlink(2).  You can get content of those
files or modify content using read(2) and write(2).

* What's a directory?

A directory is essentially a table, often only 2 columns:

1. A name of an entity within THIS directory.
2. The inode number (inum) of that entity.
3. optional: sometime the type of entity is also recorded.

What's a file name vs. (full) pathname?

Example of a Pathname is "/home/jdoe/src/foo.c".  Often includes a name
delimited, like "/" in unix or \ (backslash) in Windows.  A pathname is
"absolute" if it starts with a "/" (the root directory); else, we call the
pathname "relative".  The above pathname has four directories, each
containing at least one name:

1. Top, root dir is "/"
2. "/" is a f/s entity of type "DIR" (meaning st_mode would record it as
   type "DIR" (not "REG" for regular files).
3. The "/" dir has at least one name called "home"
4. "home" is also a directory (type "DIR")
5. "home" contains at least one other name, called "jdoe"
6. "jdoe" is also a dir, contains a name "src"
7. "src" is also a dir, contains a name "foo.c"
8. "foo.c" is a regular file "REG" (can be another dir, symlink, etc.)

A regular file in Unix has no OS-imposed structure: OS just sees it as a
sequence of bytes that can be ANYTHING.  Applications can impose structure:
gcc imposes a structure called "C source file"; a database like MySQL will
expect different content; a mailbox for a user.

A directory in Unix has a specific structure imposed by the OS.  You can
find this structure in "struct dirent" or "struct linux_dirent".  See
manpages for getdents(2) (get directory entries syscall).  Example:

struct dirent {
	int d_inum; // for inode number
	char name[255]; // to store name of the file
	// possibly other stuff
};

In POSIX: a file name is limited to 255 bytes; a full pathname is limited to
4096 bytes.  A file name in Unix cannot include '/' or a \0 (null).

So a dir will be a bunch of concatenated struct dirent's one after the
other.  Normally, you cannot read the raw bytes of a directory directly (you
get an error EISDIR): you need to use special system calls to control a
directory's contents:

1. mkdir(2): to create a new directory
- creat(2) and open(2) with O_CREAT will also create a new struct dirent
2. rmdir(2): delete a directory
- unlink(2): remove a regular file, symlink, or other entity
3. other system calls that "modify" a directory indirectly include:
- rename(2), link(2), symlink(2), others

4. getdents(2): to "read" one or more struct dirents.  You can open a
   directory (typically with open(2) with O_DIRECTORY flag), get a fd, then
   pass it to getdents with a buffer to "fill in" with as many struct
   dirents as can fit.  You can keep calling getdents until there's no more
   data left in the directory (EOF).

5. readdir(2): another (older) form of reading directory contents.

* What happens when you say

$ cat /home/jdoe/src/foo.c

Or when you code does open("/home/jdoe/src/foo.c", flags such as O_RDONLY)

The OS already knows the inode number of "/", init'd when OS starts.  The OS
then performs a function called "lookup" (other names include "pathname
lookup" or the "namei" routine).  The steps are:

1. get inum of "/"
2. issue an internal OS lookup for "home" inside the inode of "/"
- OS has to call the underlying f/s and ask for the content of directory
  "/", given its inode.  Lower level f/s will get this dir's content and
  bring it up to higher layers of the OS (where syscalls are typically
  executed).
- the OS then searches through the blocks of data retrieved from the device
  by the f/s formatted on that device, looking for a match to the string
  "home".  If not found, return ENOENT err.

- if found, then get the dirent struct that matched "home", get the inum
  from it (d_inum above), this is the inum for "home".
- next, OS asks the f/s to get the contents of the directory whose inum
  matches "home". Get that content, search for a string matching "jdoe".

- this sequence of lookup steps repeats for each '/' delimited pathname
  component, until the last one.  At the end...

- OS will lookup the name "foo.c" in the directory corresponding to "src".
  If found, will retrieve the blocks (bytes) of the regular inode file whose
  number corresponds to "foo.c".

- now /bin/cat or read(2) can actually display the content of the file foo.c

What happens when you try to creat(2) a new file "bar.c" in /home/jdoe/src/?

The OS allocates a new inode to hold the new file bar.c.  But it also adds a
new struct dirent to the content of the inode for the directory "src":
recording the tuple "bar.c" + the new file's inum.

Important: the name of the file isn't inside its inode.  It's somewhat
external to the inodes and data.  We call this "layer" of names that
connects files and dirs together the "namespace" of the file system.

Note: each time you pass a pathname to a syscall, the OS has to "parse" the
pathname based on its delimiter (e.g., "/"), then perform lookups on each
pathname, and then check permission on each pathname.  This can be slow.
For that reason, OSs cache directory entries and inodes extensively, in
complex data structures (Linux called it the Dcache, BSD/Solaris calls it
the DNLC -- Directory Name Lookup Cache).

Even with caching, the OS still "wastes" CPU and memory to search for
objects corresponding to each pathname component in its own caches.  For
that reason, a whole set of new system calls were created to bypass the
pathname lookups.  Most of these syscalls have an "at" in their name, e.g.,
openat(2).

* CONT open(2) flags

Job control demos: fg, bg, &, nohup, ^z *suspend*
pgid, process groups?