* FILE SYSTEMS AND APIs What is a file system? What is a file? Abstract model: 1. A file as container of some data 2. A file has a name (a way to identify the "file") 3. A file has some meta-data (m-d) File foo.c has a name "foo.c". It has content you can read with /bin/cat or read(2). And it has some m-d about the file: file owner, size, timestamps, etc. You can get most m-d using syscalls like l/stat(2). In most traditional "POSIX" (a standard for OSs) file systems, there are several "types" of entities or objects a file systems holds: 1. a "regular file" (has name, content you can read/write, and m-d) 2. a "directory": an index to names and IDs of other files 3. symbolic link: a reference to another pathname 4. character device, block device, UNIX pipe/FIFO/sockets Regular files and dirs are most popular. A file system is a collection of these entities, organized in some manner in some "memory" or "storage" device. Memory device can be persistent (non-volatile), or non-persistent (volatile). Examples: 1. NV devices preserve their content upon system reboot or power loss: hard disk drive, USB thumb drive, flash disk, CDROM. 2. Volatile devices lose their content on system reboot: RAM and variants. NV devices are most popular b/c people want their files to be preserved for longer time periods. Volatile devices are often used for temporary, small file systems, where performance is key (e.g., RAMFS or TMPFS). Note that two or more devices, of different types, can be combined together in some manner in order to create a new "virtual" device that can be configured to store a file system. Most devices use a "block" interface. The way an OS will talk to the device, is using the block interface, below it there's perhaps a SCSI, SATA, SAS, etc. driver. The block interface is very simple: 1. Each block has a fixed size. Historically, 512B; more modern h/w uses 4KB block sizes. Blocks are sometimes called "sectors." 2. The OS can talk to the device in terms of block numbers: 0, 1, 2, ..., N. 3. A device has a max size of N blocks: that determines the maximum capacity of the device. 4. The OS can issue read or write requests to the device, identifying a specific block number to read or write. 5. OS can also issue a few special "control" operations to the device (TBD), but reads/writes are by far the most popular. IOW, to the OS, the block device looks just like an "array" of N sectors, each being of fixed size. A file system as discussed above is a "collection of files" on some media (persistent or non-persistent media). Another interpretation is that a file system is some piece of software (an OS layer) that exports an API to users for reading/writing and managing POSIX files, but internally the file system manages the data on a specific media device. How does a file system (f/s) uses such a device? The f/s decides what parts of the device to store what type of information. Before you can use a device for some f/s, you have to prepare the device: this is called "formatting" the device. In Unix, you use "mkfs": # mkfs -t ext4 /dev/sdb3 Above says: prepare partition 3, of SCSI device 'b' (meaning second scsi dev system knows about), for use with the Ext4 file system. Mkfs will format the device and perform certain actions: 1. Decide what parts of the device would hold regular file data. 2. Decide what parts would hold directories (file names). 3. Decide where to store m-d of files (e.g., inodes). 4. Decide where to store high-level, whole file system information, also called a "superblock". Often, the space reserved for different kind of information is fixed is size, and has known locations in the device. e.g., the last 10% of the device would be for m-d, and 50-90% of the device's blocks (by number), would be for storing data blocks. The block number, or sector number, is also often called a Logical Block Address (LBA) or Logical Block Number (LBN). The superblock often lives in the very start of the device, in a known location, say LBA 0-9: and it stores the following: 1. the start and end LBAs for each section of the file system: where the m-d, directories, and file data starts/ends. 2. Stats and accounting about the entire file system: how many blocks, directories, and m-d units have been used; how many are free; WHERE the used and free ones are; etc. This is sometimes called a "free map." 3. Depending on the f/s may also store: copies of the actual superblock, in known locations (e.g., every 1/4 of the max LBA numbers). The superblock copy is useful in case the first copy (e.g., LBA 0-9) gets corrupted somehow (e.g., media errors, wear and tear). Q: how does OS know the f/s format for any given device? A1: at the very start of every device, there is often 1-2 sectors reserved for a "partition table": it lists how many partitions there are, and for each one, the start/end LBA, as well as a "type" (0..255). There are reserved partition types for various OSs: windows, Linux, etc. There are also some reserved f/s types. A2: each file system also designates a unique fingerprint, called a "magic number". This number is hard-coded by the f/s code and the OS, and each OS ensures that it has a unique number. Even different OS vendors try to coordinate it (Linux, Microsoft, and Apple). The superblock of each f/s, records this magic number (32-64 bit number). The mkfs utility writes that number into the superblock. When an OS starts and you ask it to use a previously formatted f/s (a process called "mounting" a file system volume), the OS f/s code reads the magic number, and ensures it matches its own number. This prevents you from, say, mounting a partition formatted with NTFS as Ext4: that won't work and if it did, would seriously corrupt the NTFS data. The m-d of a file is also called the "inode" of the file. Many inodes can be stored together in a f/s, in the m-d region called the "inode block." What's in an inode? You can find out by stat(2)-ing any file: struct stat { dev_t st_dev; /* ID of device containing file */ ino_t st_ino; /* Inode number */ mode_t st_mode; /* File type and mode */ nlink_t st_nlink; /* Number of hard links */ uid_t st_uid; /* User ID of owner */ gid_t st_gid; /* Group ID of owner */ dev_t st_rdev; /* Device ID (if special file) */ off_t st_size; /* Total size, in bytes */ blksize_t st_blksize; /* Block size for filesystem I/O */ blkcnt_t st_blocks; /* Number of 512B blocks allocated */ struct timespec st_atim; /* Time of last access */ struct timespec st_mtim; /* Time of last modification */ struct timespec st_ctim; /* Time of last status change */ }; st_ino: each entity like a file in a f/s, has a unique inode number. Must be unique within that f/s -- not across other file systems. Usually inums start at 1, 2, 3, ... up to some MAX_INUM that is pre-determined at mkfs time. Note that a file's name is NOT unique (TBD), but the inum must be unique. st_mode: first few bits encode the type of entity this inode is (regular file, directory, symlink, blk/chr device, etc.). Remaining bits encode the protection mode of the file. See man page for stat(2) for how you can use simple macros to determine if an st_mode field designates a regular file, directory, etc. e.g., int i = st.mode & S_IFMT; // use "mask" to get the "format" bits exposed if (i == S_IFREG) // regular file if (i == S_IFDIR) // directory type The POSIX model has 3 kinds of permissions: Read, Write, and eXecute on an inode. And also 3 types of "users": the owner of the inode; members of the group; and "all others". A total of 9 permission bits. You can set those bits using chmod(2) or /bin/chmod -- Change Mode. st_nlink: for hard links (demoed). Every time you create a hard link(2) to a file, st_nlink goes up by one; every time you unlink/delete a name, st_nlinks goes down by one. This is called a "reference counter". When refcount reaches 0, it means the inode has no more "names" (last user deleted the file name): only NOW you can actually delete the inode struct and all associated file data. st_uid: the number representing this user. Each user has a name (may not be unique) but must have a unique user ID (uid). st_gid: the number of the group this inode belongs to. A user can belong to multiple groups, defined in /etc/group and /etc/gshadow. You can find your own unique ID and what groups you belong to using /usr/bin/id. The first group listed is also the "primary" or default group you belong to. Timestamps: st_atim(e): last access (read) of the file's content (data) st_mtim(e): last modification of the file (written data) st_ctim(e): last status change of the inode itself. ctime is confusing: it does NOT stand for "(C)reation" time. It changes when certain parts of the file's inode change: namely, uid, gid, file size, mode. Some OSs (e.g., Mac OS X) have an immutable "Birth time" field in struct stat: set at file creation time, cannot be changed by users. You can create regular files using creat(2) and open(2) with O_CREAT flag. You can delete regular files using unlink(2). You can get content of those files or modify content using read(2) and write(2). * What's a directory? A directory is essentially a table, often only 2 columns: 1. A name of an entity within THIS directory. 2. The inode number (inum) of that entity. 3. optional: sometime the type of entity is also recorded. What's a file name vs. (full) pathname? Example of a Pathname is "/home/jdoe/src/foo.c". Often includes a name delimited, like "/" in unix or \ (backslash) in Windows. A pathname is "absolute" if it starts with a "/" (the root directory); else, we call the pathname "relative". The above pathname has four directories, each containing at least one name: 1. Top, root dir is "/" 2. "/" is a f/s entity of type "DIR" (meaning st_mode would record it as type "DIR" (not "REG" for regular files). 3. The "/" dir has at least one name called "home" 4. "home" is also a directory (type "DIR") 5. "home" contains at least one other name, called "jdoe" 6. "jdoe" is also a dir, contains a name "src" 7. "src" is also a dir, contains a name "foo.c" 8. "foo.c" is a regular file "REG" (can be another dir, symlink, etc.) A regular file in Unix has no OS-imposed structure: OS just sees it as a sequence of bytes that can be ANYTHING. Applications can impose structure: gcc imposes a structure called "C source file"; a database like MySQL will expect different content; a mailbox for a user. A directory in Unix has a specific structure imposed by the OS. You can find this structure in "struct dirent" or "struct linux_dirent". See manpages for getdents(2) (get directory entries syscall). Example: struct dirent { int d_inum; // for inode number char name[255]; // to store name of the file // possibly other stuff }; In POSIX: a file name is limited to 255 bytes; a full pathname is limited to 4096 bytes. A file name in Unix cannot include '/' or a \0 (null). So a dir will be a bunch of concatenated struct dirent's one after the other. Normally, you cannot read the raw bytes of a directory directly (you get an error EISDIR): you need to use special system calls to control a directory's contents: 1. mkdir(2): to create a new directory - creat(2) and open(2) with O_CREAT will also create a new struct dirent 2. rmdir(2): delete a directory - unlink(2): remove a regular file, symlink, or other entity 3. other system calls that "modify" a directory indirectly include: - rename(2), link(2), symlink(2), others 4. getdents(2): to "read" one or more struct dirents. You can open a directory (typically with open(2) with O_DIRECTORY flag), get a fd, then pass it to getdents with a buffer to "fill in" with as many struct dirents as can fit. You can keep calling getdents until there's no more data left in the directory (EOF). 5. readdir(2): another (older) form of reading directory contents. * What happens when you say $ cat /home/jdoe/src/foo.c Or when you code does open("/home/jdoe/src/foo.c", flags such as O_RDONLY) The OS already knows the inode number of "/", init'd when OS starts. The OS then performs a function called "lookup" (other names include "pathname lookup" or the "namei" routine). The steps are: 1. get inum of "/" 2. issue an internal OS lookup for "home" inside the inode of "/" - OS has to call the underlying f/s and ask for the content of directory "/", given its inode. Lower level f/s will get this dir's content and bring it up to higher layers of the OS (where syscalls are typically executed). - the OS then searches through the blocks of data retrieved from the device by the f/s formatted on that device, looking for a match to the string "home". If not found, return ENOENT err. - if found, then get the dirent struct that matched "home", get the inum from it (d_inum above), this is the inum for "home". - next, OS asks the f/s to get the contents of the directory whose inum matches "home". Get that content, search for a string matching "jdoe". - this sequence of lookup steps repeats for each '/' delimited pathname component, until the last one. At the end... - OS will lookup the name "foo.c" in the directory corresponding to "src". If found, will retrieve the blocks (bytes) of the regular inode file whose number corresponds to "foo.c". - now /bin/cat or read(2) can actually display the content of the file foo.c What happens when you try to creat(2) a new file "bar.c" in /home/jdoe/src/? The OS allocates a new inode to hold the new file bar.c. But it also adds a new struct dirent to the content of the inode for the directory "src": recording the tuple "bar.c" + the new file's inum. Important: the name of the file isn't inside its inode. It's somewhat external to the inodes and data. We call this "layer" of names that connects files and dirs together the "namespace" of the file system. Note: each time you pass a pathname to a syscall, the OS has to "parse" the pathname based on its delimiter (e.g., "/"), then perform lookups on each pathname, and then check permission on each pathname. This can be slow. For that reason, OSs cache directory entries and inodes extensively, in complex data structures (Linux called it the Dcache, BSD/Solaris calls it the DNLC -- Directory Name Lookup Cache). Even with caching, the OS still "wastes" CPU and memory to search for objects corresponding to each pathname component in its own caches. For that reason, a whole set of new system calls were created to bypass the pathname lookups. Most of these syscalls have an "at" in their name, e.g., openat(2). * CONT open(2) flags Job control demos: fg, bg, &, nohup, ^z *suspend* pgid, process groups?