* Lookups, reminder Every system call that takes a string file/pathname (char*), will require the OS to parse the pathanme on a delimiter such as '/', then perform an internal "lookup" for that pathname: stat, open, rmdir, mkdir, unlink, rename, etc. Very important b/c lots of syscalls pass pathnames. This can be slow and requires a good OS cache for directory entries and inodes. For that reason, the *at(2) syscalls were created, to execute more efficiently. * Absolute vs. Relative Pathnames Absolute: starts with a "/". Relative pathname: does NOT start with a "/" Q: relative to what?! What about "cat foo.c" or open("foo.c")? In that case, the OS needs a "reference" directory for to know where to lookup "foo.c": b/c every lookup happens in the context of a directory in which you are looking up a name. The dir to look inside for "foo.c", defaults to the Current Working Directory (CWD) of the process that issued the syscall. That information is stored inside the OS, as part of a task (or process) structure that the OS maintains on behalf of every running process (including a shell). "struct task" is one of the biggest and most complex structs in the OS. For example, it could have this field: struct task { ... char *cwd; // current working dir }; Note: above cwd is shown as a "char *": in practice it'll be a pointer to an OS internal structure representing the CWD. In linux, it's "struct dentry *cwd". So, when the OS has to perform a lookup for a rel pathname, it starts from the task struct's ->cwd field inside the OS. chdir(2): sets the task cwd field (command is 'cd') getcwd(2): gets the content of the cwd field from the OS (cmd is 'pwd') Note: chdir(2) can only change the cwd of the current running process, not for another process. That's why shells must implement 'cd' as an internal command, not one you'd fork+exec. The /bin/login program that lets you login, authenticates your userid+password (e.g., gets it from /etc/passwd or LDAP), it then forks, sets the cwd to your home dir -- chdir("/home/jdoe"), and then exec's preferred shell (e.g., /bin/bash). * other special directory names If you enumerate all names inside any dir, say using getdents(2), you will always see the following two special names: ".": the "dot" directory. Is just another name for the CWD. "..": the "dotdot" or parent directory. They always exist when you create a new directory. You can use them to refer to other pathnames relative to yours: $ ls ../tmp # list the contents of the sister dir to where I am now $ ./a.out # execute a binary named a.out in the current dir Pathnames can include as many "." and ".." as you want: you can ascend up the parent hierarchy up to root ("/") and then into any other subdir from where you are here. Also, multiple "." are no-ops: "foo/./././bar" is the same as "foo/bar" Also, note that "/.." is the same as "/". That is, while every dir has a parent, the global root "/" dir is its own parent. * hidden names in Unix By default "ls" doesn't show any file that starts with a "." such as ".bashrc". This is a convention for user level tools, not an OS restriction. If you want to see such files in unix, you say "ls -a" ('a' for all). But note that getdents returns ALL names back to the user process, such as "ls": ls is the one that doesn't show dot-files by default. In Windows, conversely, the "hidden" bit for files is an OS-level feature. Try the system call tracing tool (strace) to better understand how various programs work: $ strace /bin/ls $ strace /bin/ls -a * What is a hard link? A hard link is another name in some dir, so another struct dirent, that points to the same inode. Example: Dirent table: NAME INODE No. . 10 .. 17 foo.c 23 a.out 100 bar.c 23 In above example, "foo.c" and "bar.c" are hardlinked files, to the same inode number 23. A hard link is an alias to the same unique inum. Same as: every person has a unique social security number, but you can have multiple names and nicknames. Above example shows a hardlink within the same directory. But a hard link can exist in any directory, but only within the same file system: that's b/c it points to a specific inode number, and inode numbers are unique only within a given f/s. Most OSs will only allow you to create a hard link to a regular file. Hard links are useful when you want to have different names to the same files, without copying the same content, thus saving space. In above example, if you change content or m-d of foo.c, and then looked it up using bar.c, you'll see the same m-d and file content. To create a hardlink: $ ln foo.c bar.c # make bar.c a hard link to existing file foo.c Or use the link(2) syscall. You can rename hard links (mv(1), rename(2)) or delete them (rm(1), unlink(2)) just like any other file. When you delete a regular file, its name, inode, and any data blocks it uses are removed from the file system and underlying storage. But when you remove a hard-link'd file, it depends how many links it has: if you are removing the very last link, then all content goes away; if you're removing just one name and there are still other names left to the hardlinked file, then you ONLY remove that one name (struct dirent) from that one directory in which you deleted the file. The OS knows when it's about to delete the last ref to a named inode, because it keeps track of the number of links to a file, in the inode's struct stat in this field: nlink_t st_nlink; /* Number of hard links */ For regular files, nlink=1; for hardlinked ones, it's >1. Test: create a file, create hard links to it, then check using stat(2) or /bin/stat to see the #links grow; then unlink/rm, to see the #links drop. Note: a hard link's inode is still the same regular file type. * What us a soft, or symbolic link? A symlink is a different type of f/s object: not a REGular file, not a DIRectory, but a LINK file. A symlink has its own unique inode created, with the usual m-d in the inode. The "content" of a symlink object can be anything when you create the symlink. However, when you try to lookup (indirectly via the OS's lookup, for any syscall that passes a pathname) an object, and that object happens to be a symbolic link, then the OS's pathname lookup includes a special procedure. Example, consider the pathname lookup of a path like /a/b/c/d/foo.c. We expect "foo.c" to be a reg file inside directory 'd', which is inside directory 'c', etc. If when parsing this pathname, ANY component is determined by the OS to be a symlink, then: 1. The OS will read the content of the symlink 2. The OS will "insert" that content in place of the current component being evaluated (looked up). 3. Then the OS will resume processing the pathname lookup with the symlink's content replaced. A symlink is another way of aliasing to another file, directory, or whole pathname. While a hardlink can only connect one component name to another, a symlink can connect one component name to ANY other pathname. The size of the content of a symlink is limited to 4096 bytes: that's the maximum PATHNAME allowed in POSIX. To create a symlink: $ ln -s foo bar # -s says symlink bar to foo Or syscall symlink(2) To read the symlink's contents: use readlink(2) -- looks like a read(2) or getdents(2) but for symlinks. You can delete/rename a symlink w/ usual commands and syscalls. You cannot create a hard link to a file that does not exist yet. But you can create a symlink to ANY file or pathname, even a pathname that doesn't exist (at this time), or one that points to another file in another file system. Key: when you create a symlink, the OS doesn't validate the content of the symlink (what you're pointing to). That validation only happens when you try to ACCESS that symlink, say by open, stat, etc. $ ln -s /foo/bar name $ ln -s ../../../dir1/dir2/somedir name Symlinks offer 'delayed evaluation' of what they point to, whereas hardlinks evaluate what a hardlink is pointed to at the time you create the hardlink. Symlinks are useful to create aliases to files, dirs, any other object at any time. Useful also b/c they cross file systems. They don't take up a lot of space (one inode and at most 4KB of data). They allow you to create a whole "shadow namespace hierarchy" to existing files and directories. By default any pathname lookup (e.g., open(2), /bin/cat), when it sees a symlinks, will follow that symlink further until the final component is actually found and opened, displayed, etc. stat(2): will traverse all symlinks in the pathname you're trying to stat lstat(2): if the pathname is a symlink, will return inode info about the actual symlink, not what it points to. A symlink can point to an object that doesn't exist: if you try to traverse it, you get ENOENT error: this is called an orphan or dangling symlink. Note the target object COULD have existed before, but nothing stops a target from being deleted and you still have symlinks pointing to it. $ ls -l will show a symlink as "foo -> bar" (arrow denoting a symlink) and a lower case 'l' in the type of object (left side of ls -l output). $ ln -s a b $ ln -s b c $ ln -s c d Issue 1: you can break a chain of symlinks. Symlinks are "fragile" b/c any symlink or component that it points to can be deleted, renamed, changed, and you may not find out until next time you try to traverse this chain. Renaming: suppose "a" is a symlink to "/b/c/d/e/f.c", and I do this $ cd /b/c $ mv d old_d Issue 2: you could create a infinite loop of symlinks, circular chain. OS has to be careful, b/c if inf. loops were allowed, the lookup procedure inside the OS will get into an inf. loop (very bad for OS). Solution: not any fancy graph cycle detection algorithms, but: 1. Allow any symlink to be created, even if it causes a loop. 2. When the OS starts to evaluate any pathname as part of a lookup, for one syscall, it starts a counter. 3. Each time the evaluation of the same pathname crosses a symlink, increment the counter by 1. 4. If the counter exceeds a max threshold (often set to "20") abort the lookup and return the error ELOOP. This means, you cannot have a valid, non cyclic chain of more than 20 symlinks. Also, if you have a small loop (a->b and b->a), the OS still has to evaluate those symlinks multiple times until the counter reaches 20. There are fancy cycle detection algs, but OSs don't use them because (1) their complexity is larger than the above alg and (2) their mem footprint is larger. OS designers prefer to keep code/algs simpler. * other types of f/s objects Block and Character devices, often live in /dev: you can create them using mknod(2), rename/delete as usual. These objects are unique in that there's special code in the OS that, depending on how you create them, will implement different functionality int mknod(const char *pathname, mode_t mode, dev_t dev); - pathname: name you want for the special object - mode: default permissions - dev: encodes a major+minor numbers, that only the OS knows what they mean For example, major number 7, can be a SCSI device; major number 8 can be some GPU; terminal device, etc. E.g., every "class" of devices has its own major number. The minor usually 'refines' the type: e.g., if a scsi type device, then minor=1 means the first scsi device in the chain; minor=2 means the 2nd, etc. The minor number therefore denotes a specific instance of the device on this system. Block devices: look like one "raw" file. You can open a block device to say, /dev/sda1, and you'll be able to seek to any offset within the device, you can read and write, but only in native "blocks" (e.g., aligned 512B sectors). This is how mkfs, for example, formats a file system on a device. As long as you have the privileges to read the /dev/XXX device file (usual unix permissions), then you can access the 'raw' data of a storage device, even bypassing the file system! Dangerous: don't mess with f/s data structures. You can backup a whole raw device, block by block, and restore an identical image of the block device if you wanted: $ dd if=/dev/sda1 of=/some/file.bkp.of.sda1 bs=4k DD: Disk Dump -- used to read/write raw devices, but also any other file if: Input File (the raw device) of: Output file (where you want it stored, NOT on /dev/sda1) bs= Block Size, the unit of copying DD is also useful to generate a file of a given size, or measure raw performance of an I/O subsystem: $ dd if=/dev/zero of=/mnt/filesystem1/BIGFILE1 bs=1M count=1000 $ dd if=/dev/zero of=/mnt/filesystem2/BIGFILE2 bs=1M count=1000 Above commands would read from /dev/zero (just a sequence of zeros), and write to two different file systems. At the end, dd reports the throughput, so you can compare speeds. $ dd if=/dev/random of=/tmp/randombits bs=1k count=20 Above will create a file containing 20K worth of random bits Char device: allows you only to read/write sequentially. You cannot seek back or forward. Examples include keyboard terminals, network sockets, etc. Special devices: /dev/null: aka "unix bit bucket". A place you can write/redirect any data to, when you want it discarded. /dev/random and /dev/urandom: reading random numbers from a generator - /dev/urandom is "pseudo" random numbers, not perfect entropy, but can generate a lot of random data quickly. - /dev/random is a "true" random number generator (TRNG). Reading from this device is much slower because it "generates" random numbers from a mix of external events: kbd and mouse clicks, network packets arriving, noise signals, and more. Generates better randomness, but slower *and* reading from /dev/random may block the reading process until the OS can generate more random data. /dev/console: your login console (you can "echo hi > /dev/console"). In some OSs this is the system console (as if you attach a monitor directly to the computer) and only superusers can write to the console. To find out what is your "terminal" ID, type $ tty /dev/zero: sequence of zeros Type ls -l /dev/* to see the permissions, name, and TYPE of device ('c' or 'b'). * open, creat, openat A lot of new syscalls introduced a *at name, eg., openat open(pathname, flags) openat(dirfd, pathname, flags) Suppose you want to create files a, b, c, and d, in directory /tmp open("/tmp/a", ...) open("/tmp/b", ...) open("/tmp/c", ...) open("/tmp/d", ...) Problem: OS has to parse the full pathname 4 times, even if it's the same prefix /tmp. Sometimes your code would have to concatenate strings "/tmp" with the name of the file you want to open/create in /tmp. openat() allows you to do the following: dirfd = open("/tmp", ...) // open /tmp as a O_DIRECTORY (see flags), which lets the OS keeps cached state about the dirfd you just opened. openat(dirfd, "a", ...) openat(dirfd, "b", ...) openat(dirfd, "c", ...) openat(dirfd, "d", ...) The above is faster and more efficient inside the OS; plus saves program the hassle of string concatenations. Also: there's a serious SECURITY reason. Suppose your program does this: open("/tmp/a", ...) open("/tmp/b", ...) open("/tmp/c", ...) open("/tmp/d", ...) Let's say the user succeeded in opening /tmp/a and /tmp/b. Right before opening /tmp/c, some user (hacker), manages to get inside the computer, and change the nature of "/tmp". For example, if they gain root privs, they can do this: # cd / # mkdir /.myhiddentmp nastier hidden files: mkdir '/. ' # echo my content > /.myhiddentmp/c # echo my content > /.myhiddentmp/d # mv tmp tmp.off # ln -s /.myhiddentmp tmp open("/tmp/c", ...) open("/tmp/d", ...) The above kinds of security vulnerabilities are called time-of-use-to-time-of-check bugs (TOCTTOU, TOCTOU). Meaning there's a race condition b/t when you created or looked something up, and when you use it. Such races can happen b/c many programs do this 1. stat(somefile), to ensure it doesn't exist 2. open(somefile) assuming it's newly created But if someone manages to create a file b/t steps 1 and 2, the open in step 2 will read an entire different file. The above sequence of 4 open's is vulnerable to TOCTOU bugs. But if you first open a dir and have a handle on it (dirfd), you can be assured that the OS will NOT delete that directory, even if some other user/hacker, renames it, or even attempts to "rm -rf" it. Because the OS has the file/dir open, only its name disappears from the namespace: the actual inode and its content still remain on the file system. In sum: the syscalls that are ending in 'at' usually take a open directory descriptor, instead of full pathname. Use of such syscalls is more efficient and more secure. * file modes 9 basic bits for user, group, and other: bits are Read, Write, and eXecute You can set bits when you create a file or using chmod(2) Find what they are using l/stat(2). Execute makes sense on files: b/c you can execute them. On directories, the X bit means that the directory is searchable: whether you can lookup any name (assuming you know what it is). If a directory has the "R" bit, you can enumerate the files within (perform an "ls" or getdents(2)). S_ISUID 0004000 set-user-ID bit: Normally when you execute a program file, the program runs under your login privileges. When you execute a program file that is setuid, the OS first sets the effective userid to the OWNER of the file. This is sometimes needed if the program requires access to restricted services that only root can have, but you want to allow non-root users to access the service and run the program. Note: setuid root programs can be dangerous (setuid root scripts are even more dangerous). S_ISGID 0002000 set-group-ID bit (see inode(7)). Same thing as setuid, but sets the default group that the program runs under, to be the group of the file on disk (not the user's default group) Setgid ond directories usually means that files created inside that dir, will inherit the parent group -- not the default's running user's GID. S_ISVTX 0001000 sticky bit (see inode(7)). The tricky bit is used on world-writable directories like /tmp to mean that you can create any new file in /tmp, but you cannot delete someone else's file. * flags to open, creat, mkdir, and other syscalls that "create" objects in the file system O_DIRECT: part of a growing OS interfaces called "Direct I/O". Allows user to access files on the persistent media directly, w/o going through the OS page cache. When you write to an O_DIRECT file, the write will return AFTER the data has persisted. Slower, but you have more control over when and what gets written. Direct I/O is useful in databases written to the log/journal, b/c writes to the DB log have to be atomic. Other applications that use O_DIRECT will perform their own caching in their own memory, to avoid the OS caching (b/c apps have no control over the OS's cache flushing algorithms). Alternative to O_DIRECT is to call fsync(3) on a file descriptor when you want all cached data in the OS to flush. O_DIRECT controls flushing on a per write() basis; fsync does it on a per open fd basis; sync(2) does it on a per file system basis; and you can also mount a whole f/s with a 'sync' flag to force all access to that f/s, by all users, to be synchronous. O_DIRECTORY: used to open a directory, so you can pass dirfd to those *at calls. Normally you can't open a directory, and can't read(2) from it -- user getdents/readdir instead. O_DSYNC: sync both data and m-d of the file upon changes. O_DIRECT usually only syncs file data, not m-d. See also O_SYNC. O_EXCL: usually used w/ O_CREAT to say "only create this file if it does NOT exist" -- an 'exclusive' create. O_LARGEFILE: useful if opening very large 64-bit files on file systems that normally only handle 32-bit files (32 bits == 4GB). O_NOATIME: recall the inode has 3 times (modification/mtime, change/ctime, access/atime). Atime is normally updated each time you read a file. If you read a file even on a read-only f/s, the OS still has to update the inode's atime field, causing WRITES to the f/s. Atime historically has been less useful, and yet if you read a file a million times, you'd have to update atime a million times -- lots of unnecessary I/O. So many OSs offer an option NOT to update atime. Some OSs have a hybrid option: update atime only every N seconds, or only after N changes to atime. O_TMPFILE: tell the OS that this is going to be a short-lived file. So OS can keep all file state (data and m-d) in memory for long, b/c it'd probably be deleted in a short period of time. O_TRUNC: truncate the (existing) file to 0 bytes before writing to it. Otherwise, writing to the file, defaults to offset 0, will just overwrite whatever bytes you write -- remaining bytes afterwards stay intact in the file. Note that a successful open with O_TRUNC will truncate the file's contents permanently: so if later on you did a write(2) and wanted to abort and recover the orig file's contents, you can't (unless you created a backup of the file). openat(int dirfd, const char *pathname, int flags): open/create a file in a previously opened directory. You can pass special value AT_FDCWD for dirfd, to mean "operate on the cwd of this process". * stat, lstat, fstat, fstatat discussed l/stat before fstat() allow you to stat an already opened object, even if you don't have its name any longer. Useful to check, e.g., the size of a file you (and perhaps others) are writing to. Or check if permissions have changed. fstatat() like fstat, but allows you to stat "at" a given directory. Also takes special flags like AT_EMPTY_PATH (tells OS to operate on the file referred to by dirfd). * lseek changed the default read/write "head" on a file, to any other offset in that opened file. SEEK_SET: use the absolute offset given SEEK_CUR: seek "offset" bytes relative to the current offset SEEK_END: see relative to the end of file Recall that each time you successfully read/write N bytes from a file, the default "read/write" offset for that file changes in the OS as well. The OS maintains that state for each open file descriptor. * lseek (sparse files) How storage space is allocated to a file (on any media). Recall that storage media has a native unit, say sector/block of 512B of 4KB. That means a device cannot give you any less than the sector size. If you need 1 byte of space, you'll need to consume a whole 512B sector. 1. When you create a new file: the size of the file is 0 bytes, and no blocks need to be allocated on the underlying storage media (the f/s software is what requests allocation of blocks from the underlying storage media). 2. Now you write your first byte: the f/s will request and allocate a whole block of 512B. And then the f/s will start to fill in that block each time you write/append more bytes to the file. The OS and the f/s will track that you've allocated 1 whole block, what is the native size of these blocks, and how many bytes out of that block were actually used. 3. Once you have written your 512th byte, you filled up the first allocated block. If you need ONE more byte, the f/s will have to ask the storage media for yet another (second) block of 512 bytes. Example: a file has 600 bytes written. That means you need two x 512B blocks. The first block will be full. The second block will have only 88 bytes filled. In struct stat: - st_size: the size in BYTES of the file (e.g., 600) - st_blksize: native block size = 512 - st_blocks: number of blocks allocated = 2 Note: inode structures inside the OS have some unused/leftover space. When files' data is small, some file systems store the small no. of bytes directly in the inode itself. This is more efficient b/c you don't need to alloc actual disk blocks until the file grows beyon a certain size. This is called a "short file" -- a file whose small data is stored directly in the inode. This is seen in stat(2) as a file with some number of bytes but 0 blocks. For a similar reason, small symlinks can be stored inside the inode too -- called "short symlinks". It's been noted that some applications write out whole large sequences of zeros to their files. Typically seen in large files, databases, core dumps on large memory systems, and more. That seems like a waste of disk space. So the idea of a "sparse" file was created. The idea is that IF you need to write a block that is all zeros, the OS and f/s can internally NOT write that block at all. The OS and f/s, keep a data structure that knows exactly which blocks of data are allocated to a file, and in what order. So if, say, the 2nd block of data, happens to be all zeros -- the OS can just avoid allocating that file at all, and instead, leave a special marker like "NULL" where the ID of the block would have been (e.g. the Logical Block Number or LBN). If a user process tries to read some bytes of data, in a location where the block for that file has NOT been allocated, the assumption is that this block is a zero-filled block (sparse, or non existent block), and the OS doesn't even need to perform disk I/O: just return a bunch of zeros back to the user process (e.g., memset with 0s). That's how a sparse file behaves: an illusion that you wrote zeros, but you actually didn't consume any disk space. How do you create a sparse file? 1. Historically: lseek() PAST the end of the file, then start writing any non-zero data. Most OSs, would have created a "hole" in the file, b/t the original file EOF, and where you started writing. Note: the hole has to be large enough to encompass at least one *aligned* block. How can you recognize that you have a sparse file? Check struct stat: if the no. of bytes, rounded up to multiple of blksize, is larger than st_blocks, it means that some blocks are zero'd out -- and this is a sparse file. Problems: if you /bin/cp'd a sparse file, it's possible you will have actually filled in the zeros, thus turned a sparse file into a non-sparse file (wasting space on all those zeros). Modern cp programs are smarter. In modern systems, esp. under virtualization, a sparse file is also called a "thin" file; and a non-sparse file is called a "thick" file. E.g., if you crate a virtual machine (VM) with 100GB virtual disk: that VM's disk will be an actual file on your host machine (the one running the hypervisor). If you alloc the VM disk (e.g., called VMDK in VMware) as "thick" you'd have consumed all 100GB ahead of time; if you alloc it as "thin", it starts empty, and the hypervisor fills in any non-zero blocks as needed, depending on what the OS running in the VM does; e.g., let's say you install a minimal ubuntu 18 system, with just 20GB worth of binaries (as per "df"): in a thinly allocated 100GB VMDK, only 20GB will be used. 2. Modern OSs: allow you to pre-allocate a sparse file and also allow you to "punch" a hole in a file that has a bunch of zeros, and the file system will deallocate any all-zero blocks used by the file. There are syscalls for that: e.g., fallocate(2) to "preallocate or deallocate space to a file". fallocate(2) can "punch a hole" in the middle of a file, turning it sparse; can also pre-allocate a large extent for a file you're about to write. * truncate, fruncate Chop a file at a given offset: file's size in byte is set to that offset, all whole block allocated by the f/s to the file, get freed and released to the storage media. Most common: truncate a file to 0 bytes (also happens with open and O_TRUNC) Many programs are written inefficiently, they truncate a file, and then overwrite the data. In some cases, most of the data is the same and only a few bytes at the end (or middle) have changed. More efficient programs will avoid unnecessary file truncation followed by new writes: use seek() to go to the offset you care, and write the bytes there. If you are opening an existing file, and you start to write(2) to it. When you are done, you need to call truncate IF the number of bytes you just wrote is LESS than the size of the file. If you don't call truncate(2), you'll have extra bytes at the end from the previous version of the file. Some modern versions of truncate allow you to set the offset BEYOND the current EOF, thus creating a sparse file. But as this is not a common or standard behavior, better use ftruncate(2). * chown, fchown, lchown Prototypes: int chown(const char *pathname, uid_t owner, gid_t group); int fchown(int fd, uid_t owner, gid_t group); int lchown(const char *pathname, uid_t owner, gid_t group); Allows you to change the owner and/or group of a file. lchown operates on the symlink instead of what it points to. Only uid 0 (root) can change a file's ownership. Used /bin/chown and /bin/chgrp: chgrp will use chown only to change the group of a file. A user can only change their file's group to any other group they are a member of, not other groups. chown takes pathname, owner, and group: if owner or group are -1, that ID is not changed. Some older OSs had a separate chgrp(2) syscall. * utime, utimes, futimensat, utimensat Example on Mac OS X of "stat logo.txt": File: logo.txt Size: 1850 Blocks: 8 IO Block: 4096 regular file Device: 100001eh/16777246d Inode: 35548818 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 701/ ezk) Gid: ( 20/ staff) Access: 2021-04-07 17:58:51.431463108 -0400 Modify: 2021-04-07 17:58:50.373956409 -0400 Change: 2021-04-07 17:58:50.373956409 -0400 Birth: 2021-04-07 17:58:50.373864739 -0400 utime: set atime/mtime in 1s res. utimes: same but higher res clock (usec or better) futimensat, utimensat: nanosec clock resolution * rename, renameat, renameat2 Prototypes: int rename(const char *oldpath, const char *newpath); int renameat(int olddirfd, const char *oldpath, int newdirfd, const char *newpath); int renameat2(int olddirfd, const char *oldpath, int newdirfd, const char *newpath, unsigned int flags); rename src file to dest file; the 'at' versions are as usual, more efficient. Usually if dst exists, it gets deleted/overwritten. (More obscure: try to rename(2) a file to a destination directory that's empty -- POSIX says the dst dir should be deleted and replaced with the target file name.) renameat2 takes special flags: useful is the RENAME_EXCHAGE flage. Allows you to swap two file/dir names atomically inside the OS? Otherwise, how can you swap 2 file names A and B: $ mv B tmp # can do w/ plain rename(2) $ mv A B $ mv tmp A So you need 3 syscalls, and you risk TOCTOU bugs, and partial failures. * setxattr, lsetxattr, fsetxattr, getxattr, lgetxattr, fgetxattr, listxattr, * llistxattr, flistxattr, removexattr, lremovexattr, fremovexattr, Extended Attributes (EAs or xattrs). Historically a file contains any data you want: the OS does not impose a structure on a file's date, it just looks like a sequence of bytes. Apps can impose whatever structure they want. But there's limited m-d you can store about a file, whatever's in struct stat. Sometimes you want to store extra info per file, esp. some important m-d. Examples: MP3 "ID3" tags (song title, performer, album, year produced, etc.); same with any other multimedia data. Other examples could include: compression used for a file, encryption algorithms used or modes, hashes and checksums, and more. Choices: 1. Put the m-d as part of the file's data, which means you need to teach all applications about this m-d. Examples are MP3 files: a combination of a sequence of audio bytes as well as a special "ID3" structure, that allows you to set several pairs (e.g., "genre" is "classical", track is 1, title is "my song"). Problem, the structure is specific to the file format, and all apps that want to access MP3 files, have to know where and how read that structure. Plus you need custom tools to view/set the attributes of a file (e.g., id3tag, id3info, id3v2). 2. Add the extra m-d as part of struct stat. Problem: results in non standard stat structures, and everyone wants their own favorite additions. 3. Extended attrs, allow you to add arbitrary pairs to a file's inode. The KV pairs will be stored together with the file, by the OS. Copying the file, archiving, moving, renaming, etc. should all preserve these KV pairs. With the xattr set of calls you can essentially control a mini DB of pairs on a file: you can set a value V for a key K (setxattr); you can list (like /bin/ls) xattrs using listxattr; you can remove a KV pair using removexattr; and you can retrieve the value V of an xattr with key K (with getxattr). There's useful user level utilities as well. In UNIX, when you open a file, you read the file's "main" data. In Windows NTFS, a file can have several "channels" or "streams" of data (other than regular stat(2) m-d). So when you open an NTFS file, you have to say that you want to read the "main channel" to get the file's data; in NTFS, xattrs are a separate channel that you have to ask to read. Examples of multiple channels: storing previous versions of files (documents, spreadsheets) so you can revert/recover to previous files' states. * Access Control Lists (ACLs) EAs were created in Linux specifically to solve the ACL problem. In Unix, you can only have one owner, one group, and "other" permissions. Sometimes you want more advanced access permissions on a file: 1. Multiple owners who can operate as if they are the primary owners of the file. 2. Multiple groups. Or access to a file if a user is a member of group X and group Y; can also be logical "OR", XOR, "and not", etc. 3. Complex lists of permissions: e.g., permission is allowed if - owner A or B - group C and D - group D but not E - group F XOR G - etc. ACLs in Linux are implemented as a layer on top of xattrs. Linux reserves all EAs starting with the string "security.*" (so only root or the OS can set those). Users can set an EA starting with "user.*". Tools exist to set, remove, list, reorder, shuffle, ACLs. See man acl(5) and a lot of acl_*(3) library calls and tools. * chroot Recall that every process has a cwd so that the OS will know where to begin evaluating relative pathnames like "foo/bar.c" or "../src/debug.txt". Absolute pathnames start with a "/" and "/" is the "root" directory of the entire system. But when that root dir, can be changed on a per process basis: use the chroot(2) system call to do that. If you change the root of a running process, to some other path, for example "/my/dir/", then thereafter, every time the process tries to resolve an absolute pathname, it'll use "/my/dir/" as the "/" dir. A chrooted process cannot "escape" its chrooted directory. This is sometimes called a "chroot jail". Namely, if you after chroot-ing to /my/dir/, to "cd .." -- you'll wind up still in /my/dir; and if you try to cd to an abs pathname "outside" the jail (e.g., "cd /etc" or "cd /boot"), you'll get an error ENOENT (file/dir doesn't exist INSIDE that chroot jail, even if it exists outside of the jail). Even w/o chroot: what happens if you do this $ cd / $ cd .. Where do you wind up? What's the parent dir of the "/" dir? Yes: you wind up in the same place -- the parent of "/" is "/" itself. Q: can you 'escape' chroot if you had an open fd on a file outside? A: it depends. Some OSs prohibit this, chroot may fail, or they might "close" any fd you you have that's outside you jail. Same problem, if you have access to other resources, maybe an mmap'ed region on some outside file. Chroot is useful when you want to confine or restrict the possible access a process could have, in case of vulnerabilities or successful attacks on the process. Recall that many bugs like stack overflow, if exploitable, can allow an attacker to run ANY arbitrary code with the SAME privileges of the process that's executing. Assume Apache httpd Web server is running. Assume the DocumentRoot where all your http files are is in /var/www. If there's a bug in apache, and someone manages to trigger it (esp. b/c httpd is a service "open" widely to outsiders on the Web). If attacked, a user can easily execute code to try and copy vital files like /etc/passwd (outside of /var/www), or any other valuable information they find; they can try and offline crack the passwords; steal private keys from user's home dirs; any other info (credit cards, etc.); they can just fork bash and "browse". Note: if apache was running as UID 0 (root), the hacker can then easily access anything, including installing their own kernel(!) That's why many services try to run with their own custom user: apache, mail servers, database, all try to have their own user (not root). With chroot, and assuming the broken-into process runs as non-root, any attempts to access files outside /var/www will fail (ENOENT). Inside /var/www you have anyway mostly public files, and maybe readonly copies to some shared libraries that apache httpd may need. Of course, don't put files in /var/www that include sensitive info (social security, user IDs, emails, credit card numbers, or passwords -- esp. not in cleartext). Chroot has been shown to have some flaws. Thus some OSs started to design more complex "jail" mechanisms. There are advanced container and resource limits in Linux alone, including more complex "namespace" manipulations possible (even kernel-enforced namespace limitation on what a root user can see). * clone, fork, vfork, execve, execveat Used to start and create new processes. fork: create a "copy" of a running process, inheriting SOME properties and info from the parent to the child. When you call int ret = fork(); Right after, assuming "ret" is not negative (error), you have to now TWO processes running: 1. the child process, will get ret=0 2. the parent process, will get ret>0, whose number is the PID of the child that was created. Q: how does the child know it's parent process? A: getppid(2) Recall: getpid(2) is for any process to find out their OWN PID number. All kinds of resources are shared w/ a child, and some are not shared. See fork(2) man page. Problem: over time, people wanted a variant of fork() that lets them share (or not) other kinds of resources. Eventually, fork() was generalized to a more generic "superset" clone(2) syscall. clone(2) takes more parameters, and is more complex to use, but more flexible. Today, inside the linux kernel, fork(2) is implemented not as a separate syscall, but just calling fork(2) with the right flags. clone(2) prototype: int clone(int (*fn)(void *), void *child_stack, int flags, void *arg, ... /* pid_t *ptid, void *newtls, pid_t *ctid */ ); fn: callback when child process terminates child_stack + arg: arg passed to 'fn' upon being called flags: see CLONE_* flas below Clone flags include: CLONE_FILES: do you want to share the same open FD table or not? CLONE_FS: share the "file system" info like umask, chroot, chdir, etc.? CLONE_SIGHAND: inherit same sighandlers or not? CLONE_VM: share mmap'd info or not? and other flags you can OR together. See man page for clone(2). You can pass a function ptr 'fn', an easier way to get the child process instead of using one of the wait*(2) syscalls (like wait, wait4, waitpid, etc.). 'arg' is passed to the 'fn' as a generic void* that can hold anything. 'fn' is called a callback function. Most callback functions that are designed to be flexible will have a fxn pointer and a void* that you can put anything into. execve() takes three args: 1. filename to execute 2. an argv array (which would show up as the 'argv' and count in main() 3. an array of strings containing the environment variables (e.g., "PATH=/foo:/bar") Prototype: int execve(const char *filename, char *const argv[], char *const envp[]); If execve succeeds, it will REPLACE the current running process with a new process, whose executable is 'filename', and it gets the argv, and envp from the args of execve: int main(int argc, char *argv[], char *envp[]) A typical use when one process, like a shell, wants to start another process is to do: clone (or fork), and then the child process calls execve(2). execve does NOT change the PID number, so the parent still knows the child's PID. * kill Prototype: int kill(pid_t pid, int sig); Send a signal "sig" (just a number) to pid P. Signals are just short messages (numbers) that one process (or the kernel itself) can send another process. Once a signal is received, optional actions can take place (e.g., invoke a signal handler). Signals are a form of Inter-Process Communitation (IPC) methods. Can signal yourself, can signal any other process. If you don't own or have the right to signal the other process, you'll get an error. If P is 0, you send the sig to all processes in the same "progress group". Every process has a unique PID, but can also belong to a process group (PGID). This is useful to treat a set of processes as one unit. For example, the postfix mail server forks a bunch of processes (several to listen for new mail requests, some for forwarding, some for email filtering, and more). If sig is 0, no signal sent (there is no signal number 0). Instead, you get back a status code that tells you if the process is alive or not. This tests a "liveness" property w/o taking actual action. * signal(7) -- list of signals There's a number of signals that exist in an OS. A signal is just a "message" sent from process X to process Y. Every signal can have an action or no action. If it has a defined action, then a signal "handler" function will be invoked when the process receives the signal. When process X sends a signal S to process Y, there's no guarantee how long it'd take for the signal to arrive. Signals are not real-time messaging. A process can define which signals are masked (turned off), and for those not-masked, what the default action should be. Default actions can include "dump a core" for debugging, terminate the process, invoke your own custom sighandler function, etc. Whenever you get a signal and you invoke a not-terminating sighandler function, you interrupt the currently executing code, perform a "non-local" jump to the sighandler code. When the handler returns, you come back to where you were before (just like function calling another, setting up a stack frame, and returning to the instruction right after the jump to the called function). Q: what happens if you get a signal delivered while you're inside a signal handler?! A: It depends. Differnet OSs implement different capabilities: 1. no other signals allowed inside a sighandler fxn 2. OS will queue up the new handler function and invoke it right after returning from the current one. 3. Allow nested signal handlers to be invoked (up to a certain nesting depth). This option tends to be rare to support, complex to implement, and often has little practical use. SIGHUP: received when the controlling tty is terminated. For example, if you ssh to a remote host, start bash, and your ssh connection terminates (TCP socket close), the remote bash will get SIGHUP. SIGKILL: terminate the process with extreme prejudice. This signal CANNOT be masked off by any process. Similarly, SIGSTOP cannot be masked off. e.g., if you (e.g., sysadmin) are not sure if the "runaway" process is bad or good, best send it SIGSTOP, then ask owner (user) of process. If the process si needed, you can SIGCONT to continue it, and use re/nice(2) to change that process's priority. SIGALRM: called when a sleep(2) is done. Sometimes useful for a process to go to sleep for some number of seconds (e.g., if malloc failed and you want to wait and see if a little while later, enough mem is freed to retry that malloc). Q: Why you'd want to get a signal that invokes a sighandler? A: Useful to schedule an action to happen in the future, eg., cleaning up, sending periodic reports, logging info, etc. SIGUSR1/SIGUSR2: custom, user defined signals. E.g., you can use USR1 to enable debugging; and USR2 to disable; on a running process. Useful if you have a program running at a customer, and you want them to enable/disable debug for a period of time (then send you the debug logs). SIGINT: often used to tell a program to re-read its configuration files w/o restarting the program. Suppose you have a mysql DB, and you can't afford to shut it down, reconfig it, and restart it (even a few seconds downtime could be bad for a busy DB server, e.g., e-commerce). Often sysadmins will edit the config file (e.g., /etc/my.cnf for MySQL), and then send a SIGINT to the mysqld process: mysqld has a sighandler that'll re-read /etc/my.cnf and reconfig the various parameters (e.g., how many concurrent mysql connections are allowed). SIGSEGV: segmentation violation (when you violate a page protection or try to access a virt addr that's not mapped to the current process). SIGBUS: on some architectures, the memory bus is designed such that a memory address (e.g., of a pointer) has to be aligned to the "word" boundary of the processor. Through pointer manipulations in C, you can create pointers that point to any addr. But in some CPUs, if the addr is not a aligned on a 4B or 8B boundary, you will get a SIGBUS error and a coredump. Intel CPUS don't care about addr alignment, which is why you don't get SIGBUS on Intel (but unaligned addrs are going to be less effective). On a SPARC architecture, you can get a SIGBUS. Look at signal(2), sigaction, sigprocmask, and others to see how to set signal handlers, mask signals, collect info about signals delivered to you, your child processes, or others in the same process group. * ioctl, fcntl Prototype: int ioctl(int fd, unsigned long request, ...); Unix designers realized early on that they can't predict what services an OS might need and what applications would need? They created a bunch of syscalls, but then added a "catch-all" mechanism: a way for the OS to extend new functionality to user applications, w/o having to create a whole new syscall. Syscalls have specific numbers: when a user program invokes a system call, it's really invoking a kernel function whose number is N. N is hard-coded in libc, in applications, and in the kernel: they must all agree and match. You can't renumber system calls easily; if you add a new syscall, you have to assign a number to it "permanently". Reason: once apps use that syscall, they're bound to that number. Changing syscall numbers would break Application Binary Interfaces (ABI compatibility). OS designers are very weary about adding new syscalls. I/O Controls or "ioctls" are a way to experiment with new OS functionality, before you decide if you wanted to make it into a full fledged new syscall (with a name and unique number). If you wanted to design a generic interface, via a syscall, that can invoke ANY functionality, you'd need to be able to as many parameters as possible: from 0 to some large number P. And be able to pass their values, however long or short the values are. And you need to do this WITHOUT changing the ioctl() interface itself! If you want to pass P parameters to a function, w/o changing its prototype, you have to pack all P params into yet another structure, and then pass a ptr to the structure, as a void*. The caller has to know what is the structure that is being passed and how to create it; the callee (the function that processes the request) has to know how to unpack the structure. Caller 1: - struct a mystruct; - fill in mystruct - foo((void *)&mystruct) Caller 2: - struct b tmp; - fill in tmp - foo((void *)&tmp) Callee (implementation of foo): foo(void *ptr) { // how would it know what caller passed? or how many bytes were passed? } Often, you see a pointer to a buffer that can take any length, together with another integer denoting the length: foo(void *ptr, u_int len): - now implementation can tell how long is the buf in ptr Problem: ptr is type-less and len doesn't tell foo() how to interpret the ptr internally (what orig struct was it?). Solution: you have to pass one more thing -- a "type" descriptor. The type can be anything (e.g., a number) as long as both caller and callee agree on it. Now the code would look like this: Caller 1: - struct a mystruct; - fill in mystruct - foo(TYPE_A, (void *)&mystruct, sizeof(mystruct)) Caller 2: - struct b tmp; - fill in tmp - foo(TYPE_B, (void *)&tmp, sizeof(tmp)) Callee (implementation of foo): foo(u_int type, void *ptr, u_int len) { if (type == TYPE_A) { struct a *a_ptr = malloc(len); a_ptr = (struct a *) ptr; // now you can process "struct a *a_ptr" } // repeat code for every possible type } ioctl(int fd, u_long request, ...) - typically 3rd arg is a void* or a ptr cast to u_long - fd: the file descriptor of an open object to operate on - request: is the type of request to perform - 3rd arg -- same as "void*" in above foo() example Ioctl say: perform operation 'request' with data in 3rd arg, on file 'fd'. Q: Where's the length param? A: designers decided to exclude the need for len: both caller and callee would have to agree on length, or else ABI would break. Examples of ioctls: - many are custom or specific to a specific kernel module - create a snapshot in a file system like btrfs - set terminal to ECHO or not ECHO chars - change open flags on an open fd w/o having to close/reopen the file - special controls of devices like disks, GPUs, displays, and more. - turning on/off the inode 'immutable' flags ioctls were useful initially, and people started adding more and more ioctls. More nd more were being added, many that only work for specific OSs. But some were deemed useful enough that other vendors implemented same ioctl for their own OS (e.g., ECHO/NOECHO). several decades later... we now have thousands of ioctls across dozens of OSs. Many ioctls are old and useful (at least for legacy code), and so OSs are "forced" to support them. IOW, there are more ioctls (aka pseudo syscalls) than there are actual system calls! The kernel has to handle ioctls as follows: - look the fd: is it for a file or a socket or a directory? - if networking, pass ioctl args to a networking-ioctl handler - if a file system, pass ioctl args to the f/s that the fd belongs to (e.g., btrfs, ext4, msdos, etc.) - if fd belongs to terminal services, pass to tty-handling code - some ioctl codes are not associated with a network of f/s, and so there's a need for a large switch statement to handle each ioctl, dispatching specific kernel functions for each ioctl code. - even inside each file system, there is a large switch statement to handle all of ITS ioctls. In order to know what an ioctl does, you need to know what subsystem or module it belongs to. ioctl(2) man page is generic. There are some man pages for ioctls of specific subsystems, e.g/, ioctl_ns(), ioctl_tty(), etc. Many ioctls are not very well documented, if at all. Many ioctls are obscure that are only needed in rare cases; some ioctls are very new and perhaps experimental. If an ioctl isn't documented somewhere, your only choice is to study the kernel source code. Some useful ioctls have become system calls and some have gotten nice libc wrappers. Programmers are encouraged to use the syscalls or libc (or any other library) wrappers. Problem: legacy code still remains and if you turn off an ioctl and force people to use a different syscall, you'll break ABI and force people to change their code and recompile. Note: fcntl() is a subset of ioctl's for file manipulations. * readv, writev, pwritev/2, preadv/2 Efficient code: (1) use "large" buffers (but not too large), b/c bulk reading and writing is more I/O efficient; (2) avoid calling syscalls too often. Sometimes, the data you have comes not in one nice large buffer, but broken into various chunks that may not even be the same size. Examples: reading from network sockets or other streaming inputs. If you had data you needed to, say, write to a file, but the data was in a N different buffers, each buffer perhaps a different size: how would you write that? 1. issue N write(2) syscall, one for each buffer - problem: issuing N syscalls w/ their overhead - also: possibly writing small bits of data (less efficient than bulk) - may even have to call lseek(2) before each write(2) to set the write position 2. malloc a buffer large enough to hold all data, then memcpy individual buffers to the new buffer, then free old smaller buffers, and then call write(2) only once. - pro: only calling one syscall and it's "bulk" data - cons: extra memory needed, and memcpy consumes cpu/mem overhead Solution: create syscalls that can read/write not one buffer, but a "vector" (or array) of buffers. Recall that a "buffer" is an array of bytes. So these syscalls will be reading/writing an array of an array of bytes. This allows you to have a single syscall w/o having to copy all the data into one big buffer. Prototype: writev(fd, struct iovec *iov, int iovcnt) - fd: the file to write to - iovcnt: number of iov structs passed - iov: an array of struct iovec's struct iovec { void *iov_base; // ptr start addr of buf size_t iov_len; // len of bytes in iov_base to write } writev(): will do something like (inside kernel): for (i=0; iiov_base, iov[i]->iov_len) returns no. of bytes totally written. readv() does the same for reading; variants of these can take extra flags and even an offset where reading/writing should begin can be passed. * pread, pwrite, pread64, pwrite64 Prototypes: ssize_t pread(int fd, void *buf, size_t count, off_t offset); ssize_t pwrite(int fd, const void *buf, size_t count, off_t offset); Same as read(2) and write(2) syscalls, but also pass the offset where you should read or write. Normally, read/write() will operate at the last "read/write" offset that you left of off (0 when opening a new file). If you want to read and write data not sequentially, you have to call lseek(2) and then read/write. For sequential reading/write: regular read/write() are ok. But IF you have to write a lot of data to a file in random offsets, then it's better to use pread/pwrite (save you an extra lseek call). Example apps where you read/write randomly: any database, sql, or K-V store, leveldb, etc. Also multi-media apps: skip ahead or back in an audio/video file. * sendfile When Web servers started, there wasn't much load on them. As the Web grew, Web services became critical to run as fast as possible. Web servers are measured by how many queries-per-second they can handle. A Web server is a user application: 1. listen on a socket (waiting for Web browsers to issue a "HTTP GET" request) 2. when it gets the request, it has to read some file (e.g., index.html) 3. then pass the data of that file over to the socket, back to the client (Web browser). 4. goto 1 How would this be implemented: 1. fd = open(file) 2. read(fd, buf, len) // maybe read whole file b/c html files are usually small 3. close(fd) 4. write(socketfd, buf, len) // write file's data to the browser socket Problem: httpd has to read data from kernel, to user space, and then write it right back out. Wastes syscalls and processing (copying data, buffers, etc.). You can use mmap for some of these, but mmap is a more complex API, and you'd be wasting effort setting up a mapping and destroying it, just for a file that you have to fully copy over a socket. sendfile() was invented to solve the above problem. It allows the kernel to read data from one open file descriptor and write it out to another fd directly. Original flavors of sendfile() read whole file and wrote it to a socket fd -- API looked like old_sendfile(char *htmlfile, int sock_fd). Modern versions are more flexible: Prototype: ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count) - out_fd: the fd to copy the data to - in_fd: the fd to read the data from - offset+count: read "count" bytes from in_fd at "offset" and write out to out_fd. - returns no. of bytes successfully written, ala write(2) Can be useful even for /bin/cp. * splice related to sendfile, was created a bit after sendfile Prototype: ssize_t splice(int fd_in, loff_t *off_in, int fd_out, loff_t *off_out, size_t len, unsigned int flags); If you want to move data pages from one file to another, not just copy, you can use splice. It can even allow you to "insert" data in a middle of a file. Typically operates on whole pages. Splice internally (inside OS) may play with struct page's inside the OS page cache. Use: you have some data in a file A. You want that data to be copied to file B; and you know that you'll discard/delete file A (maybe file A was a temp file for intermediate processing). Traditionally, you'd have to copy the data from file A to file B (more efficiently if using sendfile). Inside the page cache, however, each page is associated with one open file descriptor. Splice allows me to move the "mapping" of a page in the page cache from one fd to another, at any offset of the dest. fd, allowing one to "insert" a 4KB page in the middle of a file. No copying is required, only manipulation of pointers inside the kernel (e.g., all the ptrs of all pages that belong to a file in the page cache). * Locking in general: two kinds -- (1) advisory and (2) mandatory. MANDATORY means: the one who got the lock has exclusive access on the resource. No one else can access the resource at the same time (e.g., a file). The OS enforces the locking on the resource or file: no one else can access it at the same time, or they get an error. ADVISORY mean: everyone who wants to access the file has to coordinate via advisory locks. For example, they have to check if a lock exists (e.g., F_GETLK) and if so, don't access the file until the lock is released. If another process tries to access a file that's locked, the OS won't prevent that -- and it could result in a data corruption. History: 1. Older mainframe OSs used mandatory locks (before UNIX). 2. Unix (c. 1970s) decided to make the OS simpler by supporting only advisory locking. Reasons: makes the OS simpler, and a recognition that mandatory locking was not really needed by most applications. 3. Windows decided to implement mandatory locking by default. Meaning you don't need to say explicitly that you want to lock a file: merely opening a file, automatically locks it to you -- the process that opened the file. This helps avoid data corruption by other processes inadvertently modifying files. But conversely, it makes software updates more challenging: the only way to update a file that's locked, is to terminate a process that locks it, or reboot. That's one major reason why updating Windows systems requires you to reboot, sometimes multiple times. 4. Later unix systems realized that they do need to add some support for mandatory locking, because some applications really needed it -- for example databases. Linux added mandatory locking years ago. But, you have to call the fcntl or ioctls to lock a file or byte range of a file: that is, just opening a file does NOT automatically lock it. Issues with locking: 1. Suppose someone locks a resource for "too long" or they lock a whole file and they just need access to part of the file. How do you prevent "hogging" of such a resource (a form of denial-of-service attack)? 2. Worse, what if someone is locking a whole file or purpose, to prevent others from accessing the file (e.g., use a read+write lock on the entire file). That would be a denial-of-service kind of "attack." It could also be just sloppy programming or a bug. Term: an entity (e.g., user, process) that holds a lock on a resource R is called the "lock-owner". 3. Modern systems allow for three mechanisms to handle this: (a) priorities: a higher priority process can "break" the lock of a lower priority process. Just need to be careful about setting process priorities. (b) timeouts: don't give out permanent locks that are held indefinitely, but a "lease" that lasts for some period of time. Permit lock-owner (the one holding the lock) to ask to extend the lease. Otherwise, when lease ends, the lock is automatically released by the OS (and someone else that may be waiting for the lock, can become the new lock-owner). (c) revocation: some systems even require the lock owner to register a callback function or a signal handler -- and the lock owner's code will be called back to inform it that a lock they are holding is needed back (the lock is being revoked). Over the history of unix, several "competing" interfaces for locking resources were created: ioctl(2), flock(2), fcntl(2), and lockf(3). Prototype: int flock(int fd, int operation); // advisory lock op can be LOCK_SH(ared), LOCK_EX(clusive), LOCK_UN(lock) flock: locks whole file lockf: can lock a "byte range" inside a file * fcntl Prototype: int fcntl(int fd, int cmd, ... /* arg */ ); Like ioctl but specifically for files. Allows you to un/set and get locking info about files. You pass a struct flock, which can define the operation: lock for reading, for writing, or unlock; and set the portion of the file to lock (a byte range, or whole file). You can also change the open mode of a file that's been opened: say a file is opened for read+write and you don't need write access any more. You can call fcntl w/ f_SETFL and "downgrade" your access mode to just O_RDONLY. Similarly you can request to upgrade or change your open mode. This is better than closing the file and re-opening it with new modes, b/c closing a file could have other side effects like flushing data or someone else gaining access to the file. Like ioctl, some of the fcntl abilities are uniformly supported by many OSs, and some are specific to Linux. Change notification: check also Linux's inotify* and fanotify* syscalls, start with overview inotify(7) and fanotify(7) 1. User are modifying files on a system 2. You setup a daily backup of your system, after doing a full backup once time. The daily just needs to back up files that have changed since yesterday (a) approach one: scan entire f/s looking for changed files. $ find / -type f -mtime -1 Above command will list all files that have had their mtime changed in the last day. Once you get a file list, you can back those files up using your backup software. Problem: only a handful of files typically change, but there could be millions of files on your system. You are scanning the entire f/s looking for just a handful of changed files: this produces a LOT of I/O activity. (b) better: have the OS track somehow files that have changed, with support of some user application. - first a user process "registers" an interest in a file, directory, or a folder recursively. - the registration asks the OS to inform (notify) the user process, when a file has changed. Changes include creating, deleting, or modifying files; renaming, etc. even just reading a file can be registered for notification. - when the OS has to update a file's inode, it checks a list of processes that registered for notification, and the type of notification. - any change to a file that matches a notification request, is sent to the user process (usually via some callback function, or a signal telling the process "you have a change, please retrieve it from the OS", this could be also via a special fd you have to listen/poll/select on). - the user process then has to retrieve the notification, and add it to a list of files that need to be handled -- for example, backups that night. This mechanism is much more efficient than scanning the whole f/s. Used not just for backup s/w but also for "desktop search" like technologies (like Apple's Spotlight search indexing). Such desktop s/w indexes all files based on their meta-data, they content, file name, etc. so you can do searches to efficiently find, say, "all .c files that contain a string XXX" or all pictures that have a specific tag to them, etc. * dup/dup2/dup3/pipe/pipe2, etc. dup() is used to create an alias (like a hard link) to the same already open file descriptor. So you get 2 descriptors that can be used to act on the same file. Note that if you read or write from one descriptor, you can see the same file content if you access it from the second descriptor. Also, the is only one read and write offset for the file. So if you read a file using descriptor 1, the file read/write offset will change so the 2nd descriptor will see the same offset. Same if you change using lseek. In other words, the kernel just keeps two pointers (one for each fd) to the same data structure the OS keeps on behalf of the open file. If you want each fd to have its own file offset and even different open modes, then simply open(2) the file twice. The OS will keep two separate data structures to record the status of the open file. However, there is still only one file inode and content on disk. dup(), dup2/3() are useful when you want to have separate threads or forked processes, each with their own fd to access the same file. When any of those threads/processes is done, they can close their own (dup'ed) fd. Another use is if your program traditionally reads from stdin (fd 0), or writes to stdout (fd 1). If you redirected your stdin/stdout to another location (say a network socket, or a file), and you want to ensure that no one else will access the data via regular stdin/stdout, you can "dup" stdin/stdout to another descriptor, close() stdin/stdout, but continue to read from stdin/stdout via the dup'ed fd. Useful also if you're calling libraries that may try to access your stdin/stdout inadvertently. pipe() creates a channel b/t two fds that you have. You pass an array of 2 ints. The first, pipefd[0], is the read end of the pipe. and pipefd[1] is the write end. The OS will automatically "copy" data written in pipefd[1] to pipefd[0]. Useful if you want to have a communication channel, like a socket. You can even have two separate processes or threads: one is the writer and the other is the reader. The reader just sits and listens on its end of the pipe. The writer just write(2) to its writing end of the pipe. Once the writer writes anything to its end, the OS copies the data (or makes it available) to the reader. If the reader was blocked in a wait state, it will unblock and will be able to read(2) the data from its end of the pipe. In unix, pipe(2) is used a lot for shell level pipes: $ cat /usr/share/dict/words | grep aa | wc -l - cat will just print the contents of the file onto stdout - "grep aa" will read from stdin, and only display on stdout lines of text that match the string "aa" - wc -l: count the number of lines the '|' symbols indicate that stdout of the left side, should be piped over to stdin of the right side. Shells implement this using pipe(2). Many unix programs read from stdin by default and write to stdout by default. They are called "filter" programs. * poll, select, etc. These calls are usually used in a server class application. You set up a number of file descriptors you want to "listen" on. You specify whether you're looking for read activity, write activity, etc. You also can setup a timeout for how long to wait (or "0" to mean wait forever). In poll() you send an array of descriptors and the length of the array. In select(2) you use FD_* macros to tell which descriptors to listen on. The descriptors are turned into a bitmap that is passed to select(). The bitmap is encoded in the fd_set type. For example if you want to listen on descriptors 2 and 5, you will use FD_* macro to create an fd_set bitmap that looks like this 00100100 (basically the 2nd and 5th bits, starting from 0, are turned on). FD_CLR: turn off a bit corresponding to an fd FD_ISSET: check if it is set FD_SET: turn on a bit corresponding to an fd FD_ZERO; zero out an entire fd_set If you have lots of FDs to listen on, it may be more efficient to use fd_set and select, b/c the list of FDs to listen on is packed more compactly in a bitmap. You can also listen on "exceptions" on files (e.g., a socket that was closed by the other end). Once you setup the FDs to listen on, and go into select, select will block, as long as there's no activity. Block means it won't return from the call (at least not until a timeout). When the OS notices an activity on an FD (can be a file or a network socket), it'll wake up the process that was blocked, and you then return from select/poll. The process then has to check the params it passed to see WHICH FDs have changed. Check to see which bits are still on: that indicates that THOSE descriptors have a change. Next, you can go and, say, read from each descriptor that has a change, and process the incoming request. Example: a Web server listens for connections from remote Web browsers. Each browser connects to the server using a different socket (fd). After a connection is initialized, the Web server will listen for "read" requests on the FDs of each browser. When a browser, sends (writes) the message like "GET index.html", the select loop in the Web server will wake up, the server can find which FD has data in it, and read(2) from it; it'll read the string "GET index.html", and start to process that HTTP request. Usual code you see is roughly: // initial setup FDs to listen on while (1) { // possibly update setup FDs to listen on r = select(....); // assume no error // check which FDs have activity // read data on those FDs, maybe spawn a different thread or process // to handle each request. // go back to select, listening on for more requests } Usually busy services like Web browsers don't do actual processing in their main select() loop: rather, they have a number of threads or processes that are considered "workers", and they hand off the actual processing to the worker; alternatively, they fork a new worker, dup the FDs that the worker will need to respond to (responding to the Web browser), so that the main server process can go back to its select loop, listening in for more requests. To prevent spawning too many processes, there's usually some limit imposed on how many concurrent workers can be active at a time. * umask (Unix Mask) Used to set the default mode that new files/dirs should be created with. the umask value is usually inherited from parent to child. Example: if you set your umask to 0700 (octal), it means that all files you create, should have at most r/w/x access by the user. Umask = 0022: user has full r/w/x access, but others/group only get read+execute, no write access (so no world-writeable files created by you). Recall prototypes: int open(const char *pathname, int flags, mode_t mode); int creat(const char *pathname, mode_t mode); when you call create(2) or open(2) to create a new file, or mkdir/mknod/etc. You pass the default mode you want that file to have in those syscalls. But that's NOT the mode the file will actually take. The real mode would be mode_passed_to_syscall (mode_t mode) & ~umask meaning: invert the umask value, then logically AND it with the requested mode. You can set umask in your default shell startup like .bashrc, to be more restrictive (a good idea). umask() syscall: pass the new umask you want to set for future file creations, and it returns the previous umask value (so you can tell what perhaps you inherited from the parent environment). You can of course change a file/dir's permissions with chmod(2) at any time. * fsync, fdatasync, sync, sync_file_range Many ways in which you can control when data and m-d is flushed to disk (persistent media). Recall you also use fcntl(2) to change behavior of an opened file. You can do it with O_DIRECT, O_SYNC/O_ASYNC open flags, and you can also explicitly call fsync(fd) to flush all data and m-d on a file to disk (old traditional syscall) fdatasync(fd): same, but only flushes file data and not m-d (i.e., not inode changes). Example: if you use a database journal, after you write a "transaction record" to it, you want to ensure that the data gets to the DB journal file. But you don't really care if m/a/ctime are in sync, b/c after a crash, you just replay/apply all DB journal records in order, regardless of the timestamps on the file. Not flushing m-d, speeds things up: that's often b/c the way f/s store their files' data and m-d, is in different locations on the media/disk. sync(2): will flush all data and m-d for ALL file systems that are mounted on the computer. Useful to do it before you reboot a system, to ensure that all OS-cached data is flushed first. syncfs(fd): flush all data of one f/s, whose file fd it belongs to. sync_file_range: sync just a range of bytes within a file. * getdents, getdents64 "reading" content of a directory, b/c you can't use read(2) on an fd of an open open directory. With getdents, you get to read N "whole" dir records, designated as "struct dirent", that fit inside the buffer you give getdents(). You keep reading until you get a "0" (EOF). Note that when you're in the "loop" that does getdents(2) until EOF, you're reading new chunks of a directory from the last offset. However, the directory itself may change due to name changes (creat, rename, unlink, mkdir, etc.). There's a whole new area of research in OSs where atomicity guarantees are explored for files, directories, whole directory trees, etc. struct dirent: records info about a single directory entry, namely inode number + the name of the entry (null terminated). Problem: in POSIX, a full pathname can be up to 4096 bytes including "/" delimiters. Also, a single name in a dir, can be up to 256 bytes long. Most file names are much shorter than 256B. A naive way to create struct dirent is struct dirent1 { u_long d_ino; char d_name[256]; } B/c file names are shorter and variable length, we want to save space. So we use what's called a "variable length data structure" in C, also called an "out-of-band" (OOB) data structure. A simple form looks like this: struct dirent2 { u_long d_ino; char d_name[]; // there's a field named d_name in struct, but it has no // pre-allocated space } sizeof(struct dirent2) == sizeof(u_long) (4B) Use above structure as follows: char *name = "myfile.c"; struct dirent2 *de = malloc(sizeof(struct dirent2) + strlen(name) + 1); de->d_ino = 17; // fill in actual inode number strcpy(de->d_name, name); Often, you want to know how long is the string in de->d_name, so it's common to add the actual strlen into the structure, or the size of the entire allocated structure including the variable length OOB data. Or, in the case of struct linux_dirent, where you have a sequence of these variable length structures, you want to know what's the offset in bytes, to get to the start of the NEXT struct linux_dirent. struct linux_dirent { unsigned long d_ino; /* Inode number */ unsigned long d_off; /* Offset to next linux_dirent */ unsigned short d_reclen; /* Length of this linux_dirent */ char d_name[]; /* Filename (null-terminated) */ /* length is actually (d_reclen - 2 - offsetof(struct linux_dirent, d_name)) */ } Some compilers, historically, when you enable optimizations, will "optimize away" a field that has no storage to it, like "char name[]". For that reason, some implementations of this OOB data structure, will assign at least one byte to the last field -- "char name[1]". * fallocate Prototype: int fallocate(int fd, int mode, off_t offset, off_t len); Preallocates space for a file, to ensure that if you're going to be writing a file of known size, that you won't run out of space in the middle of write (quote limits or disk limits). You have to be careful not to over-reserve space, or you get an error of sorts. And when you "release" the file (e.g., close(2)), any unused space is released. fallocate will try to allocate all the space in the file in close proximity to each other: this is sometimes called an "extent". Traditional f/s don't guarantee that all file's blocks are contiguous on the media, due to allocation strategies and fragmentation. This results in poor performance for files whose blocks got spread all over the media. An extent is guaranteed to contiguous, or as contiguous as possible. For example, ext4 supports extents: so when you fallocate() on ext4, it'll try to give you one whole contiguous unit of space on the media; if it can't it'll try and find 2 or more smaller extents, that are hopefully close to each other. * uname Unix Name: gives you info about the running system: arch, cpu, OS name, OS version, host name, etc. Can use with uname(1). Useful when you want to write code or scripts that have to run differently on different systems, or if you need to distinguish different architectures, different distros, and even different host names. * shm* (shmat, shmget, shmctl) A series of system calls to to control "Shared Memory". Useful for synchronizing among multiple processes. SHM* API came from older System 5 Release 4 (SVr4) unix, from AT&T Bell Labs. With this set of syscalls, you can "create a shared memory" object, get a handle on it, pass it to other processes, and coordinate sharing of information, including some basic locking and synchronization. Unix pipes are another form of sharing info b/t two processes, but limited to one writer and one reader. SHM* is more flexible, in that you can control even the size of the mem region to share among processes. More modern systems that want to share info b/t 2+ processes, (especially those derived from the Berkeley System Design (BSD) OS variants) will use... mmap(2). Some OSs implement the shm* API using mmap internally in the OS. * get/setitimer Interval Timers (itimers). sleep(3) can cause a process to wait N seconds and then send SIGALRM to the process. itimers allow you to set a period timer that when it reaches its timer value, it'll signal your process. You can also count different "times" -- elapsed/clock time, or user CPU time, etc. e.g., setup a timer that interrupts your process every 5 seconds, up to one hour. Useful, e.g., if you have a long running processes, say a database or other server, and you want to invoke a special function every 60 seconds to, say, flush all important data. * ptrace Process tracing: used by interactive debuggers like gdb(1), as well as syscall tracers like strace(1). ptrace(2) allows you to trace a process X, assuming you have permission to access process X. ptrace will create or be used together with another process Y (the tracer of X). Each time a syscall in X is invoked, the kernel will not run that syscall (yet) but instead it would inform the tracer process Y of some activity. Y can then inspect that activity, and decide what to do next: for strace(1), it logs the syscall being executed, lets it run, and then gets its return status. The tracer process Y has full access to the state of traced process X: Y can read the virtual memory of X, CPU registers, file descriptors, etc. ptrace(2) allows you to intercept not just syscalls, but any execution. That's how GDB works: it sets up "breakpoints" at specific memory addresses of a running process, and ptrace will keep running until X tries to access or execute in the traced mem addr: then it'll inform tracer Y that a break-point has been reached, allowing Y to inspect other state, resume running, etc. * asynchrony related system calls Prototypes: aio_read(3) Enqueue a read request. This is the asynchronous analog of read(2). Actual prototype: int aio_read(struct aiocb *aiocbp); aio_write(3) Enqueue a write request. This is the asynchronous analog of write(2). aio_fsync(3) Enqueue a sync request for the I/O operations on a file descriptor. This is the asynchronous analog of fsync(2) and fdatasync(2). aio_error(3) Obtain the error status of an enqueued I/O request. aio_return(3) Obtain the return status of a completed I/O request. aio_suspend(3) Suspend the caller until one or more of a specified set of I/O requests completes. aio_cancel(3) Attempt to cancel outstanding I/O requests on a specified file descriptor. lio_listio(3) Enqueue multiple I/O requests using a single function call. Asynchrony permits for better interleaving of threads/processes. It provides higher throughput in general. Conversely, synchronous activities slow things down and create bottlenecks: process X has to issue action 1, then wait for it to finish, then action 2, then wait for it to finish, etc. See aio(7) man page on linux, describing its Asynchronous I/O. Most syscalls are "synchronous": meaning you issue the syscall and then you have to wait for it to finish, before you can issue the next one. E.g., the read(2) and write(2) APIs are all synchronous: you have to issue them in order. Some reads/writes can run much faster/slower than others. If one of them is slow, it holds up all the others; but if you can issue them all in parallel, then you'd only have to wait for the last one to conclude. I can issue reads in parallel using threads or different processes: but those can be rather heavyweight for just a single read request. An alternative is the AIO POSIX API: here you issue a read or write request, but the syscall returns immediately(!) w/o the actual data or result of the syscall. The hallmark of any async API, is that the calls you issue return immediately (or very quickly), but you have to have a way of being informed when the work you've asked to be performed has finished (and whether it succeeded or not). This means, when you requested the original work to be done, you have to pass some "callback" function or pointer, that you will use to be informed when actions are concluded. In the AIO API, you pass an AIO Control Block structure (aiocb), where you pack the usual read/write syscall args, but also more info: how do you want to be notified (signals? other inter-process messaging (IPC)? etc.). Another possibility is that after you submit a bunch of aio_* calls, you go into a "select" or poll-like call, where you are waiting for activity to have concluded on any of the file's you've submitted work for. struct aiocb { /* The order of these fields is implementation-dependent */ int aio_fildes; /* File descriptor */ off_t aio_offset; /* File offset */ volatile void *aio_buf; /* Location of buffer */ size_t aio_nbytes; /* Length of transfer */ int aio_reqprio; /* Request priority */ struct sigevent aio_sigevent; /* Notification method */ int aio_lio_opcode; /* Operation to be performed; lio_listio() only */ /* Various implementation-internal fields not shown */ }; AIO API also has a way to query all pending I/Os (listing), cancel or suspend an in-flight AIO request, get an error or return status on your own at any point in time. Study AIO API in linux (will give you good ideas for HW4). In the AIO system: user processes are the "producers" of jobs to be done; and the "consumer" of those jobs is the OS kernel, who has to perform the tasks given to it.