* Lookups, reminder

Every system call that takes a string file/pathname (char*), will require
the OS to parse the pathanme on a delimiter such as '/', then perform an
internal "lookup" for that pathname: stat, open, rmdir, mkdir, unlink,
rename, etc.  Very important b/c lots of syscalls pass pathnames.  This can
be slow and requires a good OS cache for directory entries and inodes.  For
that reason, the *at(2) syscalls were created, to execute more efficiently.

* Absolute vs. Relative Pathnames

Absolute: starts with a "/".
Relative pathname: does NOT start with a "/"

Q: relative to what?!

What about "cat foo.c" or open("foo.c")?  In that case, the OS needs a
"reference" directory for to know where to lookup "foo.c": b/c every lookup
happens in the context of a directory in which you are looking up a name.

The dir to look inside for "foo.c", defaults to the Current Working
Directory (CWD) of the process that issued the syscall.  That information is
stored inside the OS, as part of a task (or process) structure that the OS
maintains on behalf of every running process (including a shell).  "struct
task" is one of the biggest and most complex structs in the OS.  For
example, it could have this field:

struct task {
...
	char *cwd; // current working dir
};

Note: above cwd is shown as a "char *": in practice it'll be a pointer to an
OS internal structure representing the CWD.  In linux, it's "struct dentry *cwd".

So, when the OS has to perform a lookup for a rel pathname, it starts from
the task struct's ->cwd field inside the OS.

chdir(2): sets the task cwd field (command is 'cd')
getcwd(2): gets the content of the cwd field from the OS (cmd is 'pwd')

Note: chdir(2) can only change the cwd of the current running process, not
for another process.  That's why shells must implement 'cd' as an internal
command, not one you'd fork+exec.

The /bin/login program that lets you login, authenticates your
userid+password (e.g., gets it from /etc/passwd or LDAP), it then forks,
sets the cwd to your home dir -- chdir("/home/jdoe"), and then exec's
preferred shell (e.g., /bin/bash).

* other special directory names

If you enumerate all names inside any dir, say using getdents(2), you will
always see the following two special names:

".": the "dot" directory.  Is just another name for the CWD.
"..": the "dotdot" or parent directory.

They always exist when you create a new directory.  You can use them to refer
to other pathnames relative to yours:

$ ls ../tmp # list the contents of the sister dir to where I am now
$ ./a.out # execute a binary named a.out in the current dir

Pathnames can include as many "." and ".." as you want: you can ascend up
the parent hierarchy up to root ("/") and then into any other subdir from
where you are here.  Also, multiple "." are no-ops:

"foo/./././bar" is the same as "foo/bar"

Also, note that "/.." is the same as "/".  That is, while every dir has a
parent, the global root "/" dir is its own parent.

* hidden names in Unix

By default "ls" doesn't show any file that starts with a "." such as
".bashrc".  This is a convention for user level tools, not an OS
restriction.  If you want to see such files in unix, you say "ls -a" ('a'
for all).  But note that getdents returns ALL names back to the user
process, such as "ls": ls is the one that doesn't show dot-files by default.

In Windows, conversely, the "hidden" bit for files is an OS-level feature.

Try the system call tracing tool (strace) to better understand how various
programs work:

$ strace /bin/ls
$ strace /bin/ls -a

* What is a hard link?

A hard link is another name in some dir, so another struct dirent, that
points to the same inode.  Example:

Dirent table:
NAME	INODE No.
.	10
..	17
foo.c	23
a.out	100
bar.c	23

In above example, "foo.c" and "bar.c" are hardlinked files, to the same
inode number 23.  A hard link is an alias to the same unique inum.  Same as:
every person has a unique social security number, but you can have multiple
names and nicknames.

Above example shows a hardlink within the same directory.  But a hard link
can exist in any directory, but only within the same file system: that's b/c
it points to a specific inode number, and inode numbers are unique only
within a given f/s.

Most OSs will only allow you to create a hard link to a regular file.

Hard links are useful when you want to have different names to the same
files, without copying the same content, thus saving space.  In above
example, if you change content or m-d of foo.c, and then looked it up using
bar.c, you'll see the same m-d and file content.

To create a hardlink:
$ ln foo.c bar.c # make bar.c a hard link to existing file foo.c
Or use the link(2) syscall.

You can rename hard links (mv(1), rename(2)) or delete them (rm(1),
unlink(2)) just like any other file.

When you delete a regular file, its name, inode, and any data blocks it uses
are removed from the file system and underlying storage.  But when you
remove a hard-link'd file, it depends how many links it has: if you are
removing the very last link, then all content goes away; if you're removing
just one name and there are still other names left to the hardlinked file,
then you ONLY remove that one name (struct dirent) from that one directory
in which you deleted the file.

The OS knows when it's about to delete the last ref to a named inode,
because it keeps track of the number of links to a file, in the inode's
struct stat in this field:

	       nlink_t	 st_nlink;	 /* Number of hard links */

For regular files, nlink=1; for hardlinked ones, it's >1.  Test: create a
file, create hard links to it, then check using stat(2) or /bin/stat to see
the #links grow; then unlink/rm, to see the #links drop.

Note: a hard link's inode is still the same regular file type.

* What us a soft, or symbolic link?

A symlink is a different type of f/s object: not a REGular file, not a
DIRectory, but a LINK file.  A symlink has its own unique inode created,
with the usual m-d in the inode.  The "content" of a symlink object can be
anything when you create the symlink.  However, when you try to lookup
(indirectly via the OS's lookup, for any syscall that passes a pathname) an
object, and that object happens to be a symbolic link, then the OS's
pathname lookup includes a special procedure.  Example, consider the
pathname lookup of a path like /a/b/c/d/foo.c.  We expect "foo.c" to be a
reg file inside directory 'd', which is inside directory 'c', etc.

If when parsing this pathname, ANY component is determined by the OS to be a
symlink, then:

1. The OS will read the content of the symlink
2. The OS will "insert" that content in place of the current component being
   evaluated (looked up).
3. Then the OS will resume processing the pathname lookup with the symlink's
   content replaced.

A symlink is another way of aliasing to another file, directory, or whole
pathname.  While a hardlink can only connect one component name to another,
a symlink can connect one component name to ANY other pathname.

The size of the content of a symlink is limited to 4096 bytes: that's the
maximum PATHNAME allowed in POSIX.

To create a symlink:
$ ln -s foo bar # -s says symlink bar to foo
Or syscall symlink(2)

To read the symlink's contents: use readlink(2) -- looks like a read(2) or
getdents(2) but for symlinks.

You can delete/rename a symlink w/ usual commands and syscalls.

You cannot create a hard link to a file that does not exist yet.  But you
can create a symlink to ANY file or pathname, even a pathname that doesn't
exist (at this time), or one that points to another file in another file
system.  Key: when you create a symlink, the OS doesn't validate the content
of the symlink (what you're pointing to).  That validation only happens when
you try to ACCESS that symlink, say by open, stat, etc.

$ ln -s /foo/bar name
$ ln -s ../../../dir1/dir2/somedir name

Symlinks offer 'delayed evaluation' of what they point to, whereas hardlinks
evaluate what a hardlink is pointed to at the time you create the hardlink.

Symlinks are useful to create aliases to files, dirs, any other object at
any time.  Useful also b/c they cross file systems.  They don't take up a
lot of space (one inode and at most 4KB of data).  They allow you to create
a whole "shadow namespace hierarchy" to existing files and directories.

By default any pathname lookup (e.g., open(2), /bin/cat), when it sees a
symlinks, will follow that symlink further until the final component is
actually found and opened, displayed, etc.

stat(2): will traverse all symlinks in the pathname you're trying to stat
lstat(2): if the pathname is a symlink, will return inode info about the
	actual symlink, not what it points to.

A symlink can point to an object that doesn't exist: if you try to traverse
it, you get ENOENT error: this is called an orphan or dangling symlink.
Note the target object COULD have existed before, but nothing stops a target
from being deleted and you still have symlinks pointing to it.

$ ls -l

will show a symlink as "foo -> bar" (arrow denoting a symlink) and a lower
case 'l' in the type of object (left side of ls -l output).

$ ln -s a b
$ ln -s b c
$ ln -s c d

Issue 1: you can break a chain of symlinks.  Symlinks are "fragile" b/c any
symlink or component that it points to can be deleted, renamed, changed, and
you may not find out until next time you try to traverse this chain.

Renaming: suppose "a" is a symlink to "/b/c/d/e/f.c", and I do this
$ cd /b/c
$ mv d old_d

Issue 2: you could create a infinite loop of symlinks, circular chain.  OS
has to be careful, b/c if inf. loops were allowed, the lookup procedure
inside the OS will get into an inf. loop (very bad for OS).  Solution:  not
any fancy graph cycle detection algorithms, but:

1. Allow any symlink to be created, even if it causes a loop.
2. When the OS starts to evaluate any pathname as part of a lookup, for one
   syscall, it starts a counter.
3. Each time the evaluation of the same pathname crosses a symlink,
   increment the counter by 1.
4. If the counter exceeds a max threshold (often set to "20") abort the
   lookup and return the error ELOOP.

This means, you cannot have a valid, non cyclic chain of more than 20
symlinks.  Also, if you have a small loop (a->b and b->a), the OS still has
to evaluate those symlinks multiple times until the counter reaches 20.

There are fancy cycle detection algs, but OSs don't use them because (1)
their complexity is larger than the above alg and (2) their mem footprint is
larger.  OS designers prefer to keep code/algs simpler.

* other types of f/s objects

Block and Character devices, often live in /dev: you can create them using
mknod(2), rename/delete as usual.   These objects are unique in that there's
special code in the OS that, depending on how you create them, will
implement different functionality

int mknod(const char *pathname, mode_t mode, dev_t dev);
- pathname: name you want for  the special object
- mode: default permissions
- dev: encodes a major+minor numbers, that only the OS knows what they mean

For example, major number 7, can be a SCSI device; major number 8 can be
some GPU; terminal device, etc.  E.g., every "class" of devices has its own
major number.  The minor usually 'refines' the type: e.g., if a scsi type
device, then minor=1 means the first scsi device in the chain; minor=2 means
the 2nd, etc.  The minor number therefore denotes a specific instance of the
device on this system.

Block devices: look like one "raw" file.  You can open a block device to
say, /dev/sda1, and you'll be able to seek to any offset within the device,
you can read and write, but only in native "blocks" (e.g., aligned 512B
sectors).  This is how mkfs, for example, formats a file system on a device.

As long as you have the privileges to read the /dev/XXX device file (usual
unix permissions), then you can access the 'raw' data of a storage device,
even bypassing the file system!  Dangerous: don't mess with f/s data
structures.

You can backup a whole raw device, block by block, and restore an identical
image of the block device if you wanted:

$ dd if=/dev/sda1 of=/some/file.bkp.of.sda1 bs=4k

DD: Disk Dump -- used to read/write raw devices, but also any other file
if: Input File (the raw device)
of: Output file (where you want it stored, NOT on /dev/sda1)
bs= Block Size, the unit of copying

DD is also useful to generate a file of a given size, or measure raw
performance of an I/O subsystem:

$ dd if=/dev/zero of=/mnt/filesystem1/BIGFILE1 bs=1M count=1000
$ dd if=/dev/zero of=/mnt/filesystem2/BIGFILE2 bs=1M count=1000

Above commands would read from /dev/zero (just a sequence of zeros), and
write to two different file systems.  At the end, dd reports the throughput,
so you can compare speeds.

$ dd if=/dev/random of=/tmp/randombits bs=1k count=20

Above will create a file containing 20K worth of random bits

Char device: allows you only to read/write sequentially.  You cannot seek
back or forward.  Examples include keyboard terminals, network sockets, etc.

Special devices:

/dev/null: aka "unix bit bucket".  A place you can write/redirect any data
to, when you want it discarded.

/dev/random and /dev/urandom: reading random numbers from a generator
- /dev/urandom is "pseudo" random numbers, not perfect entropy, but can
  generate a lot of random data quickly.

- /dev/random is a "true" random number generator (TRNG).  Reading from this
  device is much slower because it "generates" random numbers from a mix of
  external events: kbd and mouse clicks, network packets arriving, noise
  signals, and more.  Generates better randomness, but slower *and* reading
  from /dev/random may block the reading process until the OS can generate
  more random data.

/dev/console: your login console (you can "echo hi > /dev/console").  In
some OSs this is the system console (as if you attach a monitor directly to
the computer) and only superusers can write to the console.

To find out what is your "terminal" ID, type

$ tty

/dev/zero: sequence of zeros

Type ls -l /dev/* to see the permissions, name, and TYPE of device ('c' or
'b').

* open, creat, openat

A lot of new syscalls introduced a *at name, eg., openat

open(pathname, flags)
openat(dirfd, pathname, flags)

Suppose you want to create files a, b, c, and d, in directory /tmp

open("/tmp/a", ...)
open("/tmp/b", ...)
open("/tmp/c", ...)
open("/tmp/d", ...)

Problem: OS has to parse the full pathname 4 times, even if it's the same
prefix /tmp.  Sometimes your code would have to concatenate strings "/tmp"
with the name of the file you want to open/create in /tmp.

openat() allows you to do the following:
dirfd = open("/tmp", ...) // open /tmp as a O_DIRECTORY (see flags), which
	lets the OS keeps cached state about the dirfd you just opened.
openat(dirfd, "a", ...)
openat(dirfd, "b", ...)
openat(dirfd, "c", ...)
openat(dirfd, "d", ...)

The above is faster and more efficient inside the OS; plus saves program the
hassle of string concatenations.

Also: there's a serious SECURITY reason.  Suppose your program does this:

open("/tmp/a", ...)
open("/tmp/b", ...)
<hacker intervens here>
open("/tmp/c", ...)
open("/tmp/d", ...)

Let's say the user succeeded in opening /tmp/a and /tmp/b.  Right before
opening /tmp/c, some user (hacker), manages to get inside the computer, and
change the nature of "/tmp".  For example, if they gain root privs, they can
do this:

# cd /
# mkdir /.myhiddentmp
	nastier hidden files: mkdir '/. '
# echo my content > /.myhiddentmp/c
# echo my content > /.myhiddentmp/d
# mv tmp tmp.off
# ln -s /.myhiddentmp tmp

open("/tmp/c", ...)
open("/tmp/d", ...)

The above kinds of security vulnerabilities are called
time-of-use-to-time-of-check bugs (TOCTTOU, TOCTOU).  Meaning there's a race
condition b/t when you created or looked something up, and when you use it.
Such races can happen b/c many programs do this

1. stat(somefile), to ensure it doesn't exist
2. open(somefile) assuming it's newly created

But if someone manages to create a file b/t steps 1 and 2, the open in step
2 will read an entire different file.

The above sequence of 4 open's is vulnerable to TOCTOU bugs.  But if you
first open a dir and have a handle on it (dirfd), you can be assured that
the OS will NOT delete that directory, even if some other user/hacker,
renames it, or even attempts to "rm -rf" it.  Because the OS has the file/dir
open, only its name disappears from the namespace: the actual inode and its
content still remain on the file system.

In sum: the syscalls that are ending in 'at' usually take a open directory
descriptor, instead of full pathname.  Use of such syscalls is more efficient
and more secure.

* file modes

9 basic bits for user, group, and other: bits are Read, Write, and eXecute

You can set bits when you create a file or using chmod(2)  Find what they
are using l/stat(2).

Execute makes sense on files: b/c you can execute them.

On directories, the X bit means that the directory is searchable: whether
you can lookup any name (assuming you know what it is).  If a directory has
the "R" bit, you can enumerate the files within (perform an "ls" or
getdents(2)).

S_ISUID 0004000 set-user-ID bit: Normally when you execute a program file,
the program runs under your login privileges.  When you execute a program
file that is setuid, the OS first sets the effective userid to the OWNER of
the file.  This is sometimes needed if the program requires access to
restricted services that only root can have, but you want to allow non-root
users to access the service and run the program.  Note: setuid root programs
can be dangerous (setuid root scripts are even more dangerous).

S_ISGID  0002000 set-group-ID bit (see inode(7)).  Same thing as setuid, but
sets the default group that the program runs under, to be the group of the
file on disk (not the user's default group)

Setgid ond directories usually means that files created inside that dir,
will inherit the parent group -- not the default's running user's GID.

S_ISVTX  0001000 sticky bit (see inode(7)).  The tricky bit is used on
world-writable directories like /tmp to mean that you can create any new
file in /tmp, but you cannot delete someone else's file.

* flags to open, creat, mkdir, and other syscalls that "create" objects in the file system

O_DIRECT: part of a growing OS interfaces called "Direct I/O".  Allows user
to access files on the persistent media directly, w/o going through the OS
page cache.  When you write to an O_DIRECT file, the write will return AFTER
the data has persisted.  Slower, but you have more control over when and
what gets written.  Direct I/O is useful in databases written to the
log/journal, b/c writes to the DB log have to be atomic.  Other applications
that use O_DIRECT will perform their own caching in their own memory, to
avoid the OS caching (b/c apps have no control over the OS's cache flushing
algorithms).

Alternative to O_DIRECT is to call fsync(3) on a file descriptor when you
want all cached data in the OS to flush.  O_DIRECT controls flushing on a
per write() basis; fsync does it on a per open fd basis; sync(2) does it on
a per file system basis; and you can also mount a whole f/s with a 'sync'
flag to force all access to that f/s, by all users, to be synchronous.

O_DIRECTORY: used to open a directory, so you can pass dirfd to those *at
calls.  Normally you can't open a directory, and can't read(2) from it --
user getdents/readdir instead.

O_DSYNC: sync both data and m-d of the file upon changes.  O_DIRECT usually
only syncs file data, not m-d.  See also O_SYNC.

O_EXCL: usually used w/ O_CREAT to say "only create this file if it does NOT
exist" -- an 'exclusive' create.

O_LARGEFILE: useful if opening very large 64-bit files on file systems that
normally only handle 32-bit files (32 bits == 4GB).

O_NOATIME: recall the inode has 3 times (modification/mtime, change/ctime,
access/atime).  Atime is normally updated each time you read a file.  If you
read a file even on a read-only f/s, the OS still has to update the inode's
atime field, causing WRITES to the f/s.  Atime historically has been less
useful, and yet if you read a file a million times, you'd have to update
atime a million times -- lots of unnecessary I/O.  So many OSs offer an
option NOT to update atime.  Some OSs have a hybrid option: update atime
only every N seconds, or only after N changes to atime.

O_TMPFILE: tell the OS that this is going to be a short-lived file.  So OS
can keep all file state (data and m-d) in memory for long, b/c it'd probably
be deleted in a short period of time.

O_TRUNC: truncate the (existing) file to 0 bytes before writing to it.
Otherwise, writing to the file, defaults to offset 0, will just overwrite
whatever bytes you write -- remaining bytes afterwards stay intact in the
file.  Note that a successful open with O_TRUNC will truncate the file's
contents permanently: so if later on you did a write(2) and wanted to abort
and recover the orig file's contents, you can't (unless you created a backup
of the file).

openat(int dirfd, const char *pathname, int flags): open/create a file in a
previously opened directory.  You can pass special value AT_FDCWD for dirfd,
to mean "operate on the cwd of this process".

* stat, lstat, fstat, fstatat

discussed l/stat before

fstat() allow you to stat an already opened object, even if you don't have its
name any longer.  Useful to check, e.g., the size of a file you (and perhaps
others) are writing to.  Or check if permissions have changed.

fstatat() like fstat, but allows you to stat "at" a given directory.  Also
takes special flags like AT_EMPTY_PATH (tells OS to operate on the file
referred to by dirfd).

* lseek

changed the default read/write "head" on a file, to any other offset in that
opened file.

SEEK_SET: use the absolute offset given

SEEK_CUR: seek "offset" bytes relative to the current offset

SEEK_END: see relative to the end of file

Recall that each time you successfully read/write N bytes from a file, the
default "read/write" offset for that file changes in the OS as well.  The OS
maintains that state for each open file descriptor.

* lseek (sparse files)

How storage space is allocated to a file (on any media).  Recall that
storage media has a native unit, say sector/block of 512B of 4KB.  That
means a device cannot give you any less  than the sector size.  If you need
1 byte of space, you'll need to consume a whole 512B sector.

1. When you create a new file: the size of the file is 0 bytes, and no
blocks need to be allocated on the underlying storage media (the f/s
software is what requests allocation of blocks from the underlying storage
media).

2. Now you write your first byte: the f/s will request and allocate a whole
block of 512B. And then the f/s will start to fill in that block each time
you write/append more bytes to the file.  The OS and the f/s will track that
you've allocated 1 whole block, what is the native size of these blocks, and
how many bytes out of that block were actually used.

3. Once you have written your 512th byte, you filled up the first allocated
block.  If you need ONE more byte, the f/s will have to ask the storage
media for yet another (second) block of 512 bytes.

Example: a file has 600 bytes written.  That means you need two x 512B
blocks.  The first block will be full.  The second block will have only 88
bytes filled.  In struct stat:

- st_size: the size in BYTES of the file (e.g., 600)
- st_blksize: native block size = 512
- st_blocks: number of blocks allocated = 2

Note: inode structures inside the OS have some unused/leftover space.  When
files' data is small, some file systems store the small no. of bytes
directly in the inode itself.  This is more efficient b/c you don't need to
alloc actual disk blocks until the file grows beyon a certain size.  This is
called a "short file" -- a file whose small data is stored directly in the
inode.  This is seen in stat(2) as a file with some number of bytes but 0
blocks.  For a similar reason, small symlinks can be stored inside the inode
too -- called "short symlinks".

It's been noted that some applications write out whole large sequences of
zeros to their files.  Typically seen in large files, databases, core dumps
on large memory systems, and more.  That seems like a waste of disk space.
So the idea of a "sparse" file was created.  The idea is that IF you need to
write a block that is all zeros, the OS and f/s can internally NOT write
that block at all.

The OS and f/s, keep a data structure that knows exactly which blocks of
data are allocated to a file, and in what order.  So if, say, the 2nd block
of data, happens to be all zeros -- the OS can just avoid allocating that
file at all, and instead, leave a special marker like "NULL" where the ID of
the block would have been (e.g. the Logical Block Number or LBN).

If a user process tries to read some bytes of data, in a location where the
block for that file has NOT been allocated, the assumption is that this
block is a zero-filled block (sparse, or non existent block), and the OS
doesn't even need to perform disk I/O: just return a bunch of zeros back to
the user process (e.g., memset with 0s).  That's how a sparse file behaves:
an illusion that you wrote zeros, but you actually didn't consume any disk
space.

How do you create a sparse file?

1. Historically: lseek() PAST the end of the file, then start writing any
   non-zero data.  Most OSs, would have created a "hole" in the file, b/t
   the original file EOF, and where you started writing.  Note: the hole has
   to be large enough to encompass at least one *aligned* block.

How can you recognize that you have a sparse file?  Check struct stat: if
the no. of bytes, rounded up to multiple of blksize, is larger than
st_blocks, it means that some blocks are zero'd out -- and this is a sparse
file.

Problems: if you /bin/cp'd a sparse file, it's possible you will have
actually filled in the zeros, thus turned a sparse file into a non-sparse
file (wasting space on all those zeros).  Modern cp programs are smarter.

In modern systems, esp. under virtualization, a sparse file is also called a
"thin" file; and a non-sparse file is called a "thick" file.  E.g., if you
crate a virtual machine (VM) with 100GB virtual disk: that VM's disk will be
an actual file on your host machine (the one running the hypervisor).  If
you alloc the VM disk (e.g., called VMDK in VMware) as "thick" you'd have
consumed all 100GB ahead of time; if you alloc it as "thin", it starts
empty, and the hypervisor fills in any non-zero blocks as needed, depending
on what the OS running in the VM does; e.g., let's say you install a minimal
ubuntu 18 system, with just 20GB worth of binaries (as per "df"): in a
thinly allocated 100GB VMDK, only 20GB will be used.

2. Modern OSs: allow you to pre-allocate a sparse file and also allow you to
"punch" a hole in a file that has a bunch of zeros, and the file system will
deallocate any all-zero blocks used by the file.  There are syscalls for
that: e.g., fallocate(2) to "preallocate or deallocate space to a file".
fallocate(2) can "punch a hole" in the middle of a file, turning it sparse;
can also pre-allocate a large extent for a file you're about to write.

* truncate, fruncate

Chop a file at a given offset: file's size in byte is set to that offset,
all whole block allocated by the f/s to the file, get freed and released to
the storage media.

Most common: truncate a file to 0 bytes (also happens with open and O_TRUNC)

Many programs are written inefficiently, they truncate a file, and then
overwrite the data.  In some cases, most of the data is the same and only a
few bytes at the end (or middle) have changed.  More efficient programs will
avoid unnecessary file truncation followed by new writes: use seek() to go
to the offset you care, and write the bytes there.

If you are opening an existing file, and you start to write(2) to it.  When
you are done, you need to call truncate IF the number of bytes you just
wrote is LESS than the size of the file.  If you don't call truncate(2),
you'll have extra bytes at the end from the previous version of the file.

Some modern versions of truncate allow you to set the offset BEYOND the
current EOF, thus creating a sparse file.  But as this is not a common or
standard behavior, better use ftruncate(2).

* chown, fchown, lchown

Prototypes:
       int chown(const char *pathname, uid_t owner, gid_t group);
       int fchown(int fd, uid_t owner, gid_t group);
       int lchown(const char *pathname, uid_t owner, gid_t group);

Allows you to change the owner and/or group of a file.  lchown operates on
the symlink instead of what it points to.

Only uid 0 (root) can change a file's ownership.

Used /bin/chown and /bin/chgrp: chgrp will use chown only to change the
group of a file.  A user can only change their file's group to any other
group they are a member of, not other groups.

chown takes pathname, owner, and group: if owner or group are -1, that ID is
not changed.

Some older OSs had a separate chgrp(2) syscall.

* utime, utimes, futimensat, utimensat

Example on Mac OS X of "stat logo.txt":
  File: logo.txt
  Size: 1850      	Blocks: 8          IO Block: 4096   regular file
Device: 100001eh/16777246d	Inode: 35548818    Links: 1
Access: (0644/-rw-r--r--)  Uid: (  701/     ezk)   Gid: (   20/   staff)
Access: 2021-04-07 17:58:51.431463108 -0400
Modify: 2021-04-07 17:58:50.373956409 -0400
Change: 2021-04-07 17:58:50.373956409 -0400
 Birth: 2021-04-07 17:58:50.373864739 -0400

utime: set atime/mtime in 1s res.
utimes: same but higher res clock (usec or better)

futimensat, utimensat: nanosec clock resolution

* rename, renameat, renameat2

Prototypes:
       int rename(const char *oldpath, const char *newpath);
       int renameat(int olddirfd, const char *oldpath,
                    int newdirfd, const char *newpath);
       int renameat2(int olddirfd, const char *oldpath,
                     int newdirfd, const char *newpath, unsigned int flags);

rename src file to dest file; the 'at' versions are as usual, more
efficient.  Usually if dst exists, it gets deleted/overwritten.  (More
obscure: try to rename(2) a file to a destination directory that's empty --
POSIX says the dst dir should be deleted and replaced with the target file
name.)

renameat2 takes special flags: useful is the RENAME_EXCHAGE flage.  Allows
you to swap two file/dir names atomically inside the OS?  Otherwise, how can
you swap 2 file names A and B:

$ mv B tmp	# can do w/ plain rename(2)
$ mv A B
$ mv tmp A

So you need 3 syscalls, and you risk TOCTOU bugs, and partial failures.

* setxattr, lsetxattr, fsetxattr, getxattr, lgetxattr, fgetxattr, listxattr,
* llistxattr, flistxattr, removexattr, lremovexattr, fremovexattr,

Extended Attributes (EAs or xattrs).

Historically a file contains any data you want: the OS does not impose a
structure on a file's date, it just looks like a sequence of bytes.  Apps
can impose whatever structure they want.

But there's limited m-d you can store about a file, whatever's in struct
stat.  Sometimes you want to store extra info per file, esp. some important
m-d.  Examples: MP3 "ID3" tags (song title, performer, album, year produced,
etc.); same with any other multimedia data.  Other examples could include:
compression used for a file, encryption algorithms used or modes, hashes and
checksums, and more.  Choices:

1. Put the m-d as part of the file's data, which means you need to teach all
applications about this m-d.  Examples are MP3 files: a combination of a
sequence of audio bytes as well as a special "ID3" structure, that allows
you to set several <key,value> pairs (e.g., "genre" is "classical", track is
1, title is "my song").  Problem, the structure is specific to the file
format, and all apps that want to access MP3 files, have to know where and
how read that structure.  Plus you need custom tools to view/set the
attributes of a file (e.g., id3tag, id3info, id3v2).

2. Add the extra m-d as part of struct stat.  Problem: results in non
standard stat structures, and everyone wants their own favorite additions.

3. Extended attrs, allow you to add arbitrary <key,value> pairs to a file's
inode.  The KV pairs will be stored together with the file, by the OS.
Copying the file, archiving, moving, renaming, etc. should all preserve
these KV pairs.

With the xattr set of calls you can essentially control a mini DB of
<key,value> pairs on a file: you can set a value V for a key K (setxattr);
you can list (like /bin/ls) xattrs using listxattr; you can remove a KV pair
using removexattr; and you can retrieve the value V of an xattr with key K
(with getxattr).

There's useful user level utilities as well.

In UNIX, when you open a file, you read the file's "main" data.  In Windows
NTFS, a file can have several "channels" or "streams" of data (other than
regular stat(2) m-d).  So when you open an NTFS file, you have to say that
you want to read the "main channel" to get the file's data; in NTFS, xattrs
are a separate channel that you have to ask to read.  Examples of multiple
channels: storing previous versions of files (documents, spreadsheets) so
you can revert/recover to previous files' states.

* Access Control Lists (ACLs)

EAs were created in Linux specifically to solve the ACL problem.  In Unix,
you can only have one owner, one group, and "other" permissions.  Sometimes
you want more advanced access permissions on a file:

1. Multiple owners who can operate as if they are the primary owners of the
file.

2. Multiple groups.  Or access to a file if a user is a member of group X
and group Y; can also be logical "OR", XOR, "and not", etc.

3. Complex lists of permissions: e.g., permission is allowed if

- owner A or B
- group C and D
- group D but not E
- group F XOR G
- etc.

ACLs in Linux are implemented as a layer on top of xattrs.  Linux reserves
all EAs starting with the string "security.*" (so only root or the OS can
set those).  Users can set an EA starting with "user.*".

Tools exist to set, remove, list, reorder, shuffle, ACLs.  See man acl(5)
and a lot of acl_*(3) library calls and tools.

* chroot

Recall that every process has a cwd so that the OS will know where to begin
evaluating relative pathnames like "foo/bar.c" or "../src/debug.txt".
Absolute pathnames start with a "/" and "/" is the "root" directory of the
entire system.  But when that root dir, can be changed on a per process
basis: use the chroot(2) system call to do that.

If you change the root of a running process, to some other path, for example
"/my/dir/", then thereafter, every time the process tries to resolve an
absolute pathname, it'll use "/my/dir/" as the "/" dir.  A chrooted process
cannot "escape" its chrooted directory.  This is sometimes called a "chroot
jail".  Namely, if you after chroot-ing to /my/dir/, to "cd .." -- you'll
wind up still in /my/dir; and if you try to cd to an abs pathname "outside"
the jail (e.g., "cd /etc" or "cd /boot"), you'll get an error ENOENT
(file/dir doesn't exist INSIDE that chroot jail, even if it exists outside
of the jail).

Even w/o chroot: what happens if you do this

$ cd /
$ cd ..

Where do you wind up?  What's the parent dir of the "/" dir?  Yes: you wind
up in the same place -- the parent of "/" is "/" itself.

Q: can you 'escape' chroot if you had an open fd on a file outside?
A: it depends.  Some OSs prohibit this, chroot may fail, or they might
  "close" any fd you you have that's outside you jail.  Same problem, if you
  have access to other resources, maybe an mmap'ed region on some outside
  file.

Chroot is useful when you want to confine or restrict the possible access a
process could have, in case of vulnerabilities or successful attacks on the
process.  Recall that many bugs like stack overflow, if exploitable, can
allow an attacker to run ANY arbitrary code with the SAME privileges of the
process that's executing.

Assume Apache httpd Web server is running.  Assume the DocumentRoot where
all your http files are is in /var/www.  If there's a bug in apache, and
someone manages to trigger it (esp. b/c httpd is a service "open" widely to
outsiders on the Web).  If attacked, a user can easily execute code to try
and copy vital files like /etc/passwd (outside of /var/www), or any other
valuable information they find; they can try and offline crack the
passwords; steal private keys from user's home dirs; any other info (credit
cards, etc.); they can just fork bash and "browse".

Note: if apache was running as UID 0 (root), the hacker can then easily
access anything, including installing their own kernel(!)  That's why many
services try to run with their own custom user: apache, mail servers,
database, all try to have their own user (not root).

With chroot, and assuming the broken-into process runs as non-root, any
attempts to access files outside /var/www will fail (ENOENT).  Inside
/var/www you have anyway mostly public files, and maybe readonly copies to
some shared libraries that apache httpd may need.  Of course, don't put
files in /var/www that include sensitive info (social security, user IDs,
emails, credit card numbers, or passwords -- esp. not in cleartext).

Chroot has been shown to have some flaws.  Thus some OSs started to design
more complex "jail" mechanisms.  There are advanced container and resource
limits in Linux alone, including more complex "namespace" manipulations
possible (even kernel-enforced namespace limitation on what a root user can
see).

* clone, fork, vfork, execve, execveat

Used to start and create new processes.

fork: create a "copy" of a running process, inheriting SOME properties and
info from the parent to the child.  When you call

int ret = fork();

Right after, assuming "ret" is not negative (error), you have to now TWO
processes running:

1. the child process, will get ret=0
2. the parent process, will get ret>0, whose number is the PID of the child
   that was created.

Q: how does the child know it's parent process?
A: getppid(2)

Recall: getpid(2) is for any process to find out their OWN PID number.

All kinds of resources are shared w/ a child, and some are not shared.  See
fork(2) man page.  Problem: over time, people wanted a variant of fork()
that lets them share (or not) other kinds of resources.  Eventually, fork()
was generalized to a more generic "superset" clone(2) syscall.

clone(2) takes more parameters, and is more complex to use, but more
flexible.  Today, inside the linux kernel, fork(2) is implemented not as a
separate syscall, but just calling fork(2) with the right flags.

clone(2) prototype:

	int clone(int (*fn)(void *), void *child_stack,
		int flags, void *arg, ...
		/* pid_t *ptid, void *newtls, pid_t *ctid */ );

fn: callback when child process terminates
child_stack + arg: arg passed to 'fn' upon being called
flags: see CLONE_* flas below

Clone flags include:

CLONE_FILES: do you want to share the same open FD table or not?
CLONE_FS: share the "file system" info like umask, chroot, chdir, etc.?
CLONE_SIGHAND: inherit same sighandlers or not?
CLONE_VM: share mmap'd info or not?
and other flags you can OR together.  See man page for clone(2).

You can pass a function ptr 'fn', an easier way to get the child process
instead of using one of the wait*(2) syscalls (like wait, wait4, waitpid,
etc.).  'arg' is passed to the 'fn' as a generic void* that can hold
anything.  'fn' is called a callback function.  Most callback functions that
are designed to be flexible will have a fxn pointer and a void* that you can
put anything into.

execve() takes three args:
1. filename to execute
2. an argv array (which would show up as the 'argv' and count in main()
3. an array of strings containing the environment variables (e.g.,
   "PATH=/foo:/bar")

Prototype: int execve(const char *filename, char *const argv[],
		      char *const envp[]);


If execve succeeds, it will REPLACE the current running process with a new
process, whose executable is 'filename', and it gets the argv, and envp from
the args of execve:

int main(int argc, char *argv[], char *envp[])

A typical use when one process, like a shell, wants to start another process
is to do: clone (or fork), and then the child process calls execve(2).
execve does NOT change the PID number, so the parent still knows the child's
PID.

* kill

Prototype: int kill(pid_t pid, int sig);

Send a signal "sig" (just a number) to pid P.  Signals are just short
messages (numbers) that one process (or the kernel itself) can send another
process.  Once a signal is received, optional actions can take place (e.g.,
invoke a signal handler).  Signals are a form of Inter-Process Communitation
(IPC) methods.

Can signal yourself, can signal any other process.  If you don't own or have
the right to signal the other process, you'll get an error.

If P is 0, you send the sig to all processes in the same "progress group".
Every process has a unique PID, but can also belong to a process group
(PGID).  This is useful to treat a set of processes as one unit.  For
example, the postfix mail server forks a bunch of processes (several to
listen for new mail requests, some for forwarding, some for email filtering,
and more).

If sig is 0, no signal sent (there is no signal number 0).  Instead, you get
back a status code that tells you if the process is alive or not.  This
tests a "liveness" property w/o taking actual action.

* signal(7) -- list of signals

There's a number of signals that exist in an OS.  A signal is just a
"message" sent from process X to process Y.  Every signal can have an action
or no action.  If it has a defined action, then a signal "handler" function
will be invoked when the process receives the signal.

When process X sends a signal S to process Y, there's no guarantee how long
it'd take for the signal to arrive.  Signals are not real-time messaging.

A process can define which signals are masked (turned off), and for those
not-masked, what the default action should be.  Default actions can include
"dump a core" for debugging, terminate the process, invoke your own custom
sighandler function, etc.

Whenever you get a signal and you invoke a not-terminating sighandler
function, you interrupt the currently executing code, perform a "non-local"
jump to the sighandler code.  When the handler returns, you come back to
where you were before (just like function calling another, setting up a
stack frame, and returning to the instruction right after the jump to the
called function).

Q: what happens if you get a signal delivered while you're inside a signal
handler?!

A: It depends.  Differnet OSs implement different capabilities:

1. no other signals allowed inside a sighandler fxn

2. OS will queue up the new handler function and invoke it right after
   returning from the current one.

3. Allow nested signal handlers to be invoked (up to a certain nesting
   depth).  This option tends to be rare to support, complex to implement,
   and often has little practical use.


SIGHUP: received when the controlling tty is terminated.  For example, if
you ssh to a remote host, start bash, and your ssh connection terminates
(TCP socket close), the remote bash will get SIGHUP.

SIGKILL: terminate the process with extreme prejudice.  This signal CANNOT
be masked off by any process.  Similarly, SIGSTOP cannot be masked off.

e.g., if you (e.g., sysadmin) are not sure if the "runaway" process is bad
or good, best send it SIGSTOP, then ask owner (user) of process.  If the
process si needed, you can SIGCONT to continue it, and use re/nice(2) to
change that process's priority.

SIGALRM: called when a sleep(2) is done.  Sometimes useful for a process to
go to sleep for some number of seconds (e.g., if malloc failed and you want
to wait and see if a little while later, enough mem is freed to retry that
malloc).

Q: Why you'd want to get a signal that invokes a sighandler?
A: Useful to schedule an action to happen in the future, eg., cleaning up,
   sending periodic reports, logging info, etc.

SIGUSR1/SIGUSR2: custom, user defined signals.  E.g., you can use USR1 to
enable debugging; and USR2 to disable; on a running process.  Useful if you
have a program running at a customer, and you want them to enable/disable
debug for a period of time (then send you the debug logs).

SIGINT: often used to tell a program to re-read its configuration files w/o
restarting the program.  Suppose you have a mysql DB, and you can't afford
to shut it down, reconfig it, and restart it (even a few seconds downtime
could be bad for a busy DB server, e.g., e-commerce).  Often sysadmins will
edit the config file (e.g., /etc/my.cnf for MySQL), and then send a SIGINT
to the mysqld process: mysqld has a sighandler that'll re-read /etc/my.cnf
and reconfig the various parameters (e.g., how many concurrent mysql
connections are allowed).

SIGSEGV: segmentation violation (when you violate a page protection or try
to access a virt addr that's not mapped to the current process).

SIGBUS: on some architectures, the memory bus is designed such that a memory
address (e.g., of a pointer) has to be aligned to the "word" boundary of the
processor.  Through pointer manipulations in C, you can create pointers that
point to any addr.  But in some CPUs, if the addr is not a aligned on a 4B
or 8B boundary, you will get a SIGBUS error and a coredump.  Intel CPUS
don't care about addr alignment, which is why you don't get SIGBUS on Intel
(but unaligned addrs are going to be less effective).  On a SPARC
architecture, you can get a SIGBUS.

Look at signal(2), sigaction, sigprocmask, and others to see how to set
signal handlers, mask signals, collect info about signals delivered to you,
your child processes, or others in the same process group.

* ioctl, fcntl

Prototype: int ioctl(int fd, unsigned long request, ...);

Unix designers realized early on that they can't predict what services an OS
might need and what applications would need?  They created a bunch of
syscalls, but then added a "catch-all" mechanism: a way for the OS to extend
new functionality to user applications, w/o having to create a whole new
syscall.

Syscalls have specific numbers: when a user program invokes a system call,
it's really invoking a kernel function whose number is N.  N is hard-coded
in libc, in applications, and in the kernel: they must all agree and match.
You can't renumber system calls easily; if you add a new syscall, you have
to assign a number to it "permanently".  Reason: once apps use that syscall,
they're bound to that number.  Changing syscall numbers would break
Application Binary Interfaces (ABI compatibility).  OS designers are very
weary about adding new syscalls.

I/O Controls or "ioctls" are a way to experiment with new OS functionality,
before you decide if you wanted to make it into a full fledged new syscall
(with a name and unique number).  If you wanted to design a generic
interface, via a syscall, that can invoke ANY functionality, you'd need to be
able to as many parameters as possible: from 0 to some large number P.  And
be able to pass their values, however long or short the values are.  And you
need to do this WITHOUT changing the ioctl() interface itself!

If you want to pass P parameters to a function, w/o changing its prototype,
you have to pack all P params into yet another structure, and then pass a
ptr to the structure, as a void*.  The caller has to know what is the
structure that is being passed and how to create it; the callee (the
function that processes the request) has to know how to unpack the
structure.

Caller 1:
- struct a mystruct;
- fill in mystruct
- foo((void *)&mystruct)

Caller 2:
- struct b tmp;
- fill in tmp
- foo((void *)&tmp)

Callee (implementation of foo):

foo(void *ptr) {
// how would it know what caller passed? or how many bytes were passed?
}

Often, you see a pointer to a buffer that can take any length, together with
another integer denoting the length:

foo(void *ptr, u_int len):
- now implementation can tell how long is the buf in ptr

Problem: ptr is type-less and len doesn't tell foo() how to interpret the ptr
internally (what orig struct was it?).

Solution: you have to pass one more thing -- a "type" descriptor.  The type
can be anything (e.g., a number) as long as both caller and callee agree on
it.  Now the code would look like this:

Caller 1:
- struct a mystruct;
- fill in mystruct
- foo(TYPE_A, (void *)&mystruct, sizeof(mystruct))

Caller 2:
- struct b tmp;
- fill in tmp
- foo(TYPE_B, (void *)&tmp, sizeof(tmp))

Callee (implementation of foo):

foo(u_int type, void *ptr, u_int len) {

  if (type == TYPE_A) {
    struct a *a_ptr = malloc(len);
    a_ptr = (struct a *) ptr;
    // now you can process "struct a *a_ptr"
  }
  // repeat code for every possible type
}

ioctl(int fd, u_long request, ...)
- typically 3rd arg is a void* or a ptr cast to u_long
- fd: the file descriptor of an open object to operate on
- request: is the type of request to perform
- 3rd arg -- same as "void*" in above foo() example

Ioctl say: perform operation 'request' with data in 3rd arg, on file 'fd'.

Q: Where's the length param?
A: designers decided to exclude the need for len: both caller and callee
   would have to agree on length, or else ABI would break.

Examples of ioctls:
- many are custom or specific to a specific kernel module
- create a snapshot in a file system like btrfs
- set terminal to ECHO or not ECHO chars
- change open flags on an open fd w/o having to close/reopen the file
- special controls of devices like disks, GPUs, displays, and more.
- turning on/off the inode 'immutable' flags

ioctls were useful initially, and people started adding more and more ioctls.
More nd more were being added, many that only work for specific OSs.  But
some were deemed useful enough that other vendors implemented same ioctl for
their own OS (e.g., ECHO/NOECHO).

several decades later... we now have thousands of ioctls across dozens of
OSs.  Many ioctls are old and useful (at least for legacy code), and so OSs
are "forced" to support them.  IOW, there are more ioctls (aka pseudo
syscalls) than there are actual system calls!

The kernel has to handle ioctls as follows:
- look the fd: is it for a file or a socket or a directory?
- if networking, pass ioctl args to a networking-ioctl handler
- if a file system, pass ioctl args to the f/s that the fd belongs to (e.g.,
  btrfs, ext4, msdos, etc.)
- if fd belongs to terminal services, pass to tty-handling code
- some ioctl codes are not associated with a network of f/s, and so there's a
  need for a large switch statement to handle each ioctl, dispatching
  specific kernel functions for each ioctl code.
- even inside each file system, there is a large switch statement to handle
  all of ITS ioctls.

In order to know what an ioctl does, you need to know what subsystem or
module it belongs to.  ioctl(2) man page is generic.  There are some man
pages for ioctls of specific subsystems, e.g/, ioctl_ns(), ioctl_tty(), etc.
Many ioctls are not very well documented, if at all.  Many ioctls are
obscure that are only needed in rare cases; some ioctls are very new and
perhaps experimental.  If an ioctl isn't documented somewhere, your only
choice is to study the kernel source code.

Some useful ioctls have become system calls and some have gotten nice libc
wrappers.  Programmers are encouraged to use the syscalls or libc (or any
other library) wrappers.   Problem: legacy code still remains and if you
turn off an ioctl and force people to use a different syscall, you'll break
ABI and force people to change their code and recompile.

Note: fcntl() is a subset of ioctl's for file manipulations.

* readv, writev, pwritev/2, preadv/2

Efficient code: (1) use "large" buffers (but not too large), b/c bulk
reading and writing is more I/O efficient; (2) avoid calling syscalls too
often.

Sometimes, the data you have comes not in one nice large buffer, but broken
into various chunks that may not even be the same size.  Examples: reading
from network sockets or other streaming inputs.  If you had data you needed
to, say, write to a file, but the data was in a N different buffers, each
buffer perhaps a different size: how would you write that?

1. issue N write(2) syscall, one for each buffer
- problem: issuing N syscalls w/ their overhead
- also: possibly writing small bits of data (less efficient than bulk)
- may even have to call lseek(2) before each write(2) to set the write position

2. malloc a buffer large enough to hold all data, then memcpy individual
   buffers to the new buffer, then free old smaller buffers, and then call
   write(2) only once.
- pro: only calling one syscall and it's "bulk" data
- cons: extra memory needed, and memcpy consumes cpu/mem overhead

Solution: create syscalls that can read/write not one buffer, but a "vector"
(or array) of buffers.  Recall that a "buffer" is an array of bytes.  So
these syscalls will be reading/writing an array of an array of bytes.  This
allows you to have a single syscall w/o having to copy all the data into one
big buffer.

Prototype: writev(fd, struct iovec *iov, int iovcnt)
- fd: the file to write to
- iovcnt: number of iov structs passed
- iov: an array of struct iovec's

struct iovec {
  void *iov_base; // ptr start addr of buf
  size_t iov_len; // len of bytes in iov_base to write
}

writev(): will do something like (inside kernel):
for (i=0; i<iovcnt, i++)
  write(fd, iov[i]->iov_base, iov[i]->iov_len)

returns no. of bytes totally written.

readv() does the same for reading; variants of these can take extra flags
and even an offset where reading/writing should begin can be passed.

* pread, pwrite, pread64, pwrite64

Prototypes:
       ssize_t pread(int fd, void *buf, size_t count, off_t offset);
       ssize_t pwrite(int fd, const void *buf, size_t count, off_t offset);

Same as read(2) and write(2) syscalls, but also pass the offset where you
should read or write.

Normally, read/write() will operate at the last "read/write" offset that you
left of off (0 when opening a new file).  If you want to read and write data
not sequentially, you have to call lseek(2) and then read/write.  For
sequential reading/write: regular read/write() are ok.  But IF you have to
write a lot of data to a file in random offsets, then it's better to use
pread/pwrite (save you an extra lseek call).

Example apps where you read/write randomly: any database, sql, or K-V store,
leveldb, etc.  Also multi-media apps: skip ahead or back in an audio/video
file.

* sendfile

When Web servers started, there wasn't much load on them.  As the Web grew,
Web services became critical to run as fast as possible.  Web servers are
measured by how many queries-per-second they can handle.

A Web server is a user application:
1. listen on a socket (waiting for Web browsers to issue a "HTTP GET" request)
2. when it gets the request, it has to read some file (e.g., index.html)
3. then pass the data of that file over to the socket, back to the client
   (Web browser).
4. goto 1

How would this be implemented:
1. fd = open(file)
2. read(fd, buf, len) // maybe read whole file b/c html files are usually
   small
3. close(fd)
4. write(socketfd, buf, len) // write file's data to the browser socket

Problem: httpd has to read data from kernel, to user space, and then write
it right back out.  Wastes syscalls and processing (copying data, buffers,
etc.).

You can use mmap for some of these, but mmap is a more complex API, and
you'd be wasting effort setting up a mapping and destroying it, just for a
file that you have to fully copy over a socket.

sendfile() was invented to solve the above problem.  It allows the kernel to
read data from one open file descriptor and write it out to another fd
directly.  Original flavors of sendfile() read whole file and wrote it to a
socket fd -- API looked like old_sendfile(char *htmlfile, int sock_fd).
Modern versions are more flexible:

Prototype:

ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count)
- out_fd: the fd to copy the data to
- in_fd: the fd to read the data from
- offset+count: read "count" bytes from in_fd at "offset" and write out to
  out_fd.
- returns no. of bytes successfully written, ala write(2)

Can be useful even for /bin/cp.

* splice

related to sendfile, was created a bit after sendfile

Prototype:

ssize_t splice(int fd_in, loff_t *off_in, int fd_out,
		loff_t *off_out, size_t len, unsigned int flags);

If you want to move data pages from one file to another, not just copy, you
can use splice.  It can even allow you to "insert" data in a middle of a
file.  Typically operates on whole pages.  Splice internally (inside OS) may
play with struct page's inside the OS page cache.

Use: you have some data in a file A.  You want that data to be copied to
file B; and you know that you'll discard/delete file A (maybe file A was a
temp file for intermediate processing).

Traditionally, you'd have to copy the data from file A to file B (more
efficiently if using sendfile).  Inside the page cache, however, each page
is associated with one open file descriptor.

Splice allows me to move the "mapping" of a page in the page cache from one
fd to another, at any offset of the dest. fd, allowing one to "insert" a 4KB
page in the middle of a file.  No copying is required, only manipulation of
pointers inside the kernel (e.g., all the ptrs of all pages that belong to a
file in the page cache).

* Locking

in general: two kinds -- (1) advisory and (2) mandatory.

MANDATORY means: the one who got the lock has exclusive access on the
resource.  No one else can access the resource at the same time (e.g., a
file).  The OS enforces the locking on the resource or file: no one else can
access it at the same time, or they get an error.

ADVISORY mean: everyone who wants to access the file has to coordinate via
advisory locks.  For example, they have to check if a lock exists (e.g.,
F_GETLK) and if so, don't access the file until the lock is released.  If
another process tries to access a file that's locked, the OS won't prevent
that -- and it could result in a data corruption.

History:

1. Older mainframe OSs used mandatory locks (before UNIX).

2. Unix (c. 1970s) decided to make the OS simpler by supporting only
advisory locking.  Reasons: makes the OS simpler, and a recognition that
mandatory locking was not really needed by most applications.

3. Windows decided to implement mandatory locking by default.  Meaning you
don't need to say explicitly that you want to lock a file: merely opening a
file, automatically locks it to you -- the process that opened the file.
This helps avoid data corruption by other processes inadvertently modifying
files.  But conversely, it makes software updates more challenging: the only
way to update a file that's locked, is to terminate a process that locks it,
or reboot.  That's one major reason why updating Windows systems requires
you to reboot, sometimes multiple times.

4. Later unix systems realized that they do need to add some support for
mandatory locking, because some applications really needed it -- for example
databases.  Linux added mandatory locking years ago.  But, you have to call
the fcntl or ioctls to lock a file or byte range of a file: that is, just
opening a file does NOT automatically lock it.

Issues with locking:

1. Suppose someone locks a resource for "too long" or they lock a whole file
and they just need access to part of the file.  How do you prevent "hogging"
of such a resource (a form of denial-of-service attack)?

2. Worse, what if someone is locking a whole file or purpose, to prevent
others from accessing the file (e.g., use a read+write lock on the entire
file).  That would be a denial-of-service kind of "attack."  It could also
be just sloppy programming or a bug.

Term: an entity (e.g., user, process) that holds a lock on a resource R is
called the "lock-owner".

3. Modern systems allow for three mechanisms to handle this:

(a) priorities: a higher priority process can "break" the lock of a lower
priority process.  Just need to be careful about setting process priorities.

(b) timeouts: don't give out permanent locks that are held indefinitely, but
a "lease" that lasts for some period of time.  Permit lock-owner (the one
holding the lock) to ask to extend the lease.  Otherwise, when lease ends,
the lock is automatically released by the OS (and someone else that may be
waiting for the lock, can become the new lock-owner).

(c) revocation: some systems even require the lock owner to register a
callback function or a signal handler -- and the lock owner's code will be
called back to inform it that a lock they are holding is needed back (the
lock is being revoked).

Over the history of unix, several "competing" interfaces for locking
resources were created: ioctl(2), flock(2), fcntl(2), and lockf(3).

Prototype:
       int flock(int fd, int operation); // advisory lock
		op can be LOCK_SH(ared), LOCK_EX(clusive), LOCK_UN(lock)

flock: locks whole file
lockf: can lock a "byte range" inside a file

* fcntl

Prototype:
       int fcntl(int fd, int cmd, ... /* arg */ );

Like ioctl but specifically for files.  Allows you to un/set and get locking
info about files.  You pass a struct flock, which can define the operation:
lock for reading, for writing, or unlock; and set the portion of the file to
lock (a byte range, or whole file).

You can also change the open mode of a file that's been opened: say a file
is opened for read+write and you don't need write access any more.  You can
call fcntl w/ f_SETFL and "downgrade" your access mode to just O_RDONLY.
Similarly you can request to upgrade or change your open mode.  This is
better than closing the file and re-opening it with new modes, b/c closing a
file could have other side effects like flushing data or someone else
gaining access to the file.

Like ioctl, some of the fcntl abilities are uniformly supported by many OSs,
and some are specific to Linux.

Change notification: check also Linux's inotify* and fanotify* syscalls,
start with overview inotify(7) and fanotify(7)

1. User are modifying files on a system

2. You setup a daily backup of your system, after doing a full backup once
time.  The daily just needs to back up files that have changed since
yesterday

(a) approach one: scan entire f/s looking for changed files.

$ find / -type f -mtime -1

Above command will list all files that have had their mtime changed in the
last day.  Once you get a file list, you can back those files up using your
backup software.  Problem: only a handful of files typically change, but
there could be millions of files on your system.  You are scanning the
entire f/s looking for just a handful of changed files: this produces a LOT
of I/O activity.

(b) better: have the OS track somehow files that have changed, with support
of some user application.

- first a user process "registers" an interest in a file, directory, or
  a folder recursively.

- the registration asks the OS to inform (notify) the user process, when a
  file has changed.  Changes include creating, deleting, or modifying files;
  renaming, etc.  even just reading a file can be registered for
  notification.

- when the OS has to update a file's inode, it checks a list of processes
  that registered for notification, and the type of notification.

- any change to a file that matches a notification request, is sent to the
  user process (usually via some callback function, or a signal telling the
  process "you have a change, please retrieve it from the OS", this could be
  also via a special fd you have to listen/poll/select on).

- the user process then has to retrieve the notification, and add it to a
  list of files that need to be handled -- for example, backups that night.

This mechanism is much more efficient than scanning the whole f/s.  Used not
just for backup s/w but also for "desktop search" like technologies (like
Apple's Spotlight search indexing).  Such desktop s/w indexes all files
based on their meta-data, they content, file name, etc. so you can do
searches to efficiently find, say, "all .c files that contain a string XXX"
or all pictures that have a specific tag to them, etc.

* dup/dup2/dup3/pipe/pipe2, etc.

dup() is used to create an alias (like a hard link) to the same already open
file descriptor.

So you get 2 descriptors that can be used to act on the same file.  Note
that if you read or write from one descriptor, you can see the same file
content if you access it from the second descriptor.  Also, the is only one
read and write offset for the file.  So if you read a file using descriptor
1, the file read/write offset will change so the 2nd descriptor will see the
same offset.  Same if you change using lseek.  In other words, the kernel
just keeps two pointers (one for each fd) to the same data structure the OS
keeps on behalf of the open file.

If you want each fd to have its own file offset and even different open
modes, then simply open(2) the file twice.  The OS will keep two separate
data structures to record the status of the open file.  However, there is
still only one file inode and content on disk.

dup(), dup2/3() are useful when you want to have separate threads or forked
processes, each with their own fd to access the same file.  When any of those
threads/processes is done, they can close their own (dup'ed) fd.

Another use is if your program traditionally reads from stdin (fd 0), or
writes to stdout (fd 1).  If you redirected your stdin/stdout to another
location (say a network socket, or a file), and you want to ensure that no
one else will access the data via regular stdin/stdout, you can "dup"
stdin/stdout to another descriptor, close() stdin/stdout, but continue to
read from stdin/stdout via the dup'ed fd.  Useful also if you're calling
libraries that may try to access your stdin/stdout inadvertently.

pipe() creates a channel b/t two fds that you have.  You pass an array of 2
ints.  The first, pipefd[0], is the read end of the pipe.  and pipefd[1] is
the write end.  The OS will automatically "copy" data written in pipefd[1]
to pipefd[0].  Useful if you want to have a communication channel, like a
socket.  You can even have two separate processes or threads: one is the
writer and the other is the reader.  The reader just sits and listens on its
end of the pipe.  The writer just write(2) to its writing end of the pipe.
Once the writer writes anything to its end, the OS copies the data (or makes
it available) to the reader.  If the reader was blocked in a wait state, it
will unblock and will be able to read(2) the data from its end of the pipe.

In unix, pipe(2) is used a lot for shell level pipes:

$ cat /usr/share/dict/words | grep aa | wc -l

- cat will just print the contents of the file onto stdout
- "grep aa" will read from stdin, and only display on stdout lines of text
   that match the string "aa"
-  wc -l: count the number of lines

the '|' symbols indicate that stdout of the left side, should be piped over
to stdin of the right side.  Shells implement this using pipe(2).

Many unix programs read from stdin by default and write to stdout by
default.  They are called "filter" programs.

* poll, select, etc.

These calls are usually used in a server class application.  You set up a
number of file descriptors you want to "listen" on.  You specify whether
you're looking for read activity, write activity, etc.  You also can setup a
timeout for how long to wait (or "0" to mean wait forever).

In poll() you send an array of descriptors and the length of the array.

In select(2) you use FD_* macros to tell which descriptors to listen on.
The descriptors are turned into a bitmap that is passed to select().  The
bitmap is encoded in the fd_set type.  For example if you want to listen on
descriptors 2 and 5, you will use FD_* macro to create an fd_set bitmap that
looks like this 00100100 (basically the 2nd and 5th bits, starting from 0,
are turned on).

FD_CLR: turn off a bit corresponding to an fd
FD_ISSET: check if it is set
FD_SET: turn on a bit corresponding to an fd
FD_ZERO; zero out an entire fd_set

If you have lots of FDs to listen on, it may be more efficient to use fd_set
and select, b/c the list of FDs to listen on is packed more compactly in a
bitmap.

You can also listen on "exceptions" on files (e.g., a socket that was closed
by the other end).

Once you setup the FDs to listen on, and go into select, select will block,
as long as there's no activity.  Block means it won't return from the call
(at least not until a timeout).  When the OS notices an activity on an FD
(can be a file or a network socket), it'll wake up the process that was
blocked, and you then return from select/poll.  The process then has to
check the params it passed to see WHICH FDs have changed.  Check to see
which bits are still on: that indicates that THOSE descriptors have a
change.  Next, you can go and, say, read from each descriptor that has a
change, and process the incoming request.

Example: a Web server listens for connections from remote Web browsers.
Each browser connects to the server using a different socket (fd).  After a
connection is initialized, the Web server will listen for "read" requests on
the FDs of each browser.  When a browser, sends (writes) the message like
"GET index.html", the select loop in the Web server will wake up, the server
can find which FD has data in it, and read(2) from it; it'll read the string
"GET index.html", and start to process that HTTP request.  Usual code you
see is roughly:

// initial setup FDs to listen on
while (1) {
	// possibly update setup FDs to listen on
	r = select(....);
	// assume no error
	// check which FDs have activity
	// read data on those FDs, maybe spawn a different thread or process
	//   to handle each request.
	// go back to select, listening on for more requests
}

Usually busy services like Web browsers don't do actual processing in their
main select() loop: rather, they have a number of threads or processes that
are considered "workers", and they hand off the actual processing to the
worker; alternatively, they fork a new worker, dup the FDs that the worker
will need to respond to (responding to the Web browser), so that the main
server process can go back to its select loop, listening in for more
requests.

To prevent spawning too many processes, there's usually some limit imposed
on how many concurrent workers can be active at a time.

* umask (Unix Mask)

Used to set the default mode that new files/dirs should be created with.
the umask value is usually inherited from parent to child.

Example: if you set your umask to 0700 (octal), it means that all files you
create, should have at most r/w/x access by the user.

Umask = 0022: user has full r/w/x access, but others/group only get
read+execute, no write access (so no world-writeable files created by you).

Recall prototypes:
       int open(const char *pathname, int flags, mode_t mode);
       int creat(const char *pathname, mode_t mode);

when you call create(2) or open(2) to create a new file, or
mkdir/mknod/etc.  You pass the default mode you want that file to have in
those syscalls.  But that's NOT the mode the file will actually take.  The
real mode would be

mode_passed_to_syscall (mode_t mode) & ~umask

meaning: invert the umask value, then logically AND it with the requested
mode.

You can set umask in your default shell startup like .bashrc, to be more
restrictive (a good idea).

umask() syscall: pass the new umask you want to set for future file
creations, and it returns the previous umask value (so you can tell what
perhaps you inherited from the parent environment).

You can of course change a file/dir's permissions with chmod(2) at any time.

* fsync, fdatasync, sync, sync_file_range

Many ways in which you can control when data and m-d is flushed to disk
(persistent media).  Recall you also use fcntl(2) to change behavior of an
opened file.  You can do it with O_DIRECT, O_SYNC/O_ASYNC open flags, and
you can also explicitly call

fsync(fd) to flush all data and m-d on a file to disk (old traditional
syscall)

fdatasync(fd): same, but only flushes file data and not m-d (i.e., not inode
changes).  Example: if you use a database journal, after you write a
"transaction record" to it, you want to ensure that the data gets to the
DB journal file.  But you don't really care if m/a/ctime are in sync, b/c
after a crash, you just replay/apply all DB journal records in order,
regardless of the timestamps on the file.  Not flushing m-d, speeds things
up: that's often b/c the way f/s store  their files' data and m-d, is in
different locations on the media/disk.

sync(2): will flush all data and m-d for ALL file systems that are mounted
on the computer.  Useful to do it before you reboot a system, to ensure that
all OS-cached data is flushed first.

syncfs(fd): flush all data of one f/s, whose file fd it belongs to.

sync_file_range: sync just a range of bytes within a file.

* getdents, getdents64

"reading" content of a directory, b/c you can't use read(2) on an fd of an
open open directory.  With getdents, you get to read N "whole" dir records,
designated as "struct dirent", that fit inside the buffer you give
getdents().  You keep reading until you get a "0" (EOF).

Note that when you're in the "loop" that does getdents(2) until EOF, you're
reading new chunks of a directory from the last offset.  However, the
directory itself may change due to name changes (creat, rename, unlink,
mkdir, etc.).

There's a whole new area of research in OSs where atomicity guarantees are
explored for files, directories, whole directory trees, etc.

struct dirent: records info about a single directory entry, namely inode
number + the name of the entry (null terminated).

Problem: in POSIX, a full pathname can be up to 4096 bytes including "/"
delimiters.  Also, a single name in a dir, can be up to 256 bytes long.
Most file names are much shorter than 256B.  A naive way to create struct
dirent is

struct dirent1 {
 u_long d_ino;
 char d_name[256];
}

B/c file names are shorter and variable length, we want to save space.  So we
use what's called a "variable length data structure" in C, also called an
"out-of-band" (OOB) data structure.  A simple form looks like this:

struct dirent2 {
 u_long d_ino;
 char d_name[]; // there's a field named d_name in struct, but it has no
		// pre-allocated space
}

sizeof(struct dirent2) == sizeof(u_long) (4B)

Use above structure as follows:

char *name = "myfile.c";
struct dirent2 *de = malloc(sizeof(struct dirent2) + strlen(name) + 1);
de->d_ino = 17; // fill in actual inode number
strcpy(de->d_name, name);

Often, you want to know how long is the string in de->d_name, so it's common
to add the actual strlen into the structure, or the size of the entire
allocated structure including the variable length OOB data.  Or, in the case
of struct linux_dirent, where you have a sequence of these variable length
structures, you want to know what's the offset in bytes, to get to the start
of the NEXT struct linux_dirent.

struct linux_dirent {
 unsigned long  d_ino;     /* Inode number */
 unsigned long  d_off;     /* Offset to next linux_dirent */
 unsigned short d_reclen;  /* Length of this linux_dirent */
 char           d_name[];  /* Filename (null-terminated) */
 /* length is actually (d_reclen - 2 - offsetof(struct linux_dirent,
 d_name)) */
}

Some compilers, historically, when you enable optimizations, will "optimize
away" a field that has no storage to it, like "char name[]".  For that
reason, some implementations of this OOB data structure, will assign at
least one byte to the last field -- "char name[1]".

* fallocate

Prototype: int fallocate(int fd, int mode, off_t offset, off_t len);

Preallocates space for a file, to ensure that if you're going to be writing
a file of known size, that you won't run out of space in the middle of write
(quote limits or disk limits).

You have to be careful not to over-reserve space, or you get an error of
sorts.  And when you "release" the file (e.g., close(2)), any unused space
is released.

fallocate will try to allocate all the space in the file in close proximity
to each other: this is sometimes called an "extent".  Traditional f/s don't
guarantee that all file's blocks are contiguous on the media, due to
allocation strategies and fragmentation.  This results in poor performance
for files whose blocks got spread all over the media.  An extent is
guaranteed to contiguous, or as contiguous as possible.  For example, ext4
supports extents: so when you fallocate() on ext4, it'll try to give you one
whole contiguous unit of space on the media; if it can't it'll try and find
2 or more smaller extents, that are hopefully close to each other.

* uname

Unix Name: gives you info about the running system: arch, cpu, OS name, OS
version, host name, etc.  Can use with uname(1).  Useful when you want to
write code or scripts that have to run differently on different systems, or
if you need to distinguish different architectures, different distros, and
even different host names.

* shm* (shmat, shmget, shmctl)

A series of system calls to to control "Shared Memory".  Useful for
synchronizing among multiple processes.  SHM* API came from older System 5
Release 4 (SVr4) unix, from AT&T Bell Labs.  With this set of syscalls, you
can "create a shared memory" object, get a handle on it, pass it to other
processes, and coordinate sharing of information, including some basic
locking and synchronization.

Unix pipes are another form of sharing info b/t two processes, but limited
to one writer and one reader.  SHM* is more flexible, in that you can
control even the size of the mem region to share among processes.

More modern systems that want to share info b/t 2+ processes, (especially
those derived from the Berkeley System Design (BSD) OS variants) will
use... mmap(2).  Some OSs implement the shm* API using mmap internally in
the OS.

* get/setitimer

Interval Timers (itimers).  sleep(3) can cause a process to wait N seconds
and then send SIGALRM to the process.  itimers allow you to set a period
timer that when it reaches its timer value, it'll signal your process.  You
can also count different "times" -- elapsed/clock time, or user CPU time,
etc.

e.g., setup a timer that interrupts your process every 5 seconds, up to one
hour.  Useful, e.g., if you have a long running processes, say a database or
other server, and you want to invoke a special function every 60 seconds to,
say, flush all important data.

* ptrace

Process tracing: used by interactive debuggers like gdb(1), as well as
syscall tracers like strace(1).

ptrace(2) allows you to trace a process X, assuming you have permission to
access process X.  ptrace will create or be used together with another
process Y (the tracer of X).  Each time a syscall in X is invoked, the
kernel will not run that syscall (yet) but instead it would inform the
tracer process Y of some activity.  Y can then inspect that activity, and
decide what to do next: for strace(1), it logs the syscall being executed,
lets it run, and then gets its return status.

The tracer process Y has full access to the state of traced process X: Y can
read the virtual memory of X, CPU registers, file descriptors, etc.

ptrace(2) allows you to intercept not just syscalls, but any execution.
That's how GDB works: it sets up "breakpoints" at specific memory addresses
of a running process, and ptrace will keep running until X tries to access
or execute in the traced mem addr: then it'll inform tracer Y that a
break-point has been reached, allowing Y to inspect other state, resume
running, etc.

* asynchrony related system calls

Prototypes:

aio_read(3)     Enqueue a read request.  This is the asynchronous analog of
read(2).
	Actual prototype: int aio_read(struct aiocb *aiocbp);

aio_write(3)    Enqueue a write request.  This is the asynchronous analog of
write(2).

aio_fsync(3)    Enqueue a sync request for the I/O operations on a file
descriptor.   This is the asynchronous analog of fsync(2) and fdatasync(2).

aio_error(3)    Obtain the error status of an enqueued I/O request.

aio_return(3)   Obtain the return status of a completed I/O request.

aio_suspend(3)  Suspend  the  caller  until one or more of a specified set
of I/O requests completes.

aio_cancel(3)   Attempt to cancel outstanding I/O requests on a specified
file descriptor.

lio_listio(3)   Enqueue multiple I/O requests using a single function call.

Asynchrony permits for better interleaving of threads/processes.  It
provides higher throughput in general.  Conversely, synchronous activities
slow things down and create bottlenecks: process X has to issue action 1,
then wait for it to finish, then action 2, then wait for it to finish, etc.

See aio(7) man page on linux, describing its Asynchronous I/O.

Most syscalls are "synchronous": meaning you issue the syscall and then you
have to wait for it to finish, before you can issue the next one.  E.g., the
read(2) and write(2) APIs are all synchronous: you have to issue them in
order.  Some reads/writes can run much faster/slower than others.  If one of
them is slow, it holds up all the others; but if you can issue them all in
parallel, then you'd only have to wait for the last one to conclude.

I can issue reads in parallel using threads or different processes: but
those can be rather heavyweight for just a single read request.

An alternative is the AIO POSIX API: here you issue a read or write request,
but the syscall returns immediately(!) w/o the actual data or result of the
syscall.  The hallmark of any async API, is that the calls you issue return
immediately (or very quickly), but you have to have a way of being informed
when the work you've asked to be performed has finished (and whether it
succeeded or not).  This means, when you requested the original work to be
done, you have to pass some "callback" function or pointer, that you will
use to be informed when actions are concluded.

In the AIO API, you pass an AIO Control Block structure (aiocb), where you
pack the usual read/write syscall args, but also more info: how do you want
to be notified (signals? other inter-process messaging (IPC)? etc.).
Another possibility is that after you submit a bunch of aio_* calls, you go
into a "select" or poll-like call, where you are waiting for activity to
have concluded on any of the file's you've submitted work for.

struct aiocb {
    /* The order of these fields is implementation-dependent */

    int             aio_fildes;     /* File descriptor */
    off_t           aio_offset;     /* File offset */
    volatile void  *aio_buf;        /* Location of buffer */
    size_t          aio_nbytes;     /* Length of transfer */
    int             aio_reqprio;    /* Request priority */
    struct sigevent aio_sigevent;   /* Notification method */
    int             aio_lio_opcode; /* Operation to be performed;
                                       lio_listio() only */

    /* Various implementation-internal fields not shown */
};

AIO API also has a way to query all pending I/Os (listing), cancel or
suspend an in-flight AIO request, get an error or return status on your own
at any point in time.

Study AIO API in linux (will give you good ideas for HW4).

In the AIO system: user processes are the "producers" of jobs to be done;
and the "consumer" of those jobs is the OS kernel, who has to perform the
tasks given to it.