ODF: On Disk Format for Unionfs 2.x (Updated May 30, 2007) http://unionfs.filesystems.org/ Note: instructions on how to use Unionfs's ODF features (mount options), are found in the Documentation/filesystems/unionfs/usage.txt in the same directory as this file. * INTRODUCTION Stackable file systems, and especially Unionfs, often require to keep additional information persistently, so as to provide a file system that is fully compliant with POSIX, as well as interoperate with other file systems (especially Network File Systems). In addition, stackable file systems are "virtual" in the sense that they operate on top of one or more other file systems. This means that often a lot of state has to be produced and kept in memory. Such state can sometimes be discarded, but costs a lot of CPU cycles to reproduce. Therefore, it could be beneficial to cache such data in another place, so it can be retrieved easily. Both of the above problems can be solved if Unionfs could store certain amount of data persistently. This is accomplished using the On Disk Format (ODF) support in Unionfs. Briefly stated, Unionfs uses another small file system partition or loop device, and another file system (e.g., ext2, XFS, etc.), to store persistent data such as whiteout information, cached directories, and more. This, as you will see, solves *many* of the problems that stackable file systems such as Unionfs have to deal with. * ODF SOLVES IMPORTANT ISSUES Some of the problems that ODF solves are the following: 1. Whiteouts and opaques pollute the namespace. Currently Unionfs (pre-ODF) uses a file such as ".wh.F" to indicate the file "F" should be whited out; Unionfs also uses a file named ".wh.__dir_opaque" in a directory D to indicate that lookups should not continue past the branch where D is. Both of these special names pollute the names space, and they prevent anyone from actually creating a file which may start with ".wh". Also, writing those special files over various file systems, especially network file systems, introduces various races. Instead, the new ODF code stores whiteouts as hardlinks to a special (regular) zero-length file in odf (/odf/whiteout), and it stores opaqueness information for directories in the inode GID bits in an ODF file system (e.g., ext2, XFS, etc.) on the local machine. This avoids the name-space pollution and avoids races with network file systems, while minimizing inode consummation in /odf. 2. Inode numbers must be persistent indefinitely. Network file systems (e.g., NFS) and other user tools such as /bin/tar and /bin/find, require that the inode numbers of files be persistent. Unionfs, however, creates dynamic inode numbers on the fly because it cannot inherit the actual inode numbers from the lower branches (the latter can be any number and there's no unique way to combine them). To keep inode numbers persistent, our old Unionfs 1.x supported a special option called 'imap' which was a simple file that recorded the persistent inode numbers per branch. Alas, the imap code was racy, polluted the namespace, didn't handle branch management, and added significant performance overheads. Instead, the new ODF code allocates inodes on a real but small Ext2 file system and uses those inode numbers as the persistent inode numbers for Unionfs to use. By using Ext2, we leverage a small and very efficient file system, we get persistency easily, concurrent accesses are permitted, further improving performance, and Unionfs can be exported across network file systems reliably. 3. Layering multiple unions. Stacking unionfs on top of unionfs was not possible before. The reason is mainly the namespace pollution: a whiteout file could belong to any logical union and it's impossible to tell which one. With the new ODF code, each union has its own ODF state, which separates out whiteouts and opaque files. With ODF, stacking any unions in any arbitrary manner is simple. 4. Resuming unions with complex branch configurations. Sometimes, users build complex unions with many branches, and then they reconfigure their unions using branch-management commands (adding, deleting, and modifying branches and their states). The end result is a union that may be hard to reproduce or forgotten: it may require a long command line to build the same union again, so users often have to record the final branch configuration in some other location (a shell script, for example). With the new ODF code, each union records the branch configuration precisely and persistently. If you unmount a union, you can resume it easily by just pointing Unionfs to the correct ODF location; Unionfs then will resume the exact branch configuration as was there before. 5. Directory reading When a user lists a directory, Unionfs has to build the contents of that directory by merging the contents of that directory on all of the corresponding lower branches. Then Unionfs has to eliminate all duplicates, remove any file for which there is a whiteout, and handle opaque directories. This process consumes memory and CPU cycles, and has to be repeated each time Unionfs has to re-read the same directory. Worse, if the merged directories are very large, the kernel could run out of core memory. So one solution was to build the unified directory contents incrementally, piece by piece, conserving memory. However, incremental building of a union directory's contents prevents lseek()-ing into a directory that's not completely constructed in memory (NSF/SMB use directory lseek-ing extensively). With the new ODF code, Unionfs constructs each unified directory contents incrementally in memory, but stores it persistently onto the ODF location. This has the benefit of supporting lseek() perfectly, and avoiding large memory consumption. Essentially ODF becomes a cache of the directory contents, and since directories are read more often than they are written, this cache also improves performance. 6. Hardlinks Supporting hard-links in any file system is a challenge. This is primarily because FFS-like file systems and Unix OSs don't support an efficient mechanism to find out all of names and precise locations of a hard-linked file from just one inode number or one name. It is even more of a challenge in Unionfs because it supports an arbitrary way of creating branches, which could contain hard-linked files, only some of their names are exposed within a given branch. The challenge is during copy-up, how to recognize all of the hard-linked files and properly treat a copied-up hard-linked file as one file, not just a collection of arbitrary names. With the new ODF code, Unionfs incrementally builds the knowledge about hard-linked files across branches, as it finds them in the lower branches. Essentially, Unionfs tries to discover all of the names of a hard-linked file "lazily" so that Unionfs can properly treat a hard-linked file when it gets copied up. * DESIGN ** ODF STRUCTURE ODF is just a file (or raw partition), formatted to our needs, not a full blown file system developed just for Unionfs, and its namespace is *not* exposed to users. To make debugging easier, however, and leverage existing tools (e.g., debugfs), we decided to use an existing simple file system, Ext2 (several other disk-based file systems could work just as well). The advantage of using an existing file system are: - we remain format compatible with an existing file systems. - no need to hide the namespace for the ODF prototype (you can mount the /odf partition and inspect it manually) - we can use existing tools: mkfs, dump/restore, debugfs, etc. - we can get ODF working quicker and into the hands of users. Of course, regardless what we use for ODF, we still need to take care of locking of our objects to prevent unsafe concurrent accesses to /odf by multiple users. In the rest of this document, we assume that an ODF partition for a single union is mounted under /odf. In practice, there can be many such ODFs mounted on any paths the user desires. Note also that normally this mount is performed inside the kernel and hence the /odf namespace is hidden from normal user's view. An /odf is a hierarchy of directories and files with specific purposes, as follows: 1. /odf/sb is a plain file which records branch configuration, UUIDs, etc. Each branch has a unique UUID so it can be identified consistently. 2. /odf/ns/ is used for storing information about whiteouts, opaque directories, and persistent inode numbers. We use the actual inum of the created zero-length file, as our persistent inum. The term 'ns' here stands for Name-Space. 3. /odf/ic/##/##/ is for caching readdir() results, where '##' is the 64-bit inum, broken two bytes at a time, for up to a 4-deep hierarchy. (It might be more efficient for '##' to go MSB->LSB, to help inum locality.) The term 'ic' stands for icache. 4. We use /odf/ic also to store hardlink info in /odf/ic/##/##/., a file whose contents records a mapping of inodes to hardlinked names; here is the inode number of the source hardlinked file, and is the uuid of the source branch. The first thing in the file is the UUID of the destination branch we copied the hardlinked file up to, and the inum of the copied up file we created (this would allow us to hardlink to it). If we don't know this yet, because a hardlinked file was just looked up one or more times without resulting in a copyup, then we leave both entries zero. (In practice, we don't really need to store the UUID itself, but just an index to the branch UUID's location in /odf/sb as needed). The rest of the contents of this file is a list of struct dirent's, which records the name of the hardlink, but with the inum of the *parent* directory (e.g., ). We need to store the parent inum so that the hard-linked file's name can be found by searching the parent dir. (Note: a modification to path_lookup is needed to allow us to lookup the file using only its name and its directory's inum.) We do not store the number of links here, because that can change (a user can delete one link while we're still trying to discover the rest of the link names). So we always look it up always from one of the known hardlink names. 5. /odf/reclaim/ is used for storing hierarchies that need to be deleted. This is useful after an mkdir on top of an opaque dir, because the entire opaque directory's hierarchy has to be deleted (recursively, no less). On rmdir we may also use /odf/reclaim. This allows us to speed up operations which need to delete items from the ODF, because we just move those items to the reclaim area (which acts as a "trash bin" of sorts). In addition, we have an asynchronous kernel thread which wakes up every few seconds (configurable) and purges some items in /odf/reclaim, to keep space usage in /odf low. Finally, if we notice that space in /odf has become very short (especially if we get an ENOSPACE), then we immediately suspend file system activity in Unionfs, and go cleanup some space in /odf, until it drops below a (configurable) threshold. 6. /odf/whiteout is a zero length plain file which we hard link to, from /odf/ns/, whenever we need to indicate that is a whiteout file. 7. /odf/sr is for storing silly renamed files. Inodes that are still in use when unlinked, will have their entry in the ODF moved to /odf/sr temporarily to keep the ODF from recycling that inode while the unionfs inode is still in use. When the unionfs inode is dropped the entry in /odf/sr is unlinked. ** The ODF superblock The ODF superblock (/odf/sb) stores the following: - the ODF format version, in case it changes in the future. - the UUID of each branch - the no. of branches - records the pathnames and options for each branch. This is so we can mount an older union by its odf info, without specifying full mount options again. (Optional: we can record portable information such as a file system's LABEL.) - we might also need to record extra info to handle hardlinks (e.g., the orig branch pathname from root of that branch's file system up to the lower-branch dir name we stack on) Actions: 1. To prepare a new ODF, simply run "mkfs" on the /odf partition. Or, if you already have a formatted /odf, you can mount it and them "rm -rf" everything inside directly. 2. on mount, check if /odf/sb exists; get the number and list of branches, and their UUIDs, and store them in this file. If /osf/sb does not exist, then we know that we're mounting a new /odf for the first time; in that case, create the /odf hierarchy: /sb, /whiteout, /ic, /ns,/reclaim and /sr. If we find that /odf/sb did not exist, but some of the others do exist, then we assume that the /odf is corrupt, and we fail the mount with an error message. 3. on remount, if we reconfigure the branches, then (a) /odf/sb must exist (else BUG_ON) (b) update /odf/sb file (c) if the "odf_keepcache" flag was not specified, then clean the dir cache in /odf/ic (d) otherwise, if you really want to use an existing /odf cache (with possibly some stale info), then use the mount options "remount,odf_keepcache". ** Whiteouts and Opaque Dir actions Whiteout (WH) semantics: - create a wh only if necessary, of course > lookup(): - check for main file in each branch, from leftmost branch until the opaque branch (so we have to find out the opaque limit) - upon a successful lookup of a dir, we mark in the unionfs dir inode the values for start branch, end branch, and opaque branch. # XXX: do we return pointers to the first matching dentry below, or all? # XXX: does this affect rename(). 1. if not found in any branch, return ENOENT - optional (possibly deferred to cron): - check if wh is in /odf - if so, it was leftover from old wh (someone may have removed the main file) - so cleanup wh in /odf 2. if file found, then check /odf - if wh found in /odf, return ENOENT - if no wh in /odf, then return dentry normally The above lookup procedure is further modified to handle hard-links # XXX: we may have to change the above to handle hardlinks (keep track of # which branch the wh came from) 3. when looking up a directory, we need to set the link count to 2 if the dir is empty, else to whatever it should be. This could be useful in renaming a dir to an empty dir. > create(), mknod(), and symlink(): - after you create new file, delete wh if exists - we create only in the leftmost branch; if failed, return error (with this, no need to worry about opaque dirs) > unlink(): - you go L->R (left to right on branches) and delete all files with the same name. - if we get *any* error (EIO, EROFS, ...), then stop trying to unlink remaining ones, and create wh; return "OK" but we may want to printk on EIO or some other serious errors; if wh creation failed, return error. # XXX: need to handle hardlinks and unlinks > is_empty(char * dir) utility function: - checks if dir is logically empty or not - handles opaque dirs - handles whiteouts - returns TRUE/FALSE if dir is empty Sample operation (partial): - first need to find out what will have been the state of that directory's "emptiness" status before, so we know if to return ENOTEMPTY. - So we go L->R (up to and including an OPAQUE dir, but not beyond). - if all those dirs have link-count==2, then the directory will have been empty, so we can return TRUE from is_empty(); else return FALSE. > rmdir() is similar to unlink, but: - first call is_empty(); if dir will not have been empty, we're done, just return ENOTEMPTY; else - now handle actual rmdir'ing: you go L->R and rmdir all dirs with the same name. - if you get *any* error (EIO, EROFS, ...), then stop trying to rmdir remaining ones, and create wh; return "OK" but we may want to printk on EIO or some other serious errors. - when we have to rmdir a non-empty physical dir, we mv it to /odf/reclaim instead, to avoid in-kernel recursive removal. We can use a combination of a user-level cron job or an in-kernel async thread to cleanup stuff in /odf/relclaim. > mkdir() - create dir first; if failed, return error - delete wh if exists. To delete the wh from /odf, we need to effectively remove an entire dir of whiteouts. So mv the whole dir into /odf/reclaim, which would get cleaned up asynchronously. - create new dir in /odf, and mark it OPAQUE > readdir() and filldir() - we create in /odf/ic (see above) a file, containing the unionfs-visible content of the given, say, "foo/" dir. We use the same duplicate elimination algorithm that we have now. - we update the mtime on /odf/ic as well as the unionfs visible "foo/" dir, so we know when the /odf/ic's readdir content is stale or not; if stale, unlink old /odf/ic file and recreate it using the current branch contents. - The benefit: /odf caches readdir contents for all users, until/if the "foo/" dir changes in Unionfs; this survives reboots, works with all forms of l/seek-ing, and is even more efficient because dir pages will get cached; this also works with NFS exports (no need for cookie ugliness any more). > permission() - just check the first one you find, same as now. ** Copyup actions Note: Copyup actions don't really relate to ODF. However, we put them here because the ODF work allowed us to consider simplifying the Unionfs code (especially with respect to rename semantics). > write(): need to copyup - if you write to a readonly branch or ROFS file system, then copyup - look at SEEK_HOLE to, else check each page for zeros and skip it. - optional: we can also copyup on ENOSPACE/ENOQUOTA - if copyup failed, return error - copyup may need to create a parent hierarchy of dirs (but don't make the newly created hierarchy opaque) # XXX: fix hardlinks if copyup succeeded. # XXX: fix copyup for mmap(), probably in ->writepage, commit/prepare_write - we need to copyup also on some of ->setattr (chown, chgrp, ...) > truncate(): needs to copyup - handle same as copyup for write - useful to have a shared copyup function that can copy part/all of a file. > rename() has always been challenging in Unionfs case 1: regular file within same dir case 2: regular file to separate dir case 3: dirname within same dir case 4: dirname to a separate dir The older semantics of rename were that we tried hard to preserve the location of a file being renamed. But this was too complex to maintain without races, and offered few benefits to users. Some Unionfs users proposed that on rename, we always copyup the renamed file to the left-most branch. This indeed would be much simpler to implement, but that seems too excessive and may clutter the left-most branch too much. So our new design with ODF provides a trade-off between the old and proposed implementations. Our new, simplified rename idea is as follows: - By the time we're called with rename(src,dst), both dentries have been looked up: src must exist else our lookup will have failed; and dst may or may not exist (we'll get a dentry which could be a negative dentry). - If src is a non-empty dir that lies on a ro branch we let the userspace recurse and ask us to copy up each file separately, by returning EXDEV (just like in Unionfs 1.x) - If dst exists on the top branch and is a whiteout or a logically empty dir we remove it before renaming by: - rmdir if its a physically empty dir - silly rename if its a non-empty dir - unlink if not a dir - Start renaming from the leftmost branch of src to the rightmost or until any one rename fails. We then copyup to top branch and rename if the src leftmost was not the top branch and: - the rename at the src leftmost branch failed - dst already existed at a branch to the left of src - dst was in an opaque dir with opaqueness branch to the left of src - We create a whiteout if any of the renames failed. - For directories: The same applies to directories as well. If we need to copyup a directory, then this directory is guaranteed to be always empty by using EXDEV (above), so we just mkdir at the top branch. - We need to purge the readdir cache (/odf/ic) for all successful renames, in both src and dst dir. Also, the creation of a whiteout should purge the readdir cache of that dir (can be optimized later). # XXX: need to handle hardlinks and renames ** Persistent Inodes actions Easy: just create the file in /odf/ns/, using a flexible copyup() routine that will work for whole file copyup, truncated files, and zero-length files in /odf/ns/. On lookup, we need to create the hierarchy in /odf/ns, if it doesn't exist; if/when it does, read the inum from /odf/ns, and put it in unionfs's inode. Let dcache/icache algorithms figure out when to purge stuff. This means we'll need lots of inodes in /odf, so mkfs it appropriately. Our current estimate is to create around 3000 to 3500 inodes per MB in the odf using "mkfs -t FOOFS -i NNN" * Hardlinks When looking up a hardlinked file for the first time, create a new entry in /odf/ic/##/##/. as described above. If that hardlinked file is truncated/written to, need to copyup; so we copyup normally because it's the first time we copyup that hardlinked file; then we permit the write to proceed. If we copyup and have created a new inode on the destination branch, then update /odf/ic entry to record the UUID+inum (which were all zero initially). When looking up a hardlinked file, we check /odf/ic, and let's say we find it there. That means the hardlinked file was looked up before, and possibly copied up. If no one modifies the hardlinked file even on the second lookup, no need to copyup the file or parent hierarchy, so just add another entry to /odf/ic. But if a user modifies the file, then we have two cases. (1) if the file is modified for the first time, then copyup several parent hierarchies, for all known names of this hardlink in /odf/ic; then link the file in several places. (Copyup will have to lookup the dentry then do a d_path on it to get full hierarchy to recreate.) (2) if the file was already copied up, then it means that the entire known hardlink hierarchy already existed, so just create the parent hierarchy for this newly discovered link, and link the file to the already copied up version. Then we can allow the write/truncate to proceed. When we copyup for hardlinks or otherwise, also hardlink in /odf/ns, so we retain the persistent inodes. We need to record the actual number of links to report for a file, in case one of the hardlinked files was removed. We can store this in /odf/ns/ by stealing the gid field (or some other field we can take over safely). > link() If you create a new hardlink, then we copyup the src as described above, plus update /odf/ns as described, then link the dst to the src. If someone creates a new hardlink to an already copied up file, then we will see that as existing info in /odf/ns. * CREDITS Our group of stackable file system developers at Stony Brook had considered something like ODF for several years already, and for several projects (not just Unionfs). But it wasn't until OLS 2006, that we began seriously designing and developing this support. One major reason was that at OLS'06, we met and discussed the topic with Ted Ts'o, who was quite supportive of this idea. The initial design for ODF was done by Yiannis Pericleous, Josef "Jeff" Sipek, and Erez Zadok. The bulk of the initial development was done by Yiannis Pericleous. Currently, all Unionfs developers help maintain the ODF code. This file can be found in your Linux kernel source tree that contains Unionfs with ODF support, perhaps a patch, in linux-X.Y.Z/Documentation/filesystems/unionfs/odf.txt For more information on Unionfs, see .