ODF: On Disk Format for Unionfs 2.x

			   (Updated May 30, 2007)

		      http://unionfs.filesystems.org/


Note: instructions on how to use Unionfs's ODF features (mount options), are
found in the Documentation/filesystems/unionfs/usage.txt in the same
directory as this file.


* INTRODUCTION

Stackable file systems, and especially Unionfs, often require to keep
additional information persistently, so as to provide a file system that is
fully compliant with POSIX, as well as interoperate with other file systems
(especially Network File Systems).

In addition, stackable file systems are "virtual" in the sense that they
operate on top of one or more other file systems.  This means that often a
lot of state has to be produced and kept in memory.  Such state can
sometimes be discarded, but costs a lot of CPU cycles to reproduce.
Therefore, it could be beneficial to cache such data in another place, so it
can be retrieved easily.

Both of the above problems can be solved if Unionfs could store certain
amount of data persistently.  This is accomplished using the On Disk Format
(ODF) support in Unionfs.  Briefly stated, Unionfs uses another small file
system partition or loop device, and another file system (e.g., ext2, XFS,
etc.), to store persistent data such as whiteout information, cached
directories, and more.  This, as you will see, solves *many* of the problems
that stackable file systems such as Unionfs have to deal with.


* ODF SOLVES IMPORTANT ISSUES

Some of the problems that ODF solves are the following:

1. Whiteouts and opaques pollute the namespace.

Currently Unionfs (pre-ODF) uses a file such as ".wh.F" to indicate the file
"F" should be whited out; Unionfs also uses a file named ".wh.__dir_opaque"
in a directory D to indicate that lookups should not continue past the
branch where D is.  Both of these special names pollute the names space, and
they prevent anyone from actually creating a file which may start with
".wh".  Also, writing those special files over various file systems,
especially network file systems, introduces various races.

Instead, the new ODF code stores whiteouts as hardlinks to a special
(regular) zero-length file in odf (/odf/whiteout), and it stores opaqueness
information for directories in the inode GID bits in an ODF file system
(e.g., ext2, XFS, etc.) on the local machine.  This avoids the name-space
pollution and avoids races with network file systems, while minimizing inode
consummation in /odf.

2. Inode numbers must be persistent indefinitely.

Network file systems (e.g., NFS) and other user tools such as /bin/tar and
/bin/find, require that the inode numbers of files be persistent.  Unionfs,
however, creates dynamic inode numbers on the fly because it cannot inherit
the actual inode numbers from the lower branches (the latter can be any
number and there's no unique way to combine them).  To keep inode numbers
persistent, our old Unionfs 1.x supported a special option called 'imap'
which was a simple file that recorded the persistent inode numbers per
branch.  Alas, the imap code was racy, polluted the namespace, didn't handle
branch management, and added significant performance overheads.

Instead, the new ODF code allocates inodes on a real but small Ext2 file
system and uses those inode numbers as the persistent inode numbers for
Unionfs to use.  By using Ext2, we leverage a small and very efficient file
system, we get persistency easily, concurrent accesses are permitted,
further improving performance, and Unionfs can be exported across network
file systems reliably.

3. Layering multiple unions.

Stacking unionfs on top of unionfs was not possible before.  The reason is
mainly the namespace pollution: a whiteout file could belong to any logical
union and it's impossible to tell which one.

With the new ODF code, each union has its own ODF state, which separates out
whiteouts and opaque files.  With ODF, stacking any unions in any arbitrary
manner is simple.

4. Resuming unions with complex branch configurations.

Sometimes, users build complex unions with many branches, and then they
reconfigure their unions using branch-management commands (adding, deleting,
and modifying branches and their states).  The end result is a union that
may be hard to reproduce or forgotten: it may require a long command line to
build the same union again, so users often have to record the final branch
configuration in some other location (a shell script, for example).

With the new ODF code, each union records the branch configuration precisely
and persistently.  If you unmount a union, you can resume it easily by just
pointing Unionfs to the correct ODF location; Unionfs then will resume the
exact branch configuration as was there before.

5. Directory reading

When a user lists a directory, Unionfs has to build the contents of that
directory by merging the contents of that directory on all of the
corresponding lower branches.  Then Unionfs has to eliminate all duplicates,
remove any file for which there is a whiteout, and handle opaque
directories.  This process consumes memory and CPU cycles, and has to be
repeated each time Unionfs has to re-read the same directory.  Worse, if the
merged directories are very large, the kernel could run out of core memory.
So one solution was to build the unified directory contents incrementally,
piece by piece, conserving memory.  However, incremental building of a union
directory's contents prevents lseek()-ing into a directory that's not
completely constructed in memory (NSF/SMB use directory lseek-ing
extensively).

With the new ODF code, Unionfs constructs each unified directory contents
incrementally in memory, but stores it persistently onto the ODF location.
This has the benefit of supporting lseek() perfectly, and avoiding large
memory consumption.  Essentially ODF becomes a cache of the directory
contents, and since directories are read more often than they are written,
this cache also improves performance.

6. Hardlinks

Supporting hard-links in any file system is a challenge.  This is primarily
because FFS-like file systems and Unix OSs don't support an efficient
mechanism to find out all of names and precise locations of a hard-linked
file from just one inode number or one name.  It is even more of a challenge
in Unionfs because it supports an arbitrary way of creating branches, which
could contain hard-linked files, only some of their names are exposed within
a given branch.  The challenge is during copy-up, how to recognize all of
the hard-linked files and properly treat a copied-up hard-linked file as one
file, not just a collection of arbitrary names.

With the new ODF code, Unionfs incrementally builds the knowledge about
hard-linked files across branches, as it finds them in the lower branches.
Essentially, Unionfs tries to discover all of the names of a hard-linked
file "lazily" so that Unionfs can properly treat a hard-linked file when it
gets copied up.


* DESIGN

** ODF STRUCTURE

ODF is just a file (or raw partition), formatted to our needs, not a full
blown file system developed just for Unionfs, and its namespace is *not*
exposed to users.  To make debugging easier, however, and leverage existing
tools (e.g., debugfs), we decided to use an existing simple file system,
Ext2 (several other disk-based file systems could work just as well).

The advantage of using an existing file system are:

- we remain format compatible with an existing file systems.

- no need to hide the namespace for the ODF prototype (you can mount the
  /odf partition and inspect it manually)

- we can use existing tools: mkfs, dump/restore, debugfs, etc.

- we can get ODF working quicker and into the hands of users.

Of course, regardless what we use for ODF, we still need to take care of
locking of our objects to prevent unsafe concurrent accesses to /odf by
multiple users.

In the rest of this document, we assume that an ODF partition for a single
union is mounted under /odf.  In practice, there can be many such ODFs
mounted on any paths the user desires.  Note also that normally this mount
is performed inside the kernel and hence the /odf namespace is hidden from
normal user's view.

An /odf is a hierarchy of directories and files with specific purposes, as
follows:

1. /odf/sb is a plain file which records branch configuration, UUIDs, etc.
   Each branch has a unique UUID so it can be identified consistently.

2. /odf/ns/<PATH> is used for storing information about whiteouts, opaque
   directories, and persistent inode numbers.  We use the actual inum of the
   created zero-length file, as our persistent inum.  The term 'ns' here
   stands for Name-Space.

3. /odf/ic/##/##/<INUM> is for caching readdir() results, where '##' is the
   64-bit inum, broken two bytes at a time, for up to a 4-deep hierarchy.
   (It might be more efficient for '##' to go MSB->LSB, to help inum
   locality.)  The term 'ic' stands for icache.

4. We use /odf/ic also to store hardlink info in
   /odf/ic/##/##/<INUM>.<UUID>, a file whose contents records a mapping of
   inodes to hardlinked names; here <INUM> is the inode number of the source
   hardlinked file, and <UUID> is the uuid of the source branch.

   The first thing in the file is the UUID of the destination branch we
   copied the hardlinked file up to, and the inum of the copied up file we
   created (this would allow us to hardlink to it).  If we don't know this
   yet, because a hardlinked file was just looked up one or more times
   without resulting in a copyup, then we leave both entries zero.  (In
   practice, we don't really need to store the UUID itself, but just an
   index to the branch UUID's location in /odf/sb as needed).

   The rest of the contents of this file is a list of struct dirent's, which
   records the name of the hardlink, but with the inum of the *parent*
   directory (e.g., <name,dirinum>).  We need to store the parent inum so
   that the hard-linked file's name can be found by searching the parent
   dir.  (Note: a modification to path_lookup is needed to allow us to
   lookup the file using only its name and its directory's inum.)

   We do not store the number of links here, because that can change (a user
   can delete one link while we're still trying to discover the rest of the
   link names).  So we always look it up always from one of the known
   hardlink names.

5. /odf/reclaim/<INUM> is used for storing hierarchies that need to be
   deleted.  This is useful after an mkdir on top of an opaque dir, because
   the entire opaque directory's hierarchy has to be deleted (recursively,
   no less).  On rmdir we may also use /odf/reclaim.

   This allows us to speed up operations which need to delete items from the
   ODF, because we just move those items to the reclaim area (which acts as
   a "trash bin" of sorts).  In addition, we have an asynchronous kernel
   thread which wakes up every few seconds (configurable) and purges some
   items in /odf/reclaim, to keep space usage in /odf low.  Finally, if we
   notice that space in /odf has become very short (especially if we get an
   ENOSPACE), then we immediately suspend file system activity in Unionfs,
   and go cleanup some space in /odf, until it drops below a (configurable)
   threshold.

6. /odf/whiteout is a zero length plain file which we hard link to, from
   /odf/ns/<PATH>, whenever we need to indicate that <PATH> is a whiteout
   file.

7. /odf/sr is for storing silly renamed files. Inodes that are still in use when
   unlinked, will have their entry in the ODF moved to /odf/sr temporarily to
   keep the ODF from recycling that inode while the unionfs inode is still in
   use. When the unionfs inode is dropped the entry in /odf/sr is unlinked.

** The ODF superblock

The ODF superblock (/odf/sb) stores the following:

- the ODF format version, in case it changes in the future.

- the UUID of each branch

- the no. of branches

- records the pathnames and options for each branch.  This is so we can
  mount an older union by its odf info, without specifying full mount
  options again.  (Optional: we can record portable information such as a
  file system's LABEL.)

- we might also need to record extra info to handle hardlinks (e.g., the
  orig branch pathname from root of that branch's file system up to the
  lower-branch dir name we stack on)

Actions:

1. To prepare a new ODF, simply run "mkfs" on the /odf partition.  Or, if
   you already have a formatted /odf, you can mount it and them "rm -rf"
   everything inside directly.

2. on mount, check if /odf/sb exists; get the number and list of branches,
   and their UUIDs, and store them in this file.

   If /osf/sb does not exist, then we know that we're mounting a new /odf
   for the first time; in that case, create the /odf hierarchy: /sb,
   /whiteout, /ic, /ns,/reclaim and /sr.  If we find that /odf/sb did not
   exist, but some of the others do exist, then we assume that the /odf is
   corrupt, and we fail the mount with an error message.

3. on remount, if we reconfigure the branches, then

(a) /odf/sb must exist (else BUG_ON)

(b) update /odf/sb file

(c) if the "odf_keepcache" flag was not specified, then clean the dir cache in
    /odf/ic

(d) otherwise, if you really want to use an existing /odf cache (with
    possibly some stale info), then use the mount options
    "remount,odf_keepcache".

** Whiteouts and Opaque Dir actions

Whiteout (WH) semantics:

- create a wh only if necessary, of course

> lookup():

- check for main file in each branch, from leftmost branch until the opaque
  branch (so we have to find out the opaque limit)

- upon a successful lookup of a dir, we mark in the unionfs dir inode the
  values for start branch, end branch, and opaque branch.

# XXX: do we return pointers to the first matching dentry below, or all?
# XXX: does this affect rename().

1. if not found in any branch, return ENOENT
 - optional (possibly deferred to cron):
   - check if wh is in /odf
   - if so, it was leftover from old wh (someone may have removed the
     main file)
   - so cleanup wh in /odf
2. if file found, then check /odf
- if wh found in /odf, return ENOENT
- if no wh in /odf, then return dentry normally

The above lookup procedure is further modified to handle hard-links

# XXX: we may have to change the above to handle hardlinks (keep track of
# which branch the wh came from)

3. when looking up a directory, we need to set the link count to 2 if the
   dir is empty, else to whatever it should be.  This could be useful in
   renaming a dir to an empty dir.

> create(), mknod(), and symlink():

- after you create new file, delete wh if exists

- we create only in the leftmost branch; if failed, return error (with this,
  no need to worry about opaque dirs)

> unlink():

- you go L->R (left to right on branches) and delete all files with the same
  name.

- if we get *any* error (EIO, EROFS, ...), then stop trying to unlink
  remaining ones, and create wh; return "OK" but we may want to printk on
  EIO or some other serious errors; if wh creation failed, return error.

# XXX: need to handle hardlinks and unlinks

> is_empty(char * dir) utility function:

- checks if dir is logically empty or not
- handles opaque dirs
- handles whiteouts
- returns TRUE/FALSE if dir is empty

Sample operation (partial):

- first need to find out what will have been the state of that directory's
  "emptiness" status before, so we know if to return ENOTEMPTY.

- So we go L->R (up to and including an OPAQUE dir, but not beyond).

- if all those dirs have link-count==2, then the directory will have been
  empty, so we can return TRUE from is_empty(); else return FALSE.

> rmdir() is similar to unlink, but:

- first call is_empty(); if dir will not have been empty, we're done, just
  return ENOTEMPTY; else

- now handle actual rmdir'ing: you go L->R and rmdir all dirs with the same
  name.

- if you get *any* error (EIO, EROFS, ...), then stop trying to rmdir
  remaining ones, and create wh; return "OK" but we may want to printk on
  EIO or some other serious errors.

- when we have to rmdir a non-empty physical dir, we mv it to /odf/reclaim
  instead, to avoid in-kernel recursive removal.  We can use a combination
  of a user-level cron job or an in-kernel async thread to cleanup stuff in
  /odf/relclaim.

> mkdir()

- create dir first; if failed, return error

- delete wh if exists.  To delete the wh from /odf, we need to effectively
  remove an entire dir of whiteouts.  So mv the whole dir into /odf/reclaim,
  which would get cleaned up asynchronously.

- create new dir in /odf, and mark it OPAQUE

> readdir() and filldir()

- we create in /odf/ic (see above) a file, containing the unionfs-visible
  content of the given, say, "foo/" dir.  We use the same duplicate
  elimination algorithm that we have now.

- we update the mtime on /odf/ic as well as the unionfs visible "foo/" dir,
  so we know when the /odf/ic's readdir content is stale or not; if stale,
  unlink old /odf/ic file and recreate it using the current branch contents.

- The benefit: /odf caches readdir contents for all users, until/if the
  "foo/" dir changes in Unionfs; this survives reboots, works with all forms
  of l/seek-ing, and is even more efficient because dir pages will get
  cached; this also works with NFS exports (no need for cookie ugliness any
  more).

> permission()

- just check the first one you find, same as now.


** Copyup actions

Note: Copyup actions don't really relate to ODF.  However, we put them here
because the ODF work allowed us to consider simplifying the Unionfs code
(especially with respect to rename semantics).

> write(): need to copyup

- if you write to a readonly branch or ROFS file system, then copyup

- look at SEEK_HOLE to, else check each page for zeros and skip it.

- optional: we can also copyup on ENOSPACE/ENOQUOTA

- if copyup failed, return error

- copyup may need to create a parent hierarchy of dirs (but don't make the
  newly created hierarchy opaque)

# XXX: fix hardlinks if copyup succeeded.

# XXX: fix copyup for mmap(), probably in ->writepage, commit/prepare_write

- we need to copyup also on some of ->setattr (chown, chgrp, ...)

> truncate(): needs to copyup

- handle same as copyup for write

- useful to have a shared copyup function that can copy part/all of a file.

> rename() has always been challenging in Unionfs

case 1: regular file within same dir
case 2: regular file to separate dir
case 3: dirname within same dir
case 4: dirname to a separate dir

The older semantics of rename were that we tried hard to preserve the
location of a file being renamed.  But this was too complex to maintain
without races, and offered few benefits to users.  Some Unionfs users
proposed that on rename, we always copyup the renamed file to the left-most
branch.  This indeed would be much simpler to implement, but that seems too
excessive and may clutter the left-most branch too much.  So our new design
with ODF provides a trade-off between the old and proposed implementations.
Our new, simplified rename idea is as follows:

- By the time we're called with rename(src,dst), both dentries have been
  looked up: src must exist else our lookup will have failed; and dst may or
  may not exist (we'll get a dentry which could be a negative dentry).

- If src is a non-empty dir that lies on a ro branch we let the userspace 
  recurse and ask us to copy up each file separately, by returning EXDEV
  (just like in Unionfs 1.x)

- If dst exists on the top branch and is a whiteout or a logically empty dir
  we remove it before renaming by:
	- rmdir if its a physically empty dir
	- silly rename if its a non-empty dir
	- unlink if not a dir

- Start renaming from the leftmost branch of src to the rightmost or until
  any one rename fails. We then copyup to top branch and rename if the src
  leftmost was not the top branch and:
	- the rename at the src leftmost branch failed
	- dst already existed at a branch to the left of src
	- dst was in an opaque dir with opaqueness branch to the left of src

- We create a whiteout if any of the renames failed.

- For directories: The same applies to directories as well. If we need to
  copyup a directory, then this directory is guaranteed to be always empty
  by using EXDEV (above), so we just mkdir at the top branch.

- We need to purge the readdir cache (/odf/ic) for all successful renames,
  in both src and dst dir.  Also, the creation of a whiteout should purge
  the readdir cache of that dir (can be optimized later).

# XXX: need to handle hardlinks and renames

** Persistent Inodes actions

Easy: just create the file in /odf/ns/<PATH>, using a flexible copyup()
routine that will work for whole file copyup, truncated files, and
zero-length files in /odf/ns/.

On lookup, we need to create the hierarchy in /odf/ns, if it doesn't exist;
if/when it does, read the inum from /odf/ns, and put it in unionfs's inode.
Let dcache/icache algorithms figure out when to purge stuff.

This means we'll need lots of inodes in /odf, so mkfs it appropriately.  Our
current estimate is to create around 3000 to 3500 inodes per MB in the odf 
using "mkfs -t FOOFS -i NNN"


* Hardlinks

When looking up a hardlinked file for the first time, create a new entry in
/odf/ic/##/##/<INUM>.<UUID> as described above.

If that hardlinked file is truncated/written to, need to copyup; so we
copyup normally because it's the first time we copyup that hardlinked file;
then we permit the write to proceed.  If we copyup and have created a new
inode on the destination branch, then update /odf/ic entry to record the
UUID+inum (which were all zero initially).

When looking up a hardlinked file, we check /odf/ic, and let's say we find
it there.  That means the hardlinked file was looked up before, and possibly
copied up.  If no one modifies the hardlinked file even on the second
lookup, no need to copyup the file or parent hierarchy, so just add another
entry to /odf/ic.

But if a user modifies the file, then we have two cases.

(1) if the file is modified for the first time, then copyup several parent
    hierarchies, for all known names of this hardlink in /odf/ic; then link
    the file in several places.  (Copyup will have to lookup the dentry then
    do a d_path on it to get full hierarchy to recreate.)

(2) if the file was already copied up, then it means that the entire known
    hardlink hierarchy already existed, so just create the parent hierarchy
    for this newly discovered link, and link the file to the already copied
    up version.  Then we can allow the write/truncate to proceed.

When we copyup for hardlinks or otherwise, also hardlink in /odf/ns, so we
retain the persistent inodes.

We need to record the actual number of links to report for a file, in case
one of the hardlinked files was removed.  We can store this in
/odf/ns/<PATH> by stealing the gid field (or some other field we can take
over safely).

> link()

If you create a new hardlink, then we copyup the src as described above,
plus update /odf/ns as described, then link the dst to the src.  If someone
creates a new hardlink to an already copied up file, then we will see that
as existing info in /odf/ns.


* CREDITS

Our group of stackable file system developers at Stony Brook had considered
something like ODF for several years already, and for several projects (not
just Unionfs).  But it wasn't until OLS 2006, that we began seriously
designing and developing this support.  One major reason was that at OLS'06,
we met and discussed the topic with Ted Ts'o, who was quite supportive of
this idea.

The initial design for ODF was done by Yiannis Pericleous, Josef "Jeff"
Sipek, and Erez Zadok.  The bulk of the initial development was done by
Yiannis Pericleous.  Currently, all Unionfs developers help maintain the ODF
code.

This file can be found in your Linux kernel source tree that contains
Unionfs with ODF support, perhaps a patch, in
linux-X.Y.Z/Documentation/filesystems/unionfs/odf.txt

For more information on Unionfs, see <http://unionfs.filesystems.org/>.