DO NOT MAIL: xv6 web pages

2008-09-03 04:50:04 +00:00 · 2008-09-03 04:50:04 +00:00 · f53494c28e
commit f53494c28e
parent ee3f75f229
37 changed files with 9034 additions and 0 deletions
--- a/web/l13.html
+++ b/web/l13.html
@ -0,0 +1,245 @@
+<title>High-performance File Systems</title>
+<html>
+<head>
+</head>
+<body>
+
+<h1>High-performance File Systems</h1>
+
+<p>Required reading: soft updates.
+
+<h2>Overview</h2>
+
+<p>A key problem in designing file systems is how to obtain
+performance on file system operations while providing consistency.
+With consistency, we mean, that file system invariants are maintained
+is on disk.  These invariants include that if a file is created, it
+appears in its directory, etc.  If the file system data structures are
+consistent, then it is possible to rebuild the file system to a
+correct state after a failure.
+
+<p>To ensure consistency of on-disk file system data structures,
+  modifications to the file system must respect certain rules:
+<ul>
+
+<li>Never point to a structure before it is initialized. An inode must
+be initialized before a directory entry references it.  An block must
+be initialized before an inode references it.
+
+<li>Never reuse a structure before nullifying all pointers to it.  An
+inode pointer to a disk block must be reset before the file system can
+reallocate the disk block.
+
+<li>Never reset the last point to a live structure before a new
+pointer is set.  When renaming a file, the file system should not
+remove the old name for an inode until after the new name has been
+written.
+</ul>
+The paper calls these dependencies update dependencies.  
+
+<p>xv6 ensures these rules by writing every block synchronously, and
+  by ordering the writes appropriately.  With synchronous, we mean
+  that a process waits until the current disk write has been
+  completed before continuing with execution.
+
+<ul>
+
+<li>What happens if power fails after 4776 in mknod1?  Did we lose the
+  inode for ever? No, we have a separate program (called fsck), which
+  can rebuild the disk structures correctly and can mark the inode on
+  the free list.
+
+<li>Does the order of writes in mknod1 matter?  Say, what if we wrote
+  directory entry first and then wrote the allocated inode to disk?
+  This violates the update rules and it is not a good plan. If a
+  failure happens after the directory write, then on recovery we have
+  an directory pointing to an unallocated inode, which now may be
+  allocated by another process for another file!
+
+<li>Can we turn the writes (i.e., the ones invoked by iupdate and
+  wdir) into delayed writes without creating problems?  No, because
+  the cause might write them back to the disk in an incorrect order.
+  It has no information to decide in what order to write them.
+
+</ul>
+
+<p>xv6 is a nice example of the tension between consistency and
+  performance.  To get consistency, xv6 uses synchronous writes,
+  but these writes are slow, because they perform at the rate of a
+  seek instead of the rate of the maximum data transfer rate. The
+  bandwidth to a disk is reasonable high for large transfer (around
+  50Mbyte/s), but latency is low, because of the cost of moving the
+  disk arm(s) (the seek latency is about 10msec).
+
+<p>This tension is an implementation-dependent one.  The Unix API
+  doesn't require that writes are synchronous.  Updates don't have to
+  appear on disk until a sync, fsync, or open with O_SYNC.  Thus, in
+  principle, the UNIX API allows delayed writes, which are good for
+  performance:
+<ul>
+<li>Batch many writes together in a big one, written at the disk data
+  rate.
+<li>Absorp writes to the same block.
+<li>Schedule writes to avoid seeks.
+</ul>
+
+<p>Thus the question: how to delay writes and achieve consistency?
+  The paper provides an answer.
+
+<h2>This paper</h2>
+
+<p>The paper surveys some of the existing techniques and introduces a
+new to achieve the goal of performance and consistency.
+
+<p>
+
+<p>Techniques possible:
+<ul>
+
+<li>Equip system with NVRAM, and put buffer cache in NVRAM.
+
+<li>Logging.  Often used in UNIX file systems for metadata updates.
+LFS is an extreme version of this strategy.
+
+<li>Flusher-enforced ordering.  All writes are delayed. This flusher
+is aware of dependencies between blocks, but doesn't work because
+circular dependencies need to be broken by writing blocks out.
+
+</ul>
+
+<p>Soft updates is the solution explored in this paper.  It doesn't
+require NVRAM, and performs as well as the naive strategy of keep all
+dirty block in main memory.  Compared to logging, it is unclear if
+soft updates is better.  The default BSD file systems uses soft
+  updates, but most Linux file systems use logging.
+
+<p>Soft updates is a sophisticated variant of flusher-enforced
+ordering.  Instead of maintaining dependencies on the block-level, it
+maintains dependencies on file structure level (per inode, per
+directory, etc.), reducing circular dependencies. Furthermore, it
+breaks any remaining circular dependencies by undo changes before
+writing the block and then redoing them to the block after writing.
+
+<p>Pseudocode for create:
+<pre>
+create (f) {
+   allocate inode in block i  (assuming inode is available)
+   add i to directory data block d  (assuming d has space)
+   mark d has dependent on i, and create undo/redo record
+   update directory inode in block di
+   mark di has dependent on d
+}
+</pre>
+
+<p>Pseudocode for the flusher:
+<pre>
+flushblock (b)
+{
+  lock b;
+  for all dependencies that b is relying on
+    "remove" that dependency by undoing the change to b
+    mark the dependency as "unrolled"
+  write b 
+}
+
+write_completed (b) {
+  remove dependencies that depend on b
+  reapply "unrolled" dependencies that b depended on
+  unlock b
+}
+</pre>
+
+<p>Apply flush algorithm to example:
+<ul>
+<li>A list of two dependencies: directory->inode, inode->directory.
+<li>Lets say syncer picks directory first
+<li>Undo directory->inode changes (i.e., unroll <A,#4>)
+<li>Write directory block
+<li>Remove met dependencies (i.e., remove inode->directory dependency)
+<li>Perform redo operation (i.e., redo <A,#4>)
+<li>Select inode block and write it
+<li>Remove met dependencies (i.e., remove directory->inode dependency)
+<li>Select directory block (it is dirty again!)
+<li>Write it.
+</ul>
+
+<p>An file operation that is important for file-system consistency 
+is rename.  Rename conceptually works as follows:
+<pre>
+rename (from, to)
+   unlink (to);
+   link (from, to);
+   unlink (from);
+</pre>
+
+<p>Rename it often used by programs to make a new version of a file
+the current version.  Committing to a new version must happen
+atomically.  Unfortunately, without a transaction-like support
+atomicity is impossible to guarantee, so a typical file systems
+provides weaker semantics for rename: if to already exists, an
+instance of to will always exist, even if the system should crash in
+the middle of the operation.  Does the above implementation of rename
+guarantee this semantics? (Answer: no).
+
+<p>If rename is implemented as unlink, link, unlink, then it is
+difficult to guarantee even the weak semantics. Modern UNIXes provide
+rename as a file system call:
+<pre>
+   update dir block for to point to from's inode // write block
+   update dir block for from to free entry // write block
+</pre>
+<p>fsck may need to correct refcounts in the inode if the file
+system fails during rename.  for example, a crash after the first
+write followed by fsck should set refcount to 2, since both from
+and to are pointing at the inode.
+
+<p>This semantics is sufficient, however, for an application to ensure
+atomicity. Before the call, there is a from and perhaps a to.  If the
+call is successful, following the call there is only a to.  If there
+is a crash, there may be both a from and a to, in which case the
+caller knows the previous attempt failed, and must retry.  The
+subtlety is that if you now follow the two links, the "to" name may
+link to either the old file or the new file.  If it links to the new
+file, that means that there was a crash and you just detected that the
+rename operation was composite.  On the other hand, the retry
+procedure can be the same for either case (do the rename again), so it
+isn't necessary to discover how it failed.  The function follows the
+golden rule of recoverability, and it is idempotent, so it lays all
+the needed groundwork for use as part of a true atomic action.
+
+<p>With soft updates renames becomes:
+<pre>
+rename (from, to) {
+   i = namei(from);
+   add "to" directory data block td a reference to inode i
+   mark td dependent on block i
+   update directory inode "to" tdi
+   mark tdi as dependent on td
+   remove "from" directory data block fd a reference to inode i
+   mark fd as dependent on tdi
+   update directory inode in block fdi
+   mark fdi as dependent on fd
+}
+</pre>
+<p>No synchronous writes!
+
+<p>What needs to be done on recovery?  (Inspect every statement in
+rename and see what inconsistencies could exist on the disk; e.g.,
+refcnt inode could be too high.)  None of these inconsitencies require
+fixing before the file system can operate; they can be fixed by a
+background file system repairer. 
+
+<h2>Paper discussion</h2>
+
+<p>Do soft updates perform any useless writes? (A useless write is a
+write that will be immediately overwritten.)  (Answer: yes.) Fix
+syncer to becareful with what block to start.  Fix cache replacement
+to selecting LRU block with no pendending dependencies.
+
+<p>Can a log-structured file system implement rename better? (Answer:
+yes, since it can get the refcnts right).
+
+<p>Discuss all graphs.
+
+</body>
+