-Xv6, a simple Unix-like teaching operating system
-
-
-
-
-
Xv6, a simple Unix-like teaching operating system
-
-
Introduction
-
-Xv6 is a teaching operating system developed in the summer of 2006 for
-MIT's operating systems
-course, 6.828: operating
-systems Engineering. We hope that xv6 will be useful in other
-courses too. This page collects resources to aid the use of xv6 in
-other courses, including a commentary on the source code itself.
-
-
History and Background
-
-
For many years, MIT had no operating systems course. In the fall of 2002,
-one was created to teach operating systems engineering. In the course lectures,
-the class worked through Sixth Edition Unix (aka V6) using
-John Lions's famous commentary. In the lab assignments, students wrote most of
-an exokernel operating system, eventually named Jos, for the Intel x86.
-Exposing students to multiple systems–V6 and Jos–helped develop a
-sense of the spectrum of operating system designs.
-
-
-V6 presented pedagogic challenges from the start.
-Students doubted the relevance of an obsolete 30-year-old operating system
-written in an obsolete programming language (pre-K&R C)
-running on obsolete hardware (the PDP-11).
-Students also struggled to learn the low-level details of two different
-architectures (the PDP-11 and the Intel x86) at the same time.
-By the summer of 2006, we had decided to replace V6
-with a new operating system, xv6, modeled on V6
-but written in ANSI C and running on multiprocessor
-Intel x86 machines.
-Xv6's use of the x86 makes it more relevant to
-students' experience than V6 was
-and unifies the course around a single architecture.
-Adding multiprocessor support requires handling concurrency head on with
-locks and threads (instead of using special-case solutions for
-uniprocessors such as
-enabling/disabling interrupts) and helps relevance.
-Finally, writing a new system allowed us to write cleaner versions
-of the rougher parts of V6, like the scheduler and file system.
-6.828 substituted xv6 for V6 in the fall of 2006.
-
-
Xv6 sources and text
-
-The latest xv6 source is available via
-
git clone git://pdos.csail.mit.edu/xv6/xv6.git
-We also distribute the sources as a printed booklet with line numbers
-that keep everyone together during lectures. The booklet is available as xv6-rev6.pdf. To get the version
-corresponding to this booklet, run
-
git checkout -b xv6-rev6 xv6-rev6
-
-
-The xv6 source code is licensed under
-the traditional MIT
-license; see the LICENSE file in the source distribution. To help students
-read through xv6 and learn about the main ideas in operating systems we also
-distribute a textbook/commentary for the latest xv6.
-The line numbers in this book refer to the above source booklet.
-
-
-xv6 compiles using the GNU C compiler,
-targeted at the x86 using ELF binaries.
-On BSD and Linux systems, you can use the native compilers;
-On OS X, which doesn't use ELF binaries,
-you must use a cross-compiler.
-Xv6 does boot on real hardware, but typically
-we run it using the QEMU emulator.
-Both the GCC cross compiler and QEMU
-can be found on the 6.828 tools page.
-
-
Xv6 lecture material
-
-In 6.828, the lectures in the first half of the course cover the xv6 sources and
-text. The lectures in the second half consider advanced topics using research
-papers; for some, xv6 serves as a useful base for making discussions concrete.
-The lecture notes are available from the 6.828 schedule page.
-
-
-
Unix Version 6
-
-
6.828's xv6 is inspired by Unix V6 and by:
-
-
-
-
Lions' Commentary on UNIX' 6th Edition, John Lions, Peer to
-Peer Communications; ISBN: 1-57398-013-7; 1st edition (June 14, 2000).
-
-If you are interested in using xv6 or have used xv6 in a course,
-we would love to hear from you.
-If there's anything that we can do to make xv6 easier
-to adopt, we'd like to hear about it.
-We'd also be interested to hear what worked well and what didn't.
-
-You can reach all of us at 6.828-staff@pdos.csail.mit.edu.
-
diff --git a/web/l-bugs.html b/web/l-bugs.html
deleted file mode 100644
index 493372d..0000000
--- a/web/l-bugs.html
+++ /dev/null
@@ -1,187 +0,0 @@
-
OS Bugs
-
-
-
-
-
-
OS Bugs
-
-
Required reading: Bugs as deviant behavior
-
-
Overview
-
-
Operating systems must obey many rules for correctness and
-performance. Examples rules:
-
-
Do not call blocking functions with interrupts disabled or spin
-lock held
-
check for NULL results
-
Do not allocate large stack variables
-
Do no re-use already-allocated memory
-
Check user pointers before using them in kernel mode
-
Release acquired locks
-
-
-
In addition, there are standard software engineering rules, like
-use function results in consistent ways.
-
-
These rules are typically not checked by a compiler, even though
-they could be checked by a compiler, in principle. The goal of the
-meta-level compilation project is to allow system implementors to
-write system-specific compiler extensions that check the source code
-for rule violations.
-
-
The results are good: many new bugs found (500-1000) in Linux
-alone. The paper for today studies these bugs and attempts to draw
-lessons from these bugs.
-
-
Are kernel error worse than user-level errors? That is, if we get
-the kernel correct, then we won't have system crashes?
-
-
Errors in JOS kernel
-
-
What are unstated invariants in the JOS?
-
-
Interrupts are disabled in kernel mode
-
Only env 1 has access to disk
-
All registers are saved & restored on context switch
-
Application code is never executed with CPL 0
-
Don't allocate an already-allocated physical page
-
Propagate error messages to user applications (e.g., out of
-resources)
-
Map pipe before fd
-
Unmap fd before pipe
-
A spawned program should have open only file descriptors 0, 1, and 2.
-
Pass sometimes size in bytes and sometimes in block number to a
-given file system function.
-
User pointers should be run through TRUP before used by the kernel
-
-
-
Could these errors have been caught by metacompilation? Would
-metacompilation have caught the pipe race condition? (Probably not,
-it happens in only one place.)
-
-
How confident are you that your code is correct? For example,
-are you sure interrupts are always disabled in kernel mode? How would
-you test?
-
-
Metacompilation
-
-
A system programmer writes the rule checkers in a high-level,
-state-machine language (metal). These checkers are dynamically linked
-into an extensible version of g++, xg++. Xg++ applies the rule
-checkers to every possible execution path of a function that is being
-compiled.
-
-
Some checkers produce false positives, because of limitations of
-both static analysis and the checkers, which mostly use local
-analysis.
-
-
How does the block checker work? The first pass is a rule
-that marks functions as potentially blocking. After processing a
-function, the checker emits the function's flow graph to a file
-(including, annotations and functions called). The second pass takes
-the merged flow graph of all function calls, and produces a file with
-all functions that have a path in the control-flow-graph to a blocking
-function call. For the Linux kernel this results in 3,000 functions
-that potentially could call sleep. Yet another checker like
-check_interrupts checks if a function calls any of the 3,000 functions
-with interrupts disabled. Etc.
-
-
This paper
-
-
Writing rules is painful. First, you have to write them. Second,
-how do you decide what to check? Was it easy to enumerate all
-conventions for JOS?
-
-
Insight: infer programmer "beliefs" from code and cross-check
-for contradictions. If cli is always followed by sti,
-except in one case, perhaps something is wrong. This simplifies
-life because we can write generic checkers instead of checkers
-that specifically check for sti, and perhaps we get lucky
-and find other temporal ordering conventions.
-
-
Do we know which case is wrong? The 999 times or the 1 time that
-sti is absent? (No, this method cannot figure what the correct
-sequence is but it can flag that something is weird, which in practice
-useful.) The method just detects inconsistencies.
-
-
Is every inconsistency an error? No, some inconsistency don't
-indicate an error. If a call to function f is often followed
-by call to function g, does that imply that f should always be
-followed by g? (No!)
-
-
Solution: MUST beliefs and MAYBE beliefs. MUST beliefs are
-invariants that must hold; any inconsistency indicates an error. If a
-pointer is dereferences, then the programmer MUST believe that the
-pointer is pointing to something that can be dereferenced (i.e., the
-pointer is definitely not zero). MUST beliefs can be checked using
-"internal inconsistencies".
-
-
An aside, can zero pointers pointers be detected during runtime?
-(Sure, unmap the page at address zero.) Why is metacompilation still
-valuable? (At runtime you will find only the null pointers that your
-test code dereferenced; not all possible dereferences of null
-pointers.) An even more convincing example for Metacompilation is
-tracking user pointers that the kernel dereferences. (Is this a MUST
-belief?)
-
-
MAYBE beliefs are invariants that are suggested by the code, but
-they maybe coincidences. MAYBE beliefs are ranked by statistical
-analysis, and perhaps augmented with input about functions names
-(e.g., alloc and free are important). Is it computationally feasible
-to check every MAYBE belief? Could there be much noise?
-
-
What errors won't this approach catch?
-
-
Paper discussion
-
-
This paper is best discussed by studying every code fragment. Most
-code fragments are pieces of code from Linux distributions; these
-mistakes are real!
-
-
Section 3.1. what is the error? how does metacompilation catch
-it?
-
-
Figure 1. what is the error? is there one?
-
-
Code fragments from 6.1. what is the error? how does metacompilation catch
-it?
-
-
Figure 3. what is the error? how does metacompilation catch
-it?
-
-
Section 8.3. what is the error? how does metacompilation catch
-it?
-
-
-
diff --git a/web/l-coordination.html b/web/l-coordination.html
deleted file mode 100644
index 79b578b..0000000
--- a/web/l-coordination.html
+++ /dev/null
@@ -1,354 +0,0 @@
-
L9
-
-
-
-
-
-
Coordination and more processes
-
-
Required reading: remainder of proc.c, sys_exec, sys_sbrk,
- sys_wait, sys_exit, and sys_kill.
-
-
Overview
-
-
Big picture: more programs than processors. How to share the
- limited number of processors among the programs? Last lecture
- covered basic mechanism: threads and the distinction between process
- and thread. Today expand: how to coordinate the interactions
- between threads explicitly, and some operations on processes.
-
-
Sequence coordination. This is a diferrent type of coordination
- than mutual-exclusion coordination (which has its goal to make
- atomic actions so that threads don't interfere). The goal of
- sequence coordination is for threads to coordinate the sequences in
- which they run.
-
-
For example, a thread may want to wait until another thread
- terminates. One way to do so is to have the thread run periodically,
- let it check if the other thread terminated, and if not give up the
- processor again. This is wasteful, especially if there are many
- threads.
-
-
With primitives for sequence coordination one can do better. The
- thread could tell the thread manager that it is waiting for an event
- (e.g., another thread terminating). When the other thread
- terminates, it explicitly wakes up the waiting thread. This is more
- work for the programmer, but more efficient.
-
-
Sequence coordination often interacts with mutual-exclusion
- coordination, as we will see below.
-
-
The operating system literature has a rich set of primivites for
- sequence coordination. We study a very simple version of condition
- variables in xv6: sleep and wakeup, with a single lock.
-
-
xv6 code examples
-
-
Sleep and wakeup - usage
-
-Let's consider implementing a producer/consumer queue
-(like a pipe) that can be used to hold a single non-null pointer:
-
-
Easy and correct, at least assuming there is at most one
-reader and at most one writer at a time.
-
-
Unfortunately, the while loops are inefficient.
-Instead of polling, it would be great if there were
-primitives saying ``wait for some event to happen''
-and ``this event happened''.
-That's what sleep and wakeup do.
-
-
-
-This is okay, and now safer for multiple readers and writers,
-except that wakeup wakes up everyone who is asleep on chan,
-not just one guy.
-So some of the guys who wake up from sleep might not
-be cleared to read or write from the queue. Have to go back to looping:
-
-
The problem is that now we're using lk to protect
-access to the p->chan and p->state variables
-but other routines besides sleep and wakeup
-(in particular, proc_kill) will need to use them and won't
-know which lock protects them.
-So instead of protecting them with lk, let's use proc_table_lock:
-
-
One could probably make things work with lk as above,
-but the relationship between data and locks would be
-more complicated with no real benefit. Xv6 takes the easy way out
-and says that elements in the proc structure are always protected
-by proc_table_lock.
-
-
Use example: exit and wait
-
-
If proc_wait decides there are children to be waited for,
-it calls sleep at line 2462.
-When a process exits, we proc_exit scans the process table
-to find the parent and wakes it at 2408.
-
-
Which lock protects sleep and wakeup from missing each other?
-Proc_table_lock. Have to tweak sleep again to avoid double-acquire:
-
-
Proc_kill marks a process as killed (line 2371).
-When the process finally exits the kernel to user space,
-or if a clock interrupt happens while it is in user space,
-it will be destroyed (line 2886, 2890, 2912).
-
-
Why wait until the process ends up in user space?
-
-
What if the process is stuck in sleep? It might take a long
-time to get back to user space.
-Don't want to have to wait for it, so make sleep wake up early
-(line 2373).
-
-
This means all callers of sleep should check
-whether they have been killed, but none do.
-Bug in xv6.
-
-
System call handlers
-
-
Sheet 32
-
-
Fork: discussed copyproc in earlier lectures.
-Sys_fork (line 3218) just calls copyproc
-and marks the new proc runnable.
-Does fork create a new process or a new thread?
-Is there any shared context?
-
-
Exec: we'll talk about exec later, when we talk about file systems.
-
-
Required reading: iread, iwrite, and wdir, and code related to
- these calls in fs.c, bio.c, ide.c, file.c, and sysfile.c
-
-
Overview
-
-
The next 3 lectures are about file systems:
-
-
Basic file system implementation
-
Naming
-
Performance
-
-
-
Users desire to store their data durable so that data survives when
-the user turns of his computer. The primary media for doing so are:
-magnetic disks, flash memory, and tapes. We focus on magnetic disks
-(e.g., through the IDE interface in xv6).
-
-
To allow users to remember where they stored a file, they can
-assign a symbolic name to a file, which appears in a directory.
-
-
The data in a file can be organized in a structured way or not.
-The structured variant is often called a database. UNIX uses the
-unstructured variant: files are streams of bytes. Any particular
-structure is likely to be useful to only a small class of
-applications, and other applications will have to work hard to fit
-their data into one of the pre-defined structures. Besides, if you
-want structure, you can easily write a user-mode library program that
-imposes that format on any file. The end-to-end argument in action.
-(Databases have special requirements and support an important class of
-applications, and thus have a specialized plan.)
-
-
The API for a minimal file system consists of: open, read, write,
-seek, close, and stat. Dup duplicates a file descriptor. For example:
-
Maintaining the file offset behind the read/write interface is an
- interesting design decision . The alternative is that the state of a
- read operation should be maintained by the process doing the reading
- (i.e., that the pointer should be passed as an argument to read).
- This argument is compelling in view of the UNIX fork() semantics,
- which clones a process which shares the file descriptors of its
- parent. A read by the parent of a shared file descriptor (e.g.,
- stdin, changes the read pointer seen by the child). On the other
- hand the alternative would make it difficult to get "(data; ls) > x"
- right.
-
-
Unix API doesn't specify that the effects of write are immediately
- on the disk before a write returns. It is up to the implementation
- of the file system within certain bounds. Choices include (that
- aren't non-exclusive):
-
-
At some point in the future, if the system stays up (e.g., after
- 30 seconds);
-
Before the write returns;
-
Before close returns;
-
User specified (e.g., before fsync returns).
-
-
-
A design issue is the semantics of a file system operation that
- requires multiple disk writes. In particular, what happens if the
- logical update requires writing multiple disks blocks and the power
- fails during the update? For example, to create a new file,
- requires allocating an inode (which requires updating the list of
- free inodes on disk), writing a directory entry to record the
- allocated i-node under the name of the new file (which may require
- allocating a new block and updating the directory inode). If the
- power fails during the operation, the list of free inodes and blocks
- may be inconsistent with the blocks and inodes in use. Again this is
- up to implementation of the file system to keep on disk data
- structures consistent:
-
-
Don't worry about it much, but use a recovery program to bring
- file system back into a consistent state.
-
Journaling file system. Never let the file system get into an
- inconsistent state.
-
-
-
Another design issue is the semantics are of concurrent writes to
-the same data item. What is the order of two updates that happen at
-the same time? For example, two processes open the same file and write
-to it. Modern Unix operating systems allow the application to lock a
-file to get exclusive access. If file locking is not used and if the
-file descriptor is shared, then the bytes of the two writes will get
-into the file in some order (this happens often for log files). If
-the file descriptor is not shared, the end result is not defined. For
-example, one write may overwrite the other one (e.g., if they are
-writing to the same part of the file.)
-
-
An implementation issue is performance, because writing to magnetic
-disk is relatively expensive compared to computing. Three primary ways
-to improve performance are: careful file system layout that induces
-few seeks, an in-memory cache of frequently-accessed blocks, and
-overlap I/O with computation so that file operations don't have to
-wait until their completion and so that that the disk driver has more
-data to write, which allows disk scheduling. (We will talk about
-performance in detail later.)
-
-
xv6 code examples
-
-
xv6 implements a minimal Unix file system interface. xv6 doesn't
-pay attention to file system layout. It overlaps computation and I/O,
-but doesn't do any disk scheduling. Its cache is write-through, which
-simplifies keep on disk datastructures consistent, but is bad for
-performance.
-
-
On disk files are represented by an inode (struct dinode in fs.h),
-and blocks. Small files have up to 12 block addresses in their inode;
-large files use files the last address in the inode as a disk address
-for a block with 128 disk addresses (512/4). The size of a file is
-thus limited to 12 * 512 + 128*512 bytes. What would you change to
-support larger files? (Ans: e.g., double indirect blocks.)
-
-
Directories are files with a bit of structure to them. The file
-contains of records of the type struct dirent. The entry contains the
-name for a file (or directory) and its corresponding inode number.
-How many files can appear in a directory?
-
-
In memory files are represented by struct inode in fsvar.h. What is
-the role of the additional fields in struct inode?
-
-
What is xv6's disk layout? How does xv6 keep track of free blocks
- and inodes? See balloc()/bfree() and ialloc()/ifree(). Is this
- layout a good one for performance? What are other options?
-
-
Let's assume that an application created an empty file x with
- contains 512 bytes, and that the application now calls read(fd, buf,
- 100), that is, it is requesting to read 100 bytes into buf.
- Furthermore, let's assume that the inode for x is is i. Let's pick
- up what happens by investigating readi(), line 4483.
-
-
4488-4492: can iread be called on other objects than files? (Yes.
- For example, read from the keyboard.) Everything is a file in Unix.
-
4495: what does bmap do?
-
-
4384: what block is being read?
-
-
4483: what does bread do? does bread always cause a read to disk?
-
-
4006: what does bget do? it implements a simple cache of
- recently-read disk blocks.
-
-
How big is the cache? (see param.h)
-
3972: look if the requested block is in the cache by walking down
- a circular list.
-
3977: we had a match.
-
3979: some other process has "locked" the block, wait until it
- releases. the other processes releases the block using brelse().
-Why lock a block?
-
-
Atomic read and update. For example, allocating an inode: read
- block containing inode, mark it allocated, and write it back. This
- operation must be atomic.
-
-
3982: it is ours now.
-
3987: it is not in the cache; we need to find a cache entry to
- hold the block.
-
3987: what is the cache replacement strategy? (see also brelse())
-
3988: found an entry that we are going to use.
-
3989: mark it ours but don't mark it valid (there is no valid data
- in the entry yet).
-
-
4007: if the block was in the cache and the entry has the block's
- data, return.
-
4010: if the block wasn't in the cache, read it from disk. are
- read's synchronous or asynchronous?
-
-
3836: a bounded buffer of outstanding disk requests.
-
3809: tell the disk to move arm and generate an interrupt.
-
3851: go to sleep and run some other process to run. time sharing
- in action.
-
3792: interrupt: arm is in the right position; wakeup requester.
-
3856: read block from disk.
-
3860: remove request from bounded buffer. wakeup processes that
- are waiting for a slot.
-
3864: start next disk request, if any. xv6 can overlap I/O with
-computation.
-
-
4011: mark the cache entry has holding the data.
-
-
4498: To where is the block copied? is dst a valid user address?
-
-
-
Now let's suppose that the process is writing 512 bytes at the end
- of the file a. How many disk writes will happen?
-
-
4567: allocate a new block
-
-
4518: allocate a block: scan block map, and write entry
-
4523: How many disk operations if the process would have been appending
- to a large file? (Answer: read indirect block, scan block map, write
- block map.)
-
-
4572: read the block that the process will be writing, in case the
- process writes only part of the block.
-
4574: write it. is it synchronous or asynchronous? (Ans:
- synchronous but with timesharing.)
-
-
-
Lots of code to implement reading and writing of files. How about
- directories?
-
-
4722: look for the directory, reading directory block and see if a
- directory entry is unused (inum == 0).
-
4729: use it and update it.
-
4735: write the modified block.
-
-
Reading and writing of directories is trivial.
-
-
diff --git a/web/l-interrupt.html b/web/l-interrupt.html
deleted file mode 100644
index 363af5e..0000000
--- a/web/l-interrupt.html
+++ /dev/null
@@ -1,174 +0,0 @@
-
-
Lecture 6: Interrupts & Exceptions
-
-
-
Interrupts & Exceptions
-
-
-Required reading: xv6 trapasm.S, trap.c, syscall.c, usys.S.
-
-You will need to consult
-IA32 System
-Programming Guide chapter 5 (skip 5.7.1, 5.8.2, 5.12.2).
-
-
Overview
-
-
-Big picture: kernel is trusted third-party that runs the machine.
-Only the kernel can execute privileged instructions (e.g.,
-changing MMU state).
-The processor enforces this protection through the ring bits
-in the code segment.
-If a user application needs to carry out a privileged operation
-or other kernel-only service,
-it must ask the kernel nicely.
-How can a user program change to the kernel address space?
-How can the kernel transfer to a user address space?
-What happens when a device attached to the computer
-needs attention?
-These are the topics for today's lecture.
-
-
-There are three kinds of events that must be handled
-by the kernel, not user programs:
-(1) a system call invoked by a user program,
-(2) an illegal instruction or other kind of bad processor state (memory fault, etc.).
-and
-(3) an interrupt from a hardware device.
-
-
-Although these three events are different, they all use the same
-mechanism to transfer control to the kernel.
-This mechanism consists of three steps that execute as one atomic unit.
-(a) change the processor to kernel mode;
-(b) save the old processor somewhere (usually the kernel stack);
-and (c) change the processor state to the values set up as
-the “official kernel entry values.”
-The exact implementation of this mechanism differs
-from processor to processor, but the idea is the same.
-
-
-We'll work through examples of these today in lecture.
-You'll see all three in great detail in the labs as well.
-
-
-A note on terminology: sometimes we'll
-use interrupt (or trap) to mean both interrupts and exceptions.
-
-
-xv6 Sheet 28: tvinit and idtinit.
-Note setting of gate for T_SYSCALL
-
-
-xv6 Sheet 29: vectors.pl (also see generated vectors.S).
-
-
-System calls
-
-
-
-xv6 Sheet 16: init.c calls open("console").
-How is that implemented?
-
-
-xv6 usys.S (not in book).
-(No saving of registers. Why?)
-
-
-Breakpoint 0x1b:"open",
-step past int instruction into kernel.
-
-
-See handout Figure 9-4 [sic].
-
-
-xv6 Sheet 28: in vectors.S briefly, then in alltraps.
-Step through to call trap, examine registers and stack.
-How will the kernel find the argument to open?
-
-
-What happens if a user program divides by zero
-or accesses unmapped memory?
-Exception. Same path as system call until trap.
-
-
-What happens if kernel divides by zero or accesses unmapped memory?
-
-
-Interrupts
-
-
-
-Like system calls, except:
-devices generate them at any time,
-there are no arguments in CPU registers,
-nothing to return to,
-usually can't ignore them.
-
-
-How do they get generated?
-Device essentially phones up the
-interrupt controller and asks to talk to the CPU.
-Interrupt controller then buzzes the CPU and
-tells it, “keyboard on line 1.”
-Interrupt controller is essentially the CPU's
-secretary administrative assistant,
-managing the phone lines on the CPU's behalf.
-
-
-Have to set up interrupt controller.
-
-
-(Briefly) xv6 Sheet 63: pic_init sets up the interrupt controller,
-irq_enable tells the interrupt controller to let the given
-interrupt through.
-
-
-(Briefly) xv6 Sheet 68: pit8253_init sets up the clock chip,
-telling it to interrupt on IRQ_TIMER 100 times/second.
-console_init sets up the keyboard, enabling IRQ_KBD.
-
-
-In Bochs, set breakpoint at 0x8:"vector0"
-and continue, loading kernel.
-Step through clock interrupt, look at
-stack, registers.
-
-
-Was the processor executing in kernel or user mode
-at the time of the clock interrupt?
-Why? (Have any user-space instructions executed at all?)
-
-
-Can the kernel get an interrupt at any time?
-Why or why not? cli and sti,
-irq_enable.
-
-
-
diff --git a/web/l-lock.html b/web/l-lock.html
deleted file mode 100644
index eea8217..0000000
--- a/web/l-lock.html
+++ /dev/null
@@ -1,322 +0,0 @@
-
L7
-
-
-
-
-
-
Locking
-
-
Required reading: spinlock.c
-
-
Why coordinate?
-
-
Mutual-exclusion coordination is an important topic in operating
-systems, because many operating systems run on
-multiprocessors. Coordination techniques protect variables that are
-shared among multiple threads and updated concurrently. These
-techniques allow programmers to implement atomic sections so that one
-thread can safely update the shared variables without having to worry
-that another thread intervening. For example, processes in xv6 may
-run concurrently on different processors and in kernel-mode share
-kernel data structures. We must ensure that these updates happen
-correctly.
-
-
List and insert example:
-
-
-struct List {
- int data;
- struct List *next;
-};
-
-List *list = 0;
-
-insert(int data) {
- List *l = new List;
- l->data = data;
- l->next = list; // A
- list = l; // B
-}
-
-
-
What needs to be atomic? The two statements labeled A and B should
-always be executed together, as an indivisible fragment of code. If
-two processors execute A and B interleaved, then we end up with an
-incorrect list. To see that this is the case, draw out the list after
-the sequence A1 (statement executed A by processor 1), A2 (statement A
-executed by processor 2), B2, and B1.
-
-
How could this erroneous sequence happen? The varilable list
-lives in physical memory shared among multiple processors, connected
-by a bus. The accesses to the shared memory will be ordered in some
-total order by the bus/memory system. If the programmer doesn't
-coordinate the execution of the statements A and B, any order can
-happen, including the erroneous one.
-
-
The erroneous case is called a race condition. The problem with
-races is that they are difficult to reproduce. For example, if you
-put print statements in to debug the incorrect behavior, you might
-change the time and the race might not happen anymore.
-
-
Atomic instructions
-
-
The programmer must be able express that A and B should be executed
-as single atomic instruction. We generally use a concept like locks
-to mark an atomic region, acquiring the lock at the beginning of the
-section and releasing it at the end:
-
-
Acquire and release, of course, need to be atomic too, which can,
-for example, be done with a hardware atomic TSL (try-set-lock)
-instruction:
-
-
The semantics of TSL are:
-
- R <- [mem] // load content of mem into register R
- [mem] <- 1 // store 1 in mem.
-
-
-
In a harware implementation, the bus arbiter guarantees that both
-the load and store are executed without any other load/stores coming
-in between.
-
-
We can use locks to implement an atomic insert, or we can use
-TSL directly:
-
It is the programmer's job to make sure that locks are respected. If
-a programmer writes another function that manipulates the list, the
-programmer must must make sure that the new functions acquires and
-releases the appropriate locks. If the programmer doesn't, race
-conditions occur.
-
-
This code assumes that stores commit to memory in program order and
-that all stores by other processors started before insert got the lock
-are observable by this processor. That is, after the other processor
-released a lock, all the previous stores are committed to memory. If
-a processor executes instructions out of order, this assumption won't
-hold and we must, for example, a barrier instruction that makes the
-assumption true.
-
-
-
Example: Locking on x86
-
-
Here is one way we can implement acquire and release using the x86
-xchgl instruction:
-
-
-struct Lock {
- unsigned int locked;
-};
-
-acquire(Lock *lck) {
- while(TSL(&(lck->locked)) != 0)
- ;
-}
-
-release(Lock *lck) {
- lck->locked = 0;
-}
-
-int
-TSL(int *addr)
-{
- register int content = 1;
- // xchgl content, *addr
- // xchgl exchanges the values of its two operands, while
- // locking the memory bus to exclude other operations.
- asm volatile ("xchgl %0,%1" :
- "=r" (content),
- "=m" (*addr) :
- "0" (content),
- "m" (*addr));
- return(content);
-}
-
-
-
the instruction "XCHG %eax, (content)" works as follows:
-
-
freeze other CPUs' memory activity
-
temp := content
-
content := %eax
-
%eax := temp
-
un-freeze other CPUs
-
-
-
steps 1 and 5 make XCHG special: it is "locked" special signal
- lines on the inter-CPU bus, bus arbitration
-
-
This implementation doesn't scale to a large number of processors;
- in a later lecture we will see how we could do better.
-
-
Lock granularity
-
-
Release/acquire is ideal for short atomic sections: increment a
-counter, search in i-node cache, allocate a free buffer.
-
-
What are spin locks not so great for? Long atomic sections may
- waste waiters' CPU time and it is to sleep while holding locks. In
- xv6 we try to avoid long atomic sections by carefully coding (can
- you find an example?). xv6 doesn't release the processor when
- holding a lock, but has an additional set of coordination primitives
- (sleep and wakeup), which we will study later.
-
-
My list_lock protects all lists; inserts to different lists are
- blocked. A lock per list would waste less time spinning so you might
- want "fine-grained" locks, one for every object BUT acquire/release
- are expensive (500 cycles on my 3 ghz machine) because they need to
- talk off-chip.
-
-
Also, "correctness" is not that simple with fine-grained locks if
- need to maintain global invariants; e.g., "every buffer must be on
- exactly one of free list and device list". Per-list locks are
- irrelevant for this invariant. So you might want "large-grained",
- which reduces overhead but reduces concurrency.
-
-
This tension is hard to get right. One often starts out with
- "large-grained locks" and measures the performance of the system on
- some workloads. When more concurrency is desired (to get better
- performance), an implementor may switch to a more fine-grained
- scheme. Operating system designers fiddle with this all the time.
-
-
Recursive locks and modularity
-
-
When designing a system we desire clean abstractions and good
- modularity. We like a caller not have to know about how a callee
- implements a particul functions. Locks make achieving modularity
- more complicated. For example, what to do when the caller holds a
- lock, then calls a function, which also needs to the lock to perform
- its job.
-
-
There are no transparent solutions that allow the caller and callee
- to be unaware of which lokcs they use. One transparent, but
- unsatisfactory option is recursive locks: If a callee asks for a
- lock that its caller has, then we allow the callee to proceed.
- Unfortunately, this solution is not ideal either.
-
-
Consider the following. If lock x protects the internals of some
- struct foo, then if the caller acquires lock x, it know that the
- internals of foo are in a sane state and it can fiddle with them.
- And then the caller must restore them to a sane state before release
- lock x, but until then anything goes.
-
-
This assumption doesn't hold with recursive locking. After
- acquiring lock x, the acquirer knows that either it is the first to
- get this lock, in which case the internals are in a sane state, or
- maybe some caller holds the lock and has messed up the internals and
- didn't realize when calling the callee that it was going to try to
- look at them too. So the fact that a function acquired the lock x
- doesn't guarantee anything at all. In short, locks protect against
- callers and callees just as much as they protect against other
- threads.
-
-
Since transparent solutions aren't ideal, it is better to consider
- locks part of the function specification. The programmer must
- arrange that a caller doesn't invoke another function while holding
- a lock that the callee also needs.
-
-
Locking in xv6
-
-
xv6 runs on a multiprocessor and is programmed to allow multiple
-threads of computation to run concurrently. In xv6 an interrupt might
-run on one processor and a process in kernel mode may run on another
-processor, sharing a kernel data structure with the interrupt routing.
-xv6 uses locks, implemented using an atomic instruction, to coordinate
-concurrent activities.
-
-
Let's check out why xv6 needs locks by following what happens when
-we start a second processor:
-
-
1516: mp_init (called from main0)
-
1606: mp_startthem (called from main0)
-
1302: mpmain
-
2208: scheduler.
- Now we have several processors invoking the scheduler
- function. xv6 better ensure that multiple processors don't run the
- same process! does it?
- Yes, if multiple schedulers run concurrently, only one will
- acquire proc_table_lock, and proceed looking for a runnable
- process. if it finds a process, it will mark it running, longjmps to
- it, and the process will release proc_table_lock. the next instance
- of scheduler will skip this entry, because it is marked running, and
- look for another runnable process.
-
-
-
Why hold proc_table_lock during a context switch? It protects
-p->state; the process has to hold some lock to avoid a race with
-wakeup() and yield(), as we will see in the next lectures.
-
-
Why not a lock per proc entry? It might be expensive in in whole
-table scans (in wait, wakeup, scheduler). proc_table_lock also
-protects some larger invariants, for example it might be hard to get
-proc_wait() right with just per entry locks. Right now the check to
-see if there are any exited children and the sleep are atomic -- but
-that would be hard with per entry locks. One could have both, but
-that would probably be neither clean nor fast.
-
-
Of course, there is only processor searching the proc table if
-acquire is implemented correctly. Let's check out acquire in
-spinlock.c:
-
-
1807: no recursive locks!
-
1811: why disable interrupts on the current processor? (if
-interrupt code itself tries to take a held lock, xv6 will deadlock;
-the panic will fire on 1808.)
-
-
can a process on a processor hold multiple locks?
-
-
1814: the (hopefully) atomic instruction.
-
-
see sheet 4, line 0468.
-
-
1819: make sure that stores issued on other processors before we
-got the lock are observed by this processor. these may be stores to
-the shared data structure that is protected by the lock.
-
-
-
-
-
Locking in JOS
-
-
JOS is meant to run on single-CPU machines, and the plan can be
-simple. The simple plan is disabling/enabling interrupts in the
-kernel (IF flags in the EFLAGS register). Thus, in the kernel,
-threads release the processors only when they want to and can ensure
-that they don't release the processor during a critical section.
-
-
In user mode, JOS runs with interrupts enabled, but Unix user
-applications don't share data structures. The data structures that
-must be protected, however, are the ones shared in the library
-operating system (e.g., pipes). In JOS we will use special-case
-solutions, as you will find out in lab 6. For example, to implement
-pipe we will assume there is one reader and one writer. The reader
-and writer never update each other's variables; they only read each
-other's variables. Carefully programming using this rule we can avoid
-races.
diff --git a/web/l-mkernel.html b/web/l-mkernel.html
deleted file mode 100644
index 2984796..0000000
--- a/web/l-mkernel.html
+++ /dev/null
@@ -1,262 +0,0 @@
-
Microkernel lecture
-
-
-
-
-
-
Microkernels
-
-
Required reading: Improving IPC by kernel design
-
-
Overview
-
-
This lecture looks at the microkernel organization. In a
-microkernel, services that a monolithic kernel implements in the
-kernel are running as user-level programs. For example, the file
-system, UNIX process management, pager, and network protocols each run
-in a separate user-level address space. The microkernel itself
-supports only the services that are necessary to allow system services
-to run well in user space; a typical microkernel has at least support
-for creating address spaces, threads, and inter process communication.
-
-
The potential advantages of a microkernel are simplicity of the
-kernel (small), isolation of operating system components (each runs in
-its own user-level address space), and flexibility (we can have a file
-server and a database server). One potential disadvantage is
-performance loss, because what in a monolithich kernel requires a
-single system call may require in a microkernel multiple system calls
-and context switches.
-
-
One way in how microkernels differ from each other is the exact
-kernel API they implement. For example, Mach (a system developed at
-CMU, which influenced a number of commercial operating systems) has
-the following system calls: processes (create, terminate, suspend,
-resume, priority, assign, info, threads), threads (fork, exit, join,
-detach, yield, self), ports and messages (a port is a unidirectionally
-communication channel with a message queue and supporting primitives
-to send, destroy, etc), and regions/memory objects (allocate,
-deallocate, map, copy, inherit, read, write).
-
-
Some microkernels are more "microkernel" than others. For example,
-some microkernels implement the pager in user space but the basic
-virtual memory abstractions in the kernel (e.g, Mach); others, are
-more extreme, and implement most of the virtual memory in user space
-(L4). Yet others are less extreme: many servers run in their own
-address space, but in kernel mode (Chorus).
-
-
All microkernels support multiple threads per address space. xv6
-and Unix until recently didn't; why? Because, in Unix system services
-are typically implemented in the kernel, and those are the primary
-programs that need multiple threads to handle events concurrently
-(waiting for disk and processing new I/O requests). In microkernels,
-these services are implemented in user-level address spaces and so
-they need a mechanism to deal with handling operations concurrently.
-(Of course, one can argue if fork efficient enough, there is no need
-to have threads.)
-
-
L3/L4
-
-
L3 is a predecessor to L4. L3 provides data persistence, DOS
-emulation, and ELAN runtime system. L4 is a reimplementation of L3,
-but without the data persistence. L4KA is a project at
-sourceforge.net, and you can download the code for the latest
-incarnation of L4 from there.
-
-
L4 is a "second-generation" microkernel, with 7 calls: IPC (of
-which there are several types), id_nearest (find a thread with an ID
-close the given ID), fpage_unmap (unmap pages, mapping is done as a
-side-effect of IPC), thread_switch (hand processor to specified
-thread), lthread_ex_regs (manipulate thread registers),
-thread_schedule (set scheduling policies), task_new (create a new
-address space with some default number of threads). These calls
-provide address spaces, tasks, threads, interprocess communication,
-and unique identifiers. An address space is a set of mappings.
-Multiple threads may share mappings, a thread may grants mappings to
-another thread (through IPC). Task is the set of threads sharing an
-address space.
-
-
A thread is the execution abstraction; it belongs to an address
-space, a UID, a register set, a page fault handler, and an exception
-handler. A UID of a thread is its task number plus the number of the
-thread within that task.
-
-
IPC passes data by value or by reference to another address space.
-It also provide for sequence coordination. It is used for
-communication between client and servers, to pass interrupts to a
-user-level exception handler, to pass page faults to an external
-pager. In L4, device drivers are implemented has a user-level
-processes with the device mapped into their address space.
-Linux runs as a user-level process.
-
-
L4 provides quite a scala of messages types: inline-by-value,
-strings, and virtual memory mappings. The send and receive descriptor
-specify how many, if any.
-
-
In addition, there is a system call for timeouts and controling
-thread scheduling.
-
-
L3/L4 paper discussion
-
-
-
-
This paper is about performance. What is a microsecond? Is 100
-usec bad? Is 5 usec so much better we care? How many instructions
-does 50-Mhz x86 execute in 100 usec? What can we compute with that
-number of instructions? How many disk operations in that time? How
-many interrupts can we take? (The livelock paper, which we cover in a
-few lectures, mentions 5,000 network pkts per second, and each packet
-generates two interrrupts.)
-
-
In performance calculations, what is the appropriate/better metric?
-Microseconds or cycles?
-
-
Goal: improve IPC performance by a factor 10 by careful kernel
-design that is fully aware of the hardware it is running on.
-Principle: performance rules! Optimize for the common case. Because
-in L3 interrupts are propagated to user-level using IPC, the system
-may have to be able to support many IPCs per second (as many as the
-device can generate interrupts).
-
-
IPC consists of transfering control and transfering data. The
-minimal cost for transfering control is 127 cycles, plus 45 cycles for
-TLB misses (see table 3). What are the x86 instructions to enter and
-leave the kernel? (int, iret) Why do they consume so much time?
-(Flush pipeline) Do modern processors perform these operations more
-efficient? Worse now. Faster processors optimized for straight-line
-code; Traps/Exceptions flush deeper pipeline, cache misses cost more
-cycles.
-
-
What are the 5 TLB misses: 1) B's thread control block; loading %cr3
-flushes TLB, so 2) kernel text causes miss; iret, accesses both 3) stack and
-4+5) user text - two pages B's user code looks at message
-
-
New system call: reply_and_receive. Effect: 2 system calls per
-RPC.
-
-
Complex messages: direct string, indirect strings, and memory
-objects.
-
-
Direct transfer by temporary mapping through a communication
-window. The communication window is mapped in B address space and in
-A's kernel address space; why is this better than just mapping a page
-shared between A and B's address space? 1) Multi-level security, it
-makes it hard to reason about information flow; 2) Receiver can't
-check message legality (might change after check); 3) When server has
-many clients, could run out of virtual address space Requires shared
-memory region to be established ahead of time; 4) Not application
-friendly, since data may already be at another address, i.e.
-applications would have to copy anyway--possibly more copies.
-
-
Why not use the following approach: map the region copy-on-write
-(or read-only) in A's address space after send and read-only in B's
-address space? Now B may have to copy data or cannot receive data in
-its final destination.
-
-
On the x86 implemented by coping B's PDE into A's address space.
-Why two PDEs? (Maximum message size is 4 Meg, so guaranteed to work
-if the message starts in the bottom for 4 Mbyte of an 8 Mbyte mapped
-region.) Why not just copy PTEs? Would be much more expensive
-
-
What does it mean for the TLB to be "window clean"? Why do we
-care? Means TLB contains no mappings within communication window. We
-care because mapping is cheap (copy PDE), but invalidation not; x86
-only lets you invalidate one page at a time, or whole TLB Does TLB
-invalidation of communication window turn out to be a problem? Not
-usually, because have to load %cr3 during IPC anyway
-
-
Thread control block registers, links to various double-linked
- lists, pgdir, uid, etc.. Lower part of thread UID contains TCB
- number. Can also dededuce TCB address from stack by taking SP AND
- bitmask (the SP comes out of the TSS when just switching to kernel).
-
-
Kernel stack is on same page as tcb. why? 1) Minimizes TLB
-misses (since accessing kernel stack will bring in tcb); 2) Allows
-very efficient access to tcb -- just mask off lower 12 bits of %esp;
-3) With VM, can use lower 32-bits of thread id to indicate which tcb;
-using one page per tcb means no need to check if thread is swapped out
-(Can simply not map that tcb if shouldn't access it).
-
-
Invariant on queues: queues always hold in-memory TCBs.
-
-
Wakeup queue: set of 8 unordered wakeup lists (wakup time mod 8),
-and smart representation of time so that 32-bit integers can be used
-in the common case (base + offset in msec; bump base and recompute all
-offsets ~4 hours. maximum timeout is ~24 days, 2^31 msec).
-
-
What is the problem addressed by lazy scheduling?
-Conventional approach to scheduling:
-
- A sends message to B:
- Move A from ready queue to waiting queue
- Move B from waiting queue to ready queue
- This requires 58 cycles, including 4 TLB misses. What are TLB misses?
- One each for head of ready and waiting queues
- One each for previous queue element during the remove
-
-
Lazy scheduling:
-
- Ready queue must contain all ready threads except current one
- Might contain other threads that aren't actually ready, though
- Each wakeup queue contains all threads waiting in that queue
- Again, might contain other threads, too
- Scheduler removes inappropriate queue entries when scanning
- queue
-
-
-
Why does this help performance? Only three situations in which
-thread gives up CPU but stays ready: send syscall (as opposed to
-call), preemption, and hardware interrupts. So very often can IPC into
-thread while not putting it on ready list.
-
-
Direct process switch. This section just says you should use
-kernel threads instead of continuations.
-
-
Short messages via registers.
-
-
Avoiding unnecessary copies. Basically can send and receive
- messages w. same vector. Makes forwarding efficient, which is
- important for Clans/Chiefs model.
-
-
Segment register optimization. Loading segments registers is
- slow, have to access GDT, etc. But common case is that users don't
- change their segment registers. Observation: it is faster to check
- that segment descriptor than load it. So just check that segment
- registers are okay. Only need to load if user code changed them.
-
-
Registers for paramater passing where ever possible: systems calls
-and IPC.
-
-
Minimizing TLB misses. Try to cram as many things as possible onto
-same page: IPC kernel code, GDT, IDT, TSS, all on same page. Actually
-maybe can't fit whole tables but put the important parts of tables on
-the same page (maybe beginning of TSS, IDT, or GDT only?)
-
-
Coding tricks: short offsets, avoid jumps, avoid checks, pack
- often-used data on same cache lines, lazily save/restore CPU state
- like debug and FPU registers. Much of the kernel is written in
- assembly!
-
-
What are the results? figure 7 and 8 look good.
-
-
Is fast IPC enough to get good overall system performance? This
-paper doesn't make a statement either way; we have to read their 1997
-paper to find find the answer to that question.
-
-
Is the principle of optimizing for performance right? In general,
-it is wrong to optimize for performance; other things matter more. Is
-IPC the one exception? Maybe, perhaps not. Was Liedtke fighting a
-losing battle against CPU makers? Should fast IPC time be a hardware,
-or just an OS issue?
-
-
Required reading: nami(), and all other file system code.
-
-
Overview
-
-
To help users to remember where they stored their data, most
-systems allow users to assign their own names to their data.
-Typically the data is organized in files and users assign names to
-files. To deal with many files, users can organize their files in
-directories, in a hierarchical manner. Each name is a pathname, with
-the components separated by "/".
-
-
To avoid that users have to type long abolute names (i.e., names
-starting with "/" in Unix), users can change their working directory
-and use relative names (i.e., naming that don't start with "/").
-
-
User file namespace operations include create, mkdir, mv, ln
-(link), unlink, and chdir. (How is "mv a b" implemented in xv6?
-Answer: "link a b"; "unlink a".) To be able to name the current
-directory and the parent directory every directory includes two
-entries "." and "..". Files and directories can reclaimed if users
-cannot name it anymore (i.e., after the last unlink).
-
-
Recall from last lecture, all directories entries contain a name,
-followed by an inode number. The inode number names an inode of the
-file system. How can we merge file systems from different disks into
-a single name space?
-
-
A user grafts new file systems on a name space using mount. Umount
-removes a file system from the name space. (In DOS, a file system is
-named by its device letter.) Mount takes the root inode of the
-to-be-mounted file system and grafts it on the inode of the name space
-entry where the file system is mounted (e.g., /mnt/disk1). The
-in-memory inode of /mnt/disk1 records the major and minor number of
-the file system mounted on it. When namei sees an inode on which a
-file system is mounted, it looks up the root inode of the mounted file
-system, and proceeds with that inode.
-
-
Mount is not a durable operation; it doesn't surive power failures.
-After a power failure, the system administrator must remount the file
-system (i.e., often in a startup script that is run from init).
-
-
Links are convenient, because with users can create synonyms for
- file names. But, it creates the potential of introducing cycles in
- the naning tree. For example, consider link("a/b/c", "a"). This
- makes c a synonym for a. This cycle can complicate matters; for
- example:
-
-
If a user subsequently calls unlink ("a"), then the user cannot
- name the directory "b" and the link "c" anymore, but how can the
- file system decide that?
-
-
-
This problem can be solved by detecting cycles. The second problem
- can be solved by computing with files are reacheable from "/" and
- reclaim all the ones that aren't reacheable. Unix takes a simpler
- approach: avoid cycles by disallowing users to create links for
- directories. If there are no cycles, then reference counts can be
- used to see if a file is still referenced. In the inode maintain a
- field for counting references (nlink in xv6's dinode). link
- increases the reference count, and unlink decreases the count; if
- the count reaches zero the inode and disk blocks can be reclaimed.
-
-
How to handle symbolic links across file systems (i.e., from one
- mounted file system to another)? Since inodes are not unique across
- file systems, we cannot create a link across file systems; the
- directory entry only contains an inode number, not the inode number
- and the name of the disk on which the inode is located. To handle
- this case, Unix provides a second type of link, which are called
- soft links.
-
-
Soft links are a special file type (e.g., T_SYMLINK). If namei
- encounters a inode of type T_SYMLINK, it resolves the the name in
- the symlink file to an inode, and continues from there. With
- symlinks one can create cycles and they can point to non-existing
- files.
-
-
The design of the name system can have security implications. For
- example, if you tests if a name exists, and then use the name,
- between testing and using it an adversary can have change the
- binding from name to object. Such problems are called TOCTTOU.
-
-
An example of TOCTTOU is follows. Let's say root runs a script
- every night to remove file in /tmp. This gets rid off the files
- that editors might left behind, but we will never be used again. An
- adversary can exploit this script as follows:
-
-Lstat checks whether /tmp/etc is not symbolic link, but by the time it
-runs unlink the attacker had time to creat a symbolic link in the
-place of /tmp/etc, with a password file of the adversary's choice.
-
-
This problem could have been avoided if every user or process group
- had its own private /tmp, or if access to the shared one was
- mediated.
-
-
V6 code examples
-
-
namei (sheet 46) is the core of the Unix naming system. namei can
- be called in several ways: NAMEI_LOOKUP (resolve a name to an inode
- and lock inode), NAMEI_CREATE (resolve a name, but lock parent
- inode), and NAMEI_DELETE (resolve a name, lock parent inode, and
- return offset in the directory). The reason is that namei is
- complicated is that we want to atomically test if a name exist and
- remove/create it, if it does; otherwise, two concurrent processes
- could interfere with each other and directory could end up in an
- inconsistent state.
-
-
Let's trace open("a", O_RDWR), focussing on namei:
-
-
5263: we will look at creating a file in a bit.
-
5277: call namei with NAMEI_LOOKUP
-
4629: if path name start with "/", lookup root inode (1).
-
4632: otherwise, use inode for current working directory.
-
4638: consume row of "/", for example in "/////a////b"
-
4641: if we are done with NAMEI_LOOKUP, return inode (e.g.,
- namei("/")).
-
4652: if the inode we are searching for a name isn't of type
- directory, give up.
-
4657-4661: determine length of the current component of the
- pathname we are resolving.
-
4663-4681: scan the directory for the component.
-
4682-4696: the entry wasn't found. if we are the end of the
- pathname and NAMEI_CREATE is set, lock parent directory and return a
- pointer to the start of the component. In all other case, unlock
- inode of directory, and return 0.
-
4701: if NAMEI_DELETE is set, return locked parent inode and the
- offset of the to-be-deleted component in the directory.
-
4707: lookup inode of the component, and go to the top of the loop.
-
-
-
Now let's look at creating a file in a directory:
-
-
5264: if the last component doesn't exist, but first part of the
- pathname resolved to a directory, then dp will be 0, last will point
- to the beginning of the last component, and ip will be the locked
- parent directory.
-
5266: create an entry for last in the directory.
-
4772: mknod1 allocates a new named inode and adds it to an
- existing directory.
-
4776: ialloc. skan inode block, find unused entry, and write
- it. (if lucky 1 read and 1 write.)
-
4784: fill out the inode entry, and write it. (another write)
-
4786: write the entry into the directory (if lucky, 1 write)
-
-
-
-Why must the parent directory be locked? If two processes try to
-create the same name in the same directory, only one should succeed
-and the other one, should receive an error (file exist).
-
-
Link, unlink, chdir, mount, umount could have taken file
-descriptors instead of their path argument. In fact, this would get
-rid of some possible race conditions (some of which have security
-implications, TOCTTOU). However, this would require that the current
-working directory be remembered by the process, and UNIX didn't have
-good ways of maintaining static state shared among all processes
-belonging to a given user. The easiest way is to create shared state
-is to place it in the kernel.
-
-
We have one piece of code in xv6 that we haven't studied: exec.
- With all the ground work we have done this code can be easily
- understood (see sheet 54).
-
-
diff --git a/web/l-okws.txt b/web/l-okws.txt
deleted file mode 100644
index fa940d0..0000000
--- a/web/l-okws.txt
+++ /dev/null
@@ -1,249 +0,0 @@
-
-Security
--------------------
-I. 2 Intro Examples
-II. Security Overview
-III. Server Security: Offense + Defense
-IV. Unix Security + POLP
-V. Example: OKWS
-VI. How to Build a Website
-
-I. Intro Examples
---------------------
-1. Apache + OpenSSL 0.9.6a (CAN 2002-0656)
- - SSL = More security!
-
- unsigned int j;
- p=(unsigned char *)s->init_buf->data;
- j= *(p++);
- s->session->session_id_length=j;
- memcpy(s->session->session_id,p,j);
-
- - the result: an Apache worm
-
-2. SparkNotes.com 2000:
- - New profile feature that displays "public" information about users
- but bug that made e-mail addresses "public" by default.
- - New program for getting that data:
-
- http://www.sparknotes.com/getprofile.cgi?id=1343
-
-II. Security Overview
-----------------------
-
-What Is Security?
- - Protecting your system from attack.
-
- What's an attack?
- - Stealing data
- - Corrupting data
- - Controlling resources
- - DOS
-
- Why attack?
- - Money
- - Blackmail / extortion
- - Vendetta
- - intellectual curiosity
- - fame
-
-Security is a Big topic
-
- - Server security -- today's focus. There's some machine sitting on the
- Internet somewhere, with a certain interface exposed, and attackers
- want to circumvent it.
- - Why should you trust your software?
-
- - Client security
- - Clients are usually servers, so they have many of the same issues.
- - Slight simplification: people across the network cannot typically
- initiate connections.
- - Has a "fallible operator":
- - Spyware
- - Drive-by-Downloads
-
- - Client security turns out to be much harder -- GUI considerations,
- look inside the browser and the applications.
- - Systems community can more easily handle server security.
- - We think mainly of servers.
-
-III. Server Security: Offense and Defense
------------------------------------------
- - Show picture of a Web site.
-
- Attacks | Defense
-----------------------------------------------------------------------------
- 1. Break into DB from net | 1. FW it off
- 2. Break into WS on telnet | 2. FW it off
- 3. Buffer overrun in Apache | 3. Patch apache / use better lang?
- 4. Buffer overrun in our code | 4. Use better lang / isolate it
- 5. SQL injection | 5. Better escaping / don't interpret code.
- 6. Data scraping. | 6. Use a sparse UID space.
- 7. PW sniffing | 7. ???
- 8. Fetch /etc/passwd and crack | 8. Don't expose /etc/passwd
- PW |
- 9. Root escalation from apache | 9. No setuid programs available to Apache
-10. XSS |10. Filter JS and input HTML code.
-11. Keystroke recorded on sys- |11. Client security
- admin's desktop (planetlab) |
-12. DDOS |12. ???
-
-Summary:
- - That we want private data to be available to right people makes
- this problem hard in the first place. Internet servers are there
- for a reason.
- - Security != "just encrypt your data;" this in fact can sometimes
- make the problem worse.
- - Best to prevent break-ins from happening in the first place.
- - If they do happen, want to limit their damage (POLP).
- - Security policies are difficult to express / package up neatly.
-
-IV. Design According to POLP (in Unix)
----------------------------------------
- - Assume any piece of a system can be compromised, by either bad
- programming or malicious attack.
- - Try to limit the damage done by such a compromise (along the lines
- of the 4 attack goals).
-
-
-
-What's the goal on Unix?
- - Keep processes from communicating that don't have to:
- - limit FS, IPC, signals, ptrace
- - Strip away unneeded privilege
- - with respect to network, FS.
- - Strip away FS access.
-
-How on Unix?
- - setuid/setgid
- - system call interposition
- - chroot (away from setuid executables, /etc/passwd, /etc/ssh/..)
-
-
-
-How do you write chroot'ed programs?
- - What about shared libraries?
- - /etc/resolv.conf?
- - Can chroot'ed programs access the FS at all? What if they need
- to write to the FS or read from the FS?
- - Fd's are *capabilities*; can pass them to chroot'ed services,
- thereby opening new files on its behalf.
- - Unforgeable - can only get them from the kernel via open/socket, etc.
-
-Unix Shortcomings (round 1)
- - It's bad to run as root!
- - Yet, need root for:
- - chroot
- - setuid/setgid to a lower-privileged user
- - create a new user ID
- - Still no guarantee that we've cut off all channels
- - 200 syscalls!
- - Default is to give most/all privileges.
- - Can "break out" of chroot jails?
- - Can still exploit race conditions in the kernel to escalate privileges.
-
-Sidebar
- - setuid / setuid misunderstanding
- - root / root misunderstanding
- - effective vs. real vs. saved set-user-ID
-
-V. OKWS
--------
-- Taking these principles as far as possible.
-- C.f. Figure 1 From the paper..
-- Discussion of which privileges are in which processes
-
-
-
-- Technical details: how to launch a new service
-- Within the launcher (running as root):
-
-
-
- // receive FDs from logger, pubd, demux
- fork ();
- chroot ("/var/okws/run");
- chdir ("/coredumps/51001");
- setgid (51001);
- setuid (51001);
- exec ("login", fds ... );
-
-- Note no chroot -- why not?
-- Once launched, how does a service get new connections?
-- Note the goal - minimum tampering with each other in the
- case of a compromise.
-
-Shortcoming of Unix (2)
-- A lot of plumbing involved with this system. FDs flying everywhere.
-- Isolation still not fine enough. If a service gets taken over,
- can compromise all users of that service.
-
-VI. Reflections on Building Websites
----------------------------------
-- OKWS interesting "experiment"
-- Need for speed; also, good gzip support.
-- If you need compiled code, it's a good way to go.
-- RPC-like system a must for backend communication
-- Connection-pooling for free
-
-Biggest difficulties:
-- Finding good C++ programmers.
-- Compile times.
-- The DB is still always the problem.
-
-Hard to Find good Alternatives
-- Python / Perl - you might spend a lot of time writing C code /
- integrating with lower level languages.
-- Have to worry about DB pooling.
-- Java -- must viable, and is getting better. Scary you can't peer
- inside.
-- .Net / C#-based system might be the way to go.
-
-
-=======================================================================
-
-Extra Material:
-
-Capabilities (From the Eros Paper in SOSP 1999)
-
- - "Unforgeable pair made up of an object ID and a set of authorized
- operations (an interface) on that object."
- - c.f. Dennis and van Horn. "Programming semantics for multiprogrammed
- computations," Communications of the ACM 9(3):143-154, Mar 1966.
- - Thus:
-