diff --git a/web/Makefile b/web/Makefile new file mode 100644 index 0000000..7b49773 --- /dev/null +++ b/web/Makefile @@ -0,0 +1,3 @@ +index.html: index.txt mkhtml + mkhtml index.txt >_$@ && mv _$@ $@ + diff --git a/web/index.html b/web/index.html new file mode 100644 index 0000000..d5f940c --- /dev/null +++ b/web/index.html @@ -0,0 +1,353 @@ + + + +Xv6, a simple Unix-like teaching operating system + + + +

Xv6, a simple Unix-like teaching operating system

+

+Xv6 is a teaching operating system developed +in the summer of 2006 for MIT's operating systems course, +“6.828: Operating Systems Engineering.” +We used it for 6.828 in Fall 2006 and Fall 2007 +and are using it this semester (Fall 2008). +We hope that xv6 will be useful in other courses too. +This page collects resources to aid the use of xv6 +in other courses. + +

History and Background

+For many years, MIT had no operating systems course. +In the fall of 2002, Frans Kaashoek, Josh Cates, and Emil Sit +created a new, experimental course (6.097) +to teach operating systems engineering. +In the course lectures, the class worked through Sixth Edition Unix (aka V6) +using John Lions's famous commentary. +In the lab assignments, students wrote most of an exokernel operating +system, eventually named Jos, for the Intel x86. +Exposing students to multiple systems–V6 and Jos–helped +develop a sense of the spectrum of operating system designs. +In the fall of 2003, the experimental 6.097 became the +official course 6.828; the course has been offered each fall since then. +

+V6 presented pedagogic challenges from the start. +Students doubted the relevance of an obsolete 30-year-old operating system +written in an obsolete programming language (pre-K&R C) +running on obsolete hardware (the PDP-11). +Students also struggled to learn the low-level details of two different +architectures (the PDP-11 and the Intel x86) at the same time. +By the summer of 2006, we had decided to replace V6 +with a new operating system, xv6, modeled on V6 +but written in ANSI C and running on multiprocessor +Intel x86 machines. +Xv6's use of the x86 makes it more relevant to +students' experience than V6 was +and unifies the course around a single architecture. +Adding multiprocessor support also helps relevance +and makes it easier to discuss threads and concurrency. +(In a single processor operating system, concurrency–which only +happens because of interrupts–is too easy to view as a special case. +A multiprocessor operating system must attack the problem head on.) +Finally, writing a new system allowed us to write cleaner versions +of the rougher parts of V6, like the scheduler and file system. +

+6.828 substituted xv6 for V6 in the fall of 2006. +Based on that experience, we cleaned up rough patches +of xv6 for the course in the fall of 2007. +Since then, xv6 has stabilized, so we are making it +available in the hopes that others will find it useful too. +

+6.828 uses both xv6 and Jos. +Courses taught at UCLA, NYU, and Stanford have used +Jos without xv6; we believe other courses could use +xv6 without Jos, though we are not aware of any that have. + +

Xv6 sources

+The latest xv6 is xv6-rev2.tar.gz. +We distribute the sources in electronic form but also as +a printed booklet with line numbers that keep everyone +together during lectures. The booklet is available as +xv6-rev2.pdf. +

+xv6 compiles using the GNU C compiler, +targeted at the x86 using ELF binaries. +On BSD and Linux systems, you can use the native compilers; +On OS X, which doesn't use ELF binaries, +you must use a cross-compiler. +Xv6 does boot on real hardware, but typically +we run it using the Bochs emulator. +Both the GCC cross compiler and Bochs +can be found on the 6.828 tools page. + +

Lectures

+In 6.828, the lectures in the first half of the course +introduce the PC hardware, the Intel x86, and then xv6. +The lectures in the second half consider advanced topics +using research papers; for some, xv6 serves as a useful +base for making discussions concrete. +This section describe a typical 6.828 lecture schedule, +linking to lecture notes and homework. +A course using only xv6 (not Jos) will need to adapt +a few of the lectures, but we hope these are a useful +starting point. + +

Lecture 1. Operating systems +

+The first lecture introduces both the general topic of +operating systems and the specific approach of 6.828. +After defining “operating system,” the lecture +examines the implementation of a Unix shell +to look at the details the traditional Unix system call interface. +This is relevant to both xv6 and Jos: in the final +Jos labs, students implement a Unix-like interface +and culminating in a Unix shell. +

+lecture notes + +

Lecture 2. PC hardware and x86 programming +

+This lecture introduces the PC architecture, the 16- and 32-bit x86, +the stack, and the GCC x86 calling conventions. +It also introduces the pieces of a typical C tool chain–compiler, +assembler, linker, loader–and the Bochs emulator. +

+Reading: PC Assembly Language +

+Homework: familiarize with Bochs +

+lecture notes +homework + +

Lecture 3. Operating system organization +

+This lecture continues Lecture 1's discussion of what +an operating system does. +An operating system provides a “virtual computer” +interface to user space programs. +At a high level, the main job of the operating system +is to implement that interface +using the physical computer it runs on. +

+The lecture discusses four approaches to that job: +monolithic operating systems, microkernels, +virtual machines, and exokernels. +Exokernels might not be worth mentioning +except that the Jos labs are built around one. +

+Reading: Engler et al., Exokernel: An Operating System Architecture +for Application-Level Resource Management +

+lecture notes + +

Lecture 4. Address spaces using segmentation +

+This is the first lecture that uses xv6. +It introduces the idea of address spaces and the +details of the x86 segmentation hardware. +It makes the discussion concrete by reading the xv6 +source code and watching xv6 execute using the Bochs simulator. +

+Reading: x86 MMU handout, +xv6: bootasm.S, bootother.S, bootmain.c, main.c, init.c, and setupsegs in proc.c. +

+Homework: Bochs stack introduction +

+lecture notes +homework + +

Lecture 5. Address spaces using page tables +

+This lecture continues the discussion of address spaces, +examining the other x86 virtual memory mechanism: page tables. +Xv6 does not use page tables, so there is no xv6 here. +Instead, the lecture uses Jos as a concrete example. +An xv6-only course might skip or shorten this discussion. +

+Reading: x86 manual excerpts +

+Homework: stuff about gdt +XXX not appropriate; should be in Lecture 4 +

+lecture notes + +

Lecture 6. Interrupts and exceptions +

+How does a user program invoke the operating system kernel? +How does the kernel return to the user program? +What happens when a hardware device needs attention? +This lecture explains the answer to these questions: +interrupt and exception handling. +

+It explains the x86 trap setup mechanisms and then +examines their use in xv6's SETGATE (mmu.h), +tvinit (trap.c), idtinit (trap.c), vectors.pl, and vectors.S. +

+It then traces through a call to the system call open: +init.c, usys.S, vector48 and alltraps (vectors.S), trap (trap.c), +syscall (syscall.c), +sys_open (sysfile.c), fetcharg, fetchint, argint, argptr, argstr (syscall.c), +

+The interrupt controller, briefly: +pic_init and pic_enable (picirq.c). +The timer and keyboard, briefly: +timer_init (timer.c), console_init (console.c). +Enabling and disabling of interrupts. +

+Reading: x86 manual excerpts, +xv6: trapasm.S, trap.c, syscall.c, and usys.S. +Skim lapic.c, ioapic.c, picirq.c. +

+Homework: Explain the 35 words on the top of the +stack at first invocation of syscall. +

+lecture notes +homework + +

Lecture 7. Multiprocessors and locking +

+This lecture introduces the problems of +coordination and synchronization on a +multiprocessor +and then the solution of mutual exclusion locks. +Atomic instructions, test-and-set locks, +lock granularity, (the mistake of) recursive locks. +

+Although xv6 user programs cannot share memory, +the xv6 kernel itself is a program with multiple threads +executing concurrently and sharing memory. +Illustration: the xv6 scheduler's proc_table_lock (proc.c) +and the spin lock implementation (spinlock.c). +

+Reading: xv6: spinlock.c. Skim mp.c. +

+Homework: Interaction between locking and interrupts. +Try not disabling interrupts in the disk driver and watch xv6 break. +

+lecture notes +homework + +

Lecture 8. Threads, processes and context switching +

+The last lecture introduced some of the issues +in writing threaded programs, using xv6's processes +as an example. +This lecture introduces the issues in implementing +threads, continuing to use xv6 as the example. +

+The lecture defines a thread of computation as a register +set and a stack. A process is an address space plus one +or more threads of computation sharing that address space. +Thus the xv6 kernel can be viewed as a single process +with many threads (each user process) executing concurrently. +

+Illustrations: thread switching (swtch.S), scheduler (proc.c), sys_fork (sysproc.c) +

+Reading: proc.c, swtch.S, sys_fork (sysproc.c) +

+Homework: trace through stack switching. +

+lecture notes (need to be updated to use swtch) +homework + +

Lecture 9. Processes and coordination +

+This lecture introduces the idea of sequence coordination +and then examines the particular solution illustrated by +sleep and wakeup (proc.c). +It introduces and refines a simple +producer/consumer queue to illustrate the +need for sleep and wakeup +and then the sleep and wakeup +implementations themselves. +

+Reading: proc.c, sys_exec, sys_sbrk, sys_wait, sys_exec, sys_kill (sysproc.c). +

+Homework: Explain how sleep and wakeup would break +without proc_table_lock. Explain how devices would break +without second lock argument to sleep. +

+lecture notes +homework + +

Lecture 10. Files and disk I/O +

+This is the first of three file system lectures. +This lecture introduces the basic file system interface +and then considers the on-disk layout of individual files +and the free block bitmap. +

+Reading: iread, iwrite, fileread, filewrite, wdir, mknod1, and + code related to these calls in fs.c, bio.c, ide.c, and file.c. +

+Homework: Add print to bwrite to trace every disk write. +Explain the disk writes caused by some simple shell commands. +

+lecture notes +homework + +

Lecture 11. Naming +

+The last lecture discussed on-disk file system representation. +This lecture covers the implementation of +file system paths (namei in fs.c) +and also discusses the security problems of a shared /tmp +and symbolic links. +

+Understanding exec (exec.c) is left as an exercise. +

+Reading: namei in fs.c, sysfile.c, file.c. +

+Homework: Explain how to implement symbolic links in xv6. +

+lecture notes +homework + +

Lecture 12. High-performance file systems +

+This lecture is the first of the research paper-based lectures. +It discusses the “soft updates” paper, +using xv6 as a concrete example. + +

Feedback

+If you are interested in using xv6 or have used xv6 in a course, +we would love to hear from you. +If there's anything that we can do to make xv6 easier +to adopt, we'd like to hear about it. +We'd also be interested to hear what worked well and what didn't. +

+Russ Cox (rsc@swtch.com)
+Frans Kaashoek (kaashoek@mit.edu)
+Robert Morris (rtm@mit.edu) +

+You can reach all of us at 6.828-staff@pdos.csail.mit.edu. +

+

+ + diff --git a/web/index.txt b/web/index.txt new file mode 100644 index 0000000..41d42a4 --- /dev/null +++ b/web/index.txt @@ -0,0 +1,335 @@ +** Xv6, a simple Unix-like teaching operating system +Xv6 is a teaching operating system developed +in the summer of 2006 for MIT's operating systems course, +``6.828: Operating Systems Engineering.'' +We used it for 6.828 in Fall 2006 and Fall 2007 +and are using it this semester (Fall 2008). +We hope that xv6 will be useful in other courses too. +This page collects resources to aid the use of xv6 +in other courses. + +* History and Background + +For many years, MIT had no operating systems course. +In the fall of 2002, Frans Kaashoek, Josh Cates, and Emil Sit +created a new, experimental course (6.097) +to teach operating systems engineering. +In the course lectures, the class worked through Sixth Edition Unix (aka V6) +using John Lions's famous commentary. +In the lab assignments, students wrote most of an exokernel operating +system, eventually named Jos, for the Intel x86. +Exposing students to multiple systems--V6 and Jos--helped +develop a sense of the spectrum of operating system designs. +In the fall of 2003, the experimental 6.097 became the +official course 6.828; the course has been offered each fall since then. + +V6 presented pedagogic challenges from the start. +Students doubted the relevance of an obsolete 30-year-old operating system +written in an obsolete programming language (pre-K&R C) +running on obsolete hardware (the PDP-11). +Students also struggled to learn the low-level details of two different +architectures (the PDP-11 and the Intel x86) at the same time. +By the summer of 2006, we had decided to replace V6 +with a new operating system, xv6, modeled on V6 +but written in ANSI C and running on multiprocessor +Intel x86 machines. +Xv6's use of the x86 makes it more relevant to +students' experience than V6 was +and unifies the course around a single architecture. +Adding multiprocessor support also helps relevance +and makes it easier to discuss threads and concurrency. +(In a single processor operating system, concurrency--which only +happens because of interrupts--is too easy to view as a special case. +A multiprocessor operating system must attack the problem head on.) +Finally, writing a new system allowed us to write cleaner versions +of the rougher parts of V6, like the scheduler and file system. + +6.828 substituted xv6 for V6 in the fall of 2006. +Based on that experience, we cleaned up rough patches +of xv6 for the course in the fall of 2007. +Since then, xv6 has stabilized, so we are making it +available in the hopes that others will find it useful too. + +6.828 uses both xv6 and Jos. +Courses taught at UCLA, NYU, and Stanford have used +Jos without xv6; we believe other courses could use +xv6 without Jos, though we are not aware of any that have. + + +* Xv6 sources + +The latest xv6 is [xv6-rev2.tar.gz]. +We distribute the sources in electronic form but also as +a printed booklet with line numbers that keep everyone +together during lectures. The booklet is available as +[xv6-rev2.pdf]. + +xv6 compiles using the GNU C compiler, +targeted at the x86 using ELF binaries. +On BSD and Linux systems, you can use the native compilers; +On OS X, which doesn't use ELF binaries, +you must use a cross-compiler. +Xv6 does boot on real hardware, but typically +we run it using the Bochs emulator. +Both the GCC cross compiler and Bochs +can be found on the [../../2007/tools.html | 6.828 tools page]. + + +* Lectures + +In 6.828, the lectures in the first half of the course +introduce the PC hardware, the Intel x86, and then xv6. +The lectures in the second half consider advanced topics +using research papers; for some, xv6 serves as a useful +base for making discussions concrete. +This section describe a typical 6.828 lecture schedule, +linking to lecture notes and homework. +A course using only xv6 (not Jos) will need to adapt +a few of the lectures, but we hope these are a useful +starting point. + + +Lecture 1. Operating systems + +The first lecture introduces both the general topic of +operating systems and the specific approach of 6.828. +After defining ``operating system,'' the lecture +examines the implementation of a Unix shell +to look at the details the traditional Unix system call interface. +This is relevant to both xv6 and Jos: in the final +Jos labs, students implement a Unix-like interface +and culminating in a Unix shell. + +[l1.html | lecture notes] + + +Lecture 2. PC hardware and x86 programming + +This lecture introduces the PC architecture, the 16- and 32-bit x86, +the stack, and the GCC x86 calling conventions. +It also introduces the pieces of a typical C tool chain--compiler, +assembler, linker, loader--and the Bochs emulator. + +Reading: PC Assembly Language + +Homework: familiarize with Bochs + +[l2.html | lecture notes] +[x86-intro.html | homework] + + +Lecture 3. Operating system organization + +This lecture continues Lecture 1's discussion of what +an operating system does. +An operating system provides a ``virtual computer'' +interface to user space programs. +At a high level, the main job of the operating system +is to implement that interface +using the physical computer it runs on. + +The lecture discusses four approaches to that job: +monolithic operating systems, microkernels, +virtual machines, and exokernels. +Exokernels might not be worth mentioning +except that the Jos labs are built around one. + +Reading: Engler et al., Exokernel: An Operating System Architecture +for Application-Level Resource Management + +[l3.html | lecture notes] + + +Lecture 4. Address spaces using segmentation + +This is the first lecture that uses xv6. +It introduces the idea of address spaces and the +details of the x86 segmentation hardware. +It makes the discussion concrete by reading the xv6 +source code and watching xv6 execute using the Bochs simulator. + +Reading: x86 MMU handout, +xv6: bootasm.S, bootother.S, bootmain.c, main.c, init.c, and setupsegs in proc.c. + +Homework: Bochs stack introduction + +[l4.html | lecture notes] +[xv6-intro.html | homework] + + +Lecture 5. Address spaces using page tables + +This lecture continues the discussion of address spaces, +examining the other x86 virtual memory mechanism: page tables. +Xv6 does not use page tables, so there is no xv6 here. +Instead, the lecture uses Jos as a concrete example. +An xv6-only course might skip or shorten this discussion. + +Reading: x86 manual excerpts + +Homework: stuff about gdt +XXX not appropriate; should be in Lecture 4 + +[l5.html | lecture notes] + + +Lecture 6. Interrupts and exceptions + +How does a user program invoke the operating system kernel? +How does the kernel return to the user program? +What happens when a hardware device needs attention? +This lecture explains the answer to these questions: +interrupt and exception handling. + +It explains the x86 trap setup mechanisms and then +examines their use in xv6's SETGATE (mmu.h), +tvinit (trap.c), idtinit (trap.c), vectors.pl, and vectors.S. + +It then traces through a call to the system call open: +init.c, usys.S, vector48 and alltraps (vectors.S), trap (trap.c), +syscall (syscall.c), +sys_open (sysfile.c), fetcharg, fetchint, argint, argptr, argstr (syscall.c), + +The interrupt controller, briefly: +pic_init and pic_enable (picirq.c). +The timer and keyboard, briefly: +timer_init (timer.c), console_init (console.c). +Enabling and disabling of interrupts. + +Reading: x86 manual excerpts, +xv6: trapasm.S, trap.c, syscall.c, and usys.S. +Skim lapic.c, ioapic.c, picirq.c. + +Homework: Explain the 35 words on the top of the +stack at first invocation of syscall. + +[l-interrupt.html | lecture notes] +[x86-intr.html | homework] + + +Lecture 7. Multiprocessors and locking + +This lecture introduces the problems of +coordination and synchronization on a +multiprocessor +and then the solution of mutual exclusion locks. +Atomic instructions, test-and-set locks, +lock granularity, (the mistake of) recursive locks. + +Although xv6 user programs cannot share memory, +the xv6 kernel itself is a program with multiple threads +executing concurrently and sharing memory. +Illustration: the xv6 scheduler's proc_table_lock (proc.c) +and the spin lock implementation (spinlock.c). + +Reading: xv6: spinlock.c. Skim mp.c. + +Homework: Interaction between locking and interrupts. +Try not disabling interrupts in the disk driver and watch xv6 break. + +[l-lock.html | lecture notes] +[xv6-lock.html | homework] + + +Lecture 8. Threads, processes and context switching + +The last lecture introduced some of the issues +in writing threaded programs, using xv6's processes +as an example. +This lecture introduces the issues in implementing +threads, continuing to use xv6 as the example. + +The lecture defines a thread of computation as a register +set and a stack. A process is an address space plus one +or more threads of computation sharing that address space. +Thus the xv6 kernel can be viewed as a single process +with many threads (each user process) executing concurrently. + +Illustrations: thread switching (swtch.S), scheduler (proc.c), sys_fork (sysproc.c) + +Reading: proc.c, swtch.S, sys_fork (sysproc.c) + +Homework: trace through stack switching. + +[l-threads.html | lecture notes (need to be updated to use swtch)] +[xv6-sched.html | homework] + + +Lecture 9. Processes and coordination + +This lecture introduces the idea of sequence coordination +and then examines the particular solution illustrated by +sleep and wakeup (proc.c). +It introduces and refines a simple +producer/consumer queue to illustrate the +need for sleep and wakeup +and then the sleep and wakeup +implementations themselves. + +Reading: proc.c, sys_exec, sys_sbrk, sys_wait, sys_exec, sys_kill (sysproc.c). + +Homework: Explain how sleep and wakeup would break +without proc_table_lock. Explain how devices would break +without second lock argument to sleep. + +[l-coordination.html | lecture notes] +[xv6-sleep.html | homework] + + +Lecture 10. Files and disk I/O + +This is the first of three file system lectures. +This lecture introduces the basic file system interface +and then considers the on-disk layout of individual files +and the free block bitmap. + +Reading: iread, iwrite, fileread, filewrite, wdir, mknod1, and + code related to these calls in fs.c, bio.c, ide.c, and file.c. + +Homework: Add print to bwrite to trace every disk write. +Explain the disk writes caused by some simple shell commands. + +[l-fs.html | lecture notes] +[xv6-disk.html | homework] + + +Lecture 11. Naming + +The last lecture discussed on-disk file system representation. +This lecture covers the implementation of +file system paths (namei in fs.c) +and also discusses the security problems of a shared /tmp +and symbolic links. + +Understanding exec (exec.c) is left as an exercise. + +Reading: namei in fs.c, sysfile.c, file.c. + +Homework: Explain how to implement symbolic links in xv6. + +[l-name.html | lecture notes] +[xv6-names.html | homework] + + +Lecture 12. High-performance file systems + +This lecture is the first of the research paper-based lectures. +It discusses the ``soft updates'' paper, +using xv6 as a concrete example. + + +* Feedback + +If you are interested in using xv6 or have used xv6 in a course, +we would love to hear from you. +If there's anything that we can do to make xv6 easier +to adopt, we'd like to hear about it. +We'd also be interested to hear what worked well and what didn't. + +Russ Cox (rsc@swtch.com)
+Frans Kaashoek (kaashoek@mit.edu)
+Robert Morris (rtm@mit.edu) + +You can reach all of us at 6.828-staff@pdos.csail.mit.edu. + + diff --git a/web/l-bugs.html b/web/l-bugs.html new file mode 100644 index 0000000..493372d --- /dev/null +++ b/web/l-bugs.html @@ -0,0 +1,187 @@ +OS Bugs + + + + + +

OS Bugs

+ +

Required reading: Bugs as deviant behavior + +

Overview

+ +

Operating systems must obey many rules for correctness and +performance. Examples rules: +

+ +

In addition, there are standard software engineering rules, like +use function results in consistent ways. + +

These rules are typically not checked by a compiler, even though +they could be checked by a compiler, in principle. The goal of the +meta-level compilation project is to allow system implementors to +write system-specific compiler extensions that check the source code +for rule violations. + +

The results are good: many new bugs found (500-1000) in Linux +alone. The paper for today studies these bugs and attempts to draw +lessons from these bugs. + +

Are kernel error worse than user-level errors? That is, if we get +the kernel correct, then we won't have system crashes? + +

Errors in JOS kernel

+ +

What are unstated invariants in the JOS? +

+ +

Could these errors have been caught by metacompilation? Would +metacompilation have caught the pipe race condition? (Probably not, +it happens in only one place.) + +

How confident are you that your code is correct? For example, +are you sure interrupts are always disabled in kernel mode? How would +you test? + +

Metacompilation

+ +

A system programmer writes the rule checkers in a high-level, +state-machine language (metal). These checkers are dynamically linked +into an extensible version of g++, xg++. Xg++ applies the rule +checkers to every possible execution path of a function that is being +compiled. + +

An example rule from +the OSDI +paper: +

+sm check_interrupts {
+   decl { unsigned} flags;
+   pat enable = { sti(); } | {restore_flags(flags);} ;
+   pat disable = { cli(); };
+   
+   is_enabled: disable ==> is_disabled | enable ==> { err("double
+      enable")};
+   ...
+
+A more complete version found 82 errors in the Linux 2.3.99 kernel. + +

Common mistake: +

+get_free_buffer ( ... ) {
+   ....
+   save_flags (flags);
+   cli ();
+   if ((bh = sh->buffer_pool) == NULL)
+      return NULL;
+   ....
+}
+
+

(Figure 2 also lists a simple metarule.) + +

Some checkers produce false positives, because of limitations of +both static analysis and the checkers, which mostly use local +analysis. + +

How does the block checker work? The first pass is a rule +that marks functions as potentially blocking. After processing a +function, the checker emits the function's flow graph to a file +(including, annotations and functions called). The second pass takes +the merged flow graph of all function calls, and produces a file with +all functions that have a path in the control-flow-graph to a blocking +function call. For the Linux kernel this results in 3,000 functions +that potentially could call sleep. Yet another checker like +check_interrupts checks if a function calls any of the 3,000 functions +with interrupts disabled. Etc. + +

This paper

+ +

Writing rules is painful. First, you have to write them. Second, +how do you decide what to check? Was it easy to enumerate all +conventions for JOS? + +

Insight: infer programmer "beliefs" from code and cross-check +for contradictions. If cli is always followed by sti, +except in one case, perhaps something is wrong. This simplifies +life because we can write generic checkers instead of checkers +that specifically check for sti, and perhaps we get lucky +and find other temporal ordering conventions. + +

Do we know which case is wrong? The 999 times or the 1 time that +sti is absent? (No, this method cannot figure what the correct +sequence is but it can flag that something is weird, which in practice +useful.) The method just detects inconsistencies. + +

Is every inconsistency an error? No, some inconsistency don't +indicate an error. If a call to function f is often followed +by call to function g, does that imply that f should always be +followed by g? (No!) + +

Solution: MUST beliefs and MAYBE beliefs. MUST beliefs are +invariants that must hold; any inconsistency indicates an error. If a +pointer is dereferences, then the programmer MUST believe that the +pointer is pointing to something that can be dereferenced (i.e., the +pointer is definitely not zero). MUST beliefs can be checked using +"internal inconsistencies". + +

An aside, can zero pointers pointers be detected during runtime? +(Sure, unmap the page at address zero.) Why is metacompilation still +valuable? (At runtime you will find only the null pointers that your +test code dereferenced; not all possible dereferences of null +pointers.) An even more convincing example for Metacompilation is +tracking user pointers that the kernel dereferences. (Is this a MUST +belief?) + +

MAYBE beliefs are invariants that are suggested by the code, but +they maybe coincidences. MAYBE beliefs are ranked by statistical +analysis, and perhaps augmented with input about functions names +(e.g., alloc and free are important). Is it computationally feasible +to check every MAYBE belief? Could there be much noise? + +

What errors won't this approach catch? + +

Paper discussion

+ +

This paper is best discussed by studying every code fragment. Most +code fragments are pieces of code from Linux distributions; these +mistakes are real! + +

Section 3.1. what is the error? how does metacompilation catch +it? + +

Figure 1. what is the error? is there one? + +

Code fragments from 6.1. what is the error? how does metacompilation catch +it? + +

Figure 3. what is the error? how does metacompilation catch +it? + +

Section 8.3. what is the error? how does metacompilation catch +it? + + + diff --git a/web/l-coordination.html b/web/l-coordination.html new file mode 100644 index 0000000..b2f9f0d --- /dev/null +++ b/web/l-coordination.html @@ -0,0 +1,354 @@ +L9 + + + + + +

Coordination and more processes

+ +

Required reading: remainder of proc.c, sys_exec, sys_sbrk, + sys_wait, sys_exit, and sys_kill. + +

Overview

+ +

Big picture: more programs than processors. How to share the + limited number of processors among the programs? Last lecture + covered basic mechanism: threads and the distinction between process + and thread. Today expand: how to coordinate the interactions + between threads explicitly, and some operations on processes. + +

Sequence coordination. This is a diferrent type of coordination + than mutual-exclusion coordination (which has its goal to make + atomic actions so that threads don't interfere). The goal of + sequence coordination is for threads to coordinate the sequences in + which they run. + +

For example, a thread may want to wait until another thread + terminates. One way to do so is to have the thread run periodically, + let it check if the other thread terminated, and if not give up the + processor again. This is wasteful, especially if there are many + threads. + +

With primitives for sequence coordination one can do better. The + thread could tell the thread manager that it is waiting for an event + (e.g., another thread terminating). When the other thread + terminates, it explicitly wakes up the waiting thread. This is more + work for the programmer, but more efficient. + +

Sequence coordination often interacts with mutual-exclusion + coordination, as we will see below. + +

The operating system literature has a rich set of primivites for + sequence coordination. We study a very simple version of condition + variables in xv6: sleep and wakeup, with a single lock. + +

xv6 code examples

+ +

Sleep and wakeup - usage

+ +Let's consider implementing a producer/consumer queue +(like a pipe) that can be used to hold a single non-null char pointer: + +
+struct pcq {
+    void *ptr;
+};
+
+void*
+pcqread(struct pcq *q)
+{
+    void *p;
+
+    while((p = q->ptr) == 0)
+        ;
+    q->ptr = 0;
+    return p;
+}
+
+void
+pcqwrite(struct pcq *q, void *p)
+{
+    while(q->ptr != 0)
+        ;
+    q->ptr = p;
+}
+
+ +

Easy and correct, at least assuming there is at most one +reader and at most one writer at a time. + +

Unfortunately, the while loops are inefficient. +Instead of polling, it would be great if there were +primitives saying ``wait for some event to happen'' +and ``this event happened''. +That's what sleep and wakeup do. + +

Second try: + +

+void*
+pcqread(struct pcq *q)
+{
+    void *p;
+
+    if(q->ptr == 0)
+        sleep(q);
+    p = q->ptr;
+    q->ptr = 0;
+    wakeup(q);  /* wake pcqwrite */
+    return p;
+}
+
+void
+pcqwrite(struct pcq *q, void *p)
+{
+    if(q->ptr != 0)
+        sleep(q);
+    q->ptr = p;
+    wakeup(q);  /* wake pcqread */
+    return p;
+}
+
+ +That's better, but there is still a problem. +What if the wakeup happens between the check in the if +and the call to sleep? + +

Add locks: + +

+struct pcq {
+    void *ptr;
+    struct spinlock lock;
+};
+
+void*
+pcqread(struct pcq *q)
+{
+    void *p;
+
+    acquire(&q->lock);
+    if(q->ptr == 0)
+        sleep(q, &q->lock);
+    p = q->ptr;
+    q->ptr = 0;
+    wakeup(q);  /* wake pcqwrite */
+    release(&q->lock);
+    return p;
+}
+
+void
+pcqwrite(struct pcq *q, void *p)
+{
+    acquire(&q->lock);
+    if(q->ptr != 0)
+        sleep(q, &q->lock);
+    q->ptr = p;
+    wakeup(q);  /* wake pcqread */
+    release(&q->lock);
+    return p;
+}
+
+ +This is okay, and now safer for multiple readers and writers, +except that wakeup wakes up everyone who is asleep on chan, +not just one guy. +So some of the guys who wake up from sleep might not +be cleared to read or write from the queue. Have to go back to looping: + +
+struct pcq {
+    void *ptr;
+    struct spinlock lock;
+};
+
+void*
+pcqread(struct pcq *q)
+{
+    void *p;
+
+    acquire(&q->lock);
+    while(q->ptr == 0)
+        sleep(q, &q->lock);
+    p = q->ptr;
+    q->ptr = 0;
+    wakeup(q);  /* wake pcqwrite */
+    release(&q->lock);
+    return p;
+}
+
+void
+pcqwrite(struct pcq *q, void *p)
+{
+    acquire(&q->lock);
+    while(q->ptr != 0)
+        sleep(q, &q->lock);
+    q->ptr = p;
+    wakeup(q);  /* wake pcqread */
+    release(&q->lock);
+    return p;
+}
+
+ +The difference between this an our original is that +the body of the while loop is a much more efficient way to pause. + +

Now we've figured out how to use it, but we +still need to figure out how to implement it. + +

Sleep and wakeup - implementation

+

+Simple implementation: + +

+void
+sleep(void *chan, struct spinlock *lk)
+{
+    struct proc *p = curproc[cpu()];
+    
+    release(lk);
+    p->chan = chan;
+    p->state = SLEEPING;
+    sched();
+}
+
+void
+wakeup(void *chan)
+{
+    for(each proc p) {
+        if(p->state == SLEEPING && p->chan == chan)
+            p->state = RUNNABLE;
+    }	
+}
+
+ +

What's wrong? What if the wakeup runs right after +the release(lk) in sleep? +It still misses the sleep. + +

Move the lock down: +

+void
+sleep(void *chan, struct spinlock *lk)
+{
+    struct proc *p = curproc[cpu()];
+    
+    p->chan = chan;
+    p->state = SLEEPING;
+    release(lk);
+    sched();
+}
+
+void
+wakeup(void *chan)
+{
+    for(each proc p) {
+        if(p->state == SLEEPING && p->chan == chan)
+            p->state = RUNNABLE;
+    }	
+}
+
+ +

This almost works. Recall from last lecture that we also need +to acquire the proc_table_lock before calling sched, to +protect p->jmpbuf. + +

+void
+sleep(void *chan, struct spinlock *lk)
+{
+    struct proc *p = curproc[cpu()];
+    
+    p->chan = chan;
+    p->state = SLEEPING;
+    acquire(&proc_table_lock);
+    release(lk);
+    sched();
+}
+
+ +

The problem is that now we're using lk to protect +access to the p->chan and p->state variables +but other routines besides sleep and wakeup +(in particular, proc_kill) will need to use them and won't +know which lock protects them. +So instead of protecting them with lk, let's use proc_table_lock: + +

+void
+sleep(void *chan, struct spinlock *lk)
+{
+    struct proc *p = curproc[cpu()];
+    
+    acquire(&proc_table_lock);
+    release(lk);
+    p->chan = chan;
+    p->state = SLEEPING;
+    sched();
+}
+void
+wakeup(void *chan)
+{
+    acquire(&proc_table_lock);
+    for(each proc p) {
+        if(p->state == SLEEPING && p->chan == chan)
+            p->state = RUNNABLE;
+    }
+    release(&proc_table_lock);
+}
+
+ +

One could probably make things work with lk as above, +but the relationship between data and locks would be +more complicated with no real benefit. Xv6 takes the easy way out +and says that elements in the proc structure are always protected +by proc_table_lock. + +

Use example: exit and wait

+ +

If proc_wait decides there are children to be waited for, +it calls sleep at line 2462. +When a process exits, we proc_exit scans the process table +to find the parent and wakes it at 2408. + +

Which lock protects sleep and wakeup from missing each other? +Proc_table_lock. Have to tweak sleep again to avoid double-acquire: + +

+if(lk != &proc_table_lock) {
+    acquire(&proc_table_lock);
+    release(lk);
+}
+
+ +

New feature: kill

+ +

Proc_kill marks a process as killed (line 2371). +When the process finally exits the kernel to user space, +or if a clock interrupt happens while it is in user space, +it will be destroyed (line 2886, 2890, 2912). + +

Why wait until the process ends up in user space? + +

What if the process is stuck in sleep? It might take a long +time to get back to user space. +Don't want to have to wait for it, so make sleep wake up early +(line 2373). + +

This means all callers of sleep should check +whether they have been killed, but none do. +Bug in xv6. + +

System call handlers

+ +

Sheet 32 + +

Fork: discussed copyproc in earlier lectures. +Sys_fork (line 3218) just calls copyproc +and marks the new proc runnable. +Does fork create a new process or a new thread? +Is there any shared context? + +

Exec: we'll talk about exec later, when we talk about file systems. + +

Sbrk: Saw growproc earlier. Why setupsegs before returning? diff --git a/web/l-fs.html b/web/l-fs.html new file mode 100644 index 0000000..ed911fc --- /dev/null +++ b/web/l-fs.html @@ -0,0 +1,222 @@ +L10 + + + + + +

File systems

+ +

Required reading: iread, iwrite, and wdir, and code related to + these calls in fs.c, bio.c, ide.c, file.c, and sysfile.c + +

Overview

+ +

The next 3 lectures are about file systems: +

+ +

Users desire to store their data durable so that data survives when +the user turns of his computer. The primary media for doing so are: +magnetic disks, flash memory, and tapes. We focus on magnetic disks +(e.g., through the IDE interface in xv6). + +

To allow users to remember where they stored a file, they can +assign a symbolic name to a file, which appears in a directory. + +

The data in a file can be organized in a structured way or not. +The structured variant is often called a database. UNIX uses the +unstructured variant: files are streams of bytes. Any particular +structure is likely to be useful to only a small class of +applications, and other applications will have to work hard to fit +their data into one of the pre-defined structures. Besides, if you +want structure, you can easily write a user-mode library program that +imposes that format on any file. The end-to-end argument in action. +(Databases have special requirements and support an important class of +applications, and thus have a specialized plan.) + +

The API for a minimal file system consists of: open, read, write, +seek, close, and stat. Dup duplicates a file descriptor. For example: +

+  fd = open("x", O_RDWR);
+  read (fd, buf, 100);
+  write (fd, buf, 512);
+  close (fd)
+
+ +

Maintaining the file offset behind the read/write interface is an + interesting design decision . The alternative is that the state of a + read operation should be maintained by the process doing the reading + (i.e., that the pointer should be passed as an argument to read). + This argument is compelling in view of the UNIX fork() semantics, + which clones a process which shares the file descriptors of its + parent. A read by the parent of a shared file descriptor (e.g., + stdin, changes the read pointer seen by the child). On the other + hand the alternative would make it difficult to get "(data; ls) > x" + right. + +

Unix API doesn't specify that the effects of write are immediately + on the disk before a write returns. It is up to the implementation + of the file system within certain bounds. Choices include (that + aren't non-exclusive): +

+ +

A design issue is the semantics of a file system operation that + requires multiple disk writes. In particular, what happens if the + logical update requires writing multiple disks blocks and the power + fails during the update? For example, to create a new file, + requires allocating an inode (which requires updating the list of + free inodes on disk), writing a directory entry to record the + allocated i-node under the name of the new file (which may require + allocating a new block and updating the directory inode). If the + power fails during the operation, the list of free inodes and blocks + may be inconsistent with the blocks and inodes in use. Again this is + up to implementation of the file system to keep on disk data + structures consistent: +

+ +

Another design issue is the semantics are of concurrent writes to +the same data item. What is the order of two updates that happen at +the same time? For example, two processes open the same file and write +to it. Modern Unix operating systems allow the application to lock a +file to get exclusive access. If file locking is not used and if the +file descriptor is shared, then the bytes of the two writes will get +into the file in some order (this happens often for log files). If +the file descriptor is not shared, the end result is not defined. For +example, one write may overwrite the other one (e.g., if they are +writing to the same part of the file.) + +

An implementation issue is performance, because writing to magnetic +disk is relatively expensive compared to computing. Three primary ways +to improve performance are: careful file system layout that induces +few seeks, an in-memory cache of frequently-accessed blocks, and +overlap I/O with computation so that file operations don't have to +wait until their completion and so that that the disk driver has more +data to write, which allows disk scheduling. (We will talk about +performance in detail later.) + +

xv6 code examples

+ +

xv6 implements a minimal Unix file system interface. xv6 doesn't +pay attention to file system layout. It overlaps computation and I/O, +but doesn't do any disk scheduling. Its cache is write-through, which +simplifies keep on disk datastructures consistent, but is bad for +performance. + +

On disk files are represented by an inode (struct dinode in fs.h), +and blocks. Small files have up to 12 block addresses in their inode; +large files use files the last address in the inode as a disk address +for a block with 128 disk addresses (512/4). The size of a file is +thus limited to 12 * 512 + 128*512 bytes. What would you change to +support larger files? (Ans: e.g., double indirect blocks.) + +

Directories are files with a bit of structure to them. The file +contains of records of the type struct dirent. The entry contains the +name for a file (or directory) and its corresponding inode number. +How many files can appear in a directory? + +

In memory files are represented by struct inode in fsvar.h. What is +the role of the additional fields in struct inode? + +

What is xv6's disk layout? How does xv6 keep track of free blocks + and inodes? See balloc()/bfree() and ialloc()/ifree(). Is this + layout a good one for performance? What are other options? + +

Let's assume that an application created an empty file x with + contains 512 bytes, and that the application now calls read(fd, buf, + 100), that is, it is requesting to read 100 bytes into buf. + Furthermore, let's assume that the inode for x is is i. Let's pick + up what happens by investigating readi(), line 4483. +

+ +

Now let's suppose that the process is writing 512 bytes at the end + of the file a. How many disk writes will happen? +

+ +

Lots of code to implement reading and writing of files. How about + directories? +

+

Reading and writing of directories is trivial. + + diff --git a/web/l-interrupt.html b/web/l-interrupt.html new file mode 100644 index 0000000..363af5e --- /dev/null +++ b/web/l-interrupt.html @@ -0,0 +1,174 @@ + +Lecture 6: Interrupts & Exceptions + + +

Interrupts & Exceptions

+ +

+Required reading: xv6 trapasm.S, trap.c, syscall.c, usys.S. +
+You will need to consult +IA32 System +Programming Guide chapter 5 (skip 5.7.1, 5.8.2, 5.12.2). + +

Overview

+ +

+Big picture: kernel is trusted third-party that runs the machine. +Only the kernel can execute privileged instructions (e.g., +changing MMU state). +The processor enforces this protection through the ring bits +in the code segment. +If a user application needs to carry out a privileged operation +or other kernel-only service, +it must ask the kernel nicely. +How can a user program change to the kernel address space? +How can the kernel transfer to a user address space? +What happens when a device attached to the computer +needs attention? +These are the topics for today's lecture. + +

+There are three kinds of events that must be handled +by the kernel, not user programs: +(1) a system call invoked by a user program, +(2) an illegal instruction or other kind of bad processor state (memory fault, etc.). +and +(3) an interrupt from a hardware device. + +

+Although these three events are different, they all use the same +mechanism to transfer control to the kernel. +This mechanism consists of three steps that execute as one atomic unit. +(a) change the processor to kernel mode; +(b) save the old processor somewhere (usually the kernel stack); +and (c) change the processor state to the values set up as +the “official kernel entry values.” +The exact implementation of this mechanism differs +from processor to processor, but the idea is the same. + +

+We'll work through examples of these today in lecture. +You'll see all three in great detail in the labs as well. + +

+A note on terminology: sometimes we'll +use interrupt (or trap) to mean both interrupts and exceptions. + +

+Setting up traps on the x86 +

+ +

+See handout Table 5-1, Figure 5-1, Figure 5-2. + +

+xv6 Sheet 07: struct gatedesc and SETGATE. + +

+xv6 Sheet 28: tvinit and idtinit. +Note setting of gate for T_SYSCALL + +

+xv6 Sheet 29: vectors.pl (also see generated vectors.S). + +

+System calls +

+ +

+xv6 Sheet 16: init.c calls open("console"). +How is that implemented? + +

+xv6 usys.S (not in book). +(No saving of registers. Why?) + +

+Breakpoint 0x1b:"open", +step past int instruction into kernel. + +

+See handout Figure 9-4 [sic]. + +

+xv6 Sheet 28: in vectors.S briefly, then in alltraps. +Step through to call trap, examine registers and stack. +How will the kernel find the argument to open? + +

+xv6 Sheet 29: trap, on to syscall. + +

+xv6 Sheet 31: syscall looks at eax, +calls sys_open. + +

+(Briefly) +xv6 Sheet 52: sys_open uses argstr and argint +to get its arguments. How do they work? + +

+xv6 Sheet 30: fetchint, fetcharg, argint, +argptr, argstr. + +

+What happens if a user program divides by zero +or accesses unmapped memory? +Exception. Same path as system call until trap. + +

+What happens if kernel divides by zero or accesses unmapped memory? + +

+Interrupts +

+ +

+Like system calls, except: +devices generate them at any time, +there are no arguments in CPU registers, +nothing to return to, +usually can't ignore them. + +

+How do they get generated? +Device essentially phones up the +interrupt controller and asks to talk to the CPU. +Interrupt controller then buzzes the CPU and +tells it, “keyboard on line 1.” +Interrupt controller is essentially the CPU's +secretary administrative assistant, +managing the phone lines on the CPU's behalf. + +

+Have to set up interrupt controller. + +

+(Briefly) xv6 Sheet 63: pic_init sets up the interrupt controller, +irq_enable tells the interrupt controller to let the given +interrupt through. + +

+(Briefly) xv6 Sheet 68: pit8253_init sets up the clock chip, +telling it to interrupt on IRQ_TIMER 100 times/second. +console_init sets up the keyboard, enabling IRQ_KBD. + +

+In Bochs, set breakpoint at 0x8:"vector0" +and continue, loading kernel. +Step through clock interrupt, look at +stack, registers. + +

+Was the processor executing in kernel or user mode +at the time of the clock interrupt? +Why? (Have any user-space instructions executed at all?) + +

+Can the kernel get an interrupt at any time? +Why or why not? cli and sti, +irq_enable. + + + diff --git a/web/l-lock.html b/web/l-lock.html new file mode 100644 index 0000000..eea8217 --- /dev/null +++ b/web/l-lock.html @@ -0,0 +1,322 @@ +L7 + + + + + +

Locking

+ +

Required reading: spinlock.c + +

Why coordinate?

+ +

Mutual-exclusion coordination is an important topic in operating +systems, because many operating systems run on +multiprocessors. Coordination techniques protect variables that are +shared among multiple threads and updated concurrently. These +techniques allow programmers to implement atomic sections so that one +thread can safely update the shared variables without having to worry +that another thread intervening. For example, processes in xv6 may +run concurrently on different processors and in kernel-mode share +kernel data structures. We must ensure that these updates happen +correctly. + +

List and insert example: +

+
+struct List {
+  int data;
+  struct List *next;
+};
+
+List *list = 0;
+
+insert(int data) {
+  List *l = new List;
+  l->data = data;
+  l->next = list;  // A
+  list = l;        // B
+}
+
+ +

What needs to be atomic? The two statements labeled A and B should +always be executed together, as an indivisible fragment of code. If +two processors execute A and B interleaved, then we end up with an +incorrect list. To see that this is the case, draw out the list after +the sequence A1 (statement executed A by processor 1), A2 (statement A +executed by processor 2), B2, and B1. + +

How could this erroneous sequence happen? The varilable list +lives in physical memory shared among multiple processors, connected +by a bus. The accesses to the shared memory will be ordered in some +total order by the bus/memory system. If the programmer doesn't +coordinate the execution of the statements A and B, any order can +happen, including the erroneous one. + +

The erroneous case is called a race condition. The problem with +races is that they are difficult to reproduce. For example, if you +put print statements in to debug the incorrect behavior, you might +change the time and the race might not happen anymore. + +

Atomic instructions

+ +

The programmer must be able express that A and B should be executed +as single atomic instruction. We generally use a concept like locks +to mark an atomic region, acquiring the lock at the beginning of the +section and releasing it at the end: + +

 
+void acquire(int *lock) {
+   while (TSL(lock) != 0) ; 
+}
+
+void release (int *lock) {
+  *lock = 0;
+}
+
+ +

Acquire and release, of course, need to be atomic too, which can, +for example, be done with a hardware atomic TSL (try-set-lock) +instruction: + +

The semantics of TSL are: +

+   R <- [mem]   // load content of mem into register R
+   [mem] <- 1   // store 1 in mem.
+
+ +

In a harware implementation, the bus arbiter guarantees that both +the load and store are executed without any other load/stores coming +in between. + +

We can use locks to implement an atomic insert, or we can use +TSL directly: +

+int insert_lock = 0;
+
+insert(int data) {
+
+  /* acquire the lock: */
+  while(TSL(&insert_lock) != 0)
+    ;
+
+  /* critical section: */
+  List *l = new List;
+  l->data = data;
+  l->next = list;
+  list = l;
+
+  /* release the lock: */
+  insert_lock = 0;
+}
+
+ +

It is the programmer's job to make sure that locks are respected. If +a programmer writes another function that manipulates the list, the +programmer must must make sure that the new functions acquires and +releases the appropriate locks. If the programmer doesn't, race +conditions occur. + +

This code assumes that stores commit to memory in program order and +that all stores by other processors started before insert got the lock +are observable by this processor. That is, after the other processor +released a lock, all the previous stores are committed to memory. If +a processor executes instructions out of order, this assumption won't +hold and we must, for example, a barrier instruction that makes the +assumption true. + + +

Example: Locking on x86

+ +

Here is one way we can implement acquire and release using the x86 +xchgl instruction: + +

+struct Lock {
+  unsigned int locked;
+};
+
+acquire(Lock *lck) {
+  while(TSL(&(lck->locked)) != 0)
+    ;
+}
+
+release(Lock *lck) {
+  lck->locked = 0;
+}
+
+int
+TSL(int *addr)
+{
+  register int content = 1;
+  // xchgl content, *addr
+  // xchgl exchanges the values of its two operands, while
+  // locking the memory bus to exclude other operations.
+  asm volatile ("xchgl %0,%1" :
+                "=r" (content),
+                "=m" (*addr) :
+                "0" (content),
+                "m" (*addr));
+  return(content);
+}
+
+ +

the instruction "XCHG %eax, (content)" works as follows: +

    +
  1. freeze other CPUs' memory activity +
  2. temp := content +
  3. content := %eax +
  4. %eax := temp +
  5. un-freeze other CPUs +
+ +

steps 1 and 5 make XCHG special: it is "locked" special signal + lines on the inter-CPU bus, bus arbitration + +

This implementation doesn't scale to a large number of processors; + in a later lecture we will see how we could do better. + +

Lock granularity

+ +

Release/acquire is ideal for short atomic sections: increment a +counter, search in i-node cache, allocate a free buffer. + +

What are spin locks not so great for? Long atomic sections may + waste waiters' CPU time and it is to sleep while holding locks. In + xv6 we try to avoid long atomic sections by carefully coding (can + you find an example?). xv6 doesn't release the processor when + holding a lock, but has an additional set of coordination primitives + (sleep and wakeup), which we will study later. + +

My list_lock protects all lists; inserts to different lists are + blocked. A lock per list would waste less time spinning so you might + want "fine-grained" locks, one for every object BUT acquire/release + are expensive (500 cycles on my 3 ghz machine) because they need to + talk off-chip. + +

Also, "correctness" is not that simple with fine-grained locks if + need to maintain global invariants; e.g., "every buffer must be on + exactly one of free list and device list". Per-list locks are + irrelevant for this invariant. So you might want "large-grained", + which reduces overhead but reduces concurrency. + +

This tension is hard to get right. One often starts out with + "large-grained locks" and measures the performance of the system on + some workloads. When more concurrency is desired (to get better + performance), an implementor may switch to a more fine-grained + scheme. Operating system designers fiddle with this all the time. + +

Recursive locks and modularity

+ +

When designing a system we desire clean abstractions and good + modularity. We like a caller not have to know about how a callee + implements a particul functions. Locks make achieving modularity + more complicated. For example, what to do when the caller holds a + lock, then calls a function, which also needs to the lock to perform + its job. + +

There are no transparent solutions that allow the caller and callee + to be unaware of which lokcs they use. One transparent, but + unsatisfactory option is recursive locks: If a callee asks for a + lock that its caller has, then we allow the callee to proceed. + Unfortunately, this solution is not ideal either. + +

Consider the following. If lock x protects the internals of some + struct foo, then if the caller acquires lock x, it know that the + internals of foo are in a sane state and it can fiddle with them. + And then the caller must restore them to a sane state before release + lock x, but until then anything goes. + +

This assumption doesn't hold with recursive locking. After + acquiring lock x, the acquirer knows that either it is the first to + get this lock, in which case the internals are in a sane state, or + maybe some caller holds the lock and has messed up the internals and + didn't realize when calling the callee that it was going to try to + look at them too. So the fact that a function acquired the lock x + doesn't guarantee anything at all. In short, locks protect against + callers and callees just as much as they protect against other + threads. + +

Since transparent solutions aren't ideal, it is better to consider + locks part of the function specification. The programmer must + arrange that a caller doesn't invoke another function while holding + a lock that the callee also needs. + +

Locking in xv6

+ +

xv6 runs on a multiprocessor and is programmed to allow multiple +threads of computation to run concurrently. In xv6 an interrupt might +run on one processor and a process in kernel mode may run on another +processor, sharing a kernel data structure with the interrupt routing. +xv6 uses locks, implemented using an atomic instruction, to coordinate +concurrent activities. + +

Let's check out why xv6 needs locks by following what happens when +we start a second processor: +

+ +

Why hold proc_table_lock during a context switch? It protects +p->state; the process has to hold some lock to avoid a race with +wakeup() and yield(), as we will see in the next lectures. + +

Why not a lock per proc entry? It might be expensive in in whole +table scans (in wait, wakeup, scheduler). proc_table_lock also +protects some larger invariants, for example it might be hard to get +proc_wait() right with just per entry locks. Right now the check to +see if there are any exited children and the sleep are atomic -- but +that would be hard with per entry locks. One could have both, but +that would probably be neither clean nor fast. + +

Of course, there is only processor searching the proc table if +acquire is implemented correctly. Let's check out acquire in +spinlock.c: +

+ +

+ +

Locking in JOS

+ +

JOS is meant to run on single-CPU machines, and the plan can be +simple. The simple plan is disabling/enabling interrupts in the +kernel (IF flags in the EFLAGS register). Thus, in the kernel, +threads release the processors only when they want to and can ensure +that they don't release the processor during a critical section. + +

In user mode, JOS runs with interrupts enabled, but Unix user +applications don't share data structures. The data structures that +must be protected, however, are the ones shared in the library +operating system (e.g., pipes). In JOS we will use special-case +solutions, as you will find out in lab 6. For example, to implement +pipe we will assume there is one reader and one writer. The reader +and writer never update each other's variables; they only read each +other's variables. Carefully programming using this rule we can avoid +races. diff --git a/web/l-mkernel.html b/web/l-mkernel.html new file mode 100644 index 0000000..2984796 --- /dev/null +++ b/web/l-mkernel.html @@ -0,0 +1,262 @@ +Microkernel lecture + + + + + +

Microkernels

+ +

Required reading: Improving IPC by kernel design + +

Overview

+ +

This lecture looks at the microkernel organization. In a +microkernel, services that a monolithic kernel implements in the +kernel are running as user-level programs. For example, the file +system, UNIX process management, pager, and network protocols each run +in a separate user-level address space. The microkernel itself +supports only the services that are necessary to allow system services +to run well in user space; a typical microkernel has at least support +for creating address spaces, threads, and inter process communication. + +

The potential advantages of a microkernel are simplicity of the +kernel (small), isolation of operating system components (each runs in +its own user-level address space), and flexibility (we can have a file +server and a database server). One potential disadvantage is +performance loss, because what in a monolithich kernel requires a +single system call may require in a microkernel multiple system calls +and context switches. + +

One way in how microkernels differ from each other is the exact +kernel API they implement. For example, Mach (a system developed at +CMU, which influenced a number of commercial operating systems) has +the following system calls: processes (create, terminate, suspend, +resume, priority, assign, info, threads), threads (fork, exit, join, +detach, yield, self), ports and messages (a port is a unidirectionally +communication channel with a message queue and supporting primitives +to send, destroy, etc), and regions/memory objects (allocate, +deallocate, map, copy, inherit, read, write). + +

Some microkernels are more "microkernel" than others. For example, +some microkernels implement the pager in user space but the basic +virtual memory abstractions in the kernel (e.g, Mach); others, are +more extreme, and implement most of the virtual memory in user space +(L4). Yet others are less extreme: many servers run in their own +address space, but in kernel mode (Chorus). + +

All microkernels support multiple threads per address space. xv6 +and Unix until recently didn't; why? Because, in Unix system services +are typically implemented in the kernel, and those are the primary +programs that need multiple threads to handle events concurrently +(waiting for disk and processing new I/O requests). In microkernels, +these services are implemented in user-level address spaces and so +they need a mechanism to deal with handling operations concurrently. +(Of course, one can argue if fork efficient enough, there is no need +to have threads.) + +

L3/L4

+ +

L3 is a predecessor to L4. L3 provides data persistence, DOS +emulation, and ELAN runtime system. L4 is a reimplementation of L3, +but without the data persistence. L4KA is a project at +sourceforge.net, and you can download the code for the latest +incarnation of L4 from there. + +

L4 is a "second-generation" microkernel, with 7 calls: IPC (of +which there are several types), id_nearest (find a thread with an ID +close the given ID), fpage_unmap (unmap pages, mapping is done as a +side-effect of IPC), thread_switch (hand processor to specified +thread), lthread_ex_regs (manipulate thread registers), +thread_schedule (set scheduling policies), task_new (create a new +address space with some default number of threads). These calls +provide address spaces, tasks, threads, interprocess communication, +and unique identifiers. An address space is a set of mappings. +Multiple threads may share mappings, a thread may grants mappings to +another thread (through IPC). Task is the set of threads sharing an +address space. + +

A thread is the execution abstraction; it belongs to an address +space, a UID, a register set, a page fault handler, and an exception +handler. A UID of a thread is its task number plus the number of the +thread within that task. + +

IPC passes data by value or by reference to another address space. +It also provide for sequence coordination. It is used for +communication between client and servers, to pass interrupts to a +user-level exception handler, to pass page faults to an external +pager. In L4, device drivers are implemented has a user-level +processes with the device mapped into their address space. +Linux runs as a user-level process. + +

L4 provides quite a scala of messages types: inline-by-value, +strings, and virtual memory mappings. The send and receive descriptor +specify how many, if any. + +

In addition, there is a system call for timeouts and controling +thread scheduling. + +

L3/L4 paper discussion

+ + +Why must the parent directory be locked? If two processes try to +create the same name in the same directory, only one should succeed +and the other one, should receive an error (file exist). + +

Link, unlink, chdir, mount, umount could have taken file +descriptors instead of their path argument. In fact, this would get +rid of some possible race conditions (some of which have security +implications, TOCTTOU). However, this would require that the current +working directory be remembered by the process, and UNIX didn't have +good ways of maintaining static state shared among all processes +belonging to a given user. The easiest way is to create shared state +is to place it in the kernel. + +

We have one piece of code in xv6 that we haven't studied: exec. + With all the ground work we have done this code can be easily + understood (see sheet 54). + + diff --git a/web/l-okws.txt b/web/l-okws.txt new file mode 100644 index 0000000..fa940d0 --- /dev/null +++ b/web/l-okws.txt @@ -0,0 +1,249 @@ + +Security +------------------- +I. 2 Intro Examples +II. Security Overview +III. Server Security: Offense + Defense +IV. Unix Security + POLP +V. Example: OKWS +VI. How to Build a Website + +I. Intro Examples +-------------------- +1. Apache + OpenSSL 0.9.6a (CAN 2002-0656) + - SSL = More security! + + unsigned int j; + p=(unsigned char *)s->init_buf->data; + j= *(p++); + s->session->session_id_length=j; + memcpy(s->session->session_id,p,j); + + - the result: an Apache worm + +2. SparkNotes.com 2000: + - New profile feature that displays "public" information about users + but bug that made e-mail addresses "public" by default. + - New program for getting that data: + + http://www.sparknotes.com/getprofile.cgi?id=1343 + +II. Security Overview +---------------------- + +What Is Security? + - Protecting your system from attack. + + What's an attack? + - Stealing data + - Corrupting data + - Controlling resources + - DOS + + Why attack? + - Money + - Blackmail / extortion + - Vendetta + - intellectual curiosity + - fame + +Security is a Big topic + + - Server security -- today's focus. There's some machine sitting on the + Internet somewhere, with a certain interface exposed, and attackers + want to circumvent it. + - Why should you trust your software? + + - Client security + - Clients are usually servers, so they have many of the same issues. + - Slight simplification: people across the network cannot typically + initiate connections. + - Has a "fallible operator": + - Spyware + - Drive-by-Downloads + + - Client security turns out to be much harder -- GUI considerations, + look inside the browser and the applications. + - Systems community can more easily handle server security. + - We think mainly of servers. + +III. Server Security: Offense and Defense +----------------------------------------- + - Show picture of a Web site. + + Attacks | Defense +---------------------------------------------------------------------------- + 1. Break into DB from net | 1. FW it off + 2. Break into WS on telnet | 2. FW it off + 3. Buffer overrun in Apache | 3. Patch apache / use better lang? + 4. Buffer overrun in our code | 4. Use better lang / isolate it + 5. SQL injection | 5. Better escaping / don't interpret code. + 6. Data scraping. | 6. Use a sparse UID space. + 7. PW sniffing | 7. ??? + 8. Fetch /etc/passwd and crack | 8. Don't expose /etc/passwd + PW | + 9. Root escalation from apache | 9. No setuid programs available to Apache +10. XSS |10. Filter JS and input HTML code. +11. Keystroke recorded on sys- |11. Client security + admin's desktop (planetlab) | +12. DDOS |12. ??? + +Summary: + - That we want private data to be available to right people makes + this problem hard in the first place. Internet servers are there + for a reason. + - Security != "just encrypt your data;" this in fact can sometimes + make the problem worse. + - Best to prevent break-ins from happening in the first place. + - If they do happen, want to limit their damage (POLP). + - Security policies are difficult to express / package up neatly. + +IV. Design According to POLP (in Unix) +--------------------------------------- + - Assume any piece of a system can be compromised, by either bad + programming or malicious attack. + - Try to limit the damage done by such a compromise (along the lines + of the 4 attack goals). + + + +What's the goal on Unix? + - Keep processes from communicating that don't have to: + - limit FS, IPC, signals, ptrace + - Strip away unneeded privilege + - with respect to network, FS. + - Strip away FS access. + +How on Unix? + - setuid/setgid + - system call interposition + - chroot (away from setuid executables, /etc/passwd, /etc/ssh/..) + + + +How do you write chroot'ed programs? + - What about shared libraries? + - /etc/resolv.conf? + - Can chroot'ed programs access the FS at all? What if they need + to write to the FS or read from the FS? + - Fd's are *capabilities*; can pass them to chroot'ed services, + thereby opening new files on its behalf. + - Unforgeable - can only get them from the kernel via open/socket, etc. + +Unix Shortcomings (round 1) + - It's bad to run as root! + - Yet, need root for: + - chroot + - setuid/setgid to a lower-privileged user + - create a new user ID + - Still no guarantee that we've cut off all channels + - 200 syscalls! + - Default is to give most/all privileges. + - Can "break out" of chroot jails? + - Can still exploit race conditions in the kernel to escalate privileges. + +Sidebar + - setuid / setuid misunderstanding + - root / root misunderstanding + - effective vs. real vs. saved set-user-ID + +V. OKWS +------- +- Taking these principles as far as possible. +- C.f. Figure 1 From the paper.. +- Discussion of which privileges are in which processes + + + +- Technical details: how to launch a new service +- Within the launcher (running as root): + + + + // receive FDs from logger, pubd, demux + fork (); + chroot ("/var/okws/run"); + chdir ("/coredumps/51001"); + setgid (51001); + setuid (51001); + exec ("login", fds ... ); + +- Note no chroot -- why not? +- Once launched, how does a service get new connections? +- Note the goal - minimum tampering with each other in the + case of a compromise. + +Shortcoming of Unix (2) +- A lot of plumbing involved with this system. FDs flying everywhere. +- Isolation still not fine enough. If a service gets taken over, + can compromise all users of that service. + +VI. Reflections on Building Websites +--------------------------------- +- OKWS interesting "experiment" +- Need for speed; also, good gzip support. +- If you need compiled code, it's a good way to go. +- RPC-like system a must for backend communication +- Connection-pooling for free + +Biggest difficulties: +- Finding good C++ programmers. +- Compile times. +- The DB is still always the problem. + +Hard to Find good Alternatives +- Python / Perl - you might spend a lot of time writing C code / + integrating with lower level languages. +- Have to worry about DB pooling. +- Java -- must viable, and is getting better. Scary you can't peer + inside. +- .Net / C#-based system might be the way to go. + + +======================================================================= + +Extra Material: + +Capabilities (From the Eros Paper in SOSP 1999) + + - "Unforgeable pair made up of an object ID and a set of authorized + operations (an interface) on that object." + - c.f. Dennis and van Horn. "Programming semantics for multiprogrammed + computations," Communications of the ACM 9(3):143-154, Mar 1966. + - Thus: + + - Examples: + "Process X can write to file at inode Y" + "Process P can read from file at inode Z" + - Familiar example: Unix file descriptors + + - Why are they secure? + - Capabilities are "unforgeable" + - Processes can get them only through authorized interfaces + - Capabilities are only given to processes authorized to hold them + + - How do you get them? + - From the kernel (e.g., open) + - From other applications (e.g., FD passing) + + - How do you use them? + - read (fd), write(fd). + + - How do you revoke them once granted? + - In Unix, you do not. + - In some systems, a central authority ("reference monitor") can revoke. + + - How do you store them persistently? + - Can have circular dependencies (unlike an FS). + - What happens when the system starts up? + - Revert to checkpointed state. + - Often capability systems chose a single-level store. + + - Capability systems, a historical prospective: + - KeyKOS, Eros, Cyotos (UP research) + - Never saw any applications + - IBM Systems (System 38, later AS/400, later 'i Series') + - Commercially viable + - Problems: + - All bets are off when a capability is sent to the wrong place. + - Firewall analogy? diff --git a/web/l-plan9.html b/web/l-plan9.html new file mode 100644 index 0000000..a3af3d5 --- /dev/null +++ b/web/l-plan9.html @@ -0,0 +1,249 @@ + + +Plan 9 + + + +

Plan 9

+ +

Required reading: Plan 9 from Bell Labs

+ +

Background

+ +

Had moved away from the ``one computing system'' model of +Multics and Unix.

+ +

Many computers (`workstations'), self-maintained, not a coherent whole.

+ +

Pike and Thompson had been batting around ideas about a system glued together +by a single protocol as early as 1984. +Various small experiments involving individual pieces (file server, OS, computer) +tried throughout 1980s.

+ +

Ordered the hardware for the ``real thing'' in beginning of 1989, +built up WORM file server, kernel, throughout that year.

+ +

Some time in early fall 1989, Pike and Thompson were +trying to figure out a way to fit the window system in. +On way home from dinner, both independently realized that +needed to be able to mount a user-space file descriptor, +not just a network address.

+ +

Around Thanksgiving 1989, spent a few days rethinking the whole +thing, added bind, new mount, flush, and spent a weekend +making everything work again. The protocol at that point was +essentially identical to the 9P in the paper.

+ +

In May 1990, tried to use system as self-hosting. +File server kept breaking, had to keep rewriting window system. +Dozen or so users by then, mostly using terminal windows to +connect to Unix.

+ +

Paper written and submitted to UKUUG in July 1990.

+ +

Because it was an entirely new system, could take the +time to fix problems as they arose, in the right place.

+ + +

Design Principles

+ +

Three design principles:

+ +

+1. Everything is a file.
+2. There is a standard protocol for accessing files.
+3. Private, malleable name spaces (bind, mount). +

+ +

Everything is a file.

+ +

Everything is a file (more everything than Unix: networks, graphics).

+ +
+% ls -l /net
+% lp /dev/screen
+% cat /mnt/wsys/1/text
+
+ +

Standard protocol for accessing files

+ +

9P is the only protocol the kernel knows: other protocols +(NFS, disk file systems, etc.) are provided by user-level translators.

+ +

Only one protocol, so easy to write filters and other +converters. Iostats puts itself between the kernel +and a command.

+ +
+% iostats -xvdfdf /bin/ls
+
+ +

Private, malleable name spaces

+ +

Each process has its own private name space that it +can customize at will. +(Full disclosure: can arrange groups of +processes to run in a shared name space. Otherwise how do +you implement mount and bind?)

+ +

Iostats remounts the root of the name space +with its own filter service.

+ +

The window system mounts a file system that it serves +on /mnt/wsys.

+ +

The network is actually a kernel device (no 9P involved) +but it still serves a file interface that other programs +use to access the network. +Easy to move out to user space (or replace) if necessary: +import network from another machine.

+ +

Implications

+ +

Everything is a file + can share files => can share everything.

+ +

Per-process name spaces help move toward ``each process has its own +private machine.''

+ +

One protocol: easy to build custom filters to add functionality +(e.g., reestablishing broken network connections). + +

File representation for networks, graphics, etc.

+ +

Unix sockets are file descriptors, but you can't use the +usual file operations on them. Also far too much detail that +the user doesn't care about.

+ +

In Plan 9: +

dial("tcp!plan9.bell-labs.com!http");
+
+(Protocol-independent!)

+ +

Dial more or less does:
+write to /net/cs: tcp!plan9.bell-labs.com!http +read back: /net/tcp/clone 204.178.31.2!80 +write to /net/tcp/clone: connect 204.178.31.2!80 +read connection number: 4 +open /net/tcp/4/data +

+ +

Details don't really matter. Two important points: +protocol-independent, and ordinary file operations +(open, read, write).

+ +

Networks can be shared just like any other files.

+ +

Similar story for graphics, other resources.

+ +

Conventions

+ +

Per-process name spaces mean that even full path names are ambiguous +(/bin/cat means different things on different machines, +or even for different users).

+ +

Convention binds everything together. +On a 386, bind /386/bin /bin. + +

In Plan 9, always know where the resource should be +(e.g., /net, /dev, /proc, etc.), +but not which one is there.

+ +

Can break conventions: on a 386, bind /alpha/bin /bin, just won't +have usable binaries in /bin anymore.

+ +

Object-oriented in the sense of having objects (files) that all +present the same interface and can be substituted for one another +to arrange the system in different ways.

+ +

Very little ``type-checking'': bind /net /proc; ps. +Great benefit (generality) but must be careful (no safety nets).

+ + +

Other Contributions

+ +

Portability

+ +

Plan 9 still is the most portable operating system. +Not much machine-dependent code, no fancy features +tied to one machine's MMU, multiprocessor from the start (1989).

+ +

Many other systems are still struggling with converting to SMPs.

+ +

Has run on MIPS, Motorola 68000, Nextstation, Sparc, x86, PowerPC, Alpha, others.

+ +

All the world is not an x86.

+ +

Alef

+ +

New programming language: convenient, but difficult to maintain. +Retired when author (Winterbottom) stopped working on Plan 9.

+ +

Good ideas transferred to C library plus conventions.

+ +

All the world is not C.

+ +

UTF-8

+ +

Thompson invented UTF-8. Pike and Thompson +converted Plan 9 to use it over the first weekend of September 1992, +in time for X/Open to choose it as the Unicode standard byte format +at a meeting the next week.

+ +

UTF-8 is now the standard character encoding for Unicode on +all systems and interoperating between systems.

+ +

Simple, easy to modify base for experiments

+ +

Whole system source code is available, simple, easy to +understand and change. +There's a reason it only took a couple days to convert to UTF-8.

+ +
+  49343  file server kernel
+
+ 181611  main kernel
+  78521    ipaq port (small kernel)
+  20027      TCP/IP stack
+  15365      ipaq-specific code
+  43129      portable code
+
+1326778  total lines of source code
+
+ +

Dump file system

+ +

Snapshot idea might well have been ``in the air'' at the time. +(OldFiles in AFS appears to be independently derived, +use of WORM media was common research topic.)

+ +

Generalized Fork

+ +

Picked up by other systems: FreeBSD, Linux.

+ +

Authentication

+ +

No global super-user. +Newer, more Plan 9-like authentication described in later paper.

+ +

New Compilers

+ +

Much faster than gcc, simpler.

+ +

8s to build acme for Linux using gcc; 1s to build acme for Plan 9 using 8c (but running on Linux)

+ +

IL Protocol

+ +

Now retired. +For better or worse, TCP has all the installed base. +IL didn't work very well on asymmetric or high-latency links +(e.g., cable modems).

+ +

Idea propagation

+ +

Many ideas have propagated out to varying degrees.

+ +

Linux even has bind and user-level file servers now (FUSE), +but still not per-process name spaces.

+ + + diff --git a/web/l-scalablecoord.html b/web/l-scalablecoord.html new file mode 100644 index 0000000..da72c37 --- /dev/null +++ b/web/l-scalablecoord.html @@ -0,0 +1,202 @@ +Scalable coordination + + + + + +

Scalable coordination

+ +

Required reading: Mellor-Crummey and Scott, Algorithms for Scalable + Synchronization on Shared-Memory Multiprocessors, TOCS, Feb 1991. + +

Overview

+ +

Shared memory machines are bunch of CPUs, sharing physical memory. +Typically each processor also mantains a cache (for performance), +which introduces the problem of keep caches coherent. If processor 1 +writes a memory location whose value processor 2 has cached, then +processor 2's cache must be updated in some way. How? +

    + +
  • Bus-based schemes. Any CPU can access "dance with" any memory +equally ("dance hall arch"). Use "Snoopy" protocols: Each CPU's cache +listens to the memory bus. With write-through architecture, invalidate +copy when see a write. Or can have "ownership" scheme with write-back +cache (E.g., Pentium cache have MESI bits---modified, exclusive, +shared, invalid). If E bit set, CPU caches exclusively and can do +write back. But bus places limits on scalability. + +
  • More scalability w. NUMA schemes (non-uniform memory access). Each +CPU comes with fast "close" memory. Slower to access memory that is +stored with another processor. Use a directory to keep track of who is +caching what. For example, processor 0 is responsible for all memory +starting with address "000", processor 1 is responsible for all memory +starting with "001", etc. + +
  • COMA - cache-only memory architecture. Each CPU has local RAM, +treated as cache. Cache lines migrate around to different nodes based +on access pattern. Data only lives in cache, no permanent memory +location. (These machines aren't too popular any more.) + +
+ + +

Scalable locks

+ +

This paper is about cost and scalability of locking; what if you +have 10 CPUs waiting for the same lock? For example, what would +happen if xv6 runs on an SMP with many processors? + +

What's the cost of a simple spinning acquire/release? Algorithm 1 +*without* the delays, which is like xv6's implementation of acquire +and release (xv6 uses XCHG instead of test_and_set): +

+  each of the 10 CPUs gets the lock in turn
+  meanwhile, remaining CPUs in XCHG on lock
+  lock must be X in cache to run XCHG
+    otherwise all might read, then all might write
+  so bus is busy all the time with XCHGs!
+  can we avoid constant XCHGs while lock is held?
+
+ +

test-and-test-and-set +

+  only run expensive TSL if not locked
+  spin on ordinary load instruction, so cache line is S
+  acquire(l)
+    while(1){
+      while(l->locked != 0) { }
+      if(TSL(&l->locked) == 0)
+        return;
+    }
+
+ +

suppose 10 CPUs are waiting, let's count cost in total bus + transactions +

+  CPU1 gets lock in one cycle
+    sets lock's cache line to I in other CPUs
+  9 CPUs each use bus once in XCHG
+    then everyone has the line S, so they spin locally
+  CPU1 release the lock
+  CPU2 gets the lock in one cycle
+  8 CPUs each use bus once...
+  So 10 + 9 + 8 + ... = 50 transactions, O(n^2) in # of CPUs!
+  Look at "test-and-test-and-set" in Figure 6
+
+

Can we have n CPUs acquire a lock in O(n) time? + +

What is the point of the exponential backoff in Algorithm 1? +

+  Does it buy us O(n) time for n acquires?
+  Is there anything wrong with it?
+  may not be fair
+  exponential backoff may increase delay after release
+
+ +

What's the point of the ticket locks, Algorithm 2? +

+  one interlocked instruction to get my ticket number
+  then I spin on now_serving with ordinary load
+  release() just increments now_serving
+
+ +

why is that good? +

+  + fair
+  + no exponential backoff overshoot
+  + no spinning on 
+
+ +

but what's the cost, in bus transactions? +

+  while lock is held, now_serving is S in all caches
+  release makes it I in all caches
+  then each waiters uses a bus transaction to get new value
+  so still O(n^2)
+
+ +

What's the point of the array-based queuing locks, Algorithm 3? +

+    a lock has an array of "slots"
+    waiter allocates a slot, spins on that slot
+    release wakes up just next slot
+  so O(n) bus transactions to get through n waiters: good!
+  anderson lines in Figure 4 and 6 are flat-ish
+    they only go up because lock data structures protected by simpler lock
+  but O(n) space *per lock*!
+
+ +

Algorithm 5 (MCS), the new algorithm of the paper, uses +compare_and_swap: +

+int compare_and_swap(addr, v1, v2) {
+  int ret = 0;
+  // stop all memory activity and ignore interrupts
+  if (*addr == v1) {
+    *addr = v2;
+    ret = 1;
+  }
+  // resume other memory activity and take interrupts
+  return ret;
+}
+
+ +

What's the point of the MCS lock, Algorithm 5? +

+  constant space per lock, rather than O(n)
+  one "qnode" per thread, used for whatever lock it's waiting for
+  lock holder's qnode points to start of list
+  lock variable points to end of list
+  acquire adds your qnode to end of list
+    then you spin on your own qnode
+  release wakes up next qnode
+
+ +

Wait-free or non-blocking data structures

+ +

The previous implementations all block threads when there is + contention for a lock. Other atomic hardware operations allows one + to build implementation wait-free data structures. For example, one + can make an insert of an element in a shared list that don't block a + thread. Such versions are called wait free. + +

A linked list with locks is as follows: +

+Lock list_lock;
+
+insert(int x) {
+  element *n = new Element;
+  n->x = x;
+
+  acquire(&list_lock);
+  n->next = list;
+  list = n;
+  release(&list_lock);
+}
+
+ +

A wait-free implementation is as follows: +

+insert (int x) {
+  element *n = new Element;
+  n->x = x;
+  do {
+     n->next = list;
+  } while (compare_and_swap (&list, n->next, n) == 0);
+}
+
+

How many bus transactions with 10 CPUs inserting one element in the +list? Could you do better? + +

This + paper by Fraser and Harris compares lock-based implementations + versus corresponding non-blocking implementations of a number of data + structures. + +

It is not possible to make every operation wait-free, and there are + times we will need an implementation of acquire and release. + research on non-blocking data structures is active; the last word + isn't said on this topic yet. + + diff --git a/web/l-schedule.html b/web/l-schedule.html new file mode 100644 index 0000000..d87d7da --- /dev/null +++ b/web/l-schedule.html @@ -0,0 +1,340 @@ +Scheduling + + + + + +

Scheduling

+ +

Required reading: Eliminating receive livelock + +

Notes based on prof. Morris's lecture on scheduling (6.824, fall'02). + +

Overview

+ +
    + +
  • What is scheduling? The OS policies and mechanisms to allocates +resources to entities. A good scheduling policy ensures that the most +important entitity gets the resources it needs. This topic was +popular in the days of time sharing, when there was a shortage of +resources. It seemed irrelevant in era of PCs and workstations, when +resources were plenty. Now the topic is back from the dead to handle +massive Internet servers with paying customers. The Internet exposes +web sites to international abuse and overload, which can lead to +resource shortages. Furthermore, some customers are more important +than others (e.g., the ones that buy a lot). + +
  • Key problems: +
      +
    • Gap between desired policy and available mechanism. The desired +policies often include elements that not implementable with the +mechanisms available to the operation system. Furthermore, often +there are many conflicting goals (low latency, high throughput, and +fairness), and the scheduler must make a trade-off between the goals. + +
    • Interaction between different schedulers. One have to take a +systems view. Just optimizing the CPU scheduler may do little to for +the overall desired policy. +
    + +
  • Resources you might want to schedule: CPU time, physical memory, +disk and network I/O, and I/O bus bandwidth. + +
  • Entities that you might want to give resources to: users, +processes, threads, web requests, or MIT accounts. + +
  • Many polices for resource to entity allocation are possible: +strict priority, divide equally, shortest job first, minimum guarantee +combined with admission control. + +
  • General plan for scheduling mechanisms +
      +
    1. Understand where scheduling is occuring. +
    2. Expose scheduling decisions, allow control. +
    3. Account for resource consumption, to allow intelligent control. +
    + +
  • Simple example from 6.828 kernel. The policy for scheduling +environments is to give each one equal CPU time. The mechanism used to +implement this policy is a clock interrupt every 10 msec and then +selecting the next environment in a round-robin fashion. + +

    But this only works if processes are compute-bound. What if a +process gives up some of its 10 ms to wait for input? Do we have to +keep track of that and give it back? + +

    How long should the quantum be? is 10 msec the right answer? +Shorter quantum will lead to better interactive performance, but +lowers overall system throughput because we will reschedule more, +which has overhead. + +

    What if the environment computes for 1 msec and sends an IPC to +the file server environment? Shouldn't the file server get more CPU +time because it operates on behalf of all other functions? + +

    Potential improvements for the 6.828 kernel: track "recent" CPU use +(e.g., over the last second) and always run environment with least +recent CPU use. (Still, if you sleep long enough you lose.) Other +solution: directed yield; specify on the yield to which environment +you are donating the remainder of the quantuam (e.g., to the file +server so that it can compute on the environment's behalf). + +

  • Pitfall: Priority Inversion +
    +  Assume policy is strict priority.
    +  Thread T1: low priority.
    +  Thread T2: medium priority.
    +  Thread T3: high priority.
    +  T1: acquire(l)
    +  context switch to T3
    +  T3: acquire(l)... must wait for T1 to release(l)...
    +  context switch to T2
    +  T2 computes for a while
    +  T3 is indefinitely delayed despite high priority.
    +  Can solve if T3 lends its priority to holder of lock it is waiting for.
    +    So T1 runs, not T2.
    +  [this is really a multiple scheduler problem.]
    +  [since locks schedule access to locked resource.]
    +
    + +
  • Pitfall: Efficiency. Efficiency often conflicts with fairness (or +any other policy). Long time quantum for efficiency in CPU scheduling +versus low delay. Shortest seek versus FIFO disk scheduling. +Contiguous read-ahead vs data needed now. For example, scheduler +swaps out my idle emacs to let gcc run faster with more phys mem. +What happens when I type a key? These don't fit well into a "who gets +to go next" scheduler framework. Inefficient scheduling may make +everybody slower, including high priority users. + +
  • Pitfall: Multiple Interacting Schedulers. Suppose you want your +emacs to have priority over everything else. Give it high CPU +priority. Does that mean nothing else will run if emacs wants to run? +Disk scheduler might not know to favor emacs's disk I/Os. Typical +UNIX disk scheduler favors disk efficiency, not process prio. Suppose +emacs needs more memory. Other processes have dirty pages; emacs must +wait. Does disk scheduler know these other processes' writes are high +prio? + +
  • Pitfall: Server Processes. Suppose emacs uses X windows to +display. The X server must serve requests from many clients. Does it +know that emacs' requests should be given priority? Does the OS know +to raise X's priority when it is serving emacs? Similarly for DNS, +and NFS. Does the network know to give emacs' NFS requests priority? + +
+ +

In short, scheduling is a system problem. There are many +schedulers; they interact. The CPU scheduler is usually the easy +part. The hardest part is system structure. For example, the +existence of interrupts is bad for scheduling. Conflicting +goals may limit effectiveness. + +

Case study: modern UNIX

+ +

Goals: +

    +
  • Simplicity (e.g. avoid complex locking regimes). +
  • Quick response to device interrupts. +
  • Favor interactive response. +
+ +

UNIX has a number of execution environments. We care about +scheduling transitions among them. Some transitions aren't possible, +some can't be be controlled. The execution environments are: + +

    +
  • Process, user half +
  • Process, kernel half +
  • Soft interrupts: timer, network +
  • Device interrupts +
+ +

The rules are: +

    +
  • User is pre-emptible. +
  • Kernel half and software interrupts are not pre-emptible. +
  • Device handlers may not make blocking calls (e.g., sleep) +
  • Effective priorities: intr > soft intr > kernel half > user +
+ + + +

Rules are implemented as follows: + +

    + +
  • UNIX: Process User Half. Runs in process address space, on +per-process stack. Interruptible. Pre-emptible: interrupt may cause +context switch. We don't trust user processes to yield CPU. +Voluntarily enters kernel half via system calls and faults. + +
  • UNIX: Process Kernel Half. Runs in kernel address space, on +per-process kernel stack. Executes system calls and faults for its +process. Interruptible (but can defer interrupts in critical +sections). Not pre-emptible. Only yields voluntarily, when waiting +for an event. E.g. disk I/O done. This simplifies concurrency +control; locks often not required. No user process runs if any kernel +half wants to run. Many process' kernel halfs may be sleeping in the +kernel. + +
  • UNIX: Device Interrupts. Hardware asks CPU for an interrupt to ask +for attention. Disk read/write completed, or network packet received. +Runs in kernel space, on special interrupt stack. Interrupt routine +cannot block; must return. Interrupts are interruptible. They nest +on the one interrupt stack. Interrupts are not pre-emptible, and +cannot really yield. The real-time clock is a device and interrupts +every 10ms (or whatever). Process scheduling decisions can be made +when interrupt returns (e.g. wake up the process waiting for this +event). You want interrupt processing to be fast, since it has +priority. Don't do any more work than you have to. You're blocking +processes and other interrupts. Typically, an interrupt does the +minimal work necessary to keep the device happy, and then call wakeup +on a thread. + +
  • UNIX: Soft Interrupts. (Didn't exist in xv6) Used when device +handling is expensive. But no obvious process context in which to +run. Examples include IP forwarding, TCP input processing. Runs in +kernel space, on interrupt stack. Interruptable. Not pre-emptable, +can't really yield. Triggered by hardware interrupt. Called when +outermost hardware interrupt returns. Periodic scheduling decisions +are made in timer s/w interrupt. Scheduled by hardware timer +interrupt (i.e., if current process has run long enough, switch). +
+ +

Is this good software structure? Let's talk about receive +livelock. + +

Paper discussion

+ +
    + +
  • What is application that the paper is addressing: IP forwarding. +What functionality does a network interface offer to driver? +
      +
    • Read packets +
    • Poke hardware to send packets +
    • Interrupts when packet received/transmit complete +
    • Buffer many input packets +
    + +
  • What devices in the 6.828 kernel are interrupt driven? Which one +are polling? Is this ideal? + +
  • Explain Figure 6-1. Why does it go up? What determines how high +the peak is? Why does it go down? What determines how fast it goes +does? Answer: +
    +(fraction of packets discarded)(work invested in discarded packets)
    +           -------------------------------------------
    +              (total work CPU is capable of)
    +
    + +
  • Suppose I wanted to test an NFS server for livelock. +
    +  Run client with this loop:
    +    while(1){
    +      send NFS READ RPC;
    +      wait for response;
    +    }
    +
    +What would I see? Is the NFS server probably subject to livelock? +(No--offered load subject to feedback). + +
  • What other problems are we trying to address? +
      +
    • Increased latency for packet delivery and forwarding (e.g., start +disk head moving when first NFS read request comes) +
    • Transmit starvation +
    • User-level CPU starvation +
    + +
  • Why not tell the O/S scheduler to give interrupts lower priority? +Non-preemptible. +Could you fix this by making interrupts faster? (Maybe, if coupled +with some limit on input rate.) + +
  • Why not completely process each packet in the interrupt handler? +(I.e. forward it?) Other parts of kernel don't expect to run at high +interrupt-level (e.g., some packet processing code might invoke a function +that sleeps). Still might want an output queue + +
  • What about using polling instead of interrupts? Solves overload +problem, but killer for latency. + +
  • What's the paper's solution? +
      +
    • No IP input queue. +
    • Input processing and device input polling in kernel thread. +
    • Device receive interrupt just wakes up thread. And leaves +interrupts *disabled* for that device. +
    • Thread does all input processing, then re-enables interrupts. +
    +

    Why does this work? What happens when packets arrive too fast? +What happens when packets arrive slowly? + +

  • Explain Figure 6-3. +
      +
    • Why does "Polling (no quota)" work badly? (Input still starves +xmit complete processing.) +
    • Why does it immediately fall to zero, rather than gradually decreasing? +(xmit complete processing must be very cheap compared to input.) +
    + +
  • Explain Figure 6-4. +
      + +
    • Why does "Polling, no feedback" behave badly? There's a queue in +front of screend. We can still give 100% to input thread, 0% to +screend. + +
    • Why does "Polling w/ feedback" behave well? Input thread yields +when queue to screend fills. + +
    • What if screend hangs, what about other consumers of packets? +(e.g., can you ssh to machine to fix screend?) Fortunately screend +typically is only application. Also, re-enable input after timeout. + +
    + +
  • Why are the two solutions different? +
      +
    1. Polling thread with quotas. +
    2. Feedback from full queue. +
    +(I believe they should have used #2 for both.) + +
  • If we apply the proposed fixes, does the phenomemon totally go + away? (e.g. for web server, waits for disk, &c.) +
      +
    • Can the net device throw away packets without slowing down host? +
    • Problem: We want to drop packets for applications with big queues. +But requires work to determine which application a packet belongs to +Solution: NI-LRP (have network interface sort packets) +
    + +
  • What about latency question? (Look at figure 14 p. 243.) +
      +
    • 1st packet looks like an improvement over non-polling. But 2nd +packet transmitted later with poling. Why? (No new packets added to +xmit buffer until xmit interrupt) +
    • Why? In traditional BSD, to +amortize cost of poking device. Maybe better to poke a second time +anyway. +
    + +
  • What if processing has more complex structure? +
      +
    • Chain of processing stages with queues? Does feedback work? + What happens when a late stage is slow? +
    • Split at some point, multiple parallel paths? No so great; one + slow path blocks all paths. +
    + +
  • Can we formulate any general principles from paper? +
      +
    • Don't spend time on new work before completing existing work. +
    • Or give new work lower priority than partially-completed work. +
    + +
diff --git a/web/l-threads.html b/web/l-threads.html new file mode 100644 index 0000000..8587abb --- /dev/null +++ b/web/l-threads.html @@ -0,0 +1,316 @@ +L8 + + + + + +

Threads, processes, and context switching

+ +

Required reading: proc.c (focus on scheduler() and sched()), +setjmp.S, and sys_fork (in sysproc.c) + +

Overview

+ + +

Big picture: more programs than processors. How to share the +limited number of processors among the programs? + +

Observation: most programs don't need the processor continuously, +because they frequently have to wait for input (from user, disk, +network, etc.) + +

Idea: when one program must wait, it releases the processor, and +gives it to another program. + +

Mechanism: thread of computation, an active active computation. A +thread is an abstraction that contains the minimal state that is +necessary to stop an active and an resume it at some point later. +What that state is depends on the processor. On x86, it is the +processor registers (see setjmp.S). + +

Address spaces and threads: address spaces and threads are in +principle independent concepts. One can switch from one thread to +another thread in the same address space, or one can switch from one +thread to another thread in another address space. Example: in xv6, +one switches address spaces by switching segmentation registers (see +setupsegs). Does xv6 ever switch from one thread to another in the +same address space? (Answer: yes, v6 switches, for example, from the +scheduler, proc[0], to the kernel part of init, proc[1].) In the JOS +kernel we switch from the kernel thread to a user thread, but we don't +switch kernel space necessarily. + +

Process: one address space plus one or more threads of computation. +In xv6 all user programs contain one thread of computation and +one address space, and the concepts of address space and threads of +computation are not separated but bundled together in the concept of a +process. When switching from the kernel program (which has multiple +threads) to a user program, xv6 switches threads (switching from a +kernel stack to a user stack) and address spaces (the hardware uses +the kernel segment registers and the user segment registers). + +

xv6 supports the following operations on processes: +

    +
  • fork; create a new process, which is a copy of the parent. +
  • exec; execute a program +
  • exit: terminte process +
  • wait: wait for a process to terminate +
  • kill: kill process +
  • sbrk: grow the address space of a process. +
+This interfaces doesn't separate threads and address spaces. For +example, with this interface one cannot create additional threads in +the same threads. Modern Unixes provides additional primitives +(called pthreads, POSIX threads) to create additional threads in a +process and coordinate their activities. + +

Scheduling. The thread manager needs a method for deciding which +thread to run if multiple threads are runnable. The xv6 policy is to +run the processes round robin. Why round robin? What other methods +can you imagine? + +

Preemptive scheduling. To force a thread to release the processor +periodically (in case the thread never calls sleep), a thread manager +can use preemptive scheduling. The thread manager uses the clock chip +to generate periodically a hardware interrupt, which will cause +control to transfer to the thread manager, which then can decide to +run another thread (e.g., see trap.c). + +

xv6 code examples

+ +

Thread switching is implemented in xv6 using setjmp and longjmp, +which take a jumpbuf as an argument. setjmp saves its context in a +jumpbuf for later use by longjmp. longjmp restores the context saved +by the last setjmp. It then causes execution to continue as if the +call of setjmp has just returned 1. +

    +
  • setjmp saves: ebx, exc, edx, esi, edi, esp, ebp, and eip. +
  • longjmp restores them, and puts 1 in eax! +
+ +

Example of thread switching: proc[0] switches to scheduler: +

    +
  • 1359: proc[0] calls iget, which calls sleep, which calls sched. +
  • 2261: The stack before the call to setjmp in sched is: +
    +CPU 0:
    +eax: 0x10a144   1089860
    +ecx: 0x6c65746e 1818588270
    +edx: 0x0        0
    +ebx: 0x10a0e0   1089760
    +esp: 0x210ea8   2166440
    +ebp: 0x210ebc   2166460
    +esi: 0x107f20   1081120
    +edi: 0x107740   1079104
    +eip: 0x1023c9  
    +eflags 0x12      
    +cs:  0x8       
    +ss:  0x10      
    +ds:  0x10      
    +es:  0x10      
    +fs:  0x10      
    +gs:  0x10      
    +   00210ea8 [00210ea8]  10111e
    +   00210eac [00210eac]  210ebc
    +   00210eb0 [00210eb0]  10239e
    +   00210eb4 [00210eb4]  0001
    +   00210eb8 [00210eb8]  10a0e0
    +   00210ebc [00210ebc]  210edc
    +   00210ec0 [00210ec0]  1024ce
    +   00210ec4 [00210ec4]  1010101
    +   00210ec8 [00210ec8]  1010101
    +   00210ecc [00210ecc]  1010101
    +   00210ed0 [00210ed0]  107740
    +   00210ed4 [00210ed4]  0001
    +   00210ed8 [00210ed8]  10cd74
    +   00210edc [00210edc]  210f1c
    +   00210ee0 [00210ee0]  100bbc
    +   00210ee4 [00210ee4]  107740
    +
    +
  • 2517: stack at beginning of setjmp: +
    +CPU 0:
    +eax: 0x10a144   1089860
    +ecx: 0x6c65746e 1818588270
    +edx: 0x0        0
    +ebx: 0x10a0e0   1089760
    +esp: 0x210ea0   2166432
    +ebp: 0x210ebc   2166460
    +esi: 0x107f20   1081120
    +edi: 0x107740   1079104
    +eip: 0x102848  
    +eflags 0x12      
    +cs:  0x8       
    +ss:  0x10      
    +ds:  0x10      
    +es:  0x10      
    +fs:  0x10      
    +gs:  0x10      
    +   00210ea0 [00210ea0]  1023cf   <--- return address (sched)
    +   00210ea4 [00210ea4]  10a144
    +   00210ea8 [00210ea8]  10111e
    +   00210eac [00210eac]  210ebc
    +   00210eb0 [00210eb0]  10239e
    +   00210eb4 [00210eb4]  0001
    +   00210eb8 [00210eb8]  10a0e0
    +   00210ebc [00210ebc]  210edc
    +   00210ec0 [00210ec0]  1024ce
    +   00210ec4 [00210ec4]  1010101
    +   00210ec8 [00210ec8]  1010101
    +   00210ecc [00210ecc]  1010101
    +   00210ed0 [00210ed0]  107740
    +   00210ed4 [00210ed4]  0001
    +   00210ed8 [00210ed8]  10cd74
    +   00210edc [00210edc]  210f1c
    +
    +
  • 2519: What is saved in jmpbuf of proc[0]? +
  • 2529: return 0! +
  • 2534: What is in jmpbuf of cpu 0? The stack is as follows: +
    +CPU 0:
    +eax: 0x0        0
    +ecx: 0x6c65746e 1818588270
    +edx: 0x108aa4   1084068
    +ebx: 0x10a0e0   1089760
    +esp: 0x210ea0   2166432
    +ebp: 0x210ebc   2166460
    +esi: 0x107f20   1081120
    +edi: 0x107740   1079104
    +eip: 0x10286e  
    +eflags 0x46      
    +cs:  0x8       
    +ss:  0x10      
    +ds:  0x10      
    +es:  0x10      
    +fs:  0x10      
    +gs:  0x10      
    +   00210ea0 [00210ea0]  1023fe
    +   00210ea4 [00210ea4]  108aa4
    +   00210ea8 [00210ea8]  10111e
    +   00210eac [00210eac]  210ebc
    +   00210eb0 [00210eb0]  10239e
    +   00210eb4 [00210eb4]  0001
    +   00210eb8 [00210eb8]  10a0e0
    +   00210ebc [00210ebc]  210edc
    +   00210ec0 [00210ec0]  1024ce
    +   00210ec4 [00210ec4]  1010101
    +   00210ec8 [00210ec8]  1010101
    +   00210ecc [00210ecc]  1010101
    +   00210ed0 [00210ed0]  107740
    +   00210ed4 [00210ed4]  0001
    +   00210ed8 [00210ed8]  10cd74
    +   00210edc [00210edc]  210f1c
    +
    +
  • 2547: return 1! stack looks as follows: +
    +CPU 0:
    +eax: 0x1        1
    +ecx: 0x108aa0   1084064
    +edx: 0x108aa4   1084068
    +ebx: 0x10074    65652
    +esp: 0x108d40   1084736
    +ebp: 0x108d5c   1084764
    +esi: 0x10074    65652
    +edi: 0xffde     65502
    +eip: 0x102892  
    +eflags 0x6       
    +cs:  0x8       
    +ss:  0x10      
    +ds:  0x10      
    +es:  0x10      
    +fs:  0x10      
    +gs:  0x10      
    +   00108d40 [00108d40]  10231c
    +   00108d44 [00108d44]  10a144
    +   00108d48 [00108d48]  0010
    +   00108d4c [00108d4c]  0021
    +   00108d50 [00108d50]  0000
    +   00108d54 [00108d54]  0000
    +   00108d58 [00108d58]  10a0e0
    +   00108d5c [00108d5c]  0000
    +   00108d60 [00108d60]  0001
    +   00108d64 [00108d64]  0000
    +   00108d68 [00108d68]  0000
    +   00108d6c [00108d6c]  0000
    +   00108d70 [00108d70]  0000
    +   00108d74 [00108d74]  0000
    +   00108d78 [00108d78]  0000
    +   00108d7c [00108d7c]  0000
    +
    +
  • 2548: where will longjmp return? (answer: 10231c, in scheduler) +
  • 2233:Scheduler on each processor selects in a round-robin fashion the + first runnable process. Which process will that be? (If we are + running with one processor.) (Ans: proc[0].) +
  • 2229: what will be saved in cpu's jmpbuf? +
  • What is in proc[0]'s jmpbuf? +
  • 2548: return 1. Stack looks as follows: +
    +CPU 0:
    +eax: 0x1        1
    +ecx: 0x6c65746e 1818588270
    +edx: 0x0        0
    +ebx: 0x10a0e0   1089760
    +esp: 0x210ea0   2166432
    +ebp: 0x210ebc   2166460
    +esi: 0x107f20   1081120
    +edi: 0x107740   1079104
    +eip: 0x102892  
    +eflags 0x2       
    +cs:  0x8       
    +ss:  0x10      
    +ds:  0x10      
    +es:  0x10      
    +fs:  0x10      
    +gs:  0x10      
    +   00210ea0 [00210ea0]  1023cf   <--- return to sleep
    +   00210ea4 [00210ea4]  108aa4
    +   00210ea8 [00210ea8]  10111e
    +   00210eac [00210eac]  210ebc
    +   00210eb0 [00210eb0]  10239e
    +   00210eb4 [00210eb4]  0001
    +   00210eb8 [00210eb8]  10a0e0
    +   00210ebc [00210ebc]  210edc
    +   00210ec0 [00210ec0]  1024ce
    +   00210ec4 [00210ec4]  1010101
    +   00210ec8 [00210ec8]  1010101
    +   00210ecc [00210ecc]  1010101
    +   00210ed0 [00210ed0]  107740
    +   00210ed4 [00210ed4]  0001
    +   00210ed8 [00210ed8]  10cd74
    +   00210edc [00210edc]  210f1c
    +
    +
+ +

Why switch from proc[0] to the processor stack, and then to + proc[0]'s stack? Why not instead run the scheduler on the kernel + stack of the last process that run on that cpu? + +

    + +
  • If the scheduler wanted to use the process stack, then it couldn't + have any stack variables live across process scheduling, since + they'd be different depending on which process just stopped running. + +
  • Suppose process p goes to sleep on CPU1, so CPU1 is idling in + scheduler() on p's stack. Someone wakes up p. CPU2 decides to run + p. Now p is running on its stack, and CPU1 is also running on the + same stack. They will likely scribble on each others' local + variables, return pointers, etc. + +
  • The same thing happens if CPU1 tries to reuse the process's page +tables to avoid a TLB flush. If the process gets killed and cleaned +up by the other CPU, now the page tables are wrong. I think some OSes +actually do this (with appropriate ref counting). + +
+ +

How is preemptive scheduling implemented in xv6? Answer see trap.c + line 2905 through 2917, and the implementation of yield() on sheet + 22. + +

How long is a timeslice for a user process? (possibly very short; + very important lock is held across context switch!) + + + + + diff --git a/web/l-vm.html b/web/l-vm.html new file mode 100644 index 0000000..ffce13e --- /dev/null +++ b/web/l-vm.html @@ -0,0 +1,462 @@ + + +Virtual Machines + + + + +

Virtual Machines

+ +

Required reading: Disco

+ +

Overview

+ +

What is a virtual machine? IBM definition: a fully protected and +isolated copy of the underlying machine's hardware.

+ +

Another view is that it provides another example of a kernel API. +In contrast to other kernel APIs (unix, microkernel, and exokernel), +the virtual machine operating system exports as the kernel API the +processor API (e.g., the x86 interface). Thus, each program running +in user space sees the services offered by a processor, and each +program sees its own processor. Of course, we don't want to make a +system call for each instruction, and in fact one of the main +challenges in virtual machine operation systems is to design the +system in such a way that the physical processor executes the virtual +processor API directly, at processor speed. + +

+Virtual machines can be useful for a number of reasons: +

    + +
  1. Run multiple operating systems on single piece of hardware. For +example, in one process, you run Linux, and in another you run +Windows/XP. If the kernel API is identical to the x86 (and faithly +emulates x86 instructions, state, protection levels, page tables), +then Linux and Windows/XP, the virual machine operationg system can +run these guest operating systems without modifications. + +
      +
    • Run "older" programs on the same hardware (e.g., run one x86 +virtual machine in real mode to execute old DOS apps). + +
    • Or run applications that require different operating system. +
    + +
  2. Fault isolation: like processes on UNIX but more complete, because +the guest operating systems runs on the virtual machine in user space. +Thus, faults in the guest OS cannot effect any other software. + +
  3. Customizing the apparent hardware: virtual machine may have +different view of hardware than is physically present. + +
  4. Simplify deployment/development of software for scalable +processors (e.g., Disco). + +
+

+ +

If your operating system isn't a virtual machine operating system, +what are the alternatives? Processor simulation (e.g., bochs) or +binary emulation (WINE). Simulation runs instructions purely in +software and is slow (e.g., 100x slow down for bochs); virtualization +gets out of the way whenever possible and can be efficient. + +

Simulation gives portability whereas virtualization focuses on +performance. However, this means that you need to model your hardware +very carefully in software. Binary emulation focuses on just getting +system call for a particular operating system's interface. Binary +emulation can be hard because it is targetted towards a particular +operating system (and even that can change between revisions). +

+ +

To provide each process with its own virtual processor that exports +the same API as the physical processor, what features must +the virtual machine operating system virtualize? +

    +
  1. CPU: instructions -- trap all privileged instructions
  2. +
  3. Memory: address spaces -- map "physical" pages managed +by the guest OS to machinepages, handle translation, etc.
  4. +
  5. Devices: any I/O communication needs to be trapped and passed + through/handled appropriately.
  6. +
+

+The software that implements the virtualization is typically called +the monitor, instead of the virtual machine operating system. + +

Virtual machine monitors (VMM) can be implemented in two ways: +

    +
  1. Run VMM directly on hardware: like Disco.
  2. +
  3. Run VMM as an application (though still running as root, with + integration into OS) on top of a host OS: like VMware. Provides + additional hardware support at low development cost in + VMM. Intercept CPU-level I/O requests and translate them into + system calls (e.g. read()).
  4. +
+

+ +

The three primary functions of a virtual machine monitor are: +

    +
  • virtualize processor (CPU, memory, and devices) +
  • dispatch events (e.g., forward page fault trap to guest OS). +
  • allocate resources (e.g., divide real memory in some way between +the physical memory of each guest OS). +
+ +

Virtualization in detail

+ +

Memory virtualization

+ +

+Understanding memory virtualization. Let's consider the MIPS example +from the paper. Ideally, we'd be able to intercept and rewrite all +memory address references. (e.g., by intercepting virtual memory +calls). Why can't we do this on the MIPS? (There are addresses that +don't go through address translation --- but we don't want the virtual +machine to directly access memory!) What does Disco do to get around +this problem? (Relink the kernel outside this address space.) +

+ +

+Having gotten around that problem, how do we handle things in general? +

+
+// Disco's tlb miss handler.
+// Called when a memory reference for virtual adddress
+// 'VA' is made, but there is not VA->MA (virtual -> machine)
+// mapping in the cpu's TLB.
+void tlb_miss_handler (VA)
+{
+  // see if we have a mapping in our "shadow" tlb (which includes
+  // "main" tlb)
+  tlb_entry *t = tlb_lookup (thiscpu->l2tlb, va);
+  if (t && defined (thiscpu->pmap[t->pa]))   // is there a MA for this PA?
+    tlbwrite (va, thiscpu->pmap[t->pa], t->otherdata);
+  else if (t)
+    // get a machine page, copy physical page into, and tlbwrite
+  else
+    // trap to the virtual CPU/OS's handler
+}
+
+// Disco's procedure which emulates the MIPS
+// instruction which writes to the tlb.
+//
+// VA -- virtual addresss
+// PA -- physical address (NOT MA machine address!)
+// otherdata -- perms and stuff
+void emulate_tlbwrite_instruction (VA, PA, otherdata)
+{
+  tlb_insert (thiscpu->l2tlb, VA, PA, otherdata); // cache
+  if (!defined (thiscpu->pmap[PA])) { // fill in pmap dynamically
+    MA = allocate_machine_page ();
+    thiscpu->pmap[PA] = MA; // See 4.2.2
+    thiscpu->pmapbackmap[MA] = PA;
+    thiscpu->memmap[MA] = VA; // See 4.2.3 (for TLB shootdowns)
+  }
+  tlbwrite (va, thiscpu->pmap[PA], otherdata);
+}
+
+// Disco's procedure which emulates the MIPS
+// instruction which read the tlb.
+tlb_entry *emulate_tlbread_instruction (VA)
+{
+  // Must return a TLB entry that has a "Physical" address;
+  // This is recorded in our secondary TLB cache.
+  // (We don't have to read from the hardware TLB since
+  // all writes to the hardware TLB are mediated by Disco.
+  // Thus we can always keep the l2tlb up to date.)
+  return tlb_lookup (thiscpu->l2tlb, va);
+}
+
+ +

CPU virtualization

+ +

Requirements: +

    +
  1. Results of executing non-privileged instructions in privileged and + user mode must be equivalent. (Why? B/c the virtual "privileged" + system will not be running in true "privileged" mode.) +
  2. There must be a way to protect the VM from the real machine. (Some + sort of memory protection/address translation. For fault isolation.)
  3. +
  4. There must be a way to detect and transfer control to the VMM when + the VM tries to execute a sensitive instruction (e.g. a privileged + instruction, or one that could expose the "virtualness" of the + VM.) It must be possible to emulate these instructions in + software. Can be classified into completely virtualizable + (i.e. there are protection mechanisms that cause traps for all + instructions), partly (insufficient or incomplete trap + mechanisms), or not at all (e.g. no MMU). +
+

+ +

The MIPS didn't quite meet the second criteria, as discussed +above. But, it does have a supervisor mode that is between user mode and +kernel mode where any privileged instruction will trap.

+ +

What might a the VMM trap handler look like?

+
+void privilege_trap_handler (addr) {
+  instruction, args = decode_instruction (addr)
+  switch (instruction) {
+  case foo:
+    emulate_foo (thiscpu, args, ...);
+    break;
+  case bar:
+    emulate_bar (thiscpu, args, ...);
+    break;
+  case ...:
+    ...
+  }
+}
+
+

The emulator_foo bits will have to evaluate the +state of the virtual CPU and compute the appropriate "fake" answer. +

+ +

What sort of state is needed in order to appropriately emulate all +of these things? +

+- all user registers
+- CPU specific regs (e.g. on x86, %crN, debugging, FP...)
+- page tables (or tlb)
+- interrupt tables
+
+This is needed for each virtual processor. +

+ +

Device I/O virtualization

+ +

We intercept all communication to the I/O devices: read/writes to +reserved memory addresses cause page faults into special handlers +which will emulate or pass through I/O as appropriate. +

+ +

+In a system like Disco, the sequence would look something like: +

    +
  1. VM executes instruction to access I/O
  2. +
  3. Trap generated by CPU (based on memory or privilege protection) + transfers control to VMM.
  4. +
  5. VMM emulates I/O instruction, saving information about where this + came from (for demultiplexing async reply from hardware later) .
  6. +
  7. VMM reschedules a VM.
  8. +
+

+ +

+Interrupts will require some additional work: +

    +
  1. Interrupt occurs on real machine, transfering control to VMM + handler.
  2. +
  3. VMM determines the VM that ought to receive this interrupt.
  4. +
  5. VMM causes a simulated interrupt to occur in the VM, and reschedules a + VM.
  6. +
  7. VM runs its interrupt handler, which may involve other I/O + instructions that need to be trapped.
  8. +
+

+ +

+The above can be slow! So sometimes you want the guest operating +system to be aware that it is a guest and allow it to avoid the slow +path. Special device drivers or changing instructions that would cause +traps into memory read/write instructions. +

+ +

Intel x86/vmware

+ +

VMware, unlike Disco, runs as an application on a guest OS and +cannot modify the guest OS. Furthermore, it must virtualize the x86 +instead of MIPS processor. Both of these differences make good design +challenges. + +

The first challenge is that the monitor runs in user space, yet it +must dispatch traps and it must execute privilege instructions, which +both require kernel privileges. To address this challenge, the +monitor downloads a piece of code, a kernel module, into the guest +OS. Most modern operating systems are constructed as a core kernel, +extended with downloadable kernel modules. +Privileged users can insert kernel modules at run-time. + +

The monitor downloads a kernel module that reads the IDT, copies +it, and overwrites the hard-wired entries with addresses for stubs in +the just downloaded kernel module. When a trap happens, the kernel +module inspects the PC, and either forwards the trap to the monitor +running in user space or to the guest OS. If the trap is caused +because a guest OS execute a privileged instructions, the monitor can +emulate that privilege instruction by asking the kernel module to +perform that instructions (perhaps after modifying the arguments to +the instruction). + +

The second challenge is virtualizing the x86 + instructions. Unfortunately, x86 doesn't meet the 3 requirements for + CPU virtualization. the first two requirements above. If you run + the CPU in ring 3, most x86 instructions will be fine, + because most privileged instructions will result in a trap, which + can then be forwarded to vmware for emulation. For example, + consider a guest OS loading the root of a page table in CR3. This + results in trap (the guest OS runs in user space), which is + forwarded to the monitor, which can emulate the load to CR3 as + follows: + +

+// addr is a physical address
+void emulate_lcr3 (thiscpu, addr)
+{
+  thiscpu->cr3 = addr;
+  Pte *fakepdir = lookup (addr, oldcr3cache);
+  if (!fakepdir) {
+    fakedir = ppage_alloc ();
+    store (oldcr3cache, addr, fakedir);
+    // May wish to scan through supplied page directory to see if
+    // we have to fix up anything in particular.
+    // Exact settings will depend on how we want to handle
+    // problem cases below and our own MM.
+  }
+  asm ("movl fakepdir,%cr3");
+  // Must make sure our page fault handler is in sync with what we do here.
+}
+
+ +

To virtualize the x86, the monitor must intercept any modifications +to the page table and substitute appropriate responses. And update +things like the accessed/dirty bits. The monitor can arrange for this +to happen by making all page table pages inaccessible so that it can +emulate loads and stores to page table pages. This setup allow the +monitor to virtualize the memory interface of the x86.

+ +

Unfortunately, not all instructions that must be virtualized result +in traps: +

    +
  • pushf/popf: FL_IF is handled different, + for example. In user-mode setting FL_IF is just ignored.
  • +
  • Anything (push, pop, mov) + that reads or writes from %cs, which contains the + privilege level. +
  • Setting the interrupt enable bit in EFLAGS has different +semantics in user space and kernel space. In user space, it +is ignored; in kernel space, the bit is set. +
  • And some others... (total, 17 instructions). +
+These instructions are unpriviliged instructions (i.e., don't cause a +trap when executed by a guest OS) but expose physical processor state. +These could reveal details of virtualization that should not be +revealed. For example, if guest OS sets the interrupt enable bit for +its virtual x86, the virtualized EFLAGS should reflect that the bit is +set, even though the guest OS is running in user space. + +

How can we virtualize these instructions? An approach is to decode +the instruction stream that is provided by the user and look for bad +instructions. When we find them, replace them with an interrupt +(INT 3) that will allow the VMM to handle it +correctly. This might look something like: +

+ +
+void initcode () {
+  scan_for_nonvirtual (0x7c00);
+}
+
+void scan_for_nonvirtualizable (thiscpu, startaddr) {
+  addr  = startaddr;
+  instr = disassemble (addr);
+  while (instr is not branch or bad) {
+    addr += len (instr);
+    instr = disassemble (addr);
+  }
+  // remember that we wanted to execute this instruction.
+  replace (addr, "int 3");
+  record (thiscpu->rewrites, addr, instr);
+}
+
+void breakpoint_handler (tf) {
+  oldinstr = lookup (thiscpu->rewrites, tf->eip);
+  if (oldinstr is branch) {
+    newcs:neweip = evaluate branch
+    scan_for_nonvirtualizable (thiscpu, newcs:neweip)
+    return;
+  } else { // something non virtualizable
+    // dispatch to appropriate emulation
+  }
+}
+
+

All pages must be scanned in this way. Fortunately, most pages +probably are okay and don't really need any special handling so after +scanning them once, we can just remember that the page is okay and let +it run natively. +

+ +

What if a guest OS generates instructions, writes them to memory, +and then wants to execute them? We must detect self-modifying code +(e.g. must simulate buffer overflow attacks correctly.) When a write +to a physical page that happens to be in code segment happens, must +trap the write and then rescan the affected portions of the page.

+ +

What about self-examining code? Need to protect it some +how---possibly by playing tricks with instruction/data TLB caches, or +introducing a private segment for code (%cs) that is different than +the segment used for reads/writes (%ds). +

+ +

Some Disco paper notes

+ +

+Disco has some I/O specific optimizations. +

+
    +
  • Disk reads only need to happen once and can be shared between + virtual machines via copy-on-write virtual memory tricks.
  • +
  • Network cards do not need to be fully virtualized --- intra + VM communication doesn't need a real network card backing it.
  • +
  • Special handling for NFS so that all VMs "share" a buffer cache.
  • +
+ +

+Disco developers clearly had access to IRIX source code. +

+
    +
  • Need to deal with KSEG0 segment of MIPS memory by relinking kernel + at different address space.
  • +
  • Ensuring page-alignment of network writes (for the purposes of + doing memory map tricks.)
  • +
+ +

Performance?

+
    +
  • Evaluated in simulation.
  • +
  • Where are the overheads? Where do they come from?
  • +
  • Does it run better than NUMA IRIX?
  • +
+ +

Premise. Are virtual machine the preferred approach to extending +operating systems? Have scalable multiprocessors materialized?

+ +

Related papers

+ +

John Scott Robin, Cynthia E. Irvine. Analysis of the +Intel Pentium's Ability to Support a Secure Virtual Machine +Monitor.

+ +

Jeremy Sugerman, Ganesh Venkitachalam, Beng-Hong Lim. Virtualizing +I/O Devices on VMware Workstation's Hosted Virtual Machine +Monitor. In Proceedings of the 2001 Usenix Technical Conference.

+ +

Kevin Lawton, Drew Northup. Plex86 Virtual +Machine.

+ +

Xen +and the Art of Virtualization, Paul Barham, Boris +Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf +Neugebauer, Ian Pratt, Andrew Warfield, SOSP 2003

+ +

A comparison of +software and hardware techniques for x86 virtualizatonKeith Adams +and Ole Agesen, ASPLOS 2006

+ + + + + diff --git a/web/l-xfi.html b/web/l-xfi.html new file mode 100644 index 0000000..41ba434 --- /dev/null +++ b/web/l-xfi.html @@ -0,0 +1,246 @@ + + +XFI + + + +

XFI

+ +

Required reading: XFI: software guards for system address spaces. + +

Introduction

+ +

Problem: how to use untrusted code (an "extension") in a trusted +program? +

    +
  • Use untrusted jpeg codec in Web browser +
  • Use an untrusted driver in the kernel +
+ +

What are the dangers? +

    +
  • No fault isolations: extension modifies trusted code unintentionally +
  • No protection: extension causes a security hole +
      +
    • Extension has a buffer overrun problem +
    • Extension calls trusted program's functions +
    • Extensions calls a trusted program's functions that is allowed to + call, but supplies "bad" arguments +
    • Extensions calls privileged hardware instructions (when extending + kernel) +
    • Extensions reads data out of trusted program it shouldn't. +
    +
+ +

Possible solutions approaches: +

    + +
  • Run extension in its own address space with minimal + privileges. Rely on hardware and operating system protection + mechanism. + +
  • Restrict the language in which the extension is written: +
      + +
    • Packet filter language. Language is limited in its capabilities, + and it easy to guarantee "safe" execution. + +
    • Type-safe language. Language runtime and compiler guarantee "safe" +execution. +
    + +
  • Software-based sandboxing. + +
+ +

Software-based sandboxing

+ +

Sandboxer. A compiler or binary-rewriter sandboxes all unsafe + instructions in an extension by inserting additional instructions. + For example, every indirect store is preceded by a few instructions + that compute and check the target of the store at runtime. + +

Verifier. When the extension is loaded in the trusted program, the + verifier checks if the extension is appropriately sandboxed (e.g., + are all indirect stores sandboxed? does it call any privileged + instructions?). If not, the extension is rejected. If yes, the + extension is loaded, and can run. If the extension runs, the + instruction that sandbox unsafe instructions check if the unsafe + instruction is used in a safe way. + +

The verifier must be trusted, but the sandboxer doesn't. We can do + without the verifier, if the trusted program can establish that the + extension has been sandboxed by a trusted sandboxer. + +

The paper refers to this setup as instance of proof-carrying code. + +

Software fault isolation

+ +

SFI +by Wahbe et al. explored out to use sandboxing for fault isolation +extensions; that is, use sandboxing to control that stores and jump +stay within a specified memory range (i.e., they don't overwrite and +jump into addresses in the trusted program unchecked). They +implemented SFI for a RISC processor, which simplify things since +memory can be written only by store instructions (other instructions +modify registers). In addition, they assumed that there were plenty +of registers, so that they can dedicate a few for sandboxing code. + +

The extension is loaded into a specific range (called a segment) + within the trusted application's address space. The segment is + identified by the upper bits of the addresses in the + segment. Separate code and data segments are necessary to prevent an + extension overwriting its code. + +

An unsafe instruction on the MIPS is an instruction that jumps or + stores to an address that cannot be statically verified to be within + the correct segment. Most control transfer operations, such + program-counter relative can be statically verified. Stores to + static variables often use an immediate addressing mode and can be + statically verified. Indirect jumps and indirect stores are unsafe. + +

To sandbox those instructions the sandboxer could generate the + following code for each unsafe instruction: +

+  DR0 <- target address
+  R0 <- DR0 >> shift-register;  // load in R0 segment id of target
+  CMP R0, segment-register;     // compare to segment id to segment's ID
+  BNE fault-isolation-error     // if not equal, branch to trusted error code
+  STORE using DR0
+
+In this code, DR0, shift-register, and segment register +are dedicated: they cannot be used by the extension code. The +verifier must check if the extension doesn't use they registers. R0 +is a scratch register, but doesn't have to be dedicated. The +dedicated registers are necessary, because otherwise extension could +load DR0 and jump to the STORE instruction directly, skipping the +check. + +

This implementation costs 4 registers, and 4 additional instructions + for each unsafe instruction. One could do better, however: +

+  DR0 <- target address & and-mask-register // mask segment ID from target
+  DR0 <- DR0 | segment register // insert this segment's ID
+  STORE using DR0
+
+This code just sets the write segment ID bits. It doesn't catch +illegal addresses; it just ensures that illegal addresses are within +the segment, harming the extension but no other code. Even if the +extension jumps to the second instruction of this sandbox sequence, +nothing bad will happen (because DR0 will already contain the correct +segment ID). + +

Optimizations include: +

    +
  • use guard zones for store value, offset(reg) +
  • treat SP as dedicated register (sandbox code that initializes it) +
  • etc. +
+ +

XFI

+ +

XFI extends SFI in several ways: +

    +
  • Handles fault isolation and protection +
  • Uses control-folow integrity (CFI) to get good performance +
  • Doesn't use dedicated registers +
  • Use two stacks (a scoped stack and an allocation stack) and only + allocation stack can be corrupted by buffer-overrun attacks. The + scoped stack cannot via computed memory references. +
  • Uses a binary rewriter. +
  • Works for the x86 +
+ +

x86 is challenging, because limited registers and variable length + of instructions. SFI technique won't work with x86 instruction + set. For example if the binary contains: +

+  25 CD 80 00 00   # AND eax, 0x80CD
+
+and an adversary can arrange to jump to the second byte, then the +adversary calls system call on Linux, which has binary the binary +representation CD 80. Thus, XFI must control execution flow. + +

XFI policy goals: +

    +
  • Memory-access constraints (like SFI) +
  • Interface restrictions (extension has fixed entry and exit points) +
  • Scoped-stack integrity (calling stack is well formed) +
  • Simplified instructions semantics (remove dangerous instructions) +
  • System-environment integrity (ensure certain machine model + invariants, such as x86 flags register cannot be modified) +
  • Control-flow integrity: execution must follow a static, expected + control-flow graph. (enter at beginning of basic blocks) +
  • Program-data integrity (certain global variables in extension + cannot be accessed via computed memory addresses) +
+ +

The binary rewriter inserts guards to ensure these properties. The + verifier check if the appropriate guards in place. The primary + mechanisms used are: +

    +
  • CFI guards on computed control-flow transfers (see figure 2) +
  • Two stacks +
  • Guards on computer memory accesses (see figure 3) +
  • Module header has a section that contain access permissions for + region +
  • Binary rewriter, which performs intra-procedure analysis, and + generates guards, code for stack use, and verification hints +
  • Verifier checks specific conditions per basic block. hints specify + the verification state for the entry to each basic block, and at + exit of basic block the verifier checks that the final state implies + the verification state at entry to all possible successor basic + blocks. (see figure 4) +
+ +

Can XFI protect against the attack discussed in last lecture? +

+  unsigned int j;
+  p=(unsigned char *)s->init_buf->data;
+  j= *(p++);
+  s->session->session_id_length=j;
+  memcpy(s->session->session_id,p,j);
+
+Where will j be located? + +

How about the following one from the paper Beyond stack smashing: + recent advances in exploiting buffer overruns? +

+void f2b(void * arg, size_t len) {
+  char buf[100];
+  long val = ..;
+  long *ptr = ..;
+  extern void (*f)();
+  
+  memcopy(buff, arg, len);
+  *ptr = val;
+  f();
+  ...
+  return;
+}
+
+What code can (*f)() call? Code that the attacker inserted? +Code in libc? + +

How about an attack that use ptr in the above code to + overwrite a method's address in a class's dispatch table with an + address of support function? + +

How about data-only attacks? For example, attacker + overwrites pw_uid in the heap with 0 before the following + code executes (when downloading /etc/passwd and then uploading it with a + modified entry). +

+FILE *getdatasock( ... ) {
+  seteuid(0);
+  setsockeope ( ...);
+  ...
+  seteuid(pw->pw_uid);
+  ...
+}
+
+ +

How much does XFI slow down applications? How many more + instructions are executed? (see Tables 1-4) + + diff --git a/web/l1.html b/web/l1.html new file mode 100644 index 0000000..9865601 --- /dev/null +++ b/web/l1.html @@ -0,0 +1,288 @@ +L1 + + + + + +

OS overview

+ +

Overview

+ +
    +
  • Goal of course: + +
      +
    • Understand operating systems in detail by designing and +implementing miminal OS +
    • Hands-on experience with building systems ("Applying 6.033") +
    + +
  • What is an operating system? +
      +
    • a piece of software that turns the hardware into something useful +
    • layered picture: hardware, OS, applications +
    • Three main functions: fault isolate applications, abstract hardware, +manage hardware +
    + +
  • Examples: +
      +
    • OS-X, Windows, Linux, *BSD, ... (desktop, server) +
    • PalmOS Windows/CE (PDA) +
    • Symbian, JavaOS (Cell phones) +
    • VxWorks, pSOS (real-time) +
    • ... +
    + +
  • OS Abstractions +
      +
    • processes: fork, wait, exec, exit, kill, getpid, brk, nice, sleep, +trace +
    • files: open, close, read, write, lseek, stat, sync +
    • directories: mkdir, rmdir, link, unlink, mount, umount +
    • users + security: chown, chmod, getuid, setuid +
    • interprocess communication: signals, pipe +
    • networking: socket, accept, snd, recv, connect +
    • time: gettimeofday +
    • terminal: +
    + +
  • Sample Unix System calls (mostly POSIX) +
      +
    • int read(int fd, void*, int) +
    • int write(int fd, void*, int) +
    • off_t lseek(int fd, off_t, int [012]) +
    • int close(int fd) +
    • int fsync(int fd) +
    • int open(const char*, int flags [, int mode]) +
        +
      • O_RDONLY, O_WRONLY, O_RDWR, O_CREAT +
      +
    • mode_t umask(mode_t cmask) +
    • int mkdir(char *path, mode_t mode); +
    • DIR *opendir(char *dirname) +
    • struct dirent *readdir(DIR *dirp) +
    • int closedir(DIR *dirp) +
    • int chdir(char *path) +
    • int link(char *existing, char *new) +
    • int unlink(char *path) +
    • int rename(const char*, const char*) +
    • int rmdir(char *path) +
    • int stat(char *path, struct stat *buf) +
    • int mknod(char *path, mode_t mode, dev_t dev) +
    • int fork() +
        +
      • returns childPID in parent, 0 in child; only + difference +
      +
    • int getpid() +
    • int waitpid(int pid, int* stat, int opt) +
        +
      • pid==-1: any; opt==0||WNOHANG +
      • returns pid or error +
      +
    • void _exit(int status) +
    • int kill(int pid, int signal) +
    • int sigaction(int sig, struct sigaction *, struct sigaction *) +
    • int sleep (int sec) +
    • int execve(char* prog, char** argv, char** envp) +
    • void *sbrk(int incr) +
    • int dup2(int oldfd, int newfd) +
    • int fcntl(int fd, F_SETFD, int val) +
    • int pipe(int fds[2]) +
        +
      • writes on fds[1] will be read on fds[0] +
      • when last fds[1] closed, read fds[0] retursn EOF +
      • when last fds[0] closed, write fds[1] kills SIGPIPE/fails + EPIPE +
      +
    • int fchown(int fd, uind_t owner, gid_t group) +
    • int fchmod(int fd, mode_t mode) +
    • int socket(int domain, int type, int protocol) +
    • int accept(int socket_fd, struct sockaddr*, int* namelen) +
        +
      • returns new fd +
      +
    • int listen(int fd, int backlog) +
    • int connect(int fd, const struct sockaddr*, int namelen) +
    • void* mmap(void* addr, size_t len, int prot, int flags, int fd, + off_t offset) +
    • int munmap(void* addr, size_t len) +
    • int gettimeofday(struct timeval*) +
    +
+ +

See the reference page for links to +the early Unix papers. + +

Class structure

+ +
    +
  • Lab: minimal OS for x86 in an exokernel style (50%) +
      +
    • kernel interface: hardware + protection +
    • libOS implements fork, exec, pipe, ... +
    • applications: file system, shell, .. +
    • development environment: gcc, bochs +
    • lab 1 is out +
    + +
  • Lecture structure (20%) +
      +
    • homework +
    • 45min lecture +
    • 45min case study +
    + +
  • Two quizzes (30%) +
      +
    • mid-term +
    • final's exam week +
    + +
+ +

Case study: the shell (simplified)

+ +
    +
  • interactive command execution and a programming language +
  • Nice example that uses various OS abstractions. See Unix +paper if you are unfamiliar with the shell. +
  • Final lab is a simple shell. +
  • Basic structure: +
    +      
    +       while (1) {
    +	    printf ("$");
    +	    readcommand (command, args);   // parse user input
    +	    if ((pid = fork ()) == 0) {  // child?
    +	       exec (command, args, 0);
    +	    } else if (pid > 0) {   // parent?
    +	       wait (0);   // wait for child to terminate
    +	    } else {
    +	       perror ("Failed to fork\n");
    +            }
    +        }
    +
    +

    The split of creating a process with a new program in fork and exec +is mostly a historical accident. See the assigned paper for today. +

  • Example: +
    +        $ ls
    +
    +
  • why call "wait"? to wait for the child to terminate and collect +its exit status. (if child finishes, child becomes a zombie until +parent calls wait.) +
  • I/O: file descriptors. Child inherits open file descriptors +from parent. By convention: +
      +
    • file descriptor 0 for input (e.g., keyboard). read_command: +
      +     read (1, buf, bufsize)
      +
      +
    • file descriptor 1 for output (e.g., terminal) +
      +     write (1, "hello\n", strlen("hello\n")+1)
      +
      +
    • file descriptor 2 for error (e.g., terminal) +
    +
  • How does the shell implement: +
    +     $ls > tmp1
    +
    +just before exec insert: +
    +    	   close (1);
    +	   fd = open ("tmp1", O_CREAT|O_WRONLY);   // fd will be 1!
    +
    +

    The kernel will return the first free file descriptor, 1 in this case. +

  • How does the shell implement sharing an output file: +
    +     $ls 2> tmp1 > tmp1
    +
    +replace last code with: +
    +
    +	   close (1);
    +	   close (2);
    +	   fd1 = open ("tmp1", O_CREAT|O_WRONLY);   // fd will be 1!
    +	   fd2 = dup (fd1);
    +
    +both file descriptors share offset +
  • how do programs communicate? +
    +        $ sort file.txt | uniq | wc
    +
    +or +
    +	$ sort file.txt > tmp1
    +	$ uniq tmp1 > tmp2
    +	$ wc tmp2
    +	$ rm tmp1 tmp2
    +
    +or +
    +        $ kill -9
    +
    +
  • A pipe is an one-way communication channel. Here is an example +where the parent is the writer and the child is the reader: +
    +
    +	int fdarray[2];
    +	
    +	if (pipe(fdarray) < 0) panic ("error");
    +	if ((pid = fork()) < 0) panic ("error");
    +	else if (pid > 0) {
    +	  close(fdarray[0]);
    +	  write(fdarray[1], "hello world\n", 12);
    +        } else {
    +	  close(fdarray[1]);
    +	  n = read (fdarray[0], buf, MAXBUF);
    +	  write (1, buf, n);
    +        }
    +
    +
  • How does the shell implement pipelines (i.e., cmd 1 | cmd 2 |..)? +We want to arrange that the output of cmd 1 is the input of cmd 2. +The way to achieve this goal is to manipulate stdout and stdin. +
  • The shell creates processes for each command in +the pipeline, hooks up their stdin and stdout correctly. To do it +correct, and waits for the last process of the +pipeline to exit. A sketch of the core modifications to our shell for +setting up a pipe is: +
    	    
    +	    int fdarray[2];
    +
    +  	    if (pipe(fdarray) < 0) panic ("error");
    +	    if ((pid = fork ()) == 0) {  child (left end of pipe)
    +	       close (1);
    +	       tmp = dup (fdarray[1]);   // fdarray[1] is the write end, tmp will be 1
    +	       close (fdarray[0]);       // close read end
    +	       close (fdarray[1]);       // close fdarray[1]
    +	       exec (command1, args1, 0);
    +	    } else if (pid > 0) {        // parent (right end of pipe)
    +	       close (0);
    +	       tmp = dup (fdarray[0]);   // fdarray[0] is the read end, tmp will be 0
    +	       close (fdarray[0]);
    +	       close (fdarray[1]);       // close write end
    +	       exec (command2, args2, 0);
    +	    } else {
    +	       printf ("Unable to fork\n");
    +            }
    +
    +
  • Why close read-end and write-end? multiple reasons: maintain that +every process starts with 3 file descriptors and reading from an empty +pipe blocks reader, while reading from a closed pipe returns end of +file. +
  • How do you background jobs? +
    +        $ compute &
    +
    +
  • How does the shell implement "&", backgrounding? (Don't call wait +immediately). +
  • More details in the shell lecture later in the term. + + + + diff --git a/web/l13.html b/web/l13.html new file mode 100644 index 0000000..af0f405 --- /dev/null +++ b/web/l13.html @@ -0,0 +1,245 @@ +High-performance File Systems + + + + + +

    High-performance File Systems

    + +

    Required reading: soft updates. + +

    Overview

    + +

    A key problem in designing file systems is how to obtain +performance on file system operations while providing consistency. +With consistency, we mean, that file system invariants are maintained +is on disk. These invariants include that if a file is created, it +appears in its directory, etc. If the file system data structures are +consistent, then it is possible to rebuild the file system to a +correct state after a failure. + +

    To ensure consistency of on-disk file system data structures, + modifications to the file system must respect certain rules: +

      + +
    • Never point to a structure before it is initialized. An inode must +be initialized before a directory entry references it. An block must +be initialized before an inode references it. + +
    • Never reuse a structure before nullifying all pointers to it. An +inode pointer to a disk block must be reset before the file system can +reallocate the disk block. + +
    • Never reset the last point to a live structure before a new +pointer is set. When renaming a file, the file system should not +remove the old name for an inode until after the new name has been +written. +
    +The paper calls these dependencies update dependencies. + +

    xv6 ensures these rules by writing every block synchronously, and + by ordering the writes appropriately. With synchronous, we mean + that a process waits until the current disk write has been + completed before continuing with execution. + +

      + +
    • What happens if power fails after 4776 in mknod1? Did we lose the + inode for ever? No, we have a separate program (called fsck), which + can rebuild the disk structures correctly and can mark the inode on + the free list. + +
    • Does the order of writes in mknod1 matter? Say, what if we wrote + directory entry first and then wrote the allocated inode to disk? + This violates the update rules and it is not a good plan. If a + failure happens after the directory write, then on recovery we have + an directory pointing to an unallocated inode, which now may be + allocated by another process for another file! + +
    • Can we turn the writes (i.e., the ones invoked by iupdate and + wdir) into delayed writes without creating problems? No, because + the cause might write them back to the disk in an incorrect order. + It has no information to decide in what order to write them. + +
    + +

    xv6 is a nice example of the tension between consistency and + performance. To get consistency, xv6 uses synchronous writes, + but these writes are slow, because they perform at the rate of a + seek instead of the rate of the maximum data transfer rate. The + bandwidth to a disk is reasonable high for large transfer (around + 50Mbyte/s), but latency is low, because of the cost of moving the + disk arm(s) (the seek latency is about 10msec). + +

    This tension is an implementation-dependent one. The Unix API + doesn't require that writes are synchronous. Updates don't have to + appear on disk until a sync, fsync, or open with O_SYNC. Thus, in + principle, the UNIX API allows delayed writes, which are good for + performance: +

      +
    • Batch many writes together in a big one, written at the disk data + rate. +
    • Absorp writes to the same block. +
    • Schedule writes to avoid seeks. +
    + +

    Thus the question: how to delay writes and achieve consistency? + The paper provides an answer. + +

    This paper

    + +

    The paper surveys some of the existing techniques and introduces a +new to achieve the goal of performance and consistency. + +

    + +

    Techniques possible: +

      + +
    • Equip system with NVRAM, and put buffer cache in NVRAM. + +
    • Logging. Often used in UNIX file systems for metadata updates. +LFS is an extreme version of this strategy. + +
    • Flusher-enforced ordering. All writes are delayed. This flusher +is aware of dependencies between blocks, but doesn't work because +circular dependencies need to be broken by writing blocks out. + +
    + +

    Soft updates is the solution explored in this paper. It doesn't +require NVRAM, and performs as well as the naive strategy of keep all +dirty block in main memory. Compared to logging, it is unclear if +soft updates is better. The default BSD file systems uses soft + updates, but most Linux file systems use logging. + +

    Soft updates is a sophisticated variant of flusher-enforced +ordering. Instead of maintaining dependencies on the block-level, it +maintains dependencies on file structure level (per inode, per +directory, etc.), reducing circular dependencies. Furthermore, it +breaks any remaining circular dependencies by undo changes before +writing the block and then redoing them to the block after writing. + +

    Pseudocode for create: +

    +create (f) {
    +   allocate inode in block i  (assuming inode is available)
    +   add i to directory data block d  (assuming d has space)
    +   mark d has dependent on i, and create undo/redo record
    +   update directory inode in block di
    +   mark di has dependent on d
    +}
    +
    + +

    Pseudocode for the flusher: +

    +flushblock (b)
    +{
    +  lock b;
    +  for all dependencies that b is relying on
    +    "remove" that dependency by undoing the change to b
    +    mark the dependency as "unrolled"
    +  write b 
    +}
    +
    +write_completed (b) {
    +  remove dependencies that depend on b
    +  reapply "unrolled" dependencies that b depended on
    +  unlock b
    +}
    +
    + +

    Apply flush algorithm to example: +

      +
    • A list of two dependencies: directory->inode, inode->directory. +
    • Lets say syncer picks directory first +
    • Undo directory->inode changes (i.e., unroll ) +
    • Write directory block +
    • Remove met dependencies (i.e., remove inode->directory dependency) +
    • Perform redo operation (i.e., redo ) +
    • Select inode block and write it +
    • Remove met dependencies (i.e., remove directory->inode dependency) +
    • Select directory block (it is dirty again!) +
    • Write it. +
    + +

    An file operation that is important for file-system consistency +is rename. Rename conceptually works as follows: +

    +rename (from, to)
    +   unlink (to);
    +   link (from, to);
    +   unlink (from);
    +
    + +

    Rename it often used by programs to make a new version of a file +the current version. Committing to a new version must happen +atomically. Unfortunately, without a transaction-like support +atomicity is impossible to guarantee, so a typical file systems +provides weaker semantics for rename: if to already exists, an +instance of to will always exist, even if the system should crash in +the middle of the operation. Does the above implementation of rename +guarantee this semantics? (Answer: no). + +

    If rename is implemented as unlink, link, unlink, then it is +difficult to guarantee even the weak semantics. Modern UNIXes provide +rename as a file system call: +

    +   update dir block for to point to from's inode // write block
    +   update dir block for from to free entry // write block
    +
    +

    fsck may need to correct refcounts in the inode if the file +system fails during rename. for example, a crash after the first +write followed by fsck should set refcount to 2, since both from +and to are pointing at the inode. + +

    This semantics is sufficient, however, for an application to ensure +atomicity. Before the call, there is a from and perhaps a to. If the +call is successful, following the call there is only a to. If there +is a crash, there may be both a from and a to, in which case the +caller knows the previous attempt failed, and must retry. The +subtlety is that if you now follow the two links, the "to" name may +link to either the old file or the new file. If it links to the new +file, that means that there was a crash and you just detected that the +rename operation was composite. On the other hand, the retry +procedure can be the same for either case (do the rename again), so it +isn't necessary to discover how it failed. The function follows the +golden rule of recoverability, and it is idempotent, so it lays all +the needed groundwork for use as part of a true atomic action. + +

    With soft updates renames becomes: +

    +rename (from, to) {
    +   i = namei(from);
    +   add "to" directory data block td a reference to inode i
    +   mark td dependent on block i
    +   update directory inode "to" tdi
    +   mark tdi as dependent on td
    +   remove "from" directory data block fd a reference to inode i
    +   mark fd as dependent on tdi
    +   update directory inode in block fdi
    +   mark fdi as dependent on fd
    +}
    +
    +

    No synchronous writes! + +

    What needs to be done on recovery? (Inspect every statement in +rename and see what inconsistencies could exist on the disk; e.g., +refcnt inode could be too high.) None of these inconsitencies require +fixing before the file system can operate; they can be fixed by a +background file system repairer. + +

    Paper discussion

    + +

    Do soft updates perform any useless writes? (A useless write is a +write that will be immediately overwritten.) (Answer: yes.) Fix +syncer to becareful with what block to start. Fix cache replacement +to selecting LRU block with no pendending dependencies. + +

    Can a log-structured file system implement rename better? (Answer: +yes, since it can get the refcnts right). + +

    Discuss all graphs. + + + diff --git a/web/l14.txt b/web/l14.txt new file mode 100644 index 0000000..d121dff --- /dev/null +++ b/web/l14.txt @@ -0,0 +1,247 @@ +Why am I lecturing about Multics? + Origin of many ideas in today's OSes + Motivated UNIX design (often in opposition) + Motivated x86 VM design + This lecture is really "how Intel intended x86 segments to be used" + +Multics background + design started in 1965 + very few interactive time-shared systems then: CTSS + design first, then implementation + system stable by 1969 + so pre-dates UNIX, which started in 1969 + ambitious, many years, many programmers, MIT+GE+BTL + +Multics high-level goals + many users on same machine: "time sharing" + perhaps commercial services sharing the machine too + remote terminal access (but no recognizable data networks: wired or phone) + persistent reliable file system + encourage interaction between users + support joint projects that share data &c + control access to data that should not be shared + +Most interesting aspect of design: memory system + idea: eliminate memory / file distinction + file i/o uses LD / ST instructions + no difference between memory and disk files + just jump to start of file to run program + enhances sharing: no more copying files to private memory + this seems like a really neat simplification! + +GE 645 physical memory system + 24-bit phys addresses + 36-bit words + so up to 75 megabytes of physical memory!!! + but no-one could afford more than about a megabyte + +[per-process state] + DBR + DS, SDW (== address space) + KST + stack segment + per-segment linkage segments + +[global state] + segment content pages + per-segment page tables + per-segment branch in directory segment + AST + +645 segments (simplified for now, no paging or rings) + descriptor base register (DBR) holds phy addr of descriptor segment (DS) + DS is an array of segment descriptor words (SDW) + SDW: phys addr, length, r/w/x, present + CPU has pairs of registers: 18 bit offset, 18 bit segment # + five pairs (PC, arguments, base, linkage, stack) + early Multics limited each segment to 2^16 words + thus there are lots of them, intended to correspond to program modules + note: cannot directly address phys mem (18 vs 24) + 645 segments are a lot like the x86! + +645 paging + DBR and SDW actually contain phy addr of 64-entry page table + each page is 1024 words + PTE holds phys addr and present flag + no permission bits, so you really need to use the segments, not like JOS + no per-process page table, only per-segment + so all processes using a segment share its page table and phys storage + makes sense assuming segments tend to be shared + paging environment doesn't change on process switch + +Multics processes + each process has its own DS + Multics switches DBR on context switch + different processes typically have different number for same segment + +how to use segments to unify memory and file system? + don't want to have to use 18-bit seg numbers as file names + we want to write programs using symbolic names + names should be hierarchical (for users) + so users can have directories and sub-directories + and path names + +Multics file system + tree structure, directories and files + each file and directory is a segment + dir seg holds array of "branches" + name, length, ACL, array of block #s, "active" + unique ROOT directory + path names: ROOT > A > B + note there are no inodes, thus no i-numbers + so "real name" for a file is the complete path name + o/s tables have path name where unix would have i-number + presumably makes renaming and removing active files awkward + no hard links + +how does a program refer to a different segment? + inter-segment variables contain symbolic segment name + A$E refers to segment A, variable/function E + what happens when segment B calls function A$E(1, 2, 3)? + +when compiling B: + compiler actually generates *two* segments + one holds B's instructions + one holds B's linkage information + initial linkage entry: + name of segment e.g. "A" + name of symbol e.g. "E" + valid flag + CALL instruction is indirect through entry i of linkage segment + compiler marks entry i invalid + [storage for strings "A" and "E" really in segment B, not linkage seg] + +when a process is executing B: + two segments in DS: B and a *copy* of B's linkage segment + CPU linkage register always points to current segment's linkage segment + call A$E is really call indirect via linkage[i] + faults because linkage[i] is invalid + o/s fault handler + looks up segment name for i ("A") + search path in file system for segment "A" (cwd, library dirs) + if not already in use by some process (branch active flag and AST knows): + allocate page table and pages + read segment A into memory + if not already in use by *this* process (KST knows): + find free SDW j in process DS, make it refer to A's page table + set up r/w/x based on process's user and file ACL + also set up copy of A's linkage segment + search A's symbol table for "E" + linkage[i] := j / address(E) + restart B + now the CALL works via linkage[i] + and subsequent calls are fast + +how does A get the correct linkage register? + the right value cannot be embedded in A, since shared among processes + so CALL actually goes to instructions in A's linkage segment + load current seg# into linkage register, jump into A + one set of these per procedure in A + +all memory / file references work this way + as if pointers were really symbolic names + segment # is really a transparent optimization + linking is "dynamic" + programs contain symbolic references + resolved only as needed -- if/when executed + code is shared among processes + was program data shared? + probably most variables not shared (on stack, in private segments) + maybe a DB would share a data segment, w/ synchronization + file data: + probably one at a time (locks) for read/write + read-only is easy to share + +filesystem / segment implications + programs start slowly due to dynamic linking + creat(), unlink(), &c are outside of this model + store beyond end extends a segment (== appends to a file) + no need for buffer cache! no need to copy into user space! + but no buffer cache => ad-hoc caches e.g. active segment table + when are dirty segments written back to disk? + only in page eviction algorithm, when free pages are low + database careful ordered writes? e.g. log before data blocks? + I don't know, probably separate flush system calls + +how does shell work? + you type a program name + the shell just CALLs that program, as a segment! + dynamic linking finds program segment and any library segments it needs + the program eventually returns, e.g. with RET + all this happened inside the shell process's address space + no fork, no exec + buggy program can crash the shell! e.g. scribble on stack + process creation was too slow to give each program its own process + +how valuable is the sharing provided by segment machinery? + is it critical to users sharing information? + or is it just there to save memory and copying? + +how does the kernel fit into all this? + kernel is a bunch of code modules in segments (in file system) + a process dynamically loads in the kernel segments that it uses + so kernel segments have different numbers in different processes + a little different from separate kernel "program" in JOS or xv6 + kernel shares process's segment# address space + thus easy to interpret seg #s in system call arguments + kernel segment ACLs in file system restrict write + so mapped non-writeable into processes + +how to call the kernel? + very similar to the Intel x86 + 8 rings. users at 4. core kernel at 0. + CPU knows current execution level + SDW has max read/write/execute levels + call gate: lowers ring level, but only at designated entry + stack per ring, incoming call switches stacks + inner ring can always read arguments, write results + problem: checking validity of arguments to system calls + don't want user to trick kernel into reading/writing the wrong segment + you have this problem in JOS too + later Multics CPUs had hardware to check argument references + +are Multics rings a general-purpose protected subsystem facility? + example: protected game implementation + protected so that users cannot cheat + put game's code and data in ring 3 + BUT what if I don't trust the author? + or if i've already put some other subsystem in ring 3? + a ring has full power over itself and outer rings: you must trust + today: user/kernel, server processes and IPC + pro: protection among mutually suspicious subsystems + con: no convenient sharing of address spaces + +UNIX vs Multics + UNIX was less ambitious (e.g. no unified mem/FS) + UNIX hardware was small + just a few programmers, all in the same room + evolved rather than pre-planned + quickly self-hosted, so they got experience earlier + +What did UNIX inherit from MULTICS? + a shell at user level (not built into kernel) + a single hierarchical file system, with subdirectories + controlled sharing of files + written in high level language, self-hosted development + +What did UNIX reject from MULTICS? + files look like memory + instead, unifying idea is file descriptor and read()/write() + memory is a totally separate resource + dynamic linking + instead, static linking at compile time, every binary had copy of libraries + segments and sharing + instead, single linear address space per process, like xv6 + (but shared libraries brought these back, just for efficiency, in 1980s) + Hierarchical rings of protection + simpler user/kernel + for subsystems, setuid, then client/server and IPC + +The most useful sources I found for late-1960s Multics VM: + 1. Bensoussan, Clingen, Daley, "The Multics Virtual Memory: Concepts + and Design," CACM 1972 (segments, paging, naming segments, dynamic + linking). + 2. Daley and Dennis, "Virtual Memory, Processes, and Sharing in Multics," + SOSP 1967 (more details about dynamic linking and CPU). + 3. Graham, "Protection in an Information Processing Utility," + CACM 1968 (brief account of rings and gates). diff --git a/web/l19.txt b/web/l19.txt new file mode 100644 index 0000000..af9d0bb --- /dev/null +++ b/web/l19.txt @@ -0,0 +1,1412 @@ +-- front +6.828 Shells Lecture + +Hello. + +-- intro +Bourne shell + +Simplest shell: run cmd arg arg ... + fork + exec in child + wait in parent + +More functionality: + file redirection: cmd >file + open file as fd 1 in child before exec + +Still more functionality: + pipes: cmd | cmd | cmd ... + create pipe, + run first cmd with pipe on fd 1, + run second cmd with other end of pipe on fd 0 + +More Bourne arcana: + $* - command args + "$@" - unexpanded command args + environment variables + macro substitution + if, while, for + || + && + "foo $x" + 'foo $x' + `cat foo` + +-- rc +Rc Shell + + +No reparsing of input (except explicit eval). + +Variables as explicit lists. + +Explicit concatenation. + +Multiple input pipes <{cmd} - pass /dev/fd/4 as file name. + +Syntax more like C, less like Algol. + +diff <{echo hi} <{echo bye} + +-- es +Es shell + + +rc++ + +Goal is to override functionality cleanly. + +Rewrite input like cmd | cmd2 as %pipe {cmd} {cmd2}. + +Users can redefine %pipe, etc. + +Need lexical scoping and let to allow new %pipe refer to old %pipe. + +Need garbage collection to collect unreachable code. + +Design principle: + minimal functionality + good defaults + allow users to customize implementations + + emacs, exokernel + +-- apps +Applications + +Shell scripts are only as good as the programs they use. + (What good are pipes without cat, grep, sort, wc, etc.?) + +The more the scripts can access, the more powerful they become. + +-- acme +Acme, Plan 9 text editor + +Make window system control files available to +everything, including shell. + +Can write shell scripts to script interactions. + +/home/rsc/bin/Slide +/home/rsc/bin/Slide- +/home/rsc/bin/Slide+ + +/usr/local/plan9/bin/adict + +win + +-- javascript +JavaScript + +Very powerful + - not because it's a great language + - because it has a great data set + - Google Maps + - Gmail + - Ymail + - etc. + +-- greasemonkey +GreaseMonkey + +// ==UserScript== +// @name Google Ring +// @namespace http://swtch.com/greasemonkey/ +// @description Changes Google Logo +// @include http://*.google.*/* +// ==/UserScript== + +(function() { + for(var i=0; i[2=1] | sed 1d | winwrite body + case 2 + dict=$2 + case 3 + dict=$2 + dict -d $dict $3 >[2=1] | winwrite body + } + winctl clean + wineventloop +} + +dict=NONE +if(~ $1 -d){ + shift + dict=$2 + shift +} +if(~ $1 -d*){ + dict=`{echo $1 | sed 's/-d//'} + shift +} +if(~ $1 -*){ + echo 'usage: adict [-d dict] [word...]' >[1=2] + exit usage +} + +switch($#*){ +case 0 + if(~ $dict NONE) + dictwin /adict/ + if not + dictwin /adict/$dict/ $dict +case * + if(~ $dict NONE){ + dict=`{dict -d'?' | 9 sed -n 's/^ ([^\[ ]+).*/\1/p' | sed 1q} + if(~ $#dict 0){ + echo 'no dictionaries present on this system' >[1=2] + exit nodict + } + } + for(i) + dictwin /adict/$dict/$i $dict $i +} + +-- /usr/local/plan9/lib/acme.rc +fn newwindow { + winctl=`{9p read acme/new/ctl} + winid=$winctl(1) + winctl noscroll +} + +fn winctl { + echo $* | 9p write acme/acme/$winid/ctl +} + +fn winread { + 9p read acme/acme/$winid/$1 +} + +fn winwrite { + 9p write acme/acme/$winid/$1 +} + +fn windump { + if(! ~ $1 - '') + winctl dumpdir $1 + if(! ~ $2 - '') + winctl dump $2 +} + +fn winname { + winctl name $1 +} + +fn winwriteevent { + echo $1$2$3 $4 | winwrite event +} + +fn windel { + if(~ $1 sure) + winctl delete + if not + winctl del +} + +fn wineventloop { + . <{winread event >[2]/dev/null | acmeevent} +} +-- /home/rsc/plan9/rc/bin/fedex +#!/bin/rc + +if(! ~ $#* 1) { + echo usage: fedex 123456789012 >[1=2] + exit usage +} + +rfork e + +fn bgrep{ +pattern=`{echo $1 | sed 's;/;\\&;'} +shift + +@{ echo 'X { +$ +a + +. +} +X ,x/(.+\n)+\n/ g/'$pattern'/p' | +sam -d $* >[2]/dev/null +} +} + +fn awk2 { + awk 'NR%2==1 { a=$0; } + NR%2==0 { b=$0; printf("%-30s %s\n", a, b); } + ' $* +} + +fn awk3 { + awk '{line[NR] = $0} + END{ + i = 4; + while(i < NR){ + what=line[i++]; + when=line[i]; + comment=""; + if(!(when ~ /..\/..\/.... ..:../)){ + # out of sync + printf("%s\n", what); + continue; + } + i++; + if(!(line[i+1] ~ /..\/..\/.... ..:../) && + (i+2 > NR || line[i+2] ~ /..\/..\/.... ..:../)){ + what = what ", " line[i++]; + } + printf("%s %s\n", when, what); + } + }' $* +} + +# hget 'http://www.fedex.com/cgi-bin/track_it?airbill_list='$1'&kurrent_airbill='$1'&language=english&cntry_code=us&state=0' | +hget 'http://www.fedex.com/cgi-bin/tracking?action=track&language=english&cntry_code=us&initial=x&mps=y&tracknumbers='$1 | + htmlfmt >/tmp/fedex.$pid +sed -n '/Tracking number/,/^$/p' /tmp/fedex.$pid | awk2 +echo +sed -n '/Reference number/,/^$/p' /tmp/fedex.$pid | awk2 +echo +sed -n '/Date.time/,/^$/p' /tmp/fedex.$pid | sed 1,4d | fmt -l 4000 | sed 's/ [A-Z][A-Z] /&\n/g' +rm /tmp/fedex.$pid +-- /home/rsc/src/webscript/a3 +#!./o.webscript + +load "http://www.ups.com/WebTracking/track?loc=en_US" +find textbox "InquiryNumber1" +input "1z30557w0340175623" +find next checkbox +input "yes" +find prev form +submit +if(find "Delivery Information"){ + find outer table + print +}else if(find "One or more"){ + print +}else{ + print "Unexpected results." + find page + print +} +-- /home/rsc/src/webscript/a2 +#load "http://apc-reset/outlets.htm" +load "apc.html" +print +print "\n=============\n" +find "yoshimi" +find outer row +find next select +input "Immediate Reboot" +submit +print +-- /usr/local/plan9/acid/port +// portable acid for all architectures + +defn pfl(addr) +{ + print(pcfile(addr), ":", pcline(addr), "\n"); +} + +defn +notestk(addr) +{ + local pc, sp; + complex Ureg addr; + + pc = addr.pc\X; + sp = addr.sp\X; + + print("Note pc:", pc, " sp:", sp, " ", fmt(pc, 'a'), " "); + pfl(pc); + _stk({"PC", pc, "SP", sp, linkreg(addr)}, 1); +} + +defn +notelstk(addr) +{ + local pc, sp; + complex Ureg addr; + + pc = addr.pc\X; + sp = addr.sp\X; + + print("Note pc:", pc, " sp:", sp, " ", fmt(pc, 'a'), " "); + pfl(pc); + _stk({"PC", pc, "SP", sp, linkreg(addr)}, 1); +} + +defn params(param) +{ + while param do { + sym = head param; + print(sym[0], "=", itoa(sym[1], "%#ux")); + param = tail param; + if param then + print (","); + } +} + +stkprefix = ""; +stkignore = {}; +stkend = 0; + +defn locals(l) +{ + local sym; + + while l do { + sym = head l; + print(stkprefix, "\t", sym[0], "=", itoa(sym[1], "%#ux"), "\n"); + l = tail l; + } +} + +defn _stkign(frame) +{ + local file; + + file = pcfile(frame[0]); + s = stkignore; + while s do { + if regexp(head s, file) then + return 1; + s = tail s; + } + return 0; +} + +// print a stack trace +// +// in a run of leading frames in files matched by regexps in stkignore, +// only print the last one. +defn _stk(regs, dolocals) +{ + local stk, frame, pc, fn, done, callerpc, paramlist, locallist; + + stk = strace(regs); + if stkignore then { + while stk && tail stk && _stkign(head tail stk) do + stk = tail stk; + } + + callerpc = 0; + done = 0; + while stk && !done do { + frame = head stk; + stk = tail stk; + fn = frame[0]; + pc = frame[1]; + callerpc = frame[2]; + paramlist = frame[3]; + locallist = frame[4]; + + print(stkprefix, fmt(fn, 'a'), "("); + params(paramlist); + print(")"); + if pc != fn then + print("+", itoa(pc-fn, "%#ux")); + print(" "); + pfl(pc); + if dolocals then + locals(locallist); + if fn == var("threadmain") || fn == var("p9main") then + done=1; + if fn == var("threadstart") || fn == var("scheduler") then + done=1; + if callerpc == 0 then + done=1; + } + if callerpc && !done then { + print(stkprefix, fmt(callerpc, 'a'), " "); + pfl(callerpc); + } +} + +defn findsrc(file) +{ + local lst, src; + + if file[0] == '/' then { + src = file(file); + if src != {} then { + srcfiles = append srcfiles, file; + srctext = append srctext, src; + return src; + } + return {}; + } + + lst = srcpath; + while head lst do { + src = file(head lst+file); + if src != {} then { + srcfiles = append srcfiles, file; + srctext = append srctext, src; + return src; + } + lst = tail lst; + } +} + +defn line(addr) +{ + local src, file; + + file = pcfile(addr); + src = match(file, srcfiles); + + if src >= 0 then + src = srctext[src]; + else + src = findsrc(file); + + if src == {} then { + print("no source for ", file, "\n"); + return {}; + } + line = pcline(addr)-1; + print(file, ":", src[line], "\n"); +} + +defn addsrcdir(dir) +{ + dir = dir+"/"; + + if match(dir, srcpath) >= 0 then { + print("already in srcpath\n"); + return {}; + } + + srcpath = {dir}+srcpath; +} + +defn source() +{ + local l; + + l = srcpath; + while l do { + print(head l, "\n"); + l = tail l; + } + l = srcfiles; + + while l do { + print("\t", head l, "\n"); + l = tail l; + } +} + +defn Bsrc(addr) +{ + local lst; + + lst = srcpath; + file = pcfile(addr); + if file[0] == '/' && access(file) then { + rc("B "+file+":"+itoa(pcline(addr))); + return {}; + } + while head lst do { + name = head lst+file; + if access(name) then { + rc("B "+name+":"+itoa(pcline(addr))); + return {}; + } + lst = tail lst; + } + print("no source for ", file, "\n"); +} + +defn srcline(addr) +{ + local text, cline, line, file, src; + file = pcfile(addr); + src = match(file,srcfiles); + if (src>=0) then + src = srctext[src]; + else + src = findsrc(file); + if (src=={}) then + { + return "(no source)"; + } + return src[pcline(addr)-1]; +} + +defn src(addr) +{ + local src, file, line, cline, text; + + file = pcfile(addr); + src = match(file, srcfiles); + + if src >= 0 then + src = srctext[src]; + else + src = findsrc(file); + + if src == {} then { + print("no source for ", file, "\n"); + return {}; + } + + cline = pcline(addr)-1; + print(file, ":", cline+1, "\n"); + line = cline-5; + loop 0,10 do { + if line >= 0 then { + if line == cline then + print(">"); + else + print(" "); + text = src[line]; + if text == {} then + return {}; + print(line+1, "\t", text, "\n"); + } + line = line+1; + } +} + +defn step() // single step the process +{ + local lst, lpl, addr, bput; + + bput = 0; + if match(*PC, bplist) >= 0 then { // Sitting on a breakpoint + bput = fmt(*PC, bpfmt); + *bput = @bput; + } + + lst = follow(*PC); + + lpl = lst; + while lpl do { // place break points + *(head lpl) = bpinst; + lpl = tail lpl; + } + + startstop(pid); // do the step + + while lst do { // remove the breakpoints + addr = fmt(head lst, bpfmt); + *addr = @addr; + lst = tail lst; + } + if bput != 0 then + *bput = bpinst; +} + +defn bpset(addr) // set a breakpoint +{ + if status(pid) != "Stopped" then { + print("Waiting...\n"); + stop(pid); + } + if match(addr, bplist) >= 0 then + print("breakpoint already set at ", fmt(addr, 'a'), "\n"); + else { + *fmt(addr, bpfmt) = bpinst; + bplist = append bplist, addr; + } +} + +defn bptab() // print a table of breakpoints +{ + local lst, addr; + + lst = bplist; + while lst do { + addr = head lst; + print("\t", fmt(addr, 'X'), " ", fmt(addr, 'a'), " ", fmt(addr, 'i'), "\n"); + lst = tail lst; + } +} + +defn bpdel(addr) // delete a breakpoint +{ + local n, pc, nbplist; + + if addr == 0 then { + while bplist do { + pc = head bplist; + pc = fmt(pc, bpfmt); + *pc = @pc; + bplist = tail bplist; + } + return {}; + } + + n = match(addr, bplist); + if n < 0 then { + print("no breakpoint at ", fmt(addr, 'a'), "\n"); + return {}; + } + + addr = fmt(addr, bpfmt); + *addr = @addr; + + nbplist = {}; // delete from list + while bplist do { + pc = head bplist; + if pc != addr then + nbplist = append nbplist, pc; + bplist = tail bplist; + } + bplist = nbplist; // delete from memory +} + +defn cont() // continue execution +{ + local addr; + + addr = fmt(*PC, bpfmt); + if match(addr, bplist) >= 0 then { // Sitting on a breakpoint + *addr = @addr; + step(); // Step over + *addr = bpinst; + } + startstop(pid); // Run +} + +defn stopped(pid) // called from acid when a process changes state +{ + pfixstop(pid); + pstop(pid); // stub so this is easy to replace +} + +defn procs() // print status of processes +{ + local c, lst, cpid; + + cpid = pid; + lst = proclist; + while lst do { + np = head lst; + setproc(np); + if np == cpid then + c = '>'; + else + c = ' '; + print(fmt(c, 'c'), np, ": ", status(np), " at ", fmt(*PC, 'a'), " setproc(", np, ")\n"); + lst = tail lst; + } + pid = cpid; + if pid != 0 then + setproc(pid); +} + +_asmlines = 30; + +defn asm(addr) +{ + local bound; + + bound = fnbound(addr); + + addr = fmt(addr, 'i'); + loop 1,_asmlines do { + print(fmt(addr, 'a'), " ", fmt(addr, 'X')); + print("\t", @addr++, "\n"); + if bound != {} && addr > bound[1] then { + lasmaddr = addr; + return {}; + } + } + lasmaddr = addr; +} + +defn casm() +{ + asm(lasmaddr); +} + +defn xasm(addr) +{ + local bound; + + bound = fnbound(addr); + + addr = fmt(addr, 'i'); + loop 1,_asmlines do { + print(fmt(addr, 'a'), " ", fmt(addr, 'X')); + print("\t", *addr++, "\n"); + if bound != {} && addr > bound[1] then { + lasmaddr = addr; + return {}; + } + } + lasmaddr = addr; +} + +defn xcasm() +{ + xasm(lasmaddr); +} + +defn win() +{ + local npid, estr; + + bplist = {}; + notes = {}; + + estr = "/sys/lib/acid/window '0 0 600 400' "+textfile; + if progargs != "" then + estr = estr+" "+progargs; + + npid = rc(estr); + npid = atoi(npid); + if npid == 0 then + error("win failed to create process"); + + setproc(npid); + stopped(npid); +} + +defn win2() +{ + local npid, estr; + + bplist = {}; + notes = {}; + + estr = "/sys/lib/acid/transcript '0 0 600 400' '100 100 700 500' "+textfile; + if progargs != "" then + estr = estr+" "+progargs; + + npid = rc(estr); + npid = atoi(npid); + if npid == 0 then + error("win failed to create process"); + + setproc(npid); + stopped(npid); +} + +printstopped = 1; +defn new() +{ + local a; + + bplist = {}; + newproc(progargs); + a = var("p9main"); + if a == {} then + a = var("main"); + if a == {} then + return {}; + bpset(a); + while *PC != a do + cont(); + bpdel(a); +} + +defn stmnt() // step one statement +{ + local line; + + line = pcline(*PC); + while 1 do { + step(); + if line != pcline(*PC) then { + src(*PC); + return {}; + } + } +} + +defn func() // step until we leave the current function +{ + local bound, end, start, pc; + + bound = fnbound(*PC); + if bound == {} then { + print("cannot locate text symbol\n"); + return {}; + } + + pc = *PC; + start = bound[0]; + end = bound[1]; + while pc >= start && pc < end do { + step(); + pc = *PC; + } +} + +defn next() +{ + local sp, bound, pc; + + sp = *SP; + bound = fnbound(*PC); + if bound == {} then { + print("cannot locate text symbol\n"); + return {}; + } + stmnt(); + pc = *PC; + if pc >= bound[0] && pc < bound[1] then + return {}; + + while (pc < bound[0] || pc > bound[1]) && sp >= *SP do { + step(); + pc = *PC; + } + src(*PC); +} + +defn maps() +{ + local m, mm; + + m = map(); + while m != {} do { + mm = head m; + m = tail m; + print(mm[2]\X, " ", mm[3]\X, " ", mm[4]\X, " ", mm[0], " ", mm[1], "\n"); + } +} + +defn dump(addr, n, fmt) +{ + loop 0, n do { + print(fmt(addr, 'X'), ": "); + addr = mem(addr, fmt); + } +} + +defn mem(addr, fmt) +{ + + local i, c, n; + + i = 0; + while fmt[i] != 0 do { + c = fmt[i]; + n = 0; + while '0' <= fmt[i] && fmt[i] <= '9' do { + n = 10*n + fmt[i]-'0'; + i = i+1; + } + if n <= 0 then n = 1; + addr = fmt(addr, fmt[i]); + while n > 0 do { + print(*addr++, " "); + n = n-1; + } + i = i+1; + } + print("\n"); + return addr; +} + +defn symbols(pattern) +{ + local l, s; + + l = symbols; + while l do { + s = head l; + if regexp(pattern, s[0]) then + print(s[0], "\t", s[1], "\t", s[2], "\t", s[3], "\n"); + l = tail l; + } +} + +defn havesymbol(name) +{ + local l, s; + + l = symbols; + while l do { + s = head l; + l = tail l; + if s[0] == name then + return 1; + } + return 0; +} + +defn spsrch(len) +{ + local addr, a, s, e; + + addr = *SP; + s = origin & 0x7fffffff; + e = etext & 0x7fffffff; + loop 1, len do { + a = *addr++; + c = a & 0x7fffffff; + if c > s && c < e then { + print("src(", a, ")\n"); + pfl(a); + } + } +} + +defn acidtypes() +{ + local syms; + local l; + + l = textfile(); + if l != {} then { + syms = "acidtypes"; + while l != {} do { + syms = syms + " " + ((head l)[0]); + l = tail l; + } + includepipe(syms); + } +} + +defn getregs() +{ + local regs, l; + + regs = {}; + l = registers; + while l != {} do { + regs = append regs, var(l[0]); + l = tail l; + } + return regs; +} + +defn setregs(regs) +{ + local l; + + l = registers; + while l != {} do { + var(l[0]) = regs[0]; + l = tail l; + regs = tail regs; + } + return regs; +} + +defn resetregs() +{ + local l; + + l = registers; + while l != {} do { + var(l[0]) = register(l[0]); + l = tail l; + } +} + +defn clearregs() +{ + local l; + + l = registers; + while l != {} do { + var(l[0]) = refconst(~0); + l = tail l; + } +} + +progargs=""; +print(acidfile); + +-- /usr/local/plan9/acid/386 +// 386 support + +defn acidinit() // Called after all the init modules are loaded +{ + bplist = {}; + bpfmt = 'b'; + + srcpath = { + "./", + "/sys/src/libc/port/", + "/sys/src/libc/9sys/", + "/sys/src/libc/386/" + }; + + srcfiles = {}; // list of loaded files + srctext = {}; // the text of the files +} + +defn linkreg(addr) +{ + return {}; +} + +defn stk() // trace +{ + _stk({"PC", *PC, "SP", *SP}, 0); +} + +defn lstk() // trace with locals +{ + _stk({"PC", *PC, "SP", *SP}, 1); +} + +defn gpr() // print general(hah hah!) purpose registers +{ + print("AX\t", *AX, " BX\t", *BX, " CX\t", *CX, " DX\t", *DX, "\n"); + print("DI\t", *DI, " SI\t", *SI, " BP\t", *BP, "\n"); +} + +defn spr() // print special processor registers +{ + local pc; + local cause; + + pc = *PC; + print("PC\t", pc, " ", fmt(pc, 'a'), " "); + pfl(pc); + print("SP\t", *SP, " ECODE ", *ECODE, " EFLAG ", *EFLAGS, "\n"); + print("CS\t", *CS, " DS\t ", *DS, " SS\t", *SS, "\n"); + print("GS\t", *GS, " FS\t ", *FS, " ES\t", *ES, "\n"); + + cause = *TRAP; + print("TRAP\t", cause, " ", reason(cause), "\n"); +} + +defn regs() // print all registers +{ + spr(); + gpr(); +} + +defn mmregs() +{ + print("MM0\t", *MM0, " MM1\t", *MM1, "\n"); + print("MM2\t", *MM2, " MM3\t", *MM3, "\n"); + print("MM4\t", *MM4, " MM5\t", *MM5, "\n"); + print("MM6\t", *MM6, " MM7\t", *MM7, "\n"); +} + +defn pfixstop(pid) +{ + if *fmt(*PC-1, 'b') == 0xCC then { + // Linux stops us after the breakpoint, not at it + *PC = *PC-1; + } +} + + +defn pstop(pid) +{ + local l; + local pc; + local why; + + pc = *PC; + + // FIgure out why we stopped. + if *fmt(pc, 'b') == 0xCC then { + why = "breakpoint"; + + // fix up instruction for print; will put back later + *pc = @pc; + } else if *(pc-2\x) == 0x80CD then { + pc = pc-2; + why = "system call"; + } else + why = "stopped"; + + if printstopped then { + print(pid,": ", why, "\t"); + print(fmt(pc, 'a'), "\t", *fmt(pc, 'i'), "\n"); + } + + if why == "breakpoint" then + *fmt(pc, bpfmt) = bpinst; + + if printstopped && notes then { + if notes[0] != "sys: breakpoint" then { + print("Notes pending:\n"); + l = notes; + while l do { + print("\t", head l, "\n"); + l = tail l; + } + } + } +} + +aggr Ureg +{ + 'U' 0 di; + 'U' 4 si; + 'U' 8 bp; + 'U' 12 nsp; + 'U' 16 bx; + 'U' 20 dx; + 'U' 24 cx; + 'U' 28 ax; + 'U' 32 gs; + 'U' 36 fs; + 'U' 40 es; + 'U' 44 ds; + 'U' 48 trap; + 'U' 52 ecode; + 'U' 56 pc; + 'U' 60 cs; + 'U' 64 flags; + { + 'U' 68 usp; + 'U' 68 sp; + }; + 'U' 72 ss; +}; + +defn +Ureg(addr) { + complex Ureg addr; + print(" di ", addr.di, "\n"); + print(" si ", addr.si, "\n"); + print(" bp ", addr.bp, "\n"); + print(" nsp ", addr.nsp, "\n"); + print(" bx ", addr.bx, "\n"); + print(" dx ", addr.dx, "\n"); + print(" cx ", addr.cx, "\n"); + print(" ax ", addr.ax, "\n"); + print(" gs ", addr.gs, "\n"); + print(" fs ", addr.fs, "\n"); + print(" es ", addr.es, "\n"); + print(" ds ", addr.ds, "\n"); + print(" trap ", addr.trap, "\n"); + print(" ecode ", addr.ecode, "\n"); + print(" pc ", addr.pc, "\n"); + print(" cs ", addr.cs, "\n"); + print(" flags ", addr.flags, "\n"); + print(" sp ", addr.sp, "\n"); + print(" ss ", addr.ss, "\n"); +}; +sizeofUreg = 76; + +aggr Linkdebug +{ + 'X' 0 version; + 'X' 4 map; +}; + +aggr Linkmap +{ + 'X' 0 addr; + 'X' 4 name; + 'X' 8 dynsect; + 'X' 12 next; + 'X' 16 prev; +}; + +defn +linkdebug() +{ + local a; + + if !havesymbol("_DYNAMIC") then + return 0; + + a = _DYNAMIC; + while *a != 0 do { + if *a == 21 then // 21 == DT_DEBUG + return *(a+4); + a = a+8; + } + return 0; +} + +defn +dynamicmap() +{ + if systype == "linux" || systype == "freebsd" then { + local r, m, n; + + r = linkdebug(); + if r then { + complex Linkdebug r; + m = r.map; + n = 0; + while m != 0 && n < 100 do { + complex Linkmap m; + if m.name && *(m.name\b) && access(*(m.name\s)) then + print("textfile({\"", *(m.name\s), "\", ", m.addr\X, "});\n"); + m = m.next; + n = n+1; + } + } + } +} + +defn +acidmap() +{ +// dynamicmap(); + acidtypes(); +} + +print(acidfile); diff --git a/web/l2.html b/web/l2.html new file mode 100644 index 0000000..e183d5a --- /dev/null +++ b/web/l2.html @@ -0,0 +1,494 @@ + + +L2 + + + +

    6.828 Lecture Notes: x86 and PC architecture

    + +

    Outline

    +
      +
    • PC architecture +
    • x86 instruction set +
    • gcc calling conventions +
    • PC emulation +
    + +

    PC architecture

    + +
      +
    • A full PC has: +
        +
      • an x86 CPU with registers, execution unit, and memory management +
      • CPU chip pins include address and data signals +
      • memory +
      • disk +
      • keyboard +
      • display +
      • other resources: BIOS ROM, clock, ... +
      + +
    • We will start with the original 16-bit 8086 CPU (1978) +
    • CPU runs instructions: +
      +for(;;){
      +	run next instruction
      +}
      +
      + +
    • Needs work space: registers +
        +
      • four 16-bit data registers: AX, CX, DX, BX +
      • each in two 8-bit halves, e.g. AH and AL +
      • very fast, very few +
      +
    • More work space: memory +
        +
      • CPU sends out address on address lines (wires, one bit per wire) +
      • Data comes back on data lines +
      • or data is written to data lines +
      + +
    • Add address registers: pointers into memory +
        +
      • SP - stack pointer +
      • BP - frame base pointer +
      • SI - source index +
      • DI - destination index +
      + +
    • Instructions are in memory too! +
        +
      • IP - instruction pointer (PC on PDP-11, everything else) +
      • increment after running each instruction +
      • can be modified by CALL, RET, JMP, conditional jumps +
      + +
    • Want conditional jumps +
        +
      • FLAGS - various condition codes +
          +
        • whether last arithmetic operation overflowed +
        • ... was positive/negative +
        • ... was [not] zero +
        • ... carry/borrow on add/subtract +
        • ... overflow +
        • ... etc. +
        • whether interrupts are enabled +
        • direction of data copy instructions +
        +
      • JP, JN, J[N]Z, J[N]C, J[N]O ... +
      + +
    • Still not interesting - need I/O to interact with outside world +
        +
      • Original PC architecture: use dedicated I/O space +
          +
        • Works same as memory accesses but set I/O signal +
        • Only 1024 I/O addresses +
        • Example: write a byte to line printer: +
          +#define DATA_PORT    0x378
          +#define STATUS_PORT  0x379
          +#define   BUSY 0x80
          +#define CONTROL_PORT 0x37A
          +#define   STROBE 0x01
          +void
          +lpt_putc(int c)
          +{
          +  /* wait for printer to consume previous byte */
          +  while((inb(STATUS_PORT) & BUSY) == 0)
          +    ;
          +
          +  /* put the byte on the parallel lines */
          +  outb(DATA_PORT, c);
          +
          +  /* tell the printer to look at the data */
          +  outb(CONTROL_PORT, STROBE);
          +  outb(CONTROL_PORT, 0);
          +}
          +
          +		
        + +
      • Memory-Mapped I/O +
          +
        • Use normal physical memory addresses +
            +
          • Gets around limited size of I/O address space +
          • No need for special instructions +
          • System controller routes to appropriate device +
          +
        • Works like ``magic'' memory: +
            +
          • Addressed and accessed like memory, + but ... +
          • ... does not behave like memory! +
          • Reads and writes can have ``side effects'' +
          • Read results can change due to external events +
          +
        +
      + + +
    • What if we want to use more than 2^16 bytes of memory? +
        +
      • 8086 has 20-bit physical addresses, can have 1 Meg RAM +
      • each segment is a 2^16 byte window into physical memory +
      • virtual to physical translation: pa = va + seg*16 +
      • the segment is usually implicit, from a segment register +
      • CS - code segment (for fetches via IP) +
      • SS - stack segment (for load/store via SP and BP) +
      • DS - data segment (for load/store via other registers) +
      • ES - another data segment (destination for string operations) +
      • tricky: can't use the 16-bit address of a stack variable as a pointer +
      • but a far pointer includes full segment:offset (16 + 16 bits) +
      + +
    • But 8086's 16-bit addresses and data were still painfully small +
        +
      • 80386 added support for 32-bit data and addresses (1985) +
      • boots in 16-bit mode, boot.S switches to 32-bit mode +
      • registers are 32 bits wide, called EAX rather than AX +
      • operands and addresses are also 32 bits, e.g. ADD does 32-bit arithmetic +
      • prefix 0x66 gets you 16-bit mode: MOVW is really 0x66 MOVW +
      • the .code32 in boot.S tells assembler to generate 0x66 for e.g. MOVW +
      • 80386 also changed segments and added paged memory... +
      + +
    + +

    x86 Physical Memory Map

    + +
      +
    • The physical address space mostly looks like ordinary RAM +
    • Except some low-memory addresses actually refer to other things +
    • Writes to VGA memory appear on the screen +
    • Reset or power-on jumps to ROM at 0x000ffff0 +
    + +
    ++------------------+  <- 0xFFFFFFFF (4GB)
    +|      32-bit      |
    +|  memory mapped   |
    +|     devices      |
    +|                  |
    +/\/\/\/\/\/\/\/\/\/\
    +
    +/\/\/\/\/\/\/\/\/\/\
    +|                  |
    +|      Unused      |
    +|                  |
    ++------------------+  <- depends on amount of RAM
    +|                  |
    +|                  |
    +| Extended Memory  |
    +|                  |
    +|                  |
    ++------------------+  <- 0x00100000 (1MB)
    +|     BIOS ROM     |
    ++------------------+  <- 0x000F0000 (960KB)
    +|  16-bit devices, |
    +|  expansion ROMs  |
    ++------------------+  <- 0x000C0000 (768KB)
    +|   VGA Display    |
    ++------------------+  <- 0x000A0000 (640KB)
    +|                  |
    +|    Low Memory    |
    +|                  |
    ++------------------+  <- 0x00000000
    +
    + +

    x86 Instruction Set

    + +
      +
    • Two-operand instruction set +
        +
      • Intel syntax: op dst, src +
      • AT&T (gcc/gas) syntax: op src, dst +
          +
        • uses b, w, l suffix on instructions to specify size of operands +
        +
      • Operands are registers, constant, memory via register, memory via constant +
      • Examples: +
+
AT&T syntax "C"-ish equivalent +
movl %eax, %edx edx = eax; register mode +
movl $0x123, %edx edx = 0x123; immediate +
movl 0x123, %edx edx = *(int32_t*)0x123; direct +
movl (%ebx), %edx edx = *(int32_t*)ebx; indirect +
movl 4(%ebx), %edx edx = *(int32_t*)(ebx+4); displaced +
+ + +

  • Instruction classes + + +
  • Intel architecture manual Volume 2 is the reference + + + +

    gcc x86 calling conventions

    + + + + +

    PC emulation

    + + diff --git a/web/l3.html b/web/l3.html new file mode 100644 index 0000000..7d6ca0d --- /dev/null +++ b/web/l3.html @@ -0,0 +1,334 @@ +L3 + + + + + +

    Operating system organizaton

    + +

    Required reading: Exokernel paper. + +

    Intro: virtualizing

    + +

    One way to think about an operating system interface is that it +extends the hardware instructions with a set of "instructions" that +are implemented in software. These instructions are invoked using a +system call instruction (int on the x86). In this view, a task of the +operating system is to provide each application with a virtual +version of the interface; that is, it provides each application with a +virtual computer. + +

    One of the challenges in an operating system is multiplexing the +physical resources between the potentially many virtual computers. +What makes the multiplexing typically complicated is an additional +constraint: isolate the virtual computers well from each other. That +is, +

    + +

    In this lecture, we will explore at a high-level how to build +virtual computer that meet these goals. In the rest of the term we +work out the details. + +

    Virtual processors

    + +

    To give each application its own set of virtual processor, we need +to virtualize the physical processors. One way to do is to multiplex +the physical processor over time: the operating system runs one +application for a while, then runs another application for while, etc. +We can implement this solution as follows: when an application has run +for its share of the processor, unload the state of the phyical +processor, save that state to be able to resume the application later, +load in the state for the next application, and resume it. + +

    What needs to be saved and restored? That depends on the +processor, but for the x86: +

    + +

    To enforce that a virtual processor doesn't keep a processor, the +operating system can arrange for a periodic interrupt, and switch the +processor in the interrupt routine. + +

    To separate the memories of the applications, we may also need to save +and restore the registers that define the (virtual) memory of the +application (e.g., segment and MMU registers on the x86), which is +explained next. + + + +

    Separating memories

    + +

    Approach to separating memories: +

    +The approaches can be combined. + +

    Lets assume unlimited physical memory for a little while. We can +enforce separation then as follows: +

    +Why does this work? load/stores/jmps cannot touch/enter other +application's domains. + +

    To allow for controled sharing and separation with an application, +extend domain registers with protectioin bits: read (R), write (W), +execute-only (X). + +

    How to protect the domain registers? Extend the protection bits +with a kernel-only one. When in kernel-mode, processor can change +domain registers. As we will see in lecture 4, x86 stores the U/K +information in CPL (current privilege level) in CS segment +register. + +

    To change from user to kernel, extend the hardware with special +instructions for entering a "supervisor" or "system" call, and +returning from it. On x86, int and reti. The int instruction takes as +argument the system call number. We can then think of the kernel +interface as the set of "instructions" that augment the instructions +implemented in hardware. + +

    Memory management

    + +

    We assumed unlimited physical memory and big addresses. In +practice, operating system must support creating, shrinking, and +growing of domains, while still allowing the addresses of an +application to be contiguous (for programming convenience). What if +we want to grow the domain of application 1 but the memory right below +and above it is in use by application 2? + +

    How? Virtual addresses and spaces. Virtualize addresses and let +the kernel control the mapping from virtual to physical. + +

    Address spaces provide each application with the ideas that it has +a complete memory for itself. All the addresses it issues are its +addresses (e.g., each application has an address 0). + +

  • How do you give each application its own address space? + + +
  • What if two applications want to share real memory? Map the pages +into multiple address spaces and have protection bits per page. + +
  • How do you give an application access to a memory-mapped-IO +device? Map the physical address for the device into the applications +address space. + +
  • How do you get off the ground? + + +

    Operating system organizations

    + +

    A central theme in operating system design is how to organize the +operating system. It is helpful to define a couple of terms: +

    + +

    Example: trace a call to printf made by an application. + +

    There are roughly 4 operating system designs: +

    + +

    Although monolithic operating systems are the dominant operating +system architecture for desktop and server machines, it is worthwhile +to consider alternative architectures, even it is just to understand +operating systems better. This lecture looks at exokernels, because +that is what you will building in the lab. xv6 is organized as a +monolithic system, and we will study in the next lectures. Later in +the term we will read papers about microkernel and virtual machine +operating systems. + +

    Exokernels

    + +

    The exokernel architecture takes an end-to-end approach to +operating system design. In this design, the kernel just securely +multiplexes physical resources; any programmer can decide what the +operating system interface and its implementation are for his +application. One would expect a couple of popular APIs (e.g., UNIX) +that most applications will link against, but a programmer is always +free to replace that API, partially or completely. (Draw picture of +JOS.) + +

    Compare UNIX interface (v6 or OSX) with the JOS exokernel-like interface: +

    +enum
    +{
    +	SYS_cputs = 0,
    +	SYS_cgetc,
    +	SYS_getenvid,
    +	SYS_env_destroy,
    +	SYS_page_alloc,
    +	SYS_page_map,
    +	SYS_page_unmap,
    +	SYS_exofork,
    +	SYS_env_set_status,
    +	SYS_env_set_trapframe,
    +	SYS_env_set_pgfault_upcall,
    +	SYS_yield,
    +	SYS_ipc_try_send,
    +	SYS_ipc_recv,
    +};
    +
    + +

    To illustrate the differences between these interfaces in more +detail consider implementing the following: +

    + +

    How well can each kernel interface implement the above examples? +(Start with UNIX interface and see where you run into problems.) (The +JOS kernel interface is not flexible enough: for example, +ipc_receive is blocking.) + +

    Exokernel paper discussion

    + + +

    The central challenge in an exokernel design it to provide +extensibility, but provide fault isolation. This challenge breaks +down into three problems: + +

    + + + + diff --git a/web/l4.html b/web/l4.html new file mode 100644 index 0000000..342af32 --- /dev/null +++ b/web/l4.html @@ -0,0 +1,518 @@ +L4 + + + + + +

    Address translation and sharing using segments

    + +

    This lecture is about virtual memory, focusing on address +spaces. It is the first lecture out of series of lectures that uses +xv6 as a case study. + +

    Address spaces

    + + + +

    Two main approaches to implementing address spaces: using segments + and using page tables. Often when one uses segments, one also uses + page tables. But not the other way around; i.e., paging without + segmentation is common. + +

    Example support for address spaces: x86

    + +

    For an operating system to provide address spaces and address +translation typically requires support from hardware. The translation +and checking of permissions typically must happen on each address used +by a program, and it would be too slow to check that in software (if +even possible). The division of labor is operating system manages +address spaces, and hardware translates addresses and checks +permissions. + +

    PC block diagram without virtual memory support: +

    + +

    The x86 starts out in real mode and translation is as follows: +

    + +

    The operating system can switch the x86 to protected mode, which +allows the operating system to create address spaces. Translation in +protected mode is as follows: +

    + +

    Next lecture covers paging; now we focus on segmentation. + +

    Protected-mode segmentation works as follows: +

    + +

    Case study (xv6)

    + +

    xv6 is a reimplementation of Unix 6th edition. +

    + +

    Newer Unixs have inherited many of the conceptual ideas even though +they added paging, networking, graphics, improve performance, etc. + +

    You will need to read most of the source code multiple times. Your +goal is to explain every line to yourself. + +

    Overview of address spaces in xv6

    + +

    In today's lecture we see how xv6 creates the kernel address + spaces, first user address spaces, and switches to it. To understand + how this happens, we need to understand in detail the state on the + stack too---this may be surprising, but a thread of control and + address space are tightly bundled in xv6, in a concept + called process. The kernel address space is the only address + space with multiple threads of control. We will study context + switching and process management in detail next weeks; creation of + the first user process (init) will get you a first flavor. + +

    xv6 uses only the segmentation hardware on xv6, but in a limited + way. (In JOS you will use page-table hardware too, which we cover in + next lecture.) The adddress space layouts are as follows: +

    + +

    xv6 makes minimal use of the segmentation hardware available on the +x86. What other plans could you envision? + +

    In xv6, each each program has a user and a kernel stack; when the +user program switches to the kernel, it switches to its kernel stack. +Its kernel stack is stored in process's proc structure. (This is +arranged through the descriptors in the IDT, which is covered later.) + +

    xv6 assumes that there is a lot of physical memory. It assumes that + segments can be stored contiguously in physical memory and has + therefore no need for page tables. + +

    xv6 kernel address space

    + +

    Let's see how xv6 creates the kernel address space by tracing xv6 + from when it boots, focussing on address space management: +

    + +

    xv6 user address spaces

    + + + +

    Managing physical memory

    + +

    To create an address space we must allocate physical memory, which + will be freed when an address space is deleted (e.g., when a user + program terminates). xv6 implements a first-fit memory allocater + (see kalloc.c). + +

    It maintains a list of ranges of free memory. The allocator finds + the first range that is larger than the amount of requested memory. + It splits that range in two: one range of the size requested and one + of the remainder. It returns the first range. When memory is + freed, kfree will merge ranges that are adjacent in memory. + +

    Under what scenarios is a first-fit memory allocator undesirable? + +

    Growing an address space

    + +

    How can a user process grow its address space? growproc. +

    +

    We could do a lot better if segments didn't have to contiguous in + physical memory. How could we arrange that? Using page tables, which + is our next topic. This is one place where page tables would be + useful, but there are others too (e.g., in fork). + + + diff --git a/web/l5.html b/web/l5.html new file mode 100644 index 0000000..61b55e4 --- /dev/null +++ b/web/l5.html @@ -0,0 +1,210 @@ +Lecture 5/title> +<html> +<head> +</head> +<body> + +<h2>Address translation and sharing using page tables</h2> + +<p> Reading: <a href="../readings/i386/toc.htm">80386</a> chapters 5 and 6<br> + +<p> Handout: <b> x86 address translation diagram</b> - +<a href="x86_translation.ps">PS</a> - +<a href="x86_translation.eps">EPS</a> - +<a href="x86_translation.fig">xfig</a> +<br> + +<p>Why do we care about x86 address translation? +<ul> +<li>It can simplify s/w structure by placing data at fixed known addresses. +<li>It can implement tricks like demand paging and copy-on-write. +<li>It can isolate programs to contain bugs. +<li>It can isolate programs to increase security. +<li>JOS uses paging a lot, and segments more than you might think. +</ul> + +<p>Why aren't protected-mode segments enough? +<ul> +<li>Why did the 386 add translation using page tables as well? +<li>Isn't it enough to give each process its own segments? +</ul> + +<p>Translation using page tables on x86: +<ul> +<li>paging hardware maps linear address (la) to physical address (pa) +<li>(we will often interchange "linear" and "virtual") +<li>page size is 4096 bytes, so there are 1,048,576 pages in 2^32 +<li>why not just have a big array with each page #'s translation? +<ul> +<li>table[20-bit linear page #] => 20-bit phys page # +</ul> +<li>386 uses 2-level mapping structure +<li>one page directory page, with 1024 page directory entries (PDEs) +<li>up to 1024 page table pages, each with 1024 page table entries (PTEs) +<li>so la has 10 bits of directory index, 10 bits table index, 12 bits offset +<li>What's in a PDE or PTE? +<ul> +<li>20-bit phys page number, present, read/write, user/supervisor +</ul> +<li>cr3 register holds physical address of current page directory +<li>puzzle: what do PDE read/write and user/supervisor flags mean? +<li>puzzle: can supervisor read/write user pages? + +<li>Here's how the MMU translates an la to a pa: + + <pre> + uint + translate (uint la, bool user, bool write) + { + uint pde; + pde = read_mem (%CR3 + 4*(la >> 22)); + access (pde, user, read); + pte = read_mem ( (pde & 0xfffff000) + 4*((la >> 12) & 0x3ff)); + access (pte, user, read); + return (pte & 0xfffff000) + (la & 0xfff); + } + + // check protection. pxe is a pte or pde. + // user is true if CPL==3 + void + access (uint pxe, bool user, bool write) + { + if (!(pxe & PG_P) + => page fault -- page not present + if (!(pxe & PG_U) && user) + => page fault -- not access for user + + if (write && !(pxe & PG_W)) + if (user) + => page fault -- not writable + else if (!(pxe & PG_U)) + => page fault -- not writable + else if (%CR0 & CR0_WP) + => page fault -- not writable + } + </pre> + +<li>CPU's TLB caches vpn => ppn mappings +<li>if you change a PDE or PTE, you must flush the TLB! +<ul> + <li>by re-loading cr3 +</ul> +<li>turn on paging by setting CR0_PE bit of %cr0 +</ul> + +Can we use paging to limit what memory an app can read/write? +<ul> +<li>user can't modify cr3 (requires privilege) +<li>is that enough? +<li>could user modify page tables? after all, they are in memory. +</ul> + +<p>How we will use paging (and segments) in JOS: +<ul> +<li>use segments only to switch privilege level into/out of kernel +<li>use paging to structure process address space +<li>use paging to limit process memory access to its own address space +<li>below is the JOS virtual memory map +<li>why map both kernel and current process? why not 4GB for each? +<li>why is the kernel at the top? +<li>why map all of phys mem at the top? i.e. why multiple mappings? +<li>why map page table a second time at VPT? +<li>why map page table a third time at UVPT? +<li>how do we switch mappings for a different process? +</ul> + +<pre> + 4 Gig --------> +------------------------------+ + | | RW/-- + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + : . : + : . : + : . : + |~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~| RW/-- + | | RW/-- + | Remapped Physical Memory | RW/-- + | | RW/-- + KERNBASE -----> +------------------------------+ 0xf0000000 + | Cur. Page Table (Kern. RW) | RW/-- PTSIZE + VPT,KSTACKTOP--> +------------------------------+ 0xefc00000 --+ + | Kernel Stack | RW/-- KSTKSIZE | + | - - - - - - - - - - - - - - -| PTSIZE + | Invalid Memory | --/-- | + ULIM ------> +------------------------------+ 0xef800000 --+ + | Cur. Page Table (User R-) | R-/R- PTSIZE + UVPT ----> +------------------------------+ 0xef400000 + | RO PAGES | R-/R- PTSIZE + UPAGES ----> +------------------------------+ 0xef000000 + | RO ENVS | R-/R- PTSIZE + UTOP,UENVS ------> +------------------------------+ 0xeec00000 + UXSTACKTOP -/ | User Exception Stack | RW/RW PGSIZE + +------------------------------+ 0xeebff000 + | Empty Memory | --/-- PGSIZE + USTACKTOP ---> +------------------------------+ 0xeebfe000 + | Normal User Stack | RW/RW PGSIZE + +------------------------------+ 0xeebfd000 + | | + | | + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + . . + . . + . . + |~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~| + | Program Data & Heap | + UTEXT --------> +------------------------------+ 0x00800000 + PFTEMP -------> | Empty Memory | PTSIZE + | | + UTEMP --------> +------------------------------+ 0x00400000 + | Empty Memory | PTSIZE + 0 ------------> +------------------------------+ +</pre> + +<h3>The VPT </h3> + +<p>Remember how the X86 translates virtual addresses into physical ones: + +<p><img src=pagetables.png> + +<p>CR3 points at the page directory. The PDX part of the address +indexes into the page directory to give you a page table. The +PTX part indexes into the page table to give you a page, and then +you add the low bits in. + +<p>But the processor has no concept of page directories, page tables, +and pages being anything other than plain memory. So there's nothing +that says a particular page in memory can't serve as two or three of +these at once. The processor just follows pointers: + +pd = lcr3(); +pt = *(pd+4*PDX); +page = *(pt+4*PTX); + +<p>Diagramatically, it starts at CR3, follows three arrows, and then stops. + +<p>If we put a pointer into the page directory that points back to itself at +index Z, as in + +<p><img src=vpt.png> + +<p>then when we try to translate a virtual address with PDX and PTX +equal to V, following three arrows leaves us at the page directory. +So that virtual page translates to the page holding the page directory. +In Jos, V is 0x3BD, so the virtual address of the VPD is +(0x3BD<<22)|(0x3BD<<12). + + +<p>Now, if we try to translate a virtual address with PDX = V but an +arbitrary PTX != V, then following three arrows from CR3 ends +one level up from usual (instead of two as in the last case), +which is to say in the page tables. So the set of virtual pages +with PDX=V form a 4MB region whose page contents, as far +as the processor is concerned, are the page tables themselves. +In Jos, V is 0x3BD so the virtual address of the VPT is (0x3BD<<22). + +<p>So because of the "no-op" arrow we've cleverly inserted into +the page directory, we've mapped the pages being used as +the page directory and page table (which are normally virtually +invisible) into the virtual address space. + + +</body> diff --git a/web/mkhtml b/web/mkhtml new file mode 100755 index 0000000..74987e6 --- /dev/null +++ b/web/mkhtml @@ -0,0 +1,70 @@ +#!/usr/bin/perl + +my @lines = <>; +my $text = join('', @lines); +my $title; +if($text =~ /^\*\* (.*?)\n/m){ + $title = $1; + $text = $` . $'; +}else{ + $title = "Untitled"; +} + +$text =~ s/[ \t]+$//mg; +$text =~ s/^$/<br><br>/mg; +$text =~ s!\b([a-z0-9]+\.(c|s|pl|h))\b!<a href="src/$1.html">$1</a>!g; +$text =~ s!^(Lecture [0-9]+\. .*?)$!<b><i>$1</i></b>!mg; +$text =~ s!^\* (.*?)$!<h2>$1</h2>!mg; +$text =~ s!((<br>)+\n)+<h2>!\n<h2>!g; +$text =~ s!</h2>\n?((<br>)+\n)+!</h2>\n!g; +$text =~ s!((<br>)+\n)+<b>!\n<br><br><b>!g; +$text =~ s!\b\s*--\s*\b!\–!g; +$text =~ s!\[([^\[\]|]+) \| ([^\[\]]+)\]!<a href="$1">$2</a>!g; +$text =~ s!\[([^ \t]+)\]!<a href="$1">$1</a>!g; + +$text =~ s!``!\“!g; +$text =~ s!''!\”!g; + +print <<EOF; +<!-- AUTOMATICALLY GENERATED: EDIT the .txt version, not the .html version --> +<html> +<head> +<title>$title + + + +

    $title

    +

    +EOF +print $text; +print < + +EOF diff --git a/web/x86-intr.html b/web/x86-intr.html new file mode 100644 index 0000000..0369e25 --- /dev/null +++ b/web/x86-intr.html @@ -0,0 +1,53 @@ +Homework: xv6 and Interrupts and Exceptions + + + + + +

    Homework: xv6 and Interrupts and Exceptions

    + +

    +Read: xv6's trapasm.S, trap.c, syscall.c, vectors.S, and usys.S. Skim +lapic.c, ioapic.c, and picirq.c + +

    +Hand-In Procedure +

    +You are to turn in this homework during lecture. Please +write up your answers to the exercises below and hand them in to a +6.828 staff member at the beginning of the lecture. +

    + +Introduction + +

    Try to understand +xv6's trapasm.S, trap.c, syscall.c, vectors.S, and usys.S. Skim + You will need to consult: + +

    Chapter 5 of IA-32 Intel +Architecture Software Developer's Manual, Volume 3: System programming +guide; you can skip sections 5.7.1, 5.8.2, and 5.12.2. Be aware +that terms such as exceptions, traps, interrupts, faults and aborts +have no standard meaning. + +

    Chapter 9 of the 1987 i386 +Programmer's Reference Manual also covers exception and interrupt +handling in IA32 processors. + +

    Assignment: + +In xv6, set a breakpoint at the beginning of syscall() to +catch the very first system call. What values are on the stack at +this point? Turn in the output of print-stack 35 at that +breakpoint with each value labeled as to what it is (e.g., +saved %ebp for trap, +trapframe.eip, etc.). +

    +This completes the homework. + + + + + + + diff --git a/web/x86-intro.html b/web/x86-intro.html new file mode 100644 index 0000000..323d92e --- /dev/null +++ b/web/x86-intro.html @@ -0,0 +1,18 @@ +Homework: Intro to x86 and PC + + + + + +

    Homework: Intro to x86 and PC

    + +

    Today's lecture is an introduction to the x86 and the PC, the +platform for which you will write an operating system. The assigned +book is a reference for x86 assembly programming of which you will do +some. + +

    Assignment Make sure to do exercise 1 of lab 1 before +coming to lecture. + + + diff --git a/web/x86-mmu.html b/web/x86-mmu.html new file mode 100644 index 0000000..a83ff26 --- /dev/null +++ b/web/x86-mmu.html @@ -0,0 +1,33 @@ +Homework: x86 MMU + + + + + +

    Homework: x86 MMU

    + +

    Read chapters 5 and 6 of +Intel 80386 Reference Manual. +These chapters explain +the x86 Memory Management Unit (MMU), +which we will cover in lecture today and which you need +to understand in order to do lab 2. + +

    +Read: bootasm.S and setupsegs() in proc.c + +

    +Hand-In Procedure +

    +You are to turn in this homework during lecture. Please +write up your answers to the exercises below and hand them in to a +6.828 staff member by the beginning of lecture. +

    + +

    Assignment: Try to understand setupsegs() in proc.c. + What values are written into gdt[SEG_UCODE] + and gdt[SEG_UDATA] for init, the first user-space + process? + (You can use Bochs to answer this question.) + + diff --git a/web/x86-mmu1.pdf b/web/x86-mmu1.pdf new file mode 100644 index 0000000..e7103e7 Binary files /dev/null and b/web/x86-mmu1.pdf differ diff --git a/web/x86-mmu2.pdf b/web/x86-mmu2.pdf new file mode 100644 index 0000000..e548148 --- /dev/null +++ b/web/x86-mmu2.pdf @@ -0,0 +1,55 @@ +%PDF-1.4 1 0 obj <> >> stream KFf<dp|n'G2tyJ\#菑‘Fa7Ll1њ"^+ aࠎG垼JY w&EBQ 2(09dQfԣjA.zX0r2Aw&9']Y*AɎ[ xr hEd'`_{Z D#:;IVb.D!U~PL 3L<\ๆ  @֝hqiմӭtOQ;wICړ|wA xzzpc/ [*">DQiu"4 {W T7ӹzAӚI_5lw3$XjZE[wluիwjdW_M5ShAB\ e`H"""""""#k%Ph#B#-"@He0AAaMS}'T§Y'D /(&ALɬ *H8TOr~};VK|W (Ku %JxPK %_ I/o($e.J~%jKL~AsfHk?_at04"""#Q|Dr/8-#Hk|h)(r `= Z&9>.ʕ s.!wxvL sK:Ȼ*o;@8z _-Zǵ~!{qr9fcf4'܆戎 48M?afPx_wM7q .({Z]/I_z .aIے{I + JD~>quZtx?__Y#_z_s:%s7_o}p׵`iXKm.kظ⽍0+jjWQV"{ojg4vA8@WM5U"""?21)Kz?w}c(E.Έ8""""#4r c!.ɹCr9F]Uy!͢%,n[o_^jzo __W%0@[;'3`1H*'SQd%JF~4?4vҼfh_Mzg _Kv~C %IL*-Hu P<7ZN$ +]$xY!ܸ({/Zt]RTگz T}M^ kzoҹg}ʙm?߫kJ"8׶ >v0G:8v=4-^*X hWqO3HGn`DE!WA=#w~솔VA7<\@ySYmpX^/W?~{/VlG.>97"8 c Aۣzok_~{]&3Gp3A|GM 5Y _?V_fq'3͑#5E#lˌ֎/DDDDDDDDD}m}:'zGe"G~Bio 2ЌSd]QoPv}^ cs1Z^a =pj]={p ͢78A_df<|NXg>EGNt ot}XCCh4 t?[?zAvkfǬHz&w=Hhc BUa<:WzPA7/<d_vqW A>D)vBw? 8)/isH.Ճ&Oȱ`ő)"^!Џ_:D6#!O??Rv}i U_o 7U91VҴl+,y ~g [68H p]>Q$q;iz+Wg~^AA; +a~[mm;>:_aA # ),0CkV +Q텆 H3KDGxzVLTS,4?w Pi;[4[ „""!(Mqkk^P֒^o~}TpL%|%:(W/[lXRl%Gz 㐎Bh% D%Ŏ]g"g f@a' 4yTEvs˛xdb \3L8JǹpC +! 0'TLz#H98_]KZ}/׿}_wG5DᠹuQ{^믻OUO^KuOG3:5",08atUnɃU"H0D:;C jB? dxqǔqx!07гp^`gxOՐ#_%] "#)YyEN3kW.N暈N_B9; ܇_@E{UՄ"",TsRׂ>&/N8K""#Boz+ A}ճϿ"էwk믯׾! ᄶ!qL}a$De1owk ;A= S ڪMi^Ԭʪ]%#FK@U0ԝw~#▓^k?8rzALY8epaU i}?\ kjm+ -bW_~W]s_1̏Fh0 +p """""""ykܫ?f-2 +Ng+Ymz(AeWRםB.l?,]hd)?"{tE/7(&pC*TeaN!`܆ +C Pzk"n3\5% dܖL/zXL N/Sa5Ysf-e%nZ=wD *' 'w}06uf"GD_OOՠ `|@'_Wtj /|[]uꩪH`}c}JK#1Kmk'nN2qDaik¯ Do.`8aai: (D$4 |CkI`B1P[OT?ִ?]k%B\k]$։|="x8Iux\!Q dZGJAK]5?=p}W ?K./oAi/Cw/|']PKׯ@kյmo^X'^v% +a) 1H8إ&DL*Ar7~{H.ӆ@%^ panT'VP*p<X!  +Þ"jb$""hDDDDD,NM~OA{ ~_%Kz]lR ܯt^a_h#/b#Ʒ2IaHa +h ` zA'wһZrk.ysr{a{~ [ T!إ~޿?_H"?1 n;ڦݵDwO7#m k/c\_t j?Uҿ~o4_%3+K__-~O[Kײ vjڶ]w9_+K҆ bwG> >llW_SVK~V޽M?"ui~׻28^ft ڦ&[_a;00/Xg|@4h!H~$^DDDDG"vu׆(%~5_EQ6-T- ׿c!x8̈́#r#pp   ya3Aqda^Z !`x0G\L[9s0]Ps{89w +@MY hpERX$$95 fC=DCY2dB Y˨4RsQ2S\h)J2rq) ռ"HgdޤE̋8 giY28 *#Ԝz&i2^IT^ו#_Ti}))ӑV { +l^Kx{T}4yq[#TDj3΄;1;8dsA4K|>'0S1 +ۇ>"F4gq=Oz502@ ̎VP.|] = `AA?_40  +&8}bL 4N-=:ӈi"c_TnWWqCx_'_Mi릩ߧ:Oz]LCd!iWw~Fܔr7N$g Q~)czFJ?#'Z"#X2;EB@DY"Ò%r,dw%=?O#Oy*@iG +?Z<'E/t ^kǥzzK?;EkQc#$kA?SwKUI/TD!ݿѠ?Ql#}G]^Bj=σ~ ]D路0넿_  7-_~.l7ᇯ"׿?ւikY:.uu AҐ࿾=4_:[N5dta-__rӽϙrz pA #:խGKAm! -ۆh[}W ,,?yd^H 0]tDl+-ֿa~Vcce  b-qߍگlcoӵw{WK}NY>궿ZnAD[ﰚ 'vDkdX®NPd 0Zb1HzdAި4^ 0@:2:j4!0Mavh՚Bh"hOQ"""""65f?V5հx/tJ@ KF?s7c`fiRUӍGT bl˃rt\ˊb9gto#qDmP w!9 ,rrxQ6"""""C$,r-0h9k DDDI9 1gNSHD*,!G(DDDDHcmnE4"""%`!tl GQ^).ق0M]TG!z(As#DDDX`;;3?pӳY#1vp9,KMM$e9}.]ʂMK\ǒ:_D 6#ă]( w i3š9F\G3SF /L! ac׈|E. 6. Ȏfc˖h߿ji(E_8;'ԂzH|?P=$(|0Ikv^4^~  H:OHh'.װȣOB\%~G K]uuUE̞x'ǒ- X*ڶa0i}/銏S[Q{DZ څL/a|DZu+ \!_0DDHDr+٦_ 91ɎCne aʂr +auG\(2 +ADDDDAs$ A 3TF W,a VȔF5&@́)ׯ_ y8 z!'|z}n}?__⾻ ׯ]O{4o߿X`/=VKֲ?\hqMtCvq +_ׯ]r};#_~sY]h뿵|fq)5HN +hDvxydr̠ڔ?3vPY ntxMM=m=? 0; a"g#>~T'kiX=[DYwDoylh5i;[>'oOt]^:U_ L?^?[>}D܄lݨ理Z(n>>yps"7_ ]BׯWq_x_蹯_ ҷ^iy~S_?;ܞب/_¶^kN\c}ҩֿ.XkkI}pm5 {z]E_'V[_ǻ dQ5i{c?C :'Aia8d~M?-NᓅdW]ᔝ#V]Sy78_]{t`_q__zy@gA&$NCj2âa﾿]}kZ3d(?o_^_>_y?Q޺߯S`֪u~_~ oxX/?___޿bɹݞd' H0c94eACМʖ5DDDDDDZ2:0@[DDDDDDAr +9>4P 's1—S0`B9i|DDDDDDDDV eـqGC0+<\4G,G B8d;+?jd"طj֔oZWޚvF+UӮU___^~_[U_ɰ@VE,50Lvԟ__߯}Ǒ3 Eԏ36}#S=,?/,w?4!58P@‚ p:|Cb3 0@~C#Óyᚣ_moOOOZN!ń AމwiJ=r''n˶HnEԛI 7P_[~wA<% $xASrxy;%nNRԼ?N'^uMzNqwt o}{V}i'[<߽d''Ԋc_~wq\WTG_xo0<_}B`@Tׯx7?]1W5`ofQ_;^oO?/9?A"&(R9}Ԝl~߿YW]^kK_;K2*j 5+]RMoD]{TC⡄ _ /L/Uô:_–: M /,4DDDGABPGTsq?b""Zn־IC+pVĂ]"7~-:32Ȝ`Y;63v;zAJZzQwMݭ=+VZSOk_i\:K߿ϑ c"z8fI$}  N_=k׻t0=rDa8)E ͦxf*}&}4~A[]>=m?K^[?녗O(ƉGN$; z$ $]5OxA*D?:I?L/K/"Ηu!ۆ)w { aqN=(uJrc}*(DDHd0Oa'B B_WY<"$Zߠ{u+"莈GFbQ|j=p_-C9C,DDDDz._8oڶZV޵S̏\W5 0ҭ_׈>*-6?n+ kjL +NdW0xi"8qaA T aW&Ȏ8|GPޙ}+ޞ O.eDy_oKvtdHg#f6!.)ppFr)n3_uUB"""""""""?1 qn11vlb.]3er8rGP'_˂%~-Ƒvy#iIp/=w]_׷iov1{h8#AgHψj"""rW帾vth2CL3oIDv?|e3Gٸ:|#/ F/q`Eat}4A?{[WyJ'N܎=K%{WL'n u_HҽYi{ǿo}P4#.F$?7ztypQ)s~ ߿p|~1eq|u4˂8R8`mo޺|on"""""#uuk zj6)8☯X+5]V44 U Aa2p 'ik"""?rGs25/n,V4a0;_'7;}ׯ֟P)9{o.˰<_B & ~5_Vȏ %TO2yJWQT_ jƷ_.pz~_iz 2A\d\Gn/DDDH\Ɏw"a?qgfGH. #Ad`/##qF #3`m zKb"""""""""""%NtHP#28#6g4ds>p KwQEm骾m2 ˵M4M Z_'qh_i4͚؆5a;֤#w# V xֿZQh_iU8dpi.dqHxB8s0eȜpsK|et莫} n9!܌s *G09m ԡ. +B/""""""#7=>rXr=܄vGGtj",GfG܎%^/"""""""#c_k_aP׿唴0vpS0wS"DfF /޺9R <.uzU-t].b%kIAChק~!g<4 >&j282Gh/0G gs”3fh BxNI} ~a wUNc}Or ektn7A)+8Gȣ%t =SUI7%} _IzN%x+8FG MZgzҽӤ@:.xk*%W@BCiJQDoR(;ԗ=k\׵(H{'/xA-#C^6> ,k`݂ٲl.x\q~$q}ݯ]GWnցaa uwz=;"dcK"4" Ј3& +CSh _@DDFǯ *ڃI/ݤt]`ݤ0Ď{V!;O,F#3ˑ40d")jŕfL qH)dn C'UM5 5TAPL&+^0PoRiތKփ˿/4/?wznr垓Yє?/7WD0XK_a/#W_:v]n/W(4N+K ZqmSZҽ _V_你 ſ]-kiZja|*M1 |0_1[H c^Mi * nDDDD4#⠶`2<]dpPG\3fl60fl/ gAdrr B+':~+@9a[(B B'DDD;)F7(j"Q{C @L>OgϿ}馱"#'~]~}]_ 5Uw]Z|/z?k_8櫪8+">vu,MsxACޝ:CԿzuQ>:@&y5Y &K=xzOO} +h3B On#hFeoIz_Kbo[uO_WZ^5E]x?}|{N/zkޞ'lt*B(o0%ĭĸ=oEw_O8AW8/Fb~u6 ~<-ĺ%p/_J_ '|z]rߜ@nW_>8[OW]|M~io=i[[ aov az._{ kK66858>`/O*ikmkLjZoڿQi4_~_\DI A hC/RӴ ^-b"""#>HKR%d;"w8AܝH}$]t_z& Kza=0'A?ǼV[~_u=oOL/kA_ӻH'Km& z__o_zך\]5?=#qk_oR,}u]oЈu}yjrsY^o}?bST(ˢ$ϟkikztmy6=} +wb߈qVDZ0# /vV +z+zc㊊"? 5k}ki5ah0M4a* SMߖ:AE~ jE^DDDDDDDDqDJx0hDEe0t Qk __ 7-Y*2)B?JSP.ЕsAA7׶Ǘ"|d7 00tiuH~x'̾:'O>=7_wIx>kth'@}c?ԈfYEB CT +F1O%ߤ"""#%b?վ3es@#B>x͑2>Gˇ#r>_#AumDDDDDDDDDDDDGCc߾U!^ 'pja #2w d鄾kL_[32e(.Y)Tm͑'~ yK򆟄IW33b4y!'dDk BADX M4 dqM?8|XCM*wڼ>'n^d%}a‘R7~HF_ N~6a~}c=?Z_WZN/ׅ'LQOAN娈O EQUn3 B8g#l}KՈcZE?^Zճ;A:ߙ߿¶?mv]Q1 &%^*;[aq_wqݭ{ 2VpkR(ki'"""""""""!fQ0MT""#WZk~pk:ȣ.Z0lZM_zߐlB8*28`PiDDDDD$l36289᜸ˌO]/1&Ў2]aDDxABD5{K_]rc;4~P?__:C$ ΌA C1-M4{w];ܜQ(wEN'A==>&OM_pc?/>o0˃Is.P28As/SB"""""""@1x*|\<7DDDDDD{3:Zyzzvn5"a4|ɀ#Hg^ڈxaXaaa_oM1LqWh8i i""" ?~_}nJD b5@B ͪdIO{oF8_VNxRr8nn>)8y0Zoa ![5WH}CoNO rWA̼D=t=? :^߽%`/'u#B"l/J DDDDH-9A4LX"?DGU_Y># \'eOa/Ct2VgB.ˑ̸Ⱦp4 pp#$3?֕T_εa+[_l.kkimo9tLC?᭯j)_L>zdG0@F4  `0""""""""?ܵ6td3gc*}]: Jds "n^<7]UT.!麬0;IR15~<  tK0/_REȐF 1a4>NІh^ kz?PuyǩyD XM¿ޞ إkAN?'M8)"Fi:ez݃{52—^6t_);#VC5ޡ=ZRW1.`q .\#\ +_zoҰw""""""?2$K[ +vi"*8b5]Bh4״M`0a3fX4DDDG_?n NFfΗwrޫI_a?}+߯dgI}hn/|@ٰAI=uT^9;?DХt%boK^?/0<)q{.fq1Ⱦtα(F5H:*?q!x ]r 9aɎR ʿ_daA,r!lX a=6i͈6vy/3r.2WTD6?a2 4'5Xpaזa? ߠ'6P2Af a{zM;kJ/\,$Z=O`6?ɧk#aS.2l5M \C˜3G b 9{"aL0I`r'aυHr2&‡+ +VynP'l""""""""""""""#"i=3Bc!G 9}ߙDDDDLr!̐)aˣ4`S#h p<4O~|>M}V~ ToU 9 \saܭ9CB+%O$ܪW'aI2:}r\CC.r=?|ZwzZ 쪟_( |+#(J8??(KQUo o{ӯKYةAIym?7kUwK" pZ$?}BEP zPUr ״W6?U,8׵8;}uA'ϙqٲ_qvq׽ ?c+{l{`_V + &߿vޫ{~7kv?{Hq;"A|0Bh05^; kkui{4p 0Av  {L-A{[Z_M+oA""""""""" q P2 2Q2p7ȯu ?p +Z"*""""#B""!4 A-T 4@ͤ4Pp8dq\C-& +%~""""""""""8#C + ֎Zڌ+8a5T-dXiZ7+L i#@DT(#GeCD ЈaDdAJ)τxנi޷-ߴ=+dJ~Zb:ⓞ=4 B5' }63< Ayf4Dvt gS5Ašҽ0zC3 d~Z[O'h4uHx5䞫%jv}WX^GG{D h2W_ zz +=x +]6BUP]5]-_7@׾ǹ%_}Xz'[~Xp<ԆXK ֈQl6/GHt6I{kB7|`|}:tIG[U j߸ l`~69+:{pޯQc CJ*?]׮k魯i#2O B}V~AQ> stream q 612.00 0 0 792.00 0.00 0.00 cm 0 g /Obj1 Do Q endstream endobj 3 0 obj << /Type /Pages /Kids [ 4 0 R ] /Count 1 >> endobj 4 0 obj << /Type /Page /MediaBox [ 0 0 612 792 ] /Parent 3 0 R /Rotate 0 /Resources << /ProcSet [/PDF /ImageC /ImageB /ImageI] /XObject << /Obj1 1 0 R >> >> /Contents [2 0 R ] >> endobj 5 0 obj << /Type /Catalog /Pages 3 0 R >> endobj 6 0 obj << /Creator (HP Digital Sending Device) /CreationDate () /Author () /Producer (HP Digital Sending Device) /Title () /Subject() >> endobj xref 0 7 0000000000 65535 f 0000000009 00000 n 0000025495 00000 n 0000025597 00000 n 0000025656 00000 n 0000025843 00000 n 0000025892 00000 n trailer << /Size 7 /Root 5 0 R /Info 6 0 R >> startxref 26037 %%EOF \ No newline at end of file diff --git a/web/xv6-disk.html b/web/xv6-disk.html new file mode 100644 index 0000000..65bcf8f --- /dev/null +++ b/web/xv6-disk.html @@ -0,0 +1,63 @@ + + +Homework: Files and Disk I/O + + + +

    Homework: Files and Disk I/O

    + +

    +Read: bio.c, fd.c, fs.c, and ide.c + +

    +This homework should be turned in at the beginning of lecture. + +

    +File and Disk I/O + +

    Insert a print statement in bwrite so that you get a +print every time a block is written to disk: + +

    +  cprintf("bwrite sector %d\n", sector);
    +
    + +

    Build and boot a new kernel and run these three commands at the shell: +

    +  echo >a
    +  echo >a
    +  rm a
    +  mkdir d
    +
    + +(You can try rm d if you are curious, but it should look +almost identical to rm a.) + +

    You should see a sequence of bwrite prints after running each command. +Record the list and annotate it with the calling function and +what block is being written. +For example, this is the second echo >a: + +

    +$ echo >a
    +bwrite sector 121  # writei  (data block)
    +bwrite sector 3    # iupdate (inode block)
    +$ 
    +
    + +

    Hint: the easiest way to get the name of the +calling function is to add a string argument to bwrite, +edit all the calls to bwrite to pass the name of the +calling function, and just print it. +You should be able to reason about what kind of +block is being written just from the calling function. + +

    You need not write the following up, but try to +understand why each write is happening. This will +help your understanding of the file system layout +and the code. + +

    +This completes the homework. + + diff --git a/web/xv6-intro.html b/web/xv6-intro.html new file mode 100644 index 0000000..3669866 --- /dev/null +++ b/web/xv6-intro.html @@ -0,0 +1,163 @@ +Homework: intro to xv6 + + + + + +

    Homework: intro to xv6

    + +

    This lecture is the introduction to xv6, our re-implementation of + Unix v6. Read the source code in the assigned files. You won't have + to understand the details yet; we will focus on how the first + user-level process comes into existence after the computer is turned + on. +

    + +Hand-In Procedure +

    +You are to turn in this homework during lecture. Please +write up your answers to the exercises below and hand them in to a +6.828 staff member at the beginning of lecture. +

    + +

    Assignment: +
    +Fetch and un-tar the xv6 source: + +

    +sh-3.00$ wget http://pdos.csail.mit.edu/6.828/2007/src/xv6-rev1.tar.gz 
    +sh-3.00$ tar xzvf xv6-rev1.tar.gz
    +xv6/
    +xv6/asm.h
    +xv6/bio.c
    +xv6/bootasm.S
    +xv6/bootmain.c
    +...
    +$
    +
    + +Build xv6: +
    +$ cd xv6
    +$ make
    +gcc -O -nostdinc -I. -c bootmain.c
    +gcc -nostdinc -I. -c bootasm.S
    +ld -N -e start -Ttext 0x7C00 -o bootblock.o bootasm.o bootmain.o
    +objdump -S bootblock.o > bootblock.asm
    +objcopy -S -O binary bootblock.o bootblock
    +...
    +$ 
    +
    + +Find the address of the main function by +looking in kernel.asm: +
    +% grep main kernel.asm
    +...
    +00102454 <mpmain>:
    +mpmain(void)
    +001024d0 <main>:
    +  10250d:       79 f1                   jns    102500 <main+0x30>
    +  1025f3:       76 6f                   jbe    102664 <main+0x194>
    +  102611:       74 2f                   je     102642 <main+0x172>
    +
    +In this case, the address is 001024d0. +

    + +Run the kernel inside Bochs, setting a breakpoint +at the beginning of main (i.e., the address +you just found). +

    +$ make bochs
    +if [ ! -e .bochsrc ]; then ln -s dot-bochsrc .bochsrc; fi
    +bochs -q
    +========================================================================
    +                       Bochs x86 Emulator 2.2.6
    +                    (6.828 distribution release 1)
    +========================================================================
    +00000000000i[     ] reading configuration from .bochsrc
    +00000000000i[     ] installing x module as the Bochs GUI
    +00000000000i[     ] Warning: no rc file specified.
    +00000000000i[     ] using log file bochsout.txt
    +Next at t=0
    +(0) [0xfffffff0] f000:fff0 (unk. ctxt): jmp far f000:e05b         ; ea5be000f0
    +(1) [0xfffffff0] f000:fff0 (unk. ctxt): jmp far f000:e05b         ; ea5be000f0
    +<bochs> 
    +
    + +Look at the registers and the stack contents: + +
    +<bochs> info reg
    +...
    +<bochs> print-stack
    +...
    +<bochs>
    +
    + +Which part of the stack printout is actually the stack? +(Hint: not all of it.) Identify all the non-zero values +on the stack.

    + +Turn in: the output of print-stack with +the valid part of the stack marked. Write a short (3-5 word) +comment next to each non-zero value explaining what it is. +

    + +Now look at kernel.asm for the instructions in main that read: +

    +  10251e:       8b 15 00 78 10 00       mov    0x107800,%edx
    +  102524:       8d 04 92                lea    (%edx,%edx,4),%eax
    +  102527:       8d 04 42                lea    (%edx,%eax,2),%eax
    +  10252a:       c1 e0 04                shl    $0x4,%eax
    +  10252d:       01 d0                   add    %edx,%eax
    +  10252f:       8d 04 85 1c ad 10 00    lea    0x10ad1c(,%eax,4),%eax
    +  102536:       89 c4                   mov    %eax,%esp
    +
    +(The addresses and constants might be different on your system, +and the compiler might use imul instead of the lea,lea,shl,add,lea sequence. +Look for the move into %esp). +

    + +Which lines in main.c do these instructions correspond to? +

    + +Set a breakpoint at the first of those instructions +and let the program run until the breakpoint: +

    +<bochs> vb 0x8:0x10251e
    +<bochs> s
    +...
    +<bochs> c
    +(0) Breakpoint 2, 0x0010251e (0x0008:0x0010251e)
    +Next at t=1157430
    +(0) [0x0010251e] 0008:0x0010251e (unk. ctxt): mov edx, dword ptr ds:0x107800 ; 8b1500781000
    +(1) [0xfffffff0] f000:fff0 (unk. ctxt): jmp far f000:e05b         ; ea5be000f0
    +<bochs> 
    +
    +(The first s command is necessary +to single-step past the breakpoint at main, otherwise c +will not make any progress.) +

    + +Inspect the registers and stack again +(info reg and print-stack). +Then step past those seven instructions +(s 7) +and inspect them again. +Convince yourself that the stack has changed correctly. +

    + +Turn in: answers to the following questions. +Look at the assembly for the call to +lapic_init that occurs after the +the stack switch. Where does the +bcpu argument come from? +What would have happened if main +stored bcpu +on the stack before those four assembly instructions? +Would the code still work? Why or why not? +

    + + + diff --git a/web/xv6-lock.html b/web/xv6-lock.html new file mode 100644 index 0000000..887022a --- /dev/null +++ b/web/xv6-lock.html @@ -0,0 +1,100 @@ +Homework: Locking + + + + + +

    Homework: Locking

    + + +

    +Read: spinlock.c + +

    +Hand-In Procedure +

    +You are to turn in this homework at the beginning of lecture. Please +write up your answers to the exercises below and hand them in to a +6.828 staff member at the beginning of lecture. +

    + +Assignment: +In this assignment we will explore some of the interaction +between interrupts and locking. +

    + +Make sure you understand what would happen if the kernel executed +the following code snippet: +

    +  struct spinlock lk;
    +  initlock(&lk, "test lock");
    +  acquire(&lk);
    +  acquire(&lk);
    +
    +(Feel free to use Bochs to find out. acquire is in spinlock.c.) +

    + +An acquire ensures interrupts are off +on the local processor using cli, +and interrupts remain off until the release +of the last lock held by that processor +(at which point they are enabled using sti). +

    + +Let's see what happens if we turn on interrupts while +holding the ide lock. +In ide_rw in ide.c, add a call +to sti() after the acquire(). +Rebuild the kernel and boot it in Bochs. +Chances are the kernel will panic soon after boot; try booting Bochs a few times +if it doesn't. +

    + +Turn in: explain in a few sentences why the kernel panicked. +You may find it useful to look up the stack trace +(the sequence of %eip values printed by panic) +in the kernel.asm listing. +

    + +Remove the sti() you added, +rebuild the kernel, and make sure it works again. +

    + +Now let's see what happens if we turn on interrupts +while holding the kalloc_lock. +In kalloc() in kalloc.c, add +a call to sti() after the call to acquire(). +You will also need to add +#include "x86.h" at the top of the file after +the other #include lines. +Rebuild the kernel and boot it in Bochs. +It will not panic. +

    + +Turn in: explain in a few sentences why the kernel didn't panic. +What is different about kalloc_lock +as compared to ide_lock? +

    +You do not need to understand anything about the details of the IDE hardware +to answer this question, but you may find it helpful to look +at which functions acquire each lock, and then at when those +functions get called. +

    + +(There is a very small but non-zero chance that the kernel will panic +with the extra sti() in kalloc. +If the kernel does panic, make doubly sure that +you removed the sti() call from +ide_rw. If it continues to panic and the +only extra sti() is in bio.c, +then mail 6.828-staff@pdos.csail.mit.edu +and think about buying a lottery ticket.) +

    + +Turn in: Why does release() clear +lock->pcs[0] and lock->cpu +before clearing lock->locked? +Why not wait until after? + + + diff --git a/web/xv6-names.html b/web/xv6-names.html new file mode 100644 index 0000000..926be3a --- /dev/null +++ b/web/xv6-names.html @@ -0,0 +1,78 @@ + + +Homework: Naming + + + +

    Homework: Naming

    + +

    +Read: namei in fs.c, fd.c, sysfile.c + +

    +This homework should be turned in at the beginning of lecture. + +

    +Symbolic Links + +

    +As you read namei and explore its varied uses throughout xv6, +think about what steps would be required to add symbolic links +to xv6. +A symbolic link is simply a file with a special type (e.g., T_SYMLINK +instead of T_FILE or T_DIR) whose contents contain the path being +linked to. + +

    +Turn in a short writeup of how you would change xv6 to support +symlinks. List the functions that would have to be added or changed, +with short descriptions of the new functionality or changes. + +

    +This completes the homework. + +

    +The following is not required. If you want to try implementing +symbolic links in xv6, here are the files that the course staff +had to change to implement them: + +

    +fs.c: 20 lines added, 4 modified
    +syscall.c: 2 lines added
    +syscall.h: 1 line added
    +sysfile.c: 15 lines added
    +user.h: 1 line added
    +usys.S: 1 line added
    +
    + +Also, here is an ln program: + +
    +#include "types.h"
    +#include "user.h"
    +
    +int
    +main(int argc, char *argv[])
    +{
    +  int (*ln)(char*, char*);
    +  
    +  ln = link;
    +  if(argc > 1 && strcmp(argv[1], "-s") == 0){
    +    ln = symlink;
    +    argc--;
    +    argv++;
    +  }
    +  
    +  if(argc != 3){
    +    printf(2, "usage: ln [-s] old new (%d)\n", argc);
    +    exit();
    +  }
    +  if(ln(argv[1], argv[2]) < 0){
    +    printf(2, "%s failed\n", ln == symlink ? "symlink" : "link");
    +    exit();
    +  }
    +  exit();
    +}
    +
    + + diff --git a/web/xv6-sched.html b/web/xv6-sched.html new file mode 100644 index 0000000..f8b8b31 --- /dev/null +++ b/web/xv6-sched.html @@ -0,0 +1,96 @@ +Homework: Threads and Context Switching + + + + + +

    Homework: Threads and Context Switching

    + +

    +Read: swtch.S and proc.c (focus on the code that switches +between processes, specifically scheduler and sched). + +

    +Hand-In Procedure +

    +You are to turn in this homework during lecture. Please +write up your answers to the exercises below and hand them in to a +6.828 staff member at the beginning of lecture. +

    +Introduction + +

    +In this homework you will investigate how the kernel switches between +two processes. + +

    +Assignment: +

    + +Suppose a process that is running in the kernel +calls sched(), which ends up jumping +into scheduler(). + +

    +Turn in: +Where is the stack that sched() executes on? + +

    +Turn in: +Where is the stack that scheduler() executes on? + +

    +Turn in: +When sched() calls swtch(), +does that call to swtch() ever return? If so, when? + +

    +Turn in: +Why does swtch() copy %eip from the stack into the +context structure, only to copy it from the context +structure to the same place on the stack +when the process is re-activated? +What would go wrong if swtch() just left the +%eip on the stack and didn't store it in the context structure? + +

    +Surround the call to swtch() in schedule() with calls +to cons_putc() like this: +

    +      cons_putc('a');
    +      swtch(&cpus[cpu()].context, &p->context);
    +      cons_putc('b');
    +
    +

    +Similarly, +surround the call to swtch() in sched() with calls +to cons_putc() like this: + +

    +  cons_putc('c');
    +  swtch(&cp->context, &cpus[cpu()].context);
    +  cons_putc('d');
    +
    +

    +Rebuild your kernel and boot it on bochs. +With a few exceptions +you should see a regular four-character pattern repeated over and over. +

    +Turn in: What is the four-character pattern? +

    +Turn in: The very first characters are ac. Why does +this happen? +

    +Turn in: Near the start of the last line you should see +bc. How could this happen? + +

    +This completes the homework. + + + + + + + + diff --git a/web/xv6-sleep.html b/web/xv6-sleep.html new file mode 100644 index 0000000..e712a40 --- /dev/null +++ b/web/xv6-sleep.html @@ -0,0 +1,100 @@ +Homework: sleep and wakeup + + + + + +

    Homework: sleep and wakeup

    + +

    +Read: pipe.c + +

    +Hand-In Procedure +

    +You are to turn in this homework at the beginning of lecture. Please +write up your answers to the questions below and hand them in to a +6.828 staff member at the beginning of lecture. +

    +Introduction +

    + +Remember in lecture 7 we discussed locking a linked list implementation. +The insert code was: + +

    +        struct list *l;
    +        l = list_alloc();
    +        l->next = list_head;
    +        list_head = l;
    +
    + +and if we run the insert on multiple processors simultaneously with no locking, +this ordering of instructions can cause one of the inserts to be lost: + +
    +        CPU1                           CPU2
    +       
    +        struct list *l;
    +        l = list_alloc();
    +        l->next = list_head;
    +                                       struct list *l;
    +                                       l = list_alloc();
    +                                       l->next = list_head;
    +                                       list_head = l;
    +        list_head = l;
    +
    + +(Even though the instructions can happen simultaneously, we +write out orderings where only one CPU is "executing" at a time, +to avoid complicating things more than necessary.) +

    + +In this case, the list element allocated by CPU2 is lost from +the list by CPU1's update of list_head. +Adding a lock that protects the final two instructions makes +the read and write of list_head atomic, so that this +ordering is impossible. +

    + +The reading for this lecture is the implementation of sleep and wakeup, +which are used for coordination between different processes executing +in the kernel, perhaps simultaneously. +

    + +If there were no locking at all in sleep and wakeup, it would be +possible for a sleep and its corresponding wakeup, if executing +simultaneously on different processors, to miss each other, +so that the wakeup didn't find any process to wake up, and yet the +process calling sleep does go to sleep, never to awake. Obviously this is something +we'd like to avoid. +

    + +Read the code with this in mind. + +

    +

    +Questions +

    +(Answer and hand in.) +

    + +1. How does the proc_table_lock help avoid this problem? Give an +ordering of instructions (like the above example for linked list +insertion) +that could result in a wakeup being missed if the proc_table_lock were not used. +You need only include the relevant lines of code. +

    + +2. sleep is also protected by a second lock, its second argument, +which need not be the proc_table_lock. Look at the example in ide.c, +which uses the ide_lock. Give an ordering of instructions that could +result in a wakeup being missed if the ide_lock were not being used. +(Hint: this should not be the same as your answer to question 2. The +two locks serve different purposes.)

    + +

    +This completes the homework. + + +