DO NOT MAIL: xv6 web pages
This commit is contained in:
		
							parent
							
								
									ee3f75f229
								
							
						
					
					
						commit
						f53494c28e
					
				
					 37 changed files with 9034 additions and 0 deletions
				
			
		
							
								
								
									
										3
									
								
								web/Makefile
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										3
									
								
								web/Makefile
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,3 @@ | |||
| index.html: index.txt mkhtml | ||||
| 	mkhtml index.txt >_$@ && mv _$@ $@ | ||||
| 
 | ||||
							
								
								
									
										353
									
								
								web/index.html
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										353
									
								
								web/index.html
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,353 @@ | |||
| <!-- AUTOMATICALLY GENERATED: EDIT the .txt version, not the .html version --> | ||||
| <html> | ||||
| <head> | ||||
| <title>Xv6, a simple Unix-like teaching operating system</title> | ||||
| <style type="text/css"><!-- | ||||
| body { | ||||
| 	background-color: white; | ||||
| 	color: black; | ||||
| 	font-size: medium; | ||||
| 	line-height: 1.2em; | ||||
| 	margin-left: 0.5in; | ||||
| 	margin-right: 0.5in; | ||||
| 	margin-top: 0; | ||||
| 	margin-bottom: 0; | ||||
| } | ||||
| 
 | ||||
| h1 { | ||||
| 	text-indent: 0in; | ||||
| 	text-align: left; | ||||
| 	margin-top: 2em; | ||||
| 	font-weight: bold; | ||||
| 	font-size: 1.4em; | ||||
| } | ||||
| 
 | ||||
| h2 { | ||||
| 	text-indent: 0in; | ||||
| 	text-align: left; | ||||
| 	margin-top: 2em; | ||||
| 	font-weight: bold; | ||||
| 	font-size: 1.2em; | ||||
| } | ||||
| --></style> | ||||
| </head> | ||||
| <body bgcolor=#ffffff> | ||||
| <h1>Xv6, a simple Unix-like teaching operating system</h1> | ||||
| <br><br> | ||||
| Xv6 is a teaching operating system developed | ||||
| in the summer of 2006 for MIT's operating systems course, | ||||
| “6.828: Operating Systems Engineering.” | ||||
| We used it for 6.828 in Fall 2006 and Fall 2007 | ||||
| and are using it this semester (Fall 2008). | ||||
| We hope that xv6 will be useful in other courses too. | ||||
| This page collects resources to aid the use of xv6 | ||||
| in other courses. | ||||
| 
 | ||||
| <h2>History and Background</h2> | ||||
| For many years, MIT had no operating systems course. | ||||
| In the fall of 2002, Frans Kaashoek, Josh Cates, and Emil Sit | ||||
| created a new, experimental course (6.097) | ||||
| to teach operating systems engineering. | ||||
| In the course lectures, the class worked through Sixth Edition Unix (aka V6) | ||||
| using John Lions's famous commentary. | ||||
| In the lab assignments, students wrote most of an exokernel operating | ||||
| system, eventually named Jos, for the Intel x86. | ||||
| Exposing students to multiple systems–V6 and Jos–helped | ||||
| develop a sense of the spectrum of operating system designs. | ||||
| In the fall of 2003, the experimental 6.097 became the | ||||
| official course 6.828; the course has been offered each fall since then. | ||||
| <br><br> | ||||
| V6 presented pedagogic challenges from the start. | ||||
| Students doubted the relevance of an obsolete 30-year-old operating system | ||||
| written in an obsolete programming language (pre-K&R C) | ||||
| running on obsolete hardware (the PDP-11). | ||||
| Students also struggled to learn the low-level details of two different | ||||
| architectures (the PDP-11 and the Intel x86) at the same time. | ||||
| By the summer of 2006, we had decided to replace V6 | ||||
| with a new operating system, xv6, modeled on V6 | ||||
| but written in ANSI C and running on multiprocessor | ||||
| Intel x86 machines. | ||||
| Xv6's use of the x86 makes it more relevant to | ||||
| students' experience than V6 was | ||||
| and unifies the course around a single architecture. | ||||
| Adding multiprocessor support also helps relevance | ||||
| and makes it easier to discuss threads and concurrency. | ||||
| (In a single processor operating system, concurrency–which only | ||||
| happens because of interrupts–is too easy to view as a special case. | ||||
| A multiprocessor operating system must attack the problem head on.) | ||||
| Finally, writing a new system allowed us to write cleaner versions | ||||
| of the rougher parts of V6, like the scheduler and file system. | ||||
| <br><br> | ||||
| 6.828 substituted xv6 for V6 in the fall of 2006. | ||||
| Based on that experience, we cleaned up rough patches | ||||
| of xv6 for the course in the fall of 2007. | ||||
| Since then, xv6 has stabilized, so we are making it | ||||
| available in the hopes that others will find it useful too. | ||||
| <br><br> | ||||
| 6.828 uses both xv6 and Jos. | ||||
| Courses taught at UCLA, NYU, and Stanford have used | ||||
| Jos without xv6; we believe other courses could use | ||||
| xv6 without Jos, though we are not aware of any that have. | ||||
| 
 | ||||
| <h2>Xv6 sources</h2> | ||||
| The latest xv6 is <a href="xv6-rev2.tar.gz">xv6-rev2.tar.gz</a>. | ||||
| We distribute the sources in electronic form but also as | ||||
| a printed booklet with line numbers that keep everyone | ||||
| together during lectures.  The booklet is available as | ||||
| <a href="xv6-rev2.pdf">xv6-rev2.pdf</a>. | ||||
| <br><br> | ||||
| xv6 compiles using the GNU C compiler, | ||||
| targeted at the x86 using ELF binaries. | ||||
| On BSD and Linux systems, you can use the native compilers; | ||||
| On OS X, which doesn't use ELF binaries, | ||||
| you must use a cross-compiler. | ||||
| Xv6 does boot on real hardware, but typically | ||||
| we run it using the Bochs emulator. | ||||
| Both the GCC cross compiler and Bochs | ||||
| can be found on the <a href="../../2007/tools.html">6.828 tools page</a>. | ||||
| 
 | ||||
| <h2>Lectures</h2> | ||||
| In 6.828, the lectures in the first half of the course | ||||
| introduce the PC hardware, the Intel x86, and then xv6. | ||||
| The lectures in the second half consider advanced topics | ||||
| using research papers; for some, xv6 serves as a useful | ||||
| base for making discussions concrete. | ||||
| This section describe a typical 6.828 lecture schedule, | ||||
| linking to lecture notes and homework. | ||||
| A course using only xv6 (not Jos) will need to adapt | ||||
| a few of the lectures, but we hope these are a useful | ||||
| starting point. | ||||
| 
 | ||||
| <br><br><b><i>Lecture 1. Operating systems</i></b> | ||||
| <br><br> | ||||
| The first lecture introduces both the general topic of | ||||
| operating systems and the specific approach of 6.828. | ||||
| After defining “operating system,” the lecture | ||||
| examines the implementation of a Unix shell | ||||
| to look at the details the traditional Unix system call interface. | ||||
| This is relevant to both xv6 and Jos: in the final | ||||
| Jos labs, students implement a Unix-like interface | ||||
| and culminating in a Unix shell. | ||||
| <br><br> | ||||
| <a href="l1.html">lecture notes</a> | ||||
| 
 | ||||
| <br><br><b><i>Lecture 2.  PC hardware and x86 programming</i></b> | ||||
| <br><br> | ||||
| This lecture introduces the PC architecture, the 16- and 32-bit x86, | ||||
| the stack, and the GCC x86 calling conventions. | ||||
| It also introduces the pieces of a typical C tool chain–compiler, | ||||
| assembler, linker, loader–and the Bochs emulator. | ||||
| <br><br> | ||||
| Reading: PC Assembly Language | ||||
| <br><br> | ||||
| Homework: familiarize with Bochs | ||||
| <br><br> | ||||
| <a href="l2.html">lecture notes</a> | ||||
| <a href="x86-intro.html">homework</a> | ||||
| 
 | ||||
| <br><br><b><i>Lecture 3.  Operating system organization</i></b> | ||||
| <br><br> | ||||
| This lecture continues Lecture 1's discussion of what | ||||
| an operating system does. | ||||
| An operating system provides a “virtual computer” | ||||
| interface to user space programs. | ||||
| At a high level, the main job of the operating system | ||||
| is to implement that interface | ||||
| using the physical computer it runs on. | ||||
| <br><br> | ||||
| The lecture discusses four approaches to that job: | ||||
| monolithic operating systems, microkernels, | ||||
| virtual machines, and exokernels. | ||||
| Exokernels might not be worth mentioning | ||||
| except that the Jos labs are built around one. | ||||
| <br><br> | ||||
| Reading: Engler et al., Exokernel: An Operating System Architecture | ||||
| for Application-Level Resource Management | ||||
| <br><br> | ||||
| <a href="l3.html">lecture notes</a> | ||||
| 
 | ||||
| <br><br><b><i>Lecture 4.  Address spaces using segmentation</i></b> | ||||
| <br><br> | ||||
| This is the first lecture that uses xv6. | ||||
| It introduces the idea of address spaces and the | ||||
| details of the x86 segmentation hardware. | ||||
| It makes the discussion concrete by reading the xv6 | ||||
| source code and watching xv6 execute using the Bochs simulator. | ||||
| <br><br> | ||||
| Reading: x86 MMU handout, | ||||
| xv6: bootasm.S, bootother.S, <a href="src/bootmain.c.html">bootmain.c</a>, <a href="src/main.c.html">main.c</a>, <a href="src/init.c.html">init.c</a>, and setupsegs in <a href="src/proc.c.html">proc.c</a>. | ||||
| <br><br> | ||||
| Homework: Bochs stack introduction | ||||
| <br><br> | ||||
| <a href="l4.html">lecture notes</a> | ||||
| <a href="xv6-intro.html">homework</a> | ||||
| 
 | ||||
| <br><br><b><i>Lecture 5.  Address spaces using page tables</i></b> | ||||
| <br><br> | ||||
| This lecture continues the discussion of address spaces, | ||||
| examining the other x86 virtual memory mechanism: page tables. | ||||
| Xv6 does not use page tables, so there is no xv6 here. | ||||
| Instead, the lecture uses Jos as a concrete example. | ||||
| An xv6-only course might skip or shorten this discussion. | ||||
| <br><br> | ||||
| Reading: x86 manual excerpts | ||||
| <br><br> | ||||
| Homework: stuff about gdt | ||||
| XXX not appropriate; should be in Lecture 4 | ||||
| <br><br> | ||||
| <a href="l5.html">lecture notes</a> | ||||
| 
 | ||||
| <br><br><b><i>Lecture 6.  Interrupts and exceptions</i></b> | ||||
| <br><br> | ||||
| How does a user program invoke the operating system kernel? | ||||
| How does the kernel return to the user program? | ||||
| What happens when a hardware device needs attention? | ||||
| This lecture explains the answer to these questions: | ||||
| interrupt and exception handling. | ||||
| <br><br> | ||||
| It explains the x86 trap setup mechanisms and then | ||||
| examines their use in xv6's SETGATE (<a href="src/mmu.h.html">mmu.h</a>), | ||||
| tvinit (<a href="src/trap.c.html">trap.c</a>), idtinit (<a href="src/trap.c.html">trap.c</a>), <a href="src/vectors.pl.html">vectors.pl</a>, and vectors.S. | ||||
| <br><br> | ||||
| It then traces through a call to the system call open: | ||||
| <a href="src/init.c.html">init.c</a>, usys.S, vector48 and alltraps (vectors.S), trap (<a href="src/trap.c.html">trap.c</a>), | ||||
| syscall (<a href="src/syscall.c.html">syscall.c</a>), | ||||
| sys_open (<a href="src/sysfile.c.html">sysfile.c</a>), fetcharg, fetchint, argint, argptr, argstr (<a href="src/syscall.c.html">syscall.c</a>), | ||||
| <br><br> | ||||
| The interrupt controller, briefly: | ||||
| pic_init and pic_enable (<a href="src/picirq.c.html">picirq.c</a>). | ||||
| The timer and keyboard, briefly: | ||||
| timer_init (<a href="src/timer.c.html">timer.c</a>), console_init (<a href="src/console.c.html">console.c</a>). | ||||
| Enabling and disabling of interrupts. | ||||
| <br><br> | ||||
| Reading: x86 manual excerpts, | ||||
| xv6: trapasm.S, <a href="src/trap.c.html">trap.c</a>, <a href="src/syscall.c.html">syscall.c</a>, and usys.S. | ||||
| Skim <a href="src/lapic.c.html">lapic.c</a>, <a href="src/ioapic.c.html">ioapic.c</a>, <a href="src/picirq.c.html">picirq.c</a>. | ||||
| <br><br> | ||||
| Homework: Explain the 35 words on the top of the | ||||
| stack at first invocation of <code>syscall</code>. | ||||
| <br><br> | ||||
| <a href="l-interrupt.html">lecture notes</a> | ||||
| <a href="x86-intr.html">homework</a> | ||||
| 
 | ||||
| <br><br><b><i>Lecture 7.  Multiprocessors and locking</i></b> | ||||
| <br><br> | ||||
| This lecture introduces the problems of | ||||
| coordination and synchronization on a | ||||
| multiprocessor | ||||
| and then the solution of mutual exclusion locks. | ||||
| Atomic instructions, test-and-set locks, | ||||
| lock granularity, (the mistake of) recursive locks. | ||||
| <br><br> | ||||
| Although xv6 user programs cannot share memory, | ||||
| the xv6 kernel itself is a program with multiple threads | ||||
| executing concurrently and sharing memory. | ||||
| Illustration: the xv6 scheduler's proc_table_lock (<a href="src/proc.c.html">proc.c</a>) | ||||
| and the spin lock implementation (<a href="src/spinlock.c.html">spinlock.c</a>). | ||||
| <br><br> | ||||
| Reading: xv6: <a href="src/spinlock.c.html">spinlock.c</a>.  Skim <a href="src/mp.c.html">mp.c</a>. | ||||
| <br><br> | ||||
| Homework: Interaction between locking and interrupts. | ||||
| Try not disabling interrupts in the disk driver and watch xv6 break. | ||||
| <br><br> | ||||
| <a href="l-lock.html">lecture notes</a> | ||||
| <a href="xv6-lock.html">homework</a> | ||||
| 
 | ||||
| <br><br><b><i>Lecture 8.  Threads, processes and context switching</i></b> | ||||
| <br><br> | ||||
| The last lecture introduced some of the issues | ||||
| in writing threaded programs, using xv6's processes | ||||
| as an example. | ||||
| This lecture introduces the issues in implementing | ||||
| threads, continuing to use xv6 as the example. | ||||
| <br><br> | ||||
| The lecture defines a thread of computation as a register | ||||
| set and a stack.  A process is an address space plus one | ||||
| or more threads of computation sharing that address space. | ||||
| Thus the xv6 kernel can be viewed as a single process | ||||
| with many threads (each user process) executing concurrently. | ||||
| <br><br> | ||||
| Illustrations: thread switching (swtch.S), scheduler (<a href="src/proc.c.html">proc.c</a>), sys_fork (<a href="src/sysproc.c.html">sysproc.c</a>) | ||||
| <br><br> | ||||
| Reading: <a href="src/proc.c.html">proc.c</a>, swtch.S, sys_fork (<a href="src/sysproc.c.html">sysproc.c</a>) | ||||
| <br><br> | ||||
| Homework: trace through stack switching. | ||||
| <br><br> | ||||
| <a href="l-threads.html">lecture notes (need to be updated to use swtch)</a> | ||||
| <a href="xv6-sched.html">homework</a> | ||||
| 
 | ||||
| <br><br><b><i>Lecture 9.  Processes and coordination</i></b> | ||||
| <br><br> | ||||
| This lecture introduces the idea of sequence coordination | ||||
| and then examines the particular solution illustrated by | ||||
| sleep and wakeup (<a href="src/proc.c.html">proc.c</a>). | ||||
| It introduces and refines a simple | ||||
| producer/consumer queue to illustrate the | ||||
| need for sleep and wakeup | ||||
| and then the sleep and wakeup | ||||
| implementations themselves. | ||||
| <br><br> | ||||
| Reading: <a href="src/proc.c.html">proc.c</a>, sys_exec, sys_sbrk, sys_wait, sys_exec, sys_kill (<a href="src/sysproc.c.html">sysproc.c</a>). | ||||
| <br><br> | ||||
| Homework: Explain how sleep and wakeup would break | ||||
| without proc_table_lock.  Explain how devices would break | ||||
| without second lock argument to sleep. | ||||
| <br><br> | ||||
| <a href="l-coordination.html">lecture notes</a> | ||||
| <a href="xv6-sleep.html">homework</a> | ||||
| 
 | ||||
| <br><br><b><i>Lecture 10.  Files and disk I/O</i></b> | ||||
| <br><br> | ||||
| This is the first of three file system lectures. | ||||
| This lecture introduces the basic file system interface | ||||
| and then considers the on-disk layout of individual files | ||||
| and the free block bitmap. | ||||
| <br><br> | ||||
| Reading: iread, iwrite, fileread, filewrite, wdir, mknod1, and | ||||
| 	code related to these calls in <a href="src/fs.c.html">fs.c</a>, <a href="src/bio.c.html">bio.c</a>, <a href="src/ide.c.html">ide.c</a>, and <a href="src/file.c.html">file.c</a>. | ||||
| <br><br> | ||||
| Homework: Add print to bwrite to trace every disk write. | ||||
| Explain the disk writes caused by some simple shell commands. | ||||
| <br><br> | ||||
| <a href="l-fs.html">lecture notes</a> | ||||
| <a href="xv6-disk.html">homework</a> | ||||
| 
 | ||||
| <br><br><b><i>Lecture 11.  Naming</i></b> | ||||
| <br><br> | ||||
| The last lecture discussed on-disk file system representation. | ||||
| This lecture covers the implementation of | ||||
| file system paths (namei in <a href="src/fs.c.html">fs.c</a>) | ||||
| and also discusses the security problems of a shared /tmp | ||||
| and symbolic links. | ||||
| <br><br> | ||||
| Understanding exec (<a href="src/exec.c.html">exec.c</a>) is left as an exercise. | ||||
| <br><br> | ||||
| Reading: namei in <a href="src/fs.c.html">fs.c</a>, <a href="src/sysfile.c.html">sysfile.c</a>, <a href="src/file.c.html">file.c</a>. | ||||
| <br><br> | ||||
| Homework: Explain how to implement symbolic links in xv6. | ||||
| <br><br> | ||||
| <a href="l-name.html">lecture notes</a> | ||||
| <a href="xv6-names.html">homework</a> | ||||
| 
 | ||||
| <br><br><b><i>Lecture 12.  High-performance file systems</i></b> | ||||
| <br><br> | ||||
| This lecture is the first of the research paper-based lectures. | ||||
| It discusses the “soft updates” paper, | ||||
| using xv6 as a concrete example. | ||||
| 
 | ||||
| <h2>Feedback</h2> | ||||
| If you are interested in using xv6 or have used xv6 in a course, | ||||
| we would love to hear from you. | ||||
| If there's anything that we can do to make xv6 easier | ||||
| to adopt, we'd like to hear about it. | ||||
| We'd also be interested to hear what worked well and what didn't. | ||||
| <br><br> | ||||
| Russ Cox (rsc@swtch.com)<br> | ||||
| Frans Kaashoek (kaashoek@mit.edu)<br> | ||||
| Robert Morris (rtm@mit.edu) | ||||
| <br><br> | ||||
| You can reach all of us at 6.828-staff@pdos.csail.mit.edu. | ||||
| <br><br> | ||||
| <br><br> | ||||
| </body> | ||||
| </html> | ||||
							
								
								
									
										335
									
								
								web/index.txt
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										335
									
								
								web/index.txt
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,335 @@ | |||
| ** Xv6, a simple Unix-like teaching operating system | ||||
| Xv6 is a teaching operating system developed | ||||
| in the summer of 2006 for MIT's operating systems course,  | ||||
| ``6.828: Operating Systems Engineering.'' | ||||
| We used it for 6.828 in Fall 2006 and Fall 2007 | ||||
| and are using it this semester (Fall 2008). | ||||
| We hope that xv6 will be useful in other courses too. | ||||
| This page collects resources to aid the use of xv6 | ||||
| in other courses. | ||||
| 
 | ||||
| * History and Background | ||||
| 
 | ||||
| For many years, MIT had no operating systems course. | ||||
| In the fall of 2002, Frans Kaashoek, Josh Cates, and Emil Sit | ||||
| created a new, experimental course (6.097) | ||||
| to teach operating systems engineering. | ||||
| In the course lectures, the class worked through Sixth Edition Unix (aka V6) | ||||
| using John Lions's famous commentary. | ||||
| In the lab assignments, students wrote most of an exokernel operating | ||||
| system, eventually named Jos, for the Intel x86. | ||||
| Exposing students to multiple systems--V6 and Jos--helped | ||||
| develop a sense of the spectrum of operating system designs. | ||||
| In the fall of 2003, the experimental 6.097 became the | ||||
| official course 6.828; the course has been offered each fall since then. | ||||
| 
 | ||||
| V6 presented pedagogic challenges from the start. | ||||
| Students doubted the relevance of an obsolete 30-year-old operating system | ||||
| written in an obsolete programming language (pre-K&R C) | ||||
| running on obsolete hardware (the PDP-11). | ||||
| Students also struggled to learn the low-level details of two different | ||||
| architectures (the PDP-11 and the Intel x86) at the same time. | ||||
| By the summer of 2006, we had decided to replace V6 | ||||
| with a new operating system, xv6, modeled on V6 | ||||
| but written in ANSI C and running on multiprocessor | ||||
| Intel x86 machines. | ||||
| Xv6's use of the x86 makes it more relevant to | ||||
| students' experience than V6 was | ||||
| and unifies the course around a single architecture. | ||||
| Adding multiprocessor support also helps relevance | ||||
| and makes it easier to discuss threads and concurrency. | ||||
| (In a single processor operating system, concurrency--which only | ||||
| happens because of interrupts--is too easy to view as a special case. | ||||
| A multiprocessor operating system must attack the problem head on.) | ||||
| Finally, writing a new system allowed us to write cleaner versions | ||||
| of the rougher parts of V6, like the scheduler and file system. | ||||
| 
 | ||||
| 6.828 substituted xv6 for V6 in the fall of 2006. | ||||
| Based on that experience, we cleaned up rough patches | ||||
| of xv6 for the course in the fall of 2007. | ||||
| Since then, xv6 has stabilized, so we are making it | ||||
| available in the hopes that others will find it useful too. | ||||
| 
 | ||||
| 6.828 uses both xv6 and Jos.   | ||||
| Courses taught at UCLA, NYU, and Stanford have used | ||||
| Jos without xv6; we believe other courses could use | ||||
| xv6 without Jos, though we are not aware of any that have. | ||||
| 
 | ||||
| 
 | ||||
| * Xv6 sources | ||||
| 
 | ||||
| The latest xv6 is [xv6-rev2.tar.gz]. | ||||
| We distribute the sources in electronic form but also as  | ||||
| a printed booklet with line numbers that keep everyone | ||||
| together during lectures.  The booklet is available as | ||||
| [xv6-rev2.pdf]. | ||||
| 
 | ||||
| xv6 compiles using the GNU C compiler, | ||||
| targeted at the x86 using ELF binaries. | ||||
| On BSD and Linux systems, you can use the native compilers; | ||||
| On OS X, which doesn't use ELF binaries, | ||||
| you must use a cross-compiler. | ||||
| Xv6 does boot on real hardware, but typically | ||||
| we run it using the Bochs emulator. | ||||
| Both the GCC cross compiler and Bochs | ||||
| can be found on the [../../2007/tools.html | 6.828 tools page]. | ||||
| 
 | ||||
| 
 | ||||
| * Lectures | ||||
| 
 | ||||
| In 6.828, the lectures in the first half of the course | ||||
| introduce the PC hardware, the Intel x86, and then xv6. | ||||
| The lectures in the second half consider advanced topics | ||||
| using research papers; for some, xv6 serves as a useful | ||||
| base for making discussions concrete. | ||||
| This section describe a typical 6.828 lecture schedule, | ||||
| linking to lecture notes and homework. | ||||
| A course using only xv6 (not Jos) will need to adapt | ||||
| a few of the lectures, but we hope these are a useful | ||||
| starting point. | ||||
| 
 | ||||
| 
 | ||||
| Lecture 1. Operating systems | ||||
| 
 | ||||
| The first lecture introduces both the general topic of | ||||
| operating systems and the specific approach of 6.828. | ||||
| After defining ``operating system,'' the lecture | ||||
| examines the implementation of a Unix shell | ||||
| to look at the details the traditional Unix system call interface. | ||||
| This is relevant to both xv6 and Jos: in the final | ||||
| Jos labs, students implement a Unix-like interface | ||||
| and culminating in a Unix shell. | ||||
| 
 | ||||
| [l1.html | lecture notes] | ||||
| 
 | ||||
| 
 | ||||
| Lecture 2.  PC hardware and x86 programming | ||||
| 
 | ||||
| This lecture introduces the PC architecture, the 16- and 32-bit x86, | ||||
| the stack, and the GCC x86 calling conventions. | ||||
| It also introduces the pieces of a typical C tool chain--compiler, | ||||
| assembler, linker, loader--and the Bochs emulator. | ||||
| 
 | ||||
| Reading: PC Assembly Language | ||||
| 
 | ||||
| Homework: familiarize with Bochs | ||||
| 
 | ||||
| [l2.html | lecture notes] | ||||
| [x86-intro.html | homework] | ||||
| 
 | ||||
| 
 | ||||
| Lecture 3.  Operating system organization | ||||
| 
 | ||||
| This lecture continues Lecture 1's discussion of what | ||||
| an operating system does. | ||||
| An operating system provides a ``virtual computer'' | ||||
| interface to user space programs. | ||||
| At a high level, the main job of the operating system  | ||||
| is to implement that interface | ||||
| using the physical computer it runs on. | ||||
| 
 | ||||
| The lecture discusses four approaches to that job: | ||||
| monolithic operating systems, microkernels, | ||||
| virtual machines, and exokernels. | ||||
| Exokernels might not be worth mentioning | ||||
| except that the Jos labs are built around one. | ||||
| 
 | ||||
| Reading: Engler et al., Exokernel: An Operating System Architecture | ||||
| for Application-Level Resource Management | ||||
| 
 | ||||
| [l3.html | lecture notes] | ||||
| 
 | ||||
| 
 | ||||
| Lecture 4.  Address spaces using segmentation | ||||
| 
 | ||||
| This is the first lecture that uses xv6. | ||||
| It introduces the idea of address spaces and the | ||||
| details of the x86 segmentation hardware. | ||||
| It makes the discussion concrete by reading the xv6 | ||||
| source code and watching xv6 execute using the Bochs simulator. | ||||
| 
 | ||||
| Reading: x86 MMU handout, | ||||
| xv6: bootasm.S, bootother.S, bootmain.c, main.c, init.c, and setupsegs in proc.c. | ||||
| 
 | ||||
| Homework: Bochs stack introduction | ||||
| 
 | ||||
| [l4.html | lecture notes] | ||||
| [xv6-intro.html | homework] | ||||
|   | ||||
| 
 | ||||
| Lecture 5.  Address spaces using page tables | ||||
| 
 | ||||
| This lecture continues the discussion of address spaces, | ||||
| examining the other x86 virtual memory mechanism: page tables. | ||||
| Xv6 does not use page tables, so there is no xv6 here. | ||||
| Instead, the lecture uses Jos as a concrete example. | ||||
| An xv6-only course might skip or shorten this discussion. | ||||
| 
 | ||||
| Reading: x86 manual excerpts | ||||
| 
 | ||||
| Homework: stuff about gdt | ||||
| XXX not appropriate; should be in Lecture 4 | ||||
| 
 | ||||
| [l5.html | lecture notes] | ||||
| 
 | ||||
| 
 | ||||
| Lecture 6.  Interrupts and exceptions | ||||
| 
 | ||||
| How does a user program invoke the operating system kernel? | ||||
| How does the kernel return to the user program? | ||||
| What happens when a hardware device needs attention? | ||||
| This lecture explains the answer to these questions: | ||||
| interrupt and exception handling. | ||||
| 
 | ||||
| It explains the x86 trap setup mechanisms and then | ||||
| examines their use in xv6's SETGATE (mmu.h), | ||||
| tvinit (trap.c), idtinit (trap.c), vectors.pl, and vectors.S. | ||||
| 
 | ||||
| It then traces through a call to the system call open: | ||||
| init.c, usys.S, vector48 and alltraps (vectors.S), trap (trap.c), | ||||
| syscall (syscall.c), | ||||
| sys_open (sysfile.c), fetcharg, fetchint, argint, argptr, argstr (syscall.c), | ||||
| 
 | ||||
| The interrupt controller, briefly: | ||||
| pic_init and pic_enable (picirq.c). | ||||
| The timer and keyboard, briefly: | ||||
| timer_init (timer.c), console_init (console.c). | ||||
| Enabling and disabling of interrupts. | ||||
| 
 | ||||
| Reading: x86 manual excerpts, | ||||
| xv6: trapasm.S, trap.c, syscall.c, and usys.S. | ||||
| Skim lapic.c, ioapic.c, picirq.c. | ||||
| 
 | ||||
| Homework: Explain the 35 words on the top of the | ||||
| stack at first invocation of <code>syscall</code>. | ||||
| 
 | ||||
| [l-interrupt.html | lecture notes] | ||||
| [x86-intr.html | homework] | ||||
| 
 | ||||
| 
 | ||||
| Lecture 7.  Multiprocessors and locking | ||||
| 
 | ||||
| This lecture introduces the problems of  | ||||
| coordination and synchronization on a  | ||||
| multiprocessor | ||||
| and then the solution of mutual exclusion locks. | ||||
| Atomic instructions, test-and-set locks, | ||||
| lock granularity, (the mistake of) recursive locks. | ||||
| 
 | ||||
| Although xv6 user programs cannot share memory, | ||||
| the xv6 kernel itself is a program with multiple threads | ||||
| executing concurrently and sharing memory. | ||||
| Illustration: the xv6 scheduler's proc_table_lock (proc.c) | ||||
| and the spin lock implementation (spinlock.c). | ||||
| 
 | ||||
| Reading: xv6: spinlock.c.  Skim mp.c. | ||||
| 
 | ||||
| Homework: Interaction between locking and interrupts. | ||||
| Try not disabling interrupts in the disk driver and watch xv6 break. | ||||
| 
 | ||||
| [l-lock.html | lecture notes] | ||||
| [xv6-lock.html | homework] | ||||
| 
 | ||||
| 
 | ||||
| Lecture 8.  Threads, processes and context switching | ||||
| 
 | ||||
| The last lecture introduced some of the issues  | ||||
| in writing threaded programs, using xv6's processes | ||||
| as an example. | ||||
| This lecture introduces the issues in implementing | ||||
| threads, continuing to use xv6 as the example. | ||||
| 
 | ||||
| The lecture defines a thread of computation as a register | ||||
| set and a stack.  A process is an address space plus one  | ||||
| or more threads of computation sharing that address space. | ||||
| Thus the xv6 kernel can be viewed as a single process | ||||
| with many threads (each user process) executing concurrently. | ||||
| 
 | ||||
| Illustrations: thread switching (swtch.S), scheduler (proc.c), sys_fork (sysproc.c) | ||||
| 
 | ||||
| Reading: proc.c, swtch.S, sys_fork (sysproc.c) | ||||
| 
 | ||||
| Homework: trace through stack switching. | ||||
| 
 | ||||
| [l-threads.html | lecture notes (need to be updated to use swtch)] | ||||
| [xv6-sched.html | homework] | ||||
| 
 | ||||
| 
 | ||||
| Lecture 9.  Processes and coordination | ||||
| 
 | ||||
| This lecture introduces the idea of sequence coordination | ||||
| and then examines the particular solution illustrated by | ||||
| sleep and wakeup (proc.c). | ||||
| It introduces and refines a simple | ||||
| producer/consumer queue to illustrate the  | ||||
| need for sleep and wakeup | ||||
| and then the sleep and wakeup | ||||
| implementations themselves. | ||||
| 
 | ||||
| Reading: proc.c, sys_exec, sys_sbrk, sys_wait, sys_exec, sys_kill (sysproc.c). | ||||
| 
 | ||||
| Homework: Explain how sleep and wakeup would break | ||||
| without proc_table_lock.  Explain how devices would break | ||||
| without second lock argument to sleep. | ||||
| 
 | ||||
| [l-coordination.html | lecture notes] | ||||
| [xv6-sleep.html | homework] | ||||
| 
 | ||||
| 
 | ||||
| Lecture 10.  Files and disk I/O | ||||
| 
 | ||||
| This is the first of three file system lectures. | ||||
| This lecture introduces the basic file system interface | ||||
| and then considers the on-disk layout of individual files | ||||
| and the free block bitmap. | ||||
| 
 | ||||
| Reading: iread, iwrite, fileread, filewrite, wdir, mknod1, and  | ||||
| 	code related to these calls in fs.c, bio.c, ide.c, and file.c. | ||||
| 
 | ||||
| Homework: Add print to bwrite to trace every disk write. | ||||
| Explain the disk writes caused by some simple shell commands. | ||||
| 
 | ||||
| [l-fs.html | lecture notes] | ||||
| [xv6-disk.html | homework] | ||||
| 
 | ||||
| 
 | ||||
| Lecture 11.  Naming | ||||
| 
 | ||||
| The last lecture discussed on-disk file system representation. | ||||
| This lecture covers the implementation of | ||||
| file system paths (namei in fs.c) | ||||
| and also discusses the security problems of a shared /tmp | ||||
| and symbolic links. | ||||
| 
 | ||||
| Understanding exec (exec.c) is left as an exercise. | ||||
| 
 | ||||
| Reading: namei in fs.c, sysfile.c, file.c. | ||||
| 
 | ||||
| Homework: Explain how to implement symbolic links in xv6. | ||||
| 
 | ||||
| [l-name.html | lecture notes] | ||||
| [xv6-names.html | homework] | ||||
| 
 | ||||
| 
 | ||||
| Lecture 12.  High-performance file systems | ||||
| 
 | ||||
| This lecture is the first of the research paper-based lectures. | ||||
| It discusses the ``soft updates'' paper, | ||||
| using xv6 as a concrete example. | ||||
| 
 | ||||
| 
 | ||||
| * Feedback | ||||
| 
 | ||||
| If you are interested in using xv6 or have used xv6 in a course, | ||||
| we would love to hear from you. | ||||
| If there's anything that we can do to make xv6 easier | ||||
| to adopt, we'd like to hear about it. | ||||
| We'd also be interested to hear what worked well and what didn't. | ||||
| 
 | ||||
| Russ Cox (rsc@swtch.com)<br> | ||||
| Frans Kaashoek (kaashoek@mit.edu)<br> | ||||
| Robert Morris (rtm@mit.edu) | ||||
| 
 | ||||
| You can reach all of us at 6.828-staff@pdos.csail.mit.edu. | ||||
| 
 | ||||
| 
 | ||||
							
								
								
									
										187
									
								
								web/l-bugs.html
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										187
									
								
								web/l-bugs.html
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,187 @@ | |||
| <title>OS Bugs</title> | ||||
| <html> | ||||
| <head> | ||||
| </head> | ||||
| <body> | ||||
| 
 | ||||
| <h1>OS Bugs</h1> | ||||
| 
 | ||||
| <p>Required reading: Bugs as deviant behavior | ||||
| 
 | ||||
| <h2>Overview</h2> | ||||
| 
 | ||||
| <p>Operating systems must obey many rules for correctness and | ||||
| performance.  Examples rules: | ||||
| <ul> | ||||
| <li>Do not call blocking functions with interrupts disabled or spin | ||||
| lock held | ||||
| <li>check for NULL results | ||||
| <li>Do not allocate large stack variables | ||||
| <li>Do no re-use already-allocated memory | ||||
| <li>Check user pointers before using them in kernel mode | ||||
| <li>Release acquired locks | ||||
| </ul> | ||||
| 
 | ||||
| <p>In addition, there are standard software engineering rules, like | ||||
| use function results in consistent ways. | ||||
| 
 | ||||
| <p>These rules are typically not checked by a compiler, even though | ||||
| they could be checked by a compiler, in principle.  The goal of the | ||||
| meta-level compilation project is to allow system implementors to | ||||
| write system-specific compiler extensions that check the source code | ||||
| for rule violations. | ||||
| 
 | ||||
| <p>The results are good: many new bugs found (500-1000) in Linux | ||||
| alone.  The paper for today studies these bugs and attempts to draw | ||||
| lessons from these bugs. | ||||
| 
 | ||||
| <p>Are kernel error worse than user-level errors?  That is, if we get | ||||
| the kernel correct, then we won't have system crashes? | ||||
| 
 | ||||
| <h2>Errors in JOS kernel</h2> | ||||
| 
 | ||||
| <p>What are unstated invariants in the JOS? | ||||
| <ul> | ||||
| <li>Interrupts are disabled in kernel mode | ||||
| <li>Only env 1 has access to disk | ||||
| <li>All registers are saved & restored on context switch | ||||
| <li>Application code is never executed with CPL 0 | ||||
| <li>Don't allocate an already-allocated physical page | ||||
| <li>Propagate error messages to user applications (e.g., out of | ||||
| resources) | ||||
| <li>Map pipe before fd | ||||
| <li>Unmap fd before pipe | ||||
| <li>A spawned program should have open only file descriptors 0, 1, and 2. | ||||
| <li>Pass sometimes size in bytes and sometimes in block number to a | ||||
| given file system function. | ||||
| <li>User pointers should be run through TRUP before used by the kernel | ||||
| </ul> | ||||
| 
 | ||||
| <p>Could these errors have been caught by metacompilation?  Would | ||||
| metacompilation have caught the pipe race condition? (Probably not, | ||||
| it happens in only one place.) | ||||
| 
 | ||||
| <p>How confident are you that your code is correct?  For example, | ||||
| are you sure interrupts are always disabled in kernel mode?  How would | ||||
| you test? | ||||
| 
 | ||||
| <h2>Metacompilation</h2> | ||||
| 
 | ||||
| <p>A system programmer writes the rule checkers in a high-level, | ||||
| state-machine language (metal).  These checkers are dynamically linked | ||||
| into an extensible version of g++, xg++.  Xg++ applies the rule | ||||
| checkers to every possible execution path of a function that is being | ||||
| compiled. | ||||
| 
 | ||||
| <p>An example rule from | ||||
| the <a | ||||
| href="http://www.stanford.edu/~engler/exe-ccs-06.pdf">OSDI | ||||
| paper</a>: | ||||
| <pre> | ||||
| sm check_interrupts { | ||||
|    decl { unsigned} flags; | ||||
|    pat enable = { sti(); } | {restore_flags(flags);} ; | ||||
|    pat disable = { cli(); }; | ||||
|     | ||||
|    is_enabled: disable ==> is_disabled | enable ==> { err("double | ||||
|       enable")}; | ||||
|    ... | ||||
| </pre> | ||||
| A more complete version found 82 errors in the Linux 2.3.99 kernel. | ||||
| 
 | ||||
| <p>Common mistake: | ||||
| <pre> | ||||
| get_free_buffer ( ... ) { | ||||
|    .... | ||||
|    save_flags (flags); | ||||
|    cli (); | ||||
|    if ((bh = sh->buffer_pool) == NULL) | ||||
|       return NULL; | ||||
|    .... | ||||
| } | ||||
| </pre> | ||||
| <p>(Figure 2 also lists a simple metarule.) | ||||
| 
 | ||||
| <p>Some checkers produce false positives, because of limitations of | ||||
| both static analysis and the checkers, which mostly use local | ||||
| analysis. | ||||
| 
 | ||||
| <p>How does the <b>block</b> checker work?  The first pass is a rule | ||||
| that marks functions as potentially blocking.  After processing a | ||||
| function, the checker emits the function's flow graph to a file | ||||
| (including, annotations and functions called). The second pass takes | ||||
| the merged flow graph of all function calls, and produces a file with | ||||
| all functions that have a path in the control-flow-graph to a blocking | ||||
| function call.  For the Linux kernel this results in 3,000 functions | ||||
| that potentially could call sleep.  Yet another checker like | ||||
| check_interrupts checks if a function calls any of the 3,000 functions | ||||
| with interrupts disabled. Etc. | ||||
| 
 | ||||
| <h2>This paper</h2> | ||||
| 
 | ||||
| <p>Writing rules is painful. First, you have to write them.  Second, | ||||
| how do you decide what to check?  Was it easy to enumerate all | ||||
| conventions for JOS? | ||||
| 
 | ||||
| <p>Insight: infer programmer "beliefs" from code and cross-check | ||||
| for contradictions.  If <i>cli</i> is always followed by <i>sti</i>, | ||||
| except in one case, perhaps something is wrong.  This simplifies | ||||
| life because we can write generic checkers instead of checkers | ||||
| that specifically check for <i>sti</i>, and perhaps we get lucky | ||||
| and find other temporal ordering conventions. | ||||
| 
 | ||||
| <p>Do we know which case is wrong?  The 999 times or the 1 time that | ||||
| <i>sti</i> is absent?  (No, this method cannot figure what the correct | ||||
| sequence is but it can flag that something is weird, which in practice | ||||
| useful.)  The method just detects inconsistencies. | ||||
| 
 | ||||
| <p>Is every inconsistency an error?  No, some inconsistency don't | ||||
| indicate an error.  If a call to function <i>f</i> is often followed | ||||
| by call to function <i>g</i>, does that imply that f should always be | ||||
| followed by g?  (No!) | ||||
| 
 | ||||
| <p>Solution: MUST beliefs and MAYBE beliefs.  MUST beliefs are | ||||
| invariants that must hold; any inconsistency indicates an error.  If a | ||||
| pointer is dereferences, then the programmer MUST believe that the | ||||
| pointer is pointing to something that can be dereferenced (i.e., the | ||||
| pointer is definitely not zero).  MUST beliefs can be checked using | ||||
| "internal inconsistencies". | ||||
| 
 | ||||
| <p>An aside, can zero pointers pointers be detected during runtime? | ||||
| (Sure, unmap the page at address zero.)  Why is metacompilation still | ||||
| valuable?  (At runtime you will find only the null pointers that your | ||||
| test code dereferenced; not all possible dereferences of null | ||||
| pointers.)  An even more convincing example for Metacompilation is | ||||
| tracking user pointers that the kernel dereferences.  (Is this a MUST | ||||
| belief?) | ||||
| 
 | ||||
| <p>MAYBE beliefs are invariants that are suggested by the code, but | ||||
| they maybe coincidences.  MAYBE beliefs are ranked by statistical | ||||
| analysis, and perhaps augmented with input about functions names | ||||
| (e.g., alloc and free are important).  Is it computationally feasible | ||||
| to check every MAYBE belief?  Could there be much noise? | ||||
| 
 | ||||
| <p>What errors won't this approach catch? | ||||
| 
 | ||||
| <h2>Paper discussion</h2> | ||||
| 
 | ||||
| <p>This paper is best discussed by studying every code fragment.  Most | ||||
| code fragments are pieces of code from Linux distributions; these | ||||
| mistakes are real! | ||||
| 
 | ||||
| <p>Section 3.1.  what is the error?  how does metacompilation catch | ||||
| it? | ||||
| 
 | ||||
| <p>Figure 1.  what is the error? is there one? | ||||
| 
 | ||||
| <p>Code fragments from 6.1.  what is the error?  how does metacompilation catch | ||||
| it? | ||||
| 
 | ||||
| <p>Figure 3.  what is the error?  how does metacompilation catch | ||||
| it? | ||||
| 
 | ||||
| <p>Section 8.3.  what is the error?  how does metacompilation catch | ||||
| it? | ||||
| 
 | ||||
| </body> | ||||
| 
 | ||||
							
								
								
									
										354
									
								
								web/l-coordination.html
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										354
									
								
								web/l-coordination.html
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,354 @@ | |||
| <title>L9</title> | ||||
| <html> | ||||
| <head> | ||||
| </head> | ||||
| <body> | ||||
| 
 | ||||
| <h1>Coordination and more processes</h1> | ||||
| 
 | ||||
| <p>Required reading: remainder of proc.c, sys_exec, sys_sbrk, | ||||
|   sys_wait, sys_exit, and sys_kill. | ||||
| 
 | ||||
| <h2>Overview</h2> | ||||
| 
 | ||||
| <p>Big picture: more programs than processors.  How to share the | ||||
|   limited number of processors among the programs?  Last lecture | ||||
|   covered basic mechanism: threads and the distinction between process | ||||
|   and thread.  Today expand: how to coordinate the interactions | ||||
|   between threads explicitly, and some operations on processes. | ||||
| 
 | ||||
| <p>Sequence coordination.  This is a diferrent type of coordination | ||||
|   than mutual-exclusion coordination (which has its goal to make | ||||
|   atomic actions so that threads don't interfere).  The goal of | ||||
|   sequence coordination is for threads to coordinate the sequences in | ||||
|   which they run.   | ||||
| 
 | ||||
| <p>For example, a thread may want to wait until another thread | ||||
|   terminates. One way to do so is to have the thread run periodically, | ||||
|   let it check if the other thread terminated, and if not give up the | ||||
|   processor again.  This is wasteful, especially if there are many | ||||
|   threads.  | ||||
| 
 | ||||
| <p>With primitives for sequence coordination one can do better.  The | ||||
|   thread could tell the thread manager that it is waiting for an event | ||||
|   (e.g., another thread terminating).  When the other thread | ||||
|   terminates, it explicitly wakes up the waiting thread.  This is more | ||||
|   work for the programmer, but more efficient. | ||||
| 
 | ||||
| <p>Sequence coordination often interacts with mutual-exclusion | ||||
|   coordination, as we will see below. | ||||
| 
 | ||||
| <p>The operating system literature has a rich set of primivites for | ||||
|   sequence coordination.  We study a very simple version of condition | ||||
|   variables in xv6: sleep and wakeup, with a single lock. | ||||
| 
 | ||||
| <h2>xv6 code examples</h2> | ||||
| 
 | ||||
| <h3>Sleep and wakeup - usage</h3> | ||||
| 
 | ||||
| Let's consider implementing a producer/consumer queue | ||||
| (like a pipe) that can be used to hold a single non-null char pointer: | ||||
| 
 | ||||
| <pre> | ||||
| struct pcq { | ||||
|     void *ptr; | ||||
| }; | ||||
| 
 | ||||
| void* | ||||
| pcqread(struct pcq *q) | ||||
| { | ||||
|     void *p; | ||||
| 
 | ||||
|     while((p = q->ptr) == 0) | ||||
|         ; | ||||
|     q->ptr = 0; | ||||
|     return p; | ||||
| } | ||||
| 
 | ||||
| void | ||||
| pcqwrite(struct pcq *q, void *p) | ||||
| { | ||||
|     while(q->ptr != 0) | ||||
|         ; | ||||
|     q->ptr = p; | ||||
| } | ||||
| </pre> | ||||
| 
 | ||||
| <p>Easy and correct, at least assuming there is at most one | ||||
| reader and at most one writer at a time. | ||||
| 
 | ||||
| <p>Unfortunately, the while loops are inefficient. | ||||
| Instead of polling, it would be great if there were | ||||
| primitives saying ``wait for some event to happen'' | ||||
| and ``this event happened''. | ||||
| That's what sleep and wakeup do. | ||||
| 
 | ||||
| <p>Second try: | ||||
| 
 | ||||
| <pre> | ||||
| void* | ||||
| pcqread(struct pcq *q) | ||||
| { | ||||
|     void *p; | ||||
| 
 | ||||
|     if(q->ptr == 0) | ||||
|         sleep(q); | ||||
|     p = q->ptr; | ||||
|     q->ptr = 0; | ||||
|     wakeup(q);  /* wake pcqwrite */ | ||||
|     return p; | ||||
| } | ||||
| 
 | ||||
| void | ||||
| pcqwrite(struct pcq *q, void *p) | ||||
| { | ||||
|     if(q->ptr != 0) | ||||
|         sleep(q); | ||||
|     q->ptr = p; | ||||
|     wakeup(q);  /* wake pcqread */ | ||||
|     return p; | ||||
| } | ||||
| </pre> | ||||
| 
 | ||||
| That's better, but there is still a problem. | ||||
| What if the wakeup happens between the check in the if | ||||
| and the call to sleep? | ||||
| 
 | ||||
| <p>Add locks: | ||||
| 
 | ||||
| <pre> | ||||
| struct pcq { | ||||
|     void *ptr; | ||||
|     struct spinlock lock; | ||||
| }; | ||||
| 
 | ||||
| void* | ||||
| pcqread(struct pcq *q) | ||||
| { | ||||
|     void *p; | ||||
| 
 | ||||
|     acquire(&q->lock); | ||||
|     if(q->ptr == 0) | ||||
|         sleep(q, &q->lock); | ||||
|     p = q->ptr; | ||||
|     q->ptr = 0; | ||||
|     wakeup(q);  /* wake pcqwrite */ | ||||
|     release(&q->lock); | ||||
|     return p; | ||||
| } | ||||
| 
 | ||||
| void | ||||
| pcqwrite(struct pcq *q, void *p) | ||||
| { | ||||
|     acquire(&q->lock); | ||||
|     if(q->ptr != 0) | ||||
|         sleep(q, &q->lock); | ||||
|     q->ptr = p; | ||||
|     wakeup(q);  /* wake pcqread */ | ||||
|     release(&q->lock); | ||||
|     return p; | ||||
| } | ||||
| </pre> | ||||
| 
 | ||||
| This is okay, and now safer for multiple readers and writers, | ||||
| except that wakeup wakes up everyone who is asleep on chan, | ||||
| not just one guy. | ||||
| So some of the guys who wake up from sleep might not | ||||
| be cleared to read or write from the queue.  Have to go back to looping: | ||||
| 
 | ||||
| <pre> | ||||
| struct pcq { | ||||
|     void *ptr; | ||||
|     struct spinlock lock; | ||||
| }; | ||||
| 
 | ||||
| void* | ||||
| pcqread(struct pcq *q) | ||||
| { | ||||
|     void *p; | ||||
| 
 | ||||
|     acquire(&q->lock); | ||||
|     while(q->ptr == 0) | ||||
|         sleep(q, &q->lock); | ||||
|     p = q->ptr; | ||||
|     q->ptr = 0; | ||||
|     wakeup(q);  /* wake pcqwrite */ | ||||
|     release(&q->lock); | ||||
|     return p; | ||||
| } | ||||
| 
 | ||||
| void | ||||
| pcqwrite(struct pcq *q, void *p) | ||||
| { | ||||
|     acquire(&q->lock); | ||||
|     while(q->ptr != 0) | ||||
|         sleep(q, &q->lock); | ||||
|     q->ptr = p; | ||||
|     wakeup(q);  /* wake pcqread */ | ||||
|     release(&q->lock); | ||||
|     return p; | ||||
| } | ||||
| </pre> | ||||
| 
 | ||||
| The difference between this an our original is that | ||||
| the body of the while loop is a much more efficient way to pause. | ||||
| 
 | ||||
| <p>Now we've figured out how to use it, but we | ||||
| still need to figure out how to implement it. | ||||
| 
 | ||||
| <h3>Sleep and wakeup - implementation</h3> | ||||
| <p> | ||||
| Simple implementation: | ||||
| 
 | ||||
| <pre> | ||||
| void | ||||
| sleep(void *chan, struct spinlock *lk) | ||||
| { | ||||
|     struct proc *p = curproc[cpu()]; | ||||
|      | ||||
|     release(lk); | ||||
|     p->chan = chan; | ||||
|     p->state = SLEEPING; | ||||
|     sched(); | ||||
| } | ||||
| 
 | ||||
| void | ||||
| wakeup(void *chan) | ||||
| { | ||||
|     for(each proc p) { | ||||
|         if(p->state == SLEEPING && p->chan == chan) | ||||
|             p->state = RUNNABLE; | ||||
|     }	 | ||||
| } | ||||
| </pre> | ||||
| 
 | ||||
| <p>What's wrong?  What if the wakeup runs right after | ||||
| the release(lk) in sleep? | ||||
| It still misses the sleep. | ||||
| 
 | ||||
| <p>Move the lock down: | ||||
| <pre> | ||||
| void | ||||
| sleep(void *chan, struct spinlock *lk) | ||||
| { | ||||
|     struct proc *p = curproc[cpu()]; | ||||
|      | ||||
|     p->chan = chan; | ||||
|     p->state = SLEEPING; | ||||
|     release(lk); | ||||
|     sched(); | ||||
| } | ||||
| 
 | ||||
| void | ||||
| wakeup(void *chan) | ||||
| { | ||||
|     for(each proc p) { | ||||
|         if(p->state == SLEEPING && p->chan == chan) | ||||
|             p->state = RUNNABLE; | ||||
|     }	 | ||||
| } | ||||
| </pre> | ||||
| 
 | ||||
| <p>This almost works.  Recall from last lecture that we also need | ||||
| to acquire the proc_table_lock before calling sched, to | ||||
| protect p->jmpbuf. | ||||
| 
 | ||||
| <pre> | ||||
| void | ||||
| sleep(void *chan, struct spinlock *lk) | ||||
| { | ||||
|     struct proc *p = curproc[cpu()]; | ||||
|      | ||||
|     p->chan = chan; | ||||
|     p->state = SLEEPING; | ||||
|     acquire(&proc_table_lock); | ||||
|     release(lk); | ||||
|     sched(); | ||||
| } | ||||
| </pre> | ||||
| 
 | ||||
| <p>The problem is that now we're using lk to protect | ||||
| access to the p->chan and p->state variables | ||||
| but other routines besides sleep and wakeup  | ||||
| (in particular, proc_kill) will need to use them and won't | ||||
| know which lock protects them. | ||||
| So instead of protecting them with lk, let's use proc_table_lock: | ||||
| 
 | ||||
| <pre> | ||||
| void | ||||
| sleep(void *chan, struct spinlock *lk) | ||||
| { | ||||
|     struct proc *p = curproc[cpu()]; | ||||
|      | ||||
|     acquire(&proc_table_lock); | ||||
|     release(lk); | ||||
|     p->chan = chan; | ||||
|     p->state = SLEEPING; | ||||
|     sched(); | ||||
| } | ||||
| void | ||||
| wakeup(void *chan) | ||||
| { | ||||
|     acquire(&proc_table_lock); | ||||
|     for(each proc p) { | ||||
|         if(p->state == SLEEPING && p->chan == chan) | ||||
|             p->state = RUNNABLE; | ||||
|     } | ||||
|     release(&proc_table_lock); | ||||
| } | ||||
| </pre> | ||||
| 
 | ||||
| <p>One could probably make things work with lk as above, | ||||
| but the relationship between data and locks would be  | ||||
| more complicated with no real benefit.  Xv6 takes the easy way out  | ||||
| and says that elements in the proc structure are always protected | ||||
| by proc_table_lock. | ||||
| 
 | ||||
| <h3>Use example: exit and wait</h3> | ||||
| 
 | ||||
| <p>If proc_wait decides there are children to be waited for, | ||||
| it calls sleep at line 2462. | ||||
| When a process exits, we proc_exit scans the process table | ||||
| to find the parent and wakes it at 2408. | ||||
| 
 | ||||
| <p>Which lock protects sleep and wakeup from missing each other? | ||||
| Proc_table_lock.  Have to tweak sleep again to avoid double-acquire: | ||||
| 
 | ||||
| <pre> | ||||
| if(lk != &proc_table_lock) { | ||||
|     acquire(&proc_table_lock); | ||||
|     release(lk); | ||||
| } | ||||
| </pre> | ||||
| 
 | ||||
| <h3>New feature: kill</h3> | ||||
| 
 | ||||
| <p>Proc_kill marks a process as killed (line 2371). | ||||
| When the process finally exits the kernel to user space, | ||||
| or if a clock interrupt happens while it is in user space, | ||||
| it will be destroyed (line 2886, 2890, 2912). | ||||
| 
 | ||||
| <p>Why wait until the process ends up in user space? | ||||
| 
 | ||||
| <p>What if the process is stuck in sleep?  It might take a long | ||||
| time to get back to user space. | ||||
| Don't want to have to wait for it, so make sleep wake up early | ||||
| (line 2373). | ||||
| 
 | ||||
| <p>This means all callers of sleep should check | ||||
| whether they have been killed, but none do. | ||||
| Bug in xv6. | ||||
| 
 | ||||
| <h3>System call handlers</h3> | ||||
| 
 | ||||
| <p>Sheet 32 | ||||
| 
 | ||||
| <p>Fork: discussed copyproc in earlier lectures. | ||||
| Sys_fork (line 3218) just calls copyproc | ||||
| and marks the new proc runnable. | ||||
| Does fork create a new process or a new thread? | ||||
| Is there any shared context? | ||||
| 
 | ||||
| <p>Exec: we'll talk about exec later, when we talk about file systems. | ||||
| 
 | ||||
| <p>Sbrk: Saw growproc earlier.  Why setupsegs before returning? | ||||
							
								
								
									
										222
									
								
								web/l-fs.html
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										222
									
								
								web/l-fs.html
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,222 @@ | |||
| <title>L10</title> | ||||
| <html> | ||||
| <head> | ||||
| </head> | ||||
| <body> | ||||
| 
 | ||||
| <h1>File systems</h1> | ||||
| 
 | ||||
| <p>Required reading: iread, iwrite, and wdir, and code related to | ||||
|   these calls in fs.c, bio.c, ide.c, file.c, and sysfile.c | ||||
| 
 | ||||
| <h2>Overview</h2> | ||||
| 
 | ||||
| <p>The next 3 lectures are about file systems: | ||||
| <ul> | ||||
| <li>Basic file system implementation | ||||
| <li>Naming | ||||
| <li>Performance | ||||
| </ul> | ||||
| 
 | ||||
| <p>Users desire to store their data durable so that data survives when | ||||
| the user turns of his computer.  The primary media for doing so are: | ||||
| magnetic disks, flash memory, and tapes.  We focus on magnetic disks | ||||
| (e.g., through the IDE interface in xv6). | ||||
| 
 | ||||
| <p>To allow users to remember where they stored a file, they can | ||||
| assign a symbolic name to a file, which appears in a directory. | ||||
| 
 | ||||
| <p>The data in a file can be organized in a structured way or not. | ||||
| The structured variant is often called a database.  UNIX uses the | ||||
| unstructured variant: files are streams of bytes.  Any particular | ||||
| structure is likely to be useful to only a small class of | ||||
| applications, and other applications will have to work hard to fit | ||||
| their data into one of the pre-defined structures. Besides, if you | ||||
| want structure, you can easily write a user-mode library program that | ||||
| imposes that format on any file.  The end-to-end argument in action. | ||||
| (Databases have special requirements and support an important class of | ||||
| applications, and thus have a specialized plan.) | ||||
| 
 | ||||
| <p>The API for a minimal file system consists of: open, read, write, | ||||
| seek, close, and stat.  Dup duplicates a file descriptor. For example: | ||||
| <pre> | ||||
|   fd = open("x", O_RDWR); | ||||
|   read (fd, buf, 100); | ||||
|   write (fd, buf, 512); | ||||
|   close (fd) | ||||
| </pre> | ||||
| 
 | ||||
| <p>Maintaining the file offset behind the read/write interface is an | ||||
|   interesting design decision . The alternative is that the state of a | ||||
|   read operation should be maintained by the process doing the reading | ||||
|   (i.e., that the pointer should be passed as an argument to read). | ||||
|   This argument is compelling in view of the UNIX fork() semantics, | ||||
|   which clones a process which shares the file descriptors of its | ||||
|   parent. A read by the parent of a shared file descriptor (e.g., | ||||
|   stdin, changes the read pointer seen by the child).  On the other | ||||
|   hand the alternative would make it difficult to get "(data; ls) > x" | ||||
|   right. | ||||
| 
 | ||||
| <p>Unix API doesn't specify that the effects of write are immediately | ||||
|   on the disk before a write returns. It is up to the implementation | ||||
|   of the file system within certain bounds. Choices include (that | ||||
|   aren't non-exclusive): | ||||
| <ul> | ||||
| <li>At some point in the future, if the system stays up (e.g., after | ||||
|   30 seconds); | ||||
| <li>Before the write returns; | ||||
| <li>Before close returns; | ||||
| <li>User specified (e.g., before fsync returns). | ||||
| </ul> | ||||
| 
 | ||||
| <p>A design issue is the semantics of a file system operation that | ||||
|   requires multiple disk writes.  In particular, what happens if the | ||||
|   logical update requires writing multiple disks blocks and the power | ||||
|   fails during the update?  For example, to create a new file, | ||||
|   requires allocating an inode (which requires updating the list of | ||||
|   free inodes on disk), writing a directory entry to record the | ||||
|   allocated i-node under the name of the new file (which may require | ||||
|   allocating a new block and updating the directory inode).  If the | ||||
|   power fails during the operation, the list of free inodes and blocks | ||||
|   may be inconsistent with the blocks and inodes in use. Again this is | ||||
|   up to implementation of the file system to keep on disk data | ||||
|   structures consistent: | ||||
| <ul> | ||||
| <li>Don't worry about it much, but use a recovery program to bring | ||||
|   file system back into a consistent state. | ||||
| <li>Journaling file system.  Never let the file system get into an | ||||
|   inconsistent state. | ||||
| </ul> | ||||
| 
 | ||||
| <p>Another design issue is the semantics are of concurrent writes to | ||||
| the same data item.  What is the order of two updates that happen at | ||||
| the same time? For example, two processes open the same file and write | ||||
| to it.  Modern Unix operating systems allow the application to lock a | ||||
| file to get exclusive access.  If file locking is not used and if the | ||||
| file descriptor is shared, then the bytes of the two writes will get | ||||
| into the file in some order (this happens often for log files).  If | ||||
| the file descriptor is not shared, the end result is not defined. For | ||||
| example, one write may overwrite the other one (e.g., if they are | ||||
| writing to the same part of the file.) | ||||
| 
 | ||||
| <p>An implementation issue is performance, because writing to magnetic | ||||
| disk is relatively expensive compared to computing. Three primary ways | ||||
| to improve performance are: careful file system layout that induces | ||||
| few seeks, an in-memory cache of frequently-accessed blocks, and | ||||
| overlap I/O with computation so that file operations don't have to | ||||
| wait until their completion and so that that the disk driver has more | ||||
| data to write, which allows disk scheduling. (We will talk about | ||||
| performance in detail later.) | ||||
| 
 | ||||
| <h2>xv6 code examples</h2> | ||||
| 
 | ||||
| <p>xv6 implements a minimal Unix file system interface. xv6 doesn't | ||||
| pay attention to file system layout. It overlaps computation and I/O, | ||||
| but doesn't do any disk scheduling.  Its cache is write-through, which | ||||
| simplifies keep on disk datastructures consistent, but is bad for | ||||
| performance. | ||||
| 
 | ||||
| <p>On disk files are represented by an inode (struct dinode in fs.h), | ||||
| and blocks.  Small files have up to 12 block addresses in their inode; | ||||
| large files use files the last address in the inode as a disk address | ||||
| for a block with 128 disk addresses (512/4).  The size of a file is | ||||
| thus limited to 12 * 512 + 128*512 bytes.  What would you change to | ||||
| support larger files? (Ans: e.g., double indirect blocks.) | ||||
| 
 | ||||
| <p>Directories are files with a bit of structure to them. The file | ||||
| contains of records of the type struct dirent.  The entry contains the | ||||
| name for a file (or directory) and its corresponding inode number. | ||||
| How many files can appear in a directory? | ||||
| 
 | ||||
| <p>In memory files are represented by struct inode in fsvar.h. What is | ||||
| the role of the additional fields in struct inode? | ||||
| 
 | ||||
| <p>What is xv6's disk layout?  How does xv6 keep track of free blocks | ||||
|   and inodes? See balloc()/bfree() and ialloc()/ifree().  Is this | ||||
|   layout a good one for performance?  What are other options? | ||||
| 
 | ||||
| <p>Let's assume that an application created an empty file x with | ||||
|   contains 512 bytes, and that the application now calls read(fd, buf, | ||||
|   100), that is, it is requesting to read 100 bytes into buf. | ||||
|   Furthermore, let's assume that the inode for x is is i. Let's pick | ||||
|   up what happens by investigating readi(), line 4483. | ||||
| <ul> | ||||
| <li>4488-4492: can iread be called on other objects than files?  (Yes. | ||||
|   For example, read from the keyboard.)  Everything is a file in Unix. | ||||
| <li>4495: what does bmap do? | ||||
| <ul> | ||||
| <li>4384: what block is being read? | ||||
| </ul> | ||||
| <li>4483: what does bread do?  does bread always cause a read to disk? | ||||
| <ul> | ||||
| <li>4006: what does bget do?  it implements a simple cache of | ||||
|   recently-read disk blocks. | ||||
| <ul> | ||||
| <li>How big is the cache?  (see param.h) | ||||
| <li>3972: look if the requested block is in the cache by walking down | ||||
|   a circular list. | ||||
| <li>3977: we had a match. | ||||
| <li>3979: some other process has "locked" the block, wait until it | ||||
|   releases.  the other processes releases the block using brelse(). | ||||
| Why lock a block? | ||||
| <ul> | ||||
| <li>Atomic read and update.  For example, allocating an inode: read | ||||
|   block containing inode, mark it allocated, and write it back.  This | ||||
|   operation must be atomic. | ||||
| </ul> | ||||
| <li>3982: it is ours now. | ||||
| <li>3987: it is not in the cache; we need to find a cache entry to | ||||
|   hold the block. | ||||
| <li>3987: what is the cache replacement strategy? (see also brelse()) | ||||
| <li>3988: found an entry that we are going to use. | ||||
| <li>3989: mark it ours but don't mark it valid (there is no valid data | ||||
|   in the entry yet). | ||||
| </ul> | ||||
| <li>4007: if the block was in the cache and the entry has the block's | ||||
|   data, return. | ||||
| <li>4010: if the block wasn't in the cache, read it from disk. are | ||||
|   read's synchronous or asynchronous? | ||||
| <ul> | ||||
| <li>3836: a bounded buffer of outstanding disk requests. | ||||
| <li>3809: tell the disk to move arm and generate an interrupt. | ||||
| <li>3851: go to sleep and run some other process to run. time sharing | ||||
|   in action. | ||||
| <li>3792: interrupt: arm is in the right position; wakeup requester. | ||||
| <li>3856: read block from disk. | ||||
| <li>3860: remove request from bounded buffer.  wakeup processes that | ||||
|   are waiting for a slot. | ||||
| <li>3864: start next disk request, if any. xv6 can overlap I/O with | ||||
| computation. | ||||
| </ul> | ||||
| <li>4011: mark the cache entry has holding the data. | ||||
| </ul> | ||||
| <li>4498: To where is the block copied?  is dst a valid user address? | ||||
| </ul> | ||||
| 
 | ||||
| <p>Now let's suppose that the process is writing 512 bytes at the end | ||||
|   of the file a. How many disk writes will happen? | ||||
| <ul> | ||||
| <li>4567: allocate a new block | ||||
| <ul> | ||||
| <li>4518: allocate a block: scan block map, and write entry | ||||
| <li>4523: How many disk operations if the process would have been appending | ||||
|   to a large file? (Answer: read indirect block, scan block map, write | ||||
|   block map.) | ||||
| </ul> | ||||
| <li>4572: read the block that the process will be writing, in case the | ||||
|   process writes only part of the block. | ||||
| <li>4574: write it. is it synchronous or asynchronous? (Ans: | ||||
|   synchronous but with timesharing.) | ||||
| </ul> | ||||
| 
 | ||||
| <p>Lots of code to implement reading and writing of files. How about | ||||
|   directories? | ||||
| <ul> | ||||
| <li>4722: look for the directory, reading directory block and see if a | ||||
|   directory entry is unused (inum == 0). | ||||
| <li>4729: use it and update it. | ||||
| <li>4735: write the modified block. | ||||
| </ul> | ||||
| <p>Reading and writing of directories is trivial. | ||||
| 
 | ||||
| </body> | ||||
							
								
								
									
										174
									
								
								web/l-interrupt.html
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										174
									
								
								web/l-interrupt.html
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,174 @@ | |||
| <html> | ||||
| <head><title>Lecture 6: Interrupts & Exceptions</title></head> | ||||
| <body> | ||||
| 
 | ||||
| <h1>Interrupts & Exceptions</h1> | ||||
| 
 | ||||
| <p> | ||||
| Required reading: xv6 <code>trapasm.S</code>, <code>trap.c</code>, <code>syscall.c</code>, <code>usys.S</code>. | ||||
| <br> | ||||
| You will need to consult | ||||
| <a href="../readings/ia32/IA32-3.pdf">IA32 System | ||||
| Programming Guide</a> chapter 5 (skip 5.7.1, 5.8.2, 5.12.2). | ||||
| 
 | ||||
| <h2>Overview</h2> | ||||
| 
 | ||||
| <p> | ||||
| Big picture: kernel is trusted third-party that runs the machine. | ||||
| Only the kernel can execute privileged instructions (e.g., | ||||
| changing MMU state). | ||||
| The processor enforces this protection through the ring bits | ||||
| in the code segment. | ||||
| If a user application needs to carry out a privileged operation | ||||
| or other kernel-only service, | ||||
| it must ask the kernel nicely. | ||||
| How can a user program change to the kernel address space? | ||||
| How can the kernel transfer to a user address space? | ||||
| What happens when a device attached to the computer  | ||||
| needs attention? | ||||
| These are the topics for today's lecture. | ||||
| 
 | ||||
| <p> | ||||
| There are three kinds of events that must be handled | ||||
| by the kernel, not user programs: | ||||
| (1) a system call invoked by a user program, | ||||
| (2) an illegal instruction or other kind of bad processor state (memory fault, etc.). | ||||
| and  | ||||
| (3) an interrupt from a hardware device. | ||||
| 
 | ||||
| <p> | ||||
| Although these three events are different, they all use the same | ||||
| mechanism to transfer control to the kernel. | ||||
| This mechanism consists of three steps that execute as one atomic unit. | ||||
| (a) change the processor to kernel mode; | ||||
| (b) save the old processor somewhere (usually the kernel stack); | ||||
| and (c) change the processor state to the values set up as | ||||
| the “official kernel entry values.” | ||||
| The exact implementation of this mechanism differs | ||||
| from processor to processor, but the idea is the same. | ||||
| 
 | ||||
| <p> | ||||
| We'll work through examples of these today in lecture. | ||||
| You'll see all three in great detail in the labs as well. | ||||
| 
 | ||||
| <p> | ||||
| A note on terminology: sometimes we'll | ||||
| use interrupt (or trap) to mean both interrupts and exceptions. | ||||
| 
 | ||||
| <h2> | ||||
| Setting up traps on the x86 | ||||
| </h2> | ||||
| 
 | ||||
| <p> | ||||
| See handout Table 5-1, Figure 5-1, Figure 5-2. | ||||
| 
 | ||||
| <p> | ||||
| xv6 Sheet 07: <code>struct gatedesc</code> and <code>SETGATE</code>. | ||||
| 
 | ||||
| <p> | ||||
| xv6 Sheet 28: <code>tvinit</code> and <code>idtinit</code>. | ||||
| Note setting of gate for <code>T_SYSCALL</code> | ||||
| 
 | ||||
| <p> | ||||
| xv6 Sheet 29: <code>vectors.pl</code> (also see generated <code>vectors.S</code>). | ||||
| 
 | ||||
| <h2> | ||||
| System calls | ||||
| </h2> | ||||
| 
 | ||||
| <p> | ||||
| xv6 Sheet 16: <code>init.c</code> calls <code>open("console")</code>. | ||||
| How is that implemented? | ||||
| 
 | ||||
| <p> | ||||
| xv6 <code>usys.S</code> (not in book). | ||||
| (No saving of registers.  Why?) | ||||
| 
 | ||||
| <p> | ||||
| Breakpoint <code>0x1b:"open"</code>, | ||||
| step past <code>int</code> instruction into kernel. | ||||
| 
 | ||||
| <p> | ||||
| See handout Figure 9-4 [sic]. | ||||
| 
 | ||||
| <p> | ||||
| xv6 Sheet 28: in <code>vectors.S</code> briefly, then in <code>alltraps</code>. | ||||
| Step through to <code>call trap</code>, examine registers and stack. | ||||
| How will the kernel find the argument to <code>open</code>? | ||||
| 
 | ||||
| <p> | ||||
| xv6 Sheet 29: <code>trap</code>, on to <code>syscall</code>. | ||||
| 
 | ||||
| <p> | ||||
| xv6 Sheet 31: <code>syscall</code> looks at <code>eax</code>, | ||||
| calls <code>sys_open</code>. | ||||
| 
 | ||||
| <p> | ||||
| (Briefly) | ||||
| xv6 Sheet 52: <code>sys_open</code> uses <code>argstr</code> and <code>argint</code> | ||||
| to get its arguments.  How do they work? | ||||
| 
 | ||||
| <p> | ||||
| xv6 Sheet 30: <code>fetchint</code>, <code>fetcharg</code>, <code>argint</code>, | ||||
| <code>argptr</code>, <code>argstr</code>. | ||||
| 
 | ||||
| <p> | ||||
| What happens if a user program divides by zero | ||||
| or accesses unmapped memory? | ||||
| Exception.  Same path as system call until <code>trap</code>. | ||||
| 
 | ||||
| <p> | ||||
| What happens if kernel divides by zero or accesses unmapped memory? | ||||
| 
 | ||||
| <h2> | ||||
| Interrupts | ||||
| </h2> | ||||
| 
 | ||||
| <p> | ||||
| Like system calls, except: | ||||
| devices generate them at any time, | ||||
| there are no arguments in CPU registers, | ||||
| nothing to return to, | ||||
| usually can't ignore them. | ||||
| 
 | ||||
| <p> | ||||
| How do they get generated? | ||||
| Device essentially phones up the  | ||||
| interrupt controller and asks to talk to the CPU. | ||||
| Interrupt controller then buzzes the CPU and | ||||
| tells it, “keyboard on line 1.” | ||||
| Interrupt controller is essentially the CPU's | ||||
| <strike>secretary</strike> administrative assistant, | ||||
| managing the phone lines on the CPU's behalf. | ||||
| 
 | ||||
| <p> | ||||
| Have to set up interrupt controller. | ||||
| 
 | ||||
| <p> | ||||
| (Briefly) xv6 Sheet 63: <code>pic_init</code> sets up the interrupt controller, | ||||
| <code>irq_enable</code> tells the interrupt controller to let the given | ||||
| interrupt through.  | ||||
| 
 | ||||
| <p> | ||||
| (Briefly) xv6 Sheet 68: <code>pit8253_init</code> sets up the clock chip, | ||||
| telling it to interrupt on <code>IRQ_TIMER</code> 100 times/second. | ||||
| <code>console_init</code> sets up the keyboard, enabling <code>IRQ_KBD</code>. | ||||
| 
 | ||||
| <p> | ||||
| In Bochs, set breakpoint at 0x8:"vector0" | ||||
| and continue, loading kernel. | ||||
| Step through clock interrupt, look at  | ||||
| stack, registers. | ||||
| 
 | ||||
| <p> | ||||
| Was the processor executing in kernel or user mode | ||||
| at the time of the clock interrupt? | ||||
| Why?  (Have any user-space instructions executed at all?) | ||||
| 
 | ||||
| <p> | ||||
| Can the kernel get an interrupt at any time? | ||||
| Why or why not?  <code>cli</code> and <code>sti</code>, | ||||
| <code>irq_enable</code>. | ||||
| 
 | ||||
| </body> | ||||
| </html> | ||||
							
								
								
									
										322
									
								
								web/l-lock.html
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										322
									
								
								web/l-lock.html
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,322 @@ | |||
| <title>L7</title> | ||||
| <html> | ||||
| <head> | ||||
| </head> | ||||
| <body> | ||||
| 
 | ||||
| <h1>Locking</h1> | ||||
| 
 | ||||
| <p>Required reading: spinlock.c | ||||
| 
 | ||||
| <h2>Why coordinate?</h2> | ||||
| 
 | ||||
| <p>Mutual-exclusion coordination is an important topic in operating | ||||
| systems, because many operating systems run on | ||||
| multiprocessors. Coordination techniques protect variables that are | ||||
| shared among multiple threads and updated concurrently.  These | ||||
| techniques allow programmers to implement atomic sections so that one | ||||
| thread can safely update the shared variables without having to worry | ||||
| that another thread intervening.  For example, processes in xv6 may | ||||
| run concurrently on different processors and in kernel-mode share | ||||
| kernel data structures.  We must ensure that these updates happen | ||||
| correctly. | ||||
| 
 | ||||
| <p>List and insert example: | ||||
| <pre> | ||||
| 
 | ||||
| struct List { | ||||
|   int data; | ||||
|   struct List *next; | ||||
| }; | ||||
| 
 | ||||
| List *list = 0; | ||||
| 
 | ||||
| insert(int data) { | ||||
|   List *l = new List; | ||||
|   l->data = data; | ||||
|   l->next = list;  // A | ||||
|   list = l;        // B | ||||
| } | ||||
| </pre> | ||||
| 
 | ||||
| <p>What needs to be atomic? The two statements labeled A and B should | ||||
| always be executed together, as an indivisible fragment of code.  If | ||||
| two processors execute A and B interleaved, then we end up with an | ||||
| incorrect list.  To see that this is the case, draw out the list after | ||||
| the sequence A1 (statement executed A by processor 1), A2 (statement A | ||||
| executed by processor 2), B2, and B1. | ||||
| 
 | ||||
| <p>How could this erroneous sequence happen? The varilable <i>list</i> | ||||
| lives in physical memory shared among multiple processors, connected | ||||
| by a bus.  The accesses to the shared memory will be ordered in some | ||||
| total order by the bus/memory system.  If the programmer doesn't | ||||
| coordinate the execution of the statements A and B, any order can | ||||
| happen, including the erroneous one. | ||||
| 
 | ||||
| <p>The erroneous case is called a race condition.  The problem with | ||||
| races is that they are difficult to reproduce.  For example, if you | ||||
| put print statements in to debug the incorrect behavior, you might | ||||
| change the time and the race might not happen anymore. | ||||
| 
 | ||||
| <h2>Atomic instructions</h2> | ||||
| 
 | ||||
| <p>The programmer must be able express that A and B should be executed | ||||
| as single atomic instruction.  We generally use a concept like locks | ||||
| to mark an atomic region, acquiring the lock at the beginning of the | ||||
| section and releasing it at the end:  | ||||
| 
 | ||||
| <pre>  | ||||
| void acquire(int *lock) { | ||||
|    while (TSL(lock) != 0) ;  | ||||
| } | ||||
| 
 | ||||
| void release (int *lock) { | ||||
|   *lock = 0; | ||||
| } | ||||
| </pre> | ||||
| 
 | ||||
| <p>Acquire and release, of course, need to be atomic too, which can, | ||||
| for example, be done with a hardware atomic TSL (try-set-lock) | ||||
| instruction: | ||||
| 
 | ||||
| <p>The semantics of TSL are: | ||||
| <pre> | ||||
|    R <- [mem]   // load content of mem into register R | ||||
|    [mem] <- 1   // store 1 in mem. | ||||
| </pre> | ||||
| 
 | ||||
| <p>In a harware implementation, the bus arbiter guarantees that both | ||||
| the load and store are executed without any other load/stores coming | ||||
| in between. | ||||
| 
 | ||||
| <p>We can use locks to implement an atomic insert, or we can use | ||||
| TSL directly: | ||||
| <pre> | ||||
| int insert_lock = 0; | ||||
| 
 | ||||
| insert(int data) { | ||||
| 
 | ||||
|   /* acquire the lock: */ | ||||
|   while(TSL(&insert_lock) != 0) | ||||
|     ; | ||||
| 
 | ||||
|   /* critical section: */ | ||||
|   List *l = new List; | ||||
|   l->data = data; | ||||
|   l->next = list; | ||||
|   list = l; | ||||
| 
 | ||||
|   /* release the lock: */ | ||||
|   insert_lock = 0; | ||||
| } | ||||
| </pre> | ||||
| 
 | ||||
| <p>It is the programmer's job to make sure that locks are respected.  If | ||||
| a programmer writes another function that manipulates the list, the | ||||
| programmer must must make sure that the new functions acquires and | ||||
| releases the appropriate locks.  If the programmer doesn't, race | ||||
| conditions occur. | ||||
| 
 | ||||
| <p>This code assumes that stores commit to memory in program order and | ||||
| that all stores by other processors started before insert got the lock | ||||
| are observable by this processor.  That is, after the other processor | ||||
| released a lock, all the previous stores are committed to memory.  If | ||||
| a processor executes instructions out of order, this assumption won't | ||||
| hold and we must, for example, a barrier instruction that makes the | ||||
| assumption true. | ||||
| 
 | ||||
| 
 | ||||
| <h2>Example: Locking on x86</h2> | ||||
| 
 | ||||
| <p>Here is one way we can implement acquire and release using the x86 | ||||
| xchgl instruction: | ||||
| 
 | ||||
| <pre> | ||||
| struct Lock { | ||||
|   unsigned int locked; | ||||
| }; | ||||
| 
 | ||||
| acquire(Lock *lck) { | ||||
|   while(TSL(&(lck->locked)) != 0) | ||||
|     ; | ||||
| } | ||||
| 
 | ||||
| release(Lock *lck) { | ||||
|   lck->locked = 0; | ||||
| } | ||||
| 
 | ||||
| int | ||||
| TSL(int *addr) | ||||
| { | ||||
|   register int content = 1; | ||||
|   // xchgl content, *addr | ||||
|   // xchgl exchanges the values of its two operands, while | ||||
|   // locking the memory bus to exclude other operations. | ||||
|   asm volatile ("xchgl %0,%1" : | ||||
|                 "=r" (content), | ||||
|                 "=m" (*addr) : | ||||
|                 "0" (content), | ||||
|                 "m" (*addr)); | ||||
|   return(content); | ||||
| } | ||||
| </pre> | ||||
| 
 | ||||
| <p>the instruction "XCHG %eax, (content)" works as follows: | ||||
| <ol> | ||||
| <li> freeze other CPUs' memory activity | ||||
| <li> temp := content | ||||
| <li> content := %eax | ||||
| <li> %eax := temp | ||||
| <li> un-freeze other CPUs | ||||
| </ol> | ||||
| 
 | ||||
| <p>steps 1 and 5 make XCHG special: it is "locked" special signal | ||||
|   lines on the inter-CPU bus, bus arbitration | ||||
| 
 | ||||
| <p>This implementation doesn't scale to a large number of processors; | ||||
|   in a later lecture we will see how we could do better. | ||||
| 
 | ||||
| <h2>Lock granularity</h2> | ||||
| 
 | ||||
| <p>Release/acquire is ideal for short atomic sections: increment a | ||||
| counter, search in i-node cache, allocate a free buffer. | ||||
| 
 | ||||
| <p>What are spin locks not so great for?  Long atomic sections may | ||||
|   waste waiters' CPU time and it is to sleep while holding locks. In | ||||
|   xv6 we try to avoid long atomic sections by carefully coding (can | ||||
|   you find an example?).  xv6 doesn't release the processor when | ||||
|   holding a lock, but has an additional set of coordination primitives | ||||
|   (sleep and wakeup), which we will study later. | ||||
| 
 | ||||
| <p>My list_lock protects all lists; inserts to different lists are | ||||
|   blocked. A lock per list would waste less time spinning so you might | ||||
|   want "fine-grained" locks, one for every object BUT acquire/release | ||||
|   are expensive (500 cycles on my 3 ghz machine) because they need to | ||||
|   talk off-chip. | ||||
| 
 | ||||
| <p>Also, "correctness" is not that simple with fine-grained locks if | ||||
|   need to maintain global invariants; e.g., "every buffer must be on | ||||
|   exactly one of free list and device list".  Per-list locks are | ||||
|   irrelevant for this invariant.  So you might want "large-grained", | ||||
|   which reduces overhead but reduces concurrency. | ||||
| 
 | ||||
| <p>This tension is hard to get right. One often starts out with | ||||
|   "large-grained locks" and measures the performance of the system on | ||||
|   some workloads. When more concurrency is desired (to get better | ||||
|   performance), an implementor may switch to a more fine-grained | ||||
|   scheme. Operating system designers fiddle with this all the time. | ||||
| 
 | ||||
| <h2>Recursive locks and modularity</h2> | ||||
| 
 | ||||
| <p>When designing a system we desire clean abstractions and good | ||||
|   modularity. We like a caller not have to know about how a callee | ||||
|   implements a particul functions.  Locks make achieving modularity | ||||
|   more complicated.  For example, what to do when the caller holds a | ||||
|   lock, then calls a function, which also needs to the lock to perform | ||||
|   its job. | ||||
| 
 | ||||
| <p>There are no transparent solutions that allow the caller and callee | ||||
|   to be unaware of which lokcs they use. One transparent, but | ||||
|   unsatisfactory option is recursive locks: If a callee asks for a | ||||
|   lock that its caller has, then we allow the callee to proceed. | ||||
|   Unfortunately, this solution is not ideal either. | ||||
| 
 | ||||
| <p>Consider the following. If lock x protects the internals of some | ||||
|   struct foo, then if the caller acquires lock x, it know that the | ||||
|   internals of foo are in a sane state and it can fiddle with them. | ||||
|   And then the caller must restore them to a sane state before release | ||||
|   lock x, but until then anything goes. | ||||
| 
 | ||||
| <p>This assumption doesn't hold with recursive locking.  After | ||||
|   acquiring lock x, the acquirer knows that either it is the first to | ||||
|   get this lock, in which case the internals are in a sane state, or | ||||
|   maybe some caller holds the lock and has messed up the internals and | ||||
|   didn't realize when calling the callee that it was going to try to | ||||
|   look at them too.  So the fact that a function acquired the lock x | ||||
|   doesn't guarantee anything at all. In short, locks protect against | ||||
|   callers and callees just as much as they protect against other | ||||
|   threads. | ||||
| 
 | ||||
| <p>Since transparent solutions aren't ideal, it is better to consider | ||||
|   locks part of the function specification. The programmer must | ||||
|   arrange that a caller doesn't invoke another function while holding | ||||
|   a lock that the callee also needs. | ||||
| 
 | ||||
| <h2>Locking in xv6</h2> | ||||
| 
 | ||||
| <p>xv6 runs on a multiprocessor and is programmed to allow multiple | ||||
| threads of computation to run concurrently.  In xv6 an interrupt might | ||||
| run on one processor and a process in kernel mode may run on another | ||||
| processor, sharing a kernel data structure with the interrupt routing. | ||||
| xv6 uses locks, implemented using an atomic instruction, to coordinate | ||||
| concurrent activities. | ||||
| 
 | ||||
| <p>Let's check out why xv6 needs locks by following what happens when | ||||
| we start a second processor: | ||||
| <ul> | ||||
| <li>1516: mp_init (called from main0) | ||||
| <li>1606: mp_startthem (called from main0) | ||||
| <li>1302: mpmain | ||||
| <li>2208: scheduler. | ||||
|   <br>Now we have several processors invoking the scheduler | ||||
|   function. xv6 better ensure that multiple processors don't run the | ||||
|   same process!  does it? | ||||
|   <br>Yes, if multiple schedulers run concurrently, only one will | ||||
|   acquire proc_table_lock, and proceed looking for a runnable | ||||
|   process. if it finds a process, it will mark it running, longjmps to | ||||
|   it, and the process will release proc_table_lock.  the next instance | ||||
|   of scheduler will skip this entry, because it is marked running, and | ||||
|   look for another runnable process. | ||||
| </ul> | ||||
| 
 | ||||
| <p>Why hold proc_table_lock during a context switch?  It protects | ||||
| p->state; the process has to hold some lock to avoid a race with | ||||
| wakeup() and yield(), as we will see in the next lectures. | ||||
| 
 | ||||
| <p>Why not a lock per proc entry?  It might be expensive in in whole | ||||
| table scans (in wait, wakeup, scheduler). proc_table_lock also | ||||
| protects some larger invariants, for example it might be hard to get | ||||
| proc_wait() right with just per entry locks.  Right now the check to | ||||
| see if there are any exited children and the sleep are atomic -- but | ||||
| that would be hard with per entry locks.  One could have both, but | ||||
| that would probably be neither clean nor fast. | ||||
| 
 | ||||
| <p>Of course, there is only processor searching the proc table if | ||||
| acquire is implemented correctly. Let's check out acquire in | ||||
| spinlock.c: | ||||
| <ul> | ||||
| <li>1807: no recursive locks! | ||||
| <li>1811: why disable interrupts on the current processor? (if | ||||
| interrupt code itself tries to take a held lock, xv6 will deadlock; | ||||
| the panic will fire on 1808.) | ||||
| <ul> | ||||
| <li>can a process on a processor hold multiple locks? | ||||
| </ul> | ||||
| <li>1814: the (hopefully) atomic instruction. | ||||
| <ul> | ||||
| <li>see sheet 4, line 0468. | ||||
| </ul> | ||||
| <li>1819: make sure that stores issued on other processors before we | ||||
| got the lock are observed by this processor.  these may be stores to | ||||
| the shared data structure that is protected by the lock. | ||||
| </ul> | ||||
| 
 | ||||
| <p> | ||||
| 
 | ||||
| <h2>Locking in JOS</h2> | ||||
| 
 | ||||
| <p>JOS is meant to run on single-CPU machines, and the plan can be | ||||
| simple.  The simple plan is disabling/enabling interrupts in the | ||||
| kernel (IF flags in the EFLAGS register).  Thus, in the kernel, | ||||
| threads release the processors only when they want to and can ensure | ||||
| that they don't release the processor during a critical section. | ||||
| 
 | ||||
| <p>In user mode, JOS runs with interrupts enabled, but Unix user | ||||
| applications don't share data structures.  The data structures that | ||||
| must be protected, however, are the ones shared in the library | ||||
| operating system (e.g., pipes). In JOS we will use special-case | ||||
| solutions, as you will find out in lab 6.  For example, to implement | ||||
| pipe we will assume there is one reader and one writer.  The reader | ||||
| and writer never update each other's variables; they only read each | ||||
| other's variables.  Carefully programming using this rule we can avoid | ||||
| races. | ||||
							
								
								
									
										262
									
								
								web/l-mkernel.html
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										262
									
								
								web/l-mkernel.html
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,262 @@ | |||
| <title>Microkernel lecture</title> | ||||
| <html> | ||||
| <head> | ||||
| </head> | ||||
| <body> | ||||
| 
 | ||||
| <h1>Microkernels</h1> | ||||
| 
 | ||||
| <p>Required reading: Improving IPC by kernel design | ||||
| 
 | ||||
| <h2>Overview</h2> | ||||
| 
 | ||||
| <p>This lecture looks at the microkernel organization.  In a | ||||
| microkernel, services that a monolithic kernel implements in the | ||||
| kernel are running as user-level programs.  For example, the file | ||||
| system, UNIX process management, pager, and network protocols each run | ||||
| in a separate user-level address space.  The microkernel itself | ||||
| supports only the services that are necessary to allow system services | ||||
| to run well in user space; a typical microkernel has at least support | ||||
| for creating address spaces, threads, and inter process communication. | ||||
| 
 | ||||
| <p>The potential advantages of a microkernel are simplicity of the | ||||
| kernel (small), isolation of operating system components (each runs in | ||||
| its own user-level address space), and flexibility (we can have a file | ||||
| server and a database server).  One potential disadvantage is | ||||
| performance loss, because what in a monolithich kernel requires a | ||||
| single system call may require in a microkernel multiple system calls | ||||
| and context switches. | ||||
| 
 | ||||
| <p>One way in how microkernels differ from each other is the exact | ||||
| kernel API they implement.  For example, Mach (a system developed at | ||||
| CMU, which influenced a number of commercial operating systems) has | ||||
| the following system calls: processes (create, terminate, suspend, | ||||
| resume, priority, assign, info, threads), threads (fork, exit, join, | ||||
| detach, yield, self), ports and messages (a port is a unidirectionally | ||||
| communication channel with a message queue and supporting primitives | ||||
| to send, destroy, etc), and regions/memory objects (allocate, | ||||
| deallocate, map, copy, inherit, read, write). | ||||
| 
 | ||||
| <p>Some microkernels are more "microkernel" than others.  For example, | ||||
| some microkernels implement the pager in user space but the basic | ||||
| virtual memory abstractions in the kernel (e.g, Mach); others, are | ||||
| more extreme, and implement most of the virtual memory in user space | ||||
| (L4).  Yet others are less extreme: many servers run in their own | ||||
| address space, but in kernel mode (Chorus). | ||||
| 
 | ||||
| <p>All microkernels support multiple threads per address space. xv6 | ||||
| and Unix until recently didn't; why? Because, in Unix system services | ||||
| are typically implemented in the kernel, and those are the primary | ||||
| programs that need multiple threads to handle events concurrently | ||||
| (waiting for disk and processing new I/O requests).  In microkernels, | ||||
| these services are implemented in user-level address spaces and so | ||||
| they need a mechanism to deal with handling operations concurrently. | ||||
| (Of course, one can argue if fork efficient enough, there is no need | ||||
| to have threads.) | ||||
| 
 | ||||
| <h2>L3/L4</h2> | ||||
| 
 | ||||
| <p>L3 is a predecessor to L4.  L3 provides data persistence, DOS | ||||
| emulation, and ELAN runtime system.  L4 is a reimplementation of L3, | ||||
| but without the data persistence.  L4KA is a project at | ||||
| sourceforge.net, and you can download the code for the latest | ||||
| incarnation of L4 from there. | ||||
| 
 | ||||
| <p>L4 is a "second-generation" microkernel, with 7 calls: IPC (of | ||||
| which there are several types), id_nearest (find a thread with an ID | ||||
| close the given ID), fpage_unmap (unmap pages, mapping is done as a | ||||
| side-effect of IPC), thread_switch (hand processor to specified | ||||
| thread), lthread_ex_regs (manipulate thread registers), | ||||
| thread_schedule (set scheduling policies), task_new (create a new | ||||
| address space with some default number of threads).  These calls | ||||
| provide address spaces, tasks, threads, interprocess communication, | ||||
| and unique identifiers.  An address space is a set of mappings. | ||||
| Multiple threads may share mappings, a thread may grants mappings to | ||||
| another thread (through IPC).  Task is the set of threads sharing an | ||||
| address space. | ||||
| 
 | ||||
| <p>A thread is the execution abstraction; it belongs to an address | ||||
| space, a UID, a register set, a page fault handler, and an exception | ||||
| handler. A UID of a thread is its task number plus the number of the | ||||
| thread within that task. | ||||
| 
 | ||||
| <p>IPC passes data by value or by reference to another address space. | ||||
| It also provide for sequence coordination.  It is used for | ||||
| communication between client and servers, to pass interrupts to a | ||||
| user-level exception handler, to pass page faults to an external | ||||
| pager.  In L4, device drivers are implemented has a user-level | ||||
| processes with the device mapped into their address space. | ||||
| Linux runs as a user-level process. | ||||
| 
 | ||||
| <p>L4 provides quite a scala of messages types: inline-by-value, | ||||
| strings, and virtual memory mappings.  The send and receive descriptor | ||||
| specify how many, if any. | ||||
| 
 | ||||
| <p>In addition, there is a system call for timeouts and controling | ||||
| thread scheduling. | ||||
| 
 | ||||
| <h2>L3/L4 paper discussion</h2> | ||||
| 
 | ||||
| <ul> | ||||
| 
 | ||||
| <li>This paper is about performance.  What is a microsecond?  Is 100 | ||||
| usec bad?  Is 5 usec so much better we care? How many instructions | ||||
| does 50-Mhz x86 execute in 100 usec?  What can we compute with that | ||||
| number of instructions?  How many disk operations in that time?  How | ||||
| many interrupts can we take? (The livelock paper, which we cover in a | ||||
| few lectures, mentions 5,000 network pkts per second, and each packet | ||||
| generates two interrrupts.) | ||||
| 
 | ||||
| <li>In performance calculations, what is the appropriate/better metric? | ||||
| Microseconds or cycles? | ||||
| 
 | ||||
| <li>Goal: improve IPC performance by a factor 10 by careful kernel | ||||
| design that is fully aware of the hardware it is running on. | ||||
| Principle: performance rules!  Optimize for the common case.  Because | ||||
| in L3 interrupts are propagated to user-level using IPC, the system | ||||
| may have to be able to support many IPCs per second (as many as the | ||||
| device can generate interrupts). | ||||
| 
 | ||||
| <li>IPC consists of transfering control and transfering data.  The | ||||
| minimal cost for transfering control is 127 cycles, plus 45 cycles for | ||||
| TLB misses (see table 3).  What are the x86 instructions to enter and | ||||
| leave the kernel? (int, iret) Why do they consume so much time? | ||||
| (Flush pipeline) Do modern processors perform these operations more | ||||
| efficient?  Worse now.  Faster processors optimized for straight-line | ||||
| code; Traps/Exceptions flush deeper pipeline, cache misses cost more | ||||
| cycles. | ||||
| 
 | ||||
| <li>What are the 5 TLB misses: 1) B's thread control block; loading %cr3 | ||||
| flushes TLB, so 2) kernel text causes miss; iret, accesses both 3) stack and | ||||
| 4+5) user text - two pages B's user code looks at message | ||||
| 
 | ||||
| <li>Interface: | ||||
| <ul> | ||||
| <li>call (threadID, send-message, receive-message, timeout); | ||||
| <li>reply_and_receive (reply-message, receive-message, timeout); | ||||
| </ul> | ||||
| 
 | ||||
| <li>Optimizations: | ||||
| <ul> | ||||
| 
 | ||||
| <li>New system call: reply_and_receive.  Effect: 2 system calls per | ||||
| RPC. | ||||
| 
 | ||||
| <li>Complex messages: direct string, indirect strings, and memory | ||||
| objects. | ||||
| 
 | ||||
| <li>Direct transfer by temporary mapping through a communication | ||||
| window.  The communication window is mapped in B address space and in | ||||
| A's kernel address space; why is this better than just mapping a page | ||||
| shared between A and B's address space?  1) Multi-level security, it | ||||
| makes it hard to reason about information flow; 2) Receiver can't | ||||
| check message legality (might change after check); 3) When server has | ||||
| many clients, could run out of virtual address space Requires shared | ||||
| memory region to be established ahead of time; 4) Not application | ||||
| friendly, since data may already be at another address, i.e. | ||||
| applications would have to copy anyway--possibly more copies. | ||||
| 
 | ||||
| <li>Why not use the following approach: map the region copy-on-write | ||||
| (or read-only) in A's address space after send and read-only in B's | ||||
| address space?  Now B may have to copy data or cannot receive data in | ||||
| its final destination. | ||||
| 
 | ||||
| <li>On the x86 implemented by coping B's PDE into A's address space. | ||||
| Why two PDEs?  (Maximum message size is 4 Meg, so guaranteed to work | ||||
| if the message starts in the bottom for 4 Mbyte of an 8 Mbyte mapped | ||||
| region.) Why not just copy PTEs?  Would be much more expensive | ||||
| 
 | ||||
| <li> What does it mean for the TLB to be "window clean"?  Why do we | ||||
| care?  Means TLB contains no mappings within communication window. We | ||||
| care because mapping is cheap (copy PDE), but invalidation not; x86 | ||||
| only lets you invalidate one page at a time, or whole TLB Does TLB | ||||
| invalidation of communication window turn out to be a problem?  Not | ||||
| usually, because have to load %cr3 during IPC anyway | ||||
| 
 | ||||
| <li>Thread control block registers, links to various double-linked | ||||
|   lists, pgdir, uid, etc.. Lower part of thread UID contains TCB | ||||
|   number.  Can also dededuce TCB address from stack by taking SP AND | ||||
|   bitmask (the SP comes out of the TSS when just switching to kernel). | ||||
| 
 | ||||
| <li> Kernel stack is on same page as tcb.  why?  1) Minimizes TLB | ||||
| misses (since accessing kernel stack will bring in tcb); 2) Allows | ||||
| very efficient access to tcb -- just mask off lower 12 bits of %esp; | ||||
| 3) With VM, can use lower 32-bits of thread id to indicate which tcb; | ||||
| using one page per tcb means no need to check if thread is swapped out | ||||
| (Can simply not map that tcb if shouldn't access it). | ||||
| 
 | ||||
| <li>Invariant on queues: queues always hold in-memory TCBs. | ||||
| 
 | ||||
| <li>Wakeup queue: set of 8 unordered wakeup lists (wakup time mod 8), | ||||
| and smart representation of time so that 32-bit integers can be used | ||||
| in the common case (base + offset in msec; bump base and recompute all | ||||
| offsets ~4 hours.  maximum timeout is ~24 days, 2^31 msec). | ||||
| 
 | ||||
| <li>What is the problem addressed by lazy scheduling? | ||||
| Conventional approach to scheduling: | ||||
| <pre> | ||||
|     A sends message to B: | ||||
|       Move A from ready queue to waiting queue | ||||
|       Move B from waiting queue to ready queue | ||||
|     This requires 58 cycles, including 4 TLB misses.  What are TLB misses? | ||||
|       One each for head of ready and waiting queues | ||||
|       One each for previous queue element during the remove | ||||
| </pre> | ||||
| <li> Lazy scheduling: | ||||
| <pre> | ||||
|     Ready queue must contain all ready threads except current one | ||||
|       Might contain other threads that aren't actually ready, though | ||||
|     Each wakeup queue contains all threads waiting in that queue | ||||
|       Again, might contain other threads, too | ||||
|       Scheduler removes inappropriate queue entries when scanning | ||||
|       queue | ||||
| </pre> | ||||
|     | ||||
| <li>Why does this help performance?  Only three situations in which | ||||
| thread gives up CPU but stays ready: send syscall (as opposed to | ||||
| call), preemption, and hardware interrupts. So very often can IPC into | ||||
| thread while not putting it on ready list. | ||||
| 
 | ||||
| <li>Direct process switch.  This section just says you should use | ||||
| kernel threads instead of continuations. | ||||
| 
 | ||||
| <li>Short messages via registers. | ||||
| 
 | ||||
| <li>Avoiding unnecessary copies.  Basically can send and receive | ||||
|   messages w. same vector.  Makes forwarding efficient, which is | ||||
|   important for Clans/Chiefs model. | ||||
| 
 | ||||
| <li>Segment register optimization.  Loading segments registers is | ||||
|   slow, have to access GDT, etc.  But common case is that users don't | ||||
|   change their segment registers. Observation: it is faster to check | ||||
|   that segment descriptor than load it.  So just check that segment | ||||
|   registers are okay.  Only need to load if user code changed them. | ||||
| 
 | ||||
| <li>Registers for paramater passing where ever possible: systems calls | ||||
| and IPC. | ||||
| 
 | ||||
| <li>Minimizing TLB misses. Try to cram as many things as possible onto | ||||
| same page: IPC kernel code, GDT, IDT, TSS, all on same page. Actually | ||||
| maybe can't fit whole tables but put the important parts of tables on | ||||
| the same page (maybe beginning of TSS, IDT, or GDT only?) | ||||
| 
 | ||||
| <li>Coding tricks: short offsets, avoid jumps, avoid checks, pack | ||||
|   often-used data on same cache lines, lazily save/restore CPU state | ||||
|   like debug and FPU registers.  Much of the kernel is written in | ||||
|   assembly! | ||||
| 
 | ||||
| <li>What are the results?  figure 7 and 8 look good. | ||||
| 
 | ||||
| <li>Is fast IPC enough to get good overall system performance?  This | ||||
| paper doesn't make a statement either way; we have to read their 1997 | ||||
| paper to find find the answer to that question. | ||||
| 
 | ||||
| <li>Is the principle of optimizing for performance right?  In general, | ||||
| it is wrong to optimize for performance; other things matter more.  Is | ||||
| IPC the one exception?  Maybe, perhaps not.  Was Liedtke fighting a | ||||
| losing battle against CPU makers?  Should fast IPC time be a hardware, | ||||
| or just an OS issue? | ||||
| 
 | ||||
| </ul> | ||||
| 
 | ||||
| </body> | ||||
							
								
								
									
										181
									
								
								web/l-name.html
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										181
									
								
								web/l-name.html
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,181 @@ | |||
| <title>L11</title> | ||||
| <html> | ||||
| <head> | ||||
| </head> | ||||
| <body> | ||||
| 
 | ||||
| <h1>Naming in file systems</h1> | ||||
| 
 | ||||
| <p>Required reading: nami(), and all other file system code. | ||||
| 
 | ||||
| <h2>Overview</h2> | ||||
| 
 | ||||
| <p>To help users to remember where they stored their data, most | ||||
| systems allow users to assign their own names to their data. | ||||
| Typically the data is organized in files and users assign names to | ||||
| files.  To deal with many files, users can organize their files in | ||||
| directories, in a hierarchical manner.  Each name is a pathname, with | ||||
| the components separated by "/". | ||||
| 
 | ||||
| <p>To avoid that users have to type long abolute names (i.e., names | ||||
| starting with "/" in Unix), users can change their working directory | ||||
| and use relative names (i.e., naming that don't start with "/"). | ||||
| 
 | ||||
| <p>User file namespace operations include create, mkdir, mv, ln | ||||
| (link), unlink, and chdir. (How is "mv a b" implemented in xv6? | ||||
| Answer: "link a b"; "unlink a".)  To be able to name the current | ||||
| directory and the parent directory every directory includes two | ||||
| entries "." and "..".  Files and directories can reclaimed if users | ||||
| cannot name it anymore (i.e., after the last unlink). | ||||
| 
 | ||||
| <p>Recall from last lecture, all directories entries contain a name, | ||||
| followed by an inode number. The inode number names an inode of the | ||||
| file system.  How can we merge file systems from different disks into | ||||
| a single name space? | ||||
| 
 | ||||
| <p>A user grafts new file systems on a name space using mount.  Umount | ||||
| removes a file system from the name space.  (In DOS, a file system is | ||||
| named by its device letter.)  Mount takes the root inode of the | ||||
| to-be-mounted file system and grafts it on the inode of the name space | ||||
| entry where the file system is mounted (e.g., /mnt/disk1). The | ||||
| in-memory inode of /mnt/disk1 records the major and minor number of | ||||
| the file system mounted on it.  When namei sees an inode on which a | ||||
| file system is mounted, it looks up the root inode of the mounted file | ||||
| system, and proceeds with that inode. | ||||
| 
 | ||||
| <p>Mount is not a durable operation; it doesn't surive power failures. | ||||
| After a power failure, the system administrator must remount the file | ||||
| system (i.e., often in a startup script that is run from init). | ||||
| 
 | ||||
| <p>Links are convenient, because with users can create synonyms for | ||||
|   file names.  But, it creates the potential of introducing cycles in | ||||
|   the naning tree.  For example, consider link("a/b/c", "a").  This | ||||
|   makes c a synonym for a. This cycle can complicate matters; for | ||||
|   example: | ||||
| <ul> | ||||
| <li>If a user subsequently calls unlink ("a"), then the user cannot | ||||
|   name the directory "b" and the link "c" anymore, but how can the | ||||
|   file system decide that? | ||||
| </ul> | ||||
| 
 | ||||
| <p>This problem can be solved by detecting cycles.  The second problem | ||||
|   can be solved by computing with files are reacheable from "/" and | ||||
|   reclaim all the ones that aren't reacheable.  Unix takes a simpler | ||||
|   approach: avoid cycles by disallowing users to create links for | ||||
|   directories.  If there are no cycles, then reference counts can be | ||||
|   used to see if a file is still referenced. In the inode maintain a | ||||
|   field for counting references (nlink in xv6's dinode). link | ||||
|   increases the reference count, and unlink decreases the count; if | ||||
|   the count reaches zero the inode and disk blocks can be reclaimed. | ||||
| 
 | ||||
| <p>How to handle symbolic links across file systems (i.e., from one | ||||
|   mounted file system to another)?  Since inodes are not unique across | ||||
|   file systems, we cannot create a link across file systems; the | ||||
|   directory entry only contains an inode number, not the inode number | ||||
|   and the name of the disk on which the inode is located.  To handle | ||||
|   this case, Unix provides a second type of link, which are called | ||||
|   soft links. | ||||
| 
 | ||||
| <p>Soft links are a special file type (e.g., T_SYMLINK).  If namei | ||||
|   encounters a inode of type T_SYMLINK, it resolves the the name in | ||||
|   the symlink file to an inode, and continues from there.  With | ||||
|   symlinks one can create cycles and they can point to non-existing | ||||
|   files. | ||||
| 
 | ||||
| <p>The design of the name system can have security implications. For | ||||
|   example, if you tests if a name exists, and then use the name, | ||||
|   between testing and using it an adversary can have change the | ||||
|   binding from name to object.  Such problems are called TOCTTOU. | ||||
| 
 | ||||
| <p>An example of TOCTTOU is follows.  Let's say root runs a script | ||||
|   every night to remove file in /tmp.  This gets rid off the files | ||||
|   that editors might left behind, but we will never be used again. An | ||||
|   adversary can exploit this script as follows: | ||||
| <pre> | ||||
|     Root                         Attacker | ||||
|                                  mkdir ("/tmp/etc") | ||||
| 				 creat ("/tmp/etc/passw") | ||||
|     readdir ("tmp"); | ||||
|     lstat ("tmp/etc"); | ||||
|     readdir ("tmp/etc"); | ||||
|                                  rename ("tmp/etc", "/tmp/x"); | ||||
| 				 symlink ("etc", "/tmp/etc"); | ||||
|     unlink ("tmp/etc/passwd"); | ||||
| </pre> | ||||
| Lstat checks whether /tmp/etc is not symbolic link, but by the time it | ||||
| runs unlink the attacker had time to creat a symbolic link in the | ||||
| place of /tmp/etc, with a password file of the adversary's choice. | ||||
| 
 | ||||
| <p>This problem could have been avoided if every user or process group | ||||
|   had its own private /tmp, or if access to the shared one was | ||||
|   mediated. | ||||
| 
 | ||||
| <h2>V6 code examples</h2> | ||||
| 
 | ||||
| <p> namei (sheet 46) is the core of the Unix naming system. namei can | ||||
|   be called in several ways: NAMEI_LOOKUP (resolve a name to an inode | ||||
|   and lock inode), NAMEI_CREATE (resolve a name, but lock parent | ||||
|   inode), and NAMEI_DELETE (resolve a name, lock parent inode, and | ||||
|   return offset in the directory).  The reason is that namei is | ||||
|   complicated is that we want to atomically test if a name exist and | ||||
|   remove/create it, if it does; otherwise, two concurrent processes | ||||
|   could interfere with each other and directory could end up in an | ||||
|   inconsistent state. | ||||
| 
 | ||||
| <p>Let's trace open("a", O_RDWR), focussing on namei: | ||||
| <ul> | ||||
| <li>5263: we will look at creating a file in a bit. | ||||
| <li>5277: call namei with NAMEI_LOOKUP | ||||
| <li>4629: if path name start with "/", lookup root inode (1). | ||||
| <li>4632: otherwise, use inode for current working directory. | ||||
| <li>4638: consume row of "/", for example in "/////a////b" | ||||
| <li>4641: if we are done with NAMEI_LOOKUP, return inode (e.g., | ||||
|   namei("/")). | ||||
| <li>4652: if the inode we are searching for a name isn't of type | ||||
|   directory, give up. | ||||
| <li>4657-4661: determine length of the current component of the | ||||
|   pathname we are resolving. | ||||
| <li>4663-4681: scan the directory for the component. | ||||
| <li>4682-4696: the entry wasn't found. if we are the end of the | ||||
|   pathname and NAMEI_CREATE is set, lock parent directory and return a | ||||
|   pointer to the start of the component.  In all other case, unlock | ||||
|   inode of directory, and return 0. | ||||
| <li>4701: if NAMEI_DELETE is set, return locked parent inode and the | ||||
|   offset of the to-be-deleted component in the directory. | ||||
| <li>4707: lookup inode of the component, and go to the top of the loop. | ||||
| </ul> | ||||
| 
 | ||||
| <p>Now let's look at creating a file in a directory: | ||||
| <ul> | ||||
| <li>5264: if the last component doesn't exist, but first part of the | ||||
|   pathname resolved to a directory, then dp will be 0, last will point | ||||
|   to the beginning of the last component, and ip will be the locked | ||||
|   parent directory. | ||||
| <li>5266: create an entry for last in the directory. | ||||
| <li>4772: mknod1 allocates a new named inode and adds it to an | ||||
|   existing directory. | ||||
| <li>4776: ialloc. skan inode block, find unused entry, and write | ||||
|   it. (if lucky 1 read and 1 write.) | ||||
| <li>4784: fill out the inode entry, and write it. (another write) | ||||
| <li>4786: write the entry into the directory (if lucky, 1 write) | ||||
| </ul> | ||||
| 
 | ||||
| </ul> | ||||
| Why must the parent directory be locked?  If two processes try to | ||||
| create the same name in the same directory, only one should succeed | ||||
| and the other one, should receive an error (file exist). | ||||
| 
 | ||||
| <p>Link, unlink, chdir, mount, umount could have taken file | ||||
| descriptors instead of their path argument. In fact, this would get | ||||
| rid of some possible race conditions (some of which have security | ||||
| implications, TOCTTOU). However, this would require that the current | ||||
| working directory be remembered by the process, and UNIX didn't have | ||||
| good ways of maintaining static state shared among all processes | ||||
| belonging to a given user. The easiest way is to create shared state | ||||
| is to place it in the kernel. | ||||
|     | ||||
| <p>We have one piece of code in xv6 that we haven't studied: exec. | ||||
|   With all the ground work we have done this code can be easily | ||||
|   understood (see sheet 54). | ||||
| 
 | ||||
| </body> | ||||
							
								
								
									
										249
									
								
								web/l-okws.txt
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										249
									
								
								web/l-okws.txt
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,249 @@ | |||
| 
 | ||||
| Security | ||||
| ------------------- | ||||
| I. 2 Intro Examples | ||||
| II. Security Overview | ||||
| III. Server Security: Offense + Defense | ||||
| IV. Unix Security + POLP | ||||
| V. Example: OKWS | ||||
| VI. How to Build a Website | ||||
| 
 | ||||
| I. Intro Examples | ||||
| -------------------- | ||||
| 1. Apache + OpenSSL 0.9.6a (CAN 2002-0656) | ||||
|   - SSL = More security! | ||||
| 
 | ||||
|   unsigned int j; | ||||
|   p=(unsigned char *)s->init_buf->data; | ||||
|   j= *(p++); | ||||
|   s->session->session_id_length=j; | ||||
|   memcpy(s->session->session_id,p,j); | ||||
| 
 | ||||
|  - the result: an Apache worm | ||||
| 
 | ||||
| 2. SparkNotes.com 2000: | ||||
|  - New profile feature that displays "public" information about users | ||||
|    but bug that made e-mail addresses "public" by default. | ||||
|  - New program for getting that data: | ||||
|   | ||||
|      http://www.sparknotes.com/getprofile.cgi?id=1343 | ||||
| 
 | ||||
| II. Security Overview | ||||
| ---------------------- | ||||
| 
 | ||||
| What Is Security? | ||||
|  - Protecting your system from attack. | ||||
| 
 | ||||
|  What's an attack? | ||||
|  - Stealing data | ||||
|  - Corrupting data | ||||
|  - Controlling resources | ||||
|  - DOS | ||||
| 
 | ||||
|  Why attack? | ||||
|  - Money | ||||
|  - Blackmail / extortion | ||||
|  - Vendetta | ||||
|  - intellectual curiosity | ||||
|  - fame | ||||
| 
 | ||||
| Security is a Big topic | ||||
| 
 | ||||
|  - Server security -- today's focus.  There's some machine sitting on the | ||||
|    Internet somewhere, with a certain interface exposed, and attackers | ||||
|    want to circumvent it. | ||||
|     - Why should you trust your software? | ||||
| 
 | ||||
|  - Client security | ||||
|     - Clients are usually servers, so they have many of the same issues. | ||||
|     - Slight simplification: people across the network cannot typically | ||||
|       initiate connections. | ||||
|     - Has a "fallible operator": | ||||
|         - Spyware | ||||
|         - Drive-by-Downloads | ||||
| 
 | ||||
|  - Client security turns out to be much harder -- GUI considerations, | ||||
|    look inside the browser and the applications. | ||||
|  - Systems community can more easily handle server security. | ||||
|  - We think mainly of servers. | ||||
| 
 | ||||
| III. Server Security: Offense and Defense | ||||
| ----------------------------------------- | ||||
|  - Show picture of a Web site. | ||||
| 
 | ||||
|  Attacks			|	Defense | ||||
| ---------------------------------------------------------------------------- | ||||
|  1. Break into DB from net	| 1. FW it off | ||||
|  2. Break into WS on telnet	| 2. FW it off | ||||
|  3. Buffer overrun in Apache	| 3. Patch apache / use better lang? | ||||
|  4. Buffer overrun in our code  | 4. Use better lang / isolate it | ||||
|  5. SQL injection		| 5. Better escaping / don't interpret code. | ||||
|  6. Data scraping.              | 6. Use a sparse UID space. | ||||
|  7. PW sniffing			| 7.  ??? | ||||
|  8. Fetch /etc/passwd and crack | 8. Don't expose /etc/passwd | ||||
|      PW				| | ||||
|  9. Root escalation from apache | 9. No setuid programs available to Apache | ||||
| 10. XSS                         |10. Filter JS and input HTML code. | ||||
| 11. Keystroke recorded on sys-  |11. Client security | ||||
|     admin's desktop (planetlab) | | ||||
| 12. DDOS                        |12.  ??? | ||||
| 
 | ||||
| Summary: | ||||
|  - That we want private data to be available to right people makes | ||||
|    this problem hard in the first place. Internet servers are there | ||||
|    for a reason. | ||||
|  - Security != "just encrypt your data;" this in fact can sometimes | ||||
|    make the problem worse. | ||||
|  - Best to prevent break-ins from happening in the first place. | ||||
|  - If they do happen, want to limit their damage (POLP). | ||||
|  - Security policies are difficult to express / package up neatly. | ||||
| 
 | ||||
| IV. Design According to POLP (in Unix) | ||||
| --------------------------------------- | ||||
|  - Assume any piece of a system can be compromised, by either bad | ||||
|    programming or malicious attack. | ||||
|  - Try to limit the damage done by such a compromise (along the lines | ||||
|    of the 4 attack goals). | ||||
| 
 | ||||
|  <Draw a picture of a server process on Unix, w/ other processes> | ||||
| 
 | ||||
| What's the goal on Unix? | ||||
|  - Keep processes from communicating that don't have to: | ||||
|     - limit FS, IPC, signals, ptrace | ||||
|  - Strip away unneeded privilege  | ||||
|     - with respect to network, FS. | ||||
|  - Strip away FS access. | ||||
| 
 | ||||
| How on Unix? | ||||
|  - setuid/setgid | ||||
|  - system call interposition | ||||
|  - chroot (away from setuid executables, /etc/passwd, /etc/ssh/..) | ||||
| 
 | ||||
|  <show Code snippet> | ||||
| 
 | ||||
| How do you write chroot'ed programs? | ||||
|  - What about shared libraries? | ||||
|  - /etc/resolv.conf? | ||||
|  - Can chroot'ed programs access the FS at all? What if they need | ||||
|    to write to the FS or read from the FS? | ||||
|  - Fd's are *capabilities*; can pass them to chroot'ed services, | ||||
|    thereby opening new files on its behalf. | ||||
|  - Unforgeable - can only get them from the kernel via open/socket, etc. | ||||
| 
 | ||||
| Unix Shortcomings (round 1) | ||||
|  - It's bad to run as root! | ||||
|  - Yet, need root for: | ||||
|     - chroot | ||||
|     - setuid/setgid to a lower-privileged user | ||||
|     - create a new user ID | ||||
|  - Still no guarantee that we've cut off all channels | ||||
|     - 200 syscalls! | ||||
|     - Default is to give most/all privileges. | ||||
|  - Can "break out" of chroot jails? | ||||
|  - Can still exploit race conditions in the kernel to escalate privileges. | ||||
| 
 | ||||
| Sidebar | ||||
|  - setuid / setuid misunderstanding | ||||
|  - root / root misunderstanding | ||||
|  - effective vs. real vs. saved set-user-ID | ||||
| 
 | ||||
| V. OKWS | ||||
| ------- | ||||
| - Taking these principles as far as possible. | ||||
| - C.f. Figure 1 From the paper.. | ||||
| - Discussion of which privileges are in which processes | ||||
| 
 | ||||
| <Table of how to hack, what you get, etc...> | ||||
| 
 | ||||
| - Technical details: how to launch a new service | ||||
| - Within the launcher (running as root): | ||||
| 
 | ||||
| <on board:> | ||||
| 
 | ||||
|     // receive FDs from logger, pubd, demux | ||||
|     fork (); | ||||
|     chroot ("/var/okws/run"); | ||||
|     chdir ("/coredumps/51001"); | ||||
|     setgid (51001); | ||||
|     setuid (51001); | ||||
|     exec ("login", fds ... ); | ||||
| 
 | ||||
| - Note no chroot -- why not? | ||||
| - Once launched, how does a service get new connections? | ||||
| - Note the goal - minimum tampering with each other in the  | ||||
|   case of a compromise. | ||||
| 
 | ||||
| Shortcoming of Unix (2) | ||||
| - A lot of plumbing involved with this system.  FDs flying everywhere. | ||||
| - Isolation still not fine enough.  If a service gets taken over, | ||||
|   can compromise all users of that service. | ||||
| 
 | ||||
| VI. Reflections on Building Websites | ||||
| --------------------------------- | ||||
| - OKWS interesting "experiment" | ||||
| - Need for speed; also, good gzip support. | ||||
| - If you need compiled code, it's a good way to go. | ||||
| - RPC-like system a must for backend communication | ||||
| - Connection-pooling for free | ||||
| 
 | ||||
| Biggest difficulties: | ||||
| - Finding good C++ programmers. | ||||
| - Compile times. | ||||
| - The DB is still always the problem. | ||||
| 
 | ||||
| Hard to Find good Alternatives | ||||
| - Python / Perl - you might spend a lot of time writing C code /  | ||||
|   integrating with lower level languages. | ||||
| - Have to worry about DB pooling. | ||||
| - Java -- must viable, and is getting better.  Scary you can't peer | ||||
|   inside. | ||||
| - .Net / C#-based system might be the way to go. | ||||
| 
 | ||||
| 
 | ||||
| ======================================================================= | ||||
| 
 | ||||
| Extra Material: | ||||
| 
 | ||||
| Capabilities (From the Eros Paper in SOSP 1999) | ||||
| 
 | ||||
|  - "Unforgeable pair made up of an object ID and a set of authorized  | ||||
|    operations (an interface) on that object." | ||||
|    - c.f. Dennis and van Horn. "Programming semantics for multiprogrammed | ||||
|      computations," Communications of the ACM 9(3):143-154, Mar 1966. | ||||
|  - Thus: | ||||
|       <object ID, set of authorized OPs on that object> | ||||
|  - Examples: | ||||
|       "Process X can write to file at inode Y" | ||||
|       "Process P can read from file at inode Z" | ||||
|  - Familiar example: Unix file descriptors | ||||
|   | ||||
|  - Why are they secure? | ||||
|     - Capabilities are "unforgeable" | ||||
|     - Processes can get them only through authorized interfaces | ||||
|     - Capabilities are only given to processes authorized to hold them | ||||
| 
 | ||||
|  - How do you get them? | ||||
|     - From the kernel (e.g., open) | ||||
|     - From other applications (e.g., FD passing) | ||||
| 
 | ||||
|  - How do you use them? | ||||
|     - read (fd), write(fd). | ||||
| 
 | ||||
|  - How do you revoke them once granted? | ||||
|    - In Unix, you do not. | ||||
|    - In some systems, a central authority ("reference monitor") can revoke. | ||||
|   | ||||
|  - How do you store them persistently? | ||||
|     - Can have circular dependencies (unlike an FS). | ||||
|     - What happens when the system starts up?  | ||||
|     - Revert to checkpointed state. | ||||
|     - Often capability systems chose a single-level store. | ||||
|   | ||||
|  - Capability systems, a historical prospective: | ||||
|     - KeyKOS, Eros, Cyotos (UP research) | ||||
|         - Never saw any applications | ||||
|     - IBM Systems (System 38, later AS/400, later 'i Series') | ||||
|         - Commercially viable | ||||
|  - Problems: | ||||
|     - All bets are off when a capability is sent to the wrong place. | ||||
|     - Firewall analogy? | ||||
							
								
								
									
										249
									
								
								web/l-plan9.html
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										249
									
								
								web/l-plan9.html
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,249 @@ | |||
| <html> | ||||
| <head> | ||||
| <title>Plan 9</title> | ||||
| </head> | ||||
| <body> | ||||
| 
 | ||||
| <h1>Plan 9</h1> | ||||
| 
 | ||||
| <p>Required reading: Plan 9 from Bell Labs</p> | ||||
| 
 | ||||
| <h2>Background</h2> | ||||
| 
 | ||||
| <p>Had moved away from the ``one computing system'' model of | ||||
| Multics and Unix.</p> | ||||
| 
 | ||||
| <p>Many computers (`workstations'), self-maintained, not a coherent whole.</p> | ||||
| 
 | ||||
| <p>Pike and Thompson had been batting around ideas about a system glued together  | ||||
| by a single protocol as early as 1984. | ||||
| Various small experiments involving individual pieces (file server, OS, computer) | ||||
| tried throughout 1980s.</p> | ||||
| 
 | ||||
| <p>Ordered the hardware for the ``real thing'' in beginning of 1989, | ||||
| built up WORM file server, kernel, throughout that year.</p> | ||||
| 
 | ||||
| <p>Some time in early fall 1989, Pike and Thompson were | ||||
| trying to figure out a way to fit the window system in. | ||||
| On way home from dinner, both independently realized that | ||||
| needed to be able to mount a user-space file descriptor, | ||||
| not just a network address.</p> | ||||
| 
 | ||||
| <p>Around Thanksgiving 1989, spent a few days rethinking the whole | ||||
| thing, added bind, new mount, flush, and spent a weekend  | ||||
| making everything work again.  The protocol at that point was | ||||
| essentially identical to the 9P in the paper.</p> | ||||
| 
 | ||||
| <p>In May 1990, tried to use system as self-hosting. | ||||
| File server kept breaking, had to keep rewriting window system. | ||||
| Dozen or so users by then, mostly using terminal windows to | ||||
| connect to Unix.</p> | ||||
| 
 | ||||
| <p>Paper written and submitted to UKUUG in July 1990.</p> | ||||
| 
 | ||||
| <p>Because it was an entirely new system, could take the | ||||
| time to fix problems as they arose, <i>in the right place</i>.</p> | ||||
| 
 | ||||
| 
 | ||||
| <h2>Design Principles</h2> | ||||
| 
 | ||||
| <p>Three design principles:</p> | ||||
| 
 | ||||
| <p> | ||||
| 1. Everything is a file.<br> | ||||
| 2. There is a standard protocol for accessing files.<br> | ||||
| 3. Private, malleable name spaces (bind, mount). | ||||
| </p> | ||||
| 
 | ||||
| <h3>Everything is a file.</h3> | ||||
| 
 | ||||
| <p>Everything is a file (more everything than Unix: networks, graphics).</p> | ||||
| 
 | ||||
| <pre> | ||||
| % ls -l /net | ||||
| % lp /dev/screen | ||||
| % cat /mnt/wsys/1/text | ||||
| </pre> | ||||
| 
 | ||||
| <h3>Standard protocol for accessing files</h3> | ||||
| 
 | ||||
| <p>9P is the only protocol the kernel knows: other protocols | ||||
| (NFS, disk file systems, etc.) are provided by user-level translators.</p> | ||||
| 
 | ||||
| <p>Only one protocol, so easy to write filters and other | ||||
| converters.  <i>Iostats</i> puts itself between the kernel | ||||
| and a command.</p> | ||||
| 
 | ||||
| <pre> | ||||
| % iostats -xvdfdf /bin/ls | ||||
| </pre> | ||||
| 
 | ||||
| <h3>Private, malleable name spaces</h3> | ||||
| 
 | ||||
| <p>Each process has its own private name space that it | ||||
| can customize at will.   | ||||
| (Full disclosure: can arrange groups of | ||||
| processes to run in a shared name space.  Otherwise how do | ||||
| you implement <i>mount</i> and <i>bind</i>?)</p> | ||||
| 
 | ||||
| <p><i>Iostats</i> remounts the root of the name space | ||||
| with its own filter service.</p> | ||||
| 
 | ||||
| <p>The window system mounts a file system that it serves | ||||
| on <tt>/mnt/wsys</tt>.</p> | ||||
| 
 | ||||
| <p>The network is actually a kernel device (no 9P involved) | ||||
| but it still serves a file interface that other programs | ||||
| use to access the network. | ||||
| Easy to move out to user space (or replace) if necessary: | ||||
| <i>import</i> network from another machine.</p> | ||||
| 
 | ||||
| <h3>Implications</h3> | ||||
| 
 | ||||
| <p>Everything is a file + can share files => can share everything.</p> | ||||
| 
 | ||||
| <p>Per-process name spaces help move toward ``each process has its own | ||||
| private machine.''</p> | ||||
| 
 | ||||
| <p>One protocol: easy to build custom filters to add functionality | ||||
| (e.g., reestablishing broken network connections). | ||||
| 
 | ||||
| <h3>File representation for networks, graphics, etc.</h3> | ||||
| 
 | ||||
| <p>Unix sockets are file descriptors, but you can't use the | ||||
| usual file operations on them.  Also far too much detail that | ||||
| the user doesn't care about.</p> | ||||
| 
 | ||||
| <p>In Plan 9:  | ||||
| <pre>dial("tcp!plan9.bell-labs.com!http"); | ||||
| </pre> | ||||
| (Protocol-independent!)</p> | ||||
| 
 | ||||
| <p>Dial more or less does:<br> | ||||
| write to /net/cs: tcp!plan9.bell-labs.com!http | ||||
| read back: /net/tcp/clone 204.178.31.2!80 | ||||
| write to /net/tcp/clone: connect 204.178.31.2!80 | ||||
| read connection number: 4 | ||||
| open /net/tcp/4/data | ||||
| </p> | ||||
| 
 | ||||
| <p>Details don't really matter.  Two important points: | ||||
| protocol-independent, and ordinary file operations | ||||
| (open, read, write).</p> | ||||
| 
 | ||||
| <p>Networks can be shared just like any other files.</p> | ||||
| 
 | ||||
| <p>Similar story for graphics, other resources.</p> | ||||
| 
 | ||||
| <h2>Conventions</h2> | ||||
| 
 | ||||
| <p>Per-process name spaces mean that even full path names are ambiguous | ||||
| (<tt>/bin/cat</tt> means different things on different machines, | ||||
| or even for different users).</p> | ||||
| 
 | ||||
| <p><i>Convention</i> binds everything together.   | ||||
| On a 386, <tt>bind /386/bin /bin</tt>. | ||||
| 
 | ||||
| <p>In Plan 9, always know where the resource <i>should</i> be | ||||
| (e.g., <tt>/net</tt>, <tt>/dev</tt>, <tt>/proc</tt>, etc.), | ||||
| but not which one is there.</p> | ||||
| 
 | ||||
| <p>Can break conventions: on a 386, <tt>bind /alpha/bin /bin</tt>, just won't | ||||
| have usable binaries in <tt>/bin</tt> anymore.</p> | ||||
| 
 | ||||
| <p>Object-oriented in the sense of having objects (files) that all | ||||
| present the same interface and can be substituted for one another | ||||
| to arrange the system in different ways.</p> | ||||
| 
 | ||||
| <p>Very little ``type-checking'': <tt>bind /net /proc; ps</tt>. | ||||
| Great benefit (generality) but must be careful (no safety nets).</p> | ||||
| 
 | ||||
| 
 | ||||
| <h2>Other Contributions</h2> | ||||
| 
 | ||||
| <h3>Portability</h3> | ||||
| 
 | ||||
| <p>Plan 9 still is the most portable operating system. | ||||
| Not much machine-dependent code, no fancy features | ||||
| tied to one machine's MMU, multiprocessor from the start (1989).</p> | ||||
| 
 | ||||
| <p>Many other systems are still struggling with converting to SMPs.</p> | ||||
| 
 | ||||
| <p>Has run on MIPS, Motorola 68000, Nextstation, Sparc, x86, PowerPC, Alpha, others.</p> | ||||
| 
 | ||||
| <p>All the world is not an x86.</p> | ||||
| 
 | ||||
| <h3>Alef</h3> | ||||
| 
 | ||||
| <p>New programming language: convenient, but difficult to maintain. | ||||
| Retired when author (Winterbottom) stopped working on Plan 9.</p> | ||||
| 
 | ||||
| <p>Good ideas transferred to C library plus conventions.</p> | ||||
| 
 | ||||
| <p>All the world is not C.</p> | ||||
| 
 | ||||
| <h3>UTF-8</h3> | ||||
| 
 | ||||
| <p>Thompson invented UTF-8.  Pike and Thompson | ||||
| converted Plan 9 to use it over the first weekend of September 1992, | ||||
| in time for X/Open to choose it as the Unicode standard byte format | ||||
| at a meeting the next week.</p> | ||||
| 
 | ||||
| <p>UTF-8 is now the standard character encoding for Unicode on | ||||
| all systems and interoperating between systems.</p> | ||||
| 
 | ||||
| <h3>Simple, easy to modify base for experiments</h3> | ||||
| 
 | ||||
| <p>Whole system source code is available, simple, easy to  | ||||
| understand and change. | ||||
| There's a reason it only took a couple days to convert to UTF-8.</p> | ||||
| 
 | ||||
| <pre> | ||||
|   49343  file server kernel | ||||
| 
 | ||||
|  181611  main kernel | ||||
|   78521    ipaq port (small kernel) | ||||
|   20027      TCP/IP stack | ||||
|   15365      ipaq-specific code | ||||
|   43129      portable code | ||||
| 
 | ||||
| 1326778  total lines of source code | ||||
| </pre> | ||||
| 
 | ||||
| <h3>Dump file system</h3> | ||||
| 
 | ||||
| <p>Snapshot idea might well have been ``in the air'' at the time. | ||||
| (<tt>OldFiles</tt> in AFS appears to be independently derived, | ||||
| use of WORM media was common research topic.)</p> | ||||
| 
 | ||||
| <h3>Generalized Fork</h3> | ||||
| 
 | ||||
| <p>Picked up by other systems: FreeBSD, Linux.</p> | ||||
| 
 | ||||
| <h3>Authentication</h3> | ||||
| 
 | ||||
| <p>No global super-user. | ||||
| Newer, more Plan 9-like authentication described in later paper.</p> | ||||
| 
 | ||||
| <h3>New Compilers</h3> | ||||
| 
 | ||||
| <p>Much faster than gcc, simpler.</p> | ||||
| 
 | ||||
| <p>8s to build acme for Linux using gcc; 1s to build acme for Plan 9 using 8c (but running on Linux)</p> | ||||
| 
 | ||||
| <h3>IL Protocol</h3> | ||||
| 
 | ||||
| <p>Now retired.   | ||||
| For better or worse, TCP has all the installed base. | ||||
| IL didn't work very well on asymmetric or high-latency links | ||||
| (e.g., cable modems).</p> | ||||
| 
 | ||||
| <h2>Idea propagation</h2> | ||||
| 
 | ||||
| <p>Many ideas have propagated out to varying degrees.</p> | ||||
| 
 | ||||
| <p>Linux even has bind and user-level file servers now (FUSE), | ||||
| but still not per-process name spaces.</p> | ||||
| 
 | ||||
| 
 | ||||
| </body> | ||||
							
								
								
									
										202
									
								
								web/l-scalablecoord.html
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										202
									
								
								web/l-scalablecoord.html
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,202 @@ | |||
| <title>Scalable coordination</title> | ||||
| <html> | ||||
| <head> | ||||
| </head> | ||||
| <body> | ||||
| 
 | ||||
| <h1>Scalable coordination</h1> | ||||
| 
 | ||||
| <p>Required reading: Mellor-Crummey and Scott, Algorithms for Scalable | ||||
|   Synchronization on Shared-Memory Multiprocessors, TOCS, Feb 1991. | ||||
| 
 | ||||
| <h2>Overview</h2> | ||||
| 
 | ||||
| <p>Shared memory machines are bunch of CPUs, sharing physical memory. | ||||
| Typically each processor also mantains a cache (for performance), | ||||
| which introduces the problem of keep caches coherent.  If processor 1 | ||||
| writes a memory location whose value processor 2 has cached, then | ||||
| processor 2's cache must be updated in some way.  How? | ||||
| <ul> | ||||
| 
 | ||||
| <li>Bus-based schemes.  Any CPU can access "dance with" any memory | ||||
| equally ("dance hall arch"). Use "Snoopy" protocols: Each CPU's cache | ||||
| listens to the memory bus. With write-through architecture, invalidate | ||||
| copy when see a write. Or can have "ownership" scheme with write-back | ||||
| cache (E.g., Pentium cache have MESI bits---modified, exclusive, | ||||
| shared, invalid). If E bit set, CPU caches exclusively and can do | ||||
| write back. But bus places limits on scalability. | ||||
| 
 | ||||
| <li>More scalability w. NUMA schemes (non-uniform memory access). Each | ||||
| CPU comes with fast "close" memory. Slower to access memory that is | ||||
| stored with another processor. Use a directory to keep track of who is | ||||
| caching what.  For example, processor 0 is responsible for all memory | ||||
| starting with address "000", processor 1 is responsible for all memory | ||||
| starting with "001", etc. | ||||
| 
 | ||||
| <li>COMA - cache-only memory architecture.  Each CPU has local RAM, | ||||
| treated as cache. Cache lines migrate around to different nodes based | ||||
| on access pattern. Data only lives in cache, no permanent memory | ||||
| location. (These machines aren't too popular any more.) | ||||
| 
 | ||||
| </ul> | ||||
| 
 | ||||
| 
 | ||||
| <h2>Scalable locks</h2> | ||||
| 
 | ||||
| <p>This paper is about cost and scalability of locking; what if you | ||||
| have 10 CPUs waiting for the same lock?  For example, what would | ||||
| happen if xv6 runs on an SMP with many processors? | ||||
| 
 | ||||
| <p>What's the cost of a simple spinning acquire/release?  Algorithm 1 | ||||
| *without* the delays, which is like xv6's implementation of acquire | ||||
| and release (xv6 uses XCHG instead of test_and_set): | ||||
| <pre> | ||||
|   each of the 10 CPUs gets the lock in turn | ||||
|   meanwhile, remaining CPUs in XCHG on lock | ||||
|   lock must be X in cache to run XCHG | ||||
|     otherwise all might read, then all might write | ||||
|   so bus is busy all the time with XCHGs! | ||||
|   can we avoid constant XCHGs while lock is held? | ||||
| </pre> | ||||
| 
 | ||||
| <p>test-and-test-and-set | ||||
| <pre> | ||||
|   only run expensive TSL if not locked | ||||
|   spin on ordinary load instruction, so cache line is S | ||||
|   acquire(l) | ||||
|     while(1){ | ||||
|       while(l->locked != 0) { } | ||||
|       if(TSL(&l->locked) == 0) | ||||
|         return; | ||||
|     } | ||||
| </pre> | ||||
| 
 | ||||
| <p>suppose 10 CPUs are waiting, let's count cost in total bus | ||||
|   transactions | ||||
| <pre> | ||||
|   CPU1 gets lock in one cycle | ||||
|     sets lock's cache line to I in other CPUs | ||||
|   9 CPUs each use bus once in XCHG | ||||
|     then everyone has the line S, so they spin locally | ||||
|   CPU1 release the lock | ||||
|   CPU2 gets the lock in one cycle | ||||
|   8 CPUs each use bus once... | ||||
|   So 10 + 9 + 8 + ... = 50 transactions, O(n^2) in # of CPUs! | ||||
|   Look at "test-and-test-and-set" in Figure 6 | ||||
| </pre> | ||||
| <p>  Can we have <i>n</i> CPUs acquire a lock in O(<i>n</i>) time? | ||||
| 
 | ||||
| <p>What is the point of the exponential backoff in Algorithm 1? | ||||
| <pre> | ||||
|   Does it buy us O(n) time for n acquires? | ||||
|   Is there anything wrong with it? | ||||
|   may not be fair | ||||
|   exponential backoff may increase delay after release | ||||
| </pre> | ||||
| 
 | ||||
| <p>What's the point of the ticket locks, Algorithm 2? | ||||
| <pre> | ||||
|   one interlocked instruction to get my ticket number | ||||
|   then I spin on now_serving with ordinary load | ||||
|   release() just increments now_serving | ||||
| </pre> | ||||
| 
 | ||||
| <p>why is that good? | ||||
| <pre> | ||||
|   + fair | ||||
|   + no exponential backoff overshoot | ||||
|   + no spinning on  | ||||
| </pre> | ||||
| 
 | ||||
| <p>but what's the cost, in bus transactions? | ||||
| <pre> | ||||
|   while lock is held, now_serving is S in all caches | ||||
|   release makes it I in all caches | ||||
|   then each waiters uses a bus transaction to get new value | ||||
|   so still O(n^2) | ||||
| </pre> | ||||
| 
 | ||||
| <p>What's the point of the array-based queuing locks, Algorithm 3? | ||||
| <pre> | ||||
|     a lock has an array of "slots" | ||||
|     waiter allocates a slot, spins on that slot | ||||
|     release wakes up just next slot | ||||
|   so O(n) bus transactions to get through n waiters: good! | ||||
|   anderson lines in Figure 4 and 6 are flat-ish | ||||
|     they only go up because lock data structures protected by simpler lock | ||||
|   but O(n) space *per lock*! | ||||
| </pre> | ||||
| 
 | ||||
| <p>Algorithm 5 (MCS), the new algorithm of the paper, uses | ||||
| compare_and_swap: | ||||
| <pre> | ||||
| int compare_and_swap(addr, v1, v2) { | ||||
|   int ret = 0; | ||||
|   // stop all memory activity and ignore interrupts | ||||
|   if (*addr == v1) { | ||||
|     *addr = v2; | ||||
|     ret = 1; | ||||
|   } | ||||
|   // resume other memory activity and take interrupts | ||||
|   return ret; | ||||
| } | ||||
| </pre> | ||||
| 
 | ||||
| <p>What's the point of the MCS lock, Algorithm 5? | ||||
| <pre> | ||||
|   constant space per lock, rather than O(n) | ||||
|   one "qnode" per thread, used for whatever lock it's waiting for | ||||
|   lock holder's qnode points to start of list | ||||
|   lock variable points to end of list | ||||
|   acquire adds your qnode to end of list | ||||
|     then you spin on your own qnode | ||||
|   release wakes up next qnode | ||||
| </pre> | ||||
| 
 | ||||
| <h2>Wait-free or non-blocking data structures</h2> | ||||
| 
 | ||||
| <p>The previous implementations all block threads when there is | ||||
|   contention for a lock.  Other atomic hardware operations allows one | ||||
|   to build implementation wait-free data structures.  For example, one | ||||
|   can make an insert of an element in a shared list that don't block a | ||||
|   thread.  Such versions are called wait free.  | ||||
| 
 | ||||
| <p>A linked list with locks is as follows: | ||||
| <pre> | ||||
| Lock list_lock; | ||||
| 
 | ||||
| insert(int x) { | ||||
|   element *n = new Element; | ||||
|   n->x = x; | ||||
| 
 | ||||
|   acquire(&list_lock); | ||||
|   n->next = list; | ||||
|   list = n; | ||||
|   release(&list_lock); | ||||
| } | ||||
| </pre> | ||||
| 
 | ||||
| <p>A wait-free implementation is as follows: | ||||
| <pre> | ||||
| insert (int x) { | ||||
|   element *n = new Element; | ||||
|   n->x = x; | ||||
|   do { | ||||
|      n->next = list; | ||||
|   } while (compare_and_swap (&list, n->next, n) == 0); | ||||
| } | ||||
| </pre> | ||||
| <p>How many bus transactions with 10 CPUs inserting one element in the | ||||
| list? Could you do better? | ||||
| 
 | ||||
| <p><a href="http://www.cl.cam.ac.uk/netos/papers/2007-cpwl.pdf">This | ||||
|  paper by Fraser and Harris</a> compares lock-based implementations | ||||
|  versus corresponding non-blocking implementations of a number of data | ||||
|  structures. | ||||
| 
 | ||||
| <p>It is not possible to make every operation wait-free, and there are | ||||
|   times we will need an implementation of acquire and release. | ||||
|   research on non-blocking data structures is active; the last word | ||||
|   isn't said on this topic yet. | ||||
| 
 | ||||
| </body> | ||||
							
								
								
									
										340
									
								
								web/l-schedule.html
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										340
									
								
								web/l-schedule.html
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,340 @@ | |||
| <title>Scheduling</title> | ||||
| <html> | ||||
| <head> | ||||
| </head> | ||||
| <body> | ||||
| 
 | ||||
| <h1>Scheduling</h1> | ||||
| 
 | ||||
| <p>Required reading: Eliminating receive livelock | ||||
| 
 | ||||
| <p>Notes based on prof. Morris's lecture on scheduling (6.824, fall'02). | ||||
| 
 | ||||
| <h2>Overview</h2> | ||||
| 
 | ||||
| <ul> | ||||
| 
 | ||||
| <li>What is scheduling?  The OS policies and mechanisms to allocates | ||||
| resources to entities.  A good scheduling policy ensures that the most | ||||
| important entitity gets the resources it needs.  This topic was | ||||
| popular in the days of time sharing, when there was a shortage of | ||||
| resources.  It seemed irrelevant in era of PCs and workstations, when | ||||
| resources were plenty. Now the topic is back from the dead to handle | ||||
| massive Internet servers with paying customers.  The Internet exposes | ||||
| web sites to international abuse and overload, which can lead to | ||||
| resource shortages.  Furthermore, some customers are more important | ||||
| than others (e.g., the ones that buy a lot). | ||||
| 
 | ||||
| <li>Key problems: | ||||
| <ul> | ||||
| <li>Gap between desired policy and available mechanism.  The desired | ||||
| policies often include elements that not implementable with the | ||||
| mechanisms available to the operation system.  Furthermore, often | ||||
| there are many conflicting goals (low latency, high throughput, and | ||||
| fairness), and the scheduler must make a trade-off between the goals. | ||||
| 
 | ||||
| <li>Interaction between different schedulers.  One have to take a | ||||
| systems view.  Just optimizing the CPU scheduler may do little to for | ||||
| the overall desired policy. | ||||
| </ul> | ||||
| 
 | ||||
| <li>Resources you might want to schedule: CPU time, physical memory, | ||||
| disk and network I/O, and I/O bus bandwidth. | ||||
| 
 | ||||
| <li>Entities that you might want to give resources to: users, | ||||
| processes, threads, web requests, or MIT accounts. | ||||
| 
 | ||||
| <li>Many polices for resource to entity allocation are possible: | ||||
| strict priority, divide equally, shortest job first, minimum guarantee | ||||
| combined with admission control. | ||||
| 
 | ||||
| <li>General plan for scheduling mechanisms | ||||
| <ol> | ||||
| <li> Understand where scheduling is occuring. | ||||
| <li> Expose scheduling decisions, allow control. | ||||
| <li> Account for resource consumption, to allow intelligent control. | ||||
| </ol> | ||||
| 
 | ||||
| <li>Simple example from 6.828 kernel. The policy for scheduling | ||||
| environments is to give each one equal CPU time. The mechanism used to | ||||
| implement this policy is a clock interrupt every 10 msec and then | ||||
| selecting the next environment in a round-robin fashion.   | ||||
| 
 | ||||
| <p>But this only works if processes are compute-bound.  What if a | ||||
| process gives up some of its 10 ms to wait for input?  Do we have to | ||||
| keep track of that and give it back?  | ||||
| 
 | ||||
| <p>How long should the quantum be?  is 10 msec the right answer? | ||||
| Shorter quantum will lead to better interactive performance, but | ||||
| lowers overall system throughput because we will reschedule more, | ||||
| which has overhead. | ||||
| 
 | ||||
| <p>What if the environment computes for 1 msec and sends an IPC to | ||||
| the file server environment?  Shouldn't the file server get more CPU | ||||
| time because it operates on behalf of all other functions? | ||||
| 
 | ||||
| <p>Potential improvements for the 6.828 kernel: track "recent" CPU use | ||||
| (e.g., over the last second) and always run environment with least | ||||
| recent CPU use.  (Still, if you sleep long enough you lose.) Other | ||||
| solution: directed yield; specify on the yield to which environment | ||||
| you are donating the remainder of the quantuam (e.g., to the file | ||||
| server so that it can compute on the environment's behalf). | ||||
| 
 | ||||
| <li>Pitfall: Priority Inversion | ||||
| <pre> | ||||
|   Assume policy is strict priority. | ||||
|   Thread T1: low priority. | ||||
|   Thread T2: medium priority. | ||||
|   Thread T3: high priority. | ||||
|   T1: acquire(l) | ||||
|   context switch to T3 | ||||
|   T3: acquire(l)... must wait for T1 to release(l)... | ||||
|   context switch to T2 | ||||
|   T2 computes for a while | ||||
|   T3 is indefinitely delayed despite high priority. | ||||
|   Can solve if T3 lends its priority to holder of lock it is waiting for. | ||||
|     So T1 runs, not T2. | ||||
|   [this is really a multiple scheduler problem.] | ||||
|   [since locks schedule access to locked resource.] | ||||
| </pre> | ||||
| 
 | ||||
| <li>Pitfall: Efficiency.  Efficiency often conflicts with fairness (or | ||||
| any other policy).  Long time quantum for efficiency in CPU scheduling | ||||
| versus low delay.  Shortest seek versus FIFO disk scheduling. | ||||
| Contiguous read-ahead vs data needed now.  For example, scheduler | ||||
| swaps out my idle emacs to let gcc run faster with more phys mem. | ||||
| What happens when I type a key?  These don't fit well into a "who gets | ||||
| to go next" scheduler framework.  Inefficient scheduling may make | ||||
| <i>everybody</i> slower, including high priority users. | ||||
| 
 | ||||
| <li>Pitfall: Multiple Interacting Schedulers. Suppose you want your | ||||
| emacs to have priority over everything else.  Give it high CPU | ||||
| priority.  Does that mean nothing else will run if emacs wants to run? | ||||
| Disk scheduler might not know to favor emacs's disk I/Os.  Typical | ||||
| UNIX disk scheduler favors disk efficiency, not process prio.  Suppose | ||||
| emacs needs more memory.  Other processes have dirty pages; emacs must | ||||
| wait.  Does disk scheduler know these other processes' writes are high | ||||
| prio? | ||||
| 
 | ||||
| <li>Pitfall: Server Processes.  Suppose emacs uses X windows to | ||||
| display.  The X server must serve requests from many clients.  Does it | ||||
| know that emacs' requests should be given priority?  Does the OS know | ||||
| to raise X's priority when it is serving emacs?  Similarly for DNS, | ||||
| and NFS.  Does the network know to give emacs' NFS requests priority? | ||||
| 
 | ||||
| </ul> | ||||
| 
 | ||||
| <p>In short, scheduling is a system problem. There are many | ||||
| schedulers; they interact.  The CPU scheduler is usually the easy | ||||
| part.  The hardest part is system structure.  For example, the | ||||
| <i>existence</i> of interrupts is bad for scheduling.  Conflicting | ||||
| goals may limit effectiveness. | ||||
| 
 | ||||
| <h2>Case study: modern UNIX</h2> | ||||
| 
 | ||||
| <p>Goals:  | ||||
| <ul> | ||||
| <li>Simplicity (e.g. avoid complex locking regimes).   | ||||
| <li>Quick response to device interrupts.   | ||||
| <li> Favor interactive response.   | ||||
| </ul>  | ||||
| 
 | ||||
| <p>UNIX has a number of execution environments.  We care about | ||||
| scheduling transitions among them.  Some transitions aren't possible, | ||||
| some can't be be controlled. The execution environments are: | ||||
| 
 | ||||
| <ul> | ||||
| <li>Process, user half | ||||
| <li>Process, kernel half | ||||
| <li>Soft interrupts: timer, network | ||||
| <li>Device interrupts | ||||
| </ul> | ||||
| 
 | ||||
| <p>The rules are: | ||||
| <ul> | ||||
| <li>User is pre-emptible. | ||||
| <li>Kernel half and software interrupts are not pre-emptible. | ||||
| <li>Device handlers may not make blocking calls (e.g., sleep) | ||||
| <li>Effective priorities: intr > soft intr > kernel half > user | ||||
| </ul>   | ||||
| 
 | ||||
| </ul> | ||||
| 
 | ||||
| <p>Rules are implemented as follows: | ||||
| 
 | ||||
| <ul> | ||||
| 
 | ||||
| <li>UNIX: Process User Half.  Runs in process address space, on | ||||
| per-process stack.  Interruptible.  Pre-emptible: interrupt may cause | ||||
| context switch.  We don't trust user processes to yield CPU. | ||||
| Voluntarily enters kernel half via system calls and faults. | ||||
| 
 | ||||
| <li>UNIX: Process Kernel Half. Runs in kernel address space, on | ||||
| per-process kernel stack.  Executes system calls and faults for its | ||||
| process.  Interruptible (but can defer interrupts in critical | ||||
| sections).  Not pre-emptible.  Only yields voluntarily, when waiting | ||||
| for an event. E.g. disk I/O done.  This simplifies concurrency | ||||
| control; locks often not required.  No user process runs if any kernel | ||||
| half wants to run.  Many process' kernel halfs may be sleeping in the | ||||
| kernel. | ||||
| 
 | ||||
| <li>UNIX: Device Interrupts. Hardware asks CPU for an interrupt to ask | ||||
| for attention.  Disk read/write completed, or network packet received. | ||||
| Runs in kernel space, on special interrupt stack.  Interrupt routine | ||||
| cannot block; must return.  Interrupts are interruptible.  They nest | ||||
| on the one interrupt stack.  Interrupts are not pre-emptible, and | ||||
| cannot really yield.  The real-time clock is a device and interrupts | ||||
| every 10ms (or whatever).  Process scheduling decisions can be made | ||||
| when interrupt returns (e.g. wake up the process waiting for this | ||||
| event).  You want interrupt processing to be fast, since it has | ||||
| priority.  Don't do any more work than you have to.  You're blocking | ||||
| processes and other interrupts.  Typically, an interrupt does the | ||||
| minimal work necessary to keep the device happy, and then call wakeup | ||||
| on a thread. | ||||
| 
 | ||||
| <li>UNIX: Soft Interrupts.  (Didn't exist in xv6) Used when device | ||||
| handling is expensive.  But no obvious process context in which to | ||||
| run.  Examples include IP forwarding, TCP input processing.  Runs in | ||||
| kernel space, on interrupt stack.  Interruptable.  Not pre-emptable, | ||||
| can't really yield.  Triggered by hardware interrupt.  Called when | ||||
| outermost hardware interrupt returns.  Periodic scheduling decisions | ||||
| are made in timer s/w interrupt.  Scheduled by hardware timer | ||||
| interrupt (i.e., if current process has run long enough, switch). | ||||
| </ul> | ||||
| 
 | ||||
| <p>Is this good software structure?  Let's talk about receive | ||||
| livelock. | ||||
| 
 | ||||
| <h2>Paper discussion</h2> | ||||
| 
 | ||||
| <ul> | ||||
| 
 | ||||
| <li>What is application that the paper is addressing: IP forwarding. | ||||
| What functionality does a network interface offer to driver? | ||||
| <ul> | ||||
| <li> Read packets | ||||
| <li> Poke hardware to send packets | ||||
| <li> Interrupts when packet received/transmit complete | ||||
| <li> Buffer many input packets | ||||
| </ul> | ||||
| 
 | ||||
| <li>What devices in the 6.828 kernel are interrupt driven?  Which one | ||||
| are polling?  Is this ideal? | ||||
| 
 | ||||
| <li>Explain Figure 6-1.  Why does it go up?  What determines how high | ||||
| the peak is?  Why does it go down?  What determines how fast it goes | ||||
| does? Answer: | ||||
| <pre> | ||||
| (fraction of packets discarded)(work invested in discarded packets) | ||||
|            ------------------------------------------- | ||||
|               (total work CPU is capable of) | ||||
| </pre> | ||||
| 
 | ||||
| <li>Suppose I wanted to test an NFS server for livelock. | ||||
| <pre> | ||||
|   Run client with this loop: | ||||
|     while(1){ | ||||
|       send NFS READ RPC; | ||||
|       wait for response; | ||||
|     } | ||||
| </pre> | ||||
| What would I see?  Is the NFS server probably subject to livelock? | ||||
| (No--offered load subject to feedback). | ||||
| 
 | ||||
| <li>What other problems are we trying to address? | ||||
| <ul>  | ||||
| <li>Increased latency for packet delivery and forwarding (e.g., start | ||||
| disk head moving  when first NFS read request comes) | ||||
| <li>Transmit starvation  | ||||
| <li>User-level CPU starvation | ||||
| </ul> | ||||
| 
 | ||||
| <li>Why not tell the O/S scheduler to give interrupts lower priority? | ||||
| Non-preemptible. | ||||
| Could you fix this by making interrupts faster? (Maybe, if coupled | ||||
| with some limit on input rate.)   | ||||
| 
 | ||||
| <li>Why not completely process each packet in the interrupt handler? | ||||
| (I.e. forward it?) Other parts of kernel don't expect to run at high | ||||
| interrupt-level (e.g., some packet processing code might invoke a function | ||||
| that sleeps). Still might want an output queue | ||||
| 
 | ||||
| <li>What about using polling instead of interrupts?  Solves overload | ||||
| problem, but killer for latency. | ||||
| 
 | ||||
| <li>What's the paper's solution? | ||||
| <ul> | ||||
| <li>No IP input queue. | ||||
| <li>Input processing and device input polling in kernel thread. | ||||
| <li>Device receive interrupt just wakes up thread. And leaves | ||||
| interrupts *disabled* for that device. | ||||
| <li>Thread does all input processing, then re-enables interrupts. | ||||
| </ul> | ||||
| <p>Why does this work?  What happens when packets arrive too fast? | ||||
| What happens when packets arrive slowly? | ||||
| 
 | ||||
| <li>Explain Figure 6-3. | ||||
| <ul> | ||||
| <li>Why does "Polling (no quota)" work badly? (Input still starves | ||||
| xmit complete processing.) | ||||
| <li>Why does it immediately fall to zero, rather than gradually decreasing? | ||||
| (xmit complete processing must be very cheap compared to input.) | ||||
| </ul> | ||||
| 
 | ||||
| <li>Explain Figure 6-4. | ||||
| <ul> | ||||
| 
 | ||||
| <li>Why does "Polling, no feedback" behave badly? There's a queue in | ||||
| front of screend.  We can still give 100% to input thread, 0% to | ||||
| screend. | ||||
| 
 | ||||
| <li>Why does "Polling w/ feedback" behave well? Input thread yields | ||||
| when queue to screend fills. | ||||
| 
 | ||||
| <li>What if screend hangs, what about other consumers of packets? | ||||
| (e.g., can you ssh to machine to fix screend?)  Fortunately screend | ||||
| typically is only application. Also, re-enable input after timeout. | ||||
| 
 | ||||
| </ul> | ||||
| 
 | ||||
| <li>Why are the two solutions different? | ||||
| <ol> | ||||
| <li> Polling thread <i>with quotas</i>. | ||||
| <li> Feedback from full queue. | ||||
| </ol> | ||||
| (I believe they should have used #2 for both.) | ||||
| 
 | ||||
| <li>If we apply the proposed fixes, does the phenomemon totally go | ||||
|  away? (e.g. for web server, waits for disk, &c.)   | ||||
| <ul> | ||||
| <li>Can the net device throw away packets without slowing down host? | ||||
| <li>Problem: We want to drop packets for applications with big queues. | ||||
| But requires work to determine which application a packet belongs to | ||||
| Solution: NI-LRP (have network interface sort packets) | ||||
| </ul> | ||||
| 
 | ||||
| <li>What about latency question?  (Look at figure 14 p. 243.) | ||||
| <ul> | ||||
| <li>1st packet looks like an improvement over non-polling. But 2nd | ||||
| packet transmitted later with poling.  Why? (No new packets added to | ||||
| xmit buffer until xmit interrupt)  | ||||
| <li>Why?  In traditional BSD, to | ||||
| amortize cost of poking device. Maybe better to poke a second time | ||||
| anyway. | ||||
| </ul> | ||||
| 
 | ||||
| <li>What if processing has more complex structure? | ||||
| <ul> | ||||
| <li>Chain of processing stages with queues? Does feedback work? | ||||
|     What happens when a late stage is slow? | ||||
| <li>Split at some point, multiple parallel paths? No so great; one | ||||
|     slow path blocks all paths. | ||||
| </ul> | ||||
| 
 | ||||
| <li>Can we formulate any general principles from paper? | ||||
| <ul> | ||||
| <li>Don't spend time on new work before completing existing work. | ||||
| <li>Or give new work lower priority than partially-completed work. | ||||
| </ul> | ||||
| 
 | ||||
| </ul> | ||||
							
								
								
									
										316
									
								
								web/l-threads.html
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										316
									
								
								web/l-threads.html
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,316 @@ | |||
| <title>L8</title> | ||||
| <html> | ||||
| <head> | ||||
| </head> | ||||
| <body> | ||||
| 
 | ||||
| <h1>Threads, processes, and context switching</h1> | ||||
| 
 | ||||
| <p>Required reading: proc.c (focus on scheduler() and sched()), | ||||
| setjmp.S, and sys_fork (in sysproc.c) | ||||
| 
 | ||||
| <h2>Overview</h2> | ||||
| 
 | ||||
| 
 | ||||
| <p>Big picture: more programs than processors.  How to share the | ||||
| limited number of processors among the programs? | ||||
| 
 | ||||
| <p>Observation: most programs don't need the processor continuously, | ||||
| because they frequently have to wait for input (from user, disk, | ||||
| network, etc.) | ||||
| 
 | ||||
| <p>Idea: when one program must wait, it releases the processor, and | ||||
| gives it to another program. | ||||
| 
 | ||||
| <p>Mechanism: thread of computation, an active active computation.  A | ||||
| thread is an abstraction that contains the minimal state that is | ||||
| necessary to stop an active and an resume it at some point later. | ||||
| What that state is depends on the processor.  On x86, it is the | ||||
| processor registers (see setjmp.S). | ||||
| 
 | ||||
| <p>Address spaces and threads: address spaces and threads are in | ||||
| principle independent concepts.  One can switch from one thread to | ||||
| another thread in the same address space, or one can switch from one | ||||
| thread to another thread in another address space.  Example: in xv6, | ||||
| one switches address spaces by switching segmentation registers (see | ||||
| setupsegs).  Does xv6 ever switch from one thread to another in the | ||||
| same address space? (Answer: yes, v6 switches, for example, from the | ||||
| scheduler, proc[0], to the kernel part of init, proc[1].)  In the JOS | ||||
| kernel we switch from the kernel thread to a user thread, but we don't | ||||
| switch kernel space necessarily. | ||||
| 
 | ||||
| <p>Process: one address space plus one or more threads of computation. | ||||
| In xv6 all <i>user</i> programs contain one thread of computation and | ||||
| one address space, and the concepts of address space and threads of | ||||
| computation are not separated but bundled together in the concept of a | ||||
| process.  When switching from the kernel program (which has multiple | ||||
| threads) to a user program, xv6 switches threads (switching from a | ||||
| kernel stack to a user stack) and address spaces (the hardware uses | ||||
| the kernel segment registers and the user segment registers). | ||||
| 
 | ||||
| <p>xv6 supports the following operations on processes: | ||||
| <ul> | ||||
| <li>fork; create a new process, which is a copy of the parent. | ||||
| <li>exec; execute a program | ||||
| <li>exit: terminte process | ||||
| <li>wait: wait for a process to terminate | ||||
| <li>kill: kill process | ||||
| <li>sbrk: grow the address space of a process. | ||||
| </ul> | ||||
| This interfaces doesn't separate threads and address spaces. For | ||||
| example, with this interface one cannot create additional threads in | ||||
| the same threads.  Modern Unixes provides additional primitives | ||||
| (called pthreads, POSIX threads) to create additional threads in a | ||||
| process and coordinate their activities. | ||||
| 
 | ||||
| <p>Scheduling.  The thread manager needs a method for deciding which | ||||
| thread to run if multiple threads are runnable.  The xv6 policy is to | ||||
| run the processes round robin. Why round robin?  What other methods | ||||
| can you imagine? | ||||
| 
 | ||||
| <p>Preemptive scheduling.  To force a thread to release the processor | ||||
| periodically (in case the thread never calls sleep), a thread manager | ||||
| can use preemptive scheduling.  The thread manager uses the clock chip | ||||
| to generate periodically a hardware interrupt, which will cause | ||||
| control to transfer to the thread manager, which then can decide to | ||||
| run another thread (e.g., see trap.c). | ||||
| 
 | ||||
| <h2>xv6 code examples</h2> | ||||
| 
 | ||||
| <p>Thread switching is implemented in xv6 using setjmp and longjmp, | ||||
| which take a jumpbuf as an argument.  setjmp saves its context in a | ||||
| jumpbuf for later use by longjmp.  longjmp restores the context saved | ||||
| by the last setjmp.  It then causes execution to continue as if the | ||||
| call of setjmp has just returned 1. | ||||
| <ul> | ||||
| <li>setjmp saves: ebx, exc, edx, esi, edi, esp, ebp, and eip. | ||||
| <li>longjmp restores them, and puts 1 in eax! | ||||
| </ul> | ||||
| 
 | ||||
| <p> Example of thread switching: proc[0] switches to scheduler: | ||||
| <ul> | ||||
| <li>1359: proc[0] calls iget, which calls sleep, which calls sched. | ||||
| <li>2261: The stack before the call to setjmp in sched is: | ||||
| <pre> | ||||
| CPU 0: | ||||
| eax: 0x10a144   1089860 | ||||
| ecx: 0x6c65746e 1818588270 | ||||
| edx: 0x0        0 | ||||
| ebx: 0x10a0e0   1089760 | ||||
| esp: 0x210ea8   2166440 | ||||
| ebp: 0x210ebc   2166460 | ||||
| esi: 0x107f20   1081120 | ||||
| edi: 0x107740   1079104 | ||||
| eip: 0x1023c9   | ||||
| eflags 0x12       | ||||
| cs:  0x8        | ||||
| ss:  0x10       | ||||
| ds:  0x10       | ||||
| es:  0x10       | ||||
| fs:  0x10       | ||||
| gs:  0x10       | ||||
|    00210ea8 [00210ea8]  10111e | ||||
|    00210eac [00210eac]  210ebc | ||||
|    00210eb0 [00210eb0]  10239e | ||||
|    00210eb4 [00210eb4]  0001 | ||||
|    00210eb8 [00210eb8]  10a0e0 | ||||
|    00210ebc [00210ebc]  210edc | ||||
|    00210ec0 [00210ec0]  1024ce | ||||
|    00210ec4 [00210ec4]  1010101 | ||||
|    00210ec8 [00210ec8]  1010101 | ||||
|    00210ecc [00210ecc]  1010101 | ||||
|    00210ed0 [00210ed0]  107740 | ||||
|    00210ed4 [00210ed4]  0001 | ||||
|    00210ed8 [00210ed8]  10cd74 | ||||
|    00210edc [00210edc]  210f1c | ||||
|    00210ee0 [00210ee0]  100bbc | ||||
|    00210ee4 [00210ee4]  107740 | ||||
| </pre> | ||||
| <li>2517: stack at beginning of setjmp: | ||||
| <pre> | ||||
| CPU 0: | ||||
| eax: 0x10a144   1089860 | ||||
| ecx: 0x6c65746e 1818588270 | ||||
| edx: 0x0        0 | ||||
| ebx: 0x10a0e0   1089760 | ||||
| esp: 0x210ea0   2166432 | ||||
| ebp: 0x210ebc   2166460 | ||||
| esi: 0x107f20   1081120 | ||||
| edi: 0x107740   1079104 | ||||
| eip: 0x102848   | ||||
| eflags 0x12       | ||||
| cs:  0x8        | ||||
| ss:  0x10       | ||||
| ds:  0x10       | ||||
| es:  0x10       | ||||
| fs:  0x10       | ||||
| gs:  0x10       | ||||
|    00210ea0 [00210ea0]  1023cf   <--- return address (sched) | ||||
|    00210ea4 [00210ea4]  10a144 | ||||
|    00210ea8 [00210ea8]  10111e | ||||
|    00210eac [00210eac]  210ebc | ||||
|    00210eb0 [00210eb0]  10239e | ||||
|    00210eb4 [00210eb4]  0001 | ||||
|    00210eb8 [00210eb8]  10a0e0 | ||||
|    00210ebc [00210ebc]  210edc | ||||
|    00210ec0 [00210ec0]  1024ce | ||||
|    00210ec4 [00210ec4]  1010101 | ||||
|    00210ec8 [00210ec8]  1010101 | ||||
|    00210ecc [00210ecc]  1010101 | ||||
|    00210ed0 [00210ed0]  107740 | ||||
|    00210ed4 [00210ed4]  0001 | ||||
|    00210ed8 [00210ed8]  10cd74 | ||||
|    00210edc [00210edc]  210f1c | ||||
| </pre> | ||||
| <li>2519: What is saved in jmpbuf of proc[0]? | ||||
| <li>2529: return 0! | ||||
| <li>2534: What is in jmpbuf of cpu 0?  The stack is as follows: | ||||
| <pre> | ||||
| CPU 0: | ||||
| eax: 0x0        0 | ||||
| ecx: 0x6c65746e 1818588270 | ||||
| edx: 0x108aa4   1084068 | ||||
| ebx: 0x10a0e0   1089760 | ||||
| esp: 0x210ea0   2166432 | ||||
| ebp: 0x210ebc   2166460 | ||||
| esi: 0x107f20   1081120 | ||||
| edi: 0x107740   1079104 | ||||
| eip: 0x10286e   | ||||
| eflags 0x46       | ||||
| cs:  0x8        | ||||
| ss:  0x10       | ||||
| ds:  0x10       | ||||
| es:  0x10       | ||||
| fs:  0x10       | ||||
| gs:  0x10       | ||||
|    00210ea0 [00210ea0]  1023fe | ||||
|    00210ea4 [00210ea4]  108aa4 | ||||
|    00210ea8 [00210ea8]  10111e | ||||
|    00210eac [00210eac]  210ebc | ||||
|    00210eb0 [00210eb0]  10239e | ||||
|    00210eb4 [00210eb4]  0001 | ||||
|    00210eb8 [00210eb8]  10a0e0 | ||||
|    00210ebc [00210ebc]  210edc | ||||
|    00210ec0 [00210ec0]  1024ce | ||||
|    00210ec4 [00210ec4]  1010101 | ||||
|    00210ec8 [00210ec8]  1010101 | ||||
|    00210ecc [00210ecc]  1010101 | ||||
|    00210ed0 [00210ed0]  107740 | ||||
|    00210ed4 [00210ed4]  0001 | ||||
|    00210ed8 [00210ed8]  10cd74 | ||||
|    00210edc [00210edc]  210f1c | ||||
| </pre> | ||||
| <li>2547: return 1! stack looks as follows: | ||||
| <pre> | ||||
| CPU 0: | ||||
| eax: 0x1        1 | ||||
| ecx: 0x108aa0   1084064 | ||||
| edx: 0x108aa4   1084068 | ||||
| ebx: 0x10074    65652 | ||||
| esp: 0x108d40   1084736 | ||||
| ebp: 0x108d5c   1084764 | ||||
| esi: 0x10074    65652 | ||||
| edi: 0xffde     65502 | ||||
| eip: 0x102892   | ||||
| eflags 0x6        | ||||
| cs:  0x8        | ||||
| ss:  0x10       | ||||
| ds:  0x10       | ||||
| es:  0x10       | ||||
| fs:  0x10       | ||||
| gs:  0x10       | ||||
|    00108d40 [00108d40]  10231c | ||||
|    00108d44 [00108d44]  10a144 | ||||
|    00108d48 [00108d48]  0010 | ||||
|    00108d4c [00108d4c]  0021 | ||||
|    00108d50 [00108d50]  0000 | ||||
|    00108d54 [00108d54]  0000 | ||||
|    00108d58 [00108d58]  10a0e0 | ||||
|    00108d5c [00108d5c]  0000 | ||||
|    00108d60 [00108d60]  0001 | ||||
|    00108d64 [00108d64]  0000 | ||||
|    00108d68 [00108d68]  0000 | ||||
|    00108d6c [00108d6c]  0000 | ||||
|    00108d70 [00108d70]  0000 | ||||
|    00108d74 [00108d74]  0000 | ||||
|    00108d78 [00108d78]  0000 | ||||
|    00108d7c [00108d7c]  0000 | ||||
| </pre> | ||||
| <li>2548: where will longjmp return? (answer: 10231c, in scheduler) | ||||
| <li>2233:Scheduler on each processor selects in a round-robin fashion the | ||||
|   first runnable process.  Which process will that be? (If we are | ||||
|   running with one processor.)  (Ans: proc[0].) | ||||
| <li>2229: what will be saved in cpu's jmpbuf? | ||||
| <li>What is in proc[0]'s jmpbuf?  | ||||
| <li>2548: return 1. Stack looks as follows: | ||||
| <pre> | ||||
| CPU 0: | ||||
| eax: 0x1        1 | ||||
| ecx: 0x6c65746e 1818588270 | ||||
| edx: 0x0        0 | ||||
| ebx: 0x10a0e0   1089760 | ||||
| esp: 0x210ea0   2166432 | ||||
| ebp: 0x210ebc   2166460 | ||||
| esi: 0x107f20   1081120 | ||||
| edi: 0x107740   1079104 | ||||
| eip: 0x102892   | ||||
| eflags 0x2        | ||||
| cs:  0x8        | ||||
| ss:  0x10       | ||||
| ds:  0x10       | ||||
| es:  0x10       | ||||
| fs:  0x10       | ||||
| gs:  0x10       | ||||
|    00210ea0 [00210ea0]  1023cf   <--- return to sleep | ||||
|    00210ea4 [00210ea4]  108aa4 | ||||
|    00210ea8 [00210ea8]  10111e | ||||
|    00210eac [00210eac]  210ebc | ||||
|    00210eb0 [00210eb0]  10239e | ||||
|    00210eb4 [00210eb4]  0001 | ||||
|    00210eb8 [00210eb8]  10a0e0 | ||||
|    00210ebc [00210ebc]  210edc | ||||
|    00210ec0 [00210ec0]  1024ce | ||||
|    00210ec4 [00210ec4]  1010101 | ||||
|    00210ec8 [00210ec8]  1010101 | ||||
|    00210ecc [00210ecc]  1010101 | ||||
|    00210ed0 [00210ed0]  107740 | ||||
|    00210ed4 [00210ed4]  0001 | ||||
|    00210ed8 [00210ed8]  10cd74 | ||||
|    00210edc [00210edc]  210f1c | ||||
| </pre> | ||||
| </ul> | ||||
| 
 | ||||
| <p>Why switch from proc[0] to the processor stack, and then to | ||||
|   proc[0]'s stack?  Why not instead run the scheduler on the kernel | ||||
|   stack of the last process that run on that cpu?  | ||||
| 
 | ||||
| <ul> | ||||
| 
 | ||||
| <li>If the scheduler wanted to use the process stack, then it couldn't | ||||
|   have any stack variables live across process scheduling, since | ||||
|   they'd be different depending on which process just stopped running. | ||||
| 
 | ||||
| <li>Suppose process p goes to sleep on CPU1, so CPU1 is idling in | ||||
|   scheduler() on p's stack. Someone wakes up p. CPU2 decides to run | ||||
|   p. Now p is running on its stack, and CPU1 is also running on the | ||||
|   same stack. They will likely scribble on each others' local | ||||
|   variables, return pointers, etc. | ||||
| 
 | ||||
| <li>The same thing happens if CPU1 tries to reuse the process's page | ||||
| tables to avoid a TLB flush.  If the process gets killed and cleaned | ||||
| up by the other CPU, now the page tables are wrong.  I think some OSes | ||||
| actually do this (with appropriate ref counting). | ||||
| 
 | ||||
| </ul> | ||||
| 
 | ||||
| <p>How is preemptive scheduling implemented in xv6?  Answer see trap.c | ||||
|   line 2905 through 2917, and the implementation of yield() on sheet | ||||
|   22. | ||||
| 
 | ||||
| <p>How long is a timeslice for a user process?  (possibly very short; | ||||
|   very important lock is held across context switch!) | ||||
| 
 | ||||
| </body> | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
							
								
								
									
										462
									
								
								web/l-vm.html
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										462
									
								
								web/l-vm.html
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,462 @@ | |||
| <html> | ||||
| <head> | ||||
| <title>Virtual Machines</title> | ||||
| </head> | ||||
| 
 | ||||
| <body> | ||||
| 
 | ||||
| <h1>Virtual Machines</h1> | ||||
| 
 | ||||
| <p>Required reading: Disco</p> | ||||
| 
 | ||||
| <h2>Overview</h2> | ||||
| 
 | ||||
| <p>What is a virtual machine? IBM definition: a fully protected and | ||||
| isolated copy of the underlying machine's hardware.</p> | ||||
| 
 | ||||
| <p>Another view is that it provides another example of a kernel API. | ||||
| In contrast to other kernel APIs (unix, microkernel, and exokernel), | ||||
| the virtual machine operating system exports as the kernel API the | ||||
| processor API (e.g., the x86 interface).  Thus, each program running | ||||
| in user space sees the services offered by a processor, and each | ||||
| program sees its own processor.  Of course, we don't want to make a | ||||
| system call for each instruction, and in fact one of the main | ||||
| challenges in virtual machine operation systems is to design the | ||||
| system in such a way that the physical processor executes the virtual | ||||
| processor API directly, at processor speed. | ||||
| 
 | ||||
| <p> | ||||
| Virtual machines can be useful for a number of reasons: | ||||
| <ol> | ||||
| 
 | ||||
| <li>Run multiple operating systems on single piece of hardware.  For | ||||
| example, in one process, you run Linux, and in another you run | ||||
| Windows/XP.  If the kernel API is identical to the x86 (and faithly | ||||
| emulates x86 instructions, state, protection levels, page tables), | ||||
| then Linux and Windows/XP, the virual machine operationg system can | ||||
| run these <i>guest</i> operating systems without modifications. | ||||
| 
 | ||||
| <ul> | ||||
| <li>Run "older" programs on the same hardware (e.g., run one x86 | ||||
| virtual machine in real mode to execute old DOS apps). | ||||
| 
 | ||||
| <li>Or run applications that require different operating system. | ||||
| </ul> | ||||
| 
 | ||||
| <li>Fault isolation: like processes on UNIX but more complete, because | ||||
| the guest operating systems runs on the virtual machine in user space. | ||||
| Thus, faults in the guest OS cannot effect any other software. | ||||
| 
 | ||||
| <li>Customizing the apparent hardware: virtual machine may have | ||||
| different view of hardware than is physically present. | ||||
| 
 | ||||
| <li>Simplify deployment/development of software for scalable | ||||
| processors (e.g., Disco). | ||||
| 
 | ||||
| </ol> | ||||
| </p> | ||||
| 
 | ||||
| <p>If your operating system isn't a virtual machine operating system, | ||||
| what are the alternatives? Processor simulation (e.g., bochs) or | ||||
| binary emulation (WINE).  Simulation runs instructions purely in | ||||
| software and is slow (e.g., 100x slow down for bochs); virtualization | ||||
| gets out of the way whenever possible and can be efficient.  | ||||
| 
 | ||||
| <p>Simulation gives portability whereas virtualization focuses on | ||||
| performance. However, this means that you need to model your hardware | ||||
| very carefully in software. Binary emulation focuses on just getting | ||||
| system call for a particular operating system's interface. Binary | ||||
| emulation can be hard because it is targetted towards a particular | ||||
| operating system (and even that can change between revisions). | ||||
| </p> | ||||
| 
 | ||||
| <p>To provide each process with its own virtual processor that exports | ||||
| the same API as the physical processor, what features must | ||||
| the virtual machine operating system virtualize? | ||||
| <ol> | ||||
| <li>CPU: instructions -- trap all privileged instructions</li> | ||||
| <li>Memory: address spaces -- map "physical" pages managed | ||||
| by the guest OS to <i>machine</i>pages, handle translation, etc.</li> | ||||
| <li>Devices: any I/O communication needs to be trapped and passed | ||||
|     through/handled appropriately.</li> | ||||
| </ol> | ||||
| </p> | ||||
| The software that implements the virtualization is typically called | ||||
| the monitor, instead of the virtual machine operating system. | ||||
| 
 | ||||
| <p>Virtual machine monitors (VMM) can be implemented in two ways: | ||||
| <ol> | ||||
| <li>Run VMM directly on hardware: like Disco.</li> | ||||
| <li>Run VMM as an application (though still running as root, with | ||||
|     integration into OS) on top of a <i>host</i> OS: like VMware. Provides | ||||
|     additional hardware support at low development cost in | ||||
|     VMM. Intercept CPU-level I/O requests and translate them into | ||||
|     system calls (e.g. <code>read()</code>).</li> | ||||
| </ol> | ||||
| </p> | ||||
| 
 | ||||
| <p>The three primary functions of a virtual machine monitor are: | ||||
| <ul> | ||||
| <li>virtualize processor (CPU, memory, and devices) | ||||
| <li>dispatch events (e.g., forward page fault trap to guest OS). | ||||
| <li>allocate resources (e.g., divide real memory in some way between | ||||
| the physical memory of each guest OS). | ||||
| </ul> | ||||
| 
 | ||||
| <h2>Virtualization in detail</h2> | ||||
| 
 | ||||
| <h3>Memory virtualization</h3> | ||||
| 
 | ||||
| <p> | ||||
| Understanding memory virtualization. Let's consider the MIPS example | ||||
| from the paper. Ideally, we'd be able to intercept and rewrite all | ||||
| memory address references. (e.g., by intercepting virtual memory | ||||
| calls). Why can't we do this on the MIPS? (There are addresses that | ||||
| don't go through address translation --- but we don't want the virtual | ||||
| machine to directly access memory!)  What does Disco do to get around | ||||
| this problem?  (Relink the kernel outside this address space.) | ||||
| </p> | ||||
| 
 | ||||
| <p> | ||||
| Having gotten around that problem, how do we handle things in general? | ||||
| </p> | ||||
| <pre> | ||||
| // Disco's tlb miss handler. | ||||
| // Called when a memory reference for virtual adddress | ||||
| // 'VA' is made, but there is not VA->MA (virtual -> machine) | ||||
| // mapping in the cpu's TLB. | ||||
| void tlb_miss_handler (VA) | ||||
| { | ||||
|   // see if we have a mapping in our "shadow" tlb (which includes | ||||
|   // "main" tlb) | ||||
|   tlb_entry *t = tlb_lookup (thiscpu->l2tlb, va); | ||||
|   if (t && defined (thiscpu->pmap[t->pa]))   // is there a MA for this PA? | ||||
|     tlbwrite (va, thiscpu->pmap[t->pa], t->otherdata); | ||||
|   else if (t) | ||||
|     // get a machine page, copy physical page into, and tlbwrite | ||||
|   else | ||||
|     // trap to the virtual CPU/OS's handler | ||||
| } | ||||
| 
 | ||||
| // Disco's procedure which emulates the MIPS | ||||
| // instruction which writes to the tlb. | ||||
| // | ||||
| // VA -- virtual addresss | ||||
| // PA -- physical address (NOT MA machine address!) | ||||
| // otherdata -- perms and stuff | ||||
| void emulate_tlbwrite_instruction (VA, PA, otherdata) | ||||
| { | ||||
|   tlb_insert (thiscpu->l2tlb, VA, PA, otherdata); // cache | ||||
|   if (!defined (thiscpu->pmap[PA])) { // fill in pmap dynamically | ||||
|     MA = allocate_machine_page (); | ||||
|     thiscpu->pmap[PA] = MA; // See 4.2.2 | ||||
|     thiscpu->pmapbackmap[MA] = PA; | ||||
|     thiscpu->memmap[MA] = VA; // See 4.2.3 (for TLB shootdowns) | ||||
|   } | ||||
|   tlbwrite (va, thiscpu->pmap[PA], otherdata); | ||||
| } | ||||
| 
 | ||||
| // Disco's procedure which emulates the MIPS | ||||
| // instruction which read the tlb. | ||||
| tlb_entry *emulate_tlbread_instruction (VA) | ||||
| { | ||||
|   // Must return a TLB entry that has a "Physical" address; | ||||
|   // This is recorded in our secondary TLB cache. | ||||
|   // (We don't have to read from the hardware TLB since | ||||
|   // all writes to the hardware TLB are mediated by Disco. | ||||
|   // Thus we can always keep the l2tlb up to date.) | ||||
|   return tlb_lookup (thiscpu->l2tlb, va); | ||||
| } | ||||
| </pre> | ||||
| 
 | ||||
| <h3>CPU virtualization</h3> | ||||
| 
 | ||||
| <p>Requirements: | ||||
| <ol> | ||||
| <li>Results of executing non-privileged instructions in privileged and | ||||
|     user mode must be equivalent. (Why? B/c the virtual "privileged" | ||||
|     system will not be running in true "privileged" mode.) | ||||
| <li>There must be a way to protect the VM from the real machine. (Some | ||||
|     sort of memory protection/address translation. For fault isolation.)</li> | ||||
| <li>There must be a way to detect and transfer control to the VMM when | ||||
|     the VM tries to execute a sensitive instruction (e.g. a privileged | ||||
|     instruction, or one that could expose the "virtualness" of the | ||||
|     VM.) It must be possible to emulate these instructions in | ||||
|     software. Can be classified into completely virtualizable | ||||
|     (i.e. there are protection mechanisms that cause traps for all | ||||
|     instructions), partly (insufficient or incomplete trap | ||||
|     mechanisms), or not at all (e.g. no MMU). | ||||
| </ol> | ||||
| </p> | ||||
| 
 | ||||
| <p>The MIPS didn't quite meet the second criteria, as discussed | ||||
| above. But, it does have a supervisor mode that is between user mode and | ||||
| kernel mode where any privileged instruction will trap.</p> | ||||
| 
 | ||||
| <p>What might a the VMM trap handler look like?</p> | ||||
| <pre> | ||||
| void privilege_trap_handler (addr) { | ||||
|   instruction, args = decode_instruction (addr) | ||||
|   switch (instruction) { | ||||
|   case foo: | ||||
|     emulate_foo (thiscpu, args, ...); | ||||
|     break; | ||||
|   case bar: | ||||
|     emulate_bar (thiscpu, args, ...); | ||||
|     break; | ||||
|   case ...: | ||||
|     ... | ||||
|   } | ||||
| } | ||||
| </pre> | ||||
| <p>The <code>emulator_foo</code> bits will have to evaluate the | ||||
| state of the virtual CPU and compute the appropriate "fake" answer. | ||||
| </p> | ||||
| 
 | ||||
| <p>What sort of state is needed in order to appropriately emulate all | ||||
| of these things? | ||||
| <pre> | ||||
| - all user registers | ||||
| - CPU specific regs (e.g. on x86, %crN, debugging, FP...) | ||||
| - page tables (or tlb) | ||||
| - interrupt tables | ||||
| </pre> | ||||
| This is needed for each virtual processor. | ||||
| </p> | ||||
| 
 | ||||
| <h3>Device I/O virtualization</h3> | ||||
| 
 | ||||
| <p>We intercept all communication to the I/O devices: read/writes to | ||||
| reserved memory addresses cause page faults into special handlers | ||||
| which will emulate or pass through I/O as appropriate. | ||||
| </p> | ||||
| 
 | ||||
| <p> | ||||
| In a system like Disco, the sequence would look something like: | ||||
| <ol> | ||||
| <li>VM executes instruction to access I/O</li> | ||||
| <li>Trap generated by CPU (based on memory or privilege protection) | ||||
|     transfers control to VMM.</li> | ||||
| <li>VMM emulates I/O instruction, saving information about where this | ||||
|     came from (for demultiplexing async reply from hardware later) .</li> | ||||
| <li>VMM reschedules a VM.</li> | ||||
| </ol> | ||||
| </p> | ||||
| 
 | ||||
| <p> | ||||
| Interrupts will require some additional work: | ||||
| <ol> | ||||
| <li>Interrupt occurs on real machine, transfering control to VMM | ||||
|     handler.</li> | ||||
| <li>VMM determines the VM that ought to receive this interrupt.</li> | ||||
| <li>VMM causes a simulated interrupt to occur in the VM, and reschedules a | ||||
|     VM.</li> | ||||
| <li>VM runs its interrupt handler, which may involve other I/O | ||||
|     instructions that need to be trapped.</li> | ||||
| </ol> | ||||
| </p> | ||||
| 
 | ||||
| <p> | ||||
| The above can be slow! So sometimes you want the guest operating | ||||
| system to be aware that it is a guest and allow it to avoid the slow | ||||
| path. Special device drivers or changing instructions that would cause | ||||
| traps into memory read/write instructions. | ||||
| </p> | ||||
| 
 | ||||
| <h2>Intel x86/vmware</h2> | ||||
| 
 | ||||
| <p>VMware, unlike Disco, runs as an application on a guest OS and | ||||
| cannot modify the guest OS.  Furthermore, it must virtualize the x86 | ||||
| instead of MIPS processor.  Both of these differences make good design | ||||
| challenges. | ||||
| 
 | ||||
| <p>The first challenge is that the monitor runs in user space, yet it | ||||
| must dispatch traps and it must execute privilege instructions, which | ||||
| both require kernel privileges.  To address this challenge, the | ||||
| monitor downloads a piece of code, a kernel module, into the guest | ||||
| OS. Most modern operating systems are constructed as a core kernel, | ||||
| extended with downloadable kernel modules. | ||||
| Privileged users can insert kernel modules at run-time. | ||||
| 
 | ||||
| <p>The monitor downloads a kernel module that reads the IDT, copies | ||||
| it, and overwrites the hard-wired entries with addresses for stubs in | ||||
| the just downloaded kernel module.  When a trap happens, the kernel | ||||
| module inspects the PC, and either forwards the trap to the monitor | ||||
| running in user space or to the guest OS.  If the trap is caused | ||||
| because a guest OS execute a privileged instructions, the monitor can | ||||
| emulate that privilege instruction by asking the kernel module to | ||||
| perform that instructions (perhaps after modifying the arguments to | ||||
| the instruction). | ||||
| 
 | ||||
| <p>The second challenge is virtualizing the x86 | ||||
|   instructions. Unfortunately, x86 doesn't meet the 3 requirements for | ||||
|   CPU virtualization. the first two requirements above.  If you run | ||||
|   the CPU in ring 3, <i>most</i> x86 instructions will be fine, | ||||
|   because most privileged instructions will result in a trap, which | ||||
|   can then be forwarded to vmware for emulation.  For example, | ||||
|   consider a guest OS loading the root of a page table in CR3. This | ||||
|   results in trap (the guest OS runs in user space), which is | ||||
|   forwarded to the monitor, which can emulate the load to CR3 as | ||||
|   follows: | ||||
| 
 | ||||
| <pre> | ||||
| // addr is a physical address | ||||
| void emulate_lcr3 (thiscpu, addr) | ||||
| { | ||||
|   thiscpu->cr3 = addr; | ||||
|   Pte *fakepdir = lookup (addr, oldcr3cache); | ||||
|   if (!fakepdir) { | ||||
|     fakedir = ppage_alloc (); | ||||
|     store (oldcr3cache, addr, fakedir); | ||||
|     // May wish to scan through supplied page directory to see if | ||||
|     // we have to fix up anything in particular. | ||||
|     // Exact settings will depend on how we want to handle | ||||
|     // problem cases below and our own MM. | ||||
|   } | ||||
|   asm ("movl fakepdir,%cr3"); | ||||
|   // Must make sure our page fault handler is in sync with what we do here. | ||||
| } | ||||
| </pre> | ||||
| 
 | ||||
| <p>To virtualize the x86, the monitor must intercept any modifications | ||||
| to the page table and substitute appropriate responses. And update | ||||
| things like the accessed/dirty bits.  The monitor can arrange for this | ||||
| to happen by making all page table pages inaccessible so that it can | ||||
| emulate loads and stores to page table pages.  This setup allow the | ||||
| monitor to virtualize the memory interface of the x86.</p> | ||||
| 
 | ||||
| <p>Unfortunately, not all instructions that must be virtualized result | ||||
| in traps: | ||||
| <ul> | ||||
| <li><code>pushf/popf</code>: <code>FL_IF</code> is handled different, | ||||
|     for example.  In user-mode setting FL_IF is just ignored.</li> | ||||
| <li>Anything (<code>push</code>, <code>pop</code>, <code>mov</code>) | ||||
|     that reads or writes from <code>%cs</code>, which contains the | ||||
|     privilege level. | ||||
| <li>Setting the interrupt enable bit in EFLAGS has different | ||||
| semantics in user space and kernel space.  In user space, it | ||||
| is ignored; in kernel space, the bit is set. | ||||
| <li>And some others... (total, 17 instructions). | ||||
| </ul> | ||||
| These instructions are unpriviliged instructions (i.e., don't cause a | ||||
| trap when executed by a guest OS) but expose physical processor state. | ||||
| These could reveal details of virtualization that should not be | ||||
| revealed.  For example, if guest OS sets the interrupt enable bit for | ||||
| its virtual x86, the virtualized EFLAGS should reflect that the bit is | ||||
| set, even though the guest OS is running in user space. | ||||
| 
 | ||||
| <p>How can we virtualize these instructions? An approach is to decode | ||||
| the instruction stream that is provided by the user and look for bad | ||||
| instructions. When we find them, replace them with an interrupt | ||||
| (<code>INT 3</code>) that will allow the VMM to handle it | ||||
| correctly. This might look something like: | ||||
| </p> | ||||
| 
 | ||||
| <pre> | ||||
| void initcode () { | ||||
|   scan_for_nonvirtual (0x7c00); | ||||
| } | ||||
| 
 | ||||
| void scan_for_nonvirtualizable (thiscpu, startaddr) { | ||||
|   addr  = startaddr; | ||||
|   instr = disassemble (addr); | ||||
|   while (instr is not branch or bad) { | ||||
|     addr += len (instr); | ||||
|     instr = disassemble (addr); | ||||
|   } | ||||
|   // remember that we wanted to execute this instruction. | ||||
|   replace (addr, "int 3"); | ||||
|   record (thiscpu->rewrites, addr, instr); | ||||
| } | ||||
| 
 | ||||
| void breakpoint_handler (tf) { | ||||
|   oldinstr = lookup (thiscpu->rewrites, tf->eip); | ||||
|   if (oldinstr is branch) { | ||||
|     newcs:neweip = evaluate branch | ||||
|     scan_for_nonvirtualizable (thiscpu, newcs:neweip) | ||||
|     return; | ||||
|   } else { // something non virtualizable | ||||
|     // dispatch to appropriate emulation | ||||
|   } | ||||
| } | ||||
| </pre> | ||||
| <p>All pages must be scanned in this way. Fortunately, most pages | ||||
| probably are okay and don't really need any special handling so after | ||||
| scanning them once, we can just remember that the page is okay and let | ||||
| it run natively. | ||||
| </p> | ||||
| 
 | ||||
| <p>What if a guest OS generates instructions, writes them to memory, | ||||
| and then wants to execute them? We must detect self-modifying code | ||||
| (e.g. must simulate buffer overflow attacks correctly.) When a write | ||||
| to a physical page that happens to be in code segment happens, must | ||||
| trap the write and then rescan the affected portions of the page.</p> | ||||
| 
 | ||||
| <p>What about self-examining code? Need to protect it some | ||||
| how---possibly by playing tricks with instruction/data TLB caches, or | ||||
| introducing a private segment for code (%cs) that is different than | ||||
| the segment used for reads/writes (%ds). | ||||
| </p> | ||||
| 
 | ||||
| <h2>Some Disco paper notes</h2> | ||||
| 
 | ||||
| <p> | ||||
| Disco has some I/O specific optimizations. | ||||
| </p> | ||||
| <ul> | ||||
| <li>Disk reads only need to happen once and can be shared between | ||||
|     virtual machines via copy-on-write virtual memory tricks.</li> | ||||
| <li>Network cards do not need to be fully virtualized --- intra | ||||
|     VM communication doesn't need a real network card backing it.</li> | ||||
| <li>Special handling for NFS so that all VMs "share" a buffer cache.</li> | ||||
| </ul> | ||||
| 
 | ||||
| <p> | ||||
| Disco developers clearly had access to IRIX source code. | ||||
| </p> | ||||
| <ul> | ||||
| <li>Need to deal with KSEG0 segment of MIPS memory by relinking kernel | ||||
|     at different address space.</li> | ||||
| <li>Ensuring page-alignment of network writes (for the purposes of | ||||
|     doing memory map tricks.)</li> | ||||
| </ul> | ||||
| 
 | ||||
| <p>Performance?</p> | ||||
| <ul> | ||||
| <li>Evaluated in simulation.</li> | ||||
| <li>Where are the overheads? Where do they come from?</li> | ||||
| <li>Does it run better than NUMA IRIX?</li> | ||||
| </ul> | ||||
| 
 | ||||
| <p>Premise.  Are virtual machine the preferred approach to extending | ||||
| operating systems?  Have scalable multiprocessors materialized?</p> | ||||
| 
 | ||||
| <h2>Related papers</h2> | ||||
| 
 | ||||
| <p>John Scott Robin, Cynthia E. Irvine. <a | ||||
| href="http://www.cs.nps.navy.mil/people/faculty/irvine/publications/2000/VMM-usenix00-0611.pdf">Analysis of the | ||||
| Intel Pentium's Ability to Support a Secure Virtual Machine | ||||
| Monitor</a>.</p> | ||||
| 
 | ||||
| <p>Jeremy Sugerman,  Ganesh Venkitachalam, Beng-Hong Lim. <a | ||||
| href="http://www.vmware.com/resources/techresources/530">Virtualizing | ||||
| I/O Devices on VMware Workstation's Hosted Virtual Machine | ||||
| Monitor</a>. In Proceedings of the 2001 Usenix Technical Conference.</p> | ||||
| 
 | ||||
| <p>Kevin Lawton, Drew Northup. <a | ||||
| href="http://savannah.nongnu.org/projects/plex86">Plex86 Virtual | ||||
| Machine</a>.</p> | ||||
| 
 | ||||
| <p><a href="http://www.cl.cam.ac.uk/netos/papers/2003-xensosp.pdf">Xen | ||||
| and the Art of Virtualization</a>, Paul Barham, Boris | ||||
| Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf | ||||
| Neugebauer, Ian Pratt, Andrew Warfield, SOSP 2003</p> | ||||
| 
 | ||||
| <p><a href="http://www.vmware.com/pdf/asplos235_adams.pdf">A comparison of | ||||
| software and hardware techniques for x86 virtualizaton</a>Keith Adams | ||||
| and Ole Agesen, ASPLOS 2006</p> | ||||
| 
 | ||||
| </body> | ||||
| 
 | ||||
| </html> | ||||
| 
 | ||||
							
								
								
									
										246
									
								
								web/l-xfi.html
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										246
									
								
								web/l-xfi.html
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,246 @@ | |||
| <html> | ||||
| <head> | ||||
| <title>XFI</title> | ||||
| </head> | ||||
| <body> | ||||
| 
 | ||||
| <h1>XFI</h1> | ||||
| 
 | ||||
| <p>Required reading: XFI: software guards for system address spaces. | ||||
| 
 | ||||
| <h2>Introduction</h2> | ||||
| 
 | ||||
| <p>Problem: how to use untrusted code (an "extension") in a trusted | ||||
| program? | ||||
| <ul> | ||||
| <li>Use untrusted jpeg codec in Web browser | ||||
| <li>Use an untrusted driver in the kernel | ||||
| </ul> | ||||
| 
 | ||||
| <p>What are the dangers? | ||||
| <ul> | ||||
| <li>No fault isolations: extension modifies trusted code unintentionally | ||||
| <li>No protection: extension causes a security hole | ||||
| <ul> | ||||
| <li>Extension has a buffer overrun problem | ||||
| <li>Extension calls trusted program's functions | ||||
| <li>Extensions calls a trusted program's functions that is allowed to | ||||
|   call, but supplies "bad" arguments | ||||
| <li>Extensions calls privileged hardware instructions (when extending | ||||
|   kernel) | ||||
| <li>Extensions reads data out of trusted program it shouldn't. | ||||
| </ul> | ||||
| </ul> | ||||
| 
 | ||||
| <p>Possible solutions approaches: | ||||
| <ul> | ||||
| 
 | ||||
| <li>Run extension in its own address space with minimal | ||||
|   privileges. Rely on hardware and operating system protection | ||||
|   mechanism. | ||||
| 
 | ||||
| <li>Restrict the language in which the extension is written: | ||||
| <ul> | ||||
| 
 | ||||
| <li>Packet filter language.  Language is limited in its capabilities, | ||||
|   and it easy to guarantee "safe" execution. | ||||
| 
 | ||||
| <li>Type-safe language. Language runtime and compiler guarantee "safe" | ||||
| execution. | ||||
| </ul> | ||||
| 
 | ||||
| <li>Software-based sandboxing. | ||||
| 
 | ||||
| </ul> | ||||
| 
 | ||||
| <h2>Software-based sandboxing</h2> | ||||
| 
 | ||||
| <p>Sandboxer. A compiler or binary-rewriter sandboxes all unsafe | ||||
|   instructions in an extension by inserting additional instructions. | ||||
|   For example, every indirect store is preceded by a few instructions | ||||
|   that compute and check the target of the store at runtime. | ||||
| 
 | ||||
| <p>Verifier. When the extension is loaded in the trusted program, the | ||||
|   verifier checks if the extension is appropriately sandboxed (e.g., | ||||
|   are all indirect stores sandboxed? does it call any privileged | ||||
|   instructions?).  If not, the extension is rejected. If yes, the | ||||
|   extension is loaded, and can run.  If the extension runs, the | ||||
|   instruction that sandbox unsafe instructions check if the unsafe | ||||
|   instruction is used in a safe way. | ||||
| 
 | ||||
| <p>The verifier must be trusted, but the sandboxer doesn't.  We can do | ||||
|   without the verifier, if the trusted program can establish that the | ||||
|   extension has been sandboxed by a trusted sandboxer. | ||||
| 
 | ||||
| <p>The paper refers to this setup as instance of proof-carrying code. | ||||
| 
 | ||||
| <h2>Software fault isolation</h2> | ||||
| 
 | ||||
| <p><a href="http://citeseer.ist.psu.edu/wahbe93efficient.html">SFI</a> | ||||
| by Wahbe et al. explored out to use sandboxing for fault isolation | ||||
| extensions; that is, use sandboxing to control that stores and jump | ||||
| stay within a specified memory range (i.e., they don't overwrite and | ||||
| jump into addresses in the trusted program unchecked).  They | ||||
| implemented SFI for a RISC processor, which simplify things since | ||||
| memory can be written only by store instructions (other instructions | ||||
| modify registers).  In addition, they assumed that there were plenty | ||||
| of registers, so that they can dedicate a few for sandboxing code. | ||||
| 
 | ||||
| <p>The extension is loaded into a specific range (called a segment) | ||||
|   within the trusted application's address space.  The segment is | ||||
|   identified by the upper bits of the addresses in the | ||||
|   segment. Separate code and data segments are necessary to prevent an | ||||
|   extension overwriting its code. | ||||
| 
 | ||||
| <p>An unsafe instruction on the MIPS is an instruction that jumps or | ||||
|   stores to an address that cannot be statically verified to be within | ||||
|   the correct segment.  Most control transfer operations, such | ||||
|   program-counter relative can be statically verified.  Stores to | ||||
|   static variables often use an immediate addressing mode and can be | ||||
|   statically verified.  Indirect jumps and indirect stores are unsafe. | ||||
| 
 | ||||
| <p>To sandbox those instructions the sandboxer could generate the | ||||
|   following code for each unsafe instruction: | ||||
| <pre> | ||||
|   DR0 <- target address | ||||
|   R0 <- DR0 >> shift-register;  // load in R0 segment id of target | ||||
|   CMP R0, segment-register;     // compare to segment id to segment's ID | ||||
|   BNE fault-isolation-error     // if not equal, branch to trusted error code | ||||
|   STORE using DR0 | ||||
| </pre> | ||||
| In this code, DR0, shift-register, and segment register | ||||
| are <i>dedicated</i>: they cannot be used by the extension code.  The | ||||
| verifier must check if the extension doesn't use they registers.  R0 | ||||
| is a scratch register, but doesn't have to be dedicated.  The | ||||
| dedicated registers are necessary, because otherwise extension could | ||||
| load DR0 and jump to the STORE instruction directly, skipping the | ||||
| check. | ||||
| 
 | ||||
| <p>This implementation costs 4 registers, and 4 additional instructions | ||||
|   for each unsafe instruction. One could do better, however: | ||||
| <pre> | ||||
|   DR0 <- target address & and-mask-register // mask segment ID from target | ||||
|   DR0 <- DR0 | segment register // insert this segment's ID | ||||
|   STORE using DR0 | ||||
| </pre> | ||||
| This code just sets the write segment ID bits.  It doesn't catch | ||||
| illegal addresses; it just ensures that illegal addresses are within | ||||
| the segment, harming the extension but no other code.  Even if the | ||||
| extension jumps to the second instruction of this sandbox sequence, | ||||
| nothing bad will happen (because DR0 will already contain the correct | ||||
| segment ID). | ||||
| 
 | ||||
| <p>Optimizations include:  | ||||
| <ul> | ||||
| <li>use guard zones for <i>store value, offset(reg)</i> | ||||
| <li>treat SP as dedicated register (sandbox code that initializes it) | ||||
| <li>etc. | ||||
| </ul> | ||||
| 
 | ||||
| <h2>XFI</h2> | ||||
| 
 | ||||
| <p>XFI extends SFI in several ways: | ||||
| <ul> | ||||
| <li>Handles fault isolation and protection | ||||
| <li>Uses control-folow integrity (CFI) to get good performance | ||||
| <li>Doesn't use dedicated registers | ||||
| <li>Use two stacks (a scoped stack and an allocation stack) and only | ||||
|   allocation stack can be corrupted by buffer-overrun attacks. The | ||||
|   scoped stack cannot via computed memory references. | ||||
| <li>Uses a binary rewriter. | ||||
| <li>Works for the x86 | ||||
| </ul> | ||||
| 
 | ||||
| <p>x86 is challenging, because limited registers and variable length | ||||
|   of instructions. SFI technique won't work with x86 instruction | ||||
|   set. For example if the binary contains: | ||||
| <pre> | ||||
|   25 CD 80 00 00   # AND eax, 0x80CD | ||||
| </pre> | ||||
| and an adversary can arrange to jump to the second byte, then the | ||||
| adversary calls system call on Linux, which has binary the binary | ||||
| representation CD 80.  Thus, XFI must control execution flow. | ||||
| 
 | ||||
| <p>XFI policy goals: | ||||
| <ul> | ||||
| <li>Memory-access constraints (like SFI) | ||||
| <li>Interface restrictions  (extension has fixed entry and exit points) | ||||
| <li>Scoped-stack integrity (calling stack is well formed) | ||||
| <li>Simplified instructions semantics (remove dangerous instructions) | ||||
| <li>System-environment integrity (ensure certain machine model | ||||
|   invariants, such as x86 flags register cannot be modified) | ||||
| <li>Control-flow integrity: execution must follow a static, expected | ||||
|   control-flow graph. (enter at beginning of basic blocks) | ||||
| <li>Program-data integrity (certain global variables in extension | ||||
|   cannot be accessed via computed memory addresses) | ||||
| </ul> | ||||
| 
 | ||||
| <p>The binary rewriter inserts guards to ensure these properties.  The | ||||
|   verifier check if the appropriate guards in place.  The primary | ||||
|   mechanisms used are: | ||||
| <ul> | ||||
| <li>CFI guards on computed control-flow transfers (see figure 2) | ||||
| <li>Two stacks | ||||
| <li>Guards on computer memory accesses (see figure 3) | ||||
| <li>Module header has a section that contain access permissions for | ||||
|   region | ||||
| <li>Binary rewriter, which performs intra-procedure analysis, and | ||||
|   generates guards, code for stack use, and verification hints | ||||
| <li>Verifier checks specific conditions per basic block. hints specify | ||||
|   the verification state for the entry to each basic block, and at | ||||
|   exit of basic block the verifier checks that the final state implies | ||||
|   the verification state at entry to all possible successor basic | ||||
|   blocks. (see figure 4) | ||||
| </ul> | ||||
| 
 | ||||
| <p>Can XFI protect against the attack discussed in last lecture? | ||||
| <pre> | ||||
|   unsigned int j; | ||||
|   p=(unsigned char *)s->init_buf->data; | ||||
|   j= *(p++); | ||||
|   s->session->session_id_length=j; | ||||
|   memcpy(s->session->session_id,p,j); | ||||
| </pre> | ||||
| Where will <i>j</i> be located? | ||||
| 
 | ||||
| <p>How about the following one from the paper <a href="http://research.microsoft.com/users/jpincus/beyond-stack-smashing.pdf"><i>Beyond stack smashing: | ||||
|   recent advances in exploiting buffer overruns</i></a>? | ||||
| <pre> | ||||
| void f2b(void * arg, size_t len) { | ||||
|   char buf[100]; | ||||
|   long val = ..; | ||||
|   long *ptr = ..; | ||||
|   extern void (*f)(); | ||||
|    | ||||
|   memcopy(buff, arg, len); | ||||
|   *ptr = val; | ||||
|   f(); | ||||
|   ... | ||||
|   return; | ||||
| } | ||||
| </pre> | ||||
| What code can <i>(*f)()</i> call?  Code that the attacker inserted? | ||||
| Code in libc? | ||||
| 
 | ||||
| <p>How about an attack that use <i>ptr</i> in the above code to | ||||
|   overwrite a method's address in a class's dispatch table with an | ||||
|   address of support function? | ||||
| 
 | ||||
| <p>How about <a href="http://research.microsoft.com/~shuochen/papers/usenix05data_attack.pdf">data-only attacks</a>?  For example, attacker | ||||
|   overwrites <i>pw_uid</i> in the heap with 0 before the following | ||||
|   code executes (when downloading /etc/passwd and then uploading it with a | ||||
|   modified entry). | ||||
| <pre> | ||||
| FILE *getdatasock( ... ) { | ||||
|   seteuid(0); | ||||
|   setsockeope ( ...); | ||||
|   ... | ||||
|   seteuid(pw->pw_uid); | ||||
|   ... | ||||
| } | ||||
| </pre> | ||||
| 
 | ||||
| <p>How much does XFI slow down applications? How many more | ||||
|   instructions are executed?  (see Tables 1-4) | ||||
| 
 | ||||
| </body> | ||||
							
								
								
									
										288
									
								
								web/l1.html
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										288
									
								
								web/l1.html
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,288 @@ | |||
| <title>L1</title> | ||||
| <html> | ||||
| <head> | ||||
| </head> | ||||
| <body> | ||||
| 
 | ||||
| <h1>OS overview</h1> | ||||
| 
 | ||||
| <h2>Overview</h2> | ||||
| 
 | ||||
| <ul> | ||||
| <li>Goal of course: | ||||
| 
 | ||||
| <ul> | ||||
| <li>Understand operating systems in detail by designing and | ||||
| implementing miminal OS | ||||
| <li>Hands-on experience with building systems  ("Applying 6.033") | ||||
| </ul> | ||||
| 
 | ||||
| <li>What is an operating system? | ||||
| <ul> | ||||
| <li>a piece of software that turns the hardware into something useful | ||||
| <li>layered picture: hardware, OS, applications | ||||
| <li>Three main functions: fault isolate applications, abstract hardware,  | ||||
| manage hardware | ||||
| </ul> | ||||
| 
 | ||||
| <li>Examples: | ||||
| <ul> | ||||
| <li>OS-X, Windows, Linux, *BSD, ... (desktop, server) | ||||
| <li>PalmOS Windows/CE (PDA) | ||||
| <li>Symbian, JavaOS (Cell phones) | ||||
| <li>VxWorks, pSOS (real-time) | ||||
| <li> ... | ||||
| </ul> | ||||
| 
 | ||||
| <li>OS Abstractions | ||||
| <ul> | ||||
| <li>processes: fork, wait, exec, exit, kill, getpid, brk, nice, sleep, | ||||
| trace | ||||
| <li>files:  open, close, read, write, lseek, stat, sync | ||||
| <li>directories: mkdir, rmdir, link, unlink, mount, umount | ||||
| <li>users + security: chown, chmod, getuid, setuid | ||||
| <li>interprocess communication: signals, pipe | ||||
| <li>networking: socket, accept, snd, recv, connect | ||||
| <li>time: gettimeofday | ||||
| <li>terminal: | ||||
| </ul> | ||||
| 
 | ||||
| <li>Sample Unix System calls (mostly POSIX) | ||||
| <ul> | ||||
| 	<li> int read(int fd, void*, int) | ||||
| 	<li> int write(int fd, void*, int) | ||||
| 	<li> off_t lseek(int fd, off_t, int [012]) | ||||
| 	<li> int close(int fd) | ||||
| 	<li> int fsync(int fd) | ||||
| 	<li> int open(const char*, int flags [, int mode]) | ||||
| 	  <ul> | ||||
| 		<li> O_RDONLY, O_WRONLY, O_RDWR, O_CREAT | ||||
| 	  </ul> | ||||
| 	<li> mode_t umask(mode_t cmask) | ||||
| 	<li> int mkdir(char *path, mode_t mode); | ||||
| 	<li> DIR *opendir(char *dirname) | ||||
| 	<li> struct dirent *readdir(DIR *dirp) | ||||
| 	<li> int closedir(DIR *dirp) | ||||
| 	<li> int chdir(char *path) | ||||
| 	<li> int link(char *existing, char *new) | ||||
| 	<li> int unlink(char *path) | ||||
| 	<li> int rename(const char*, const char*) | ||||
| 	<li> int rmdir(char *path) | ||||
| 	<li> int stat(char *path, struct stat *buf) | ||||
| 	<li> int mknod(char *path, mode_t mode, dev_t dev) | ||||
| 	<li> int fork() | ||||
|           <ul> | ||||
| 		<li> returns childPID in parent, 0 in child; only | ||||
| 		difference | ||||
|            </ul> | ||||
| 	<li>int getpid() | ||||
| 	<li> int waitpid(int pid, int* stat, int opt) | ||||
|            <ul> | ||||
| 		<li> pid==-1: any; opt==0||WNOHANG | ||||
| 		<li> returns pid or error | ||||
| 	   </ul> | ||||
| 	<li> void _exit(int status) | ||||
| 	<li> int kill(int pid, int signal) | ||||
| 	<li> int sigaction(int sig, struct sigaction *, struct sigaction *) | ||||
| 	<li> int sleep (int sec) | ||||
| 	<li> int execve(char* prog, char** argv, char** envp) | ||||
| 	<li> void *sbrk(int incr) | ||||
| 	<li> int dup2(int oldfd, int newfd) | ||||
| 	<li> int fcntl(int fd, F_SETFD, int val) | ||||
| 	<li> int pipe(int fds[2]) | ||||
| 	  <ul> | ||||
| 		<li> writes on fds[1] will be read on fds[0] | ||||
| 		<li> when last fds[1] closed, read fds[0] retursn EOF | ||||
| 		<li> when last fds[0] closed, write fds[1] kills SIGPIPE/fails | ||||
| 		EPIPE | ||||
|            </ul> | ||||
| 	<li> int fchown(int fd, uind_t owner, gid_t group) | ||||
| 	<li> int fchmod(int fd, mode_t mode) | ||||
| 	<li> int socket(int domain, int type, int protocol) | ||||
| 	<li> int accept(int socket_fd, struct sockaddr*, int* namelen) | ||||
| 	  <ul> | ||||
| 		<li> returns new fd | ||||
|           </ul> | ||||
| 	<li> int listen(int fd, int backlog) | ||||
| 	<li> int connect(int fd, const struct sockaddr*, int namelen) | ||||
| 	<li> void* mmap(void* addr, size_t len, int prot, int flags, int fd, | ||||
| 	off_t offset) | ||||
| 	<li> int munmap(void* addr, size_t len) | ||||
| 	<li> int gettimeofday(struct timeval*) | ||||
| </ul> | ||||
| </ul> | ||||
| 
 | ||||
| <p>See the <a href="../reference.html">reference page</a> for links to | ||||
| the early Unix papers. | ||||
| 
 | ||||
| <h2>Class structure</h2> | ||||
| 
 | ||||
| <ul> | ||||
| <li>Lab: minimal OS for x86 in an exokernel style (50%) | ||||
| <ul> | ||||
| <li>kernel interface: hardware + protection | ||||
| <li>libOS implements fork, exec, pipe, ... | ||||
| <li>applications: file system, shell, .. | ||||
| <li>development environment: gcc, bochs | ||||
| <li>lab 1 is out | ||||
| </ul> | ||||
| 
 | ||||
| <li>Lecture structure (20%) | ||||
| <ul> | ||||
| <li>homework | ||||
| <li>45min lecture | ||||
| <li>45min case study | ||||
| </ul> | ||||
| 
 | ||||
| <li>Two quizzes (30%) | ||||
| <ul> | ||||
| <li>mid-term | ||||
| <li>final's exam week | ||||
| </ul> | ||||
| 
 | ||||
| </ul> | ||||
| 
 | ||||
| <h2>Case study: the shell (simplified)</h2> | ||||
| 
 | ||||
| <ul> | ||||
| <li>interactive command execution and a programming language | ||||
| <li>Nice example that uses various OS abstractions. See  <a | ||||
| href="../readings/ritchie74unix.pdf">Unix | ||||
| paper</a> if you are unfamiliar with the shell. | ||||
| <li>Final lab is a simple shell. | ||||
| <li>Basic structure: | ||||
| <pre> | ||||
|        | ||||
|        while (1) { | ||||
| 	    printf ("$"); | ||||
| 	    readcommand (command, args);   // parse user input | ||||
| 	    if ((pid = fork ()) == 0) {  // child? | ||||
| 	       exec (command, args, 0); | ||||
| 	    } else if (pid > 0) {   // parent? | ||||
| 	       wait (0);   // wait for child to terminate | ||||
| 	    } else { | ||||
| 	       perror ("Failed to fork\n"); | ||||
|             } | ||||
|         } | ||||
| </pre> | ||||
| <p>The split of creating a process with a new program in fork and exec | ||||
| is mostly a historical accident.  See the  <a | ||||
| href="../readings/ritchie79evolution.html">assigned paper</a> for today. | ||||
| <li>Example: | ||||
| <pre> | ||||
|         $ ls | ||||
| </pre> | ||||
| <li>why call "wait"?  to wait for the child to terminate and collect | ||||
| its exit status.  (if child finishes, child becomes a zombie until | ||||
| parent calls wait.) | ||||
| <li>I/O: file descriptors.  Child inherits open file descriptors | ||||
| from parent. By convention: | ||||
| <ul> | ||||
| <li>file descriptor 0 for input (e.g., keyboard). read_command:  | ||||
| <pre> | ||||
|      read (1, buf, bufsize) | ||||
| </pre> | ||||
| <li>file descriptor 1 for output (e.g., terminal) | ||||
| <pre> | ||||
|      write (1, "hello\n", strlen("hello\n")+1) | ||||
| </pre> | ||||
| <li>file descriptor 2 for error (e.g., terminal) | ||||
| </ul> | ||||
| <li>How does the shell implement: | ||||
| <pre> | ||||
|      $ls > tmp1 | ||||
| </pre> | ||||
| just before exec insert: | ||||
| <pre> | ||||
|     	   close (1); | ||||
| 	   fd = open ("tmp1", O_CREAT|O_WRONLY);   // fd will be 1! | ||||
| </pre> | ||||
| <p>The kernel will return the first free file descriptor, 1 in this case. | ||||
| <li>How does the shell implement sharing an output file: | ||||
| <pre> | ||||
|      $ls 2> tmp1 > tmp1 | ||||
| </pre> | ||||
| replace last code with: | ||||
| <pre> | ||||
| 
 | ||||
| 	   close (1); | ||||
| 	   close (2); | ||||
| 	   fd1 = open ("tmp1", O_CREAT|O_WRONLY);   // fd will be 1! | ||||
| 	   fd2 = dup (fd1); | ||||
| </pre> | ||||
| both file descriptors share offset | ||||
| <li>how do programs communicate? | ||||
| <pre> | ||||
|         $ sort file.txt | uniq | wc | ||||
| </pre> | ||||
| or | ||||
| <pre> | ||||
| 	$ sort file.txt > tmp1 | ||||
| 	$ uniq tmp1 > tmp2 | ||||
| 	$ wc tmp2 | ||||
| 	$ rm tmp1 tmp2 | ||||
| </pre> | ||||
| or  | ||||
| <pre> | ||||
|         $ kill -9 | ||||
| </pre> | ||||
| <li>A pipe is an one-way communication channel.  Here is an example | ||||
| where the parent is the writer and the child is the reader: | ||||
| <pre> | ||||
| 
 | ||||
| 	int fdarray[2]; | ||||
| 	 | ||||
| 	if (pipe(fdarray) < 0) panic ("error"); | ||||
| 	if ((pid = fork()) < 0) panic ("error"); | ||||
| 	else if (pid > 0) { | ||||
| 	  close(fdarray[0]); | ||||
| 	  write(fdarray[1], "hello world\n", 12); | ||||
|         } else { | ||||
| 	  close(fdarray[1]); | ||||
| 	  n = read (fdarray[0], buf, MAXBUF); | ||||
| 	  write (1, buf, n); | ||||
|         } | ||||
| </pre> | ||||
| <li>How does the shell implement pipelines (i.e., cmd 1 | cmd 2 |..)? | ||||
| We want to arrange that the output of cmd 1 is the input of cmd 2. | ||||
| The way to achieve this goal is to manipulate stdout and stdin. | ||||
| <li>The shell creates processes for each command in | ||||
| the pipeline, hooks up their stdin and stdout correctly.  To do it | ||||
| correct, and waits for the last process of the | ||||
| pipeline to exit.  A sketch of the core modifications to our shell for | ||||
| setting up a pipe is: | ||||
| <pre>	     | ||||
| 	    int fdarray[2]; | ||||
| 
 | ||||
|   	    if (pipe(fdarray) < 0) panic ("error"); | ||||
| 	    if ((pid = fork ()) == 0) {  child (left end of pipe) | ||||
| 	       close (1); | ||||
| 	       tmp = dup (fdarray[1]);   // fdarray[1] is the write end, tmp will be 1 | ||||
| 	       close (fdarray[0]);       // close read end | ||||
| 	       close (fdarray[1]);       // close fdarray[1] | ||||
| 	       exec (command1, args1, 0); | ||||
| 	    } else if (pid > 0) {        // parent (right end of pipe) | ||||
| 	       close (0); | ||||
| 	       tmp = dup (fdarray[0]);   // fdarray[0] is the read end, tmp will be 0 | ||||
| 	       close (fdarray[0]); | ||||
| 	       close (fdarray[1]);       // close write end | ||||
| 	       exec (command2, args2, 0); | ||||
| 	    } else { | ||||
| 	       printf ("Unable to fork\n"); | ||||
|             } | ||||
| </pre> | ||||
| <li>Why close read-end and write-end? multiple reasons: maintain that | ||||
| every process starts with 3 file descriptors and reading from an empty | ||||
| pipe blocks reader, while reading from a closed pipe returns end of | ||||
| file. | ||||
| <li>How do you  background jobs? | ||||
| <pre> | ||||
|         $ compute & | ||||
| </pre> | ||||
| <li>How does the shell implement "&", backgrounding?  (Don't call wait | ||||
| immediately). | ||||
| <li>More details in the shell lecture later in the term. | ||||
| 
 | ||||
| </body> | ||||
| 
 | ||||
| 
 | ||||
							
								
								
									
										245
									
								
								web/l13.html
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										245
									
								
								web/l13.html
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,245 @@ | |||
| <title>High-performance File Systems</title> | ||||
| <html> | ||||
| <head> | ||||
| </head> | ||||
| <body> | ||||
| 
 | ||||
| <h1>High-performance File Systems</h1> | ||||
| 
 | ||||
| <p>Required reading: soft updates. | ||||
| 
 | ||||
| <h2>Overview</h2> | ||||
| 
 | ||||
| <p>A key problem in designing file systems is how to obtain | ||||
| performance on file system operations while providing consistency. | ||||
| With consistency, we mean, that file system invariants are maintained | ||||
| is on disk.  These invariants include that if a file is created, it | ||||
| appears in its directory, etc.  If the file system data structures are | ||||
| consistent, then it is possible to rebuild the file system to a | ||||
| correct state after a failure. | ||||
| 
 | ||||
| <p>To ensure consistency of on-disk file system data structures, | ||||
|   modifications to the file system must respect certain rules: | ||||
| <ul> | ||||
| 
 | ||||
| <li>Never point to a structure before it is initialized. An inode must | ||||
| be initialized before a directory entry references it.  An block must | ||||
| be initialized before an inode references it. | ||||
| 
 | ||||
| <li>Never reuse a structure before nullifying all pointers to it.  An | ||||
| inode pointer to a disk block must be reset before the file system can | ||||
| reallocate the disk block. | ||||
| 
 | ||||
| <li>Never reset the last point to a live structure before a new | ||||
| pointer is set.  When renaming a file, the file system should not | ||||
| remove the old name for an inode until after the new name has been | ||||
| written. | ||||
| </ul> | ||||
| The paper calls these dependencies update dependencies.   | ||||
| 
 | ||||
| <p>xv6 ensures these rules by writing every block synchronously, and | ||||
|   by ordering the writes appropriately.  With synchronous, we mean | ||||
|   that a process waits until the current disk write has been | ||||
|   completed before continuing with execution. | ||||
| 
 | ||||
| <ul> | ||||
| 
 | ||||
| <li>What happens if power fails after 4776 in mknod1?  Did we lose the | ||||
|   inode for ever? No, we have a separate program (called fsck), which | ||||
|   can rebuild the disk structures correctly and can mark the inode on | ||||
|   the free list. | ||||
| 
 | ||||
| <li>Does the order of writes in mknod1 matter?  Say, what if we wrote | ||||
|   directory entry first and then wrote the allocated inode to disk? | ||||
|   This violates the update rules and it is not a good plan. If a | ||||
|   failure happens after the directory write, then on recovery we have | ||||
|   an directory pointing to an unallocated inode, which now may be | ||||
|   allocated by another process for another file! | ||||
| 
 | ||||
| <li>Can we turn the writes (i.e., the ones invoked by iupdate and | ||||
|   wdir) into delayed writes without creating problems?  No, because | ||||
|   the cause might write them back to the disk in an incorrect order. | ||||
|   It has no information to decide in what order to write them. | ||||
| 
 | ||||
| </ul> | ||||
| 
 | ||||
| <p>xv6 is a nice example of the tension between consistency and | ||||
|   performance.  To get consistency, xv6 uses synchronous writes, | ||||
|   but these writes are slow, because they perform at the rate of a | ||||
|   seek instead of the rate of the maximum data transfer rate. The | ||||
|   bandwidth to a disk is reasonable high for large transfer (around | ||||
|   50Mbyte/s), but latency is low, because of the cost of moving the | ||||
|   disk arm(s) (the seek latency is about 10msec). | ||||
| 
 | ||||
| <p>This tension is an implementation-dependent one.  The Unix API | ||||
|   doesn't require that writes are synchronous.  Updates don't have to | ||||
|   appear on disk until a sync, fsync, or open with O_SYNC.  Thus, in | ||||
|   principle, the UNIX API allows delayed writes, which are good for | ||||
|   performance: | ||||
| <ul> | ||||
| <li>Batch many writes together in a big one, written at the disk data | ||||
|   rate. | ||||
| <li>Absorp writes to the same block. | ||||
| <li>Schedule writes to avoid seeks. | ||||
| </ul> | ||||
| 
 | ||||
| <p>Thus the question: how to delay writes and achieve consistency? | ||||
|   The paper provides an answer. | ||||
| 
 | ||||
| <h2>This paper</h2> | ||||
| 
 | ||||
| <p>The paper surveys some of the existing techniques and introduces a | ||||
| new to achieve the goal of performance and consistency. | ||||
| 
 | ||||
| <p> | ||||
| 
 | ||||
| <p>Techniques possible: | ||||
| <ul> | ||||
| 
 | ||||
| <li>Equip system with NVRAM, and put buffer cache in NVRAM. | ||||
| 
 | ||||
| <li>Logging.  Often used in UNIX file systems for metadata updates. | ||||
| LFS is an extreme version of this strategy. | ||||
| 
 | ||||
| <li>Flusher-enforced ordering.  All writes are delayed. This flusher | ||||
| is aware of dependencies between blocks, but doesn't work because | ||||
| circular dependencies need to be broken by writing blocks out. | ||||
| 
 | ||||
| </ul> | ||||
| 
 | ||||
| <p>Soft updates is the solution explored in this paper.  It doesn't | ||||
| require NVRAM, and performs as well as the naive strategy of keep all | ||||
| dirty block in main memory.  Compared to logging, it is unclear if | ||||
| soft updates is better.  The default BSD file systems uses soft | ||||
|   updates, but most Linux file systems use logging. | ||||
| 
 | ||||
| <p>Soft updates is a sophisticated variant of flusher-enforced | ||||
| ordering.  Instead of maintaining dependencies on the block-level, it | ||||
| maintains dependencies on file structure level (per inode, per | ||||
| directory, etc.), reducing circular dependencies. Furthermore, it | ||||
| breaks any remaining circular dependencies by undo changes before | ||||
| writing the block and then redoing them to the block after writing. | ||||
| 
 | ||||
| <p>Pseudocode for create: | ||||
| <pre> | ||||
| create (f) { | ||||
|    allocate inode in block i  (assuming inode is available) | ||||
|    add i to directory data block d  (assuming d has space) | ||||
|    mark d has dependent on i, and create undo/redo record | ||||
|    update directory inode in block di | ||||
|    mark di has dependent on d | ||||
| } | ||||
| </pre> | ||||
| 
 | ||||
| <p>Pseudocode for the flusher: | ||||
| <pre> | ||||
| flushblock (b) | ||||
| { | ||||
|   lock b; | ||||
|   for all dependencies that b is relying on | ||||
|     "remove" that dependency by undoing the change to b | ||||
|     mark the dependency as "unrolled" | ||||
|   write b  | ||||
| } | ||||
| 
 | ||||
| write_completed (b) { | ||||
|   remove dependencies that depend on b | ||||
|   reapply "unrolled" dependencies that b depended on | ||||
|   unlock b | ||||
| } | ||||
| </pre> | ||||
| 
 | ||||
| <p>Apply flush algorithm to example: | ||||
| <ul> | ||||
| <li>A list of two dependencies: directory->inode, inode->directory. | ||||
| <li>Lets say syncer picks directory first | ||||
| <li>Undo directory->inode changes (i.e., unroll <A,#4>) | ||||
| <li>Write directory block | ||||
| <li>Remove met dependencies (i.e., remove inode->directory dependency) | ||||
| <li>Perform redo operation (i.e., redo <A,#4>) | ||||
| <li>Select inode block and write it | ||||
| <li>Remove met dependencies (i.e., remove directory->inode dependency) | ||||
| <li>Select directory block (it is dirty again!) | ||||
| <li>Write it. | ||||
| </ul> | ||||
| 
 | ||||
| <p>An file operation that is important for file-system consistency  | ||||
| is rename.  Rename conceptually works as follows: | ||||
| <pre> | ||||
| rename (from, to) | ||||
|    unlink (to); | ||||
|    link (from, to); | ||||
|    unlink (from); | ||||
| </pre> | ||||
| 
 | ||||
| <p>Rename it often used by programs to make a new version of a file | ||||
| the current version.  Committing to a new version must happen | ||||
| atomically.  Unfortunately, without a transaction-like support | ||||
| atomicity is impossible to guarantee, so a typical file systems | ||||
| provides weaker semantics for rename: if to already exists, an | ||||
| instance of to will always exist, even if the system should crash in | ||||
| the middle of the operation.  Does the above implementation of rename | ||||
| guarantee this semantics? (Answer: no). | ||||
| 
 | ||||
| <p>If rename is implemented as unlink, link, unlink, then it is | ||||
| difficult to guarantee even the weak semantics. Modern UNIXes provide | ||||
| rename as a file system call: | ||||
| <pre> | ||||
|    update dir block for to point to from's inode // write block | ||||
|    update dir block for from to free entry // write block | ||||
| </pre> | ||||
| <p>fsck may need to correct refcounts in the inode if the file | ||||
| system fails during rename.  for example, a crash after the first | ||||
| write followed by fsck should set refcount to 2, since both from | ||||
| and to are pointing at the inode. | ||||
| 
 | ||||
| <p>This semantics is sufficient, however, for an application to ensure | ||||
| atomicity. Before the call, there is a from and perhaps a to.  If the | ||||
| call is successful, following the call there is only a to.  If there | ||||
| is a crash, there may be both a from and a to, in which case the | ||||
| caller knows the previous attempt failed, and must retry.  The | ||||
| subtlety is that if you now follow the two links, the "to" name may | ||||
| link to either the old file or the new file.  If it links to the new | ||||
| file, that means that there was a crash and you just detected that the | ||||
| rename operation was composite.  On the other hand, the retry | ||||
| procedure can be the same for either case (do the rename again), so it | ||||
| isn't necessary to discover how it failed.  The function follows the | ||||
| golden rule of recoverability, and it is idempotent, so it lays all | ||||
| the needed groundwork for use as part of a true atomic action. | ||||
| 
 | ||||
| <p>With soft updates renames becomes: | ||||
| <pre> | ||||
| rename (from, to) { | ||||
|    i = namei(from); | ||||
|    add "to" directory data block td a reference to inode i | ||||
|    mark td dependent on block i | ||||
|    update directory inode "to" tdi | ||||
|    mark tdi as dependent on td | ||||
|    remove "from" directory data block fd a reference to inode i | ||||
|    mark fd as dependent on tdi | ||||
|    update directory inode in block fdi | ||||
|    mark fdi as dependent on fd | ||||
| } | ||||
| </pre> | ||||
| <p>No synchronous writes! | ||||
| 
 | ||||
| <p>What needs to be done on recovery?  (Inspect every statement in | ||||
| rename and see what inconsistencies could exist on the disk; e.g., | ||||
| refcnt inode could be too high.)  None of these inconsitencies require | ||||
| fixing before the file system can operate; they can be fixed by a | ||||
| background file system repairer.  | ||||
| 
 | ||||
| <h2>Paper discussion</h2> | ||||
| 
 | ||||
| <p>Do soft updates perform any useless writes? (A useless write is a | ||||
| write that will be immediately overwritten.)  (Answer: yes.) Fix | ||||
| syncer to becareful with what block to start.  Fix cache replacement | ||||
| to selecting LRU block with no pendending dependencies. | ||||
| 
 | ||||
| <p>Can a log-structured file system implement rename better? (Answer: | ||||
| yes, since it can get the refcnts right). | ||||
| 
 | ||||
| <p>Discuss all graphs. | ||||
| 
 | ||||
| </body> | ||||
| 
 | ||||
							
								
								
									
										247
									
								
								web/l14.txt
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										247
									
								
								web/l14.txt
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,247 @@ | |||
| Why am I lecturing about Multics? | ||||
|   Origin of many ideas in today's OSes | ||||
|   Motivated UNIX design (often in opposition) | ||||
|   Motivated x86 VM design | ||||
|   This lecture is really "how Intel intended x86 segments to be used" | ||||
| 
 | ||||
| Multics background | ||||
|   design started in 1965 | ||||
|     very few interactive time-shared systems then: CTSS | ||||
|   design first, then implementation | ||||
|   system stable by 1969 | ||||
|     so pre-dates UNIX, which started in 1969 | ||||
|   ambitious, many years, many programmers, MIT+GE+BTL | ||||
| 
 | ||||
| Multics high-level goals | ||||
|   many users on same machine: "time sharing" | ||||
|   perhaps commercial services sharing the machine too | ||||
|   remote terminal access (but no recognizable data networks: wired or phone) | ||||
|   persistent reliable file system | ||||
|   encourage interaction between users | ||||
|     support joint projects that share data &c | ||||
|   control access to data that should not be shared | ||||
| 
 | ||||
| Most interesting aspect of design: memory system | ||||
|   idea: eliminate memory / file distinction | ||||
|   file i/o uses LD / ST instructions | ||||
|   no difference between memory and disk files | ||||
|   just jump to start of file to run program | ||||
|   enhances sharing: no more copying files to private memory | ||||
|   this seems like a really neat simplification! | ||||
| 
 | ||||
| GE 645 physical memory system | ||||
|   24-bit phys addresses | ||||
|   36-bit words | ||||
|   so up to 75 megabytes of physical memory!!! | ||||
|     but no-one could afford more than about a megabyte | ||||
| 
 | ||||
| [per-process state] | ||||
|   DBR | ||||
|   DS, SDW (== address space) | ||||
|   KST | ||||
|   stack segment | ||||
|   per-segment linkage segments | ||||
| 
 | ||||
| [global state] | ||||
|   segment content pages | ||||
|   per-segment page tables | ||||
|   per-segment branch in directory segment | ||||
|   AST | ||||
| 
 | ||||
| 645 segments (simplified for now, no paging or rings) | ||||
|   descriptor base register (DBR) holds phy addr of descriptor segment (DS) | ||||
|   DS is an array of segment descriptor words (SDW) | ||||
|   SDW: phys addr, length, r/w/x, present | ||||
|   CPU has pairs of registers: 18 bit offset, 18 bit segment # | ||||
|     five pairs (PC, arguments, base, linkage, stack) | ||||
|   early Multics limited each segment to 2^16 words | ||||
|     thus there are lots of them, intended to correspond to program modules | ||||
|   note: cannot directly address phys mem (18 vs 24) | ||||
|   645 segments are a lot like the x86! | ||||
| 
 | ||||
| 645 paging | ||||
|   DBR and SDW actually contain phy addr of 64-entry page table | ||||
|   each page is 1024 words | ||||
|   PTE holds phys addr and present flag | ||||
|   no permission bits, so you really need to use the segments, not like JOS | ||||
|   no per-process page table, only per-segment | ||||
|     so all processes using a segment share its page table and phys storage | ||||
|     makes sense assuming segments tend to be shared | ||||
|     paging environment doesn't change on process switch | ||||
| 
 | ||||
| Multics processes | ||||
|   each process has its own DS | ||||
|   Multics switches DBR on context switch | ||||
|   different processes typically have different number for same segment | ||||
| 
 | ||||
| how to use segments to unify memory and file system? | ||||
|   don't want to have to use 18-bit seg numbers as file names | ||||
|   we want to write programs using symbolic names | ||||
|   names should be hierarchical (for users) | ||||
|     so users can have directories and sub-directories | ||||
|     and path names | ||||
| 
 | ||||
| Multics file system | ||||
|   tree structure, directories and files | ||||
|   each file and directory is a segment | ||||
|   dir seg holds array of "branches" | ||||
|     name, length, ACL, array of block #s, "active" | ||||
|   unique ROOT directory | ||||
|   path names: ROOT > A > B | ||||
|   note there are no inodes, thus no i-numbers | ||||
|   so "real name" for a file is the complete path name | ||||
|     o/s tables have path name where unix would have i-number | ||||
|     presumably makes renaming and removing active files awkward | ||||
|     no hard links | ||||
| 
 | ||||
| how does a program refer to a different segment? | ||||
|   inter-segment variables contain symbolic segment name | ||||
|   A$E refers to segment A, variable/function E | ||||
|   what happens when segment B calls function A$E(1, 2, 3)? | ||||
| 
 | ||||
| when compiling B: | ||||
|     compiler actually generates *two* segments | ||||
|     one holds B's instructions | ||||
|     one holds B's linkage information | ||||
|     initial linkage entry: | ||||
|       name of segment e.g. "A" | ||||
|       name of symbol e.g. "E" | ||||
|       valid flag | ||||
|     CALL instruction is indirect through entry i of linkage segment | ||||
|     compiler marks entry i invalid | ||||
|     [storage for strings "A" and "E" really in segment B, not linkage seg] | ||||
| 
 | ||||
| when a process is executing B: | ||||
|     two segments in DS: B and a *copy* of B's linkage segment | ||||
|     CPU linkage register always points to current segment's linkage segment | ||||
|     call A$E is really call indirect via linkage[i] | ||||
|     faults because linkage[i] is invalid | ||||
|     o/s fault handler | ||||
|       looks up segment name for i ("A") | ||||
|       search path in file system for segment "A" (cwd, library dirs) | ||||
|       if not already in use by some process (branch active flag and AST knows): | ||||
|         allocate page table and pages | ||||
|         read segment A into memory | ||||
|       if not already in use by *this* process (KST knows): | ||||
|         find free SDW j in process DS, make it refer to A's page table | ||||
|         set up r/w/x based on process's user and file ACL | ||||
|         also set up copy of A's linkage segment | ||||
|       search A's symbol table for "E" | ||||
|       linkage[i] := j / address(E) | ||||
|       restart B | ||||
|     now the CALL works via linkage[i] | ||||
|       and subsequent calls are fast | ||||
| 
 | ||||
| how does A get the correct linkage register? | ||||
|   the right value cannot be embedded in A, since shared among processes | ||||
|   so CALL actually goes to instructions in A's linkage segment | ||||
|     load current seg# into linkage register, jump into A | ||||
|     one set of these per procedure in A | ||||
| 
 | ||||
| all memory / file references work this way | ||||
|   as if pointers were really symbolic names | ||||
|   segment # is really a transparent optimization | ||||
|   linking is "dynamic" | ||||
|     programs contain symbolic references | ||||
|     resolved only as needed -- if/when executed | ||||
|   code is shared among processes | ||||
|   was program data shared? | ||||
|     probably most variables not shared (on stack, in private segments) | ||||
|     maybe a DB would share a data segment, w/ synchronization | ||||
|   file data: | ||||
|     probably one at a time (locks) for read/write | ||||
|     read-only is easy to share | ||||
| 
 | ||||
| filesystem / segment implications | ||||
|   programs start slowly due to dynamic linking | ||||
|   creat(), unlink(), &c are outside of this model | ||||
|   store beyond end extends a segment (== appends to a file) | ||||
|   no need for buffer cache! no need to copy into user space! | ||||
|     but no buffer cache => ad-hoc caches e.g. active segment table | ||||
|   when are dirty segments written back to disk? | ||||
|     only in page eviction algorithm, when free pages are low | ||||
|   database careful ordered writes? e.g. log before data blocks? | ||||
|     I don't know, probably separate flush system calls | ||||
| 
 | ||||
| how does shell work? | ||||
|   you type a program name | ||||
|   the shell just CALLs that program, as a segment! | ||||
|   dynamic linking finds program segment and any library segments it needs | ||||
|   the program eventually returns, e.g. with RET | ||||
|   all this happened inside the shell process's address space | ||||
|   no fork, no exec | ||||
|   buggy program can crash the shell! e.g. scribble on stack | ||||
|   process creation was too slow to give each program its own process | ||||
| 
 | ||||
| how valuable is the sharing provided by segment machinery? | ||||
|   is it critical to users sharing information? | ||||
|   or is it just there to save memory and copying? | ||||
| 
 | ||||
| how does the kernel fit into all this? | ||||
|   kernel is a bunch of code modules in segments (in file system) | ||||
|   a process dynamically loads in the kernel segments that it uses | ||||
|   so kernel segments have different numbers in different processes | ||||
|     a little different from separate kernel "program" in JOS or xv6 | ||||
|   kernel shares process's segment# address space   | ||||
|     thus easy to interpret seg #s in system call arguments | ||||
|   kernel segment ACLs in file system restrict write | ||||
|     so mapped non-writeable into processes | ||||
| 
 | ||||
| how to call the kernel? | ||||
|   very similar to the Intel x86 | ||||
|   8 rings. users at 4. core kernel at 0. | ||||
|   CPU knows current execution level | ||||
|   SDW has max read/write/execute levels | ||||
|   call gate: lowers ring level, but only at designated entry | ||||
|   stack per ring, incoming call switches stacks | ||||
|   inner ring can always read arguments, write results | ||||
|   problem: checking validity of arguments to system calls | ||||
|     don't want user to trick kernel into reading/writing the wrong segment | ||||
|     you have this problem in JOS too | ||||
|     later Multics CPUs had hardware to check argument references | ||||
| 
 | ||||
| are Multics rings a general-purpose protected subsystem facility? | ||||
|   example: protected game implementation | ||||
|     protected so that users cannot cheat | ||||
|     put game's code and data in ring 3 | ||||
|     BUT what if I don't trust the author? | ||||
|       or if i've already put some other subsystem in ring 3? | ||||
|     a ring has full power over itself and outer rings: you must trust | ||||
|   today: user/kernel, server processes and IPC | ||||
|     pro: protection among mutually suspicious subsystems | ||||
|     con: no convenient sharing of address spaces | ||||
| 
 | ||||
| UNIX vs Multics | ||||
|   UNIX was less ambitious (e.g. no unified mem/FS) | ||||
|   UNIX hardware was small | ||||
|   just a few programmers, all in the same room | ||||
|   evolved rather than pre-planned | ||||
|   quickly self-hosted, so they got experience earlier | ||||
| 
 | ||||
| What did UNIX inherit from MULTICS? | ||||
|   a shell at user level (not built into kernel) | ||||
|   a single hierarchical file system, with subdirectories | ||||
|   controlled sharing of files | ||||
|   written in high level language, self-hosted development | ||||
| 
 | ||||
| What did UNIX reject from MULTICS? | ||||
|   files look like memory | ||||
|     instead, unifying idea is file descriptor and read()/write() | ||||
|     memory is a totally separate resource | ||||
|   dynamic linking | ||||
|     instead, static linking at compile time, every binary had copy of libraries | ||||
|   segments and sharing | ||||
|     instead, single linear address space per process, like xv6 | ||||
|   (but shared libraries brought these back, just for efficiency, in 1980s) | ||||
|   Hierarchical rings of protection | ||||
|     simpler user/kernel | ||||
|     for subsystems, setuid, then client/server and IPC | ||||
| 
 | ||||
| The most useful sources I found for late-1960s Multics VM: | ||||
|   1. Bensoussan, Clingen, Daley, "The Multics Virtual Memory: Concepts | ||||
|      and Design," CACM 1972 (segments, paging, naming segments, dynamic | ||||
|      linking). | ||||
|   2. Daley and Dennis, "Virtual Memory, Processes, and Sharing in Multics," | ||||
|      SOSP 1967 (more details about dynamic linking and CPU).  | ||||
|   3. Graham, "Protection in an Information Processing Utility," | ||||
|      CACM 1968 (brief account of rings and gates). | ||||
							
								
								
									
										1412
									
								
								web/l19.txt
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										1412
									
								
								web/l19.txt
									
										
									
									
									
										Normal file
									
								
							
										
											
												File diff suppressed because it is too large
												Load diff
											
										
									
								
							
							
								
								
									
										494
									
								
								web/l2.html
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										494
									
								
								web/l2.html
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,494 @@ | |||
| <html> | ||||
| <head> | ||||
| <title>L2</title> | ||||
| </head> | ||||
| <body> | ||||
| 
 | ||||
| <h1>6.828 Lecture Notes: x86 and PC architecture</h1> | ||||
| 
 | ||||
| <h2>Outline</h2> | ||||
| <ul> | ||||
| <li>PC architecture | ||||
| <li>x86 instruction set | ||||
| <li>gcc calling conventions | ||||
| <li>PC emulation | ||||
| </ul> | ||||
| 
 | ||||
| <h2>PC architecture</h2> | ||||
| 
 | ||||
| <ul> | ||||
| <li>A full PC has: | ||||
|   <ul> | ||||
|   <li>an x86 CPU with registers, execution unit, and memory management | ||||
|   <li>CPU chip pins include address and data signals | ||||
|   <li>memory | ||||
|   <li>disk | ||||
|   <li>keyboard | ||||
|   <li>display | ||||
|   <li>other resources: BIOS ROM, clock, ... | ||||
|   </ul> | ||||
| 
 | ||||
| <li>We will start with the original 16-bit 8086 CPU (1978) | ||||
| <li>CPU runs instructions: | ||||
| <pre> | ||||
| for(;;){ | ||||
| 	run next instruction | ||||
| } | ||||
| </pre> | ||||
| 
 | ||||
| <li>Needs work space: registers | ||||
| 	<ul> | ||||
| 	<li>four 16-bit data registers: AX, CX, DX, BX | ||||
|         <li>each in two 8-bit halves, e.g. AH and AL | ||||
| 	<li>very fast, very few | ||||
| 	</ul> | ||||
| <li>More work space: memory | ||||
| 	<ul> | ||||
| 	<li>CPU sends out address on address lines (wires, one bit per wire) | ||||
| 	<li>Data comes back on data lines | ||||
| 	<li><i>or</i> data is written to data lines | ||||
| 	</ul> | ||||
| 
 | ||||
| <li>Add address registers: pointers into memory | ||||
| 	<ul> | ||||
| 	<li>SP - stack pointer | ||||
| 	<li>BP - frame base pointer | ||||
| 	<li>SI - source index | ||||
| 	<li>DI - destination index | ||||
| 	</ul> | ||||
| 
 | ||||
| <li>Instructions are in memory too! | ||||
| 	<ul> | ||||
| 	<li>IP - instruction pointer (PC on PDP-11, everything else) | ||||
| 	<li>increment after running each instruction | ||||
| 	<li>can be modified by CALL, RET, JMP, conditional jumps | ||||
| 	</ul> | ||||
| 
 | ||||
| <li>Want conditional jumps | ||||
| 	<ul> | ||||
| 	<li>FLAGS - various condition codes | ||||
| 		<ul> | ||||
| 		<li>whether last arithmetic operation overflowed | ||||
| 		<li> ... was positive/negative | ||||
| 		<li> ... was [not] zero | ||||
| 		<li> ... carry/borrow on add/subtract | ||||
| 		<li> ... overflow | ||||
| 		<li> ... etc. | ||||
| 		<li>whether interrupts are enabled | ||||
| 		<li>direction of data copy instructions | ||||
| 		</ul> | ||||
| 	<li>JP, JN, J[N]Z, J[N]C, J[N]O ... | ||||
| 	</ul> | ||||
| 
 | ||||
| <li>Still not interesting - need I/O to interact with outside world | ||||
| 	<ul> | ||||
| 	<li>Original PC architecture: use dedicated <i>I/O space</i> | ||||
| 		<ul> | ||||
| 		<li>Works same as memory accesses but set I/O signal | ||||
| 		<li>Only 1024 I/O addresses | ||||
| 		<li>Example: write a byte to line printer: | ||||
| <pre> | ||||
| #define DATA_PORT    0x378 | ||||
| #define STATUS_PORT  0x379 | ||||
| #define   BUSY 0x80 | ||||
| #define CONTROL_PORT 0x37A | ||||
| #define   STROBE 0x01 | ||||
| void | ||||
| lpt_putc(int c) | ||||
| { | ||||
|   /* wait for printer to consume previous byte */ | ||||
|   while((inb(STATUS_PORT) & BUSY) == 0) | ||||
|     ; | ||||
| 
 | ||||
|   /* put the byte on the parallel lines */ | ||||
|   outb(DATA_PORT, c); | ||||
| 
 | ||||
|   /* tell the printer to look at the data */ | ||||
|   outb(CONTROL_PORT, STROBE); | ||||
|   outb(CONTROL_PORT, 0); | ||||
| } | ||||
| <pre> | ||||
| 		</ul> | ||||
| 
 | ||||
| <li>Memory-Mapped I/O | ||||
| 	<ul> | ||||
| 	<li>Use normal physical memory addresses | ||||
| 		<ul> | ||||
| 		<li>Gets around limited size of I/O address space | ||||
| 		<li>No need for special instructions | ||||
| 		<li>System controller routes to appropriate device | ||||
| 		</ul> | ||||
| 	<li>Works like ``magic'' memory: | ||||
| 		<ul> | ||||
| 		<li>	<i>Addressed</i> and <i>accessed</i> like memory, | ||||
| 			but ... | ||||
| 		<li>	... does not <i>behave</i> like memory! | ||||
| 		<li>	Reads and writes can have ``side effects'' | ||||
| 		<li>	Read results can change due to external events | ||||
| 		</ul> | ||||
| 	</ul> | ||||
| </ul> | ||||
| 
 | ||||
| 
 | ||||
| <li>What if we want to use more than 2^16 bytes of memory? | ||||
| 	<ul> | ||||
|         <li>8086 has 20-bit physical addresses, can have 1 Meg RAM | ||||
|         <li>each segment is a 2^16 byte window into physical memory | ||||
|         <li>virtual to physical translation: pa = va + seg*16 | ||||
|         <li>the segment is usually implicit, from a segment register | ||||
| 	<li>CS - code segment (for fetches via IP) | ||||
| 	<li>SS - stack segment (for load/store via SP and BP) | ||||
| 	<li>DS - data segment (for load/store via other registers) | ||||
| 	<li>ES - another data segment (destination for string operations) | ||||
|         <li>tricky: can't use the 16-bit address of a stack variable as a pointer | ||||
|         <li>but a <i>far pointer</i> includes full segment:offset (16 + 16 bits) | ||||
| 	</ul> | ||||
| 
 | ||||
| <li>But 8086's 16-bit addresses and data were still painfully small | ||||
|   <ul> | ||||
|   <li>80386 added support for 32-bit data and addresses (1985) | ||||
|   <li>boots in 16-bit mode, boot.S switches to 32-bit mode | ||||
|   <li>registers are 32 bits wide, called EAX rather than AX | ||||
|   <li>operands and addresses are also 32 bits, e.g. ADD does 32-bit arithmetic | ||||
|   <li>prefix 0x66 gets you 16-bit mode: MOVW is really 0x66 MOVW | ||||
|   <li>the .code32 in boot.S tells assembler to generate 0x66 for e.g. MOVW | ||||
|   <li>80386 also changed segments and added paged memory... | ||||
|   </ul> | ||||
| 
 | ||||
| </ul> | ||||
| 
 | ||||
| <h2>x86 Physical Memory Map</h2> | ||||
| 
 | ||||
| <ul> | ||||
| <li>The physical address space mostly looks like ordinary RAM | ||||
| <li>Except some low-memory addresses actually refer to other things | ||||
| <li>Writes to VGA memory appear on the screen | ||||
| <li>Reset or power-on jumps to ROM at 0x000ffff0 | ||||
| </ul> | ||||
| 
 | ||||
| <pre> | ||||
| +------------------+  <- 0xFFFFFFFF (4GB) | ||||
| |      32-bit      | | ||||
| |  memory mapped   | | ||||
| |     devices      | | ||||
| |                  | | ||||
| /\/\/\/\/\/\/\/\/\/\ | ||||
| 
 | ||||
| /\/\/\/\/\/\/\/\/\/\ | ||||
| |                  | | ||||
| |      Unused      | | ||||
| |                  | | ||||
| +------------------+  <- depends on amount of RAM | ||||
| |                  | | ||||
| |                  | | ||||
| | Extended Memory  | | ||||
| |                  | | ||||
| |                  | | ||||
| +------------------+  <- 0x00100000 (1MB) | ||||
| |     BIOS ROM     | | ||||
| +------------------+  <- 0x000F0000 (960KB) | ||||
| |  16-bit devices, | | ||||
| |  expansion ROMs  | | ||||
| +------------------+  <- 0x000C0000 (768KB) | ||||
| |   VGA Display    | | ||||
| +------------------+  <- 0x000A0000 (640KB) | ||||
| |                  | | ||||
| |    Low Memory    | | ||||
| |                  | | ||||
| +------------------+  <- 0x00000000 | ||||
| </pre> | ||||
| 
 | ||||
| <h2>x86 Instruction Set</h2> | ||||
| 
 | ||||
| <ul> | ||||
| <li>Two-operand instruction set | ||||
| 	<ul> | ||||
| 	<li>Intel syntax: <tt>op dst, src</tt> | ||||
| 	<li>AT&T (gcc/gas) syntax: <tt>op src, dst</tt> | ||||
| 		<ul> | ||||
| 		<li>uses b, w, l suffix on instructions to specify size of operands | ||||
| 		</ul> | ||||
| 	<li>Operands are registers, constant, memory via register, memory via constant | ||||
| 	<li>	Examples: | ||||
| 		<table cellspacing=5> | ||||
| 		<tr><td><u>AT&T syntax</u> <td><u>"C"-ish equivalent</u> | ||||
| 		<tr><td>movl %eax, %edx <td>edx = eax; <td><i>register mode</i> | ||||
| 		<tr><td>movl $0x123, %edx <td>edx = 0x123; <td><i>immediate</i> | ||||
| 		<tr><td>movl 0x123, %edx <td>edx = *(int32_t*)0x123; <td><i>direct</i> | ||||
| 		<tr><td>movl (%ebx), %edx <td>edx = *(int32_t*)ebx; <td><i>indirect</i> | ||||
| 		<tr><td>movl 4(%ebx), %edx <td>edx = *(int32_t*)(ebx+4); <td><i>displaced</i> | ||||
| 		</table> | ||||
| 	</ul> | ||||
| 
 | ||||
| <li>Instruction classes | ||||
| 	<ul> | ||||
| 	<li>data movement: MOV, PUSH, POP, ... | ||||
| 	<li>arithmetic: TEST, SHL, ADD, AND, ... | ||||
| 	<li>i/o: IN, OUT, ... | ||||
| 	<li>control: JMP, JZ, JNZ, CALL, RET | ||||
| 	<li>string: REP MOVSB, ... | ||||
| 	<li>system: IRET, INT | ||||
| 	</ul> | ||||
| 
 | ||||
| <li>Intel architecture manual Volume 2 is <i>the</i> reference | ||||
| 
 | ||||
| </ul> | ||||
| 
 | ||||
| <h2>gcc x86 calling conventions</h2> | ||||
| 
 | ||||
| <ul> | ||||
| <li>x86 dictates that stack grows down: | ||||
| 	<table cellspacing=5> | ||||
| 	<tr><td><u>Example instruction</u> <td><u>What it does</u> | ||||
| 	<tr><td>pushl %eax | ||||
| 		<td> | ||||
| 		subl $4, %esp <br> | ||||
| 		movl %eax, (%esp) <br> | ||||
| 	<tr><td>popl %eax | ||||
| 		<td> | ||||
| 		movl (%esp), %eax <br> | ||||
| 		addl $4, %esp <br> | ||||
| 	<tr><td>call $0x12345 | ||||
| 		<td> | ||||
| 		pushl %eip <sup>(*)</sup> <br> | ||||
| 		movl $0x12345, %eip <sup>(*)</sup> <br> | ||||
| 	<tr><td>ret | ||||
| 		<td> | ||||
| 		popl %eip <sup>(*)</sup> | ||||
| 	</table> | ||||
| 	(*) <i>Not real instructions</i> | ||||
| 
 | ||||
| <li>GCC dictates how the stack is used. | ||||
| 	Contract between caller and callee on x86: | ||||
| 	<ul> | ||||
| 	<li>after call instruction: | ||||
| 		<ul> | ||||
| 		<li>%eip points at first instruction of function | ||||
| 		<li>%esp+4 points at first argument | ||||
| 		<li>%esp points at return address | ||||
| 		</ul> | ||||
| 	<li>after ret instruction: | ||||
| 		<ul> | ||||
| 		<li>%eip contains return address | ||||
| 		<li>%esp points at arguments pushed by caller | ||||
| 		<li>called function may have trashed arguments | ||||
| 		<li>%eax contains return value | ||||
| 			(or trash if function is <tt>void</tt>) | ||||
| 		<li>%ecx, %edx may be trashed | ||||
| 		<li>%ebp, %ebx, %esi, %edi must contain contents from time of <tt>call</tt> | ||||
| 		</ul> | ||||
| 	<li>Terminology: | ||||
| 		<ul> | ||||
| 		<li>%eax, %ecx, %edx are "caller save" registers | ||||
| 		<li>%ebp, %ebx, %esi, %edi are "callee save" registers | ||||
| 		</ul> | ||||
| 	</ul> | ||||
| 
 | ||||
| <li>Functions can do anything that doesn't violate contract. | ||||
| 	By convention, GCC does more: | ||||
| 	<ul> | ||||
| 	<li>each function has a stack frame marked by %ebp, %esp | ||||
| 		<pre> | ||||
| 		       +------------+   | | ||||
| 		       | arg 2      |   \ | ||||
| 		       +------------+    >- previous function's stack frame | ||||
| 		       | arg 1      |   / | ||||
| 		       +------------+   | | ||||
| 		       | ret %eip   |   / | ||||
| 		       +============+    | ||||
| 		       | saved %ebp |   \ | ||||
| 		%ebp-> +------------+   | | ||||
| 		       |            |   | | ||||
| 		       |   local    |   \ | ||||
| 		       | variables, |    >- current function's stack frame | ||||
| 		       |    etc.    |   / | ||||
| 		       |            |   | | ||||
| 		       |            |   | | ||||
| 		%esp-> +------------+   / | ||||
| 		</pre> | ||||
| 	<li>%esp can move to make stack frame bigger, smaller | ||||
| 	<li>%ebp points at saved %ebp from previous function, | ||||
| 		chain to walk stack | ||||
| 	<li>function prologue: | ||||
| 		<pre> | ||||
| 			pushl %ebp | ||||
| 			movl %esp, %ebp | ||||
| 		</pre> | ||||
| 	<li>function epilogue: | ||||
| 		<pre> | ||||
| 			movl %ebp, %esp | ||||
| 			popl %ebp | ||||
| 		</pre> | ||||
| 		or | ||||
| 		<pre> | ||||
| 			leave | ||||
| 		</pre> | ||||
| 	</ul> | ||||
| 
 | ||||
| <li>Big example: | ||||
| 	<ul> | ||||
| 	<li>C code | ||||
| 		<pre> | ||||
| 		int main(void) { return f(8)+1; } | ||||
| 		int f(int x) { return g(x); } | ||||
| 		int g(int x) { return x+3; } | ||||
| 		</pre> | ||||
| 	<li>assembler | ||||
| 		<pre> | ||||
| 		_main: | ||||
| 					<i>prologue</i> | ||||
| 			pushl %ebp | ||||
| 			movl %esp, %ebp | ||||
| 					<i>body</i> | ||||
| 			pushl $8 | ||||
| 			call _f | ||||
| 			addl $1, %eax | ||||
| 					<i>epilogue</i> | ||||
| 			movl %ebp, %esp | ||||
| 			popl %ebp | ||||
| 			ret | ||||
| 		_f: | ||||
| 					<i>prologue</i> | ||||
| 			pushl %ebp | ||||
| 			movl %esp, %ebp | ||||
| 					<i>body</i> | ||||
| 			pushl 8(%esp) | ||||
| 			call _g | ||||
| 					<i>epilogue</i> | ||||
| 			movl %ebp, %esp | ||||
| 			popl %ebp | ||||
| 			ret | ||||
| 
 | ||||
| 		_g: | ||||
| 					<i>prologue</i> | ||||
| 			pushl %ebp | ||||
| 			movl %esp, %ebp | ||||
| 					<i>save %ebx</i> | ||||
| 			pushl %ebx | ||||
| 					<i>body</i> | ||||
| 			movl 8(%ebp), %ebx | ||||
| 			addl $3, %ebx | ||||
| 			movl %ebx, %eax | ||||
| 					<i>restore %ebx</i> | ||||
| 			popl %ebx | ||||
| 					<i>epilogue</i> | ||||
| 			movl %ebp, %esp | ||||
| 			popl %ebp | ||||
| 			ret | ||||
| 		</pre> | ||||
| 	</ul> | ||||
| 
 | ||||
| <li>Super-small <tt>_g</tt>: | ||||
| 	<pre> | ||||
| 		_g: | ||||
| 			movl 4(%esp), %eax | ||||
| 			addl $3, %eax | ||||
| 			ret | ||||
| 	</pre> | ||||
| 
 | ||||
| <li>Compiling, linking, loading: | ||||
| 	<ul> | ||||
| 	<li>	<i>Compiler</i> takes C source code (ASCII text), | ||||
| 		produces assembly language (also ASCII text) | ||||
| 	<li>	<i>Assembler</i> takes assembly language (ASCII text), | ||||
| 		produces <tt>.o</tt> file (binary, machine-readable!) | ||||
| 	<li>	<i>Linker</i> takse multiple '<tt>.o</tt>'s, | ||||
| 		produces a single <i>program image</i> (binary) | ||||
| 	<li>	<i>Loader</i> loads the program image into memory | ||||
| 		at run-time and starts it executing | ||||
| 	</ul> | ||||
| </ul> | ||||
| 
 | ||||
| 
 | ||||
| <h2>PC emulation</h2> | ||||
| 
 | ||||
| <ul> | ||||
| <li>	Emulator like Bochs works by | ||||
| 	<ul> | ||||
| 	<li>	doing exactly what a real PC would do, | ||||
| 	<li>	only implemented in software rather than hardware! | ||||
| 	</ul> | ||||
| <li>	Runs as a normal process in a "host" operating system (e.g., Linux) | ||||
| <li>	Uses normal process storage to hold emulated hardware state: | ||||
| 	e.g., | ||||
| 	<ul> | ||||
| 	<li>	Hold emulated CPU registers in global variables | ||||
| 		<pre> | ||||
| 		int32_t regs[8]; | ||||
| 		#define REG_EAX 1; | ||||
| 		#define REG_EBX 2; | ||||
| 		#define REG_ECX 3; | ||||
| 		... | ||||
| 		int32_t eip; | ||||
| 		int16_t segregs[4]; | ||||
| 		... | ||||
| 		</pre> | ||||
| 	<li>	<tt>malloc</tt> a big chunk of (virtual) process memory | ||||
| 		to hold emulated PC's (physical) memory | ||||
| 	</ul> | ||||
| <li>	Execute instructions by simulating them in a loop: | ||||
| 	<pre> | ||||
| 	for (;;) { | ||||
| 		read_instruction(); | ||||
| 		switch (decode_instruction_opcode()) { | ||||
| 		case OPCODE_ADD: | ||||
| 			int src = decode_src_reg(); | ||||
| 			int dst = decode_dst_reg(); | ||||
| 			regs[dst] = regs[dst] + regs[src]; | ||||
| 			break; | ||||
| 		case OPCODE_SUB: | ||||
| 			int src = decode_src_reg(); | ||||
| 			int dst = decode_dst_reg(); | ||||
| 			regs[dst] = regs[dst] - regs[src]; | ||||
| 			break; | ||||
| 		... | ||||
| 		} | ||||
| 		eip += instruction_length; | ||||
| 	} | ||||
| 	</pre> | ||||
| 
 | ||||
| <li>	Simulate PC's physical memory map | ||||
| 	by decoding emulated "physical" addresses just like a PC would: | ||||
| 	<pre> | ||||
| 	#define KB		1024 | ||||
| 	#define MB		1024*1024 | ||||
| 
 | ||||
| 	#define LOW_MEMORY	640*KB | ||||
| 	#define EXT_MEMORY	10*MB | ||||
| 
 | ||||
| 	uint8_t low_mem[LOW_MEMORY]; | ||||
| 	uint8_t ext_mem[EXT_MEMORY]; | ||||
| 	uint8_t bios_rom[64*KB]; | ||||
| 
 | ||||
| 	uint8_t read_byte(uint32_t phys_addr) { | ||||
| 		if (phys_addr < LOW_MEMORY) | ||||
| 			return low_mem[phys_addr]; | ||||
| 		else if (phys_addr >= 960*KB && phys_addr < 1*MB) | ||||
| 			return rom_bios[phys_addr - 960*KB]; | ||||
| 		else if (phys_addr >= 1*MB && phys_addr < 1*MB+EXT_MEMORY) { | ||||
| 			return ext_mem[phys_addr-1*MB]; | ||||
| 		else ... | ||||
| 	} | ||||
| 
 | ||||
| 	void write_byte(uint32_t phys_addr, uint8_t val) { | ||||
| 		if (phys_addr < LOW_MEMORY) | ||||
| 			low_mem[phys_addr] = val; | ||||
| 		else if (phys_addr >= 960*KB && phys_addr < 1*MB) | ||||
| 			; /* ignore attempted write to ROM! */ | ||||
| 		else if (phys_addr >= 1*MB && phys_addr < 1*MB+EXT_MEMORY) { | ||||
| 			ext_mem[phys_addr-1*MB] = val; | ||||
| 		else ... | ||||
| 	} | ||||
| 	</pre> | ||||
| <li>	Simulate I/O devices, etc., by detecting accesses to | ||||
| 	"special" memory and I/O space and emulating the correct behavior: | ||||
| 	e.g., | ||||
| 	<ul> | ||||
| 	<li>	Reads/writes to emulated hard disk | ||||
| 		transformed into reads/writes of a file on the host system | ||||
| 	<li>	Writes to emulated VGA display hardware | ||||
| 		transformed into drawing into an X window | ||||
| 	<li>	Reads from emulated PC keyboard | ||||
| 		transformed into reads from X input event queue | ||||
| 	</ul> | ||||
| </ul> | ||||
							
								
								
									
										334
									
								
								web/l3.html
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										334
									
								
								web/l3.html
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,334 @@ | |||
| <title>L3</title> | ||||
| <html> | ||||
| <head> | ||||
| </head> | ||||
| <body> | ||||
| 
 | ||||
| <h1>Operating system organizaton</h1> | ||||
| 
 | ||||
| <p>Required reading: Exokernel paper. | ||||
| 
 | ||||
| <h2>Intro: virtualizing</h2> | ||||
| 
 | ||||
| <p>One way to think about an operating system interface is that it | ||||
| extends the hardware instructions with a set of "instructions" that | ||||
| are implemented in software. These instructions are invoked using a | ||||
| system call instruction (int on the x86).  In this view, a task of the | ||||
| operating system is to provide each application with a <i>virtual</i> | ||||
| version of the interface; that is, it provides each application with a | ||||
| virtual computer.   | ||||
| 
 | ||||
| <p>One of the challenges in an operating system is multiplexing the | ||||
| physical resources between the potentially many virtual computers. | ||||
| What makes the multiplexing typically complicated is an additional | ||||
| constraint: isolate the virtual computers well from each other. That | ||||
| is, | ||||
| <ul> | ||||
| <li> stores shouldn't be able to overwrite other apps's data | ||||
| <li> jmp shouldn't be able to enter another application | ||||
| <li> one virtual computer cannot hog the processor | ||||
| </ul> | ||||
| 
 | ||||
| <p>In this lecture, we will explore at a high-level how to build | ||||
| virtual computer that meet these goals.  In the rest of the term we | ||||
| work out the details. | ||||
| 
 | ||||
| <h2>Virtual processors</h2> | ||||
| 
 | ||||
| <p>To give each application its own set of virtual processor, we need | ||||
| to virtualize the physical processors.  One way to do is to multiplex | ||||
| the physical processor over time: the operating system runs one | ||||
| application for a while, then runs another application for while, etc. | ||||
| We can implement this solution as follows: when an application has run | ||||
| for its share of the processor, unload the state of the phyical | ||||
| processor, save that state to be able to resume the application later, | ||||
| load in the state for the next application, and resume it. | ||||
| 
 | ||||
| <p>What needs to be saved and restored?  That depends on the | ||||
| processor, but for the x86: | ||||
| <ul> | ||||
| <li>IP | ||||
| <li>SP | ||||
| <li>The other processor registers (eax, etc.) | ||||
| </ul> | ||||
| 
 | ||||
| <p>To enforce that a virtual processor doesn't keep a processor, the | ||||
| operating system can arrange for a periodic interrupt, and switch the | ||||
| processor in the interrupt routine. | ||||
| 
 | ||||
| <p>To separate the memories of the applications, we may also need to save | ||||
| and restore the registers that define the (virtual) memory of the | ||||
| application (e.g., segment and MMU registers on the x86), which is | ||||
| explained next. | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| <h2>Separating memories</h2> | ||||
|    | ||||
| <p>Approach to separating memories: | ||||
| <ul> | ||||
| <li>Force programs to be written in high-level, type-safe language | ||||
| <li>Enforce separation using hardware support | ||||
| </ul> | ||||
| The approaches can be combined. | ||||
| 
 | ||||
| <p>Lets assume unlimited physical memory for a little while. We can | ||||
| enforce separation then as follows: | ||||
| <ul> | ||||
| 
 | ||||
| <li>Put device (memory management unit) between processor and memory, | ||||
|   which checks each memory access against a set of domain registers. | ||||
|   (The domain registers are like segment registers on the x86, except | ||||
|   there is no computation to compute an address.) | ||||
| <li>The domain register specifies a range of addresses that the | ||||
|   processor is allow to access. | ||||
| <li>When switching applications, switch domain registers. | ||||
| </ul> | ||||
| Why does this work? load/stores/jmps cannot touch/enter other | ||||
| application's domains. | ||||
| 
 | ||||
| <p>To allow for controled sharing and separation with an application, | ||||
| extend domain registers with protectioin bits: read (R), write (W), | ||||
| execute-only (X). | ||||
| 
 | ||||
| <p>How to protect the domain registers? Extend the protection bits | ||||
| with a kernel-only one.  When in kernel-mode, processor can change | ||||
| domain registers. As we will see in lecture 4, x86 stores the U/K | ||||
| information in CPL (current privilege level) in CS segment | ||||
| register. | ||||
| 
 | ||||
| <p>To change from user to kernel, extend the hardware with special | ||||
| instructions for entering a "supervisor" or "system" call, and | ||||
| returning from it.  On x86, int and reti. The int instruction takes as | ||||
| argument the system call number. We can then think of the kernel | ||||
| interface as the set of "instructions" that augment the instructions | ||||
| implemented in hardware. | ||||
| 
 | ||||
| <h2>Memory management</h2> | ||||
| 
 | ||||
| <p>We assumed unlimited physical memory and big addresses.  In | ||||
| practice, operating system must support creating, shrinking, and | ||||
| growing of domains, while still allowing the addresses of an | ||||
| application to be contiguous (for programming convenience).  What if | ||||
| we want to grow the domain of application 1 but the memory right below | ||||
| and above it is in use by application 2? | ||||
| 
 | ||||
| <p>How? Virtual addresses and spaces.  Virtualize addresses and let | ||||
| the kernel control the mapping from virtual to physical. | ||||
| 
 | ||||
| <p> Address spaces provide each application with the ideas that it has | ||||
| a complete memory for itself.  All the addresses it issues are its | ||||
| addresses (e.g., each application has an address 0). | ||||
| 
 | ||||
| <li> How do you give each application its own address space? | ||||
| <ul> | ||||
|       <li> MMU translates <i>virtual</i> address to <i>physical</i> | ||||
|         addresses using a translation table | ||||
|       <li> Implementation approaches for translation table: | ||||
| <ol> | ||||
| 
 | ||||
| <li> for each virtual address store physical address, too costly. | ||||
| 
 | ||||
| <li> translate a set of contiguous virtual addresses at a time using | ||||
| segments (segment #, base address, length) | ||||
| 
 | ||||
| <li> translate a fixed-size set of address (page) at a time using a | ||||
| page map (page # -> block #) (draw hardware page table picture). | ||||
| Datastructures for page map: array, n-level tree, superpages, etc. | ||||
| 
 | ||||
| </ol> | ||||
| <br>Some processor have both 2+3: x86!  (see lecture 4) | ||||
| </ul> | ||||
| 
 | ||||
| <li> What if two applications want to share real memory? Map the pages | ||||
| into multiple address spaces and have protection bits per page. | ||||
| 
 | ||||
| <li> How do you give an application access to a memory-mapped-IO | ||||
| device? Map the physical address for the device into the applications | ||||
| address space. | ||||
| 
 | ||||
| <li> How do you get off the ground? | ||||
| <ul> | ||||
|      <li> when computer starts, MMU is disabled. | ||||
|      <li> computer starts in kernel mode, with no | ||||
|        translation (i.e., virtual address 0 is physical address 0, and | ||||
|        so on) | ||||
|      <li> kernel program sets up MMU to translate kernel address to physical | ||||
|      address. often kernel virtual address translates to physical adress 0. | ||||
|      <li> enable MMU | ||||
| <br><p>Lab 2 explores this topic in detail. | ||||
| </ul> | ||||
| 
 | ||||
| <h2>Operating system organizations</h2> | ||||
| 
 | ||||
| <p>A central theme in operating system design is how to organize the | ||||
| operating system. It is helpful to define a couple of terms: | ||||
| <ul> | ||||
| 
 | ||||
| <li>Kernel: the program that runs in kernel mode, in a kernel | ||||
| address space. | ||||
| 
 | ||||
| <li>Library: code against which application link (e.g., libc). | ||||
| 
 | ||||
| <li>Application: code that runs in a user-level address space. | ||||
| 
 | ||||
| <li>Operating system: kernel plus all user-level system code (e.g., | ||||
| servers, libraries, etc.) | ||||
| 
 | ||||
| </ul> | ||||
| 
 | ||||
| <p>Example: trace a call to printf made by an application. | ||||
| 
 | ||||
| <p>There are roughly 4 operating system designs: | ||||
| <ul> | ||||
| 
 | ||||
| <li>Monolithic design. The OS interface is the kernel interface (i.e., | ||||
| the complete operating systems runs in kernel mode).  This has limited | ||||
| flexibility (other than downloadable kernel modules) and doesn't fault | ||||
| isolate individual OS modules (e.g., the file system and process | ||||
| module are both in the kernel address space).  xv6 has this | ||||
| organization. | ||||
| 
 | ||||
| <li>Microkernl design.  The kernel interface provides a minimal set of | ||||
| abstractions (e.g., virtual memory, IPC, and threads), and the rest of | ||||
| the operating system is implemented by user applications (often called | ||||
| servers). | ||||
| 
 | ||||
| <li>Virtual machine design.  The kernel implements a virtual machine | ||||
| monitor.  The monitor multiplexes multiple virtual machines, which | ||||
| each provide as the kernel programming interface the machine platform | ||||
| (the instruction set, devices, etc.).  Each virtual machine runs its | ||||
| own, perhaps simple, operating system. | ||||
| 
 | ||||
| <li>Exokernel design.  Only used in this class and discussed below. | ||||
| 
 | ||||
| </ul> | ||||
| 
 | ||||
| <p>Although monolithic operating systems are the dominant operating | ||||
| system architecture for desktop and server machines, it is worthwhile | ||||
| to consider alternative architectures, even it is just to understand | ||||
| operating systems better.  This lecture looks at exokernels, because | ||||
| that is what you will building in the lab.  xv6 is organized as a | ||||
| monolithic system, and we will study in the next lectures.  Later in | ||||
| the term we will read papers about microkernel and virtual machine | ||||
| operating systems. | ||||
| 
 | ||||
| <h2>Exokernels</h2> | ||||
| 
 | ||||
| <p>The exokernel architecture takes an end-to-end approach to | ||||
| operating system design.  In this design, the kernel just securely | ||||
| multiplexes physical resources; any programmer can decide what the | ||||
| operating system interface and its implementation are for his | ||||
| application.  One would expect a couple of popular APIs (e.g., UNIX) | ||||
| that most applications will link against, but a programmer is always | ||||
| free to replace that API, partially or completely.  (Draw picture of | ||||
| JOS.) | ||||
| 
 | ||||
| <p>Compare UNIX interface (<a href="v6.c">v6</a> or <a | ||||
| href="os10.h">OSX</a>) with the JOS exokernel-like interface: | ||||
| <pre> | ||||
| enum | ||||
| { | ||||
| 	SYS_cputs = 0, | ||||
| 	SYS_cgetc, | ||||
| 	SYS_getenvid, | ||||
| 	SYS_env_destroy, | ||||
| 	SYS_page_alloc, | ||||
| 	SYS_page_map, | ||||
| 	SYS_page_unmap, | ||||
| 	SYS_exofork, | ||||
| 	SYS_env_set_status, | ||||
| 	SYS_env_set_trapframe, | ||||
| 	SYS_env_set_pgfault_upcall, | ||||
| 	SYS_yield, | ||||
| 	SYS_ipc_try_send, | ||||
| 	SYS_ipc_recv, | ||||
| }; | ||||
| </pre> | ||||
| 
 | ||||
| <p>To illustrate the differences between these interfaces in more | ||||
| detail consider implementing the following: | ||||
| <ul> | ||||
| 
 | ||||
| <li>User-level thread package that deals well with blocking system | ||||
| calls, page faults, etc. | ||||
| 
 | ||||
| <li>High-performance web server performing optimizations across module | ||||
| boundaries (e.g., file system and network stack). | ||||
| 
 | ||||
| </ul>  | ||||
| 
 | ||||
| <p>How well can each kernel interface implement the above examples? | ||||
| (Start with UNIX interface and see where you run into problems.)  (The | ||||
| JOS kernel interface is not flexible enough: for example, | ||||
| <i>ipc_receive</i> is blocking.) | ||||
| 
 | ||||
| <h2>Exokernel paper discussion</h2> | ||||
| 
 | ||||
| 
 | ||||
| <p>The central challenge in an exokernel design it to provide | ||||
| extensibility, but provide fault isolation.  This challenge breaks | ||||
| down into three problems: | ||||
| 
 | ||||
| <ul> | ||||
| 
 | ||||
| <li>tracking owner ship of resources; | ||||
| 
 | ||||
| <li>ensuring fault isolation between applications; | ||||
| 
 | ||||
| <li>revoking access to resources. | ||||
| 
 | ||||
| </ul> | ||||
| 
 | ||||
| <ul> | ||||
| 
 | ||||
| <li>How is physical memory multiplexed?  Kernel tracks for each | ||||
| physical page who has it. | ||||
| 
 | ||||
| <li>How is the processor multiplexed?  Time slices. | ||||
| 
 | ||||
| <li>How is the network multiplexed?  Packet filters. | ||||
| 
 | ||||
| <li>What is the plan for revoking resources? | ||||
| <ul> | ||||
| 
 | ||||
| <li>Expose information so that application can do the right thing. | ||||
| 
 | ||||
| <li>Ask applications politely to release resources of a given type. | ||||
| 
 | ||||
| <li>Ask applications with force to release resources | ||||
| 
 | ||||
| </ul> | ||||
| 
 | ||||
| <li>What is an environment?  The processor environment: it stores | ||||
| sufficient information to deliver events to applications: exception | ||||
| context, interrupt context, protected entry context, and addressing | ||||
| context.  This structure is processor specific. | ||||
| 
 | ||||
| <li>How does on implement a minimal protected control transfer on the | ||||
| x86?  Lab 4's approach to IPC has some short comings: what are they? | ||||
| (It is essentially a polling-based solution, and the one you implement | ||||
| is unfair.) What is a better way?  Set up a specific handler to be | ||||
| called when an environment wants to call this environment.  How does | ||||
| this impact scheduling of environments? (i.e., give up time slice or | ||||
| not?) | ||||
| 
 | ||||
| <li>How does one dispatch exceptions (e.g., page fault) to user space | ||||
| on the x86?  Give each environment a separate exception stack in user | ||||
| space, and propagate exceptions on that stack. See page-fault handling | ||||
| in lab 4. | ||||
| 
 | ||||
| <li>How does on implement processes in user space? The thread part of | ||||
| a process is easy.  The difficult part it to perform the copy of the | ||||
| address space efficiently; one would like to share memory between | ||||
| parent and child.  This property can be achieved using copy-on-write. | ||||
| The child should, however, have its own exception stack.  Again, | ||||
| see lab 4.  <i>sfork</i> is a trivial extension of user-level <i>fork</i>. | ||||
| 
 | ||||
| <li>What are the examples of extensibility in this paper? (RPC system | ||||
| in which server saves and restores registers, different page table, | ||||
| and stride scheduler.) | ||||
| 
 | ||||
| </ul> | ||||
| 
 | ||||
| </body> | ||||
							
								
								
									
										518
									
								
								web/l4.html
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										518
									
								
								web/l4.html
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,518 @@ | |||
| <title>L4</title> | ||||
| <html> | ||||
| <head> | ||||
| </head> | ||||
| <body> | ||||
| 
 | ||||
| <h1>Address translation and sharing using segments</h1> | ||||
| 
 | ||||
| <p>This lecture is about virtual memory, focusing on address | ||||
| spaces. It is the first lecture out of series of lectures that uses | ||||
| xv6 as a case study. | ||||
| 
 | ||||
| <h2>Address spaces</h2> | ||||
| 
 | ||||
| <ul> | ||||
| 
 | ||||
| <li>OS: kernel program and user-level programs. For fault isolation | ||||
| each program runs in a separate address space. The kernel address | ||||
| spaces is like user address spaces, expect it runs in kernel mode. | ||||
| The program in kernel mode can execute priviledge instructions (e.g., | ||||
| writing the kernel's code segment registers). | ||||
| 
 | ||||
| <li>One job of kernel is to manage address spaces (creating, growing, | ||||
| deleting, and switching between them) | ||||
| 
 | ||||
| <ul> | ||||
| 
 | ||||
| <li>Each address space (including kernel) consists of the binary | ||||
|     representation for the text of the program, the data part | ||||
|     part of the program, and the stack area. | ||||
| 
 | ||||
| <li>The kernel address space runs the kernel program. In a monolithic | ||||
|     organization the kernel manages all hardware and provides an API | ||||
|     to user programs. | ||||
| 
 | ||||
| <li>Each user address space contains a program.  A user progam may ask | ||||
|   to shrink or grow its address space. | ||||
| 
 | ||||
| </ul> | ||||
| 
 | ||||
| <li>The main operations: | ||||
| <ul> | ||||
| <li>Creation.  Allocate physical memory to storage program. Load | ||||
| program into physical memory. Fill address spaces with references to | ||||
| physical memory.  | ||||
| <li>Growing. Allocate physical memory and add it to address space. | ||||
| <li>Shrinking. Free some of the memory in an address space. | ||||
| <li>Deletion. Free all memory in an address space. | ||||
| <li>Switching. Switch the processor to use another address space. | ||||
| <li>Sharing.  Share a part of an address space with another program. | ||||
| </ul> | ||||
| </ul> | ||||
| 
 | ||||
| <p>Two main approaches to implementing address spaces: using segments | ||||
|   and using page tables. Often when one uses segments, one also uses | ||||
|   page tables.  But not the other way around; i.e., paging without | ||||
|   segmentation is common. | ||||
| 
 | ||||
| <h2>Example support for address spaces: x86</h2> | ||||
| 
 | ||||
| <p>For an operating system to provide address spaces and address | ||||
| translation typically requires support from hardware.  The translation | ||||
| and checking of permissions typically must happen on each address used | ||||
| by a program, and it would be too slow to check that in software (if | ||||
| even possible).  The division of labor is operating system manages | ||||
| address spaces, and hardware translates addresses and checks | ||||
| permissions. | ||||
| 
 | ||||
| <p>PC block diagram without virtual memory support: | ||||
| <ul> | ||||
| <li>physical address | ||||
| <li>base, IO hole, extended memory | ||||
| <li>Physical address == what is on CPU's address pins | ||||
| </ul> | ||||
| 
 | ||||
| <p>The x86 starts out in real mode and translation is as follows: | ||||
| 	<ul> | ||||
| 	<li>segment*16+offset ==> physical address | ||||
|         <li>no protection: program can load anything into seg reg | ||||
| 	</ul> | ||||
| 
 | ||||
| <p>The operating system can switch the x86 to protected mode, which | ||||
| allows the operating system to create address spaces. Translation in | ||||
| protected mode is as follows: | ||||
| 	<ul> | ||||
| 	<li>selector:offset (logical addr) <br> | ||||
| 	     ==SEGMENTATION==>  | ||||
| 	<li>linear address <br> | ||||
| 	     ==PAGING ==> | ||||
| 	<li>physical address | ||||
| 	</ul> | ||||
| 
 | ||||
| <p>Next lecture covers paging; now we focus on segmentation.  | ||||
| 
 | ||||
| <p>Protected-mode segmentation works as follows: | ||||
| <ul> | ||||
| <li>protected-mode segments add 32-bit addresses and protection | ||||
| <ul> | ||||
| <li>wait: what's the point? the point of segments in real mode was | ||||
|   bigger addresses, but 32-bit mode fixes that! | ||||
| </ul> | ||||
| <li>segment register holds segment selector | ||||
| <li>selector indexes into global descriptor table (GDT) | ||||
| <li>segment descriptor holds 32-bit base, limit, type, protection | ||||
| <li>la = va + base ; assert(va < limit); | ||||
| <li>seg register usually implicit in instruction | ||||
| 	<ul> | ||||
| 	<li>DS:REG | ||||
| 		<ul> | ||||
| 		<li><tt>movl $0x1, _flag</tt> | ||||
| 		</ul> | ||||
| 	<li>SS:ESP, SS:EBP | ||||
| 		<ul> | ||||
| 		<li><tt>pushl %ecx, pushl $_i</tt> | ||||
| 		<li><tt>popl %ecx</tt> | ||||
| 		<li><tt>movl 4(%ebp),%eax</tt> | ||||
| 		</ul> | ||||
| 	<li>CS:EIP | ||||
| 		<ul> | ||||
| 		<li>instruction fetch | ||||
| 		</ul> | ||||
| 	<li>String instructions: read from DS:ESI, write to ES:EDI | ||||
| 		<ul> | ||||
| 		<li><tt>rep movsb</tt> | ||||
| 		</ul> | ||||
| 	<li>Exception: far addresses | ||||
| 		<ul> | ||||
| 		<li><tt>ljmp $selector, $offset</tt> | ||||
| 		</ul> | ||||
| 	</ul> | ||||
| <li>LGDT instruction loads CPU's GDT register | ||||
| <li>you turn on protected mode by setting PE bit in CR0 register | ||||
| <li>what happens with the next instruction? CS now has different | ||||
|   meaning... | ||||
| 
 | ||||
| <li>How to transfer from segment to another, perhaps with different | ||||
| priveleges. | ||||
| <ul> | ||||
| <li>Current privilege level (CPL) is in the low 2 bits of CS | ||||
| <li>CPL=0 is privileged O/S, CPL=3 is user | ||||
| <li>Within in the same privelege level: ljmp. | ||||
| <li>Transfer to a segment with more privilege: call gates. | ||||
| <ul> | ||||
| <li>a way for app to jump into a segment and acquire privs | ||||
| <li>CPL must be <= descriptor's DPL in order to read or write segment | ||||
| <li>call gates can change privelege <b>and</b> switch CS and SS | ||||
|   segment | ||||
| <li>call gates are implemented using a special type segment descriptor | ||||
|   in the GDT. | ||||
| <li>interrupts are conceptually the same as call gates, but their | ||||
|   descriptor is stored in the IDT.  We will use interrupts to transfer | ||||
|   control between user and kernel mode, both in JOS and xv6.  We will | ||||
|   return to this in the lecture about interrupts and exceptions. | ||||
| </ul> | ||||
| </ul> | ||||
| 
 | ||||
| <li>What about protection? | ||||
| <ul> | ||||
|   <li>can o/s limit what memory an application can read or write? | ||||
|   <li>app can load any selector into a seg reg... | ||||
|   <li>but can only mention indices into GDT | ||||
|   <li>app can't change GDT register (requires privilege) | ||||
|   <li>why can't app write the descriptors in the GDT? | ||||
|   <li>what about system calls? how to they transfer to kernel? | ||||
|   <li>app cannot <b>just</b> lower the CPL | ||||
| </ul> | ||||
| </ul> | ||||
| 
 | ||||
| <h2>Case study (xv6)</h2> | ||||
| 
 | ||||
| <p>xv6 is a reimplementation of <a href="../v6.html">Unix 6th edition</a>. | ||||
| <ul> | ||||
| <li>v6 is a version of the orginal Unix operating system for <a href="http://www.pdp11.org/">DEC PDP11</a> | ||||
| <ul> | ||||
|        <li>PDP-11 (1972):  | ||||
| 	 <li>16-bit processor, 18-bit physical (40) | ||||
| 	 <li>UNIBUS | ||||
| 	 <li>memory-mapped I/O | ||||
| 	 <li>performance: less than 1MIPS | ||||
| 	 <li>register-to-register transfer: 0.9 usec | ||||
| 	 <li>56k-228k (40) | ||||
| 	 <li>no paging, but some segmentation support | ||||
| 	 <li>interrupts, traps | ||||
| 	 <li>about $10K | ||||
|          <li>rk disk with 2MByte of storage | ||||
| 	 <li>with cabinet 11/40 is 400lbs | ||||
| </ul> | ||||
|        <li>Unix v6 | ||||
| <ul> | ||||
|          <li><a href="../reference.html">Unix papers</a>. | ||||
|          <li>1976; first widely available Unix outside Bell labs | ||||
|          <li>Thompson and Ritchie | ||||
|          <li>Influenced by Multics but simpler. | ||||
| 	 <li>complete (used for real work) | ||||
| 	 <li>Multi-user, time-sharing | ||||
| 	 <li>small (43 system calls) | ||||
| 	 <li>modular (composition through pipes; one had to split programs!!) | ||||
| 	 <li>compactly written (2 programmers, 9,000 lines of code) | ||||
| 	 <li>advanced UI (shell) | ||||
| 	 <li>introduced C (derived from B) | ||||
| 	 <li>distributed with source | ||||
| 	 <li>V7 was sold by Microsoft for a couple years under the name Xenix | ||||
| </ul> | ||||
|        <li>Lion's commentary | ||||
| <ul> | ||||
|          <li>surpressed because of copyright issue | ||||
| 	 <li>resurfaced in 1996 | ||||
| </ul> | ||||
| 
 | ||||
| <li>xv6 written for 6.828: | ||||
| <ul> | ||||
|          <li>v6 reimplementation for x86 | ||||
| 	 <li>does't include all features of v6 (e.g., xv6 has 20 of 43 | ||||
| 	 system calls). | ||||
| 	 <li>runs on symmetric multiprocessing PCs (SMPs). | ||||
| </ul> | ||||
| </ul> | ||||
| 
 | ||||
| <p>Newer Unixs have inherited many of the conceptual ideas even though | ||||
| they added paging, networking, graphics, improve performance, etc. | ||||
| 
 | ||||
| <p>You will need to read most of the source code multiple times. Your | ||||
| goal is to explain every line to yourself. | ||||
| 
 | ||||
| <h3>Overview of address spaces in xv6</h3> | ||||
| 
 | ||||
| <p>In today's lecture we see how xv6 creates the kernel address | ||||
|  spaces, first user address spaces, and switches to it. To understand | ||||
|  how this happens, we need to understand in detail the state on the | ||||
|  stack too---this may be surprising, but a thread of control and | ||||
|  address space are tightly bundled in xv6, in a concept | ||||
|  called <i>process</i>.  The kernel address space is the only address | ||||
|  space with multiple threads of control. We will study context | ||||
|  switching and process management in detail next weeks; creation of | ||||
|  the first user process (init) will get you a first flavor. | ||||
| 
 | ||||
| <p>xv6 uses only the segmentation hardware on xv6, but in a limited | ||||
|   way. (In JOS you will use page-table hardware too, which we cover in | ||||
|   next lecture.)  The adddress space layouts are as follows: | ||||
| <ul> | ||||
| <li>In kernel address space is set up as follows: | ||||
|   <pre> | ||||
|   the code segment runs from 0 to 2^32 and is mapped X and R | ||||
|   the data segment runs from 0 to 2^32 but is mapped W (read and write). | ||||
|   </pre> | ||||
| <li>For each process, the layout is as follows:  | ||||
| <pre> | ||||
|   text | ||||
|   original data and bss | ||||
|   fixed-size stack | ||||
|   expandable heap | ||||
| </pre> | ||||
| The text of a process is stored in its own segment and the rest in a | ||||
| data segment.   | ||||
| </ul> | ||||
| 
 | ||||
| <p>xv6 makes minimal use of the segmentation hardware available on the | ||||
| x86. What other plans could you envision? | ||||
| 
 | ||||
| <p>In xv6, each each program has a user and a kernel stack; when the | ||||
| user program switches to the kernel, it switches to its kernel stack. | ||||
| Its kernel stack is stored in process's proc structure. (This is | ||||
| arranged through the descriptors in the IDT, which is covered later.) | ||||
| 
 | ||||
| <p>xv6 assumes that there is a lot of physical memory. It assumes that | ||||
|   segments can be stored contiguously in physical memory and has | ||||
|   therefore no need for page tables. | ||||
| 
 | ||||
| <h3>xv6 kernel address space</h3> | ||||
| 
 | ||||
| <p>Let's see how xv6 creates the kernel address space by tracing xv6 | ||||
|   from when it boots, focussing on address space management: | ||||
| <ul> | ||||
| <li>Where does xv6 start after the PC is power on: start (which is | ||||
|   loaded at physical address 0x7c00; see lab 1). | ||||
| <li>1025-1033: are we in real mode? | ||||
| <ul> | ||||
| <li>how big are logical addresses? | ||||
| <li>how big are physical addresses? | ||||
| <li>how are addresses physical calculated? | ||||
| <li>what segment is being used in subsequent code? | ||||
| <li>what values are in that segment? | ||||
| </ul> | ||||
| <li>1068: what values are loaded in the GDT? | ||||
| <ul> | ||||
| <li>1097: gdtr points to gdt | ||||
| <li>1094: entry 0 unused | ||||
| <li>1095: entry 1 (X + R, base = 0, limit = 0xffffffff, DPL = 0) | ||||
| <li>1096: entry 2 (W, base = 0, limit = 0xffffffff, DPL = 0) | ||||
| <li>are we using segments in a sophisticated way? (i.e., controled sharing) | ||||
| <li>are P and S set? | ||||
| <li>are addresses translated as in protected mode when lgdt completes? | ||||
| </ul> | ||||
| <li>1071: no, and not even here. | ||||
| <li>1075: far jump, load 8 in CS. from now on we use segment-based translation. | ||||
| <li>1081-1086: set up other segment registers | ||||
| <li>1087: where is the stack which is used for procedure calls? | ||||
| <li>1087: cmain in the bootloader (see lab 1), which calls main0 | ||||
| <li>1222: main0.   | ||||
| <ul> | ||||
| <li>job of main0 is to set everthing up so that all xv6 convtions works | ||||
| <li>where is the stack?  (sp = 0x7bec) | ||||
| <li>what is on it? | ||||
| <pre> | ||||
|    00007bec [00007bec]  7cda  // return address in cmain | ||||
|    00007bf0 [00007bf0]  0080  // callee-saved ebx | ||||
|    00007bf4 [00007bf4]  7369  // callee-saved esi | ||||
|    00007bf8 [00007bf8]  0000  // callee-saved ebp | ||||
|    00007bfc [00007bfc]  7c49  // return address for cmain: spin | ||||
|    00007c00 [00007c00]  c031fcfa  // the instructions from 7c00 (start) | ||||
| </pre> | ||||
| </ul> | ||||
| <li>1239-1240: switch to cpu stack (important for scheduler) | ||||
| <ul> | ||||
| <li>why -32? | ||||
| <li>what values are in ebp and esp? | ||||
| <pre> | ||||
| esp: 0x108d30   1084720 | ||||
| ebp: 0x108d5c   1084764 | ||||
| </pre> | ||||
| <li>what is on the stack? | ||||
| <pre> | ||||
|    00108d30 [00108d30]  0000 | ||||
|    00108d34 [00108d34]  0000 | ||||
|    00108d38 [00108d38]  0000 | ||||
|    00108d3c [00108d3c]  0000 | ||||
|    00108d40 [00108d40]  0000 | ||||
|    00108d44 [00108d44]  0000 | ||||
|    00108d48 [00108d48]  0000 | ||||
|    00108d4c [00108d4c]  0000 | ||||
|    00108d50 [00108d50]  0000 | ||||
|    00108d54 [00108d54]  0000 | ||||
|    00108d58 [00108d58]  0000 | ||||
|    00108d5c [00108d5c]  0000 | ||||
|    00108d60 [00108d60]  0001 | ||||
|    00108d64 [00108d64]  0001 | ||||
|    00108d68 [00108d68]  0000 | ||||
|    00108d6c [00108d6c]  0000 | ||||
| </pre> | ||||
| 
 | ||||
| <li>what is 1 in 0x108d60?  is it on the stack? | ||||
| 
 | ||||
| </ul> | ||||
| 
 | ||||
| <li>1242: is it save to reference bcpu?  where is it allocated? | ||||
| 
 | ||||
| <li>1260-1270: set up proc[0] | ||||
| 
 | ||||
| <ul> | ||||
| <li>each process has its own stack (see struct proc). | ||||
| 
 | ||||
| <li>where is its stack?  (see the section below on physical memory | ||||
|   management below). | ||||
| 
 | ||||
| <li>what is the jmpbuf?  (will discuss in detail later) | ||||
| 
 | ||||
| <li>1267: why -4? | ||||
| 
 | ||||
| </ul> | ||||
| 
 | ||||
| <li>1270: necessar to be able to take interrupts (will discuss in | ||||
|   detail later) | ||||
| 
 | ||||
| <li>1292: what process do you think scheduler() will run?  we will | ||||
|   study later how that happens, but let's assume it runs process0 on | ||||
|   process0's stack. | ||||
| </ul> | ||||
| 
 | ||||
| <h3>xv6 user address spaces</h3> | ||||
| 
 | ||||
| <ul> | ||||
| <li>1327: process0   | ||||
| <ul> | ||||
| <li>process 0 sets up everything to make process conventions work out | ||||
| 
 | ||||
| <li>which stack is process0 running?  see 1260. | ||||
| 
 | ||||
| <li>1334: is the convention to release the proc_table_lock after being | ||||
|   scheduled? (we will discuss locks later; assume there are no other | ||||
|   processors for now.) | ||||
| 
 | ||||
| <li>1336: cwd is current working directory. | ||||
| 
 | ||||
| <li>1348: first step in initializing a template tram frame: set | ||||
|   everything to zero. we are setting up process 0 as if it just | ||||
|   entered the kernel from user space and wants to go back to user | ||||
|   space.  (see x86.h to see what field have the value 0.) | ||||
| 
 | ||||
| <li>1349: why "|3"?  instead of 0? | ||||
| 
 | ||||
| <li>1351: why set interrupt flag in template trapframe? | ||||
| 
 | ||||
| <li>1352: where will the user stack be in proc[0]'s address space? | ||||
| 
 | ||||
| <li>1353: makes a copy of proc0.  fork() calls copyproc() to implement | ||||
|   forking a process.  This statement in essense is calling fork inside | ||||
|   proc0, making a proc[1] a duplicate of proc[0].  proc[0], however, | ||||
|   has not much in its address space of one page (see 1341). | ||||
| <ul> | ||||
| <li>2221: grab a lock on the proc table so that we are the only one | ||||
|   updating it. | ||||
| <li>2116: allocate next pid. | ||||
| <li>2228: we got our entry; release the  lock. from now we are only | ||||
|   modifying our entry. | ||||
| <li>2120-2127: copy proc[0]'s memory.  proc[1]'s memory will be identical | ||||
|   to proc[0]'s. | ||||
| <li>2130-2136: allocate a kernel stack. this stack is different from | ||||
|   the stack that proc[1] uses when running in user mode. | ||||
| <li>2139-2140: copy the template trapframe that xv6 had set up in | ||||
|   proc[0]. | ||||
| <li>2147: where will proc[1] start running when the scheduler selects | ||||
|   it? | ||||
| <li>2151-2155: Unix semantics: child inherits open file descriptors | ||||
|   from parent. | ||||
| <li>2158: same for cwd. | ||||
| </ul> | ||||
| 
 | ||||
| <li>1356: load a program in proc[1]'s address space.  the program | ||||
|   loaded is the binary version of init.c (sheet 16). | ||||
| 
 | ||||
| <li>1374: where will proc[1] start? | ||||
| 
 | ||||
| <li>1377-1388: copy the binary into proc[1]'s address space.  (you | ||||
|   will learn about the ELF format in the labs.)   | ||||
| <ul> | ||||
| <li>can the binary for init be any size for proc[1] to work correctly? | ||||
| 
 | ||||
| <li>what is the layout of proc[1]'s address space? is it consistent | ||||
|   with the layout described on line 1950-1954? | ||||
| 
 | ||||
| </ul> | ||||
| 
 | ||||
| <li>1357: make proc[1] runnable so that the scheduler will select it | ||||
|   to run.  everything is set up now for proc[1] to run, "return" to | ||||
|   user space, and execute init. | ||||
| 
 | ||||
| <li>1359: proc[0] gives up the processor, which calls sleep, which | ||||
|   calls sched, which setjmps back to scheduler. let's peak a bit in | ||||
|   scheduler to see what happens next.  (we will return to the | ||||
|   scheduler in more detail later.) | ||||
| </ul> | ||||
| <li>2219: this test will fail for proc[1] | ||||
| <li>2226: setupsegs(p) sets up the segments for proc[1].  this call is | ||||
|   more interesting than the previous, so let's see what happens: | ||||
| <ul> | ||||
| <li>2032-37: this is for traps and interrupts, which we will cover later. | ||||
| <li>2039-49: set up new gdt. | ||||
| <li>2040: why 0x100000 + 64*1024? | ||||
| <li>2045: why 3?  why is base p->mem? is p->mem physical or logical? | ||||
| <li>2045-2046: how much the program for proc[1] be compiled if proc[1] | ||||
|   will run successfully in user space? | ||||
| <li>2052: we are still running in the kernel, but we are loading gdt. | ||||
|   is this ok? | ||||
| <li>why have so few user-level segments?  why not separate out code, | ||||
|   data, stack, bss, etc.? | ||||
| </ul> | ||||
| <li>2227: record that proc[1] is running on the cpu | ||||
| <li>2228: record it is running instead of just runnable | ||||
| <li>2229: setjmp to fork_ret. | ||||
| <li>2282: which stack is proc[1] running on? | ||||
| <li>2284: when scheduled, first release the proc_table_lock. | ||||
| <li>2287: back into assembly. | ||||
| <li>2782: where is the stack pointer pointing to? | ||||
| <pre> | ||||
|    0020dfbc [0020dfbc]  0000 | ||||
|    0020dfc0 [0020dfc0]  0000 | ||||
|    0020dfc4 [0020dfc4]  0000 | ||||
|    0020dfc8 [0020dfc8]  0000 | ||||
|    0020dfcc [0020dfcc]  0000 | ||||
|    0020dfd0 [0020dfd0]  0000 | ||||
|    0020dfd4 [0020dfd4]  0000 | ||||
|    0020dfd8 [0020dfd8]  0000 | ||||
|    0020dfdc [0020dfdc]  0023 | ||||
|    0020dfe0 [0020dfe0]  0023 | ||||
|    0020dfe4 [0020dfe4]  0000 | ||||
|    0020dfe8 [0020dfe8]  0000 | ||||
|    0020dfec [0020dfec]  0000 | ||||
|    0020dff0 [0020dff0]  001b | ||||
|    0020dff4 [0020dff4]  0200 | ||||
|    0020dff8 [0020dff8]  1000 | ||||
| </pre> | ||||
| <li>2783: why jmp instead of call? | ||||
| <li>what will iret put in eip? | ||||
| <li>what is 0x1b?  what will iret put in cs? | ||||
| <li>after iret, what will the processor being executing? | ||||
| </ul> | ||||
| 
 | ||||
| <h3>Managing physical memory</h3> | ||||
| 
 | ||||
| <p>To create an address space we must allocate physical memory, which | ||||
|   will be freed when an address space is deleted (e.g., when a user | ||||
|   program terminates).  xv6 implements a first-fit memory allocater | ||||
|   (see kalloc.c).   | ||||
| 
 | ||||
| <p>It maintains a list of ranges of free memory.  The allocator finds | ||||
|   the first range that is larger than the amount of requested memory. | ||||
|   It splits that range in two: one range of the size requested and one | ||||
|   of the remainder.  It returns the first range.  When memory is | ||||
|   freed, kfree will merge ranges that are adjacent in memory. | ||||
| 
 | ||||
| <p>Under what scenarios is a first-fit memory allocator undesirable? | ||||
| 
 | ||||
| <h3>Growing an address space</h3> | ||||
| 
 | ||||
| <p>How can a user process grow its address space?  growproc. | ||||
| <ul> | ||||
| <li>2064: allocate a new segment of old size plus n | ||||
| <li>2067: copy the old segment into the new (ouch!) | ||||
| <li>2068: and zero the rest. | ||||
| <li>2071: free the old physical memory | ||||
| </ul> | ||||
| <p>We could do a lot better if segments didn't have to contiguous in | ||||
|   physical memory. How could we arrange that? Using page tables, which | ||||
|   is our next topic.  This is one place where page tables would be | ||||
|   useful, but there are others too (e.g., in fork). | ||||
| </body> | ||||
| 
 | ||||
| 
 | ||||
							
								
								
									
										210
									
								
								web/l5.html
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										210
									
								
								web/l5.html
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,210 @@ | |||
| <title>Lecture 5/title> | ||||
| <html> | ||||
| <head> | ||||
| </head> | ||||
| <body> | ||||
| 
 | ||||
| <h2>Address translation and sharing using page tables</h2> | ||||
| 
 | ||||
| <p> Reading: <a href="../readings/i386/toc.htm">80386</a> chapters 5 and 6<br> | ||||
| 
 | ||||
| <p> Handout: <b> x86 address translation diagram</b> -  | ||||
| <a href="x86_translation.ps">PS</a> - | ||||
| <a href="x86_translation.eps">EPS</a> - | ||||
| <a href="x86_translation.fig">xfig</a> | ||||
| <br> | ||||
| 
 | ||||
| <p>Why do we care about x86 address translation? | ||||
| <ul> | ||||
| <li>It can simplify s/w structure by placing data at fixed known addresses. | ||||
| <li>It can implement tricks like demand paging and copy-on-write. | ||||
| <li>It can isolate programs to contain bugs. | ||||
| <li>It can isolate programs to increase security. | ||||
| <li>JOS uses paging a lot, and segments more than you might think. | ||||
| </ul> | ||||
| 
 | ||||
| <p>Why aren't protected-mode segments enough? | ||||
| <ul> | ||||
| <li>Why did the 386 add translation using page tables as well? | ||||
| <li>Isn't it enough to give each process its own segments? | ||||
| </ul> | ||||
| 
 | ||||
| <p>Translation using page tables on x86: | ||||
| <ul> | ||||
| <li>paging hardware maps linear address (la) to physical address (pa) | ||||
| <li>(we will often interchange "linear" and "virtual") | ||||
| <li>page size is 4096 bytes, so there are 1,048,576 pages in 2^32 | ||||
| <li>why not just have a big array with each page #'s translation? | ||||
| <ul> | ||||
| <li>table[20-bit linear page #] => 20-bit phys page # | ||||
| </ul> | ||||
| <li>386 uses 2-level mapping structure | ||||
| <li>one page directory page, with 1024 page directory entries (PDEs) | ||||
| <li>up to 1024 page table pages, each with 1024 page table entries (PTEs) | ||||
| <li>so la has 10 bits of directory index, 10 bits table index, 12 bits offset | ||||
| <li>What's in a PDE or PTE? | ||||
| <ul> | ||||
| <li>20-bit phys page number, present, read/write, user/supervisor | ||||
| </ul> | ||||
| <li>cr3 register holds physical address of current page directory | ||||
| <li>puzzle: what do PDE read/write and user/supervisor flags mean? | ||||
| <li>puzzle: can supervisor read/write user pages? | ||||
| 
 | ||||
| <li>Here's how the MMU translates an la to a pa: | ||||
| 
 | ||||
|    <pre> | ||||
|    uint | ||||
|    translate (uint la, bool user, bool write) | ||||
|    { | ||||
|      uint pde;  | ||||
|      pde = read_mem (%CR3 + 4*(la >> 22)); | ||||
|      access (pde, user, read); | ||||
|      pte = read_mem ( (pde & 0xfffff000) + 4*((la >> 12) & 0x3ff)); | ||||
|      access (pte, user, read); | ||||
|      return (pte & 0xfffff000) + (la & 0xfff); | ||||
|    } | ||||
| 
 | ||||
|    // check protection. pxe is a pte or pde. | ||||
|    // user is true if CPL==3 | ||||
|    void | ||||
|    access (uint pxe, bool user, bool write) | ||||
|    { | ||||
|      if (!(pxe & PG_P)   | ||||
|         => page fault -- page not present | ||||
|      if (!(pxe & PG_U) && user) | ||||
|         => page fault -- not access for user | ||||
|     | ||||
|      if (write && !(pxe & PG_W)) | ||||
|        if (user)    | ||||
|           => page fault -- not writable | ||||
|        else if (!(pxe & PG_U)) | ||||
|          => page fault -- not writable | ||||
|        else if (%CR0 & CR0_WP)  | ||||
|          => page fault -- not writable | ||||
|    } | ||||
|    </pre> | ||||
| 
 | ||||
| <li>CPU's TLB caches vpn => ppn mappings | ||||
| <li>if you change a PDE or PTE, you must flush the TLB! | ||||
| <ul> | ||||
|   <li>by re-loading cr3 | ||||
| </ul> | ||||
| <li>turn on paging by setting CR0_PE bit of %cr0 | ||||
| </ul> | ||||
| 
 | ||||
| Can we use paging to limit what memory an app can read/write? | ||||
| <ul> | ||||
| <li>user can't modify cr3 (requires privilege) | ||||
| <li>is that enough? | ||||
| <li>could user modify page tables? after all, they are in memory. | ||||
| </ul> | ||||
| 
 | ||||
| <p>How we will use paging (and segments) in JOS: | ||||
| <ul> | ||||
| <li>use segments only to switch privilege level into/out of kernel | ||||
| <li>use paging to structure process address space | ||||
| <li>use paging to limit process memory access to its own address space | ||||
| <li>below is the JOS virtual memory map | ||||
| <li>why map both kernel and current process? why not 4GB for each? | ||||
| <li>why is the kernel at the top? | ||||
| <li>why map all of phys mem at the top? i.e. why multiple mappings? | ||||
| <li>why map page table a second time at VPT? | ||||
| <li>why map page table a third time at UVPT? | ||||
| <li>how do we switch mappings for a different process? | ||||
| </ul> | ||||
| 
 | ||||
| <pre> | ||||
|     4 Gig -------->  +------------------------------+ | ||||
|                      |                              | RW/-- | ||||
|                      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||||
|                      :              .               : | ||||
|                      :              .               : | ||||
|                      :              .               : | ||||
|                      |~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~| RW/-- | ||||
|                      |                              | RW/-- | ||||
|                      |   Remapped Physical Memory   | RW/-- | ||||
|                      |                              | RW/-- | ||||
|     KERNBASE ----->  +------------------------------+ 0xf0000000 | ||||
|                      |  Cur. Page Table (Kern. RW)  | RW/--  PTSIZE | ||||
|     VPT,KSTACKTOP--> +------------------------------+ 0xefc00000      --+ | ||||
|                      |         Kernel Stack         | RW/--  KSTKSIZE   | | ||||
|                      | - - - - - - - - - - - - - - -|                 PTSIZE | ||||
|                      |      Invalid Memory          | --/--             | | ||||
|     ULIM     ------> +------------------------------+ 0xef800000      --+ | ||||
|                      |  Cur. Page Table (User R-)   | R-/R-  PTSIZE | ||||
|     UVPT      ---->  +------------------------------+ 0xef400000 | ||||
|                      |          RO PAGES            | R-/R-  PTSIZE | ||||
|     UPAGES    ---->  +------------------------------+ 0xef000000 | ||||
|                      |           RO ENVS            | R-/R-  PTSIZE | ||||
|  UTOP,UENVS ------>  +------------------------------+ 0xeec00000 | ||||
|  UXSTACKTOP -/       |     User Exception Stack     | RW/RW  PGSIZE | ||||
|                      +------------------------------+ 0xeebff000 | ||||
|                      |       Empty Memory           | --/--  PGSIZE | ||||
|     USTACKTOP  --->  +------------------------------+ 0xeebfe000 | ||||
|                      |      Normal User Stack       | RW/RW  PGSIZE | ||||
|                      +------------------------------+ 0xeebfd000 | ||||
|                      |                              | | ||||
|                      |                              | | ||||
|                      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||||
|                      .                              . | ||||
|                      .                              . | ||||
|                      .                              . | ||||
|                      |~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~| | ||||
|                      |     Program Data & Heap      | | ||||
|     UTEXT -------->  +------------------------------+ 0x00800000 | ||||
|     PFTEMP ------->  |       Empty Memory           |        PTSIZE | ||||
|                      |                              | | ||||
|     UTEMP -------->  +------------------------------+ 0x00400000 | ||||
|                      |       Empty Memory           |        PTSIZE | ||||
|     0 ------------>  +------------------------------+ | ||||
| </pre> | ||||
| 
 | ||||
| <h3>The VPT </h3> | ||||
| 
 | ||||
| <p>Remember how the X86 translates virtual addresses into physical ones: | ||||
| 
 | ||||
| <p><img src=pagetables.png> | ||||
| 
 | ||||
| <p>CR3 points at the page directory.  The PDX part of the address | ||||
| indexes into the page directory to give you a page table.  The | ||||
| PTX part indexes into the page table to give you a page, and then | ||||
| you add the low bits in. | ||||
| 
 | ||||
| <p>But the processor has no concept of page directories, page tables, | ||||
| and pages being anything other than plain memory.  So there's nothing | ||||
| that says a particular page in memory can't serve as two or three of | ||||
| these at once.  The processor just follows pointers: | ||||
| 
 | ||||
| pd = lcr3(); | ||||
| pt = *(pd+4*PDX); | ||||
| page = *(pt+4*PTX); | ||||
| 
 | ||||
| <p>Diagramatically, it starts at CR3, follows three arrows, and then stops. | ||||
| 
 | ||||
| <p>If we put a pointer into the page directory that points back to itself at | ||||
| index Z, as in | ||||
| 
 | ||||
| <p><img src=vpt.png> | ||||
| 
 | ||||
| <p>then when we try to translate a virtual address with PDX and PTX | ||||
| equal to V, following three arrows leaves us at the page directory. | ||||
| So that virtual page translates to the page holding the page directory. | ||||
| In Jos, V is 0x3BD, so the virtual address of the VPD is | ||||
| (0x3BD<<22)|(0x3BD<<12). | ||||
| 
 | ||||
| 
 | ||||
| <p>Now, if we try to translate a virtual address with PDX = V but an | ||||
| arbitrary PTX != V, then following three arrows from CR3 ends | ||||
| one level up from usual (instead of two as in the last case), | ||||
| which is to say in the page tables.  So the set of virtual pages | ||||
| with PDX=V form a 4MB region whose page contents, as far | ||||
| as the processor is concerned, are the page tables themselves. | ||||
| In Jos, V is 0x3BD so the virtual address of the VPT is (0x3BD<<22). | ||||
| 
 | ||||
| <p>So because of the "no-op" arrow we've cleverly inserted into | ||||
| the page directory, we've mapped the pages being used as | ||||
| the page directory and page table (which are normally virtually | ||||
| invisible) into the virtual address space. | ||||
| 
 | ||||
|   | ||||
| </body> | ||||
							
								
								
									
										70
									
								
								web/mkhtml
									
										
									
									
									
										Executable file
									
								
							
							
						
						
									
										70
									
								
								web/mkhtml
									
										
									
									
									
										Executable file
									
								
							|  | @ -0,0 +1,70 @@ | |||
| #!/usr/bin/perl | ||||
| 
 | ||||
| my @lines = <>; | ||||
| my $text = join('', @lines); | ||||
| my $title; | ||||
| if($text =~ /^\*\* (.*?)\n/m){ | ||||
| 	$title = $1; | ||||
| 	$text = $` . $'; | ||||
| }else{ | ||||
| 	$title = "Untitled"; | ||||
| } | ||||
| 
 | ||||
| $text =~ s/[ \t]+$//mg; | ||||
| $text =~ s/^$/<br><br>/mg; | ||||
| $text =~ s!\b([a-z0-9]+\.(c|s|pl|h))\b!<a href="src/$1.html">$1</a>!g; | ||||
| $text =~ s!^(Lecture [0-9]+\. .*?)$!<b><i>$1</i></b>!mg; | ||||
| $text =~ s!^\* (.*?)$!<h2>$1</h2>!mg; | ||||
| $text =~ s!((<br>)+\n)+<h2>!\n<h2>!g; | ||||
| $text =~ s!</h2>\n?((<br>)+\n)+!</h2>\n!g; | ||||
| $text =~ s!((<br>)+\n)+<b>!\n<br><br><b>!g; | ||||
| $text =~ s!\b\s*--\s*\b!\–!g; | ||||
| $text =~ s!\[([^\[\]|]+) \| ([^\[\]]+)\]!<a href="$1">$2</a>!g; | ||||
| $text =~ s!\[([^ \t]+)\]!<a href="$1">$1</a>!g; | ||||
| 
 | ||||
| $text =~ s!``!\“!g; | ||||
| $text =~ s!''!\”!g; | ||||
| 
 | ||||
| print <<EOF; | ||||
| <!-- AUTOMATICALLY GENERATED: EDIT the .txt version, not the .html version --> | ||||
| <html> | ||||
| <head> | ||||
| <title>$title</title> | ||||
| <style type="text/css"><!-- | ||||
| body { | ||||
| 	background-color: white; | ||||
| 	color: black; | ||||
| 	font-size: medium; | ||||
| 	line-height: 1.2em; | ||||
| 	margin-left: 0.5in; | ||||
| 	margin-right: 0.5in; | ||||
| 	margin-top: 0; | ||||
| 	margin-bottom: 0; | ||||
| } | ||||
| 
 | ||||
| h1 { | ||||
| 	text-indent: 0in; | ||||
| 	text-align: left; | ||||
| 	margin-top: 2em; | ||||
| 	font-weight: bold; | ||||
| 	font-size: 1.4em; | ||||
| } | ||||
| 
 | ||||
| h2 { | ||||
| 	text-indent: 0in; | ||||
| 	text-align: left; | ||||
| 	margin-top: 2em; | ||||
| 	font-weight: bold; | ||||
| 	font-size: 1.2em; | ||||
| } | ||||
| --></style> | ||||
| </head> | ||||
| <body bgcolor=#ffffff> | ||||
| <h1>$title</h1> | ||||
| <br><br> | ||||
| EOF | ||||
| print $text; | ||||
| print <<EOF; | ||||
| </body> | ||||
| </html> | ||||
| EOF | ||||
							
								
								
									
										53
									
								
								web/x86-intr.html
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										53
									
								
								web/x86-intr.html
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,53 @@ | |||
| <title>Homework: xv6 and Interrupts and Exceptions</title> | ||||
| <html> | ||||
| <head> | ||||
| </head> | ||||
| <body> | ||||
| 
 | ||||
| <h1>Homework: xv6 and Interrupts and Exceptions</h1> | ||||
| 
 | ||||
| <p> | ||||
| <b>Read</b>: xv6's trapasm.S, trap.c, syscall.c, vectors.S, and usys.S. Skim | ||||
| lapic.c, ioapic.c, and picirq.c | ||||
| 
 | ||||
| <p> | ||||
| <b>Hand-In Procedure</b> | ||||
| <p> | ||||
| You are to turn in this homework during lecture. Please | ||||
| write up your answers to the exercises below and hand them in to a | ||||
| 6.828 staff member at the beginning of the lecture. | ||||
| <p> | ||||
| 
 | ||||
| <b>Introduction</b> | ||||
| 
 | ||||
| <p>Try to understand  | ||||
| xv6's trapasm.S, trap.c, syscall.c, vectors.S, and usys.S. Skim | ||||
|   You will need to consult: | ||||
| 
 | ||||
| <p>Chapter 5 of <a href="../readings/ia32/IA32-3.pdf">IA-32 Intel | ||||
| Architecture Software Developer's Manual, Volume 3: System programming | ||||
| guide</a>; you can skip sections 5.7.1, 5.8.2, and 5.12.2. Be aware | ||||
| that terms such as exceptions, traps, interrupts, faults and aborts | ||||
| have no standard meaning. | ||||
| 
 | ||||
| <p>Chapter 9 of the 1987 <a href="../readings/i386/toc.htm">i386 | ||||
| Programmer's Reference Manual</a> also covers exception and interrupt | ||||
| handling in IA32 processors. | ||||
| 
 | ||||
| <p><b>Assignment</b>:  | ||||
| 
 | ||||
| In xv6, set a breakpoint at the beginning of <code>syscall()</code> to | ||||
| catch the very first system call.  What values are on the stack at | ||||
| this point?  Turn in the output of <code>print-stack 35</code> at that | ||||
| breakpoint with each value labeled as to what it is (e.g., | ||||
| saved <code>%ebp</code> for <code>trap</code>, | ||||
| <code>trapframe.eip</code>, etc.). | ||||
| <p> | ||||
| <b>This completes the homework.</b> | ||||
| 
 | ||||
| </body> | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
							
								
								
									
										18
									
								
								web/x86-intro.html
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										18
									
								
								web/x86-intro.html
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,18 @@ | |||
| <title>Homework: Intro to x86 and PC</title> | ||||
| <html> | ||||
| <head> | ||||
| </head> | ||||
| <body> | ||||
| 
 | ||||
| <h1>Homework: Intro to x86 and PC</h1> | ||||
| 
 | ||||
| <p>Today's lecture is an introduction to the x86 and the PC, the | ||||
| platform for which you will write an operating system.  The assigned | ||||
| book is a reference for x86 assembly programming of which you will do | ||||
| some. | ||||
| 
 | ||||
| <p><b>Assignment</b> Make sure to do exercise 1 of lab 1 before | ||||
| coming to lecture. | ||||
| 
 | ||||
| </body> | ||||
| 
 | ||||
							
								
								
									
										33
									
								
								web/x86-mmu.html
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										33
									
								
								web/x86-mmu.html
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,33 @@ | |||
| <title>Homework: x86 MMU</title> | ||||
| <html> | ||||
| <head> | ||||
| </head> | ||||
| <body> | ||||
| 
 | ||||
| <h1>Homework: x86 MMU</h1> | ||||
| 
 | ||||
| <p>Read chapters 5 and 6 of | ||||
| <a href="../readings/i386/toc.htm">Intel 80386 Reference Manual</a>. | ||||
| These chapters explain | ||||
| the x86 Memory Management Unit (MMU), | ||||
| which we will cover in lecture today and which you need | ||||
| to understand in order to do lab 2. | ||||
| 
 | ||||
| <p> | ||||
| <b>Read</b>: bootasm.S and setupsegs() in proc.c | ||||
| 
 | ||||
| <p> | ||||
| <b>Hand-In Procedure</b> | ||||
| <p> | ||||
| You are to turn in this homework during lecture. Please | ||||
| write up your answers to the exercises below and hand them in to a | ||||
| 6.828 staff member by the beginning of lecture. | ||||
| <p> | ||||
| 
 | ||||
| <p><b>Assignment</b>: Try to understand setupsegs() in proc.c. | ||||
|   What values are written into <code>gdt[SEG_UCODE]</code> | ||||
|   and <code>gdt[SEG_UDATA]</code> for init, the first user-space | ||||
|   process? | ||||
|   (You can use Bochs to answer this question.) | ||||
| 
 | ||||
| </body> | ||||
							
								
								
									
										
											BIN
										
									
								
								web/x86-mmu1.pdf
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								web/x86-mmu1.pdf
									
										
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							
							
								
								
									
										55
									
								
								web/x86-mmu2.pdf
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										55
									
								
								web/x86-mmu2.pdf
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,55 @@ | |||
| %PDF-1.4
1 0 obj
<</Type /XObject /Subtype /Image /ImageMask true /Name /Obj1 /Width 2560 /Height 3256 /BlackIs1 false /BitsPerComponent 1 /Length 25249 /Filter /CCITTFaxDecode /DecodeParms << /K -1 /Columns 2560 >> >> stream
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿü¦ÑæKçóFf<46>…<dp„|än'‹ŠG2tyŠJÙà\áÅ#åãè<C3A3>‘‘FaË7L<37>lº1ÙÑš"^+<2B>Ža‚à Ž ÐGåž¼ŽJYÀ´°ˆƒw&EÏB‹•QÊ ‹2(ä09<06>Èdã<>QfÔ£jÎæAÜú”.zˆX0æ<1C>ærÇä2Aw&9 ¸']YÇ*£Á‚«”äAÉŽ[“¦ ß xrhEƒÐí±®dÈ'¯ÿÿÿÿÿÿÿ“`×úãûÿÿÿÿþ½ÿ_ÿýúÿºÿ½{ÿÿÿÿÿü³£Zî	<09>¨D#:;ý„×ÿI¿×ÿ<7F>ÿõÿçVb.Dƒ!³U~PŠL†ûúÂ3âL<œ\ๆ	œÌ @ÿÖ<C3BF>ÚèhqªiýÕ´ÓÕtÓOª¼”Q;¢wª´IòCÚ“|»¢wÿ ž›§„¯AÁ ƒxzzþþíp¿ëéöû¿½cõ½/[ý*"<22>>DQéiu§"4ºÿÆ	ñ{ÿÁý×WþðÃ
ýÿä¿T7ûÓ¹z„ÿ÷òÔAþ—ÓšI_5¯ùÖlŽÃÿ÷µÒÕîÝw3–¿ò$°®ÿXj¶µµüZöÅEé[wlußÕ«÷wÚjéÝÚdW_»»M5öS¡hA„Bõ†\ eº `<60>„H¾"""""""#õÿÿÿÿÿÿÿÿÿùk%¿–P¬ìh‰#B<>#-"@ÊH†Áùe0A„AªaMSú}¦½'ëûTä‡É§¨Y'Dâ/(&ÐAÁL¦É¬ *ÿH8T› ôÿOûþŸÚÿþö»ÿ¿¥ù¡‡ðr†—ö”œë~·«}þ;ŠŠ<C5A0>íV¿ðKé|Wî	(Kuÿê´	%¨J¿ûxP<78>KÂÿå”´%é_÷ß Iá/o®ü($ «Òùeý”.áþ×J’ô®í~ßé%éjÅKL»³~AsøfÏÿ±¬Hƒˆ»ø¿kí?ûþ_atí0«¦‡àÁ4"""#ñþýüµ•QÑ|´ÌDr/¢8-…#†Hk|²§h)ÿü<C3BF> (r <20>ã<1C>ä`ƒ=	ÙZÿ&9>.Ê•¹s.¥ü!wÿ¯ÿ×ÿÿ¯ÿÿÿÿÿÿÿçxþv¡š™ÇúõLèó±ÓûÈÀsKÿ:È»ÿ Ó*oÿ´öÿ;@Ÿ¢8zÿÛ_-Zôÿüǵï~¡´!§ý{òqr9fcéÂfÄ4'ÿ܆<C39C>æˆŽŽ²ñ“åÅ4½8‡ ô¯M?ûîa¡f„P<E2809E>¡¯§x_wM7ÿþªž<C2AA>ú¢qûë».(<28>í…ÿß{Z]/Iäñèžúÿ_z	‡.ØaIÛ’¿í{ôÿú‡Iÿßû | ||||
| ¤ú	éÿ¯Jÿöÿÿþ«ßî®·ßÿþDù~ÿÿô¾>þº™‡ãÿØqúÿâ×ý×uõûZØÿÒtúëî½ÿýx?ÿ__ÿÿYºä#‚Ûü“×ÿ_z÷_ÿåî…ósÑ:ÿúü½ûµâ½%×s7·_o„ÿÞõ¥š}ýÚp×µØ`¿iXKÿëöm.×kظŠâ½<C3A2>Žý0—¶µ¶+÷ºëjújÿ¼W±QV×û"ö°Âöšý{»o¿†jg4ÖvÓA”8@ÁWþ½„M5UˆˆˆŽ"""?Ô21Âÿâ)ÿÿøKÿáõÿûzÿÖ?ýwÿþ½ÿ}®¿ì˜ÈìÆc(EÄ.ÎˆŽ…ò8‚§â""""#ÿé<C3BF>4r
c‘!¶ä§.ɹCœrÜã”9Fá]–Uy!Í¢èò%,Én[øˆˆ<CB86>oÿ×_^þÿëëßßÿõõÿÿÿÿúÿ¿ÿúÿ¿“j¾¿¯<C2BF>Úÿÿÿÿ÷ôÿõÿÿÿÿÿýzûÿÿÿÿÿÿýoþ¿ûÿ¯ÿ¥ÿÐ_ýú_ýÿÒÿéÈW„ºúéÒÒïïÿ%ó¡ÿé0@ÿõ§ÿ [;£'3`äÌ1ÿH*'<27>S¿ú®åQõd%ôº×JÖFïþ é~“Ÿ‚4?úƒúé4÷¿ývµÒ¼Æfh<66>ù_ôMøªézgÏ üø <C3B8>‘Ùñ_úúKÕvÇ~ƒC	¯ô„¼%¤òÇIøL*õð’-ñë¤Hœu“Œ›ïþ‚ô«â›Ý úPƒÐ<¼ÿÒð’ÿý7ZN“ÿ×Ð$»‡¥ý | ||||
| ÿÿÐ]$–xY!ܸ({ë…ÿÿ¥Ü/Zt„ã¿ä«õÿ]RT¦ûÒÖö¿ô¿<C3B4>ÛÝþ©ßÚ¯éz ÁT}øÞM^ùkšzéo¾ßÒ¹<C392>Ög}ôºÊ™mé?ß«kÿ¬J"8Úî×¶Ž>Âv¬0µý¬Gô:¸ø¶8ïøþÝvÖíµÿü=4ì-‘^ì*ÿ†<C3BF>ØX“â hWÊåqO3HäGˆàn`ÅñDE!¾WA=#µwô—í~È솔V‚A7<…¹\@¸ç<1C>åySYmÆpX¯¢Úë^ëéÿû/Wâ?ÿÿÿÿ×ýÿÿý~¿å{Í/þVÁlÿáÿÿäG.È>Ì97"á8 cƒA‘<>ÿòÛ£ýzþÿoÿÿêÿßþ¿ÿkûÿ¯_ÿÿÿýÿ~¿ßý{ÿ]ÿúÿ÷&âÑüó3‰äGˆà„p‡3‘ÁA|ŽGÞþMÀà 5ÿÿú×Y€ØëÿÜÆÿ¿õûþþ¿ÿÿÿ_ÿÿÿÿÿþ¿ÿ÷ÎÖÿÿþ?ÿïÿÿÿV¿ë¯ýÿÿÿý_ÿÿÿÿÿ¯ÿß×ÿïfq'3ðÍ‘ƒ#™5EÅ#²ñ¦lˌ֎ô/ÿÄDDDDDDDDD}ûÿÿùmÑëý}ÿ¿ÿ×ÿ:õ¼'ÿzÿúúGe"G~™ªÿþBiÿÿêÿÿþ—ÿÿÈo	2ý<32>ÐŒäáƒÑSd]’ñQoúéPv†¼—}Úéýõëû^¿Ó÷þ–—Ÿ ‹ÆcÍÈãs1ðïßÿÿZ^a=ôÈpj]=ÿýªö«§ª{ÿÿýpºÓäÝýºìƒ÷Í¢7…©ÒÌìâ8¿õ¬¼ËÊAô_dó¢ùðÄÆfÁ<Ï|NXðƒÂ¡g>øEþ•GN“Óïtþû ¢oïÐtš}XCCÂh4ÿá‘ t?Õã[?ýÚéÚzº®½ýõAv—ÉÿkfǬ<C387>ïH¼z&ùwû–å=<12>HÇh<C387>Ñc•
úéBUa‡î<Ì:WÅÄzPŸ÷ÐAè7¤ô/<Ÿ¿ô ¨<C2A0>îdþ½_ú޽vŸqÿWú§Ã	A>ˆ±ÿD‡)ÿì<C3BF>ÅëvêÒB¿úéw‚†?ëé¾ù8)Ê„/isH.»Õƒòœ&…×Oȱ`Å‘)ü"^!’ûõêÐ<C3AA>_ù:D6#!Oúƒó ?ÿñ°Ø?õ„Rúÿv}úÿi¢U—_o
ÿ§þû7ýU91ÿøVÒ´½l+Ø,ýy~‰gïþ‰Å
ÿ¥Â[×Ä68ø‡±H…	ÿ ÷šp¾]>ßËQ$ôŠq;izîöÖÝ+ÁWìg›î~¿ëÌã‘Ûÿ^ºÿäAîäAÕ; | ||||
| a¯ýë~[mm;Õ>ÿºé:_øaA—	„Ñ#) ,0CþÓkØVÂûà | ||||
| ÃäQí…†ýH3ÛKüDGýxŠˆzVÆÇLTSýþ—ùØ,¿Ý4í?»û´þúÒöûÿw
Pi‘Âü;[4×ô·ô¿á‚[	¡ Â„""!¡(Mqékék«ˆ¨ˆˆ<CB86>Õþûª÷î»ú^ØPÖ’þ¾ô½Šú„µúúªþ—ú^ÿéo¤–á¬~é}õˆÿTµp£ÿëá¥ÙØLŸ%‘|ŽŠÒ%ò:(ÿ¥ìWý/ú[°²lÿXŽRÁl%·G½z×Ôõÿ ã<>ŽBŽh%¡
ŽD¼%ÅŽ]gò¦"Á¸<C381>g¥…ýþfêÿú÷¯ÿÿÿÿÿÿÿÿÿô¿ÿ¿ÿûÿÿÿ÷ëóµ@§a‡'¾¿ú ¾ü4ÿÿ÷yTÿþEv×ýïùsË›þ¿ûxÿÿûdb \3L£8Éïïÿ¹ñJŽÌǹ†pC‚” | ||||
|  ˆœ!§ÿ¯ü Âõ0ƒÂýúÔ'öúë×ÿÿTLz#‰ßøHœ98°‘<r8¯þÿÂÐ< ôßu | ||||
| ƒ¤ïO	ºÿÖüpžŸé'§÷§þÿý | ||||
| u¥ûý¤:WÚ_ÿ^‰/"?®²á?ÿýð@»Çê±ÿÿõÿ姺ÿÿþëÿ¿ôE<C3B4>ÿôÿ÷þÿþ_þZº»ÍTÎïÞ¿ÿúU~Oé/fu®ëÇúÿ×OÛúúiuÿ¯Ýµml.}ƒ{
wÿÿñ
Ž*)xêW_ý}ªöV×ÿÿÜ2+Æ«pÐi Õ_ý Ï4t`Î’´" Ê&¿øˆâ"5ÿÿ2&é÷þÿÿ±™ëÿ¯÷×íü!úÿãúïÿ¿ÿõýÿ·ÿýÿ×ëÿëòÜq‘Ù¢4#§6Î#‡ÿÿ;`Gïÿÿ_ÿÿÿ‘ŽB9Ø)ÈŽÍ…Žq˃]T|+œ¤¿ïÿ‚µßõý>¿×_ÿïÿÿÿÿþ»ëÿÿÿ¿ÿÿ]ßÿÿ¯ÿÿÿïþÿÿýÿÿÿÿÿÿÿÿÿÿïÿïõ×ýïÿÿÿÿÿÿûÿÿõÿõÿõûÿûÿÿõÿÿï¯ÿÿÿÿÿÿýßúûúÿÿÿûÿÿõÿÿÿK¯ÿïÿÿÿÿÿÕþ¾ïºÿõZßïïßÿßÿÿë¯ý}ÿÿíÿû×ÿÿ¿×ÿ×ï¯ÿÿýÿý/ÿýÿëÿïÿÿ׿¯ë}ÿ®ïîûÿÿ_¾¿×ÿ×íÿïÿýý×þwè¨GâŸ5æÑDá ¹˜uÿÿQõ{ÿ^믻ÿßù›¨ÿÿOú¿ÿýúÿòUÿõO¯ÿ^¿ïÿÒÿKÿ„u¿ÿ¯õûúOÿõÈÆG3´:5",×ÿþ¡0¡8atUÿ÷¾¼„n•ÿþɃ«ïõUÿÿ"ÍH0µD…:‰ýÿþÛ;Cö¤ˆÀ¿ÿÙj÷ÿ³ªÖB«Õø?Ï dxœòqåÉÌÇ”ñq„xŽ!™é0ÿõôÈ7ÿ Ð³<C390>pœ^ƒâøýÈ`gÿû—¦·§xOººÕ<C2BA>#_ÿø½%×]
"#ý)èÿ¿<C3BF>YyEöN<¼rxôO2 åù<ÓK¦ÿÿÀ• ƒ~ø~›ˆÂü$ÈfGûÿ=<>__Oí‘Á¹:$òìº#äpP^.
ˆâó`ŽGÌömŽpr8`Ž‹Š\9Ñüïˆû0EÝ/¥š,õÿ먈<C2A8>ˆˆˆˆˆˆˆˆˆˆˆˆˆ<CB86>ëâ2&<26>E>3æ<>kâÑáWúÒÿø.NÁï§…Ný—_ööB9Ç;‘ ‹†˜ä܇_ß@¶…ôE{ýUÿÔÕ„"",‚‚…TsRäýׂù>ü&¸/®NÎ8KÿªÒ""#õôBoË ½zûö+	A}ú‚»¥Õ³ïûöÏ¿ÿéÃ"ÓßÜÕ§wk믯׾’!Ù¿ñø¸á„¶!ÅqLƒØ}a$DeÿöÜ1ÿµowî”kÿ Ó<0B>ƒë
§õÿü;A–=§S	Úª ÓMi¿ñÂ^«Ô¬Êª—ïû]Ò%Ù#¯¯ñFKÐ@šßßU0Ÿÿ¦¾“»úîÁÔ<>Ñwý~#°©öÿúâ–“ÿû^þ¿ñõÿÿk¯ÿã÷÷ù?8çr¦zŒA¨LåY‹‚«8ep…aU•¿ª¿×ለˆˆˆˆˆŽûÿ»¤ÿßõ	i}?\ ·ÿû kýj—ÿÿ…ÿÿÒm+ÿÿ-ýbÅÿßÿþWëü<C3AB>ŒÿˆÕ_ÿÿÿ~¿ÿÝWÿ¿û×ûÿë]s_1òèà†Ì<E280A0>žFãhº0 | ||||
| ¤pÈ
Ÿþ"""""""¿ÿÿyÃkÎå¹Ü«?f-2 | ||||
| Ng+<2B>©¿ÿYmÐÿßÓ¯´»ýþéÿÿÿýzÿÿÿÿÿ÷ü˜(ÎÉA<C389>ÿðÂeWR¢þ¿ÃÓÓÿÿµ×<C2B5>§ŸþBß.l?,]hÿƒéÇ×d)?ýƒöì‰"{†tEÁÿù/·7‹³Ã(&pCìŽ*å<G&\.ÁÿöAzÕiÆOAÅ®…÷ìývÿÓ®×oû! ?þA¹ýô¨<C3B4>ì0¤áȯ]’ˆ`¤q^ûìï | ||||
| _8@ú	á]H1<48>6°›<E280BA>Êïû<C3AF>Íü%×á:ûõÕ—DDCŒxÀ!r@ì®/‘¸Á¯á‘ó†Aœ~µ]0´¯þ·õˆˆˆˆâ"?ÄD{ÿxÞûõþÿÿÒøÿûJZ÷È | ||||
| ö@ðbêïÿÿÿÖ”½— ¢²		Ç ªä4CdG"¹–9C™Éж…¯-Â<02>^ÿýýúD44’Ÿ<E28099>ÎBuùvyÌêëŠßí¿‚®<E2809A>__‰…¶¿ÿ±ZôAÆuëNÿÿ_ýí<ô¥Ûa[^[R-’¿¢#NŸ"+QüTT6?‹ø{\/ V’[¾ÖÖ¾¿3O_œKMì Âv¾Ÿw¬~;TL!Áœtâ,/þ"""4"#]ýWëªõð<C3B5>_ú¥÷Ú¯øƒ¿CÿZödk›Ë¢è ‹ÇÌþ_(GLƒdíÿÿÖ\9´)r<>ñÿ_ÿÿûÿÿ×õÿÿÿ÷÷×ûÿÿZ×ßÿýn[§;ÿ"$ïØ:úÈ`Gü†zª½ƒßðöC3;òß‚dO#HꉄQ˜¾\L8‹²æG<C3A6>‘ÁGÊã@xlˆˆˆˆˆˆˆˆû÷Ö | ||||
| ¿-H<><48>l‚,ʃ9Xã“9Ms_©
<0A>ÎWüƒ
	?Ð!ÿ![:o _äÛ%ÿÿÿ–•f]pšúiÿ¿ù©¢wEÃ_>ÿïéÿ¥T¿ð‘e<E28098>aN€¹ÚØ!ÀØ`܆ÿ | ||||
| C
Pzk÷ý"n·<08>3\œ5%±dâšëÜ–ˆûþ’„L/z¦<7A>¯XL ¾N/ô•SåÍËü‰<C3BC>a5Ysf‡-e«ý%»ÝðnZÅ=wªD‡‘Ò÷þ<C3B7>*'
º«ï…“º'<27>w}0¯ó6uf"„GDƒþ½§_ü<5F>ëOOÿÕ ž¼ `†|@ƒþ‚ý'_ïW¯×ÿtìÂjš
ï¤/×÷|[]u×äÇꩪ×H’`û}¢cß}JK#·úÑ1þ¯Kmk'nN2q—<71>úDa<44>äÿik¯ÍDoâµëÈ.`á8aaiè:	÷þ‘¨(D$Áþ4—	|CkúIâžÈ`ÂB1P´<50>[ØOãTÁ?Ö´?Úë]kÔ%B¿\þkð’õû]á$ºéÖ‰Â|Š=ƒý"xÑ8µúIuô„’x\´¥!Qø¸¤<7F> ÛþödþZGúJœµAK¯]Òúð‚ˆ5÷™Î?ð‚ˆ=p‚äó}éWëú	?K„.¨/oAi/CÏôšüw¤úŽ/|'ü¤]šPßéŽÂKµ ×¯«Ò@ÂýkÕµmo^ <>ýª<C3BD><C2AA>ÂXö¹þÌ'^Âv%+õ”·¶ÿa)1H†ú ƒ¸¶8Ø¥º&öÂDLÁŽ*Ar7ï~¿»´×Ò{H.ÿö¿ßÙÓ†šêª@ˆ%Ó^ÓpanÓTÕ'VP‡è*p‚<70><†½X!£
Å | ||||
| Þ"jb$""hDDDDD‡Ðý,—N‚–ÝM~¸OA{
~“ë_ý%ºKÝz]…ý¯¥lR ÀÁ˜Ü¯t·ü^©a¯_Öh#/ýb#ëÆ·ýÒÿ½ÿëú2áIaHaÿù | ||||
| h’–` ÁzÿØAÃ'wïÿ§„Ò»ïÿZræÂk.ysrç–{ƒÝaðû{~ÿ¤½Ñ<ýíï÷ÿ´›út¿ÿw××_ÿµ¿ûÿºÇZë^ÿ«ÿõÓZþúZÿêÖÔRõù:×ÿuq@¾]ãŠÿ×ZKý¿ÿþúûýÿ}·Uòs¦ÿìGƒeˤÔ|<7C>t©¹gÿØ<C3BF>ÈnŸÚýÿ÷®æÛëýÿ¯jý}ýuþ[’Í¥½¯õ¥ÿþ×í+ðº÷ÿØ®Ûü%ÿý¥â¾*+éÿm{O×ëÿ
>
[Ó	ÿýˆâT!ÕþØ¥ÿý~Úÿúÿ½Þ¿â?ÿÿÿÿ_ßúÊæH<C3A6>‚þ"?ÿþÈ1È– ’n;˜ÁÚ¦<DDDDGÿÿþýÿÿ¿ÿ×ÿÿÿÿÿÿÿÿËAfhf@ßËqdGF´vñÜó!’fÃ.û•ËhÂei¦šal„kõÉHçùTgxé+ïø"‡"ðh?Ú¯])}ÿèa0ž×µþ–ÕMqUUÐ_ù_
IÞNÿäfSðÍ2®MWÌ38 ÌË.EfÌãŽ)˜ìð¦™qBjƒÓn\<5C>°ƒ8)N.n3"ñ¸¸ÉÅ4û
0DDä Ž—ña8†=cC4?éî¯ü&ƒuBаš„^ƒN´¿I;_µ»ÿ0úZïÓ	øJío¥ | ||||
| ‹¥÷.z¢ODß"KâÓÃ | ||||
| MÜ“×D£"<22>Dá¢QŽ28†%<¥Çdp<64>´‰ÞJòûòxáÐ<ÿ˜K%<25>è„ì ôx Ü&øAºµî)Òé鵿ßë…kòºX£UõõÿÓ¥¯ÿ¥<C3BF>Wßém=?Ö¸¯×ð»÷ˆÿ¿íÓýwÒ÷ÿÿ¦<C3BF>ÿö•"¨ýzÿ«ÿçP©g‘<>—?ñÿþpm~%ÿjõ<C3B5>ëÿÕ~{ÿÿÿúýWÂ_ᆿûÿÿî>›ÝµÿÿÿDwþû Oø7#<23>·ýÿ¬Ÿm¯þ¹	kÿþø/cù\—Ò_ t	þ¿—£ÛÿúñÛ÷ùj?ÿ½ÿUÓÒ¿ÍÁé~Õo4ÿ_ý%¿Š§3+K_×_ý¬-~ÿÿëO[Köײ
÷¯všjÚ¶¾Ø]w9_é+K»õöÕöÒ†µû†“÷³bwÅGñÄ>“û™¦>ØâÇÜllWÛ_ÜSVõþï¿ø¨ƒKº~ßVŸý¯Þ½…M?»"ëøu¯iú~š×»28^ðšÂft ÉÚ¦<C39A>–è&[_a;0êœ0ƒ/“Xg§´¢|@Á4Ðh!H~$„^DDDDGþ"vu׆<C397>(%~5_µEÕQ6-ûT-
ãÇ÷¯ÿõ¿úÿýÿ׿ÿ×ÿÿÿå¬c!£xù£8Á›Í„#Šr#äp¤p Žƒ˜ŽÆyÆÆa£‚™„3À…ÁAq˜dá¶a—€^Z<0C>!ä`Ëäxø0G\ˆàLŽ[â9˜Ês0þ]‘ÁPŽàsÈá’{8<10>Ã9ˆàw™Á¸ë | ||||
| @˜ðMÉY †häpERX$ä$9ÛÐÌë5 f®C=‘ŽDó¹C˜çY2<59>dÄB§ Yì<>˨ªÌ4Rs€ƒQ<1F>ê2ÖS\µ¨–h¾¤§ˆòÐË)J2rÅq)£ òÕ¼ì£"H§gdÁ¹Þ¤EˆíÌ‹ ¦8 giY¶28†®¢*†šå#ÔœzÃ&îi<>‹‘2º¯ø^—I»¿áÂïÿ××í¿…Tÿÿú^וî#_ÓTíÿi}))Ó‘ý·ûV{ | ||||
| üìl^Ú×Kx{“³T}‘4úyåq[#ˆTDŠùÆj3 Î„Ñ;1;8d<38>sAƒ4œŒK¥æ|>'ž0S1 | ||||
| Û‡>´"FÙÃ4g£q=OÂz5™Ã02@…ÌŽVP.|]ü=`˜Aà<41>¨Aá?Â_áè40ƒ	ãù¬ŒÃ | ||||
| &8ú}bL 㓆ðƒ4õN-=:Óˆi"cú_T®šnWW„qÖCå<43>Ñ¼Žˆùx_Œ'_Mßi¦«ë¦©î<C2A9>§ß§õÅ:OízÙ]úŒLöëCþ¿Ýd!iW§¿ªëwëú†~½FÜ”<04>ªªr7ÿN$gÄ Q~¾ê)cïz®ëö‰FJ?#<23>'<27>’»Z"»ô»¯…#޲XÑ2;ÂŽ‚–EµB@Œþ¶D×Y"Ã’Í%r,dw…%œ=?O#Î×Oð<4F>äyÿô<C3BF>*º@Š<>µi‰G | ||||
| ù?ˆïZý<'®E/t¤	^Á«k‚®´Ç¥õzzK?—;¿ýEkQÐcÂ#®$kúA?ÁSw¥ôë<C3B4>KÓ×UÁŒ„IÄ/ÈTDëõ¬!Ý¿ÅÑ ?Ql‡#}G]ÿ^Ì×þBŽÁj¯¯âäÁ=ƒÏƒÿ×ÿÉ~Òú¿Ž]Dè·¯ê0ë„¿â<C2BF>ÿß_
ð¿ýþ„ÈÝáÿáü7™-«ã_ÿƒ~‚úì.‡ýøl7ëᇯý÷¾Þïü޲íà<C3AD>Ë"š‘Õÿÿ×ï׿ÿÑû‡Ñ?×ú×ÝÖ‚ikY:™.‡uÿ¾ußÂþAŽÿÿÈÁÃüÒ<>Ä࿾ºÏ=‡ñò4˜_ä˜ÿ:úßü<C39F>[Á®ü‘ð·Nûú5dtaŸ-æÑé_ñý_r<5F>ӽϙrz pA	#‰:Û×ÕGKÿA®½…µìù¡m ßô!ÿû-Û†h[}WÀ·«,,?¡ydšý¿^Ïú×H äãì0]tD³<44>l+Ø-Ö¿a~öŠûÂVãàØøãccìÓø‘eö
b¹-Çüqð…¸ß<C2B8><C39F><EFBFBD>øý<C3B8><C3BD>Ú¯÷¦®½ëÒl„c±ÿìoþþ×ÖþÓµýw{WþÃKß}¦±²ŸNY–>þÓê¶¿·®ï»ZþínÖþÿþÖáÃA¤Šð±ëD[Õþ¡ý†öï°š'vD¨kdX¾Â®ýØNàÁPdò×0Zb1HzdAá‘á Â÷Þ¨4^œ®¸Ð0@Â:2:jލ4!àÐ0MavšÁ hÕšˆˆ¨<CB86>BŽàÂÃÐh"’‰øh<C3B8>‚ÎÅOQˆŽ"""""6¨5©fŠ?V5 Õ°•x§Ý/tÂJï@Âì ÔíKF?ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿü²™¨ÿÿ“s7…ùcÖï`ý¸fi†ÿÿäñRUÓ<55>Gÿÿÿÿÿÿÿÿå‘T
bàælØËƒrñt\ËŠb9gÐËäto#£qDmPˆˆˆˆˆˆˆˆˆˆˆ‰<0C>w!´9‡ ¸äÇ,rrÉÁxQƒ6„¢"""""C$‡,r-‘0Ðh9Ëk	ÄDDD†I9
ƒ“‹²1Íg™NSHDµë*„ô,!ö<>ÞG(äDDDDHc<>m‚nE¡‹ŸŽå4â"""%¹ò`!€¤tl	€ªG½ˆ•ÍQÄ^)ÑÌò.ŒÙ‚0ÐMˆˆˆˆ‰]TÌG™!À¯ˆˆ™z(ÍÅÈàAs#<23>áµÄDDDì¡÷×ÿÿÿÿÿÿÿõÿÿÿÿÿÿÿå‘X`ì€Á;;3?pÓ³½Y#1˜ÎÇv„p9,KáÃÊÃÊMM¡$äße9„ÿ}.]êÊ‚¿Óÿ¿ôºM÷ûþK²\Ç’:_ÒD ÿ¿»™‘6ðç#ùă¡Ò]( üŽá¯wë„ Ôø<C394>i„Á3š™ñ™™9ãÍF\‚G3<1C>éSžFŒ éÎÈ/¿øL! ÷¸Â
ac׈|Eé.¯‚
6.Ùá	ÈŽfcÉÅË–hß¿ªj½iÝÿ»ä(ÿE<C3BF>ˆ_áá…„8¼;ïý'þ«ßÔ‚ùýz´´H|?P¾œ=ÿ’·$ù(|<7C>ùì0IÓkv^’ׯÝ4½^Óõ~H:O½Hùh'ä³ü–.úÕ×°¤‡êˆâ·È£ïOõB\%~G¯KÓ]u×ïuUûõŽE“Ìžx'‘Ç’½þ<C2BD>-ãê<u÷^›÷õÎ	iéiô#î¾×O×û ŸëÓÿÓÕÔ<C395>ZZÒÒÕ6Gøó@«Qj%ÂS}ÿy¤Aÿ¿ÿ÷ÿé×þƒäQôÚ÷Awó0н}A%0šÕyf‹ý<E280B9>ÿ<EFBFBD>‹v›§ëä9úÓKÿÿK¿Ç¯’
›-×þ½ü5¶¼|Š9ÝIjGõëÿú¶Á…í³eý¥µõ¨þ¿`¸Oò^	Q,þÉUÿqÇñ±Ü ƒ—¿Ãíºµn~üý¯GïÿÕuU
¨þùжãáÚ~6Á|-¯a?®î×ÛMmzõ}7]Óñ\|oWdQáÚ®ž<C2AE>¯íÿz«v“_íèa‚`ŸiÃÿj«½¿Ø[aÚÞ¶¿ˆŽ""""4
œOÛUÁPh0žE{&ðÕ2(úØIw,£’â"#¸ˆŽ""#ƒ98‡fƒ!üRIyÚƒÄDDGû…VÒ°«×ÂXXëáªÛV’ðˆè²Ìê¶µñ‹ÓT|DD[2_ÿþÿºþ¿ÿÿÿÿÿ+‰‘øA§ôòήþ³Ò×DÊ2#Ñx¦šØ e Ê€r¬É€é”‚<E2809D>¾F¯Óª®©„íÕ2¬TÂÿpJ]°¡S½?M©þ’4zN,»¦oYb”HvªŸÅÿyw¥Û¥„<11> | ||||
| ‰;ø%ƒ×ôôô“_¼maýÐ:2s_ê‡Z×ÿÿŠ[]Ú®ˆ†Jù0ëù\¡Þûì$”…VX„šõ·ê»«'~¾þW¿KãÖí-WÿßáÛ§É=›)aµúÿú§Dá¥A±oÁ°—ߨa/KÃT§â ½h<17>ŽéëôÈ‹/þ×ý•‹]|—{Ý »¯†ª‘×ÞÚ[k©H>’ßø<C39F>Ûö	X*Ú¶a0Èâ×i}/â˜éŠ<C3A9>ŠSþÇöµÃÖÓ[Q{üDZ ÂÚ…L/a|DZé˜uñÿÿÿÿÿ+<2B><0B>á¨\!¶_0þDDHƒ<03>îDr+™ÈÙ¦_ Ò91ÉŽCn‡eøˆ‰<0C>aÊ‚ràŽ | ||||
| óauœ‚†¹GË\‚(2 | ||||
| ÞA©ÄDDDDAÿ¹s$ƒA
<0A>£øˆ<C3B8>•3²†TFÁ´<C381><0C>Wþî,Ãa	VÈ”Fâ5&@Ì<>ŸÿÝÿ¦Ÿÿ‚)ëÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿúÿÿÿÿÿÿÿ¿ÿÿÿþ¿ýÿÿïÿÿÿÿÿÿ×ÿÿþ¿ÿÿÿׯÿÿ»ÿýÿÚ_ùy8ÿÿÿø z!Ï'ŒÞ|ÿÛÿúzýÿý}ùnÿþÿõÿô}òýÿÍ?ÿÿýÿÿÝÿÿÿï_¿ûÿÿÿ_×ÿÿë¯ÿÿÿýãâ¾»	ªßׯÿ<C2AF>ÿëÿ]ÿÿÿæ“O{ÿ÷ÿ¿õÿ4ë×ÿ×ÿÒÿÿÿÿýÿoßÿß¿X`¿¯ÿ¬œ/õïÿÿý=ð×ÿþ¯½÷V¸K¿Ö²¾¬Âë–?Ô\hqÿ•ÐM | ||||
 rsc
						rsc