262 lines
		
	
	
	
		
			12 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
			
		
		
	
	
			262 lines
		
	
	
	
		
			12 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
| <title>Microkernel lecture</title>
 | |
| <html>
 | |
| <head>
 | |
| </head>
 | |
| <body>
 | |
| 
 | |
| <h1>Microkernels</h1>
 | |
| 
 | |
| <p>Required reading: Improving IPC by kernel design
 | |
| 
 | |
| <h2>Overview</h2>
 | |
| 
 | |
| <p>This lecture looks at the microkernel organization.  In a
 | |
| microkernel, services that a monolithic kernel implements in the
 | |
| kernel are running as user-level programs.  For example, the file
 | |
| system, UNIX process management, pager, and network protocols each run
 | |
| in a separate user-level address space.  The microkernel itself
 | |
| supports only the services that are necessary to allow system services
 | |
| to run well in user space; a typical microkernel has at least support
 | |
| for creating address spaces, threads, and inter process communication.
 | |
| 
 | |
| <p>The potential advantages of a microkernel are simplicity of the
 | |
| kernel (small), isolation of operating system components (each runs in
 | |
| its own user-level address space), and flexibility (we can have a file
 | |
| server and a database server).  One potential disadvantage is
 | |
| performance loss, because what in a monolithich kernel requires a
 | |
| single system call may require in a microkernel multiple system calls
 | |
| and context switches.
 | |
| 
 | |
| <p>One way in how microkernels differ from each other is the exact
 | |
| kernel API they implement.  For example, Mach (a system developed at
 | |
| CMU, which influenced a number of commercial operating systems) has
 | |
| the following system calls: processes (create, terminate, suspend,
 | |
| resume, priority, assign, info, threads), threads (fork, exit, join,
 | |
| detach, yield, self), ports and messages (a port is a unidirectionally
 | |
| communication channel with a message queue and supporting primitives
 | |
| to send, destroy, etc), and regions/memory objects (allocate,
 | |
| deallocate, map, copy, inherit, read, write).
 | |
| 
 | |
| <p>Some microkernels are more "microkernel" than others.  For example,
 | |
| some microkernels implement the pager in user space but the basic
 | |
| virtual memory abstractions in the kernel (e.g, Mach); others, are
 | |
| more extreme, and implement most of the virtual memory in user space
 | |
| (L4).  Yet others are less extreme: many servers run in their own
 | |
| address space, but in kernel mode (Chorus).
 | |
| 
 | |
| <p>All microkernels support multiple threads per address space. xv6
 | |
| and Unix until recently didn't; why? Because, in Unix system services
 | |
| are typically implemented in the kernel, and those are the primary
 | |
| programs that need multiple threads to handle events concurrently
 | |
| (waiting for disk and processing new I/O requests).  In microkernels,
 | |
| these services are implemented in user-level address spaces and so
 | |
| they need a mechanism to deal with handling operations concurrently.
 | |
| (Of course, one can argue if fork efficient enough, there is no need
 | |
| to have threads.)
 | |
| 
 | |
| <h2>L3/L4</h2>
 | |
| 
 | |
| <p>L3 is a predecessor to L4.  L3 provides data persistence, DOS
 | |
| emulation, and ELAN runtime system.  L4 is a reimplementation of L3,
 | |
| but without the data persistence.  L4KA is a project at
 | |
| sourceforge.net, and you can download the code for the latest
 | |
| incarnation of L4 from there.
 | |
| 
 | |
| <p>L4 is a "second-generation" microkernel, with 7 calls: IPC (of
 | |
| which there are several types), id_nearest (find a thread with an ID
 | |
| close the given ID), fpage_unmap (unmap pages, mapping is done as a
 | |
| side-effect of IPC), thread_switch (hand processor to specified
 | |
| thread), lthread_ex_regs (manipulate thread registers),
 | |
| thread_schedule (set scheduling policies), task_new (create a new
 | |
| address space with some default number of threads).  These calls
 | |
| provide address spaces, tasks, threads, interprocess communication,
 | |
| and unique identifiers.  An address space is a set of mappings.
 | |
| Multiple threads may share mappings, a thread may grants mappings to
 | |
| another thread (through IPC).  Task is the set of threads sharing an
 | |
| address space.
 | |
| 
 | |
| <p>A thread is the execution abstraction; it belongs to an address
 | |
| space, a UID, a register set, a page fault handler, and an exception
 | |
| handler. A UID of a thread is its task number plus the number of the
 | |
| thread within that task.
 | |
| 
 | |
| <p>IPC passes data by value or by reference to another address space.
 | |
| It also provide for sequence coordination.  It is used for
 | |
| communication between client and servers, to pass interrupts to a
 | |
| user-level exception handler, to pass page faults to an external
 | |
| pager.  In L4, device drivers are implemented has a user-level
 | |
| processes with the device mapped into their address space.
 | |
| Linux runs as a user-level process.
 | |
| 
 | |
| <p>L4 provides quite a scala of messages types: inline-by-value,
 | |
| strings, and virtual memory mappings.  The send and receive descriptor
 | |
| specify how many, if any.
 | |
| 
 | |
| <p>In addition, there is a system call for timeouts and controling
 | |
| thread scheduling.
 | |
| 
 | |
| <h2>L3/L4 paper discussion</h2>
 | |
| 
 | |
| <ul>
 | |
| 
 | |
| <li>This paper is about performance.  What is a microsecond?  Is 100
 | |
| usec bad?  Is 5 usec so much better we care? How many instructions
 | |
| does 50-Mhz x86 execute in 100 usec?  What can we compute with that
 | |
| number of instructions?  How many disk operations in that time?  How
 | |
| many interrupts can we take? (The livelock paper, which we cover in a
 | |
| few lectures, mentions 5,000 network pkts per second, and each packet
 | |
| generates two interrrupts.)
 | |
| 
 | |
| <li>In performance calculations, what is the appropriate/better metric?
 | |
| Microseconds or cycles?
 | |
| 
 | |
| <li>Goal: improve IPC performance by a factor 10 by careful kernel
 | |
| design that is fully aware of the hardware it is running on.
 | |
| Principle: performance rules!  Optimize for the common case.  Because
 | |
| in L3 interrupts are propagated to user-level using IPC, the system
 | |
| may have to be able to support many IPCs per second (as many as the
 | |
| device can generate interrupts).
 | |
| 
 | |
| <li>IPC consists of transfering control and transfering data.  The
 | |
| minimal cost for transfering control is 127 cycles, plus 45 cycles for
 | |
| TLB misses (see table 3).  What are the x86 instructions to enter and
 | |
| leave the kernel? (int, iret) Why do they consume so much time?
 | |
| (Flush pipeline) Do modern processors perform these operations more
 | |
| efficient?  Worse now.  Faster processors optimized for straight-line
 | |
| code; Traps/Exceptions flush deeper pipeline, cache misses cost more
 | |
| cycles.
 | |
| 
 | |
| <li>What are the 5 TLB misses: 1) B's thread control block; loading %cr3
 | |
| flushes TLB, so 2) kernel text causes miss; iret, accesses both 3) stack and
 | |
| 4+5) user text - two pages B's user code looks at message
 | |
| 
 | |
| <li>Interface:
 | |
| <ul>
 | |
| <li>call (threadID, send-message, receive-message, timeout);
 | |
| <li>reply_and_receive (reply-message, receive-message, timeout);
 | |
| </ul>
 | |
| 
 | |
| <li>Optimizations:
 | |
| <ul>
 | |
| 
 | |
| <li>New system call: reply_and_receive.  Effect: 2 system calls per
 | |
| RPC.
 | |
| 
 | |
| <li>Complex messages: direct string, indirect strings, and memory
 | |
| objects.
 | |
| 
 | |
| <li>Direct transfer by temporary mapping through a communication
 | |
| window.  The communication window is mapped in B address space and in
 | |
| A's kernel address space; why is this better than just mapping a page
 | |
| shared between A and B's address space?  1) Multi-level security, it
 | |
| makes it hard to reason about information flow; 2) Receiver can't
 | |
| check message legality (might change after check); 3) When server has
 | |
| many clients, could run out of virtual address space Requires shared
 | |
| memory region to be established ahead of time; 4) Not application
 | |
| friendly, since data may already be at another address, i.e.
 | |
| applications would have to copy anyway--possibly more copies.
 | |
| 
 | |
| <li>Why not use the following approach: map the region copy-on-write
 | |
| (or read-only) in A's address space after send and read-only in B's
 | |
| address space?  Now B may have to copy data or cannot receive data in
 | |
| its final destination.
 | |
| 
 | |
| <li>On the x86 implemented by coping B's PDE into A's address space.
 | |
| Why two PDEs?  (Maximum message size is 4 Meg, so guaranteed to work
 | |
| if the message starts in the bottom for 4 Mbyte of an 8 Mbyte mapped
 | |
| region.) Why not just copy PTEs?  Would be much more expensive
 | |
| 
 | |
| <li> What does it mean for the TLB to be "window clean"?  Why do we
 | |
| care?  Means TLB contains no mappings within communication window. We
 | |
| care because mapping is cheap (copy PDE), but invalidation not; x86
 | |
| only lets you invalidate one page at a time, or whole TLB Does TLB
 | |
| invalidation of communication window turn out to be a problem?  Not
 | |
| usually, because have to load %cr3 during IPC anyway
 | |
| 
 | |
| <li>Thread control block registers, links to various double-linked
 | |
|   lists, pgdir, uid, etc.. Lower part of thread UID contains TCB
 | |
|   number.  Can also dededuce TCB address from stack by taking SP AND
 | |
|   bitmask (the SP comes out of the TSS when just switching to kernel).
 | |
| 
 | |
| <li> Kernel stack is on same page as tcb.  why?  1) Minimizes TLB
 | |
| misses (since accessing kernel stack will bring in tcb); 2) Allows
 | |
| very efficient access to tcb -- just mask off lower 12 bits of %esp;
 | |
| 3) With VM, can use lower 32-bits of thread id to indicate which tcb;
 | |
| using one page per tcb means no need to check if thread is swapped out
 | |
| (Can simply not map that tcb if shouldn't access it).
 | |
| 
 | |
| <li>Invariant on queues: queues always hold in-memory TCBs.
 | |
| 
 | |
| <li>Wakeup queue: set of 8 unordered wakeup lists (wakup time mod 8),
 | |
| and smart representation of time so that 32-bit integers can be used
 | |
| in the common case (base + offset in msec; bump base and recompute all
 | |
| offsets ~4 hours.  maximum timeout is ~24 days, 2^31 msec).
 | |
| 
 | |
| <li>What is the problem addressed by lazy scheduling?
 | |
| Conventional approach to scheduling:
 | |
| <pre>
 | |
|     A sends message to B:
 | |
|       Move A from ready queue to waiting queue
 | |
|       Move B from waiting queue to ready queue
 | |
|     This requires 58 cycles, including 4 TLB misses.  What are TLB misses?
 | |
|       One each for head of ready and waiting queues
 | |
|       One each for previous queue element during the remove
 | |
| </pre>
 | |
| <li> Lazy scheduling:
 | |
| <pre>
 | |
|     Ready queue must contain all ready threads except current one
 | |
|       Might contain other threads that aren't actually ready, though
 | |
|     Each wakeup queue contains all threads waiting in that queue
 | |
|       Again, might contain other threads, too
 | |
|       Scheduler removes inappropriate queue entries when scanning
 | |
|       queue
 | |
| </pre>
 | |
|    
 | |
| <li>Why does this help performance?  Only three situations in which
 | |
| thread gives up CPU but stays ready: send syscall (as opposed to
 | |
| call), preemption, and hardware interrupts. So very often can IPC into
 | |
| thread while not putting it on ready list.
 | |
| 
 | |
| <li>Direct process switch.  This section just says you should use
 | |
| kernel threads instead of continuations.
 | |
| 
 | |
| <li>Short messages via registers.
 | |
| 
 | |
| <li>Avoiding unnecessary copies.  Basically can send and receive
 | |
|   messages w. same vector.  Makes forwarding efficient, which is
 | |
|   important for Clans/Chiefs model.
 | |
| 
 | |
| <li>Segment register optimization.  Loading segments registers is
 | |
|   slow, have to access GDT, etc.  But common case is that users don't
 | |
|   change their segment registers. Observation: it is faster to check
 | |
|   that segment descriptor than load it.  So just check that segment
 | |
|   registers are okay.  Only need to load if user code changed them.
 | |
| 
 | |
| <li>Registers for paramater passing where ever possible: systems calls
 | |
| and IPC.
 | |
| 
 | |
| <li>Minimizing TLB misses. Try to cram as many things as possible onto
 | |
| same page: IPC kernel code, GDT, IDT, TSS, all on same page. Actually
 | |
| maybe can't fit whole tables but put the important parts of tables on
 | |
| the same page (maybe beginning of TSS, IDT, or GDT only?)
 | |
| 
 | |
| <li>Coding tricks: short offsets, avoid jumps, avoid checks, pack
 | |
|   often-used data on same cache lines, lazily save/restore CPU state
 | |
|   like debug and FPU registers.  Much of the kernel is written in
 | |
|   assembly!
 | |
| 
 | |
| <li>What are the results?  figure 7 and 8 look good.
 | |
| 
 | |
| <li>Is fast IPC enough to get good overall system performance?  This
 | |
| paper doesn't make a statement either way; we have to read their 1997
 | |
| paper to find find the answer to that question.
 | |
| 
 | |
| <li>Is the principle of optimizing for performance right?  In general,
 | |
| it is wrong to optimize for performance; other things matter more.  Is
 | |
| IPC the one exception?  Maybe, perhaps not.  Was Liedtke fighting a
 | |
| losing battle against CPU makers?  Should fast IPC time be a hardware,
 | |
| or just an OS issue?
 | |
| 
 | |
| </ul>
 | |
| 
 | |
| </body>
 | 
