262 lines
		
	
	
	
		
			12 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
			
		
		
	
	
			262 lines
		
	
	
	
		
			12 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
<title>Microkernel lecture</title>
 | 
						|
<html>
 | 
						|
<head>
 | 
						|
</head>
 | 
						|
<body>
 | 
						|
 | 
						|
<h1>Microkernels</h1>
 | 
						|
 | 
						|
<p>Required reading: Improving IPC by kernel design
 | 
						|
 | 
						|
<h2>Overview</h2>
 | 
						|
 | 
						|
<p>This lecture looks at the microkernel organization.  In a
 | 
						|
microkernel, services that a monolithic kernel implements in the
 | 
						|
kernel are running as user-level programs.  For example, the file
 | 
						|
system, UNIX process management, pager, and network protocols each run
 | 
						|
in a separate user-level address space.  The microkernel itself
 | 
						|
supports only the services that are necessary to allow system services
 | 
						|
to run well in user space; a typical microkernel has at least support
 | 
						|
for creating address spaces, threads, and inter process communication.
 | 
						|
 | 
						|
<p>The potential advantages of a microkernel are simplicity of the
 | 
						|
kernel (small), isolation of operating system components (each runs in
 | 
						|
its own user-level address space), and flexibility (we can have a file
 | 
						|
server and a database server).  One potential disadvantage is
 | 
						|
performance loss, because what in a monolithich kernel requires a
 | 
						|
single system call may require in a microkernel multiple system calls
 | 
						|
and context switches.
 | 
						|
 | 
						|
<p>One way in how microkernels differ from each other is the exact
 | 
						|
kernel API they implement.  For example, Mach (a system developed at
 | 
						|
CMU, which influenced a number of commercial operating systems) has
 | 
						|
the following system calls: processes (create, terminate, suspend,
 | 
						|
resume, priority, assign, info, threads), threads (fork, exit, join,
 | 
						|
detach, yield, self), ports and messages (a port is a unidirectionally
 | 
						|
communication channel with a message queue and supporting primitives
 | 
						|
to send, destroy, etc), and regions/memory objects (allocate,
 | 
						|
deallocate, map, copy, inherit, read, write).
 | 
						|
 | 
						|
<p>Some microkernels are more "microkernel" than others.  For example,
 | 
						|
some microkernels implement the pager in user space but the basic
 | 
						|
virtual memory abstractions in the kernel (e.g, Mach); others, are
 | 
						|
more extreme, and implement most of the virtual memory in user space
 | 
						|
(L4).  Yet others are less extreme: many servers run in their own
 | 
						|
address space, but in kernel mode (Chorus).
 | 
						|
 | 
						|
<p>All microkernels support multiple threads per address space. xv6
 | 
						|
and Unix until recently didn't; why? Because, in Unix system services
 | 
						|
are typically implemented in the kernel, and those are the primary
 | 
						|
programs that need multiple threads to handle events concurrently
 | 
						|
(waiting for disk and processing new I/O requests).  In microkernels,
 | 
						|
these services are implemented in user-level address spaces and so
 | 
						|
they need a mechanism to deal with handling operations concurrently.
 | 
						|
(Of course, one can argue if fork efficient enough, there is no need
 | 
						|
to have threads.)
 | 
						|
 | 
						|
<h2>L3/L4</h2>
 | 
						|
 | 
						|
<p>L3 is a predecessor to L4.  L3 provides data persistence, DOS
 | 
						|
emulation, and ELAN runtime system.  L4 is a reimplementation of L3,
 | 
						|
but without the data persistence.  L4KA is a project at
 | 
						|
sourceforge.net, and you can download the code for the latest
 | 
						|
incarnation of L4 from there.
 | 
						|
 | 
						|
<p>L4 is a "second-generation" microkernel, with 7 calls: IPC (of
 | 
						|
which there are several types), id_nearest (find a thread with an ID
 | 
						|
close the given ID), fpage_unmap (unmap pages, mapping is done as a
 | 
						|
side-effect of IPC), thread_switch (hand processor to specified
 | 
						|
thread), lthread_ex_regs (manipulate thread registers),
 | 
						|
thread_schedule (set scheduling policies), task_new (create a new
 | 
						|
address space with some default number of threads).  These calls
 | 
						|
provide address spaces, tasks, threads, interprocess communication,
 | 
						|
and unique identifiers.  An address space is a set of mappings.
 | 
						|
Multiple threads may share mappings, a thread may grants mappings to
 | 
						|
another thread (through IPC).  Task is the set of threads sharing an
 | 
						|
address space.
 | 
						|
 | 
						|
<p>A thread is the execution abstraction; it belongs to an address
 | 
						|
space, a UID, a register set, a page fault handler, and an exception
 | 
						|
handler. A UID of a thread is its task number plus the number of the
 | 
						|
thread within that task.
 | 
						|
 | 
						|
<p>IPC passes data by value or by reference to another address space.
 | 
						|
It also provide for sequence coordination.  It is used for
 | 
						|
communication between client and servers, to pass interrupts to a
 | 
						|
user-level exception handler, to pass page faults to an external
 | 
						|
pager.  In L4, device drivers are implemented has a user-level
 | 
						|
processes with the device mapped into their address space.
 | 
						|
Linux runs as a user-level process.
 | 
						|
 | 
						|
<p>L4 provides quite a scala of messages types: inline-by-value,
 | 
						|
strings, and virtual memory mappings.  The send and receive descriptor
 | 
						|
specify how many, if any.
 | 
						|
 | 
						|
<p>In addition, there is a system call for timeouts and controling
 | 
						|
thread scheduling.
 | 
						|
 | 
						|
<h2>L3/L4 paper discussion</h2>
 | 
						|
 | 
						|
<ul>
 | 
						|
 | 
						|
<li>This paper is about performance.  What is a microsecond?  Is 100
 | 
						|
usec bad?  Is 5 usec so much better we care? How many instructions
 | 
						|
does 50-Mhz x86 execute in 100 usec?  What can we compute with that
 | 
						|
number of instructions?  How many disk operations in that time?  How
 | 
						|
many interrupts can we take? (The livelock paper, which we cover in a
 | 
						|
few lectures, mentions 5,000 network pkts per second, and each packet
 | 
						|
generates two interrrupts.)
 | 
						|
 | 
						|
<li>In performance calculations, what is the appropriate/better metric?
 | 
						|
Microseconds or cycles?
 | 
						|
 | 
						|
<li>Goal: improve IPC performance by a factor 10 by careful kernel
 | 
						|
design that is fully aware of the hardware it is running on.
 | 
						|
Principle: performance rules!  Optimize for the common case.  Because
 | 
						|
in L3 interrupts are propagated to user-level using IPC, the system
 | 
						|
may have to be able to support many IPCs per second (as many as the
 | 
						|
device can generate interrupts).
 | 
						|
 | 
						|
<li>IPC consists of transfering control and transfering data.  The
 | 
						|
minimal cost for transfering control is 127 cycles, plus 45 cycles for
 | 
						|
TLB misses (see table 3).  What are the x86 instructions to enter and
 | 
						|
leave the kernel? (int, iret) Why do they consume so much time?
 | 
						|
(Flush pipeline) Do modern processors perform these operations more
 | 
						|
efficient?  Worse now.  Faster processors optimized for straight-line
 | 
						|
code; Traps/Exceptions flush deeper pipeline, cache misses cost more
 | 
						|
cycles.
 | 
						|
 | 
						|
<li>What are the 5 TLB misses: 1) B's thread control block; loading %cr3
 | 
						|
flushes TLB, so 2) kernel text causes miss; iret, accesses both 3) stack and
 | 
						|
4+5) user text - two pages B's user code looks at message
 | 
						|
 | 
						|
<li>Interface:
 | 
						|
<ul>
 | 
						|
<li>call (threadID, send-message, receive-message, timeout);
 | 
						|
<li>reply_and_receive (reply-message, receive-message, timeout);
 | 
						|
</ul>
 | 
						|
 | 
						|
<li>Optimizations:
 | 
						|
<ul>
 | 
						|
 | 
						|
<li>New system call: reply_and_receive.  Effect: 2 system calls per
 | 
						|
RPC.
 | 
						|
 | 
						|
<li>Complex messages: direct string, indirect strings, and memory
 | 
						|
objects.
 | 
						|
 | 
						|
<li>Direct transfer by temporary mapping through a communication
 | 
						|
window.  The communication window is mapped in B address space and in
 | 
						|
A's kernel address space; why is this better than just mapping a page
 | 
						|
shared between A and B's address space?  1) Multi-level security, it
 | 
						|
makes it hard to reason about information flow; 2) Receiver can't
 | 
						|
check message legality (might change after check); 3) When server has
 | 
						|
many clients, could run out of virtual address space Requires shared
 | 
						|
memory region to be established ahead of time; 4) Not application
 | 
						|
friendly, since data may already be at another address, i.e.
 | 
						|
applications would have to copy anyway--possibly more copies.
 | 
						|
 | 
						|
<li>Why not use the following approach: map the region copy-on-write
 | 
						|
(or read-only) in A's address space after send and read-only in B's
 | 
						|
address space?  Now B may have to copy data or cannot receive data in
 | 
						|
its final destination.
 | 
						|
 | 
						|
<li>On the x86 implemented by coping B's PDE into A's address space.
 | 
						|
Why two PDEs?  (Maximum message size is 4 Meg, so guaranteed to work
 | 
						|
if the message starts in the bottom for 4 Mbyte of an 8 Mbyte mapped
 | 
						|
region.) Why not just copy PTEs?  Would be much more expensive
 | 
						|
 | 
						|
<li> What does it mean for the TLB to be "window clean"?  Why do we
 | 
						|
care?  Means TLB contains no mappings within communication window. We
 | 
						|
care because mapping is cheap (copy PDE), but invalidation not; x86
 | 
						|
only lets you invalidate one page at a time, or whole TLB Does TLB
 | 
						|
invalidation of communication window turn out to be a problem?  Not
 | 
						|
usually, because have to load %cr3 during IPC anyway
 | 
						|
 | 
						|
<li>Thread control block registers, links to various double-linked
 | 
						|
  lists, pgdir, uid, etc.. Lower part of thread UID contains TCB
 | 
						|
  number.  Can also dededuce TCB address from stack by taking SP AND
 | 
						|
  bitmask (the SP comes out of the TSS when just switching to kernel).
 | 
						|
 | 
						|
<li> Kernel stack is on same page as tcb.  why?  1) Minimizes TLB
 | 
						|
misses (since accessing kernel stack will bring in tcb); 2) Allows
 | 
						|
very efficient access to tcb -- just mask off lower 12 bits of %esp;
 | 
						|
3) With VM, can use lower 32-bits of thread id to indicate which tcb;
 | 
						|
using one page per tcb means no need to check if thread is swapped out
 | 
						|
(Can simply not map that tcb if shouldn't access it).
 | 
						|
 | 
						|
<li>Invariant on queues: queues always hold in-memory TCBs.
 | 
						|
 | 
						|
<li>Wakeup queue: set of 8 unordered wakeup lists (wakup time mod 8),
 | 
						|
and smart representation of time so that 32-bit integers can be used
 | 
						|
in the common case (base + offset in msec; bump base and recompute all
 | 
						|
offsets ~4 hours.  maximum timeout is ~24 days, 2^31 msec).
 | 
						|
 | 
						|
<li>What is the problem addressed by lazy scheduling?
 | 
						|
Conventional approach to scheduling:
 | 
						|
<pre>
 | 
						|
    A sends message to B:
 | 
						|
      Move A from ready queue to waiting queue
 | 
						|
      Move B from waiting queue to ready queue
 | 
						|
    This requires 58 cycles, including 4 TLB misses.  What are TLB misses?
 | 
						|
      One each for head of ready and waiting queues
 | 
						|
      One each for previous queue element during the remove
 | 
						|
</pre>
 | 
						|
<li> Lazy scheduling:
 | 
						|
<pre>
 | 
						|
    Ready queue must contain all ready threads except current one
 | 
						|
      Might contain other threads that aren't actually ready, though
 | 
						|
    Each wakeup queue contains all threads waiting in that queue
 | 
						|
      Again, might contain other threads, too
 | 
						|
      Scheduler removes inappropriate queue entries when scanning
 | 
						|
      queue
 | 
						|
</pre>
 | 
						|
   
 | 
						|
<li>Why does this help performance?  Only three situations in which
 | 
						|
thread gives up CPU but stays ready: send syscall (as opposed to
 | 
						|
call), preemption, and hardware interrupts. So very often can IPC into
 | 
						|
thread while not putting it on ready list.
 | 
						|
 | 
						|
<li>Direct process switch.  This section just says you should use
 | 
						|
kernel threads instead of continuations.
 | 
						|
 | 
						|
<li>Short messages via registers.
 | 
						|
 | 
						|
<li>Avoiding unnecessary copies.  Basically can send and receive
 | 
						|
  messages w. same vector.  Makes forwarding efficient, which is
 | 
						|
  important for Clans/Chiefs model.
 | 
						|
 | 
						|
<li>Segment register optimization.  Loading segments registers is
 | 
						|
  slow, have to access GDT, etc.  But common case is that users don't
 | 
						|
  change their segment registers. Observation: it is faster to check
 | 
						|
  that segment descriptor than load it.  So just check that segment
 | 
						|
  registers are okay.  Only need to load if user code changed them.
 | 
						|
 | 
						|
<li>Registers for paramater passing where ever possible: systems calls
 | 
						|
and IPC.
 | 
						|
 | 
						|
<li>Minimizing TLB misses. Try to cram as many things as possible onto
 | 
						|
same page: IPC kernel code, GDT, IDT, TSS, all on same page. Actually
 | 
						|
maybe can't fit whole tables but put the important parts of tables on
 | 
						|
the same page (maybe beginning of TSS, IDT, or GDT only?)
 | 
						|
 | 
						|
<li>Coding tricks: short offsets, avoid jumps, avoid checks, pack
 | 
						|
  often-used data on same cache lines, lazily save/restore CPU state
 | 
						|
  like debug and FPU registers.  Much of the kernel is written in
 | 
						|
  assembly!
 | 
						|
 | 
						|
<li>What are the results?  figure 7 and 8 look good.
 | 
						|
 | 
						|
<li>Is fast IPC enough to get good overall system performance?  This
 | 
						|
paper doesn't make a statement either way; we have to read their 1997
 | 
						|
paper to find find the answer to that question.
 | 
						|
 | 
						|
<li>Is the principle of optimizing for performance right?  In general,
 | 
						|
it is wrong to optimize for performance; other things matter more.  Is
 | 
						|
IPC the one exception?  Maybe, perhaps not.  Was Liedtke fighting a
 | 
						|
losing battle against CPU makers?  Should fast IPC time be a hardware,
 | 
						|
or just an OS issue?
 | 
						|
 | 
						|
</ul>
 | 
						|
 | 
						|
</body>
 |