340 lines
		
	
	
	
		
			13 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
			
		
		
	
	
			340 lines
		
	
	
	
		
			13 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
| <title>Scheduling</title>
 | |
| <html>
 | |
| <head>
 | |
| </head>
 | |
| <body>
 | |
| 
 | |
| <h1>Scheduling</h1>
 | |
| 
 | |
| <p>Required reading: Eliminating receive livelock
 | |
| 
 | |
| <p>Notes based on prof. Morris's lecture on scheduling (6.824, fall'02).
 | |
| 
 | |
| <h2>Overview</h2>
 | |
| 
 | |
| <ul>
 | |
| 
 | |
| <li>What is scheduling?  The OS policies and mechanisms to allocates
 | |
| resources to entities.  A good scheduling policy ensures that the most
 | |
| important entitity gets the resources it needs.  This topic was
 | |
| popular in the days of time sharing, when there was a shortage of
 | |
| resources.  It seemed irrelevant in era of PCs and workstations, when
 | |
| resources were plenty. Now the topic is back from the dead to handle
 | |
| massive Internet servers with paying customers.  The Internet exposes
 | |
| web sites to international abuse and overload, which can lead to
 | |
| resource shortages.  Furthermore, some customers are more important
 | |
| than others (e.g., the ones that buy a lot).
 | |
| 
 | |
| <li>Key problems:
 | |
| <ul>
 | |
| <li>Gap between desired policy and available mechanism.  The desired
 | |
| policies often include elements that not implementable with the
 | |
| mechanisms available to the operation system.  Furthermore, often
 | |
| there are many conflicting goals (low latency, high throughput, and
 | |
| fairness), and the scheduler must make a trade-off between the goals.
 | |
| 
 | |
| <li>Interaction between different schedulers.  One have to take a
 | |
| systems view.  Just optimizing the CPU scheduler may do little to for
 | |
| the overall desired policy.
 | |
| </ul>
 | |
| 
 | |
| <li>Resources you might want to schedule: CPU time, physical memory,
 | |
| disk and network I/O, and I/O bus bandwidth.
 | |
| 
 | |
| <li>Entities that you might want to give resources to: users,
 | |
| processes, threads, web requests, or MIT accounts.
 | |
| 
 | |
| <li>Many polices for resource to entity allocation are possible:
 | |
| strict priority, divide equally, shortest job first, minimum guarantee
 | |
| combined with admission control.
 | |
| 
 | |
| <li>General plan for scheduling mechanisms
 | |
| <ol>
 | |
| <li> Understand where scheduling is occuring.
 | |
| <li> Expose scheduling decisions, allow control.
 | |
| <li> Account for resource consumption, to allow intelligent control.
 | |
| </ol>
 | |
| 
 | |
| <li>Simple example from 6.828 kernel. The policy for scheduling
 | |
| environments is to give each one equal CPU time. The mechanism used to
 | |
| implement this policy is a clock interrupt every 10 msec and then
 | |
| selecting the next environment in a round-robin fashion.  
 | |
| 
 | |
| <p>But this only works if processes are compute-bound.  What if a
 | |
| process gives up some of its 10 ms to wait for input?  Do we have to
 | |
| keep track of that and give it back? 
 | |
| 
 | |
| <p>How long should the quantum be?  is 10 msec the right answer?
 | |
| Shorter quantum will lead to better interactive performance, but
 | |
| lowers overall system throughput because we will reschedule more,
 | |
| which has overhead.
 | |
| 
 | |
| <p>What if the environment computes for 1 msec and sends an IPC to
 | |
| the file server environment?  Shouldn't the file server get more CPU
 | |
| time because it operates on behalf of all other functions?
 | |
| 
 | |
| <p>Potential improvements for the 6.828 kernel: track "recent" CPU use
 | |
| (e.g., over the last second) and always run environment with least
 | |
| recent CPU use.  (Still, if you sleep long enough you lose.) Other
 | |
| solution: directed yield; specify on the yield to which environment
 | |
| you are donating the remainder of the quantuam (e.g., to the file
 | |
| server so that it can compute on the environment's behalf).
 | |
| 
 | |
| <li>Pitfall: Priority Inversion
 | |
| <pre>
 | |
|   Assume policy is strict priority.
 | |
|   Thread T1: low priority.
 | |
|   Thread T2: medium priority.
 | |
|   Thread T3: high priority.
 | |
|   T1: acquire(l)
 | |
|   context switch to T3
 | |
|   T3: acquire(l)... must wait for T1 to release(l)...
 | |
|   context switch to T2
 | |
|   T2 computes for a while
 | |
|   T3 is indefinitely delayed despite high priority.
 | |
|   Can solve if T3 lends its priority to holder of lock it is waiting for.
 | |
|     So T1 runs, not T2.
 | |
|   [this is really a multiple scheduler problem.]
 | |
|   [since locks schedule access to locked resource.]
 | |
| </pre>
 | |
| 
 | |
| <li>Pitfall: Efficiency.  Efficiency often conflicts with fairness (or
 | |
| any other policy).  Long time quantum for efficiency in CPU scheduling
 | |
| versus low delay.  Shortest seek versus FIFO disk scheduling.
 | |
| Contiguous read-ahead vs data needed now.  For example, scheduler
 | |
| swaps out my idle emacs to let gcc run faster with more phys mem.
 | |
| What happens when I type a key?  These don't fit well into a "who gets
 | |
| to go next" scheduler framework.  Inefficient scheduling may make
 | |
| <i>everybody</i> slower, including high priority users.
 | |
| 
 | |
| <li>Pitfall: Multiple Interacting Schedulers. Suppose you want your
 | |
| emacs to have priority over everything else.  Give it high CPU
 | |
| priority.  Does that mean nothing else will run if emacs wants to run?
 | |
| Disk scheduler might not know to favor emacs's disk I/Os.  Typical
 | |
| UNIX disk scheduler favors disk efficiency, not process prio.  Suppose
 | |
| emacs needs more memory.  Other processes have dirty pages; emacs must
 | |
| wait.  Does disk scheduler know these other processes' writes are high
 | |
| prio?
 | |
| 
 | |
| <li>Pitfall: Server Processes.  Suppose emacs uses X windows to
 | |
| display.  The X server must serve requests from many clients.  Does it
 | |
| know that emacs' requests should be given priority?  Does the OS know
 | |
| to raise X's priority when it is serving emacs?  Similarly for DNS,
 | |
| and NFS.  Does the network know to give emacs' NFS requests priority?
 | |
| 
 | |
| </ul>
 | |
| 
 | |
| <p>In short, scheduling is a system problem. There are many
 | |
| schedulers; they interact.  The CPU scheduler is usually the easy
 | |
| part.  The hardest part is system structure.  For example, the
 | |
| <i>existence</i> of interrupts is bad for scheduling.  Conflicting
 | |
| goals may limit effectiveness.
 | |
| 
 | |
| <h2>Case study: modern UNIX</h2>
 | |
| 
 | |
| <p>Goals: 
 | |
| <ul>
 | |
| <li>Simplicity (e.g. avoid complex locking regimes).  
 | |
| <li>Quick response to device interrupts.  
 | |
| <li> Favor interactive response.  
 | |
| </ul> 
 | |
| 
 | |
| <p>UNIX has a number of execution environments.  We care about
 | |
| scheduling transitions among them.  Some transitions aren't possible,
 | |
| some can't be be controlled. The execution environments are:
 | |
| 
 | |
| <ul>
 | |
| <li>Process, user half
 | |
| <li>Process, kernel half
 | |
| <li>Soft interrupts: timer, network
 | |
| <li>Device interrupts
 | |
| </ul>
 | |
| 
 | |
| <p>The rules are:
 | |
| <ul>
 | |
| <li>User is pre-emptible.
 | |
| <li>Kernel half and software interrupts are not pre-emptible.
 | |
| <li>Device handlers may not make blocking calls (e.g., sleep)
 | |
| <li>Effective priorities: intr > soft intr > kernel half > user
 | |
| </ul>  
 | |
| 
 | |
| </ul>
 | |
| 
 | |
| <p>Rules are implemented as follows:
 | |
| 
 | |
| <ul>
 | |
| 
 | |
| <li>UNIX: Process User Half.  Runs in process address space, on
 | |
| per-process stack.  Interruptible.  Pre-emptible: interrupt may cause
 | |
| context switch.  We don't trust user processes to yield CPU.
 | |
| Voluntarily enters kernel half via system calls and faults.
 | |
| 
 | |
| <li>UNIX: Process Kernel Half. Runs in kernel address space, on
 | |
| per-process kernel stack.  Executes system calls and faults for its
 | |
| process.  Interruptible (but can defer interrupts in critical
 | |
| sections).  Not pre-emptible.  Only yields voluntarily, when waiting
 | |
| for an event. E.g. disk I/O done.  This simplifies concurrency
 | |
| control; locks often not required.  No user process runs if any kernel
 | |
| half wants to run.  Many process' kernel halfs may be sleeping in the
 | |
| kernel.
 | |
| 
 | |
| <li>UNIX: Device Interrupts. Hardware asks CPU for an interrupt to ask
 | |
| for attention.  Disk read/write completed, or network packet received.
 | |
| Runs in kernel space, on special interrupt stack.  Interrupt routine
 | |
| cannot block; must return.  Interrupts are interruptible.  They nest
 | |
| on the one interrupt stack.  Interrupts are not pre-emptible, and
 | |
| cannot really yield.  The real-time clock is a device and interrupts
 | |
| every 10ms (or whatever).  Process scheduling decisions can be made
 | |
| when interrupt returns (e.g. wake up the process waiting for this
 | |
| event).  You want interrupt processing to be fast, since it has
 | |
| priority.  Don't do any more work than you have to.  You're blocking
 | |
| processes and other interrupts.  Typically, an interrupt does the
 | |
| minimal work necessary to keep the device happy, and then call wakeup
 | |
| on a thread.
 | |
| 
 | |
| <li>UNIX: Soft Interrupts.  (Didn't exist in xv6) Used when device
 | |
| handling is expensive.  But no obvious process context in which to
 | |
| run.  Examples include IP forwarding, TCP input processing.  Runs in
 | |
| kernel space, on interrupt stack.  Interruptable.  Not pre-emptable,
 | |
| can't really yield.  Triggered by hardware interrupt.  Called when
 | |
| outermost hardware interrupt returns.  Periodic scheduling decisions
 | |
| are made in timer s/w interrupt.  Scheduled by hardware timer
 | |
| interrupt (i.e., if current process has run long enough, switch).
 | |
| </ul>
 | |
| 
 | |
| <p>Is this good software structure?  Let's talk about receive
 | |
| livelock.
 | |
| 
 | |
| <h2>Paper discussion</h2>
 | |
| 
 | |
| <ul>
 | |
| 
 | |
| <li>What is application that the paper is addressing: IP forwarding.
 | |
| What functionality does a network interface offer to driver?
 | |
| <ul>
 | |
| <li> Read packets
 | |
| <li> Poke hardware to send packets
 | |
| <li> Interrupts when packet received/transmit complete
 | |
| <li> Buffer many input packets
 | |
| </ul>
 | |
| 
 | |
| <li>What devices in the 6.828 kernel are interrupt driven?  Which one
 | |
| are polling?  Is this ideal?
 | |
| 
 | |
| <li>Explain Figure 6-1.  Why does it go up?  What determines how high
 | |
| the peak is?  Why does it go down?  What determines how fast it goes
 | |
| does? Answer:
 | |
| <pre>
 | |
| (fraction of packets discarded)(work invested in discarded packets)
 | |
|            -------------------------------------------
 | |
|               (total work CPU is capable of)
 | |
| </pre>
 | |
| 
 | |
| <li>Suppose I wanted to test an NFS server for livelock.
 | |
| <pre>
 | |
|   Run client with this loop:
 | |
|     while(1){
 | |
|       send NFS READ RPC;
 | |
|       wait for response;
 | |
|     }
 | |
| </pre>
 | |
| What would I see?  Is the NFS server probably subject to livelock?
 | |
| (No--offered load subject to feedback).
 | |
| 
 | |
| <li>What other problems are we trying to address?
 | |
| <ul> 
 | |
| <li>Increased latency for packet delivery and forwarding (e.g., start
 | |
| disk head moving  when first NFS read request comes)
 | |
| <li>Transmit starvation 
 | |
| <li>User-level CPU starvation
 | |
| </ul>
 | |
| 
 | |
| <li>Why not tell the O/S scheduler to give interrupts lower priority?
 | |
| Non-preemptible.
 | |
| Could you fix this by making interrupts faster? (Maybe, if coupled
 | |
| with some limit on input rate.)  
 | |
| 
 | |
| <li>Why not completely process each packet in the interrupt handler?
 | |
| (I.e. forward it?) Other parts of kernel don't expect to run at high
 | |
| interrupt-level (e.g., some packet processing code might invoke a function
 | |
| that sleeps). Still might want an output queue
 | |
| 
 | |
| <li>What about using polling instead of interrupts?  Solves overload
 | |
| problem, but killer for latency.
 | |
| 
 | |
| <li>What's the paper's solution?
 | |
| <ul>
 | |
| <li>No IP input queue.
 | |
| <li>Input processing and device input polling in kernel thread.
 | |
| <li>Device receive interrupt just wakes up thread. And leaves
 | |
| interrupts *disabled* for that device.
 | |
| <li>Thread does all input processing, then re-enables interrupts.
 | |
| </ul>
 | |
| <p>Why does this work?  What happens when packets arrive too fast?
 | |
| What happens when packets arrive slowly?
 | |
| 
 | |
| <li>Explain Figure 6-3.
 | |
| <ul>
 | |
| <li>Why does "Polling (no quota)" work badly? (Input still starves
 | |
| xmit complete processing.)
 | |
| <li>Why does it immediately fall to zero, rather than gradually decreasing?
 | |
| (xmit complete processing must be very cheap compared to input.)
 | |
| </ul>
 | |
| 
 | |
| <li>Explain Figure 6-4.
 | |
| <ul>
 | |
| 
 | |
| <li>Why does "Polling, no feedback" behave badly? There's a queue in
 | |
| front of screend.  We can still give 100% to input thread, 0% to
 | |
| screend.
 | |
| 
 | |
| <li>Why does "Polling w/ feedback" behave well? Input thread yields
 | |
| when queue to screend fills.
 | |
| 
 | |
| <li>What if screend hangs, what about other consumers of packets?
 | |
| (e.g., can you ssh to machine to fix screend?)  Fortunately screend
 | |
| typically is only application. Also, re-enable input after timeout.
 | |
| 
 | |
| </ul>
 | |
| 
 | |
| <li>Why are the two solutions different?
 | |
| <ol>
 | |
| <li> Polling thread <i>with quotas</i>.
 | |
| <li> Feedback from full queue.
 | |
| </ol>
 | |
| (I believe they should have used #2 for both.)
 | |
| 
 | |
| <li>If we apply the proposed fixes, does the phenomemon totally go
 | |
|  away? (e.g. for web server, waits for disk, &c.)  
 | |
| <ul>
 | |
| <li>Can the net device throw away packets without slowing down host?
 | |
| <li>Problem: We want to drop packets for applications with big queues.
 | |
| But requires work to determine which application a packet belongs to
 | |
| Solution: NI-LRP (have network interface sort packets)
 | |
| </ul>
 | |
| 
 | |
| <li>What about latency question?  (Look at figure 14 p. 243.)
 | |
| <ul>
 | |
| <li>1st packet looks like an improvement over non-polling. But 2nd
 | |
| packet transmitted later with poling.  Why? (No new packets added to
 | |
| xmit buffer until xmit interrupt) 
 | |
| <li>Why?  In traditional BSD, to
 | |
| amortize cost of poking device. Maybe better to poke a second time
 | |
| anyway.
 | |
| </ul>
 | |
| 
 | |
| <li>What if processing has more complex structure?
 | |
| <ul>
 | |
| <li>Chain of processing stages with queues? Does feedback work?
 | |
|     What happens when a late stage is slow?
 | |
| <li>Split at some point, multiple parallel paths? No so great; one
 | |
|     slow path blocks all paths.
 | |
| </ul>
 | |
| 
 | |
| <li>Can we formulate any general principles from paper?
 | |
| <ul>
 | |
| <li>Don't spend time on new work before completing existing work.
 | |
| <li>Or give new work lower priority than partially-completed work.
 | |
| </ul>
 | |
| 
 | |
| </ul>
 | 
