202 lines
		
	
	
	
		
			6.2 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
			
		
		
	
	
			202 lines
		
	
	
	
		
			6.2 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
<title>Scalable coordination</title>
 | 
						|
<html>
 | 
						|
<head>
 | 
						|
</head>
 | 
						|
<body>
 | 
						|
 | 
						|
<h1>Scalable coordination</h1>
 | 
						|
 | 
						|
<p>Required reading: Mellor-Crummey and Scott, Algorithms for Scalable
 | 
						|
  Synchronization on Shared-Memory Multiprocessors, TOCS, Feb 1991.
 | 
						|
 | 
						|
<h2>Overview</h2>
 | 
						|
 | 
						|
<p>Shared memory machines are bunch of CPUs, sharing physical memory.
 | 
						|
Typically each processor also mantains a cache (for performance),
 | 
						|
which introduces the problem of keep caches coherent.  If processor 1
 | 
						|
writes a memory location whose value processor 2 has cached, then
 | 
						|
processor 2's cache must be updated in some way.  How?
 | 
						|
<ul>
 | 
						|
 | 
						|
<li>Bus-based schemes.  Any CPU can access "dance with" any memory
 | 
						|
equally ("dance hall arch"). Use "Snoopy" protocols: Each CPU's cache
 | 
						|
listens to the memory bus. With write-through architecture, invalidate
 | 
						|
copy when see a write. Or can have "ownership" scheme with write-back
 | 
						|
cache (E.g., Pentium cache have MESI bits---modified, exclusive,
 | 
						|
shared, invalid). If E bit set, CPU caches exclusively and can do
 | 
						|
write back. But bus places limits on scalability.
 | 
						|
 | 
						|
<li>More scalability w. NUMA schemes (non-uniform memory access). Each
 | 
						|
CPU comes with fast "close" memory. Slower to access memory that is
 | 
						|
stored with another processor. Use a directory to keep track of who is
 | 
						|
caching what.  For example, processor 0 is responsible for all memory
 | 
						|
starting with address "000", processor 1 is responsible for all memory
 | 
						|
starting with "001", etc.
 | 
						|
 | 
						|
<li>COMA - cache-only memory architecture.  Each CPU has local RAM,
 | 
						|
treated as cache. Cache lines migrate around to different nodes based
 | 
						|
on access pattern. Data only lives in cache, no permanent memory
 | 
						|
location. (These machines aren't too popular any more.)
 | 
						|
 | 
						|
</ul>
 | 
						|
 | 
						|
 | 
						|
<h2>Scalable locks</h2>
 | 
						|
 | 
						|
<p>This paper is about cost and scalability of locking; what if you
 | 
						|
have 10 CPUs waiting for the same lock?  For example, what would
 | 
						|
happen if xv6 runs on an SMP with many processors?
 | 
						|
 | 
						|
<p>What's the cost of a simple spinning acquire/release?  Algorithm 1
 | 
						|
*without* the delays, which is like xv6's implementation of acquire
 | 
						|
and release (xv6 uses XCHG instead of test_and_set):
 | 
						|
<pre>
 | 
						|
  each of the 10 CPUs gets the lock in turn
 | 
						|
  meanwhile, remaining CPUs in XCHG on lock
 | 
						|
  lock must be X in cache to run XCHG
 | 
						|
    otherwise all might read, then all might write
 | 
						|
  so bus is busy all the time with XCHGs!
 | 
						|
  can we avoid constant XCHGs while lock is held?
 | 
						|
</pre>
 | 
						|
 | 
						|
<p>test-and-test-and-set
 | 
						|
<pre>
 | 
						|
  only run expensive TSL if not locked
 | 
						|
  spin on ordinary load instruction, so cache line is S
 | 
						|
  acquire(l)
 | 
						|
    while(1){
 | 
						|
      while(l->locked != 0) { }
 | 
						|
      if(TSL(&l->locked) == 0)
 | 
						|
        return;
 | 
						|
    }
 | 
						|
</pre>
 | 
						|
 | 
						|
<p>suppose 10 CPUs are waiting, let's count cost in total bus
 | 
						|
  transactions
 | 
						|
<pre>
 | 
						|
  CPU1 gets lock in one cycle
 | 
						|
    sets lock's cache line to I in other CPUs
 | 
						|
  9 CPUs each use bus once in XCHG
 | 
						|
    then everyone has the line S, so they spin locally
 | 
						|
  CPU1 release the lock
 | 
						|
  CPU2 gets the lock in one cycle
 | 
						|
  8 CPUs each use bus once...
 | 
						|
  So 10 + 9 + 8 + ... = 50 transactions, O(n^2) in # of CPUs!
 | 
						|
  Look at "test-and-test-and-set" in Figure 6
 | 
						|
</pre>
 | 
						|
<p>  Can we have <i>n</i> CPUs acquire a lock in O(<i>n</i>) time?
 | 
						|
 | 
						|
<p>What is the point of the exponential backoff in Algorithm 1?
 | 
						|
<pre>
 | 
						|
  Does it buy us O(n) time for n acquires?
 | 
						|
  Is there anything wrong with it?
 | 
						|
  may not be fair
 | 
						|
  exponential backoff may increase delay after release
 | 
						|
</pre>
 | 
						|
 | 
						|
<p>What's the point of the ticket locks, Algorithm 2?
 | 
						|
<pre>
 | 
						|
  one interlocked instruction to get my ticket number
 | 
						|
  then I spin on now_serving with ordinary load
 | 
						|
  release() just increments now_serving
 | 
						|
</pre>
 | 
						|
 | 
						|
<p>why is that good?
 | 
						|
<pre>
 | 
						|
  + fair
 | 
						|
  + no exponential backoff overshoot
 | 
						|
  + no spinning on 
 | 
						|
</pre>
 | 
						|
 | 
						|
<p>but what's the cost, in bus transactions?
 | 
						|
<pre>
 | 
						|
  while lock is held, now_serving is S in all caches
 | 
						|
  release makes it I in all caches
 | 
						|
  then each waiters uses a bus transaction to get new value
 | 
						|
  so still O(n^2)
 | 
						|
</pre>
 | 
						|
 | 
						|
<p>What's the point of the array-based queuing locks, Algorithm 3?
 | 
						|
<pre>
 | 
						|
    a lock has an array of "slots"
 | 
						|
    waiter allocates a slot, spins on that slot
 | 
						|
    release wakes up just next slot
 | 
						|
  so O(n) bus transactions to get through n waiters: good!
 | 
						|
  anderson lines in Figure 4 and 6 are flat-ish
 | 
						|
    they only go up because lock data structures protected by simpler lock
 | 
						|
  but O(n) space *per lock*!
 | 
						|
</pre>
 | 
						|
 | 
						|
<p>Algorithm 5 (MCS), the new algorithm of the paper, uses
 | 
						|
compare_and_swap:
 | 
						|
<pre>
 | 
						|
int compare_and_swap(addr, v1, v2) {
 | 
						|
  int ret = 0;
 | 
						|
  // stop all memory activity and ignore interrupts
 | 
						|
  if (*addr == v1) {
 | 
						|
    *addr = v2;
 | 
						|
    ret = 1;
 | 
						|
  }
 | 
						|
  // resume other memory activity and take interrupts
 | 
						|
  return ret;
 | 
						|
}
 | 
						|
</pre>
 | 
						|
 | 
						|
<p>What's the point of the MCS lock, Algorithm 5?
 | 
						|
<pre>
 | 
						|
  constant space per lock, rather than O(n)
 | 
						|
  one "qnode" per thread, used for whatever lock it's waiting for
 | 
						|
  lock holder's qnode points to start of list
 | 
						|
  lock variable points to end of list
 | 
						|
  acquire adds your qnode to end of list
 | 
						|
    then you spin on your own qnode
 | 
						|
  release wakes up next qnode
 | 
						|
</pre>
 | 
						|
 | 
						|
<h2>Wait-free or non-blocking data structures</h2>
 | 
						|
 | 
						|
<p>The previous implementations all block threads when there is
 | 
						|
  contention for a lock.  Other atomic hardware operations allows one
 | 
						|
  to build implementation wait-free data structures.  For example, one
 | 
						|
  can make an insert of an element in a shared list that don't block a
 | 
						|
  thread.  Such versions are called wait free. 
 | 
						|
 | 
						|
<p>A linked list with locks is as follows:
 | 
						|
<pre>
 | 
						|
Lock list_lock;
 | 
						|
 | 
						|
insert(int x) {
 | 
						|
  element *n = new Element;
 | 
						|
  n->x = x;
 | 
						|
 | 
						|
  acquire(&list_lock);
 | 
						|
  n->next = list;
 | 
						|
  list = n;
 | 
						|
  release(&list_lock);
 | 
						|
}
 | 
						|
</pre>
 | 
						|
 | 
						|
<p>A wait-free implementation is as follows:
 | 
						|
<pre>
 | 
						|
insert (int x) {
 | 
						|
  element *n = new Element;
 | 
						|
  n->x = x;
 | 
						|
  do {
 | 
						|
     n->next = list;
 | 
						|
  } while (compare_and_swap (&list, n->next, n) == 0);
 | 
						|
}
 | 
						|
</pre>
 | 
						|
<p>How many bus transactions with 10 CPUs inserting one element in the
 | 
						|
list? Could you do better?
 | 
						|
 | 
						|
<p><a href="http://www.cl.cam.ac.uk/netos/papers/2007-cpwl.pdf">This
 | 
						|
 paper by Fraser and Harris</a> compares lock-based implementations
 | 
						|
 versus corresponding non-blocking implementations of a number of data
 | 
						|
 structures.
 | 
						|
 | 
						|
<p>It is not possible to make every operation wait-free, and there are
 | 
						|
  times we will need an implementation of acquire and release.
 | 
						|
  research on non-blocking data structures is active; the last word
 | 
						|
  isn't said on this topic yet.
 | 
						|
 | 
						|
</body>
 |