Lesson 4 - Parallelization - Notes by Chris Hayduk

**Main Learning Goal:** To understand the OS structure for parallel machines, including the basic algorithms for synchronization, communication and scheduling for parallel machines. # Key Terms - **Dance Hall Architecture:** a multiprocessor architecture in which the CPUs access the memory via an interconnection network ![[Pasted image 20240924183149.png]] - **Symmetric Multiprocessor (SMP) Architecture:** a multiprocessor architecture in which the CPUs communicate with main memory via a shared system bus. It is called symmetric because each CPU can access memory with the same latency ![[Pasted image 20240924183226.png]] - **Distributed Shared Memory (DSM) Architecture:** a multiprocessor architecture where a piece of memory is assigned to each CPU (and is therefore fast to access) and each CPU is able to access all of the other CPUs' memory via the interconnection network (which is slower to access) ![[Pasted image 20240924183339.png]] - **Memory Consistency Model:** a model that gives guarantees to the programmer regarding the behavior of reads and writes in shared memory machines - **Sequential Consistency Model:** ensures that each individual process obeys its own program order, but that instructions executed in parallel can be interleaved in any order that obeys the individual program order. For example, an arbitrary number of memory accesses (reads + writes) from program 2 can occur between memory access 1 and memory access 2 in program 1, *but* all memory accesses will definitely see that memory access 1 from program 1 happens before memory access 2 from program 2 (even if the time between them is not guaranteed) ![[Pasted image 20240924184335.png]] - **Cache Coherence:** describes how the system implements the memory consistency model in the presence of private caches. The software must ensure that the cache coherence is maintained as the shared memory address space is updated, since data modified in memory must be propagated in some way to caches that are storing a stale copy of this memory. - **Lock:** a data structure that allows a thread (or threads) to access a piece of shared memory without interference from other threads - **Mutual Exclusion Lock:** a lock that can be held by only one thread at a time. This is useful for when a thread needs to modify data, as only one thread should be updating a memory location at any given time - **Shared Lock:** a lock that allows multiple threads to access a piece of shared data at once. This is useful if we have multiple threads that want to read a piece of shared memory. - **Barrier Synchronization:** a synchronization primitive that forces each thread to wait until all threads have reached the barrier in the code. Once all threads have reached this point, the barrier is released and the threads can continue making progress - **Spin Locks** - **Naive Spin Lock:** an exclusive lock implementation that remains inside a while loop and continually runs atomic `test_and_set` on the the lock while it is locked (until it is able to acquire it due to some other thread unlocking it) - **Caching Spin Lock:** an exclusive lock implementation that spins on the cached value of the lock using an atomic read rather than a `test_and_set` operation. Once the lock is unlocked, the contending threads will execute an atomic `test_and_set` to try to acquire the lock. If the lock is not acquired, the thread will return to waiting by spinning on the locally-cached lock value - **Spin Lock with Delays:** an exclusive lock implementation that forces threads to wait between `test_and_set` operations. In the exponential backoff implementation, we double the length of the delay after each unsuccessful lock acquisition attempt for a thread - **Ticket Lock:** a spin lock implementation that uses a ticketing system, where threads are assigned a number in the order in which they start trying to acquire the lock. The lock data structure includes a value showing the thread number that is currently allowed to acquire the lock, allowing the thread to know when it is its turn to acquire the lock. The thread will spin with delays while waiting for its turn, with the delays proportional to the difference between the thread's ticket number and the ticket number currently being served - **Queue Locks** - **Array-based Queuing Lock:** for each lock, creates an array of length $N$ (where $N$ is the number of processors in the system) which implements a circular queue. The array contains flags which can take on one of two values: `has-lock` or `must-wait`. Only 1 thread at a time will have the `has-lock` flag, while all others will be in the `must-wait` state. The lock data structure also contains a variable called `queue_last`, which indicates the position in the array that a thread can use to add itself to the queue (i.e. threads are inserted into the array at `queue_last`, which is then incremented after the thread is added to the array). We take `queue_last % N` in order to loop back to the beginning of the array once we have moved past the last slot. When a thread unlocks the mutex at position `i`, it finds position `(i+1) % N` in the queue and sets its flag to `has-lock`. This allows its successor lock to proceed. - **Linked-List-based Queuing Lock:** implements the queue as a linked list with a dummy node that points to the last node in the list. Each node has a flag `got_it` that determines whether it has the lock and a pointer `next` that points to the node's successor. When a node finishes using the lock, it sets `got_it` to true on its successor and removes itself from the queue. If it has no successor, it sets the pointer in the dummy node to null. - **Types of Barriers:** - **Counting Barrier:** we initialize a variable `N` representing the number of threads that will be approaching the barrier. When a thread arrives, it atomically decrements `N`. If `N` is not 0, the thread enters a while loop where it repeatedly reads `N` until `N` is 0, at which point all threads can proceed with execution. When the last thread arrives (and sees that `N` is 0 after it atomically decrements it), it resets the variable back to the initial value of `N` - **Sense-reversing Barrier:** the barrier includes a count variable (representing how many threads will wait on the variable) and a sense variable (a boolean flag that determines whether the threads can progress). Every thread except the last thread to reach the barrier will decrement count and spin on sense. The last thread to reach the barrier will reset count to `N` and flip the sense value (either from True to False or from False to True), at which point the other threads can proceed. - **Tree Barrier:** uses a hierarchical implementation of the sense-reversing barrier, where groups of size `k` threads have their own sense-reversing barrier. The last thread to arrive at the sense-reversing barrier for the leaf node will move "up" to the next level of the tree and wait until the last thread arrives (and flips `locksense`). This process continues until the top lock in the tree is flipped, at which point the threads stuck at the lower levels of the tree are "woken up" so they can proceed. ![[Screenshot from 2024-09-30 14-21-31.png]] - **MCS Tree Barrier (4-ary arrival):** Every node in the arrival tree has four children, which notify their parent as soon as they arrived at the barrier (fan-in is four). Once the root node, usually the thread with id 1, and all of its children have arrived too the release phase is entered. This time every parent notifies its two children (fan-out is two), thus propagating the departure signal down the tree, as shown in the figure below. ![[Pasted image 20240930154216.png]] - **Tournament Barrier:** This barrier is organized like a tournament bracket. Threads are paired off, and in each round, one thread from each pair waits while the other proceeds to the next round. This continues until only one thread remains, which becomes the champion. The champion thread then reverses the process, releasing waiting threads in each round until all threads are released. This creates a tree-like structure where threads wait at different levels based on how far they progress in the tournament. - **Dissemination Barrier:** In this barrier, threads communicate in a series of rounds. In each round, every thread sends a message to another thread that is a specific distance away (typically a power of 2). The number of rounds is logarithmic in the number of threads. After all rounds are complete, every thread has communicated directly or indirectly with every other thread, ensuring that all threads have reached the barrier. This method distributes the communication load more evenly across all threads, potentially improving performance on certain architectures. ![[Screenshot from 2024-09-30 16-46-09.png]] # Notes - What does cache coherence overhead depend on? - Number of processors - Type of interconnection network - Amount of sharing that is happening for a particular shared memory location - What is the implication for scalability as a result of cache coherence overhead? - Performance scales more slowly with the number of CPUs than expected because more CPUs leads to more cache coherence overhead. - What is a rule of thumb to get good performance on shared memory machines? - Try to minimize sharing memory across threads as much as possible (which in turn minimizes the number of cache coherence updates required) - What are two common atomic instructions used to implement mutexes and semaphores? And how are these atomic instructions implemented at a high level? - ```test_and_set``` and ```fetch_and_increment``` ![[Pasted image 20240924191342.png]] - What are the three important metrics for measuring scalability of lock & barrier implementations? - **Latency:** the amount of time that a thread takes to acquire the lock assuming it is free - **Wait Time:** how long does a thread need to wait to acquire the lock? (we have less control over this in the lock implementation because this is dependent on what the threads are doing once they acquire the lock) - **Contention:** how long does it take in the presence of contending threads for one of them to "win" and acquire the lock? - What are the limitation of the naive spin lock implementation? - Too much contention - all threads waiting for the thread start trying to acquire the lock once it is freed - Does not exploit caches - the `test_and_set` instruction cannot use cache values because it needs to make sure that it is retrieving the most up-to-date version of the value stored in the lock - Disrupts useful work - threads that are contending for the lock use valuable CPU cycles just running `test_and_set` over and over on the locked lock - What is the motivation for ticket lock? - We want to make sure that our lock is fair in the sense that a thread which has been waiting longest should have the highest priority when acquiring the lock. Other spin lock implementations make no accommodations for this - What is the motivation for queuing locks? - In spin locks, when thread 1 release a thread, *all* other threads waiting for the mutex will contend for it. This causes lots of contention and memory bus traffic, as each thread is repeatedly checking the lock value (or having its cache invalidated if it's using the locally-cached value of the lock). Since we know only 1 thread can successfully acquire the lock, we can just use a queue data structure and have the releasing thread signal the next thread in the queue that it's now that thread's turn to use the lock. - What are the atomic operations required for the linked-list-based queuing lock and how can we simulate them if they are not available on the hardware? - They are `fetch_and_set` and `compare_and_swap`. We can simulate them using `test_and_set` if they are not available on the hardware - What is an advantage of the linked-list-based queuing lock over the array-based queuing lock? - The linked-list-based queuing lock does not have a fixed size, and so it can shrink when there are not many threads waiting on a particular resource. By contrast, the array-based queuing lock has size fixed at the number of processors in the system - If the architecture doesn't have either of the atomic instructions needed for the array-based & linked-list-based queuing locks, what is the impact on performance? - Both will perform slower since we need to simulate the functionality using `test_and_set` - What is a good choice of lock for scalability if we have access to basic atomic instruction (i.e. read, write, `test_and_set) - The spin lock with delays using the exponential backoff implementation - What is a centralized barrier? - A barrier where the count is shared by all barriers in the code. - What is the main problem with using a centralized barrier and how can we mitigate it? - If the last thread does not reset the counter to its initial value before other threads start proceeding, they may get to the next barrier and race past it. We need to have all threads spin on `count != initial_value`*after* spinning `count > 0` to ensure that they wait until the last thread updates the counter variable before proceeding. - In an RPC (client call - server execution - return results to the client), how many times does the kernel copy stuff from the user address spaces into the kernel and vice versa? - Four times (client call -> kernel, kernel -> server, server results -> kernel, kernel -> client) - How many copies happen in total during an RPC call? - 8 copies (from the client -> server and server -> client each involve 4 copies: two user space copies shown in orange below and two kernel space copies shown in red below) ![[Screenshot from 2024-09-30 17-09-21.png]] - How can we make RPC cheaper? - We make the set up step more expensive by spending time creating shared memory which the client and server can read to/write from. Once this is set up, the two can communicate without interference from the kernel, removing half of the required copy operations.