# Lamport's Logical Clock For all questions in this section, explain your reasoning in one or two sentences. Correct justification is required to receive credit. Consider the diagram below. The vertical direction represents time, later times being higher than earlier ones. The dots denote events, the vertical lines denote processes, and the wavy lines denote messages. (Lamport, 1978) 1. [3 points] Consider a partial ordering of the events shown. a. Did p1 happen before q3? Explain your reasoning. b. Did p1 happen before r2? Explain your reasoning. c. Did p1 happen before r3? Explain your reasoning. ![[Pasted image 20241110234807.png]] A. Yes, p1 -> q2, q2 -> **q3**, thus p1 -> q3 B. Unknown, p1 || r2 C. Yes, p1 -> q2, q2 -> q4, q4 -> r3, thus p1 -> r3 2. [3 points] Consider logical clocks Cp, Cq, and Cr associated with processes p, q, and r. All clock values are integers. All clocks begin counting at 0: 𝐶𝑝 ⟨𝑝1 ⟩ = 𝐶𝑞 ⟨𝑞1 ⟩ = 𝐶𝑟 ⟨𝑟1 ⟩ = 0 When a clock is incremented, it is incremented by the smallest positive integer that satisfies the logical clock conditions from (Lamport, 1978). a. What is the value of Cr at event r2? Explain your reasoning. b. What is the value of Cr at event r3? Explain your reasoning. c. What is the value of Cr at event r4? Explain your reasoning. ![[Pasted image 20241110234840.png]] A. - Cr(r1) = 0, Cr(r2) = 1 B. - Cr(r2) = 1, - But: Cp(p1) = 0, Cq(q2) = 1, Cq(q3) = 2, Cq(q4) = 3 - So, Cr(r3) must be greater than 3, so Cr(r3) = 4 C. - As stated in part b, Cr(r3) = 4 - Cr(r4) must be greater, so Cr(r4) = 5 ![[Pasted image 20241110234919.png]] 3. [2 points] Is there a unique total ordering of the events that occur in a distributed system? Explain your reasoning. No, there is not a unique total ordering. We have a partial order (defined by events that are causally connected). We can then create a total order by selecting a well-known arbitrary condition to break ties. Many conditions will be valid and thus there will be many possible total orderings. # Lamport's Mutual Exclusion Lock 1. Consider a distributed system with 3 nodes (P1, P2, P3) implementing Lamport's mutual exclusion algorithm, as shown in the Figure. P1 and P3 request locks at timestamps 1 and 2 respectively. The figure illustrates message exchanges (request, acknowledge, and release) along timelines for each node. The message complexity for this algorithm is 3(N-1) per lock request, where N is the number of nodes (3 in this example). ![[Pasted image 20241110235106.png]](a) [3 points] What refinement to the algorithm would reduce the number of messages without affecting the correctness of the algorithm? Clearly indicate which messages can be eliminated and why because of your refinement. (b) [3 points] In Lamport’s M.E. algorithm, LOCK release message is sent to ALL the peers. What refinement to the algorithm could reduce the message complexity even further? Clearly indicate which messages can be eliminated and why because of your refinement. (a) **m6** & **m7** can be eliminated (b) **m9, m11, and m12** can be eliminated. **m9** is unnecessary because the P2 node hasn’t requested a lock, so no coherence action is needed. Similarly, **m11 and m12** can be eliminated since P3 is the last node that requested the lock. 2. [2 points] Your co-worker claims that for Lamport's Mutual Exclusion algorithm to work correctly, when a process P1 sends messages to multiple other processes, these messages must arrive at all destinations in the same order they were sent. Is this claim correct? Justify your answer. **No,** the claim is incorrect. The ordering requirement is weaker than that. It applies only between messages that have the same sender and the same recipient. if P1 sends multiple messages to P2, they must arrive at P2 in the order they were sent. **The reason for the requirement is** so a process can be sure, when it receives ACKs from all peers, that it has already received all prior LOCK requests. The simple peer-to-peer requirement is enough for that; there's no need for a stronger requirement involving multiple recipients. # RPC Latency Limits 1. Consider the following overhead around marshalling/unmarshalling data in a RPC call: ![[Pasted image 20241110235454.png]] (a) [2 points] There are three copies required on the client side for marshaling as shown above. How will you convince your friend that this can be reduced to two without compromising kernel safety? (b) [2 points] How would you reduce the number of copies on the server side for unmarshaling? a) Having a shared descriptor between the kernel and the RPC client stub for reducing the number of copies in marshalling the arguments of the RPC. b) I will either use a shared descriptor between the kernel and the RPC server stub or download the server 2. [4 points] You and your friend are brainstorming ways to minimize the RPC latency costs associated with control transfers while maximizing the CPU utilization on the client and server. You have an oracle-like system where you can have the following information available: (a) For every RPC call, before you make the call, you can predict exactly the round-trip time (in CPU cycles) associated with every RPC request and you know exactly the cost of control transfer (in CPU cycles). How would you optimize the control transfer decisions for minimizing RPC latency while also trying to maximize the CPU utilization. (b) You have a system where you know the response time for 90% of RPC messages is within 2T(the cost of context switching is 3T), while the response time for 10% of RPC messages is 20T. How would you optimize the control transfer decisions for minimum RPC latency while also trying to maximize the CPU utilization using the details above. a) On the client side, there would be two control transfers after a client sends a request, the first is to switch to another process to ensure no wasted CPU cycles and then the other is to switch back to the requesting client once its RPC message response comes in. If the time for these two control transfers is GREATER OR EQUAL TO the RTT time of the RPC message, then the original client process should spin. This is because no meaningful work would actually get done before the RPC message returns (on average) so performing control transfer would be wasteful. b) There would be two context switches on the client side (from requesting process to another process and then returning to the requesting process) so this would take 6T which is longer than 2T of the majority of the RPC calls. So the client should spin for 2T and if the response does not come in, incur the cost of a context switch. One could switch back after performing work for ~12T (2T spin + 3T switch + 12T work + 3T switch = 20T) before context switching back to the original process. 3. [2 points] Your RPC system will be running on a LAN which is quite reliable but does suffer from occasional packet loss. What would be your design decision regarding buffering in the kernel and why? Overlap server side buffering with result transmission. This optimization eliminate the need for the server to re-execute the procedure call. This method still allows for packet loss because the reply is being buffered as it is transferred back to the client. So, if there is any packet loss during transmission, the server still has the data in the buffer. # Active Networks 1. [3 points] Give two reasons as to why it is a reasonable design decision for a node to drop a capsule if the code corresponding to the “type” field is not available in the “prev” node? The PREV node not having the code is indicative that the associated network flow is quite old and this packet itself may be a result of misrouting of a network flow that has already terminated. Even if that’s not the case, the higher-level layers (such as transport) in the protocol stack will take the necessary action of retransmitting the “lost” packet. 2. [4 points] Suppose a capsule is received by a node. The “type” field does not match any capsule previously processed by this node. List the steps taken by the node in processing the capsule. - Current node uses **type field** to check if it has the corresponding capsule code in its own soft store and it will not be there. - Current node uses the **prev field** of the capsule to send a **request** message to the previous node asking if it has the code. (In the special case where there is no previous node, the code comes from the end-user software according to this [Active Networks code origin post](https://edstem.org/us/courses/60905/discussion/5672575)). - Since **previous node** just processed this capsule it is highly likely to have the corresponding capsule code in its **soft-store**, so the current node receives the capsule code from the previous node. - To verify that the code is genuine the current node computes a **cryptographically strong fingerprint** and checks if it matches the type field of the capsule, if so, it then knows the code is genuine and stores it in its own local soft-store. - Current node now can now **execute the capsule code** and properly route this capsule and any future capsules that reference the same code. - For the reasons noted in the previous question, [Active Networks Question 1](https://edstem.org/us/courses/60905/discussion/5676198), packet is **dropped** since capsule code cannot be retrieved. # Distributed Objects and Middleware 1. Your team is discussing the possibility of transitioning your servers from a particular monolithic Unix distribution to Spring OS. Answer the following questions that result from these discussions. a. [1 point] You have users that run scripts containing the standard Unix commands such as ls, cp and mv. What changes will have to be made to these scripts once the transition to Spring OS is complete and why? b. [4 points] Spring Kernel introduces the following abstractions: “domain” which is a protection domain; “door” which is an entry point into a domain. Having read the Spring paper, your teammate is concerned about the performance impact of your application software modeled as a set of communicating Unix processes on top of Spring running on a single CPU. How would you convince your team that Spring OS stays performant while bringing the goodness of a microkernel-based OS? c. [3 points] Your team takes the next leap to distribute your application software to run on Spring kernel on a LAN using RPC for communicating among the processes in your application software. Each node in the LAN is a shared memory multiprocessor. Your teammates are worried that there is now additional effort needed since you explicitly have to deal with the physical location of the processes (i.e., on the same multiprocessor or in different nodes of the LAN) to make the application performant. Are their worries well-founded? Explain your answer. a) No changes need to be made. Spring is designed to affect the underlying system while keeping the Unix interface intact b) Spring's door abstractions provide entrypoints to target domains. On a call to a target domain, the nucleus will rapidly deactivate the client thread and allocate a thread in the target domain to perform the entrypoint procedure, deactivating on return and reactivating the client thread. This results in performant cross-domain calls akin to Lightweight RPC c) Spring provides subcontracts to hide and handle exchanges with other processes, regardless of their physical locations. Subcontracts may even be dynamically loaded at runtime for desired behaviors. Object invocations over a network are transparently handled by Spring's network proxy service # Java RMI 1. [4 points] Discuss two concepts from any of the learnings in this class that share similarities with Java RMI and describe the similarities. RPC: - Remote method invocation using client-server network communication - The implementation and complexity of the underlying remote function call is hidden from the developer - Allows parameters and return values to be passed by value between processes Tornado Distributed Objects: - Networking Layer is handled behind the scenes, such that remote method invocation is transparent to the developer - Proxy objects for remote objects that handle marshaling/unmarshalling and can be interacted with as if they are local - Remote objects are registered so that clients can look them up by name or identifier - Garbage Collection to release resources after an object is no longer in use # Distributed Subsystems 1. [9 points] Consider the following sequence of actions in the following time order happening in a Treadmarks DSM system. Assume a clean copy of pages M and N is with their owner node A at the start of the program, and the system starts execution at time T1. T2, T3 and T4 represent increasing time order on distinct nodes of the distributed system. Assume that the pristine copies of M and N remain unchanged at node A for the duration of the execution shown below. Assume that none of the nodes have copies of M and N initially. T1: Process P1: acq(L1) modify M rel(L1) T2: Process P2: acq(L2) modify M rel(L2) T3: Process P3: acq(L1) read M rel(L1) T3: Process P4: acq(L2) Modify M Modify N rel(L2) T4: Process P5: acq(L2) read M rel(L2) Describe the actions at each node in executing the critical sections (lock acquisition, page accesses, and lock release). **Process 1:** - **Lock** - **Modify M**: Fetch original M from node A, make a twin, remove original's write protection, and modify - **Release**: Create a diff, associate it with L1, discard the twin, and write-protect the modified M **Process 2:** - **Lock** - **Modify M**: Fetch original M from node A, make a twin, remove original's write protection, and modify - **Release**: Create a diff, associate it with L2, discard the twin, and write-protect the modified M **Process 3:** - **Lock**: Invalidate local M, if it exists (doesn't) - **Read M**: Fetch original M from node A and diff from P1. Remove original's write protection, apply diffs, and read it - **Release**: Restore write protection **Process 4:** - **Lock**: Invalidate local M, if it exists (doesn't) - **Modify M**: Fetch original M and N from node A, remove the originals' write protection, then retrieve and apply P2's diffs to M. Make twins, then modify M and N - **Release**: Create diffs, associate them with L2, discard the twins, and write-protect the modified M and N **Process 5:** - **Lock**: Invalidate local M and N, if they exist (don't) - **Read M**: Fetch original M from node A and diffs from P2 and P4. Remove original's write protection, apply diffs in order, and read it - **Release**: Restore write protection # GMS 1. [8 points] Node P faults on page X, which is present in Node Q’s local cache. Node R houses the globally oldest page Y. Explain how GMS would strive to handle this page fault and how the size of local and global parts of memory changes. Note that this question is NEITHER about the actual implementation of GMS, NOR the Geriatric algorithm. - Node P copies page X from Node Q to the local part of its memory. - To make room in its local part GMS on node P has to evict the oldest page on Node P: - Case 1: Oldest page Z in Node P is in the local part; Write it to disk if dirty; and send it to Node R - Case 2: Oldest page Z in Node P is in the global part; Send it to Node R - Action at Node R: - Place page Z in its global part - To make room for page Z, Node R has to evict page Y: - Case 1: page Y is in the local part; write it to disk if dirty (+1); simply drop it if it is clean - Case 2: page Y is in the global part; it is guaranteed clean so simply drop it - Local/Global Memory Changes: - Node P - Case 1: No change in Node P local and global - Case 2: Local +1, Global -1 - No change in Node Q local and global - Node R: - Case 1: Local -1, Global +1 - Case 2: No change in Node R local and global 2. [4 points] In Geriatric algorithm, what parameters are sent from the initiator to each node? How are they used? The initiator sends the following values to each node: - **Minimum age** - This is the minimum age of the oldest set of pages that are going to be replaced in the upcoming epoch - **Weights of all nodes** - Weight corresponds to the fraction of pages whose ages are greater than the minimum age and will be replaced in the upcoming epoch. **Minimum Age:** If a page needs to be evicted from a node, it is only sent to a peer node for storage if its age is lesser than the min age. Otherwise, the page is discarded as the pages with age older than min age will be discarded in the upcoming epoch anyways. **Weights of all nodes:** Weights are used to identify the initiator for the next epoch. The node with the highest weight (i.e., has the max number of pages that will be removed in the upcoming epoch and is likely the least active), will be picked as the manager or initiator. The weight is also one of the factors used by a node, to determine the peer node to send the page to, for storage. # DFS 1. A traditional centralized Unix file system has an array data structure called i_node_table. Every file name at creation time is assigned an i_number which is a unique index into this table. The content of that entry is the disk block address of the inode corresponding to that file. xFS being a distributed file system, the functionality of the i_node_table is fulfilled by the mmap and imap data structures. (a) [2 points] Explain each of these data structures to show they how fulfill the functionality of the i_node_table. (b) [1 point] What determines the size of the mmap data structure? (c) [1 point] When does the content of mmap data structure change? (d) [1 point] mmap data structure is (choose one of the following) O partitioned O centralized O replicated partially O replicated globally (e) [1 point] imap data structure is (choose one of the following) O partitioned O centralized O replicated partially O replicated globally (f) [1 point] How big is the imap data structure in each node? a) **mmap structure**: a globally-replicated table that maps index number spaces to managers - a client can locate a file's manager by using part of the file's index number to index the mmap - managers maintain the imap entries for the files they manage **imap structure**: a partitioned table that maps a file's index number to the log address of the file's index node - this is analogous to how the Unix i_node_table maps i_number to disk block address containing inode (b) The size of the mmap structure is determined by the size of the index number space, load on the system, and the number of managers in the system. (c) The content of the mmap changes to correct a load imbalance or when the number of machines in the system changes. (For instance, "xFS can modifiy the mmap to assign some index number space to a new manager by having the original managers send the corresponding part of their manager state to the new manager." (section 3.1.1)) (d) replicated globally (e) partitioned (f) The size of the imap data structure depends on the number of files that exist in the filesystem within the index number space managed by the node. 2. (2 points) In xFS, what does an inode contain? Disk addresses of logsegs that contain datablocks of the file