Lesson 3 - Virtualization - Notes by Chris Hayduk

**Main Learning Goal:** To explore how virtualization took OS extensibility to another level, allowing entire operating systems to coexist on the same hardware. # Key Terms - **Virtualization:** extensibility applied at the granularity of *entire* operating systems, rather than at the granularity of services with an operating system (as we saw in lesson 2) - **Hypervisor (also known as Virtual Machine Manager \[VMM\]):** an operating system of operating systems. It acts as the layer underneath the virtual machines running on a single hardware platform. - **Native (Bare Metal) Hypervisor:** a hypervisor that runs directly on the hardware, with guest OSes running on top of the hypervisor - **Hosted Hypervisor:** a hypervisor that runs on top of a host operating system, with guest OSes running on top of the hypervisor. ![[Screenshot from 2024-09-22 20-48-57.png]] - **Full Virtualization:** the guest OSes are left entirely the same (i.e. no code is changed) and run directly on top of the hypervisor. Guest OSes run at the user privilege level (rather than the kernel privilege level), which requires trapping into the hypervisor to handle privileged instructions coming form the guest OSes. - **Para Virtualization:** the source code of guest OSes is modified to provide optimizations and alter privileged instructions. Crucially, none of the guest OS interfaces are changed, so applications running inside the guest OS do not see any difference after the hypervisor modifies the guest OS code - **Ballooning:** a process by which a "balloon" driver is installed on the guest OSes by the hypervisor which can be used in both para virtualized and fully virtualized systems. When the hypervisor needs more machine memory (for other guest OSes), it can send requests to the balloon driver to "inflate" the balloon, which essentially forces the balloon to request more and more memory from the guest OS. Since the guest has a finite amount of physical memory visible to it, eventually the balloon will hit a size limit. At that point, the hypervisor can page the balloon out to the disk and reassign that machine memory to another guest OS. Similarly, if the hypervisor has memory available, it can deflate balloons so that the guest OSes have more memory to work with. - **Pure share based policy for memory virtualization:** "you pay less, you get less". For example, if you pay the data center more, you get access to more memory. - **Working-set based policy for memory virtualization:** if the working set of a virtual machine goes up, you give it more memory. If the working set of a virtual machine shrinks, you take away memory. - **Dynamic idle-adjusted policy for memory virtualization:** balances pure share & working set approaches by taxing idle pages more than active pages. If the tax rate is 0%, this reduces to the pure share approach (if you paid for it, you keep it), while if the tax rate is 100%, any resources that are not used will be taken away, reducing to the working-set approach. Typically an intermediate value for tax rate is used, such as 50%. If the tax rate is 50%, then any given idle page has a 50% chance of being taken away # Notes - How do the lessons from the SPIN and Exokernel papers relate to virtualization? - In SPIN and Exokernel, we examined ways that we could customize operating system functionality for a variety of applications by defining new OS services that utilized underlying microkernel abstractions. The goal was to create an extensible OS that maintains protections and performance. - In virtualization, we now need to customize system functionality for a variety of *operating systems* that are sharing the hardware resources. As such, we need something to act as the microkernel (in this case, the hypervisor), whose role is to create and expose a set of abstractions to the guest operating systems. These guest operating systems can then use these abstractions to implement core microkernel and OS service operations as needed. - What are examples of hosted hypervisors? - VMWare and VirtualBox - What are examples of bare metal hypervisors? - The Linux Kernel-based Virtual Machine (KVM) - How much of the guest OS code typically needs to be modified in para virtualization? - Usually less than 2% - What needs to be done for virtualization (both full virtualization and para virtualization)? - Virtualize hardware - Memory hierarchy - CPU - Devices - Effect data and control transfer between guest OSes and hypervisor - How does the hypervisor manage page tables for applications running inside of guest OSes? - It doesn't care about that at all. The hypervisor just views each guest OS as a protection domain, without any concern for what is going on inside the guest OS with regards to how it manages memory between its applications - Guest OSes are responsible for mapping application virtual memory to physical memory. The catch is that this physical memory is not *really* physical memory in this case. The guest OSes think they have access to their own contiguous blocks of physical memory, but, in reality, this is just another layer of virtual memory. The hypervisor is responsible for mapping this physical memory to machine memory. ![[Screenshot from 2024-09-22 21-34-30.png]] - Who keeps the physical memory -> machine memory mapping in para virtualized system (guest OS or hypervisor)? - Either can, but it's usually the guest OS. This is possible because the hypervisor in para virtualized systems captures and rewrites guest OS code to make it aware that it is running as a guest OS. This allows the guest OS to be aware of the real memory hierarchy and do this mapping itself without passing control to the hypervisor - Who keeps the physical memory -> machine memory mapping in fully virtualized system (guest OS or hypervisor)? - The hypervisor does because guest OSes do not have any knowledge of the underlying hardware. - How can memory be shared across virtual machines if we have access to the guest OSes? - If two virtual machines have the same content (e.g. running the same OS or running the same application), then the virtual-to-machine mapping for those memory pages can point to the same location in machine memory. The hypervisor just makes sure that they are write protected in each guest OS so that, when a write is attempted on the page, the page is copied to a new machine memory location and the virtual-to-machine mapping is updated in the guest OS that attempted the write (this approach is called copy on write) - How can memory be shared across virtual machines if we are oblivious to the guest OSes? - We can use a hash function to hash the contents of a page. We can then consult a hash table that maps content hashes to the corresponding hint frame, which contains the VM that owns this page, the physical page number corresponding to that page (inside the guest OS), and the machine page number corresponding to that page. We cannot *know* that the contents because the matched page may have been modified since the content hash was generated, so once we find a match, we have to do a full comparison of the pages to make sure that they are exactly the same. - Once we have a full match, we can make the requesting OSes's (VM1) physical page number map to the matched VM's (VM2) machine page number. We mark this page as copy on write and indicate in the hash table that two VMs are now using this machine page. ![[Screenshot from 2024-09-22 22-03-06.png]] - Why is it a bad idea to set the tax rate to be 100% in dynamic idle-adjusted memory virtualization? (equivalently, why is the working-set based approach a bad idea?) - Because, if we take away all of the idle pages from a VM, then even a small burst in memory usage will require tons of paging back and forth with the hard drive, significantly reducing performance. By setting the tax rate lower (e.g. 50%), we can reclaim a huge amount of idle memory while still allowing some slack for guest OSes to handle bursts in workloads - What are the two core parts of CPU virtualization? - We need to give each guest OS the illusion of owning the entire CPU (each guest OS should not even be aware that there are any other guest OSes that it's sharing resources with) - We need to handle program discontinuities. That is, the hypervisor should be able to field events that arise from a parent guest OS during the course of execution of a process - How are CPU interrupts handled in CPU virtualization? - The CPU passes all interrupts for the running thread (e.g. system calls, external interrupts, page faults, etc.) to the guest OS that is the parent of the thread. It also must track which thread is responsible/receiving the interrupt, as the thread corresponding to the interrupt may have been moved off of the CPU by the time the interrupt arrives. The events are delivered by the hypervisor to the parent guest OS as software interrupts. - How are privileged instructions handled in a fully virtualized environment? - When the guest OS tries to execute a privileged instruction, we will trap into the hypervisor for handling since the guest OS cannot execute privileged instructions itself. However, on some hardware, there are certain privileged instructions that, when attempted to be executed in user mode, fail silently. Thus, the hypervisor will not be notified and we will not trap into the hypervisor. To mitigate this, the hypervisor must know the quirks of the hardware it is running on and scan the binary of the guest OS to identify any of these silently-failing instructions in its source code. If those instructions are found, the hypervisor will rewrite the guest OS's binary to avoid this situation. - How does the guest OS communicate with the hypervisor in the fully virtualized setting? - It is done implicitly through system calls. All system calls trap into the hypervisor - How does the guest OS communicate with the hypervisor in the para virtualized setting? - There are APIs that allow the guest OS to communicate directly with the hypervisor - How is device virtualization handled in the fully virtualized setting? - Each guest OS thinks that it has exclusive access to the devices. When the guest OS tries to access any of these devices, it will result in a trap into the hypervisor and the hypervisor will emulate the behavior that the guest OS expects for that device. Thus, the control transfer from the guest OS to the hypervisor is implicit. The hypervisor will determine the legality of the operation. - How is device virtualization handled in the para virtualized setting? - The same devices visible to the hypervisor are also visible to the guest OSes, so there are many possible ways to implement device access in this setting. The guest OSes can transfer control to the hypervisor specifically by making hypercalls, which allows the guest OS to decide when event notifications are delivered.