Guest memory provisioning in a disaggregated system

Get Complete Project Material File(s) Now! »

VM memory provisioning and balancing

In this section, there are discussed different ways of guest memory resizing at runtime. For an illustrative purpose, they are all collated on the figure 2-2, together with the most important characteristics as well as example positions from recent literature, which leverage a given approach. Based on the dReDBox prototype, capable of allocating memory coming from the disaggregated pool, one of our main objective was to enhance the virtualization framework so that hosted VMs would take full advantage of the non-local resources. The idea is that a compute node may have attached locally only a very limited amount of RAM, necessary to boot the host OS (serving as a hypervisor) and to execute auxiliary software components. As mentioned at the end of Section 1.7, such approach may be considered as semi-disaggregated for two reasons that follow. Firstly, the hypervisor itself relies on a certain amount of memory being available locally on the computing node. Therefore, such requirement bends a little the principle of a complete independence between CPUs and RAM in a disaggregated system. Secondly, although the attached memory comes from remote nodes, a VM process is bound to the HPA of a single computing node as is the hypervisor. For such reason, a VM can be migrated to a different computing node but it cannot be distributed over multiple ones at the same time. Different nodes run different hypervisor instances. This is way different comparing to the LegoOS design, where the hypervisor is built on top of a splitkernel architecture [53]. Thus, although our approach as well as similar ones are often described in the literature as disaggregated, the level of disaggregation is different when compared to a system like the LegoOS.
Additional processes executed by the hypervisor next to VMs are required to integrate the compute node with the rest of the system. For example, the host OS is notified by the system manager when a VM process should be started or migrated. Moreover, a compute node also needs a way to request the attachment of a portion of disaggregated memory and receive the corresponding parameters in return, in order to properly configure its local interconnect adapter logic and establish a link to the new resources. These tasks are handled by software components running in local RAM for the obvious reason that the disaggregated memory cannot be used before it is attached. Differently to that, all the buffers supposed to build up the guest’s RAM come only from the disaggregated pool.
Building on top of these assumptions, the virtualization framework is supposed to obtain buffers of disaggregated RAM and attach them to a VM at its boot time. Corre- spondingly, they are released when a VM is terminated. Moreover, VMs are expected to dynamically change the amount of provided RAM at runtime so that a running guest may increase or decrease its memory capacity without reboot. This functionality underlies the memory balancing logic at the host level. The host can shuffle already reserved buffers between all hosted VMs and, when these are not sufficient, it can request the attachment of additional ones.

Inter-VM memory sharing and migration

In Chapter 4 we will discuss inter-VM memory sharing and VM migration on a disaggregated system. Both are presented from the virtualization framework perspective, building on top of custom libraries available at host level. The libraries are responsible for operating parts of system-wide infrastructure involved in memory sharing or migration. Inter-VM memory sharing is mostly mentioned in the literature in the sense of communication and data exchange between VMs co-located on the same host. On a traditional architecture this allows a zero-copy sharing as data is all the time in the same physical location, mapped to all participating VMs. In this context the main challenge is to assure proper serialization of concurrent accesses in order to avoid data corruption. The related signaling is usually held by a server running in the hypervisor and dispatching the virtual interrupts between the participating VMs. Except for dReDBox, all presented examples are using zero-copy memory sharing only when all participating VMs are running on the same system node, that is using the same HPA. In one case this approach is used to provide an efficient socket based communication, not affected by the overhead of a networking stack [31]. Other works integrate it with HPC programming models (like MPI) in order to optimize inter-VM communication for co-located VMs [33, 63]. Nevertheless, because of clustered architecture limitations, once VMs are hosted by different system nodes they fall back to mechanisms based on data copying, that is TCP/IP or Remote Direct Memory Access (RDMA). The latter can be over PCIe, as in [30, 57], or InfiniBand — very popular in HPC domain [63]. Amongst them, the most efficient is when copying is initialized by software but the actual data transfer is performed only by hardware.
The same applies, for example, when one VM, initially capable of zero-copy data sharing with others, is migrated to another node. The data sharing scheme has to be changed to a copy-based one and the data exchange performance will decrease as a result. In such scenario VM migration is considered harmful, while ideally it would be almost transparent to the deployed workloads.

READ The Kesterite CZTS

Proposed system architecture

In the design presented in this dissertation, when a VM starts, the guest’s virtual RAM buffers are not reserved using the default host memory allocator (e.g. using the malloc() function). Instead, they are obtained from a custom host driver managing isolated ranges of Host Physical Address space (HPA). On a disaggregated system, resources coming from remote memory nodes are supposed to be mapped to these ranges at host level. Because of that, each guest memory section is also contiguous in HPA, as well as each isolated range is used by only one VM maximum (except for explicit memory sharing). Eventually, guest’s virtual RAM is constructed of one or multiple such isolated chunks, which amount can be adjusted at runtime. This approach makes the virtualization layer easy to integrate with the architecture disaggregation.

Table of contents :

List of Figures
List of Tables
List of Terms
1 Introduction
1.1 Data centers — current state
1.2 Clustered architecture
1.3 Virtualization role
1.4 Clustering drawbacks
1.5 Disaggregated architecture
1.6 Disaggregated systems and virtualization
1.7 Focus and scope of this work
2 Related work
2.1 Memory disaggregation
2.2 VM memory provisioning and balancing
2.3 Uniform address space
2.4 Inter-VM memory sharing and migration
2.5 Devices disaggregation
3 Guest memory provisioning in a disaggregated system
3.1 Chapter introduction
3.2 Proposed system architecture
3.3 Resize volume
3.4 Live VM balancing: guest parameters visibility
3.5 Explicit resize requests
3.6 Request path
3.7 Resize granularity
3.8 Disaggregation context
3.9 Guest memory isolation
3.10 Chapter conclusion
4 VM memory sharing and migration
4.1 Chapter introduction
4.2 Memory sharing — overview
4.3 VM migration — overview
4.4 Software modifications
4.5 Proposed system architecture
4.6 Sharing disaggregated memory
4.7 VM migration
4.8 Chapter conclusion
5 Disaggregated peripherals attachment
5.1 Chapter introduction
5.2 Device emulation and direct attachment
5.3 Disaggregated passthrough design
5.4 Chapter conclusion
6 Implementation and evaluation
6.1 Memory provisioning
7 Conclusion
7.1 Perspectives and future works
Appendices