There are still TODOs and things to improve to make virtio-mem support even more use cases.
For simplicity, the minimum granularity in which memory can be added/removed by virtio-mem in Linux corresponds to the maximum allocation granularity in Linux :
- x86-64 allows hot(un)plug in 4 MiB chunks
- s390x could allow hot(un)plug in 4 MiB chunks
- ppc64 could allow hot(un)plug in 16 MiB chunks
- arm64 with 4k base pages could allow hot(un) plug in 4 MiB chunks
For some architectures, it will be comparatively easy to support hot(un)plug of even smaller memory chunks in the future, corresponding to the THP size (pageblock granularity). On x86-64, 2 MiB are possible, on s390x 1 MiB. Going even smaller than that will be tricky, and might not be desireable due to possible fragmentation.
If the maximum allocation size and THP size is abnormally large, such as on arm64 with 64k base pages where it's effectively 512 MiB, this is not sufficient. While hotplugging smaller chunks is easy, hotunplugging smaller chunks is more challenging. Handling such large THPs in the QEMU implementation will also needs some thought. Most probably this corner case is not worth optimizing.
Linux guest driver TODOs
- Support the memmap_on_memory kernel command line paramater for memory hotplugged via virtio-mem.
- alloc_contig_range() improvements
- Support allocation of individual pageblocks cleanly, especially on ZONE_NORMAL; support 2 MiB granularity on x86-64.
- Increase reliability and improve performance.
- Defragmentation: Try to minimize the number of memory blocks after unplug over time.
Linux host TODOs
- Reclaiming empty page tables, optimizing for sparse memory mappings (WIP)
- Reduce KVM memory slot overhead, optimizing for sparse memory slots (WIP)
- "memslots=X" option for virtio-mem devices (WIP)
- Expose device memory via multiple memslots that are dynamically mapped and dynamically create KVM memory slots.
- "iothread=X" option for virtio-mem devices
- Process guest requests (plug/unplug memory blocks), inluding discarding memory, preallcoating memory and updating VFIO mappings, via a separate thread; avoid holding the BQL (Big QEMU Lock) for a long time during expensive operations.
- "managed-size" option for memory backends (TBD)
- Adjust the size of the memory region / RAMBlock ... dynamically to cover all mapped memslots. The target goal is to reduce the overhead due to QEMU bitmaps and increase performance (e.g., in migration code) when having large virtio-mem devices that only expose little memory to the VM.
- "prot=uffd" option for virtio-mem devices (TBD)
- Catch access to logically unplugged memory by catching access via userfaultfd - requires indicating to the guest that unplugged memory must not be read.
virito-mem adds System RAM (virtio_mem) - resources as child resources under the corresponding virtio device resource. This is visible via /proc/iomem to user space. For example:
00000000-00000fff : Reserved 00001000-0009fbff : System RAM [...] fffc0000-ffffffff : Reserved 100000000-13fffffff : System RAM 140000000-33fffffff : virtio0 140000000-147ffffff : System RAM (virtio_mem) 340000000-53fffffff : virtio1 340000000-34fffffff : System RAM (virtio_mem) 540000000-5bfffffff : PCI Bus 0000:00
kexec-tools must not use this memory for placing kexec images, or use this memory when building the initial memory map (e.g., e820 map) for the kexec kernel. This is already done implicitly due to the way the resources show up as child resources of virtio devices.
However, kexec-tools should add this memory to the list of memory (via the elfcorehdr) to dump via kdump.
In general, virtio-mem can live without ACPI. However, there has to be a way to expose the maximum possible pfn that can be used for hotplugged memory to the guest OS, for example, to properly setup swiotlb during boot in Linux. Currently, virtio-mem on x86-64 relies on ACPI SRAT tables to communicate the memory range that can be use for memory hotplug ("max_possible_pfn").
- Indicate the guest status / guest errors via virtio-mem devices.
- OOM handling / early detection of low-memory situations.
- Guest-triggered shrinking of the usable region.