User Guide - QEMU
Important Current Limitations
- virtio-mem devices that start out small to grow large over time will consume metadata in QEMU/KVM for the maximum size right from the start. See below for more details. In the works (resizeable memory backends / memory regions, using multiple memslots).
- x86-64 KVM in the v5.14 kernel will now lazily allocate the rmap when the TDP MMU is used, for example, when running with EPT/NPT in the hypervisor and not running nested VMs; not allocating the rmap reduces the memory overhead for large virtio-mem devices drastically. Note that the rmap will be allocated lazily when starting nested virtual machines.
- [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots
- There is no protection of unplugged memory (similar to virtio-balloon) in QEMU yet. A malicious guest might make use of more memory than requested - although such guests will break when migrating.
- Some features are not compatible with virtio-mem (yet) and are blocked:
- RDMA migration
- mlock'ing memory
- Only anonymous memory is fully supported. Memfd-based shared memory works reliably in controlled environments. Huge pages are not supported yet. There are two main things to fix to support any kind of memory backing that supports sparse memory mappings:
- Tmpfs/shmem/hugetlbfs do not have a shared zeropage. We have to indicate to the guest that reading unplugged memory is not supported, and block legacy guests that cannot guarantee that. In the works:
- Preallocation, required to make use of scarce resources, such as huge pages, is not supported yet. In the works:
- QEMU now cleanly supports guest memory dumps with virtio-mem devices.
- QEMU now cleanly supports migration/background snapshots with virtio-mem devices.
- virtio-mem now supports vfio/mdev. See the exmple below for granularity considerations.
- Memory backends now support the reserve=off option, required for clean hugetlbfs support.
We can define multiple virtio-mem devices for a virtual machine. Each virtio-mem device belongs to exactly one vNUMA node and is assigned exactly one memory backend.
While virtio-mem device can be hotplugged, they cannot be hotunplugged for now. Memory hot(un)plug is triggered by requesting to resize a virtio-mem device.
On x86-64 and arm64, the actual virtio-mem device is hidden inside a virtio-mem-pci device. The user defines virtio-mem-pci devices and uses them as if they were virtio-mem devices. The memory backend defines the maximum size of a virtio-mem device and the source+type of the memory that will be provided via the virtio-mem device to the virtual machine
id (default: NULL) property of a virtio-mem device is an identifier that allows for resizing and monitoring a specific virtio-mem device after creation.
memdev (default: NULL) virtio-mem device property specifies the id of the memory backend to assign to a virtio-mem device.
memaddr (default: 0) virtio-mem device property specifies the start address of the virtio-mem device memory region in guest physical address space. If it is 0, it will be auto-assigned.
requested-size (default: 0) virtio-mem device property specifies how much memory we would like the virtual machine to consume via a specific virtio-mem device; it is a request towards the virtual machine, and to which degree the domain is able to fulfill that request is visible via via the
Note that there is a delay between changing the
requested-size property and observing a change of the
Note that in some cases, the virtual machine might not be able to fulfil the request at all. Also, the virtual machine might not be able to fulfill the request completely. Especially shrinking virtio-mem devices can easily fail if no proper care has been taken inside the virtual machine to make memory hotunplug more reliable, such as using ZONE_MOVABLE under Linux.
requested-size property has to be multiples of the
block-size property and cannot exceed the maximum size as defined by the memory backend. In general, a virtual machine cannot consume more than the
requested-size via the virtio-mem device, except when reducing it and the domain cannot fulfill the request (completely).
Note that usually, the virtual machine will retry regularly to eventually fulfill the requests as good as possible -- like retrying to unplug memory until the resize request has been fully handled.
Changing the the
requested-size property for a running virtual machine corresponds to a hot(un)plug request.
block-size (with page size indicating huge pages in the memory backend: memory backend page size; otherwise: THP size) virtio-mem device property specifies the size of memory blocks part of the device memory region that can get hot(un)plugged individually; it corresponds to the hot(un)plug granularity on the hypervisor side.
block-size must be bigger than 1 MiB, has to be a power of two, and has to be at least as big as the page size of the assigned memory backend.
Note that when vfio/mdev is used, the
block-size might have to be increased due to limited vfio/mdev mappings: see the block size discussion for details.
Do not specify
block-size values smaller than the THP size, unless using huge pages: it is not supported and QEMU will print a warning. In the future, clean support that properly disables THP for the virtio-mem device might be added.
size (read-only) virtio-mem device property shows how much memory the virtio-mem device currently provides to the virtual machine ("plugged memory").
size changes based on resize requests as the virtual machine tries to fulfill a resize request and hot(un)plugs device blocks; however, it also changes due to other events, for example, when rebooting the virtual machine.
node (default: 0) virtio-mem device property specifies the vNUMA node assignment for a virtio-mem device.
prealloc virtio-mem device property specifies whether to preallocate memory when processing plug requests from the virtual machine; if preallocation fails, the plug request will be rejected and the virtual machine will continue running unharmed.
The virtual machine will retry processing the memory hotplug request later. Consequently, user errors when handling scarce memory resources, such as running out of huge pages on a specific NUMA node, can be caught and handled gracefully.
Note that preallcoation cannot protect from the OOM (Out Of Memory) handler under Linux triggering and killing the process. Special care has to be taken with ordinary anonymours RAM.
When using scarce memory resources, such as huge pages, and in every other setup where we would use
prealloc=on for the memory backend with other memory devies like DIMMs, specify
prealloc=on for the virtio-mem device instead and specify
prealloc=off for the memory backend. Note that this feature is still WIP and not upstream yet.
PCI Device Properties
For virtio-mem-pci devices, the same properties as for other PCI devices apply. Examples include the
bus property and the
addr property. See the QEMU documentation for details.
Virtio Device Properties
For virtio-mem devices, the same properties as for other virtio devices apply. Examples include the
Similarly, for virtio-mem-pci devices, the same properties as for other PCI-based virtio devices apply. Examples include the
disable-modern and the
The memory backend defines the maximum size of a virtio-mem device and the source+type of the memory that will be provided via the virtio-mem device to the virtual machine. virtio-mem relies on a sparse memory backend, exposing a dynamic amount of memory from the memory backend.
Memory backends applicable to virtio-mem are:
- memory-backend-ram: for anonymous memory; usually with share=off
- memory-backend-file: for file-backed memory, including hugetlbfs and tmpf; usually with share=on
- memory-backend-memfd: for shmem and hugetlb; usually with share=on
id (default: NULL) property of a memory backend is an identifier that allows for assigning a specific memory backend to a specific virtio-mem device.
size (default: 0) property of the memory backend defines the backend memory size. For a virtio-mem device, the memory backend size corresponds to the maximum size the virtio-mem device can provideto the VM.
share (default for memory-backend-memfd: on; otherwise: off) property of a memory backend defines whether we want process-private or shared memory.
Note: Don't use memory-backend-memfd with
share=off,hugetlb=off; it can result in double memory consumption.
Note: Use memory-backend-ram with
share=on with care; there are only very limited use cases for shared anonymous memory in QEMU.
reserve (default: on) property of a memory backend defines whether we want to reserve, depending on the memory backend and if applicable, swap space or huge pages. For example, reservation of swap space is not applicable for ordinary shared file-backed memory but it's always applicable for huge pages.
QEMU will only bail out if
reserve=off is specified but reservation cannot be disabled: this can only fail for anonymous and private file-backed memory if the memory overcommit configuration in Linux does not allow for it -- which contradicts to virtio-mem already. Disabling reservation for huge pages cannot fail.
reserve=off for memory backends assigned to virtio-mem devices.
prealloc (default: off) property of a memory backend defines whether we want to preallocate memory for the whole memory backend when creating it. As virtio-mem relies on sparse memory backends, we don't want to preallocate memory for the whole memory backend. QEMU will discard all memory again when initializing the virtio-mem device but QEMU will temporarily allocate memory for the whole memory backend, which can result in undesired side effects .
prealloc=off for memory backends assigned to virtio-mem devices. Specify
prealloc=on for the virtio-mem device instead (WIP).
dump property (default: off; with
dump-guest-core=on: on) of a memory backend defines whether that memory will be part of a core dump of the QEMU process. Consequently, a core dump will read all memory of the memory backend, which can have negative effects for sparse memory backends as used by virtio-mem.
Avoid enabling core dumping via
-machine dump-guest-core=on, and if enabled, specify
dump=off for memory backends assigned to virtio-mem devices. Avoid specifying
merge (default: off) property of a memory backend defines whether that memory should be marked as mergeable for KSM. This currently only applies to private anonymous RAM, including private file-backed memory.
No special virtio-mem considerations apply.
policy (default: default) property of a memory backend defines the NUMA policy used for that memory. The
host-nodes property specifies the NUMA nodes using a bitmap; for example, node 0 corresponds to the value 1 and node 1 to the value 2.
Supported policies are default, preferred, bind and interleave. Details about the policies can be found in the mbind() documentation.
No special virtio-mem considerations apply.
The file memory backend property
mem-path (default: NULL) defines the file path. The file memory backend property
discard-data (default: off) can be used for shared file mappings to make QEMU essentially empty the file on exit, however, should be used with care.
The file memory backend properties
readonly (defaut: off),
align (default: 0) and
pmem (default: off) don't apply to virtio-mem and should not be enabled.
The memfd memory backend property
hugetlb (default: off) specifies whether to use huge pages as an easy alternative to memory-backend-file with files on hugetlbfs mount points. The memfd memory backend property
hugetlbsize (default: 0) selects the huge page size.
The memfd memory backend property
seal (default: on) defines wheher to disallow growing and shrinking of the memfd after creation, which is a reasonable thing to have for virtio-mem as well.
The HMP/QMP interface of QEMU can be used to query the current size of a virtio-mem device and to trigger a resize request. The virtio-mem
qom-get command can be used to query the
size of a virtio-mem device. Similarly, it can be used to query other device properties, although most virtio-mem properties are static at runtime. Instead of a device
id, also a qom-path can be supplied.
(qemu) qom-get vmem0 size 0
qom-set command can be used to update the
requested-size of a virtio-mem device, corresponding to a resize request.
(qemu) qom-set vmem0 requested-size 1G
info memory-devices /
info memory-devices /
query-memory-devices command can be used to list all defined memory devices, including hotplugged ones. It lists various properties of the defined memory devices.
(qemu) info memory-devices Memory device [virtio-mem]: "vmem0" memaddr: 0x240000000 node: 0 requested-size: 1073741824 size: 1073741824 max-size: 8589934592 block-size: 2097152 memdev: /objects/mem2 Memory device [virtio-mem]: "vmem1" memaddr: 0x440000000 node: 1 requested-size: 0 size: 0 max-size: 8589934592 block-size: 2097152 memdev: /objects/mem3
info memory\_size\_summary /
info memory\_size\_summary /
query-memory-size-summary command can be used to identify how much initial/boot memory ("base") and how much hotplugged memory ("plugged") the virtual machine is currently able to use.
(qemu) info memory_size_summary base memory: 8589934592 plugged memory: 1073741824
Note that all virtio-mem memory is always indicated as "plugged" and not as "base" memory. Further, only the actually plugged memory, corresponding to the device
size property, is included in the summary.
info numa /
info numa /
query-numa command can be used to identify how much initial memory ("base") and how much hotplugged memory ("plugged") the virtual machine is currently able to use per NUMA node.
info numa 2 nodes node 0 cpus: 0 1 2 3 node 0 size: 5120 MB node 0 plugged: 1024 MB node 1 cpus: 4 5 6 7 node 1 size: 4096 MB node 1 plugged: 0 MB
info balloon /
info balloon /
query-balloon command can be used to query the logical virtual machine size, corresponding to the virtual machine size minus the balloon size. In the context of memory ballooning, the logical virtual machine size only includes initial memory and DIMMs, not memory provided by virtio-mem devices. Consequently, we cannot really inflate the balloon fully on virtio-mem memory.
Note that this is intended: virtio-mem is not fully compatible with balloon inflation/deflation, because having two mechanisms active to resize virtual machine memory at the same time is not a sane use case. virtio-mem is compatible with free page reporting as implemented by virtio-balloon, to optimize memory overcommit in the hypervisor, though.
MEMORY_DEVICE_SIZE_CHANGE QAPI Event
size property of a virtio-mem device changes, QEMU issues a rate-limited QAPI event. The event contains:
- The device
id, if set.
- The new value of the
- The path to the device object in the QOM tree (since QEMU v6.2).
Let's create a VM with two NUMA nodes, one virtio-mem-pci device each (here, vm0 and vm1). Each virtio-mem-pci device has to be assigned a memory backend (here, mem0 and mem1). The size of the memory backend determines the maximum size of a virtio-mem device (here, 8GB each). The size of the memory backends have to be accounted for in the maxmem declaration (here, 20GB). Setting requested-size to something > 0 tells the guest to directly consume a specific amount of memory via a virtio-mem device (here, 300M and 1G).
qemu-kvm \ -m 4G,maxmem=20G \ -smp sockets=2,cores=2 \ -object memory-backend-ram,id=mem0,size=2G \ -object memory-backend-ram,id=mem1,size=2G \ -numa node,nodeid=0,cpus=0-1,memdev=mem0 \ -numa node,nodeid=1,cpus=2-3,memdev=mem1 \ -machine pc \ -nographic \ -nodefaults \ -chardev stdio,nosignal,id=serial \ -device isa-serial,chardev=serial \ -chardev socket,id=monitor,path=/var/tmp/monitor \ -mon chardev=monitor,mode=readline \ ... -object memory-backend-ram,id=vmem0,size=8G \ -device virtio-mem-pci,id=vm0,memdev=vmem0,node=0,requested-size=300M \ -object memory-backend-ram,id=vmem1,size=8G \ -device virtio-mem-pci,id=vm1,memdev=vmem1,node=1,requested-size=1G
Via the QEMU monitor ("hmp") and via qmp, we can query the current size of virtio-mem devices and change the requested size.
$ echo "info memory-devices" | sudo nc -U /var/tmp/monitor QEMU 5.1.92 monitor - type 'help' for more informatio (qemu) info memory-devices Memory device [virtio-mem]: "vm0" memaddr: 0x140000000 node: 0 requested-size: 314572800 size: 314572800 max-size: 8589934592 block-size: 2097152 memdev: /objects/vmem0 Memory device [virtio-mem]: "vm1" memaddr: 0x340000000 node: 1 requested-size: 1073741824 size: 1073741824 max-size: 8589934592 block-size: 2097152 memdev: /objects/vmem1
As the size of both virtio-mem devices is > 0, we know that the guest driver is alive and is making use of virtio-mem provided memory. We can now request to resize virtio-mem devices. Let's grow vm0 to 4GB and shrink vm1 to 256M.
$ echo "qom-set vm0 requested-size 4G" | sudo nc -U /var/tmp/monitor QEMU 5.1.92 monitor - type 'help' for more information (qemu) qom-set vm0 requested-size 4G $ echo "qom-set vm1 requested-size 256M" | sudo nc -U /var/tmp/monitor QEMU 5.1.92 monitor - type 'help' for more information (qemu) qom-set vm1 requested-size 256M $ echo "info memory-devices" | sudo nc -U /var/tmp/monitor QEMU 5.1.92 monitor - type 'help' for more information (qemu) info memory-devices Memory device [virtio-mem]: "vm0" memaddr: 0x140000000 node: 0 requested-size: 4294967296 size: 4294967296 max-size: 8589934592 block-size: 2097152 memdev: /objects/vmem0 Memory device [virtio-mem]: "vm1" memaddr: 0x340000000 node: 1 requested-size: 268435456 size: 268435456 max-size: 8589934592 block-size: 2097152 memdev: /objects/vmem1
If the guest cannot completely fulfill a request (esp., unplug enoguh memory), it will retry for a while. The current state (size) is always updated. We can also checkout the current logical size of the VM / NUMA nodes.
$ echo "info memory_size_summary" | sudo nc -U /var/tmp/monitor QEMU 5.1.92 monitor - type 'help' for more information (qemu) info memory_size_summary base memory: 4294967296 plugged memory: 4563402752 $ echo "info numa" | sudo nc -U /var/tmp/mon_src QEMU 5.1.92 monitor - type 'help' for more information (qemu) info numa 2 nodes node 0 cpus: 0 1 node 0 size: 6144 MB node 0 plugged: 4096 MB node 1 cpus: 2 3 node 1 size: 2304 MB node 1 plugged: 256 MB
It is also possible to hotplug virtio-mem devices later. However, usually, one wants to have a single virtio-mem device per NUMA node. virtio-mem and DIMMs can be mixed, although it's not recommended.
Preparing IOMMU + KVM
Enable IOMMU support for your host. On my AMD CPU, this involved enabling support in the BIOS and adding to my kernel cmdline "amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1"
Identifying the PCI devices to pass-through
Identify the device(s) to pass through - e.g., using lspci. In this example, we are using a very old and simple GPU, along with an audio controller. Make sure that all devices belonging to the IOMMU are used for passthrough.
# Identify the device(s) to pass through (here: 5:00.0 and 5.00.1) $ lspci ... 05:00.0 VGA compatible controller: NVIDIA Corporation GK208B [GeForce GT 710] (rev a1) 05:00.1 Audio device: NVIDIA Corporation GK208 HDMI/DP Audio Controller (rev a1) ... # Identify the IOMMU group (here: 23) $ find /sys/kernel/iommu_groups/ -name "*05:00*" /sys/kernel/iommu_groups/23/devices/0000:05:00.1 /sys/kernel/iommu_groups/23/devices/0000:05:00.0 # Verify that we caught all devices in the IOMMU group $ ls /sys/kernel/iommu_groups/23/devices/ 0000:05:00.0 0000:05:00.1
We have to bind all devices we want to passthrough to the vfio-pci driver. There are ways to to that on the kernel cmdline when booting up: the critical part is stopping other drivers from binding to the device (e.g., "module_blacklist=nouveau"). In this example, we'll rip out the device from the old driver forcefully (use with care ...) and bind it to vfio-pci.
# Unload vfio-pci first, can usually be skipped $ sudo rmmod vfio-pci # Force unbinding from the old driver (use with care ...) $ echo "0000:05:00.0" | sudo tee -a "/sys/bus/pci/devices/0000:05:00.0/driver/unbind" $ echo "0000:05:00.1" | sudo tee -a "/sys/bus/pci/devices/0000:05:00.1/driver/unbind" # Load vfio-pci $ sudo modprobe vfio-pci # Configure "vfio-pci" for the devices $ echo "vfio-pci" | sudo tee -a "/sys/bus/pci/devices/0000:05:00.0/driver_override" $ echo "vfio-pci" | sudo tee -a "/sys/bus/pci/devices/0000:05:00.1/driver_override" # Trigger driver-probing, binding the devices to vfio-pci $ echo "0000:05:00.0" | sudo tee -a /sys/bus/pci/drivers_probe $ echo "0000:05:00.1" | sudo tee -a /sys/bus/pci/drivers_probe
(Optional) Identify USB keyboard and mouse
We'll be using a Logitech mouse and keyboard by forwarding an Logitech Unifying Receiver. Identify the vendorid (here: 046d) and productid (here: c52b).
$ lsusb ... Bus 003 Device 002: ID 046d:c52b Logitech, Inc. Unifying Receiver ...
Example: vfio-pci + virtio-mem (no vIOMMU)
qemu-kvm \ -accel kvm \ -m 4G,maxmem=20G \ -smp sockets=2,cores=2 \ -machine q35 \ -nographic \ -nodefaults \ ... -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ -device vfio-pci,host=05:00.0,x-vga=on,bus=pci.1,addr=1.0,multifunction=on \ -device vfio-pci,host=05:00.1,bus=pci.1,addr=1.1 \ -usb -device usb-host,vendorid=0x046d,productid=0xc52b \ -object memory-backend-ram,id=vmem0,size=16G \ -device virtio-mem-pci,id=vm0,memdev=vmem0,requested-size=2G
When resizing virtio-mem devices (see the other example), the memory consumption of the VM will adjust accordinly. "x-vga=on" seems to be required for the GPU.
Example: vfio-pci + virtio-mem (vIOMMU)
qemu-kvm \ -accel kvm,kernel-irqchip=split \ -m 4G,maxmem=20G \ -smp sockets=2,cores=2 \ -machine q35 \ -nographic \ -nodefaults \ ... -device intel-iommu,caching-mode=on,intremap=on,device-iotlb=on \ -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ -device vfio-pci,host=05:00.0,x-vga=on,bus=pci.1,addr=1.0,multifunction=on \ -device vfio-pci,host=05:00.1,bus=pci.1,addr=1.1 \ -usb -device usb-host,vendorid=0x046d,productid=0xc52b \ -object memory-backend-ram,id=vmem0,size=16G \ -device virtio-mem-pci,disable-legacy=on,disable-modern=off,iommu_platform=on,ats=on,id=vm0,memdev=vmem0,requested-size=2G
More details can be found in the QEMU wiki. Make sure to:
- Enable IOMMU support in your guest ("intel_iommu=on" on the kernel cmdline of your Linux guest). Note that you can use the intel-iommu device in QEMU independently of your CPU vendor.
- Define the intel-iommu device before specifying any other device
- Take special care of all virtio devices. (disable-legacy=on,disable-modern=off might not actually be required for virtio-mem-pci, but for older virtio devices)
Block Size Limitations
VFIO usually allows for ~64k distinct mappings per VFIO controller (here: our two devices, consisting of one IOMMU group). Each memory block of a virtio-mem (determined via the block-size) requires a distinct mapping. As mappings are shared with other users, let's assume we can use half of that (~32k) for virtio-mem (accross all virtio-mem devices) purposes. Actual numbers differ per setup.
With a maximum size of 16 GiB for our virtio-mem device in our example and the default block-size of 2 MiB, we'll need up to 8192 mappings.
- The bigger the block-size, the less likely it gets to unplug a lot of memory reliably (ZONE_NORMAL under Linux, Windows TBD). So we want small block sizes (except with ZONE_MOVABLE under Linux).
- The smaller the block-size, the smaller the maximum amount of memory that can be provided via virtio-mem. 32k mappings with a 2 MiB block-size allows for a maximum amount of virtio-mem memory of 64 GiB.
- The smaller the block-size, the more VFIO kernel calls, eventually resulting in a slowdown when hot(un)plugging memory or rebooting (not benchmarked - pinning guest memory is already expensive).
Assume we want to eventually hotplug 256 GiB via virtio-mem. To not exceed 32k mappings, we would have to manually configure the block-size of virtio-mem devices to 8 MiB.
Note: With a vIOMMU these limits theoretically don't apply, because we don't have to map all memory blocks of a virtio-mem device once the guest is running and using the vIOMMU - only what's mapped into the vIOMMU has to be mapped by vfio. However, there are times (e.g., boot, reboot, guest not using the vIOMMU), when the vIOMMU isn't active - all VM memory (including all plugged virtio-mem memory) has to be mapped by vfio.
- QEMU does not yet protect unplugged memory. A malicious guest might make use of more memory than desired (however, e.g., migration will break such guests as unplugged memory is never migrated). To be implemented.
- QEMU does not yet implement resizeable memory regions. Specifying huge virtio-mem devices that start out small will result in memory overhead in the hypervisor (page tables, KVM memory slot tracking data, ...) and might require to set the sysctl "vm.overcommit_memory" to 1 ("Always overcommit"). The "reserve=off" option for memory backends can be used to avoid the latter.
- virtio-mem devices cannot get unplugged, they can only be requested to be resized.
- virtio-mem devices only operate on their assigned memory region/memory backend. All memory one might eventually unplug again (via virtio-mem) later has to be provided to the VM via a virtio-mem device.