User Guide - QEMU

Important Current Limitations

  • virtio-mem devices that start out small to grow large over time will consume metadata in QEMU/KVM for the maximum size right from the start. See below for more details. In the works (resizeable memory backends / memory regions, using multiple memslots).
  • There is no protection of unplugged memory (similar to virtio-balloon) in QEMU yet. A malicious guest might make use of more memory than requested - although such guests will break when migrating.
  • Some features are not compatible with virtio-mem (yet) and are blocked:
    • vdpa
    • RDMA migration
    • vfio-nvme
    • mlock'ing memory
  • Only anonymous memory is fully supported. Memfd-based shared memory works reliably in controlled environments. Huge pages are not supported yet. There are two main things to fix to support any kind of memory backing that supports sparse memory mappings:

Updates

v6.2

v6.1

Documentation

We can define multiple virtio-mem devices for a virtual machine. Each virtio-mem device belongs to exactly one vNUMA node and is assigned exactly one memory backend.

While virtio-mem device can be hotplugged, they cannot be hotunplugged for now. Memory hot(un)plug is triggered by requesting to resize a virtio-mem device.

virtio-mem Device

On x86-64 and arm64, the actual virtio-mem device is hidden inside a virtio-mem-pci device. The user defines virtio-mem-pci devices and uses them as if they were virtio-mem devices. The memory backend defines the maximum size of a virtio-mem device and the source+type of the memory that will be provided via the virtio-mem device to the virtual machine

id

The id (default: NULL) property of a virtio-mem device is an identifier that allows for resizing and monitoring a specific virtio-mem device after creation.

memdev

The memdev (default: NULL) virtio-mem device property specifies the id of the memory backend to assign to a virtio-mem device.

memaddr

The memaddr (default: 0) virtio-mem device property specifies the start address of the virtio-mem device memory region in guest physical address space. If it is 0, it will be auto-assigned.

requested-size

The requested-size (default: 0) virtio-mem device property specifies how much memory we would like the virtual machine to consume via a specific virtio-mem device; it is a request towards the virtual machine, and to which degree the domain is able to fulfill that request is visible via via the size property.

Note that there is a delay between changing the requested-size property and observing a change of the size property.

Note that in some cases, the virtual machine might not be able to fulfil the request at all. Also, the virtual machine might not be able to fulfill the request completely. Especially shrinking virtio-mem devices can easily fail if no proper care has been taken inside the virtual machine to make memory hotunplug more reliable, such as using ZONE_MOVABLE under Linux.

The requested-size property has to be multiples of the block-size property and cannot exceed the maximum size as defined by the memory backend. In general, a virtual machine cannot consume more than the requested-size via the virtio-mem device, except when reducing it and the domain cannot fulfill the request (completely).

Note that usually, the virtual machine will retry regularly to eventually fulfill the requests as good as possible -- like retrying to unplug memory until the resize request has been fully handled.

Changing the the requested-size property for a running virtual machine corresponds to a hot(un)plug request.

block-size

The block-size (with page size indicating huge pages in the memory backend: memory backend page size; otherwise: THP size) virtio-mem device property specifies the size of memory blocks part of the device memory region that can get hot(un)plugged individually; it corresponds to the hot(un)plug granularity on the hypervisor side.

The block-size must be bigger than 1 MiB, has to be a power of two, and has to be at least as big as the page size of the assigned memory backend.

Note that when vfio/mdev is used, the block-size might have to be increased due to limited vfio/mdev mappings: see the block size discussion for details.

Do not specify block-size values smaller than the THP size, unless using huge pages: it is not supported and QEMU will print a warning. In the future, clean support that properly disables THP for the virtio-mem device might be added.

size (read-only)

The size (read-only) virtio-mem device property shows how much memory the virtio-mem device currently provides to the virtual machine ("plugged memory").

The size changes based on resize requests as the virtual machine tries to fulfill a resize request and hot(un)plugs device blocks; however, it also changes due to other events, for example, when rebooting the virtual machine.

node

The node (default: 0) virtio-mem device property specifies the vNUMA node assignment for a virtio-mem device.

prealloc (WIP)

The prealloc virtio-mem device property specifies whether to preallocate memory when processing plug requests from the virtual machine; if preallocation fails, the plug request will be rejected and the virtual machine will continue running unharmed.

The virtual machine will retry processing the memory hotplug request later. Consequently, user errors when handling scarce memory resources, such as running out of huge pages on a specific NUMA node, can be caught and handled gracefully.

Note that preallcoation cannot protect from the OOM (Out Of Memory) handler under Linux triggering and killing the process. Special care has to be taken with ordinary anonymours RAM.

When using scarce memory resources, such as huge pages, and in every other setup where we would use prealloc=on for the memory backend with other memory devies like DIMMs, specify prealloc=on for the virtio-mem device instead and specify prealloc=off for the memory backend. Note that this feature is still WIP and not upstream yet.

PCI Device Properties

For virtio-mem-pci devices, the same properties as for other PCI devices apply. Examples include the bus property and the addr property. See the QEMU documentation for details.

Virtio Device Properties

For virtio-mem devices, the same properties as for other virtio devices apply. Examples include the iommu_platform property.

Similarly, for virtio-mem-pci devices, the same properties as for other PCI-based virtio devices apply. Examples include the disable-legacy, the disable-modern and the ats property.

Memory Backend

The memory backend defines the maximum size of a virtio-mem device and the source+type of the memory that will be provided via the virtio-mem device to the virtual machine. virtio-mem relies on a sparse memory backend, exposing a dynamic amount of memory from the memory backend.

Memory backends applicable to virtio-mem are:

  • memory-backend-ram: for anonymous memory; usually with share=off
  • memory-backend-file: for file-backed memory, including hugetlbfs and tmpf; usually with share=on
  • memory-backend-memfd: for shmem and hugetlb; usually with share=on

id

The id (default: NULL) property of a memory backend is an identifier that allows for assigning a specific memory backend to a specific virtio-mem device.

size

The size (default: 0) property of the memory backend defines the backend memory size. For a virtio-mem device, the memory backend size corresponds to the maximum size the virtio-mem device can provideto the VM.

share

The share (default for memory-backend-memfd: on; otherwise: off) property of a memory backend defines whether we want process-private or shared memory.

Note: Don't use memory-backend-memfd with share=off,hugetlb=off; it can result in double memory consumption.

Note: Use memory-backend-ram with share=on with care; there are only very limited use cases for shared anonymous memory in QEMU.

reserve

The reserve (default: on) property of a memory backend defines whether we want to reserve, depending on the memory backend and if applicable, swap space or huge pages. For example, reservation of swap space is not applicable for ordinary shared file-backed memory but it's always applicable for huge pages.

QEMU will only bail out if reserve=off is specified but reservation cannot be disabled: this can only fail for anonymous and private file-backed memory if the memory overcommit configuration in Linux does not allow for it -- which contradicts to virtio-mem already. Disabling reservation for huge pages cannot fail.

Always specify reserve=off for memory backends assigned to virtio-mem devices.

prealloc

The prealloc (default: off) property of a memory backend defines whether we want to preallocate memory for the whole memory backend when creating it. As virtio-mem relies on sparse memory backends, we don't want to preallocate memory for the whole memory backend. QEMU will discard all memory again when initializing the virtio-mem device but QEMU will temporarily allocate memory for the whole memory backend, which can result in undesired side effects .

Always specify prealloc=off for memory backends assigned to virtio-mem devices. Specify prealloc=on for the virtio-mem device instead (WIP).

dump

The dump property (default: off; with dump-guest-core=on: on) of a memory backend defines whether that memory will be part of a core dump of the QEMU process. Consequently, a core dump will read all memory of the memory backend, which can have negative effects for sparse memory backends as used by virtio-mem.

Avoid enabling core dumping via -machine dump-guest-core=on, and if enabled, specify dump=off for memory backends assigned to virtio-mem devices. Avoid specifying dump=on.

merge

The merge (default: off) property of a memory backend defines whether that memory should be marked as mergeable for KSM. This currently only applies to private anonymous RAM, including private file-backed memory.

No special virtio-mem considerations apply.

policy and host-nodes

The policy (default: default) property of a memory backend defines the NUMA policy used for that memory. The host-nodes property specifies the NUMA nodes using a bitmap; for example, node 0 corresponds to the value 1 and node 1 to the value 2.

Supported policies are default, preferred, bind and interleave. Details about the policies can be found in the mbind() documentation.

No special virtio-mem considerations apply.

memory-backend-file Properties

The file memory backend property mem-path (default: NULL) defines the file path. The file memory backend property discard-data (default: off) can be used for shared file mappings to make QEMU essentially empty the file on exit, however, should be used with care.

The file memory backend properties readonly (defaut: off), align (default: 0) and pmem (default: off) don't apply to virtio-mem and should not be enabled.

memory-backend-memfd Properties

The memfd memory backend property hugetlb (default: off) specifies whether to use huge pages as an easy alternative to memory-backend-file with files on hugetlbfs mount points. The memfd memory backend property hugetlbsize (default: 0) selects the huge page size.

The memfd memory backend property seal (default: on) defines wheher to disallow growing and shrinking of the memfd after creation, which is a reasonable thing to have for virtio-mem as well.

HMP/QMP Interface

The HMP/QMP interface of QEMU can be used to query the current size of a virtio-mem device and to trigger a resize request. The virtio-mem

qom-get and qom-set

The qom-get command can be used to query the size of a virtio-mem device. Similarly, it can be used to query other device properties, although most virtio-mem properties are static at runtime. Instead of a device id, also a qom-path can be supplied.

(qemu) qom-get vmem0 size
0

The qom-set command can be used to update the requested-size of a virtio-mem device, corresponding to a resize request.

(qemu) qom-set vmem0 requested-size 1G

info memory-devices / query-memory-devices

The info memory-devices / query-memory-devices command can be used to list all defined memory devices, including hotplugged ones. It lists various properties of the defined memory devices.

(qemu) info memory-devices
Memory device [virtio-mem]: "vmem0"
  memaddr: 0x240000000
  node: 0
  requested-size: 1073741824
  size: 1073741824
  max-size: 8589934592
  block-size: 2097152
  memdev: /objects/mem2
Memory device [virtio-mem]: "vmem1"
  memaddr: 0x440000000
  node: 1
  requested-size: 0
  size: 0
  max-size: 8589934592
  block-size: 2097152
  memdev: /objects/mem3

info memory\_size\_summary / query-memory-size-summary

The info memory\_size\_summary / query-memory-size-summary command can be used to identify how much initial/boot memory ("base") and how much hotplugged memory ("plugged") the virtual machine is currently able to use.

(qemu) info memory_size_summary
base memory: 8589934592
plugged memory: 1073741824

Note that all virtio-mem memory is always indicated as "plugged" and not as "base" memory. Further, only the actually plugged memory, corresponding to the device size property, is included in the summary.

info numa / x-query-numa

The info numa / query-numa command can be used to identify how much initial memory ("base") and how much hotplugged memory ("plugged") the virtual machine is currently able to use per NUMA node.

info numa
2 nodes
node 0 cpus: 0 1 2 3
node 0 size: 5120 MB
node 0 plugged: 1024 MB
node 1 cpus: 4 5 6 7
node 1 size: 4096 MB
node 1 plugged: 0 MB

info balloon / query-balloon

The info balloon / query-balloon command can be used to query the logical virtual machine size, corresponding to the virtual machine size minus the balloon size. In the context of memory ballooning, the logical virtual machine size only includes initial memory and DIMMs, not memory provided by virtio-mem devices. Consequently, we cannot really inflate the balloon fully on virtio-mem memory.

Note that this is intended: virtio-mem is not fully compatible with balloon inflation/deflation, because having two mechanisms active to resize virtual machine memory at the same time is not a sane use case. virtio-mem is compatible with free page reporting as implemented by virtio-balloon, to optimize memory overcommit in the hypervisor, though.

MEMORY_DEVICE_SIZE_CHANGE QAPI Event

Whenever the size property of a virtio-mem device changes, QEMU issues a rate-limited QAPI event. The event contains:

  • The device id, if set.
  • The new value of the size property.
  • The path to the device object in the QOM tree (since QEMU v6.2).

NUMA Example

Let's create a VM with two NUMA nodes, one virtio-mem-pci device each (here, vm0 and vm1). Each virtio-mem-pci device has to be assigned a memory backend (here, mem0 and mem1). The size of the memory backend determines the maximum size of a virtio-mem device (here, 8GB each). The size of the memory backends have to be accounted for in the maxmem declaration (here, 20GB). Setting requested-size to something > 0 tells the guest to directly consume a specific amount of memory via a virtio-mem device (here, 300M and 1G).

qemu-kvm \
    -m 4G,maxmem=20G \
    -smp sockets=2,cores=2 \
    -object memory-backend-ram,id=mem0,size=2G \
    -object memory-backend-ram,id=mem1,size=2G \
    -numa node,nodeid=0,cpus=0-1,memdev=mem0 \
    -numa node,nodeid=1,cpus=2-3,memdev=mem1 \
    -machine pc \
    -nographic \
    -nodefaults \
    -chardev stdio,nosignal,id=serial \
    -device isa-serial,chardev=serial \
    -chardev socket,id=monitor,path=/var/tmp/monitor \
    -mon chardev=monitor,mode=readline \
    ...
    -object memory-backend-ram,id=vmem0,size=8G \
    -device virtio-mem-pci,id=vm0,memdev=vmem0,node=0,requested-size=300M \
    -object memory-backend-ram,id=vmem1,size=8G \
    -device virtio-mem-pci,id=vm1,memdev=vmem1,node=1,requested-size=1G

Via the QEMU monitor ("hmp") and via qmp, we can query the current size of virtio-mem devices and change the requested size.

$ echo "info memory-devices" | sudo nc -U /var/tmp/monitor
QEMU 5.1.92 monitor - type 'help' for more informatio
(qemu) info memory-devices
Memory device [virtio-mem]: "vm0"
  memaddr: 0x140000000
  node: 0
  requested-size: 314572800
  size: 314572800
  max-size: 8589934592
  block-size: 2097152
  memdev: /objects/vmem0
Memory device [virtio-mem]: "vm1"
  memaddr: 0x340000000
  node: 1
  requested-size: 1073741824
  size: 1073741824
  max-size: 8589934592
  block-size: 2097152
  memdev: /objects/vmem1

As the size of both virtio-mem devices is > 0, we know that the guest driver is alive and is making use of virtio-mem provided memory. We can now request to resize virtio-mem devices. Let's grow vm0 to 4GB and shrink vm1 to 256M.

$ echo "qom-set vm0 requested-size 4G" | sudo nc -U /var/tmp/monitor
QEMU 5.1.92 monitor - type 'help' for more information
(qemu) qom-set vm0 requested-size 4G

$ echo "qom-set vm1 requested-size 256M" | sudo nc -U /var/tmp/monitor
QEMU 5.1.92 monitor - type 'help' for more information
(qemu) qom-set vm1 requested-size 256M

$ echo "info memory-devices" | sudo nc -U /var/tmp/monitor
QEMU 5.1.92 monitor - type 'help' for more information
(qemu) info memory-devices
Memory device [virtio-mem]: "vm0"
  memaddr: 0x140000000
  node: 0
  requested-size: 4294967296
  size: 4294967296
  max-size: 8589934592
  block-size: 2097152
  memdev: /objects/vmem0
Memory device [virtio-mem]: "vm1"
  memaddr: 0x340000000
  node: 1
  requested-size: 268435456
  size: 268435456
  max-size: 8589934592
  block-size: 2097152
  memdev: /objects/vmem1

If the guest cannot completely fulfill a request (esp., unplug enoguh memory), it will retry for a while. The current state (size) is always updated. We can also checkout the current logical size of the VM / NUMA nodes.

$ echo "info memory_size_summary" | sudo nc -U /var/tmp/monitor
QEMU 5.1.92 monitor - type 'help' for more information
(qemu) info memory_size_summary
base memory: 4294967296
plugged memory: 4563402752

$ echo "info numa" | sudo nc -U /var/tmp/mon_src
QEMU 5.1.92 monitor - type 'help' for more information
(qemu) info numa
2 nodes
node 0 cpus: 0 1
node 0 size: 6144 MB
node 0 plugged: 4096 MB
node 1 cpus: 2 3
node 1 size: 2304 MB
node 1 plugged: 256 MB

It is also possible to hotplug virtio-mem devices later. However, usually, one wants to have a single virtio-mem device per NUMA node. virtio-mem and DIMMs can be mixed, although it's not recommended.

VFIO (vfio-pci)

Preparing IOMMU + KVM

Enable IOMMU support for your host. On my AMD CPU, this involved enabling support in the BIOS and adding to my kernel cmdline "amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1"

Identifying the PCI devices to pass-through

Identify the device(s) to pass through - e.g., using lspci. In this example, we are using a very old and simple GPU, along with an audio controller. Make sure that all devices belonging to the IOMMU are used for passthrough.

# Identify the device(s) to pass through (here: 5:00.0 and 5.00.1)
$ lspci
...
05:00.0 VGA compatible controller: NVIDIA Corporation GK208B [GeForce GT 710] (rev a1)
05:00.1 Audio device: NVIDIA Corporation GK208 HDMI/DP Audio Controller (rev a1)
...

# Identify the IOMMU group (here: 23)
$ find /sys/kernel/iommu_groups/ -name "*05:00*" 
/sys/kernel/iommu_groups/23/devices/0000:05:00.1
/sys/kernel/iommu_groups/23/devices/0000:05:00.0

# Verify that we caught all devices in the IOMMU group
$ ls /sys/kernel/iommu_groups/23/devices/
0000:05:00.0  0000:05:00.1

Preparing vfio-pci

We have to bind all devices we want to passthrough to the vfio-pci driver. There are ways to to that on the kernel cmdline when booting up: the critical part is stopping other drivers from binding to the device (e.g., "module_blacklist=nouveau"). In this example, we'll rip out the device from the old driver forcefully (use with care ...) and bind it to vfio-pci.

# Unload vfio-pci first, can usually be skipped
$ sudo rmmod vfio-pci

# Force unbinding from the old driver (use with care ...)
$ echo "0000:05:00.0" | sudo tee -a "/sys/bus/pci/devices/0000:05:00.0/driver/unbind"
$ echo "0000:05:00.1" | sudo tee -a "/sys/bus/pci/devices/0000:05:00.1/driver/unbind"

# Load vfio-pci
$ sudo modprobe vfio-pci

# Configure "vfio-pci" for the devices
$ echo "vfio-pci" | sudo tee -a "/sys/bus/pci/devices/0000:05:00.0/driver_override"
$ echo "vfio-pci" | sudo tee -a "/sys/bus/pci/devices/0000:05:00.1/driver_override"

# Trigger driver-probing, binding the devices to vfio-pci
$ echo "0000:05:00.0" | sudo tee -a /sys/bus/pci/drivers_probe
$ echo "0000:05:00.1" | sudo tee -a /sys/bus/pci/drivers_probe

(Optional) Identify USB keyboard and mouse

We'll be using a Logitech mouse and keyboard by forwarding an Logitech Unifying Receiver. Identify the vendorid (here: 046d) and productid (here: c52b).

$ lsusb
...
Bus 003 Device 002: ID 046d:c52b Logitech, Inc. Unifying Receiver
...

Example: vfio-pci + virtio-mem (no vIOMMU)

qemu-kvm \
    -accel kvm \
    -m 4G,maxmem=20G \
    -smp sockets=2,cores=2 \
    -machine q35 \
    -nographic \
    -nodefaults \
    ...
    -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
    -device vfio-pci,host=05:00.0,x-vga=on,bus=pci.1,addr=1.0,multifunction=on \
    -device vfio-pci,host=05:00.1,bus=pci.1,addr=1.1 \
    -usb -device usb-host,vendorid=0x046d,productid=0xc52b \
    -object memory-backend-ram,id=vmem0,size=16G \
    -device virtio-mem-pci,id=vm0,memdev=vmem0,requested-size=2G

When resizing virtio-mem devices (see the other example), the memory consumption of the VM will adjust accordinly. "x-vga=on" seems to be required for the GPU.

Example: vfio-pci + virtio-mem (vIOMMU)

qemu-kvm \
    -accel kvm,kernel-irqchip=split \
    -m 4G,maxmem=20G \
    -smp sockets=2,cores=2 \
    -machine q35 \
    -nographic \
    -nodefaults \
    ...
    -device intel-iommu,caching-mode=on,intremap=on,device-iotlb=on \
    -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
    -device vfio-pci,host=05:00.0,x-vga=on,bus=pci.1,addr=1.0,multifunction=on \
    -device vfio-pci,host=05:00.1,bus=pci.1,addr=1.1 \
    -usb -device usb-host,vendorid=0x046d,productid=0xc52b \
    -object memory-backend-ram,id=vmem0,size=16G \
    -device virtio-mem-pci,disable-legacy=on,disable-modern=off,iommu_platform=on,ats=on,id=vm0,memdev=vmem0,requested-size=2G

More details can be found in the QEMU wiki. Make sure to:

  • Enable IOMMU support in your guest ("intel_iommu=on" on the kernel cmdline of your Linux guest). Note that you can use the intel-iommu device in QEMU independently of your CPU vendor.
  • Define the intel-iommu device before specifying any other device
  • Take special care of all virtio devices. (disable-legacy=on,disable-modern=off might not actually be required for virtio-mem-pci, but for older virtio devices)

Block Size Limitations

VFIO usually allows for ~64k distinct mappings per VFIO controller (here: our two devices, consisting of one IOMMU group). Each memory block of a virtio-mem (determined via the block-size) requires a distinct mapping. As mappings are shared with other users, let's assume we can use half of that (~32k) for virtio-mem (accross all virtio-mem devices) purposes. Actual numbers differ per setup.

With a maximum size of 16 GiB for our virtio-mem device in our example and the default block-size of 2 MiB, we'll need up to 8192 mappings.

  • The bigger the block-size, the less likely it gets to unplug a lot of memory reliably (ZONE_NORMAL under Linux, Windows TBD). So we want small block sizes (except with ZONE_MOVABLE under Linux).
  • The smaller the block-size, the smaller the maximum amount of memory that can be provided via virtio-mem. 32k mappings with a 2 MiB block-size allows for a maximum amount of virtio-mem memory of 64 GiB.
  • The smaller the block-size, the more VFIO kernel calls, eventually resulting in a slowdown when hot(un)plugging memory or rebooting (not benchmarked - pinning guest memory is already expensive).

Assume we want to eventually hotplug 256 GiB via virtio-mem. To not exceed 32k mappings, we would have to manually configure the block-size of virtio-mem devices to 8 MiB.

Note: With a vIOMMU these limits theoretically don't apply, because we don't have to map all memory blocks of a virtio-mem device once the guest is running and using the vIOMMU - only what's mapped into the vIOMMU has to be mapped by vfio. However, there are times (e.g., boot, reboot, guest not using the vIOMMU), when the vIOMMU isn't active - all VM memory (including all plugged virtio-mem memory) has to be mapped by vfio.

QEMU Facts

  • QEMU does not yet protect unplugged memory. A malicious guest might make use of more memory than desired (however, e.g., migration will break such guests as unplugged memory is never migrated). To be implemented.
  • QEMU does not yet implement resizeable memory regions. Specifying huge virtio-mem devices that start out small will result in memory overhead in the hypervisor (page tables, KVM memory slot tracking data, ...) and might require to set the sysctl "vm.overcommit_memory" to 1 ("Always overcommit"). The "reserve=off" option for memory backends can be used to avoid the latter.
  • virtio-mem devices cannot get unplugged, they can only be requested to be resized.
  • virtio-mem devices only operate on their assigned memory region/memory backend. All memory one might eventually unplug again (via virtio-mem) later has to be provided to the VM via a virtio-mem device.

results matching ""

    No results matching ""