Over the past months I have participated in discussions about out-of-process
device emulation. This post describes the requirements that have become
apparent. I hope this will be a useful guide to understanding the big picture
about out-of-process device emulation.
What is out-of-process device emulation?
Device emulation is traditionally implemented in the program that executes
guest code. This approach is natural because accesses to device registers are
trapped as part of the CPU run loop that sits at the core of an emulator or
virtual machine monitor (VMM).
In some use cases it is advantageous to perform device emulation in separate
processes. For example, software-defined network switches can minimize data
copies by emulating network cards directly in the switch process.
Out-of-process device emulation also enables privilege separation and tighter
sandboxing for security.
Why are these requirements important?
When emulated devices are implemented in the VMM they use common VMM APIs.
Adding new devices is relatively easy because the APIs are already there and
the developer can focus on the device specifics. Out-of-process device
emulation potentially leaves developers without APIs since the device emulation
program is a separate program that literally starts from main().
Developers want to focus on implementing their specific device, not on solving
general problems related to out-of-process device emulation infrastructure.
It is not only a lot of work to implement an out-of-process device
completely from scratch, but there is also a risk of developing the wrong
solution because some subtleties of device emulation are not obvious at first
glance.
I hope sharing these requirements will help in the creation of common
infrastructure so it's easy to implement high-quality out-of-process
devices.
Not all use cases have the full set of requirements. Therefore it's best if
requirements are addressed in separate, reusable libraries so that device
implementors can pick the ones that are relevant to them.
Device emulation
Device resources
Devices provide resources that drivers interact with such as hardware
registers, memory, or interrupts. The fundamental requirement of out-of-process
device emulation is exposing device resources.
The following types of device resources are needed:
Synchronous MMIO/PIO accesses
The most basic device emulation operation is the hardware register access.
This is a memory-mapped I/O (MMIO) or programmed I/O (PIO) access to the
device. A read loads a value from a device register. A write stores a value to
a device register. These operations are synchronous because the vCPU is paused
until completion.
Asynchronous doorbells
Devices often have doorbell registers, allowing the driver to inform the
device that new requests are ready for processing. The vCPU does not need to
wait since the access is a posted write.
The kvm.ko ioeventfd mechanism can be used to implement asynchronous
doorbells.
Shared device memory
Devices may have memory-like regions that the CPU can access (such as PCI
Memory BARs). The device emulation process therefore needs to share a region of
its memory space with the VMM so the guest can access it. This mechanism also
allows device emulation to busy wait (poll) instead of using synchronous
MMIO/PIO accesses or asynchronous doorbells for notifications.
Direct Memory Access (DMA)
Devices often require read and write access to a memory address space
belonging to the CPU. This allows network cards to transmit packet payloads
that are located in guest RAM, for example.
Early out-of-process device emulation interfaces simply shared guest RAM. The allowed DMA to any guest physical memory address. More advanced IOMMU and address space identifier mechanisms are now becoming ubiquitous. Therefore, new out-of-process device emulation interfaces should incorporate IOMMU functionality.
The key requirement for IOMMU mechanisms is allowing the VMM to grant access to a region of memory so the device emulation process can read from and/or write to it.
Interrupts
Devices notify the CPU using interrupts. An interrupt is simply a message
sent by the device emulation process to the VMM. Interrupt configuration is
flexible on modern devices, meaning the driver may be able to select the number
of interrupts and a mapping (using one interrupt with multiple event sources).
This can be implemented using the Linux eventfd mechanism or via in-band device
emulation protocol messages, for example.
Extensibility for new bus types
It should be possible to support multiple bus types. vhost-user only
supports vhost devices. VFIO is more extensible but currently focussed on PCI
devices. It is likely that QEMU SysBus devices will be desirable for
implementing ad-hoc out-of-process devices (especially for System-on-Chip
target platforms).
Bus-level APIs, not protocol bindings
Developers should not need to learn the out-of-process device emulation
protocol (vfio-user, etc). APIs should focus on bus-level concepts such as
defining VIRTIO or PCI devices rather than protocol bindings for dealing with
protocol messages, file descriptor passing, and shared memory.
In other words, developers should be thinking in terms of the problem
domain, not worrying about how out-of-process device emulation is implemented.
The protocol should be hidden behind bus-level APIs.
Multi-threading support from the beginning
Threading issues arise often in device emulation because asynchronous
requests or multi-queue devices can be implemented using threads. Therefore it
is necessary to clearly document what threading models are supported and how
device lifecycle operations like reset interact with in-flight requests.
Live migration, live upgrade, and crash recovery
There are several related issues around device state and restarting the
device emulation program without disrupting the guest.
Live migration
Live migration transfers the state of a device from one device emulation process to another (typically running on another host). This requires the following functionality:
Quiescing the device
Some devices can be live migrated at any point in time without any
preparation, while others must be put into a quiescent state to avoid
issues. An example is a storage controller that has a write request in
flight. It is not safe to live migration until the write request has completed
or been canceled. Failure to wait might result in data corruption if the write
takes effect after the destination has resumed execution.
Therefore it is necessary to quiesce a device. After this point there is no
further device activity and no guest-visible changes will be made by the
device.
Saving/loading device state
It must be possible to save and load device state. Device state includes the
contents of hardware registers as well as device-internal state necessary for
resuming operation.
It is typically necessary to determine whether the device emulation
processes on the migration source and destination are compatible before
attempting migration. This avoids migration failure when the destination tries
to load the device state and discovers it doesn't support it. It may be
desirable to support loading device state that was generated by a different
implementation of the same device type (for example, two virtio-net
implementations).
Dirty memory logging
Pre-copy live migration starts with an iterative phase where dirty memory
pages are copied from the migration source to the destination host. Devices
need to participate in dirty memory logging so that all written pages are
transferred to the destination and no pages are "missed".
Crash recovery
If the device emulation process crashes it should be possible to restart it
and resume device emulation without disrupting the guest (aside from a possible
pause during reconnection).
Doing this requires maintaining device state (contents of hardware
registers, etc) outside the device emulation process. This way the state
remains even if the process crashes and it can be resume when a new process
starts.
Live upgrade
It must be possible to upgrade the device emulation process and the VMM
without disrupting the guest. Upgrading the device emulation process is similar
to crash recovery in that the process terminates and a new one resumes with the
previous state.
Device versioning
The guest-visible aspects of the device must be versioned. In the simplest
case the device emulation program would have a --compat-version=N
command-line option that controls which version of the device the guest
sees. When guest-visible changes are made to the program the version number
must be increased.
By giving control of the guest-visible device behavior it is possible to
save/load and live migrate reliably. Otherwise loading device state in a newer
device emulation program could affect the running guest. Guest drivers
typically are not prepared for the device to change underneath them and doing
so could result in guest crashes or data corruption.
Security
The trust model
The VMM must not trust the device emulation program. This is key to
implementing privilege separation and the principle of least privilege. If a
compromised device emulation program is able to gain control of the VMM then
out-of-process device emulation has failed to provide isolation between
devices.
The device emulation program must not trust the VMM to the extent that this
is possible. For example, it must validate inputs so that the VMM cannot gain
control of the device emulation process through memory corruptions or other
bugs. This makes it so that even if the VMM has been compromised, access to
device resources and associated system calls still requires further
compromising the device emulation process.
Unprivileged operation
The device emulation program should run unprivileged to the extent that this
is possible. If special permissions are required to access hardware resources
then these resources can sometimes be provided via file descriptor passing by a
more privileged parent process.
Sandboxing
Operating system sandboxing mechanisms can be applied to device emulation
processes more effectively than monolithic VMMs. Seccomp can limit the Linux
system calls that may be invoked. SELinux can restrict access to system
resources.
Sandboxing is a common task that most device emulation programs need.
Therefore it is a good candidate for a library or launcher tool that is shared
by device emulation programs.
Management
Command-line interface
A common command-line interface should be defined where possible. For
example, vhost-user's standard --socket-path=PATH argument makes it
easy to launch any vhost-user device backend. Protocol-specific options (e.g.
socket path) and device type-specific options (e.g. virtio-net) can be
standardized.
Some options are necessarily specific to the device emulation program and
therefore cannot be standardized.
The advantage of standard options is that management tools like libvirt can
launch the device emulation programs without further user configuration.
RPC interface
It may be necessary to issue commands at runtime. Examples include adjusting
throttling limits, enabling/disabling logging, etc. These operations can be
performed over an RPC interface.
Various RPC interfaces are used throughout open source virtualization
software. Adopting a widely-used RPC protocol and standardizing commands is
beneficial because it makes it easy to communicate with the software and
management tools can support them relatively easily.
Conclusion
This was largely a brain dump but I hope it is useful food for thought as
out-of-process device emulation interfaces are designed and developed. There is
a lot more to it than simply implementing a protocol for device register
accesses and guest RAM DMA. Developing open source libraries in Rust and C that
can be used as needed will ensure that out-of-process devices are high-quality
and easy for users to deploy.