Stefan Hajnoczi: internals

Showing posts with label internals. Show all posts

Wednesday, March 22, 2023

How to debug stuck VIRTIO devices in QEMU

Every once in a while a bug comes along where a guest hangs while communicating with a QEMU VIRTIO device. In this blog post I'll share some debugging approaches that can help QEMU developers who are trying to understand why a VIRTIO device is stuck.

There are a number of reasons why communication with a VIRTIO device might cease, so it helps to identify the nature of the hang:

Did the QEMU device see the requests that the guest driver submitted?
Did the QEMU device complete the request?
Did the guest driver see the requests that the device completed?

The case I will talk about is when QEMU itself is still responsive (the QMP/HMP monitor works) and the guest may or may not be responsive.

Finding requests that are stuck

There is a QEMU monitor command to inspect virtqueues called x-query-virtio-queue-status (QMP) and info virtio-queue-status (HMP). This is a quick way to extract information about a virtqueue from QEMU.

This command allows us to answer the question of whether the QEMU device completed its requests. The shadow_avail_idx and used_idx values in the output are the Available Ring index and Used Ring index, respectively. When they are equal the device has completed all requests. When they are not equal there are still requests in flight and the request must be stuck inside QEMU.

Here is a little more background on the index values. Remember that VIRTIO Split Virtqueues have an Available Ring index and a Used Ring index. The Available Ring index is incremented by the driver whenever it submits a request. The Used Ring index is incremented by the device whenever it completes a request. If the Available Ring index is equal to the Used Ring index then all requests have been completed.

Note that shadow_avail_idx is not the vring Available Ring index in guest RAM but just the last cached copy that the device saw. That means we cannot tell if there are new requests that the device hasn't seen yet. We need to take another approach to figure that out.

Finding requests that the device has not seen yet

Maybe the device has not seen new requests recently and this is why the guest is stuck. That can happen if the device is not receiving Buffer Available Notifications properly (normally this is done by reading a virtqueue kick ioeventfd, also known as a host notifier in QEMU).

We cannot use QEMU monitor commands here, but attaching the GDB debugger to QEMU will allow us to peak at the Available Ring index in guest RAM. The following GDB Python script loads the Available Ring index for a given VirtQueue:

$ cat avail-idx.py
import gdb

# ADDRESS is the address of a VirtQueue struct
vq = gdb.Value(ADDRESS).cast(gdb.lookup_type('VirtQueue').pointer())
avail_idx = vq['vring']['caches']['avail']['ptr'].cast(uint16_type.pointer())[1]
if avail_idx != vq['shadow_avail_idx']:
  print('Device has not seen all available buffers: avail_idx {} shadow_avail_idx {} in {}'.format(avail_idx, vq['shadow_avail_idx'], vq.dereference()))

You can run the script using the source avail-idx.py GDB command. Finding the address of the virtqueue depends on the type of device that you are debugging.

Finding completions that the guest has not seen

If requests are not stuck inside QEMU and the device has seen the latest request, then the guest driver might have missed the Used Buffer Notification from the device (normally an interrupt handler or polling loop inside the guest detects completed requests).

In VIRTIO the driver's current index in the Used Ring is not visible to the device. This means we have no general way of knowing whether the driver has seen completions. However, there is a cool trick for modern devices that have the VIRTIO_RING_F_EVENT_IDX feature enabled.

The trick is that the Linux VIRTIO driver code updates the Used Event Index every time a completed request is popped from the virtqueue. So if we look at the Used Event Index we know the driver's index into the Used Ring and can find out whether it has seen request completions.

The following GDB Python script loads the Used Event Index for a given VirtQueue:

$ cat used-event-idx.py
import gdb

# ADDRESS is the address of a VirtQueue struct
vq = gdb.Value(ADDRESS).cast(gdb.lookup_type('VirtQueue').pointer())
used_event = vq['vring']['caches']['avail']['ptr'].cast(uint16_type.pointer())[2 + vq['vring']['num']]
if used_event != vq['used_idx']:
  print('Driver has not seen all used buffers: used_event {} used_idx {} in {}'.format(used_event, vq['used_idx'], vq.dereference()))

You can run the script using the source avail-idx.py GDB command. Finding the address of the virtqueue depends on the type of device that you are debugging.

Conclusion

I hope this helps anyone who has to debug a VIRTIO device that seems to have gotten stuck.

Friday, October 9, 2020

Requirements for out-of-process device emulation

Over the past months I have participated in discussions about out-of-process device emulation. This post describes the requirements that have become apparent. I hope this will be a useful guide to understanding the big picture about out-of-process device emulation.

What is out-of-process device emulation?

Device emulation is traditionally implemented in the program that executes guest code. This approach is natural because accesses to device registers are trapped as part of the CPU run loop that sits at the core of an emulator or virtual machine monitor (VMM).

In some use cases it is advantageous to perform device emulation in separate processes. For example, software-defined network switches can minimize data copies by emulating network cards directly in the switch process. Out-of-process device emulation also enables privilege separation and tighter sandboxing for security.

Why are these requirements important?

When emulated devices are implemented in the VMM they use common VMM APIs. Adding new devices is relatively easy because the APIs are already there and the developer can focus on the device specifics. Out-of-process device emulation potentially leaves developers without APIs since the device emulation program is a separate program that literally starts from main(). Developers want to focus on implementing their specific device, not on solving general problems related to out-of-process device emulation infrastructure.

It is not only a lot of work to implement an out-of-process device completely from scratch, but there is also a risk of developing the wrong solution because some subtleties of device emulation are not obvious at first glance.

I hope sharing these requirements will help in the creation of common infrastructure so it's easy to implement high-quality out-of-process devices.

Not all use cases have the full set of requirements. Therefore it's best if requirements are addressed in separate, reusable libraries so that device implementors can pick the ones that are relevant to them.

Device emulation

Device resources

Devices provide resources that drivers interact with such as hardware registers, memory, or interrupts. The fundamental requirement of out-of-process device emulation is exposing device resources.

The following types of device resources are needed:

Synchronous MMIO/PIO accesses

The most basic device emulation operation is the hardware register access. This is a memory-mapped I/O (MMIO) or programmed I/O (PIO) access to the device. A read loads a value from a device register. A write stores a value to a device register. These operations are synchronous because the vCPU is paused until completion.

Asynchronous doorbells

Devices often have doorbell registers, allowing the driver to inform the device that new requests are ready for processing. The vCPU does not need to wait since the access is a posted write.

The kvm.ko ioeventfd mechanism can be used to implement asynchronous doorbells.

Shared device memory

Devices may have memory-like regions that the CPU can access (such as PCI Memory BARs). The device emulation process therefore needs to share a region of its memory space with the VMM so the guest can access it. This mechanism also allows device emulation to busy wait (poll) instead of using synchronous MMIO/PIO accesses or asynchronous doorbells for notifications.

Direct Memory Access (DMA)

Devices often require read and write access to a memory address space belonging to the CPU. This allows network cards to transmit packet payloads that are located in guest RAM, for example.

Early out-of-process device emulation interfaces simply shared guest RAM. The allowed DMA to any guest physical memory address. More advanced IOMMU and address space identifier mechanisms are now becoming ubiquitous. Therefore, new out-of-process device emulation interfaces should incorporate IOMMU functionality.

The key requirement for IOMMU mechanisms is allowing the VMM to grant access to a region of memory so the device emulation process can read from and/or write to it.

Interrupts

Devices notify the CPU using interrupts. An interrupt is simply a message sent by the device emulation process to the VMM. Interrupt configuration is flexible on modern devices, meaning the driver may be able to select the number of interrupts and a mapping (using one interrupt with multiple event sources). This can be implemented using the Linux eventfd mechanism or via in-band device emulation protocol messages, for example.

Extensibility for new bus types

It should be possible to support multiple bus types. vhost-user only supports vhost devices. VFIO is more extensible but currently focussed on PCI devices. It is likely that QEMU SysBus devices will be desirable for implementing ad-hoc out-of-process devices (especially for System-on-Chip target platforms).

Bus-level APIs, not protocol bindings

Developers should not need to learn the out-of-process device emulation protocol (vfio-user, etc). APIs should focus on bus-level concepts such as defining VIRTIO or PCI devices rather than protocol bindings for dealing with protocol messages, file descriptor passing, and shared memory.

In other words, developers should be thinking in terms of the problem domain, not worrying about how out-of-process device emulation is implemented. The protocol should be hidden behind bus-level APIs.

Multi-threading support from the beginning

Threading issues arise often in device emulation because asynchronous requests or multi-queue devices can be implemented using threads. Therefore it is necessary to clearly document what threading models are supported and how device lifecycle operations like reset interact with in-flight requests.

Live migration, live upgrade, and crash recovery

There are several related issues around device state and restarting the device emulation program without disrupting the guest.

Live migration

Live migration transfers the state of a device from one device emulation process to another (typically running on another host). This requires the following functionality:

Quiescing the device

Some devices can be live migrated at any point in time without any preparation, while others must be put into a quiescent state to avoid issues. An example is a storage controller that has a write request in flight. It is not safe to live migration until the write request has completed or been canceled. Failure to wait might result in data corruption if the write takes effect after the destination has resumed execution.

Therefore it is necessary to quiesce a device. After this point there is no further device activity and no guest-visible changes will be made by the device.

Saving/loading device state

It must be possible to save and load device state. Device state includes the contents of hardware registers as well as device-internal state necessary for resuming operation.

It is typically necessary to determine whether the device emulation processes on the migration source and destination are compatible before attempting migration. This avoids migration failure when the destination tries to load the device state and discovers it doesn't support it. It may be desirable to support loading device state that was generated by a different implementation of the same device type (for example, two virtio-net implementations).

Dirty memory logging

Pre-copy live migration starts with an iterative phase where dirty memory pages are copied from the migration source to the destination host. Devices need to participate in dirty memory logging so that all written pages are transferred to the destination and no pages are "missed".

Crash recovery

If the device emulation process crashes it should be possible to restart it and resume device emulation without disrupting the guest (aside from a possible pause during reconnection).

Doing this requires maintaining device state (contents of hardware registers, etc) outside the device emulation process. This way the state remains even if the process crashes and it can be resume when a new process starts.

Live upgrade

It must be possible to upgrade the device emulation process and the VMM without disrupting the guest. Upgrading the device emulation process is similar to crash recovery in that the process terminates and a new one resumes with the previous state.

Device versioning

The guest-visible aspects of the device must be versioned. In the simplest case the device emulation program would have a --compat-version=N command-line option that controls which version of the device the guest sees. When guest-visible changes are made to the program the version number must be increased.

By giving control of the guest-visible device behavior it is possible to save/load and live migrate reliably. Otherwise loading device state in a newer device emulation program could affect the running guest. Guest drivers typically are not prepared for the device to change underneath them and doing so could result in guest crashes or data corruption.

Security

The trust model

The VMM must not trust the device emulation program. This is key to implementing privilege separation and the principle of least privilege. If a compromised device emulation program is able to gain control of the VMM then out-of-process device emulation has failed to provide isolation between devices.

The device emulation program must not trust the VMM to the extent that this is possible. For example, it must validate inputs so that the VMM cannot gain control of the device emulation process through memory corruptions or other bugs. This makes it so that even if the VMM has been compromised, access to device resources and associated system calls still requires further compromising the device emulation process.

Unprivileged operation

The device emulation program should run unprivileged to the extent that this is possible. If special permissions are required to access hardware resources then these resources can sometimes be provided via file descriptor passing by a more privileged parent process.

Sandboxing

Operating system sandboxing mechanisms can be applied to device emulation processes more effectively than monolithic VMMs. Seccomp can limit the Linux system calls that may be invoked. SELinux can restrict access to system resources.

Sandboxing is a common task that most device emulation programs need. Therefore it is a good candidate for a library or launcher tool that is shared by device emulation programs.

Management

Command-line interface

A common command-line interface should be defined where possible. For example, vhost-user's standard --socket-path=PATH argument makes it easy to launch any vhost-user device backend. Protocol-specific options (e.g. socket path) and device type-specific options (e.g. virtio-net) can be standardized.

Some options are necessarily specific to the device emulation program and therefore cannot be standardized.

The advantage of standard options is that management tools like libvirt can launch the device emulation programs without further user configuration.

RPC interface

It may be necessary to issue commands at runtime. Examples include adjusting throttling limits, enabling/disabling logging, etc. These operations can be performed over an RPC interface.

Various RPC interfaces are used throughout open source virtualization software. Adopting a widely-used RPC protocol and standardizing commands is beneficial because it makes it easy to communicate with the software and management tools can support them relatively easily.

Conclusion

This was largely a brain dump but I hope it is useful food for thought as out-of-process device emulation interfaces are designed and developed. There is a lot more to it than simply implementing a protocol for device register accesses and guest RAM DMA. Developing open source libraries in Rust and C that can be used as needed will ensure that out-of-process devices are high-quality and easy for users to deploy.

Sunday, September 27, 2020

On unifying vhost-user and VIRTIO

The recent development of a Linux driver framework called VIRTIO Data Path Acceleration (vDPA) has laid the groundwork for exciting new vhost-user features. The implications of vDPA have not yet rippled through the community so I want to share my thoughts on how the vhost-user protocol can take advantage of new ideas from vDPA.

This post is aimed at developers and assumes familiarity with the vhost-user protocol and VIRTIO. No knowledge of vDPA is required.

vDPA helps with writing drivers for hardware that supports VIRTIO offload. Its goal is to enable hybrid hardware/software VIRTIO devices, but as a nice side-effect it has overcome limitations in the kernel vhost interface. It turns out that applying ideas from vDPA to the vhost-user protocol solves the same issues there. In this article I'll show how extending the vhost-user protocol with vDPA has the following benefits:

Allows existing VIRTIO device emulation code to act as a vhost-user device backend.
Removes the need for shim devices in the virtual machine monitor (VMM).
Replaces undocumented conventions with a well-defined device model.

These things can be done while reusing existing vhost-user and VIRTIO code. In fact, this is especially good news for existing codebases like QEMU because they already have a wealth of vhost-user and VIRTIO code that can now finally be connected together!

Let's look at the advantages of extending vhost-user with vDPA first and then discuss how to do it.

Why extend vhost-user with vDPA?

Reusing VIRTIO emulation code for vhost-user backends

It is a common misconception that a vhost device is a VIRTIO device. VIRTIO devices are defined in the VIRTIO specification and consist of a configuration space, virtqueues, and a device lifecycle that includes feature negotiation. A vhost device is a subset of the corresponding VIRTIO device. The exact subset depends on the device type, and some vhost devices are closer to the full functionality of their corresponding VIRTIO device than others. The most well-known example is that vhost-net devices have rx/tx virtqueues and but lack the virtio-net control virtqueue. Also, the configuration space and device lifecycle are only partially available to vhost devices.

This difference makes it impossible to use a VIRTIO device as a vhost-user device and vice versa. There is an impedance mismatch and missing functionality. That's a shame because existing VIRTIO device emulation code is mature and duplicating it to provide vhost-user backends creates additional work.

If there was a way to reuse existing VIRTIO device emulation code it would be easier to move to a multi-process architecture in QEMU. Want to run --netdev user,id=netdev0 --device virtio-net-pci,netdev=netdev0 in a separate, sandboxed process? Easy, run it as a vhost-user-net device instead of as virtio-net.

Making VMM device shims optional

Today each new vhost device type requires a shim device in the VMM. QEMU has --device vhost-user-blk-pci, --device vhost-user-input-pci, and so on. Why can't there be a single --device vhost-user device?

This limitation is a consequence of the fact that vhost devices are not full VIRTIO devices. In fact, a vhost device does not even have a way to report its device type (net, blk, scsi, etc). Therefore it is impossible for today's VMMs to offer a generic device. Each vhost device type requires a shim device.

In some cases a shim device is desirable because it allows the VMM to handle some aspects of the device instead of passing everything through to the vhost device backend. But requiring shims by default involves lots of tedious boilerplate code and prevents new device types from being used by older VMMs.

Providing a well-defined device model in vhost-user

Although vhost-user works well for users, it is difficult for developers to learn and extend. The protocol does not have a well-defined device model. Each device type has its own undocumented set of protocol messages that are used. For example, the vhost-user-blk device uses the configuration space whereas most other device types do not use the configuration space at all.

Since protocol use is not fully documented in the specification, developers might resort to reading Linux, QEMU, and DPDK code in order to figure out how to make their devices work. They typically have to debug vhost-user protocol messages and adjust their code until it appears to work. Hopefully the critical bugs are caught before the code ships. This is problematic because it's hard to produce high-quality vhost-user implementations.

Although the protocol specification can certainly be cleaned up, the problem is more fundamental. vhost-user badly needs a well-defined device model so that protocol usage is clear and uniform for all device types. The best way to do that is to simply adopt the VIRTIO device model. The VIRTIO specification already defines the device lifecycle and details of the device types. By making vhost-user devices full VIRTIO devices there is no need for additional vhost device specifications. The vhost-user specification just becomes a transport for the established VIRTIO device model. Luckily that is effectively what vDPA has done for kernel vhost ioctls.

How to do this in QEMU

The following QEMU changes are needed to implement vhost-user vDPA support. Below I will focus on vhost-user-net but most of the work is generic and benefits all device types.

Import vDPA ioctls into vhost-user

vDPA extends the Linux vhost ioctl interface. It uses a subset of vhost ioctls and adds new vDPA-specific ioctls that are implemented in the vhost_vdpa.ko kernel module. These new ioctls enable the full VIRTIO device model, including device IDs, the status register, configuration space, and so on.

In theory vhost-user could be fixed without vDPA, but it would involve effectively adding the same set of functionality that vDPA has already added onto kernel vhost. Reusing the vDPA ioctls allows VMMs to support both kernel vDPA and vhost-user with minimal code duplication.

This can be done by adding a VHOST_USER_PROTOCOL_F_VDPA feature bit to the vhost-user protocol. If both the vhost-user frontend and backend support vDPA then all vDPA messages are available. Otherwise they can either fall back on legacy vhost-user behavior or drop the connection.

The vhost-user specification could be split into a legacy section and a modern vDPA-enabled section. The modern protocol will remove vhost-user messages that are not needed by vDPA, simplifying the protocol for new developers while allowing existing implementations to support both with minimal changes.

One detail is that vDPA does not use the memory table mechanism for sharing memory. Instead it relies on the richer IOMMU message family that is optional in vhost-user today. This approach can be adopted in vhost-user too, making the IOMMU code path standard for all implementations and dropping the memory table mechanism.

Add vhost-user vDPA to the vhost code

QEMU's hw/virtio/vhost*.c code supports kernel vhost, vhost-user, and kernel vDPA. A vhost-user vDPA mode must be added to implement the new protocol. It can be implemented as a combination of the vhost-user and kernel vDPA code already in QEMU. Most likely the existing vhost-user code can simply be extended to enable vDPA support if the backend supports it.

Only small changes to hw/net/virtio-net.c and net/vhost-user.c are needed to use vhost-user vDPA with net devices. At that point QEMU can connect to a vhost-user-net device backend and use vDPA extensions.

Add a vhost-user vDPA VIRTIO transport

Next a vhost-user-net device backend can be put together using QEMU's virtio-net emulation. A translation layer is needed between the vhost-user protocol and the VirtIODevice type in QEMU. This can be done by implementing a new VIRTIO transport alongside the existing pci, mmio, and ccw transports. The transport processes vhost-user protocol messages from the UNIX domain socket and invokes VIRTIO device emulation code inside QEMU. It acts as a VIRTIO bus so that virtio-net-device, virtio-blk-device, and other device types can be plugged in.

This piece is the most involved but the vhost-user protocol communication part was already implemented in the virtio-vhost-user prototype that I worked on previously. Most of the communication code can be reused and the remainder is implementing the VirtioBusClass interface.

To summarize, a new transport acts as the vhost-user device backend and invokes QEMU VirtIODevice methods in response to vhost-user protocol messages. The syntax would be something like --device virtio-net-device,id=virtio-net0 --device vhost-user-backend,device=virtio-net0,addr.type=unix,addr.path=/tmp/vhost-user-net.sock.

Where this converges with multi-process QEMU

At this point QEMU can run ad-hoc vhost-user backends using existing VIRTIO device models. It is possible to go further by creating a qemu-dev launcher executable that implements the vhost-user spec's "Backend program conventions". This way a minimal device emulator executable hosts the device instead of a full system emulator.

The requirements for this are similar to the multi-process QEMU effort, which aims to run QEMU devices as separate processes. One of the main open questions is how to design build system and Kconfig support for building minimal device emulator executables.

In the case of vhost-user-net the qemu-dev-vhost-user-net executable would contain virtio-net-device, vhost-user-backend, any netdevs the user wishes to include, a QMP monitor, and a vhost-user backend command-line interface.

Where does this leave us? QEMU's existing VIRTIO device models can be used as vhost-user devices and run in a separate processes from the VMM. It's a great way of reusing code and having the flexibility to deploy it in the way that makes most sense for the intended use case.

Conclusion

The vhost-user protocol can be simplified by adopting the vhost vDPA ioctls that have recently been introduced in Linux. This turns vhost-user into a VIRTIO transport and vhost-user devices become full VIRTIO devices. Existing VIRTIO device emulation code can then be reused in vhost-user device backends.

Monday, August 24, 2020

QEMU Internals: Event loops

This post explains event loops in QEMU v5.1.0 and their unique features compared to other event loop implementations. The APIs are not covered in detail here since they are explained in doc comments. Instead, the focus is on the big picture and why things work the way they do.

Event loops are central to many I/O-bound applications like network services and graphical desktop applications. QEMU also has I/O-bound work that fits well into an event loop. Examples include the QMP monitor, disk I/O, and timers.

An event loop monitors event sources for activity and invokes a callback function when an event occurs. This makes it possible to process multiple event sources within a single CPU thread. The application can appear to do multiple things at once without multithreading because it switches between handling different event sources. This architecture is common in Javascript, Python Twisted/asyncio, and many other environments. Sometimes the event loop is hidden underneath coroutines or async/await language features (QEMU has coroutines but often the event loop is still used directly).

The most important event sources in QEMU are:

File descriptors such as sockets and character devices.
Event notifiers (implemented as eventfds on Linux).
Timers for delayed function execution.
Bottom-halves (BHs) for invoking a function in another thread or deferring a function call to avoid reentrancy.

Event loops and threads

QEMU has several different types of threads:

vCPU threads that execute guest code and perform device emulation synchronously with respect to the vCPU.
The main loop that runs the event loops (yes, there is more than one!) used by many QEMU components.
IOThreads that run event loops for device emulation concurrently with vCPUs and "out-of-band" QMP monitor commands.

It's worth explaining how device emulation interacts with threads. When guest code accesses a device register the vCPU thread traps the access and dispatches it to an emulated device. The device's read/write function runs in the vCPU thread. The vCPU thread cannot resume guest code execution until the device's read/write function returns. This means long-running operations like emulating a timer chip or disk I/O cannot be performed synchronously in the vCPU thread since they would block the vCPU. Most devices solve this problem using the main loop thread's event loops. They add timer or file descriptor monitoring callbacks to the main loop and return back to guest code execution. When the timer expires or the file descriptor becomes ready the callback function runs in the main loop thread. The final part of emulating a guest timer or disk access therefore runs in the main loop thread and not a vCPU thread.

Some devices perform the guest device register access in the main loop thread or an IOThread thanks to ioeventfd. ioeventfd is a Linux KVM API and also has a userspace fallback implementation for TCG that traps vCPU device accesses and writes to a file descriptor so another thread can handle the access.

The key point is that vCPU threads do not run an event loop. The main loop thread and IOThreads run event loops. vCPU threads can add event sources to the main loop or IOThread event loops. Callbacks run in the main loop thread or IOThreads.

How the main loop and IOThreads differ

The main loop and IOThreads share some code but are fundamentally different. The common code is called AioContext and is QEMU's native event loop API. Commonly-used functions include aio_set_fd_handler(), aio_set_event_handler(), aio_timer_init(), and aio_bh_new().

The main loop actually has a glib GMainContext and two AioContext event loops. QEMU components can use any of these event loop APIs and the main loop combines them all into a single event loop function os_host_main_loop_wait() that calls qemu_poll_ns() to wait for event sources. This makes it possible to combine glib-based code with code using the native QEMU AioContext APIs.

The reason why the main loop has two AioContexts is because one, called iohandler_ctx, is used to implement older qemu_set_fd_handler() APIs whose handlers should not run when the other AioContext, called qemu_aio_context, is run using aio_poll(). The QEMU block layer and newer code uses qemu_aio_context while older code uses iohandler_ctx. Over time it may be possible to unify the two by converting iohandler_ctx handlers to safely execute in qemu_aio_context.

IOThreads have an AioContext and a glib GMainContext. The AioContext is run using the aio_poll() API, which enables the advanced features of the event loop. If a glib event loop is needed then the GMainContext can be run using g_main_loop_run() and the AioContext event sources will be included.

Code that relies on the AioContext aio_*() APIs will work with both the main loop and IOThreads. Older code using qemu_*() APIs only works with the main loop. glib code works with both the main loop and IOThreads.

The key difference between the main loop and IOThreads is that the main loop uses a traditional event loop that calls qemu_poll_ns() while IOThreads AioContext aio_poll() has advanced features that result in better performance.

AioContext features

AioContext has the following event loop features that traditional event loops do not have:

Adaptive polling support for lower latency but slightly higher CPU consumption. AioContext event sources can have a userspace polling function that detects events without performing syscalls (e.g. peeking at a memory location). This allows the event loop to avoid block syscalls that might lead the host kernel scheduler to yield the thread and put the physical CPU into a low power state. Keeping the CPU busy and avoiding entering the kernel minimizes latency.
O(1) time complexity with respect to the number of monitored file descriptors. When there are thousands of file descriptors O(n) APIs like poll(2) spend time scanning over all file descriptors, even those that have no activity. This scalability bottleneck can be avoided with Linux io_uring and epoll APIs, both of which are supported by AioContext aio_poll(2).
Nanosecond timers. glib's event loop only has millisecond timers, which is not sufficient for emulating hardware timers.

These features are required for performance reasons. Unfortunately glib's event loop does not support them, otherwise QEMU could use GMainContext as its only event loop.

Conclusion

QEMU uses both its native AioContext event loop and glib's GMainContext. The QEMU main loop and IOThreads work differently, with IOThreads offering the best performance thanks to its AioContext aio_poll() event loop. Modern QEMU code should use AioContext APIs for optimal performance and so that the code can be used in both the main loop and IOThreads.

Wednesday, September 7, 2011

QEMU Internals: vhost architecture

This post explains how vhost provides in-kernel virtio devices for KVM. I have been hacking on vhost-scsi and have answered questions about ioeventfd, irqfd, and vhost recently, so I thought this would be a useful QEMU Internals post.

Vhost overview

The vhost drivers in Linux provide in-kernel virtio device emulation. Normally the QEMU userspace process emulates I/O accesses from the guest. Vhost puts virtio emulation code into the kernel, taking QEMU userspace out of the picture. This allows device emulation code to directly call into kernel subsystems instead of performing system calls from userspace.

The vhost-net driver emulates the virtio-net network card in the host kernel. Vhost-net is the oldest vhost device and the only one which is available in mainline Linux. Experimental vhost-blk and vhost-scsi devices have also been developed.

In Linux 3.0 the vhost code lives in drivers/vhost/. Common code that is used by all devices is in drivers/vhost/vhost.c. This includes the virtio vring access functions which all virtio devices need in order to communicate with the guest. The vhost-net code lives in drivers/vhost/net.c.

The vhost driver model

The vhost-net driver creates a /dev/vhost-net character device on the host. This character device serves as the interface for configuring the vhost-net instance.

When QEMU is launched with -netdev tap,vhost=on it opens /dev/vhost-net and initializes the vhost-net instance with several ioctl(2) calls. These are necessary to associate the QEMU process with the vhost-net instance, prepare for virtio feature negotiation, and pass the guest physical memory mapping to the vhost-net driver.

During initialization the vhost driver creates a kernel thread called vhost-$pid, where $pid is the QEMU process pid. This thread is called the "vhost worker thread". The job of the worker thread is to handle I/O events and perform the device emulation.

In-kernel virtio emulation

Vhost does not emulate a complete virtio PCI adapter. Instead it restricts itself to virtqueue operations only. QEMU is still used to perform virtio feature negotiation and live migration, for example. This means a vhost driver is not a self-contained virtio device implementation, it depends on userspace to handle the control plane while the data plane is done in-kernel.

The vhost worker thread waits for virtqueue kicks and then handles buffers that have been placed on the virtqueue. In vhost-net this means taking packets from the tx virtqueue and transmitting them over the tap file descriptor.

File descriptor polling is also done by the vhost worker thread. In vhost-net the worker thread wakes up when packets come in over the tap file descriptor and it places them into the rx virtqueue so the guest can receive them.

Vhost as a userspace interface

One surprising aspect of the vhost architecture is that it is not tied to KVM in any way. Vhost is a userspace interface and has no dependency on the KVM kernel module. This means other userspace code, like libpcap, could in theory use vhost devices if they find them convenient high-performance I/O interfaces.

When a guest kicks the host because it has placed buffers onto a virtqueue, there needs to be a way to signal the vhost worker thread that there is work to do. Since vhost does not depend on the KVM kernel module they cannot communicate directly. Instead vhost instances are set up with an eventfd file descriptor which the vhost worker thread watches for activity. The KVM kernel module has a feature known as ioeventfd for taking an eventfd and hooking it up to a particular guest I/O exit. QEMU userspace registers an ioeventfd for the VIRTIO_PCI_QUEUE_NOTIFY hardware register access which kicks the virtqueue. This is how the vhost worker thread gets notified by the KVM kernel module when the guest kicks the virtqueue.

On the return trip from the vhost worker thread to interrupting the guest a similar approach is used. Vhost takes a "call" file descriptor which it will write to in order to kick the guest. The KVM kernel module has a feature called irqfd which allows an eventfd to trigger guest interrupts. QEMU userspace registers an irqfd for the virtio PCI device interrupt and hands it to the vhost instance. This is how the vhost worker thread can interrupt the guest.

In the end the vhost instance only knows about the guest memory mapping, a kick eventfd, and a call eventfd.

Where to find out more

Here are the main points to begin exploring the code:

drivers/vhost/vhost.c - common vhost driver code
drivers/vhost/net.c - vhost-net driver
virt/kvm/eventfd.c - ioeventfd and irqfd

The QEMU userspace code shows how to initialize the vhost instance:

hw/vhost.c - common vhost initialization code
hw/vhost_net.c - vhost-net initialization

Wednesday, March 9, 2011

QEMU Internals: Big picture overview

Last week I started the QEMU Internals series to share knowledge of how QEMU works. I dove straight in to the threading model without a high-level overview. I want to go back and provide the big picture so that the details of the threading model can be understood more easily.

The story of a guest

A guest is created by running the qemu program, also known as qemu-kvm or just kvm. On a host that is running 3 virtual machines there are 3 qemu processes:

When a guest shuts down the qemu process exits. Reboot can be performed without restarting the qemu process for convenience although it would be fine to shut down and then start qemu again.

Guest RAM

Guest RAM is simply allocated when qemu starts up. It is possible to pass in file-backed memory with -mem-path such that hugetlbfs can be used. Either way, the RAM is mapped in to the qemu process' address space and acts as the "physical" memory as seen by the guest:

QEMU supports both big-endian and little-endian target architectures so guest memory needs to be accessed with care from QEMU code. Endian conversion is performed by helper functions instead of accessing guest RAM directly. This makes it possible to run a target with a different endianness from the host.

KVM virtualization

KVM is a virtualization feature in the Linux kernel that lets a program like qemu safely execute guest code directly on the host CPU. This is only possible when the target architecture is supported by the host CPU. Today KVM is available on x86, ARMv8, ppc, s390, and MIPS CPUs.

In order to execute guest code using KVM, the qemu process opens /dev/kvm and issues the KVM_RUN ioctl. The KVM kernel module uses hardware virtualization extensions found on modern Intel and AMD CPUs to directly execute guest code. When the guest accesses a hardware device register, halts the guest CPU, or performs other special operations, KVM exits back to qemu. At that point qemu can emulate the desired outcome of the operation or simply wait for the next guest interrupt in the case of a halted guest CPU.

The basic flow of a guest CPU is as follows:

open("/dev/kvm")
ioctl(KVM_CREATE_VM)
ioctl(KVM_CREATE_VCPU)
for (;;) {
     ioctl(KVM_RUN)
     switch (exit_reason) {
     case KVM_EXIT_IO:  /* ... */
     case KVM_EXIT_HLT: /* ... */
     }
}

The host's view of a running guest

The host kernel schedules qemu like a regular process. Multiple guests run alongside without knowledge of each other. Applications like Firefox or Apache also compete for the same host resources as qemu although resource controls can be used to isolate and prioritize qemu.

Since qemu system emulation provides a full virtual machine inside the qemu userspace process, the details of what processes are running inside the guest are not directly visible from the host. One way of understanding this is that qemu provides a slab of guest RAM, the ability to execute guest code, and emulated hardware devices; therefore any operating system (or no operating system at all) can run inside the guest. There is no ability for the host to peek inside an arbitrary guest.

Guests have a so-called vcpu thread per virtual CPU. A dedicated iothread runs a select(2) event loop to process I/O such as network packets and disk I/O completion. For more details and possible alternate configuration, see the threading model post.

The following diagram illustrates the qemu process as seen from the host:

Further information

Hopefully this gives you an overview of QEMU and KVM architecture. Feel free to leave questions in the comments and check out other QEMU Internals posts for details on these aspects of QEMU.

Here are two presentations on KVM architecture that cover similar areas if you are interested in reading more:

Jan Kiszka's Linux Kongress 2010 presentation on the Architecture of the Kernel-based Virtual Machine (KVM). Very good material.
My own attempt at presenting a KVM Architecture Overview from 2010.

Saturday, March 5, 2011

QEMU Internals: Overall architecture and threading model

This is the first post in a series on QEMU Internals aimed at developers. It is designed to share knowledge of how QEMU works and make it easier for new contributors to learn about the QEMU codebase.

Running a guest involves executing guest code, handling timers, processing I/O, and responding to monitor commands. Doing all these things at once requires an architecture capable of mediating resources in a safe way without pausing guest execution if a disk I/O or monitor command takes a long time to complete. There are two popular architectures for programs that need to respond to events from multiple sources:

Parallel architecture splits work into processes or threads that can execute simultaneously. I will call this threaded architecture.
Event-driven architecture reacts to events by running a main loop that dispatches to event handlers. This is commonly implemented using the select(2) or poll(2) family of system calls to wait on multiple file descriptors.

QEMU actually uses a hybrid architecture that combines event-driven programming with threads. It makes sense to do this because an event loop cannot take advantage of multiple cores since it only has a single thread of execution. In addition, sometimes it is simpler to write a dedicated thread to offload one specific task rather than integrate it into an event-driven architecture. Nevertheless, the core of QEMU is event-driven and most code executes in that environment.

The event-driven core of QEMU

An event-driven architecture is centered around the event loop which dispatches events to handler functions. QEMU's main event loop is main_loop_wait() and it performs the following tasks:

Waits for file descriptors to become readable or writable. File descriptors play a critical role because files, sockets, pipes, and various other resources are all file descriptors. File descriptors can be added using qemu_set_fd_handler().
Runs expired timers. Timers can be added using qemu_mod_timer().
Runs bottom-halves (BHs), which are like timers that expire immediately. BHs are used to avoid reentrancy and overflowing the call stack. BHs can be added using qemu_bh_schedule().

When a file descriptor becomes ready, a timer expires, or a BH is scheduled, the event loop invokes a callback that responds to the event. Callbacks have two simple rules about their environment:

No other core code is executing at the same time so synchronization is not necessary. Callbacks execute sequentially and atomically with respect to other core code. There is only one thread of control executing core code at any given time.
No blocking system calls or long-running computations should be performed. Since the event loop waits for the callback to return before continuing with other events, it is important to avoid spending an unbounded amount of time in a callback. Breaking this rule causes the guest to pause and the monitor to become unresponsive.

This second rule is sometimes hard to honor and there is code in QEMU which blocks. In fact there is even a nested event loop in qemu_aio_wait() that waits on a subset of the events that the top-level event loop handles. Hopefully these violations will be removed in the future by restructuring the code. New code almost never has a legitimate reason to block and one solution is to use dedicated worker threads to offload long-running or blocking code.

Offloading specific tasks to worker threads

Although many I/O operations can be performed in a non-blocking fashion, there are system calls which have no non-blocking equivalent. Furthermore, sometimes long-running computations simply hog the CPU and are difficult to break up into callbacks. In these cases dedicated worker threads can be used to carefully move these tasks out of core QEMU.

One example user of worker threads is posix-aio-compat.c, an asynchronous file I/O implementation. When core QEMU issues an aio request it is placed on a queue. Worker threads take requests off the queue and execute them outside of core QEMU. They may perform blocking operations since they execute in their own threads and do not block the rest of QEMU. The implementation takes care to perform necessary synchronization and communication between worker threads and core QEMU.

Another example is ui/vnc-jobs-async.c which performs compute-intensive image compression and encoding in worker threads.

Since the majority of core QEMU code is not thread-safe, worker threads cannot call into core QEMU code. Simple utilities like qemu_malloc() are thread-safe but that is the exception rather than the rule. This poses a problem for communicating worker thread events back to core QEMU.

When a worker thread needs to notify core QEMU, a pipe or a qemu_eventfd() file descriptor is added to the event loop. The worker thread can write to the file descriptor and the callback will be invoked by the event loop when the file descriptor becomes readable. In addition, a signal must be used to ensure that the event loop is able to run under all circumstances. This approach is used by posix-aio-compat.c and makes more sense (especially the use of signals) after understanding how guest code is executed.

Executing guest code

So far we have mainly looked at the event loop and its central role in QEMU. Equally as important is the ability to execute guest code, without which QEMU could respond to events but would not be very useful.

There are two mechanism for executing guest code: Tiny Code Generator (TCG) and KVM. TCG emulates the guest using dynamic binary translation, also known as Just-in-Time (JIT) compilation. KVM takes advantage of hardware virtualization extensions present in modern Intel and AMD CPUs for safely executing guest code directly on the host CPU. For the purposes of this post the actual techniques do not matter but what matters is that both TCG and KVM allow us to jump into guest code and execute it.

Jumping into guest code takes away our control of execution and gives control to the guest. While a thread is running guest code it cannot simultaneously be in the event loop because the guest has (safe) control of the CPU. Typically the amount of time spent in guest code is limited because reads and writes to emulated device registers and other exceptions cause us to leave the guest and give control back to QEMU. In extreme cases a guest can spend an unbounded amount of time without giving up control and this would make QEMU unresponsive.

In order to solve the problem of guest code hogging QEMU's thread of control signals are used to break out of the guest. A UNIX signal yanks control away from the current flow of execution and invokes a signal handler function. This allows QEMU to take steps to leave guest code and return to its main loop where the event loop can get a chance to process pending events.

The upshot of this is that new events may not be detected immediately if QEMU is currently in guest code. Most of the time QEMU eventually gets around to processing events but this additional latency is a performance problem in itself. For this reason timers, I/O completion, and notifications from worker threads to core QEMU use signals to ensure that the event loop will be run immediately.

You might be wondering what the overall picture between the event loop and an SMP guest with multiple vcpus looks like. Now that the threading model and guest code has been covered we can discuss the overall architecture.

iothread and non-iothread architecture

The traditional architecture is a single QEMU thread that executes guest code and the event loop. This model is also known as non-iothread or !CONFIG_IOTHREAD and is the default when QEMU is built with ./configure && make. The QEMU thread executes guest code until an exception or signal yields back control. Then it runs one iteration of the event loop without blocking in select(2). Afterwards it dives back into guest code and repeats until QEMU is shut down.

If the guest is started with multiple vcpus using -smp 2, for example, no additional QEMU threads will be created. Instead the single QEMU thread multiplexes between two vcpus executing guest code and the event loop. Therefore non-iothread fails to exploit multicore hosts and can result in poor performance for SMP guests.

Note that despite there being only one QEMU thread there may be zero or more worker threads. These threads may be temporarily or permanent. Remember that they perform specialized tasks and do not execute guest code or process events. I wanted to emphasise this because it is easy to be confused by worker threads when monitoring the host and interpret them as vcpu threads. Remember that non-iothread only ever has one QEMU thread.

The newer architecture is one QEMU thread per vcpu plus a dedicated event loop thread. This model is known as iothread or CONFIG_IOTHREAD and can be enabled with ./configure --enable-io-thread at build time. Each vcpu thread can execute guest code in parallel, offering true SMP support, while the iothread runs the event loop. The rule that core QEMU code never runs simultaneously is maintained through a global mutex that synchronizes core QEMU code across the vcpus and iothread. Most of the time vcpus will be executing guest code and do not need to hold the global mutex. Most of the time the iothread is blocked in select(2) and does not need to hold the global mutex.

Note that TCG is not thread-safe so even under the iothread model it multiplexes vcpus across a single QEMU thread. Only KVM can take advantage of per-vcpu threads.

Conclusion and words about the future

Hopefully this helps communicate the overall architecture of QEMU (which KVM inherits). Feel free to leave questions in the comments below.

In the future the details are likely to change and I hope we will see a move to CONFIG_IOTHREAD by default and maybe even a removal of !CONFIG_IOTHREAD.

I will try to update this post as qemu.git changes.