Showing posts with label qemu. Show all posts
Showing posts with label qemu. Show all posts

Wednesday, June 25, 2025

Profiling tools I use for QEMU storage performance optimization

A fair amount of the development work I do is related to storage performance in QEMU/KVM. Although I have written about disk I/O benchmarking and my performance analysis workflow in the past, I haven't covered the performance tools that I frequently rely on. In this post I'll go over what's in my toolbox and hopefully this will be helpful to others.

Performance analysis is hard when the system is slow but there is no clear bottleneck. If a CPU profile shows that a function is consuming significant amounts of time, then that's a good target for optimizations. On the other hand, if the profile is uniform and each function only consumes a small fraction of time, then it is difficult to gain much by optimizing just one function (although taking function call nesting into account may point towards parent functions that can be optimized):

If you are measuring just one metric, then eventually the profile will become uniform and there isn't much left to optimize. It helps to measure at multiple layers of the system in order to increase the chance of finding bottlenecks.

Here are the tools I like to use when hunting for QEMU storage performance optimizations.

kvm_stat

kvm_stat is a tool that runs on the host and counts events from the kvm.ko kernel module, including device register accesses (MMIO), interrupt injections, and more. These are often associated with vmentry/vmexit events where control passes between the guest and the hypervisor. Each time this happens there is a cost and it is preferrable to minimize the number of vmentry/vmexit events.

kvm_stat will let you identify inefficiencies when guest drivers are accessing devices as well as low-level activity (MSRs, nested page tables, etc) that can be optimized.

Here is output from an I/O intensive workload:

 Event                                         Total %Total CurAvg/s
 kvm_entry                                   1066255   20.3    43513
 kvm_exit                                    1066266   20.3    43513
 kvm_hv_timer_state                           954944   18.2    38754
 kvm_msr                                      487702    9.3    19878
 kvm_apicv_accept_irq                         278926    5.3    11430
 kvm_apic_accept_irq                          278920    5.3    11430
 kvm_vcpu_wakeup                              250055    4.8    10128
 kvm_pv_tlb_flush                             250000    4.8    10123
 kvm_msi_set_irq                              229439    4.4     9345
 kvm_ple_window_update                        213309    4.1     8836
 kvm_fast_mmio                                123433    2.3     5000
 kvm_wait_lapic_expire                         39855    0.8     1628
 kvm_apic_ipi                                   9614    0.2      457
 kvm_apic                                       9614    0.2      457
 kvm_unmap_hva_range                               9    0.0        1
 kvm_fpu                                          42    0.0        0
 kvm_mmio                                         28    0.0        0
 kvm_userspace_exit                               21    0.0        0
 vcpu_match_mmio                                  19    0.0        0
 kvm_emulate_insn                                 19    0.0        0
 kvm_pio                                           2    0.0        0
 Total                                       5258472          214493

Here I'm looking for high CurAvg/s rates. Any counters at 100k/s are well worth investigating.

Important events:

  • kvm_entry/kvm_exit: vmentries and vmexits are when the CPU transfers control between guest mode and the hypervisor.
  • kvm_msr: Model Specific Register accesses
  • kvm_msi_set_irq: Interrupt injections (Message Signalled Interrupts)
  • kvm_fast_mmio/kvm_mmio/kvm_pio: Device register accesses

sysstat

sysstat is a venerable suite of performance monitoring tools cover CPU, network, disk, memory activity. It can be used equally within guests and on the host. It is like an extended version of the classic vmstat(8) tool.

The mpstat, pidstat, and iostat tools are the ones I use most often:

  • mpstat reports CPU consumption, including %usr (userspace), %sys (kernel), %guest (guest mode), %steal (hypervisor), and more. Use this to find CPUs that are maxed out or poor use of parallelism/multi-threading.
  • pidstat reports per-process and per-thread statistics. This is useful for identifying specific threads that are consuming resources.
  • iostat reports disk statistics. This is useful for comparing I/O statistics between bare metal and guests.

blktrace

blktrace monitors Linux block I/O activity. This can be used equally within guests and on the host. Often it's interesting to capture traces in both the guest and on the host so they can be compared. If the I/O pattern is different then something in the I/O stack is modifying requests. That can be a sign of a misconfiguration.

The blktrace data can be analyzed and plotted with the btt command. For example, the latencies from driver submission to completion can be summarized to find the overhead compared to bare metal.

perf-top

In its default mode, perf-top is a sampling CPU profiler. It periodically collects the CPU's program counter value so that a profile can be generated showing hot functions. It supports call graph recording with the -g option if you want to aggregate nested function calls and find out which parent functions are responsible for the most CPU usage.

perf-top (and its non-interactive perf-record/perf-report cousin) is good at identifying inner loops of programs, excessive memcpy/memset, expensive locking instructions, instructions with poor cache hit rates, etc. When I use it to profile QEMU it shows the host kernel and QEMU userspace activity. It does not show guest activity.

Here is the output without call graph recording where we can see vmexit activity, QEMU virtqueue processing, and excessive memset at the top of the profile:

   3.95%  [kvm_intel]                 [k] vmx_vmexit
   3.37%  qemu-system-x86_64          [.] virtqueue_split_pop
   2.64%  libc.so.6                   [.] __memset_evex_unaligned_erms
   2.50%  [kvm_intel]                 [k] vmx_spec_ctrl_restore_host
   1.57%  [kernel]                    [k] native_irq_return_iret
   1.13%  qemu-system-x86_64          [.] bdrv_co_preadv_part
   1.13%  [kernel]                    [k] sync_regs
   1.09%  [kernel]                    [k] native_write_msr
   1.05%  [nvme]                      [k] nvme_poll_cq

The goal is to find functions or families of functions that consume significant amounts of CPU time so they can be optimized. If the profile is uniform with most functions taking less then 1%, then the bottlenecks are more likely to be found with other tools.

perf-trace

perf-trace is an strace-like tool for monitoring system call activity. For performance monitoring it has a --summary option that shows time spent in system calls and the counts. This is how you can identify system calls that block for a long time or that are called too often.

Here is an example from the host showing a summary of a QEMU IOThread that uses io_uring for disk I/O and ioeventfd/irqfd for VIRTIO device activity:

 IO iothread1 (332737), 1763284 events, 99.9%

   syscall            calls  errors  total       min       avg       max       stddev
                                     (msec)    (msec)    (msec)    (msec)        (%)
   --------------- --------  ------ -------- --------- --------- ---------     ------
   io_uring_enter    351680      0  8016.308     0.003     0.023     0.209      0.11%
   write             390189      0  1238.501     0.002     0.003     0.098      0.08%
   read              121057      0   305.355     0.002     0.003     0.019      0.14%
   ppoll              18707      0    62.228     0.002     0.003     0.023      0.32%

Conclusion

We looked at kvm_stat, sysstat, blktrace, perf-top, and perf-trace. They provide performance metrics from different layers of the system. Another worthy mention is the bcc collection of eBPF-based tools that offers a huge array of performance monitoring and troubleshooting tools. Let me know which tools you like to use!

Tuesday, October 1, 2024

Video and slides available for "IOThread Virtqueue Mapping" talk at KVM Forum 2024

My KVM Forum 2024 talk "IOThread Virtqueue Mapping: Improving virtio-blk SMP scalability in QEMU" is now available on YouTube. The slides are also available here.

IOThread Virtqueue Mapping is a new QEMU feature for configuring multiple IOThreads that will handle a virtio-blk device's virtqueues. This means QEMU can take advantage of the host's Linux multi-queue block layer and assign CPUs to I/O processing. Giving additional resources to virtio-blk emulation allows QEMU to achieve higher IOPS and saturate fast local NVMe drives. This is especially important for applications that submit I/O on many vCPUs simultaneously - a workload that QEMU had trouble keeping up with in the past.

You can read more about IOThread Virtqueue Mapping in this Red Hat blog post.

Wednesday, March 22, 2023

How to debug stuck VIRTIO devices in QEMU

Every once in a while a bug comes along where a guest hangs while communicating with a QEMU VIRTIO device. In this blog post I'll share some debugging approaches that can help QEMU developers who are trying to understand why a VIRTIO device is stuck.

There are a number of reasons why communication with a VIRTIO device might cease, so it helps to identify the nature of the hang:

  • Did the QEMU device see the requests that the guest driver submitted?
  • Did the QEMU device complete the request?
  • Did the guest driver see the requests that the device completed?

The case I will talk about is when QEMU itself is still responsive (the QMP/HMP monitor works) and the guest may or may not be responsive.

Finding requests that are stuck

There is a QEMU monitor command to inspect virtqueues called x-query-virtio-queue-status (QMP) and info virtio-queue-status (HMP). This is a quick way to extract information about a virtqueue from QEMU.

This command allows us to answer the question of whether the QEMU device completed its requests. The shadow_avail_idx and used_idx values in the output are the Available Ring index and Used Ring index, respectively. When they are equal the device has completed all requests. When they are not equal there are still requests in flight and the request must be stuck inside QEMU.

Here is a little more background on the index values. Remember that VIRTIO Split Virtqueues have an Available Ring index and a Used Ring index. The Available Ring index is incremented by the driver whenever it submits a request. The Used Ring index is incremented by the device whenever it completes a request. If the Available Ring index is equal to the Used Ring index then all requests have been completed.

Note that shadow_avail_idx is not the vring Available Ring index in guest RAM but just the last cached copy that the device saw. That means we cannot tell if there are new requests that the device hasn't seen yet. We need to take another approach to figure that out.

Finding requests that the device has not seen yet

Maybe the device has not seen new requests recently and this is why the guest is stuck. That can happen if the device is not receiving Buffer Available Notifications properly (normally this is done by reading a virtqueue kick ioeventfd, also known as a host notifier in QEMU).

We cannot use QEMU monitor commands here, but attaching the GDB debugger to QEMU will allow us to peak at the Available Ring index in guest RAM. The following GDB Python script loads the Available Ring index for a given VirtQueue:

$ cat avail-idx.py
import gdb

# ADDRESS is the address of a VirtQueue struct
vq = gdb.Value(ADDRESS).cast(gdb.lookup_type('VirtQueue').pointer())
avail_idx = vq['vring']['caches']['avail']['ptr'].cast(uint16_type.pointer())[1]
if avail_idx != vq['shadow_avail_idx']:
  print('Device has not seen all available buffers: avail_idx {} shadow_avail_idx {} in {}'.format(avail_idx, vq['shadow_avail_idx'], vq.dereference()))

You can run the script using the source avail-idx.py GDB command. Finding the address of the virtqueue depends on the type of device that you are debugging.

Finding completions that the guest has not seen

If requests are not stuck inside QEMU and the device has seen the latest request, then the guest driver might have missed the Used Buffer Notification from the device (normally an interrupt handler or polling loop inside the guest detects completed requests).

In VIRTIO the driver's current index in the Used Ring is not visible to the device. This means we have no general way of knowing whether the driver has seen completions. However, there is a cool trick for modern devices that have the VIRTIO_RING_F_EVENT_IDX feature enabled.

The trick is that the Linux VIRTIO driver code updates the Used Event Index every time a completed request is popped from the virtqueue. So if we look at the Used Event Index we know the driver's index into the Used Ring and can find out whether it has seen request completions.

The following GDB Python script loads the Used Event Index for a given VirtQueue:

$ cat used-event-idx.py
import gdb

# ADDRESS is the address of a VirtQueue struct
vq = gdb.Value(ADDRESS).cast(gdb.lookup_type('VirtQueue').pointer())
used_event = vq['vring']['caches']['avail']['ptr'].cast(uint16_type.pointer())[2 + vq['vring']['num']]
if used_event != vq['used_idx']:
  print('Driver has not seen all used buffers: used_event {} used_idx {} in {}'.format(used_event, vq['used_idx'], vq.dereference()))

You can run the script using the source avail-idx.py GDB command. Finding the address of the virtqueue depends on the type of device that you are debugging.

Conclusion

I hope this helps anyone who has to debug a VIRTIO device that seems to have gotten stuck.

Monday, February 20, 2023

Video and slides available for "vhost-user-blk: a fast userspace block I/O interface"

At FOSDEM '23 I gave a talk about vhost-user-blk and its use as a userspace block I/O interface. The video and slides are now available here. Enjoy!

Wednesday, January 25, 2023

Speaking at FOSDEM '23 about "vhost-user-blk: A fast userspace block I/O interface"

vhost-user-blk has connected hypervisors to software-defined storage since around 2017, but it was mainly seen as virtualization technology. Did you know that vhost-user-blk is not specific to virtual machines? I think it's time to use it more generally as a userspace block I/O interface because it's fast, unprivileged, and avoids exposing kernel attack surfaces.

My LWN.net article about Accessing QEMU storage features without a VM already hinted at this, but now it's time to focus on what vhost-user-blk is and why it's easy to integrate into your applications. libblkio is a simple and familiar block I/O API with vhost-user-blk support. You can connect to existing SPDK-based software-defined storage applications, qemu-storage-daemon, and other vhost-user-blk back-ends.

Come see my FOSDEM '23 talk about vhost-user-blk as a fast userspace block I/O interface live on Saturday Feb 4 2023, 11:15 CET. It will be streamed on the FOSDEM website and recordings will be available later. Slides are available here.

Friday, November 18, 2022

LWN article on "Accessing QEMU storage features without a VM"

At KVM Forum 2022 Kevin Wolf and Stefano Garzarella gave a talk on qemu-storage-daemon, a way to get QEMU's storage functionality without running a VM. It's great for accessing disk images, basically taking the older qemu-nbd to the next level. The cool thing is this makes QEMU's software-defined storage functionality - block devices with snapshots, incremental backup, image file formats, etc - available to other programs. Backup and forensics tools as well as other types of programs can take advantage of qemu-storage-daemon.

Here is the full article about Accessing QEMU storage features without a VM. Enjoy!

Thursday, November 10, 2022

Using qemu-img to access vhost-user-blk storage

vhost-user-blk is a high-performance storage protocol that connects virtual machines to software-defined storage like SPDK or qemu-storage-daemon. Until now, tool support for vhost-user-blk has been lacking. Accessing vhost-user-blk devices involved running a virtual machine, which requires more setup than one would like.

QEMU 7.2 adds vhost-user-blk support to the qemu-img tool. This is possible thanks to libblkio, a library that other programs besides QEMU can use too.

Check for vhost-user-blk support in your installed qemu-img version like this (if it says 0 then you need to update qemu-img or compile it from source with libblkio enabled):

$ qemu-img --help | grep virtio-blk-vhost-user | wc -l
1

You can copy a raw disk image file into a vhost-user-blk device like this:

$ qemu-img convert \
      --target-image-opts \
      -n \
      test.img \
      driver=virtio-blk-vhost-user,path=/tmp/vhost-user-blk.sock,cache.direct=on

The contents of the vhost-user-blk device can be saved as a qcow2 image file like this:

$ qemu-img convert \
      --image-opts \
      -O qcow2 \
      driver=virtio-blk-vhost-user,path=/tmp/vhost-user-blk.sock,cache.direct=on out.qcow2

The size of the virtual disk can be read:

$ qemu-img info \
      --image-opts \
      driver=virtio-blk-vhost-user,path=/tmp/vhost-user-blk.sock,cache.direct=on
image: json:{"driver": "virtio-blk-vhost-user"}
file format: virtio-blk-vhost-user
virtual size: 4 GiB (4294967296 bytes)
disk size: unavailable

Other qemu-img sub-commands like bench and dd are also available for quickly accessing the vhost-user-blk device without running a virtual machine:

$ qemu-img bench \
      --image-opts \
      driver=virtio-blk-vhost-user,path=/tmp/vhost-user-blk.sock,cache.direct=on
Sending 75000 read requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 4096)
Run completed in 1.443 seconds.

Being able to access vhost-user-blk devices from qemu-img makes vhost-user-blk a little easier to work with.

Friday, October 15, 2021

A new approach to usermode networking with passt

There is a new project called passt that Stefano Brivio has been working on to implement usermode networking, the magic that forwards network packets between QEMU guests and the host network.

passt is designed as a replacement for QEMU's --netdev user (also known as slirp), a feature that is commonly used but not really considered production-ready. What passt improves on is security and performance, finally making usermode networking production-ready. That's all you need to know to try it out but I thought the internals of how passt works are interesting, so this article explains the approach.

Why usermode networking is necessary

Guests send and receive Ethernet frames through emulated network interface cards like virtio-net. Those packets need to be injected into the host network but popular operating systems don't provide an API for sending and receiving Ethernet frames because that poses a security risk (spoofing) or could simply interfere with other applications on the host.

Actually that's not quite true, operating systems do provide specialized APIs for injecting Ethernet frames but they come with limitations. For example, the Linux tun/tap driver requires additional network configuration steps as well as administrator privileges. Sometimes it's not possible to take advantage of tap due to these limitations and we really need a solution for unprivileged users. That's what usermode networking is about.

Transmuting Ethernet frames to Socket API calls

Since an unprivileged user cannot inject Ethernet frames into the host network, we have to make due with the POSIX Sockets API that is available to unprivileged users. Each Ethernet frame sent by the guest needs to be converted into equivalent Sockets API calls on the host so that the desired effect is achieved even though we weren't able to transmit the original Ethernet frame byte-for-byte. Incoming packets from the external network need to be received via the Sockets API and repackaged into Ethernet frames that the guest network interface card can receive.

In networking parlance this conversion between Ethernet frames and Sockets API calls is a Layer 2 (Data Link Layer)/Layer 4 (Transport Layer) conversion. The Ethernet frames have additional packet headers including the Ethernet header, IP header, and the TCP/UDP header that the Sockets API calls don't include. Careful use of the Sockets API makes it possible to synthesize Ethernet frames that are similar enough to the original ones that the guest can communicate successfully.

For the most part this conversion requires parsing and building, respectively, packet headers in a straightforward way. The TCP protocol makes things more interesting though because a TCP connection involves non-trivial state that is normally handled by the TCP/IP stack. For example, data sent over a TCP connection might arrive out of order or some chunks may have been dropped. There is also a state machine for the TCP connection lifecycle including its famous three-way handshake. This means TCP connections must be carefully tracked so that these low-level protocol features can be simulated correctly.

How passt works

Passt runs as an unprivileged host userspace process that is connected to QEMU through --netdev socket, a way to transfer Ethernet frames from QEMU's network interface card emulation to another process like passt. When passt reads an Ethernet frame like a UDP message from the guest it sends the data onwards through an equivalent AF_INET SOCK_DGRAM socket on the host. It also keeps the socket open so replies can be read on the host and then packaged into Ethernet frames that are written to the guest. The effect of this is that guest network communication appears like it's coming from the passt process on the host and integrates nicely into host networking.

How TCP works is a bit more interesting. Since TCP connections require acknowledgement messages for reliable delivery, passt uses the recvmmsg(2) MSG_PEEK flag to fetch data while keeping it queued in the host network stack's rcvbuf until the guest acknowledges it. This avoids extra buffer management code in passt and is part of its strategy of implementing only a subset of TCP. There is no need to duplicate the full TCP/IP stack since the host and guest already have them, but achieving this requires detailed knowledge of TCP so that passt can maintain just enough state.

Incoming connections are handled by port forwarding. This means passt can bind to port 2222 on the host and forward connections to port 22 inside the guest. This is very useful for usermode networking since the user may not have permission to bind to low-numbered ports on the host or there might already be host services listening on those ports. If you don't want to mess with port forwarding you can use passt's all mode, which simply listens on all non-ephemeral ports (basically a brute force approach).

A few basic network protocols are necessary for network communication: ARP, ICMP, DHCP, DNS, and IPv6 services. Passt offers these because the guest cannot talk to those services on the host network directly. They can be disabled when the guest has knowledge of the network configuration and doesn't need them.

Why passt is unique

Thanks to running in a separate process from QEMU and by taking a minimalist approach, passt is able to tighten security. Its seccomp filters are stronger than anything the larger QEMU process could do. The code is clean and specifically designed for security and simplicity. I think writing passt in C was a missed opportunity. Some users may rule it out entirely for this reason. Userspace parsing of untrusted network packets should be done in a memory-safe programming language nowadays. Nevertheless, it's a step up from slirp, which has a poor track record of security issues and runs as part of the QEMU process.

I'm excited to see how passt performs in comparison to slirp. passt uses techniques like recvmmsg(2)/sendmmsg(2) to batch message transfer and reads multiple Ethernet frames from the guest in a single syscall to amortize the cost of syscalls over multiple packets. There is no dynamic memory allocation and packet headers are pre-populated to minimize the number of CPU cycles spent in the data path. While this is promising, QEMU's --netdev socket isn't the fastest (packets first take a trip through QEMU's net subsystem queues), but still a good trade-off between performance and simplicity/security. Based on reading the code, I think passt will be faster than slirp but I haven't benchmarked it myself.

There is another mode that passt supports for containers instead of virtualization. Although it's not relevant to QEMU, this so-called pasta mode is a cool feature for container networking. In this mode pasta connects a network namespace with the outside world (init namespace) through a tap device. This might become passt's killer feature, because the same software can be used for both virtualization and containers, so why bother investing in two separate solutions?

Conclusion

Passt is a promising replacement for slirp (on Linux hosts at least). It looks like there will finally be a production-ready usermode networking feature for QEMU that is fast and secure. Passt's functionality is generic enough that other projects besides QEMU will be able to use it, which is great because this kind of networking code is non-trivial to develop. I look forward to passt becoming available for use with QEMU in Linux distributions soon!

Friday, July 2, 2021

Slides available for "Bring Your Own Virtual Devices: Frameworks for Software and Hardware Device Virtualization"

The PDF slides for my "Bring Your Own Virtual Devices: Frameworks for Software and Hardware Device Virtualization" talk from the 16th Workshop on Virtualization in High-Performance Cloud Computing are now available.

This talk covers out-of-process device interfaces including vhost (kernel), vhost-user, Linux VFIO, mdev, vfio-user, vDPA, and VDUSE. It gives a brief overview of each interface, how it works, and how to develop your own devices.

The growing number of out-of-process device interfaces available in QEMU/KVM can make it hard to understand and compare them. Each of these interfaces is designed for different use cases. For example, whether you want to pass through hardware or implement the device in software, if you want to implement your device in the host kernel or in host userspace, etc. This talk will give you the necessary knowledge to compare these interfaces yourself so you can decide which one is most appropriate for your use case.

For more information about the design of out-of-process device interfaces, see also my previous blog post about requirements for out-of-process devices.

Friday, March 12, 2021

Overcoming fear of public communication in open source

When given the choice between communicating in private or in public, many people opt for private communication. They send questions or unfinished patches to individuals instead of posting on mailing lists, forums, or chat rooms. This post explains why public communication is a faster and more efficient way for developers to communicate in open source communities than private communication. A big factor in the private vs public decision is psychological and I've provided a checklist to give you confidence when communicating in public.

Time and time again I find people initiating discussions through private channels when I know it would be advantageous to have them in public instead. I even keep an email reply template handy asking the sender to reach out to the public mailing list instead of communicating with me in private. Communicating in public is a good default unless you need confidentiality (security bugs, business reasons, etc). So why do people prefer to ask questions or send patches off-list?

Why we fear public communication

I'm interested in understanding why people avoid public communications channels in open source communities. The purpose of mailing lists, chat rooms, and forums is to engage in discussions and share knowledge. Members of these communities are interested in the topic and want to engage in discussion. But there are some common reasons I've found why people don't make use of public communications channels:

  • I don't want to create noise. Mailing lists are home to important discussions between experienced members of the community. Sending a relatively simple question or unfinished patch that no expert would need to ask can make you doubt whether it deserves to be sent at all. This is a fallacy. As long as the question or patches are relevant to the community and you have done your homework (see the checklist below), there is no need to worry about creating noise.
  • I will look dumb or seem like a bad programmer. When you need help it's likely that your understanding is incomplete. We often hold ourselves to artifically high standards when asking for help in public, yet when we observe others interacting in public we don't hold their questions or unfinished code against them. Only if they show a repeated pattern of sloppy work does it damage their reputation. Asking for help in public won't harm your image.
  • Too much traffic. Big open source communities are often so active that no single person can keep track of everything that is going on. Even core members of the community rely on filtering only topics relevant to them. Since filtering becomes necessary at scale anyway, there's no need to worry about sending too many messages.

A note about tone: in the past people were more likely to be put off by the unfriendly tone in chat rooms and mailing lists. Over time the tone seems to have improved in general, probably due to multiple factors like open source becoming more professional, codes of conduct making participants aware of their behavior, etc. I don't want to cover the pros and cons of these factors, but suffice to say that nowadays tone is less of a problem and that's a good thing for everyone.

Why public communication is faster and more efficient

Next let's look at the reasons why public communication is beneficial:

Including the community from the start avoids waste

If you ask for help in private it's possible that the answer you get will not end up being consensus in the community. When you eventually send patches to the community they might disagree with the approach you settled on in private and you'll have to redo your work to get the patches merged. This can be avoided by including the community from the start.

Similarly, you can avoid duplicating work when multiple people are investigating the same problem in private without knowing about each other. If you discuss what you are doing in public then you can collaborate and avoid spending time creating multiple solutions, only one of which can be merged.

Public discussions create a searchable knowledge base

If you find a solution to your problem in private, that doesn't help others who have the same question. Public discussions are typically archived and searchable on the web. When the next person has the same question they will find the answer online and won't need to ask at all!

Additionally, everyone benefits and learns from each other when questions are asked in public. We soak up knowledge by following these discussions. If they are private then we don't have this opportunity.

Get a reply even if the person you asked is unavailable

The person you intended to reach may be away due to timezones, holidays, etc. A private discussion is blocked until they respond to your message. Public discussions, on the other hand, can progress even when the original participants are unavailable. You can get an answer to your question faster by asking in public.

Public activity gives visibility to your work

Private discussions are invisible to your teammates, managers, and other people affected by your progress. When you communicate in public they will be able plan better, help out when needed, and give you credit for the effort you are putting in.

A checklist before you press Send

Hopefully I have encouraged you to go ahead and ask questions or send unfinished patches in public. Here is a checklist that will help you feel confident about communicating in public:

Before asking questions...

  1. Have you searched the web, documentation, and code?
  2. Have you added printfs, GDB breakpoints, or enabled tracing to understand the behavior of the system?

Before sending patches...

  1. Is the code formatted according to the coding standard?
  2. Are the error cases handled, memory allocation/ownership correctly implemented, and thread safety addressed? Most of the time you can figure these out yourself and forgetting to do so may distract from the topics you want help with.
  3. Did you add todo comments pointing out unfinished aspects of the code? Being upfront about what is missing helps readers understand the status of the code and saves them time trying to distinguish requirements you forgot from things that are simply not yet implemented.
  4. Do the commit descriptions explain the purpose of the code changes and does the cover letter give an overview of the patch series?
  5. Are you using git-format-patch --subject-prefix RFC (same for git-publish) to mark your patches as a "request for comment"? This tells people you are seeking input on unfinished code.

Finally, do you know who to CC on emails or mention in comments/chats? Look up the relevant maintainers or active developers using git-log(1), scripts/get_maintainer.pl, etc.

Conclusion

Private communication can be slower, less efficient, and adds less value to an open source community. For many people public communication feels a little scary and they prefer to avoid it. I hope that by understanding the advantages of public communication you will be motivated to use public communication channels more.

Friday, October 9, 2020

Requirements for out-of-process device emulation

Over the past months I have participated in discussions about out-of-process device emulation. This post describes the requirements that have become apparent. I hope this will be a useful guide to understanding the big picture about out-of-process device emulation.

What is out-of-process device emulation?

Device emulation is traditionally implemented in the program that executes guest code. This approach is natural because accesses to device registers are trapped as part of the CPU run loop that sits at the core of an emulator or virtual machine monitor (VMM).

In some use cases it is advantageous to perform device emulation in separate processes. For example, software-defined network switches can minimize data copies by emulating network cards directly in the switch process. Out-of-process device emulation also enables privilege separation and tighter sandboxing for security.

Why are these requirements important?

When emulated devices are implemented in the VMM they use common VMM APIs. Adding new devices is relatively easy because the APIs are already there and the developer can focus on the device specifics. Out-of-process device emulation potentially leaves developers without APIs since the device emulation program is a separate program that literally starts from main(). Developers want to focus on implementing their specific device, not on solving general problems related to out-of-process device emulation infrastructure.

It is not only a lot of work to implement an out-of-process device completely from scratch, but there is also a risk of developing the wrong solution because some subtleties of device emulation are not obvious at first glance.

I hope sharing these requirements will help in the creation of common infrastructure so it's easy to implement high-quality out-of-process devices.

Not all use cases have the full set of requirements. Therefore it's best if requirements are addressed in separate, reusable libraries so that device implementors can pick the ones that are relevant to them.

Device emulation

Device resources

Devices provide resources that drivers interact with such as hardware registers, memory, or interrupts. The fundamental requirement of out-of-process device emulation is exposing device resources.

The following types of device resources are needed:

Synchronous MMIO/PIO accesses

The most basic device emulation operation is the hardware register access. This is a memory-mapped I/O (MMIO) or programmed I/O (PIO) access to the device. A read loads a value from a device register. A write stores a value to a device register. These operations are synchronous because the vCPU is paused until completion.

Asynchronous doorbells

Devices often have doorbell registers, allowing the driver to inform the device that new requests are ready for processing. The vCPU does not need to wait since the access is a posted write.

The kvm.ko ioeventfd mechanism can be used to implement asynchronous doorbells.

Shared device memory

Devices may have memory-like regions that the CPU can access (such as PCI Memory BARs). The device emulation process therefore needs to share a region of its memory space with the VMM so the guest can access it. This mechanism also allows device emulation to busy wait (poll) instead of using synchronous MMIO/PIO accesses or asynchronous doorbells for notifications.

Direct Memory Access (DMA)

Devices often require read and write access to a memory address space belonging to the CPU. This allows network cards to transmit packet payloads that are located in guest RAM, for example.

Early out-of-process device emulation interfaces simply shared guest RAM. The allowed DMA to any guest physical memory address. More advanced IOMMU and address space identifier mechanisms are now becoming ubiquitous. Therefore, new out-of-process device emulation interfaces should incorporate IOMMU functionality.

The key requirement for IOMMU mechanisms is allowing the VMM to grant access to a region of memory so the device emulation process can read from and/or write to it.

Interrupts

Devices notify the CPU using interrupts. An interrupt is simply a message sent by the device emulation process to the VMM. Interrupt configuration is flexible on modern devices, meaning the driver may be able to select the number of interrupts and a mapping (using one interrupt with multiple event sources). This can be implemented using the Linux eventfd mechanism or via in-band device emulation protocol messages, for example.

Extensibility for new bus types

It should be possible to support multiple bus types. vhost-user only supports vhost devices. VFIO is more extensible but currently focussed on PCI devices. It is likely that QEMU SysBus devices will be desirable for implementing ad-hoc out-of-process devices (especially for System-on-Chip target platforms).

Bus-level APIs, not protocol bindings

Developers should not need to learn the out-of-process device emulation protocol (vfio-user, etc). APIs should focus on bus-level concepts such as defining VIRTIO or PCI devices rather than protocol bindings for dealing with protocol messages, file descriptor passing, and shared memory.

In other words, developers should be thinking in terms of the problem domain, not worrying about how out-of-process device emulation is implemented. The protocol should be hidden behind bus-level APIs.

Multi-threading support from the beginning

Threading issues arise often in device emulation because asynchronous requests or multi-queue devices can be implemented using threads. Therefore it is necessary to clearly document what threading models are supported and how device lifecycle operations like reset interact with in-flight requests.

Live migration, live upgrade, and crash recovery

There are several related issues around device state and restarting the device emulation program without disrupting the guest.

Live migration

Live migration transfers the state of a device from one device emulation process to another (typically running on another host). This requires the following functionality:

Quiescing the device

Some devices can be live migrated at any point in time without any preparation, while others must be put into a quiescent state to avoid issues. An example is a storage controller that has a write request in flight. It is not safe to live migration until the write request has completed or been canceled. Failure to wait might result in data corruption if the write takes effect after the destination has resumed execution.

Therefore it is necessary to quiesce a device. After this point there is no further device activity and no guest-visible changes will be made by the device.

Saving/loading device state

It must be possible to save and load device state. Device state includes the contents of hardware registers as well as device-internal state necessary for resuming operation.

It is typically necessary to determine whether the device emulation processes on the migration source and destination are compatible before attempting migration. This avoids migration failure when the destination tries to load the device state and discovers it doesn't support it. It may be desirable to support loading device state that was generated by a different implementation of the same device type (for example, two virtio-net implementations).

Dirty memory logging

Pre-copy live migration starts with an iterative phase where dirty memory pages are copied from the migration source to the destination host. Devices need to participate in dirty memory logging so that all written pages are transferred to the destination and no pages are "missed".

Crash recovery

If the device emulation process crashes it should be possible to restart it and resume device emulation without disrupting the guest (aside from a possible pause during reconnection).

Doing this requires maintaining device state (contents of hardware registers, etc) outside the device emulation process. This way the state remains even if the process crashes and it can be resume when a new process starts.

Live upgrade

It must be possible to upgrade the device emulation process and the VMM without disrupting the guest. Upgrading the device emulation process is similar to crash recovery in that the process terminates and a new one resumes with the previous state.

Device versioning

The guest-visible aspects of the device must be versioned. In the simplest case the device emulation program would have a --compat-version=N command-line option that controls which version of the device the guest sees. When guest-visible changes are made to the program the version number must be increased.

By giving control of the guest-visible device behavior it is possible to save/load and live migrate reliably. Otherwise loading device state in a newer device emulation program could affect the running guest. Guest drivers typically are not prepared for the device to change underneath them and doing so could result in guest crashes or data corruption.

Security

The trust model

The VMM must not trust the device emulation program. This is key to implementing privilege separation and the principle of least privilege. If a compromised device emulation program is able to gain control of the VMM then out-of-process device emulation has failed to provide isolation between devices.

The device emulation program must not trust the VMM to the extent that this is possible. For example, it must validate inputs so that the VMM cannot gain control of the device emulation process through memory corruptions or other bugs. This makes it so that even if the VMM has been compromised, access to device resources and associated system calls still requires further compromising the device emulation process.

Unprivileged operation

The device emulation program should run unprivileged to the extent that this is possible. If special permissions are required to access hardware resources then these resources can sometimes be provided via file descriptor passing by a more privileged parent process.

Sandboxing

Operating system sandboxing mechanisms can be applied to device emulation processes more effectively than monolithic VMMs. Seccomp can limit the Linux system calls that may be invoked. SELinux can restrict access to system resources.

Sandboxing is a common task that most device emulation programs need. Therefore it is a good candidate for a library or launcher tool that is shared by device emulation programs.

Management

Command-line interface

A common command-line interface should be defined where possible. For example, vhost-user's standard --socket-path=PATH argument makes it easy to launch any vhost-user device backend. Protocol-specific options (e.g. socket path) and device type-specific options (e.g. virtio-net) can be standardized.

Some options are necessarily specific to the device emulation program and therefore cannot be standardized.

The advantage of standard options is that management tools like libvirt can launch the device emulation programs without further user configuration.

RPC interface

It may be necessary to issue commands at runtime. Examples include adjusting throttling limits, enabling/disabling logging, etc. These operations can be performed over an RPC interface.

Various RPC interfaces are used throughout open source virtualization software. Adopting a widely-used RPC protocol and standardizing commands is beneficial because it makes it easy to communicate with the software and management tools can support them relatively easily.

Conclusion

This was largely a brain dump but I hope it is useful food for thought as out-of-process device emulation interfaces are designed and developed. There is a lot more to it than simply implementing a protocol for device register accesses and guest RAM DMA. Developing open source libraries in Rust and C that can be used as needed will ensure that out-of-process devices are high-quality and easy for users to deploy.

Sunday, September 27, 2020

On unifying vhost-user and VIRTIO

The recent development of a Linux driver framework called VIRTIO Data Path Acceleration (vDPA) has laid the groundwork for exciting new vhost-user features. The implications of vDPA have not yet rippled through the community so I want to share my thoughts on how the vhost-user protocol can take advantage of new ideas from vDPA.

This post is aimed at developers and assumes familiarity with the vhost-user protocol and VIRTIO. No knowledge of vDPA is required.

vDPA helps with writing drivers for hardware that supports VIRTIO offload. Its goal is to enable hybrid hardware/software VIRTIO devices, but as a nice side-effect it has overcome limitations in the kernel vhost interface. It turns out that applying ideas from vDPA to the vhost-user protocol solves the same issues there. In this article I'll show how extending the vhost-user protocol with vDPA has the following benefits:

  • Allows existing VIRTIO device emulation code to act as a vhost-user device backend.
  • Removes the need for shim devices in the virtual machine monitor (VMM).
  • Replaces undocumented conventions with a well-defined device model.

These things can be done while reusing existing vhost-user and VIRTIO code. In fact, this is especially good news for existing codebases like QEMU because they already have a wealth of vhost-user and VIRTIO code that can now finally be connected together!

Let's look at the advantages of extending vhost-user with vDPA first and then discuss how to do it.

Why extend vhost-user with vDPA?

Reusing VIRTIO emulation code for vhost-user backends

It is a common misconception that a vhost device is a VIRTIO device. VIRTIO devices are defined in the VIRTIO specification and consist of a configuration space, virtqueues, and a device lifecycle that includes feature negotiation. A vhost device is a subset of the corresponding VIRTIO device. The exact subset depends on the device type, and some vhost devices are closer to the full functionality of their corresponding VIRTIO device than others. The most well-known example is that vhost-net devices have rx/tx virtqueues and but lack the virtio-net control virtqueue. Also, the configuration space and device lifecycle are only partially available to vhost devices.

This difference makes it impossible to use a VIRTIO device as a vhost-user device and vice versa. There is an impedance mismatch and missing functionality. That's a shame because existing VIRTIO device emulation code is mature and duplicating it to provide vhost-user backends creates additional work.

If there was a way to reuse existing VIRTIO device emulation code it would be easier to move to a multi-process architecture in QEMU. Want to run --netdev user,id=netdev0 --device virtio-net-pci,netdev=netdev0 in a separate, sandboxed process? Easy, run it as a vhost-user-net device instead of as virtio-net.

Making VMM device shims optional

Today each new vhost device type requires a shim device in the VMM. QEMU has --device vhost-user-blk-pci, --device vhost-user-input-pci, and so on. Why can't there be a single --device vhost-user device?

This limitation is a consequence of the fact that vhost devices are not full VIRTIO devices. In fact, a vhost device does not even have a way to report its device type (net, blk, scsi, etc). Therefore it is impossible for today's VMMs to offer a generic device. Each vhost device type requires a shim device.

In some cases a shim device is desirable because it allows the VMM to handle some aspects of the device instead of passing everything through to the vhost device backend. But requiring shims by default involves lots of tedious boilerplate code and prevents new device types from being used by older VMMs.

Providing a well-defined device model in vhost-user

Although vhost-user works well for users, it is difficult for developers to learn and extend. The protocol does not have a well-defined device model. Each device type has its own undocumented set of protocol messages that are used. For example, the vhost-user-blk device uses the configuration space whereas most other device types do not use the configuration space at all.

Since protocol use is not fully documented in the specification, developers might resort to reading Linux, QEMU, and DPDK code in order to figure out how to make their devices work. They typically have to debug vhost-user protocol messages and adjust their code until it appears to work. Hopefully the critical bugs are caught before the code ships. This is problematic because it's hard to produce high-quality vhost-user implementations.

Although the protocol specification can certainly be cleaned up, the problem is more fundamental. vhost-user badly needs a well-defined device model so that protocol usage is clear and uniform for all device types. The best way to do that is to simply adopt the VIRTIO device model. The VIRTIO specification already defines the device lifecycle and details of the device types. By making vhost-user devices full VIRTIO devices there is no need for additional vhost device specifications. The vhost-user specification just becomes a transport for the established VIRTIO device model. Luckily that is effectively what vDPA has done for kernel vhost ioctls.

How to do this in QEMU

The following QEMU changes are needed to implement vhost-user vDPA support. Below I will focus on vhost-user-net but most of the work is generic and benefits all device types.

Import vDPA ioctls into vhost-user

vDPA extends the Linux vhost ioctl interface. It uses a subset of vhost ioctls and adds new vDPA-specific ioctls that are implemented in the vhost_vdpa.ko kernel module. These new ioctls enable the full VIRTIO device model, including device IDs, the status register, configuration space, and so on.

In theory vhost-user could be fixed without vDPA, but it would involve effectively adding the same set of functionality that vDPA has already added onto kernel vhost. Reusing the vDPA ioctls allows VMMs to support both kernel vDPA and vhost-user with minimal code duplication.

This can be done by adding a VHOST_USER_PROTOCOL_F_VDPA feature bit to the vhost-user protocol. If both the vhost-user frontend and backend support vDPA then all vDPA messages are available. Otherwise they can either fall back on legacy vhost-user behavior or drop the connection.

The vhost-user specification could be split into a legacy section and a modern vDPA-enabled section. The modern protocol will remove vhost-user messages that are not needed by vDPA, simplifying the protocol for new developers while allowing existing implementations to support both with minimal changes.

One detail is that vDPA does not use the memory table mechanism for sharing memory. Instead it relies on the richer IOMMU message family that is optional in vhost-user today. This approach can be adopted in vhost-user too, making the IOMMU code path standard for all implementations and dropping the memory table mechanism.

Add vhost-user vDPA to the vhost code

QEMU's hw/virtio/vhost*.c code supports kernel vhost, vhost-user, and kernel vDPA. A vhost-user vDPA mode must be added to implement the new protocol. It can be implemented as a combination of the vhost-user and kernel vDPA code already in QEMU. Most likely the existing vhost-user code can simply be extended to enable vDPA support if the backend supports it.

Only small changes to hw/net/virtio-net.c and net/vhost-user.c are needed to use vhost-user vDPA with net devices. At that point QEMU can connect to a vhost-user-net device backend and use vDPA extensions.

Add a vhost-user vDPA VIRTIO transport

Next a vhost-user-net device backend can be put together using QEMU's virtio-net emulation. A translation layer is needed between the vhost-user protocol and the VirtIODevice type in QEMU. This can be done by implementing a new VIRTIO transport alongside the existing pci, mmio, and ccw transports. The transport processes vhost-user protocol messages from the UNIX domain socket and invokes VIRTIO device emulation code inside QEMU. It acts as a VIRTIO bus so that virtio-net-device, virtio-blk-device, and other device types can be plugged in.

This piece is the most involved but the vhost-user protocol communication part was already implemented in the virtio-vhost-user prototype that I worked on previously. Most of the communication code can be reused and the remainder is implementing the VirtioBusClass interface.

To summarize, a new transport acts as the vhost-user device backend and invokes QEMU VirtIODevice methods in response to vhost-user protocol messages. The syntax would be something like --device virtio-net-device,id=virtio-net0 --device vhost-user-backend,device=virtio-net0,addr.type=unix,addr.path=/tmp/vhost-user-net.sock.

Where this converges with multi-process QEMU

At this point QEMU can run ad-hoc vhost-user backends using existing VIRTIO device models. It is possible to go further by creating a qemu-dev launcher executable that implements the vhost-user spec's "Backend program conventions". This way a minimal device emulator executable hosts the device instead of a full system emulator.

The requirements for this are similar to the multi-process QEMU effort, which aims to run QEMU devices as separate processes. One of the main open questions is how to design build system and Kconfig support for building minimal device emulator executables.

In the case of vhost-user-net the qemu-dev-vhost-user-net executable would contain virtio-net-device, vhost-user-backend, any netdevs the user wishes to include, a QMP monitor, and a vhost-user backend command-line interface.

Where does this leave us? QEMU's existing VIRTIO device models can be used as vhost-user devices and run in a separate processes from the VMM. It's a great way of reusing code and having the flexibility to deploy it in the way that makes most sense for the intended use case.

Conclusion

The vhost-user protocol can be simplified by adopting the vhost vDPA ioctls that have recently been introduced in Linux. This turns vhost-user into a VIRTIO transport and vhost-user devices become full VIRTIO devices. Existing VIRTIO device emulation code can then be reused in vhost-user device backends.

Monday, August 24, 2020

QEMU Internals: Event loops

This post explains event loops in QEMU v5.1.0 and their unique features compared to other event loop implementations. The APIs are not covered in detail here since they are explained in doc comments. Instead, the focus is on the big picture and why things work the way they do.

Event loops are central to many I/O-bound applications like network services and graphical desktop applications. QEMU also has I/O-bound work that fits well into an event loop. Examples include the QMP monitor, disk I/O, and timers.

An event loop monitors event sources for activity and invokes a callback function when an event occurs. This makes it possible to process multiple event sources within a single CPU thread. The application can appear to do multiple things at once without multithreading because it switches between handling different event sources. This architecture is common in Javascript, Python Twisted/asyncio, and many other environments. Sometimes the event loop is hidden underneath coroutines or async/await language features (QEMU has coroutines but often the event loop is still used directly).

The most important event sources in QEMU are:

  • File descriptors such as sockets and character devices.
  • Event notifiers (implemented as eventfds on Linux).
  • Timers for delayed function execution.
  • Bottom-halves (BHs) for invoking a function in another thread or deferring a function call to avoid reentrancy.

Event loops and threads

QEMU has several different types of threads:

  • vCPU threads that execute guest code and perform device emulation synchronously with respect to the vCPU.
  • The main loop that runs the event loops (yes, there is more than one!) used by many QEMU components.
  • IOThreads that run event loops for device emulation concurrently with vCPUs and "out-of-band" QMP monitor commands.

It's worth explaining how device emulation interacts with threads. When guest code accesses a device register the vCPU thread traps the access and dispatches it to an emulated device. The device's read/write function runs in the vCPU thread. The vCPU thread cannot resume guest code execution until the device's read/write function returns. This means long-running operations like emulating a timer chip or disk I/O cannot be performed synchronously in the vCPU thread since they would block the vCPU. Most devices solve this problem using the main loop thread's event loops. They add timer or file descriptor monitoring callbacks to the main loop and return back to guest code execution. When the timer expires or the file descriptor becomes ready the callback function runs in the main loop thread. The final part of emulating a guest timer or disk access therefore runs in the main loop thread and not a vCPU thread.

Some devices perform the guest device register access in the main loop thread or an IOThread thanks to ioeventfd. ioeventfd is a Linux KVM API and also has a userspace fallback implementation for TCG that traps vCPU device accesses and writes to a file descriptor so another thread can handle the access.

The key point is that vCPU threads do not run an event loop. The main loop thread and IOThreads run event loops. vCPU threads can add event sources to the main loop or IOThread event loops. Callbacks run in the main loop thread or IOThreads.

How the main loop and IOThreads differ

The main loop and IOThreads share some code but are fundamentally different. The common code is called AioContext and is QEMU's native event loop API. Commonly-used functions include aio_set_fd_handler(), aio_set_event_handler(), aio_timer_init(), and aio_bh_new().

The main loop actually has a glib GMainContext and two AioContext event loops. QEMU components can use any of these event loop APIs and the main loop combines them all into a single event loop function os_host_main_loop_wait() that calls qemu_poll_ns() to wait for event sources. This makes it possible to combine glib-based code with code using the native QEMU AioContext APIs.

The reason why the main loop has two AioContexts is because one, called iohandler_ctx, is used to implement older qemu_set_fd_handler() APIs whose handlers should not run when the other AioContext, called qemu_aio_context, is run using aio_poll(). The QEMU block layer and newer code uses qemu_aio_context while older code uses iohandler_ctx. Over time it may be possible to unify the two by converting iohandler_ctx handlers to safely execute in qemu_aio_context.

IOThreads have an AioContext and a glib GMainContext. The AioContext is run using the aio_poll() API, which enables the advanced features of the event loop. If a glib event loop is needed then the GMainContext can be run using g_main_loop_run() and the AioContext event sources will be included.

Code that relies on the AioContext aio_*() APIs will work with both the main loop and IOThreads. Older code using qemu_*() APIs only works with the main loop. glib code works with both the main loop and IOThreads.

The key difference between the main loop and IOThreads is that the main loop uses a traditional event loop that calls qemu_poll_ns() while IOThreads AioContext aio_poll() has advanced features that result in better performance.

AioContext features

AioContext has the following event loop features that traditional event loops do not have:

  • Adaptive polling support for lower latency but slightly higher CPU consumption. AioContext event sources can have a userspace polling function that detects events without performing syscalls (e.g. peeking at a memory location). This allows the event loop to avoid block syscalls that might lead the host kernel scheduler to yield the thread and put the physical CPU into a low power state. Keeping the CPU busy and avoiding entering the kernel minimizes latency.
  • O(1) time complexity with respect to the number of monitored file descriptors. When there are thousands of file descriptors O(n) APIs like poll(2) spend time scanning over all file descriptors, even those that have no activity. This scalability bottleneck can be avoided with Linux io_uring and epoll APIs, both of which are supported by AioContext aio_poll(2).
  • Nanosecond timers. glib's event loop only has millisecond timers, which is not sufficient for emulating hardware timers.

These features are required for performance reasons. Unfortunately glib's event loop does not support them, otherwise QEMU could use GMainContext as its only event loop.

Conclusion

QEMU uses both its native AioContext event loop and glib's GMainContext. The QEMU main loop and IOThreads work differently, with IOThreads offering the best performance thanks to its AioContext aio_poll() event loop. Modern QEMU code should use AioContext APIs for optimal performance and so that the code can be used in both the main loop and IOThreads.

Thursday, December 22, 2011

QEMU 2011 Year in Review

As 2011 comes to an end I want to look back at the highlights from the QEMU community this year. Development progress feels good, the mailing list is very active, and QEMU's future looks bright. I only started contributing in 2010 but the growth since QEMU's early days must be enormous. Perhaps someone will make a source history visualization that shows the commit history and clusters of activity.

Here is the recap of the milestones that QEMU reached in 2011.

QEMU 0.14

In February the 0.14 release came out with a bunch of exciting new features:

For full details see the changelog.

QEMU 0.15

In August the 0.15 release brought yet more cool improvements:

For full details see the changelog.

Google Summer of Code

QEMU participated in Google Summer of Code 2011 and received funding for students to contribute to QEMU during the summer. Behind the scenes this takes an aweful lot of work from the students themselves but also from the mentors. These four projects were successfully completed:

  • Boot Mac OS >=8.5 on PowerPC system emulation
  • QED <-> QCOW2 image conversion utility
  • Improved VMDK image format compatibility
  • Adding NeXT emulation support

Hopefully we can continue to participate in GSoC and give students an opportunity to get involved with open source emulation and virtualization.

QEMU 1.0

The final QEMU release for 2011 was in December. The release announcement was picked up quite widely and after hitting Hacker News and Reddit required effort to try to keep the QEMU website up. I think that's a good sign, although QEMU 1.0 is kind of like Linux 3.0 in that the version number change does not represent a fundamental new codebase or architecture. Here somre of the changes:

  • Xtensa target architecture
  • TCG Interpreter interprets portable bytecode instead of translating to native machine code

For full details see the changelog.

Ongoing engineering efforts

There is a lot of change in motion as the year ends. Here are long-term efforts that are unfolding right now:

  • Jan Kiszka has made a lot of progress in the quest to merge qemu-kvm back into QEMU. In a way this is similar to Xen's QEMU fork which was merged back earlier this year. This is a great effort because some day soon there will be no more confusion over qemu-kvm vs qemu when they have been unified.
  • Avi Kivity took on the interfaces for guest memory management and is in the process of revamping them. This touches not only the core concept of how QEMU registers and tracks guest memory but also every single emulated device.
  • Anthony Liguori is working on an object model that will make emulated devices and all resources managed by QEMU consistent and accessible via APIs. I think of this like introducing sysfs so that there is a one hierarchy and ways to explore and manipulate everything QEMU knows about.

Looking forward to 2012

It is hard to pick the highlights to mention but I hope this summary has given you a few links to click and brought back cool features you forgot about :). Have a great new year!

Thursday, November 17, 2011

Pictures from the Computer History Museum

I visited the Computer History Museum in Mountain View, CA when attending Google Summer of Code Mentor Summit 2011. The museum is fantastic for anyone with an interest in computers because they have the actual machines there - PDP-11, VAX, System 360, and much more. I'd like to go back again because I didn't finish seeing everything before closing time :).

Here are some highlights from the museum, selected more for fun than historic value:

IBM flowcharting tools: This is how I design all my QEMU patches.

Fortran, the next Ruby on Rails?

The neat thing is that I was actually on the same flight back home as the GNU Fortran hackers who attended the Mentor Summit. I think they liked the badges :).

System 360 is not as big as you imagine

Alexander Graf probably has one in his bedroom :).

Go and visit if you get a chance. They have the actual teapot that the famous OpenGL teapot is modelled after!

Monday, September 19, 2011

Enhanced VMDK support now in QEMU

QEMU now has greatly enhanced VMDK VMware disk image file support, thanks to Fam Zheng's hard work during Google Summer of Code 2011. Previously QEMU was only able to handle older VMDK files because it did not support the entire VMDK file format specification. This resulted in qemu-img convert and other tools being unable to open certain VMDK files. As of now, qemu.git has merged code to handle the VMDK specification and work well with modern image files.

If you had trouble in the past manipulating VMDK files with qemu-img, it may be worth another look soon. You can already build the latest and greatest qemu-img from the qemu.git repository and distros will provide packages with full VMDK support in the future:

$ git clone git://git.qemu.org/qemu.git
$ cd qemu
$ ./configure
$ make qemu-img
$ ./qemu-img convert ...

It is still recommended to convert VMDK files to QEMU's native formats (raw, qcow2, or qed) in order to get optimal performance for running VMs.

At the end of Fam's Summer of Code project, he put together an article on gotchas and undocumented behavior in the VMDK specification. This will be of great interest to developers writing their own code to manipulate VMDK image files. His experience this summer involved testing a wide range of VMware software, real-world image files, as well as studying existing open-source VMDK code. The VMDK specification is ambiguous in places and does not cover several essential details, so Fam had to figure them out himself and he then documented them.

So with that, happy VMDK-ing...

Sunday, September 11, 2011

How to share files instantly between virtual machines and host

It's pretty common to need to copy files between a virtual machine and the host. This can be drivers or installers from the host into the virtual machine or it could be in order to get some data out of the virtual machine and onto the host.

There are several solutions to sharing files, like network file systems, third-party file hosting, or even the virtfs paravirtualized file system supported by KVM. But my favorite ad-hoc file sharing tool is Python's built-in webserver.

The one-liner that shares files over HTTP

To share some files, simply change into the directory subtree you want to export and start the web server:

$ cd path/to/shared/files && python -m SimpleHTTPServer
Serving HTTP on 0.0.0.0 port 8000 ...

The directory hierarchy at path/to/shared/files is now available over HTTP on all IPs for the machine on port 8000. The web server generates index listings for directories so it is easy to browse.

To access the host from a virtual machine:

  • User networking (slirp): http://10.0.2.2:8000/
  • NAT networking: the default gateway from ip route show | grep ^default, for example http://192.168.122.1:8000/
  • Bridged or routed networking: the IP address of the host

To access the virtual machine from the host:

  • NAT, bridged, or routed networking: the virtual machine's IP address
  • User networking (slirp): forward a port with the hostfwd_add QEMU monitor command or -net user,hostfwd= option documented in the QEMU man page

Advantages

There are a couple of reasons why I like Python's built-in web server:

  • No need to install software since Linux and Mac typically already have Python installed.
  • No privileges are required to run the web server.
  • Works for both physical and virtual machines.
  • HTTP is supported by desktop file managers, browsers, and the command-line with wget or curl.
  • Network booting and installation is possible straight from the web server. Be warned that RHEL installs over HTTP have been known to fail with SimpleHTTPServer but I have not encountered problems with other software.

Security tips

This handy HTTP server is only suitable for trusted networks for a couple of reasons:

  • Encryption is not supported so data travels in plain text and can be tampered with in flight.
  • The directory listing feature means you should only export directory subtrees that contain no sensitive data.
  • SimpleHTTPServer is not a production web server and there could be obvious bugs.

Conclusion

Python's built-in web server is one of my favorite tricks that not many people seem to know. I hope this handy command comes in useful to you!

Wednesday, September 7, 2011

QEMU Internals: vhost architecture

This post explains how vhost provides in-kernel virtio devices for KVM. I have been hacking on vhost-scsi and have answered questions about ioeventfd, irqfd, and vhost recently, so I thought this would be a useful QEMU Internals post.

Vhost overview

The vhost drivers in Linux provide in-kernel virtio device emulation. Normally the QEMU userspace process emulates I/O accesses from the guest. Vhost puts virtio emulation code into the kernel, taking QEMU userspace out of the picture. This allows device emulation code to directly call into kernel subsystems instead of performing system calls from userspace.

The vhost-net driver emulates the virtio-net network card in the host kernel. Vhost-net is the oldest vhost device and the only one which is available in mainline Linux. Experimental vhost-blk and vhost-scsi devices have also been developed.

In Linux 3.0 the vhost code lives in drivers/vhost/. Common code that is used by all devices is in drivers/vhost/vhost.c. This includes the virtio vring access functions which all virtio devices need in order to communicate with the guest. The vhost-net code lives in drivers/vhost/net.c.

The vhost driver model

The vhost-net driver creates a /dev/vhost-net character device on the host. This character device serves as the interface for configuring the vhost-net instance.

When QEMU is launched with -netdev tap,vhost=on it opens /dev/vhost-net and initializes the vhost-net instance with several ioctl(2) calls. These are necessary to associate the QEMU process with the vhost-net instance, prepare for virtio feature negotiation, and pass the guest physical memory mapping to the vhost-net driver.

During initialization the vhost driver creates a kernel thread called vhost-$pid, where $pid is the QEMU process pid. This thread is called the "vhost worker thread". The job of the worker thread is to handle I/O events and perform the device emulation.

In-kernel virtio emulation

Vhost does not emulate a complete virtio PCI adapter. Instead it restricts itself to virtqueue operations only. QEMU is still used to perform virtio feature negotiation and live migration, for example. This means a vhost driver is not a self-contained virtio device implementation, it depends on userspace to handle the control plane while the data plane is done in-kernel.

The vhost worker thread waits for virtqueue kicks and then handles buffers that have been placed on the virtqueue. In vhost-net this means taking packets from the tx virtqueue and transmitting them over the tap file descriptor.

File descriptor polling is also done by the vhost worker thread. In vhost-net the worker thread wakes up when packets come in over the tap file descriptor and it places them into the rx virtqueue so the guest can receive them.

Vhost as a userspace interface

One surprising aspect of the vhost architecture is that it is not tied to KVM in any way. Vhost is a userspace interface and has no dependency on the KVM kernel module. This means other userspace code, like libpcap, could in theory use vhost devices if they find them convenient high-performance I/O interfaces.

When a guest kicks the host because it has placed buffers onto a virtqueue, there needs to be a way to signal the vhost worker thread that there is work to do. Since vhost does not depend on the KVM kernel module they cannot communicate directly. Instead vhost instances are set up with an eventfd file descriptor which the vhost worker thread watches for activity. The KVM kernel module has a feature known as ioeventfd for taking an eventfd and hooking it up to a particular guest I/O exit. QEMU userspace registers an ioeventfd for the VIRTIO_PCI_QUEUE_NOTIFY hardware register access which kicks the virtqueue. This is how the vhost worker thread gets notified by the KVM kernel module when the guest kicks the virtqueue.

On the return trip from the vhost worker thread to interrupting the guest a similar approach is used. Vhost takes a "call" file descriptor which it will write to in order to kick the guest. The KVM kernel module has a feature called irqfd which allows an eventfd to trigger guest interrupts. QEMU userspace registers an irqfd for the virtio PCI device interrupt and hands it to the vhost instance. This is how the vhost worker thread can interrupt the guest.

In the end the vhost instance only knows about the guest memory mapping, a kick eventfd, and a call eventfd.

Where to find out more

Here are the main points to begin exploring the code:
  • drivers/vhost/vhost.c - common vhost driver code
  • drivers/vhost/net.c - vhost-net driver
  • virt/kvm/eventfd.c - ioeventfd and irqfd
The QEMU userspace code shows how to initialize the vhost instance:
  • hw/vhost.c - common vhost initialization code
  • hw/vhost_net.c - vhost-net initialization

Wednesday, June 8, 2011

LinuxCon Japan 2011 KVM highlights

I had the opportunity to attended LinuxCon Japan 2011 in Yokohama. The conference ran many interesting talks, with a virtualization mini-summit taking place through the first two days. Here are highlights from the conference and some of the latest KVM-related presentation slides.

KVM Ecosystem

Jes Sorensen from Red Hat presented the KVM Weather Report, a status update and look at recent KVM work. Check out his slides if you want an overview of where KVM is today. The main points that I identify are:
  • KVM performance is excellent and continues to improve.
  • In the future, expect advanced features that round out KVM's strengths.
Jes' presentation ends on a nice note with a weather forecast that reads "Cloudy with a chance of total world domination" :).

Virtualization End User Panel

The end user panel brought together three commerical users of open virtualization software to discuss their deployments and answer questions from the audience. I hope a video of this panel will be made available because it is interesting to understand how virtualization is put to use.

All three users were in the hosting or cloud market. The users were split across KVM, Xen, and KVM on standard and VMware on premium offerings. It is clear that KVM is becoming the hypervisor of choice for hosting due to its low cost and good Linux integration - the Xen user had started several years ago but is now evaluating KVM. My personal experience with USA and UK hosting is that Xen is still widely deployed although KVM is growing quickly.

All three users rely on custom management tools and web interfaces. Although libvirt is being used the management layers above it are seen as an opportunity to differentiate. The current breed of open cloud and virtualization management tools weren't seen as mature or scalable enough. I expect this area of the virtualization stack to solidify with several of the open source efforts consolidating in order to reach critical mass.

Storage and Networking

Jes Sorensen from Red Hat covered the KVM Live Snapshot Support work that will allow disk snapshots to be taken for backup and other purposes without stopping the VM. My own talk gave An Updated Overview of the QEMU Storage Stack and covered other current work in the storage area, as well as explaining the most important storage configuration settings when using QEMU.

Stephen Hemminger from Vyatta presented an overview of Virtual Networking Performance. KVM is looking pretty good relative to other hypervisors, no doubt thanks to all the work that has gone into network performance under KVM. Michael Tsirkin and many others are still optimizing virtio-net and vhost_net to reduce latency, improve throughput, and reduce CPU consumption and I think the results justify virtio and paravirtualized I/O. KVM is able to continue improving network performance by extending its virtio host<->guest interface in order to work more efficiently - something that is impossible when emulating existing real-world hardware.

Other KVM-related talks

Isaku Yamahata from VA Linux gave his Status Update on QEMU PCI Express Support. His goal is PCI device assignment of PCI Express adapters. I think his work is important for QEMU in the long term since eventually hardware emulation needs to be able to present modern devices in order for guest operating systems to function.

Guangrong Xiao from Fujitsu presented KVM Memory Virtualization Progress, which describes how guest memory access works. It covers both hardware MMU virtualization in modern processors as well as the software solution used on older hardware. Kernel Samepage Merging (KSM) and Transparent Hugepages (THP) are also touched upon. This is a good technical overview if you want to understand how memory management virtualization works.

More slides and videos

The full slides and videos for talks should be available within the next few weeks. Here are the links to the LinuxCon Japan 2011 main schedule and virtualization mini-summit schedule.

Saturday, April 23, 2011

How to capture VM network traffic using qemu -net dump

This post describes how to save a packet capture of the network traffic a QEMU
virtual machine sees. This feature is built into QEMU and works with any
emulated network card and any host network device except vhost-net.

It's relatively easy to use tcpdump(8) with tap networking. First the
tap device for the particular VM needs to be identified and then packets can be
captured:
# tcpdump -i vnet0 -s0 -w /tmp/vm0.pcap

The tcpdump(8) approach cannot be easily used with non-tap host network devices, including slirp and socket.

Using the dump net client

Packet capture is built into QEMU and can be done without tcpdump(8). There are some restrictions:
  1. The vhost-net host network device is not supported because traffic does not cross QEMU so interception is not possible.
  2. The old-style -net command-line option must be used instead of -netdev because the dump net client depends on the mis-named "vlan" feature (essentially a virtual network hub).

Without further ado, here is an example invocation:
$ qemu -net nic,model=e1000 -net dump,file=/tmp/vm0.pcap -net user
This presents the VM with an Intel e1000 network card using QEMU's userspace network stack (slirp). The packet capture will be written to /tmp/vm0.pcap. After shutting down the VM, either inspect the packet capture on the command-line:
$ /usr/sbin/tcpdump -nr /tmp/vm0.pcap

Or open the pcap file with Wireshark.