Showing posts with label linux. Show all posts
Showing posts with label linux. Show all posts

Wednesday, June 25, 2025

Profiling tools I use for QEMU storage performance optimization

A fair amount of the development work I do is related to storage performance in QEMU/KVM. Although I have written about disk I/O benchmarking and my performance analysis workflow in the past, I haven't covered the performance tools that I frequently rely on. In this post I'll go over what's in my toolbox and hopefully this will be helpful to others.

Performance analysis is hard when the system is slow but there is no clear bottleneck. If a CPU profile shows that a function is consuming significant amounts of time, then that's a good target for optimizations. On the other hand, if the profile is uniform and each function only consumes a small fraction of time, then it is difficult to gain much by optimizing just one function (although taking function call nesting into account may point towards parent functions that can be optimized):

If you are measuring just one metric, then eventually the profile will become uniform and there isn't much left to optimize. It helps to measure at multiple layers of the system in order to increase the chance of finding bottlenecks.

Here are the tools I like to use when hunting for QEMU storage performance optimizations.

kvm_stat

kvm_stat is a tool that runs on the host and counts events from the kvm.ko kernel module, including device register accesses (MMIO), interrupt injections, and more. These are often associated with vmentry/vmexit events where control passes between the guest and the hypervisor. Each time this happens there is a cost and it is preferrable to minimize the number of vmentry/vmexit events.

kvm_stat will let you identify inefficiencies when guest drivers are accessing devices as well as low-level activity (MSRs, nested page tables, etc) that can be optimized.

Here is output from an I/O intensive workload:

 Event                                         Total %Total CurAvg/s
 kvm_entry                                   1066255   20.3    43513
 kvm_exit                                    1066266   20.3    43513
 kvm_hv_timer_state                           954944   18.2    38754
 kvm_msr                                      487702    9.3    19878
 kvm_apicv_accept_irq                         278926    5.3    11430
 kvm_apic_accept_irq                          278920    5.3    11430
 kvm_vcpu_wakeup                              250055    4.8    10128
 kvm_pv_tlb_flush                             250000    4.8    10123
 kvm_msi_set_irq                              229439    4.4     9345
 kvm_ple_window_update                        213309    4.1     8836
 kvm_fast_mmio                                123433    2.3     5000
 kvm_wait_lapic_expire                         39855    0.8     1628
 kvm_apic_ipi                                   9614    0.2      457
 kvm_apic                                       9614    0.2      457
 kvm_unmap_hva_range                               9    0.0        1
 kvm_fpu                                          42    0.0        0
 kvm_mmio                                         28    0.0        0
 kvm_userspace_exit                               21    0.0        0
 vcpu_match_mmio                                  19    0.0        0
 kvm_emulate_insn                                 19    0.0        0
 kvm_pio                                           2    0.0        0
 Total                                       5258472          214493

Here I'm looking for high CurAvg/s rates. Any counters at 100k/s are well worth investigating.

Important events:

  • kvm_entry/kvm_exit: vmentries and vmexits are when the CPU transfers control between guest mode and the hypervisor.
  • kvm_msr: Model Specific Register accesses
  • kvm_msi_set_irq: Interrupt injections (Message Signalled Interrupts)
  • kvm_fast_mmio/kvm_mmio/kvm_pio: Device register accesses

sysstat

sysstat is a venerable suite of performance monitoring tools cover CPU, network, disk, memory activity. It can be used equally within guests and on the host. It is like an extended version of the classic vmstat(8) tool.

The mpstat, pidstat, and iostat tools are the ones I use most often:

  • mpstat reports CPU consumption, including %usr (userspace), %sys (kernel), %guest (guest mode), %steal (hypervisor), and more. Use this to find CPUs that are maxed out or poor use of parallelism/multi-threading.
  • pidstat reports per-process and per-thread statistics. This is useful for identifying specific threads that are consuming resources.
  • iostat reports disk statistics. This is useful for comparing I/O statistics between bare metal and guests.

blktrace

blktrace monitors Linux block I/O activity. This can be used equally within guests and on the host. Often it's interesting to capture traces in both the guest and on the host so they can be compared. If the I/O pattern is different then something in the I/O stack is modifying requests. That can be a sign of a misconfiguration.

The blktrace data can be analyzed and plotted with the btt command. For example, the latencies from driver submission to completion can be summarized to find the overhead compared to bare metal.

perf-top

In its default mode, perf-top is a sampling CPU profiler. It periodically collects the CPU's program counter value so that a profile can be generated showing hot functions. It supports call graph recording with the -g option if you want to aggregate nested function calls and find out which parent functions are responsible for the most CPU usage.

perf-top (and its non-interactive perf-record/perf-report cousin) is good at identifying inner loops of programs, excessive memcpy/memset, expensive locking instructions, instructions with poor cache hit rates, etc. When I use it to profile QEMU it shows the host kernel and QEMU userspace activity. It does not show guest activity.

Here is the output without call graph recording where we can see vmexit activity, QEMU virtqueue processing, and excessive memset at the top of the profile:

   3.95%  [kvm_intel]                 [k] vmx_vmexit
   3.37%  qemu-system-x86_64          [.] virtqueue_split_pop
   2.64%  libc.so.6                   [.] __memset_evex_unaligned_erms
   2.50%  [kvm_intel]                 [k] vmx_spec_ctrl_restore_host
   1.57%  [kernel]                    [k] native_irq_return_iret
   1.13%  qemu-system-x86_64          [.] bdrv_co_preadv_part
   1.13%  [kernel]                    [k] sync_regs
   1.09%  [kernel]                    [k] native_write_msr
   1.05%  [nvme]                      [k] nvme_poll_cq

The goal is to find functions or families of functions that consume significant amounts of CPU time so they can be optimized. If the profile is uniform with most functions taking less then 1%, then the bottlenecks are more likely to be found with other tools.

perf-trace

perf-trace is an strace-like tool for monitoring system call activity. For performance monitoring it has a --summary option that shows time spent in system calls and the counts. This is how you can identify system calls that block for a long time or that are called too often.

Here is an example from the host showing a summary of a QEMU IOThread that uses io_uring for disk I/O and ioeventfd/irqfd for VIRTIO device activity:

 IO iothread1 (332737), 1763284 events, 99.9%

   syscall            calls  errors  total       min       avg       max       stddev
                                     (msec)    (msec)    (msec)    (msec)        (%)
   --------------- --------  ------ -------- --------- --------- ---------     ------
   io_uring_enter    351680      0  8016.308     0.003     0.023     0.209      0.11%
   write             390189      0  1238.501     0.002     0.003     0.098      0.08%
   read              121057      0   305.355     0.002     0.003     0.019      0.14%
   ppoll              18707      0    62.228     0.002     0.003     0.023      0.32%

Conclusion

We looked at kvm_stat, sysstat, blktrace, perf-top, and perf-trace. They provide performance metrics from different layers of the system. Another worthy mention is the bcc collection of eBPF-based tools that offers a huge array of performance monitoring and troubleshooting tools. Let me know which tools you like to use!

Thursday, June 30, 2022

Comparing VIRTIO, NVMe, and io_uring queue designs

Queues and their implementation using shared memory ring buffers are a standard tool for communicating with I/O devices and between CPUs. Although ring buffers are widely used, there is no standard memory layout and it's interesting to compare the differences between designs. When defining libblkio's APIs, I surveyed the ring buffer designs in VIRTIO, NVMe, and io_uring. This article examines some of the differences between the ring buffers and queue semantics in VIRTIO, NVMe, and io_uring.

Ring buffer basics

A ring buffer is a circular array where new elements are written or produced on one side and read or consumed on the other side. Often terms such as head and tail or reader and writer are used to describe the array indices at which the next element is accessed. When the end of the array is reached, one moves back to the start of the array. The empty and full conditions are special states that must be checked to avoid underflow and overflow.

VIRTIO, NVMe, and io_uring all use single producer, single consumer shared memory ring buffers. This allows a CPU and an I/O device or two CPUs to communicate across a region of memory to which both sides have access.

Embedding data in descriptors

At a minimum a ring buffer element, or descriptor, contains the memory address and size of a data buffer:

OffsetTypeName
0x0u64buf
0x8u64len

In a storage device the data buffer contains a request structure with information about the I/O request (logical block address, number of sectors, etc). In order to process a request, the device first loads the descriptor and then loads the request structure described by the descriptor. Performing two loads is sub-optimal and it would be faster to fetch the request structure in a single load.

Embedding the data buffer in the descriptor is a technique that reduces the number of loads. The descriptor layout looks like this:

OffsetTypeName
0x0u64remainder_buf
0x8u64remainder_len
0x10...request structure

The descriptor is extended to make room for the data. If the size of the data varies and is sometimes too large for a descriptor, then the remainder is put into an external buffer. The common case will only require a single load but larger variable-sized buffers can still be handled with 2 loads as before.

VIRTIO does not embed data in descriptors due to its layered design. The data buffers are defined by the device type (net, blk, etc) and virtqueue descriptors are one layer below device types. They have no knowledge of the data buffer layout and therefore cannot embed data.

NVMe embeds the request structure into the Submission Queue Entry. The Command Dword 10, 11, 12, 13, 14, and 15 fields contain the request data and their meaning depends on the Opcode (request type). I/O buffers are still external and described by Physical Region Pages (PRPs) or Scatter Gather Lists (SGLs).

io_uring's struct io_uring_sqe embeds the request structure. Only I/O buffer(s) need to be external as their size varies, would be too large for the ring buffer, and typically zero-copy is desired due to the size of the data.

It seems that VIRTIO could learn from NVMe and io_uring. Instead of having small 16-byte descriptors, it could embed part of the data buffer into the descriptor so that devices need to perform fewer loads during request processing. The 12-byte struct virtio_net_hdr and 16-byte struct virtio_blk_req request headers would fit into a new 32-byte descriptor layout. I have not prototyped and benchmarked this optimization, so I don't know how effective it is.

Descriptor chaining vs external descriptors

I/O requests often include variable size I/O buffers that require scatter-gather lists similar to POSIX struct iovec arrays. Long arrays don't fit into a descriptor so descriptors have fields that point to an external array of descriptors.

Another technique for scatter-gather lists is to chain descriptors together within the ring buffer instead of relying on memory external to the ring buffer. When descriptor chaining is used, I/O requests that don't fit into a single descriptor can occupy multiple descriptors.

Advantages of chaining are better cache locality when a sequence of descriptors is used and no need to allocate separate per-request external descriptor memory.

A consequence of descriptor chaining is that the maximum queue size, or queue depth, becomes variable. It is not possible to guarantee space for specific number of I/O requests because the available number of descriptors depends on the chain size of requests placed into the ring buffer.

VIRTIO supports descriptor chaining although drivers usually forego it when VIRTIO_F_RING_INDIRECT_DESC is available.

NVMe and io_uring do not support descriptor chaining, instead relying on embedded and external descriptors.

Limits on in-flight requests

The maximum number of in-flight requests depends on the ring buffer design. Designs where descriptors are occupied from submission until completion prevent descriptor reuse for other requests while the current request is in flight.

An alternative design is where the device processes submitted descriptors and they are considered free again as soon as the device has looked at them. This approach is natural when separate submission and completion queues are used and there is no relationship between the two descriptor rings.

VIRTIO requests occupy descriptors for the duration of their lifetime, at least in the Split Virtqueue format. Therefore the number of in-flight requests is influenced by the descriptor table size.

NVMe has separate Submission Queues and Completion Queues, but its design still limits the number of in-flight requests to the queue size. The Completion Queue Entry's SQ Head Pointer (SQHD) field precludes having more requests in flight than the Submission Queue size because the field would no longer be unique. Additionally, the driver has no way of detecting Submission Queue Head changes, so it only knows there is space for more submissions when completions occur.

io_uring has independent submission (SQ) and completions queues (CQ) with support for more in-flight requests than the ring buffer size. When there are more in-flight requests than CQ capacity, it's possible to overflow the CQ. io_uring has a backlog mechanism for this case, although the intention is for applications to properly size queues to avoid hitting the backlog often.

Conclusion

VIRTIO, NVMe, and io_uring have slightly different takes on queue design. The semantics and performance vary due to these differences. VIRTIO lacks data embedding inside descriptors. io_uring supports more in-flight requests than the queue size. NVMe and io_uring rely on external descriptors with no ability to chain descriptors.

Friday, July 2, 2021

Slides available for "Bring Your Own Virtual Devices: Frameworks for Software and Hardware Device Virtualization"

The PDF slides for my "Bring Your Own Virtual Devices: Frameworks for Software and Hardware Device Virtualization" talk from the 16th Workshop on Virtualization in High-Performance Cloud Computing are now available.

This talk covers out-of-process device interfaces including vhost (kernel), vhost-user, Linux VFIO, mdev, vfio-user, vDPA, and VDUSE. It gives a brief overview of each interface, how it works, and how to develop your own devices.

The growing number of out-of-process device interfaces available in QEMU/KVM can make it hard to understand and compare them. Each of these interfaces is designed for different use cases. For example, whether you want to pass through hardware or implement the device in software, if you want to implement your device in the host kernel or in host userspace, etc. This talk will give you the necessary knowledge to compare these interfaces yourself so you can decide which one is most appropriate for your use case.

For more information about the design of out-of-process device interfaces, see also my previous blog post about requirements for out-of-process devices.

Friday, March 12, 2021

Overcoming fear of public communication in open source

When given the choice between communicating in private or in public, many people opt for private communication. They send questions or unfinished patches to individuals instead of posting on mailing lists, forums, or chat rooms. This post explains why public communication is a faster and more efficient way for developers to communicate in open source communities than private communication. A big factor in the private vs public decision is psychological and I've provided a checklist to give you confidence when communicating in public.

Time and time again I find people initiating discussions through private channels when I know it would be advantageous to have them in public instead. I even keep an email reply template handy asking the sender to reach out to the public mailing list instead of communicating with me in private. Communicating in public is a good default unless you need confidentiality (security bugs, business reasons, etc). So why do people prefer to ask questions or send patches off-list?

Why we fear public communication

I'm interested in understanding why people avoid public communications channels in open source communities. The purpose of mailing lists, chat rooms, and forums is to engage in discussions and share knowledge. Members of these communities are interested in the topic and want to engage in discussion. But there are some common reasons I've found why people don't make use of public communications channels:

  • I don't want to create noise. Mailing lists are home to important discussions between experienced members of the community. Sending a relatively simple question or unfinished patch that no expert would need to ask can make you doubt whether it deserves to be sent at all. This is a fallacy. As long as the question or patches are relevant to the community and you have done your homework (see the checklist below), there is no need to worry about creating noise.
  • I will look dumb or seem like a bad programmer. When you need help it's likely that your understanding is incomplete. We often hold ourselves to artifically high standards when asking for help in public, yet when we observe others interacting in public we don't hold their questions or unfinished code against them. Only if they show a repeated pattern of sloppy work does it damage their reputation. Asking for help in public won't harm your image.
  • Too much traffic. Big open source communities are often so active that no single person can keep track of everything that is going on. Even core members of the community rely on filtering only topics relevant to them. Since filtering becomes necessary at scale anyway, there's no need to worry about sending too many messages.

A note about tone: in the past people were more likely to be put off by the unfriendly tone in chat rooms and mailing lists. Over time the tone seems to have improved in general, probably due to multiple factors like open source becoming more professional, codes of conduct making participants aware of their behavior, etc. I don't want to cover the pros and cons of these factors, but suffice to say that nowadays tone is less of a problem and that's a good thing for everyone.

Why public communication is faster and more efficient

Next let's look at the reasons why public communication is beneficial:

Including the community from the start avoids waste

If you ask for help in private it's possible that the answer you get will not end up being consensus in the community. When you eventually send patches to the community they might disagree with the approach you settled on in private and you'll have to redo your work to get the patches merged. This can be avoided by including the community from the start.

Similarly, you can avoid duplicating work when multiple people are investigating the same problem in private without knowing about each other. If you discuss what you are doing in public then you can collaborate and avoid spending time creating multiple solutions, only one of which can be merged.

Public discussions create a searchable knowledge base

If you find a solution to your problem in private, that doesn't help others who have the same question. Public discussions are typically archived and searchable on the web. When the next person has the same question they will find the answer online and won't need to ask at all!

Additionally, everyone benefits and learns from each other when questions are asked in public. We soak up knowledge by following these discussions. If they are private then we don't have this opportunity.

Get a reply even if the person you asked is unavailable

The person you intended to reach may be away due to timezones, holidays, etc. A private discussion is blocked until they respond to your message. Public discussions, on the other hand, can progress even when the original participants are unavailable. You can get an answer to your question faster by asking in public.

Public activity gives visibility to your work

Private discussions are invisible to your teammates, managers, and other people affected by your progress. When you communicate in public they will be able plan better, help out when needed, and give you credit for the effort you are putting in.

A checklist before you press Send

Hopefully I have encouraged you to go ahead and ask questions or send unfinished patches in public. Here is a checklist that will help you feel confident about communicating in public:

Before asking questions...

  1. Have you searched the web, documentation, and code?
  2. Have you added printfs, GDB breakpoints, or enabled tracing to understand the behavior of the system?

Before sending patches...

  1. Is the code formatted according to the coding standard?
  2. Are the error cases handled, memory allocation/ownership correctly implemented, and thread safety addressed? Most of the time you can figure these out yourself and forgetting to do so may distract from the topics you want help with.
  3. Did you add todo comments pointing out unfinished aspects of the code? Being upfront about what is missing helps readers understand the status of the code and saves them time trying to distinguish requirements you forgot from things that are simply not yet implemented.
  4. Do the commit descriptions explain the purpose of the code changes and does the cover letter give an overview of the patch series?
  5. Are you using git-format-patch --subject-prefix RFC (same for git-publish) to mark your patches as a "request for comment"? This tells people you are seeking input on unfinished code.

Finally, do you know who to CC on emails or mention in comments/chats? Look up the relevant maintainers or active developers using git-log(1), scripts/get_maintainer.pl, etc.

Conclusion

Private communication can be slower, less efficient, and adds less value to an open source community. For many people public communication feels a little scary and they prefer to avoid it. I hope that by understanding the advantages of public communication you will be motivated to use public communication channels more.

Tuesday, March 29, 2011

How to use perf-probe

Dynamic tracepoints are an observability feature that allows functions
and even arbitrary lines of source code to be instrumented without recompiling
the program. Armed with a copy of the program's source code, tracepoints can
be placed anywhere at runtime and the value of variables can be dumped each
time execution passes the tracepoint. This is an extremely powerful technique
for instrumenting complex systems or code written by someone else.

I recently investigated issues with CD-ROM media change
inside a KVM guest. Using QEMU tracing I was able to
instrument the ATAPI CD-ROM emulation inside QEMU and observe the commands
being sent to the CD-ROM. I complemented this with perf-probe(1) to
instrument the sr and cdrom drivers in the Linux host and
guest kernels. The power of perf-probe(1) was key to understanding the
state of the CD-ROM drivers without recompiling a custom debugging kernel.

Overview of perf-probe(1)

The probe subcommand of perf(1) allows dynamic tracepoints
to be added or removed inside the Linux kernel
. Both core kernel and modules
can be instrumented.

Besides instrumenting locations in the code, a tracepoint can also fetch
values from local variables, globals, registers, the stack, or memory. You can
combine this with perf-record -g callgraph recording to get stack
traces when execution passes probes. This makes perf-probe(1)
extremely powerful.

Requirements

A word of warning: use a recent kernel and perf(1) tool. I was unable
to use function return probes with Fedora 14's 2.6.35-based kernel. It works
fine on with a 2.6.38 perf(1) tool. In general, I have found the
perf(1) tool to be somewhat unstable but this has improved a lot with
recent versions.

Plain function tracing can be done without installing kernel debuginfo
packages but arbitrary source line probes and variable dumping requires the
kernel debuginfo packages. On Debian-based distros the kernel debuginfo
packages are called linux-image-*-dbg and on Red Hat-based distros
they are called kernel-debuginfo.

Tracing function entry

The following command-line syntax adds a new function entry probe:
sudo perf probe <function-name> [<local-variable> ...]

Tracing function return

The following command-line syntax adds a new function return probe and dumps the return value:
sudo perf probe <funtion-name>%return '$retval'
Notice the arg1 return value in this example output for scsi_test_unit_ready():
$ sudo perf record -e probe:scsi_test_unit_ready -aR sleep 10
[...]
$ sudo perf script
     kworker/0:0-3383  [000]  3546.003235: scsi_test_unit_ready: (ffffffffa0208c6d <- ffffffffa0069306) arg1=8000002
 hald-addon-stor-3025  [001]  3546.004431: scsi_test_unit_ready: (ffffffffa0208c6d <- ffffffffa006a143) arg1=8000002
     kworker/0:0-3383  [000]  3548.004531: scsi_test_unit_ready: (ffffffffa0208c6d <- ffffffffa0069306) arg1=8000002
 hald-addon-stor-3025  [001]  3548.005531: scsi_test_unit_ready: (ffffffffa0208c6d <- ffffffffa006a143) arg1=8000002

Listing probes

Existing probes can be listed:
sudo perf probe -l

Removing probes

A probe can be deleted with:
sudo perf probe -d <probe-name<
All probes can be deleted with:
sudo perf probe -d '*'

Further information

See the perf-probe(1) man page for details on command-line syntax and additional features.

The underlying mechanism of perf-probe(1) is kprobe-based event tracing, which is documented here.

I hope that future Linux kernels will add perf-probe(1) for
userspace processes. Nowadays GDB might already include this feature in its
tracing commands but I haven't had a chance to try them out yet.

Monday, February 21, 2011

Near instant kernel development cycle with KVM

I want to share my setup for rapid kernel development using KVM.

A fast development cycle makes a huge difference to productivity. For firmware and kernel development many areas can be efficiently tested inside virtual machines.

Traditionally physical test machines were used but virtualization lets you take the lab with you. This means working offline without giving up on testing.

In that past I used QEMU when working on the gPXE network bootloader. Now I am using KVM to test Linux kernel changes in less than 30 seconds and it's a really pleasant setup.

What can't be tested under KVM?


A lot of code can be tested in a virtual machine but device drivers or hardware-specific code often require physical machines. But with PCI device assignment, or passing physical PCI devices through into the virtual machine, it is becoming possible to test device drivers in a virtual machine too.

Testing kernels without disk images


Most virtual machines are booted from a disk image or an ISO file, but KVM can directly load a Linux kernel into memory skipping the bootloader. This means you don't need an image file containing the kernel and boot files. Instead, you can run a kernel directly like this:

qemu-kvm -kernel arch/x86/boot/bzImage -initrd initramfs.gz -append "console=ttyS0" -nographic

These flags directly load a kernel and initramfs from the host filesystem without the need to generate a disk image or configure a bootloader.

The optional -initrd flag loads an initramfs for the kernel to use as the root filesystem.

The -append flags adds kernel parameters and can be used to enable the serial console.

The -nographic option restricts the virtual machine to just a serial console and therefore keeps all test kernel output in your terminal rather than in a graphical window.

Building an initramfs


I don't use a distro initramfs generation utility because I like to control which files get included and the init script. Instead I use the linux-2.6/usr/gen_init_cio utility to build an initramfs cpio archive from a specification file. A neat feature of gen_init_cpio is that you don't need to be root in order to create device files or set ownership inside the initramfs. The specification file syntax looks like this:

# a comment
file <name> <location> <mode> <uid> <gid> [<hard links>]
dir <name> <mode> <uid> <gid>
nod <name> <mode> <uid> <gid> <dev_type> <maj> <min>
slink <name> <target> <mode> <uid> <gid>
pipe <name> <mode> <uid> <gid>
sock <name> <mode> <uid> <gid>

The kernel will execute the file at /init. I include busybox in the initramfs and have the following script:

#!/bin/sh
mount -t proc none /proc
mount -t sysfs none /sys
mount -t configfs none /sys/kernel/config
mount -t debugfs none /sys/kernel/debug
mount -t tmpfs none /tmp

# Test setup commands here:
insmod /lib/modules/$(uname -r)/kernel/...

exec /bin/sh -i

Instead of building out a full /lib/modules directory tree I just include those kernel module dependencies that I need. This means I use insmod(8) instead of modprobe(8) because I skip generating depmod(8) dependency metadata.

Tying it all together


Here are the steps I take to build and test a kernel:

cd linux-2.6
[...make some changes...]
make
usr/gen_init_cpio initramfs | gzip >initramfs.gz
qemu-kvm -kernel arch/x86/boot/bzImage -initrd initramfs.gz -append "console=ttyS0" -nographic

It takes about 28 seconds to the shell prompt inside the virtual machine with ccache and a hot page cache on this laptop. This keeps development fun :)!