Friday, August 14, 2015

Asynchronous file I/O on Linux: Plus ça change

In 2009 Anthony Liguori gave a presentation at Linux Plumbers Conference about the state of asynchronous file I/O on Linux. He talked about what was missing from POSIX AIO and Linux AIO APIs. I recently got thinking about this again after reading the source code for the io_submit(2) system call.

Over half a decade has passed and plus ça change, plus c'est la même chose. Sure, there are new file systems, device-mapper targets, the multiqueue block layer, and high IOPS PCI SSDs. There's DAX for storage devices accessible via memory load/store instructions - radically different from the block device model.

However, the io_submit(2) system call remains a treacherous ally in the quest for asynchronous file I/O. I don't think much has changed since 2009 in making Linux AIO the best asynchronous file I/O mechanism.

The main problem is that io_submit(2) waits for I/O in some cases. It can block! This defeats the purpose of asynchronous file I/O because the caller is stuck until the system call completes. If called from a program's event loop, the program becomes unresponsive until the system call returns. But even if io_submit(2) is invoked from a dedicated thread where blocking doesn't matter, latency is introduced to any further I/O requests submitted in the same io_submit(2) call.

Sources of blocking in io_submit(2) depend on the file system and block devices being used. There are many different cases but in general they occur because file I/O code paths contain synchronous I/O (for metadata I/O or page cache write-out) as well as locks/waiting (for serializing operations). This is why the io_submit(2) system call can be held up while submitting a request.

This means io_submit(2) works best on fully-allocated files, volumes, or block devices. Anything else is likely to result in blocking behavior and cause poor performance.

Since these conditions don't apply in many cases, QEMU has its own userspace thread-pool with worker threads that call preadv(2)/pwritev(2). It would be nice to default to Linux AIO but the limitations are too serious.

Have there been new developments or did I get something wrong? Let me know in the comments.

Wednesday, April 1, 2015

Tracing Linux kernel function entries/returns

Here is a neat ftrace recipe for tracing execution while the Linux kernel is inside a particular function.  This helps when a kernel function or its children are failing but you don't know where or why.

ftrace will trigger on particular functions if you give it  set_graph_function values.  That way you only see traces from the functions you are interested in.  This eliminates the noise you get when tracing all function entries/returns without a filter.

Let's trace virtio_dev_probe() and all its children:

echo virtio_dev_probe >/sys/kernel/debug/tracing/set_graph_function
echo function_graph >/sys/kernel/debug/tracing/current_tracer
echo 1 >/sys/kernel/debug/tracing/tracing_on

modprobe transport_virtio

echo 0 >/sys/kernel/debug/tracing/tracing_on
echo >/sys/kernel/debug/tracing/current_tracer
echo >/sys/kernel/debug/tracing/set_graph_function
cat /sys/kernel/debug/tracing/trace

Here is some example output:

...
 0)               |        virtqueue_kick [virtio_ring]() {
 0) + 30.207 us   |          virtqueue_kick_prepare [virtio_ring]();
 0) + 13.342 us   |          vp_notify [virtio_pci]();
 0) + 90.315 us   |        }
 0) # 61946.45 us |      }
 0)   1.046 us    |      mutex_unlock();
 0) # 102833.9 us |    }
 0)   2.411 us    |    vp_get_status [virtio_pci]();
 0)   0.826 us    |    vp_get_status [virtio_pci]();
 0) ! 130.773 us  |    vp_set_status [virtio_pci]();
 0)               |    virtio_config_enable [virtio]() {
 0)   0.689 us    |      _raw_spin_lock_irq();
 0) + 33.796 us   |    }
 0) # 105349.9 us |  }

I haven't figured out whether set_graph_function can be used on functions whose kernel module has not been loaded yet.  I think the answer is no, but please let me know in the comments if there is a way to do it.

Wednesday, March 4, 2015

QEMU participating in Outreachy

I'm delighted that QEMU is able to participate in Outreachy May-August 2015.

Outreachy (formerly known as Outreachy Program for Women) provides internships to underrepresented groups in open source.  The internship is a 12-week full-time paid software development project working on open source software.

QEMU is sharing project ideas between Outreachy and Google Summer of Code.  We encourage applicants to apply to both if they are eligible.

You can join the QEMU Outreachy IRC channel at #qemu-outreachy on irc.oftc.net.

Monday, March 2, 2015

QEMU accepted in Google Summer of Code 2015!

QEMU is participating in Google Summer of Code 2015.  I'm very excited that we are back for another great summer of students contributing to open source (with generous funding from Google).

QEMU's project ideas list is available here:
http://qemu-project.org/Google_Summer_of_Code_2015

Students, you may be interested in my advice for applying.

Good luck, students of 2015!

Tuesday, February 17, 2015

Slides posted for "KVM Architecture Overview: 2015 Edition"

I recently gave a talk on KVM's architecture.  It covers how hardware assisted virtualization works with KVM and explains key features of QEMU's architecture.

Check out the presentation to learn the basics of how KVM runs virtual machines and QEMU emulates devices.

Slides are available here (pdf).  There is no audio or video recording of this talk.

Sunday, February 1, 2015

Slides posted for "Observability in KVM: Troubleshooting virtual machines"

In my FOSDEM 2015 talk on Observability in KVM, I covered the basic tools and troubleshooting techniques for CPU, networking, and disk I/O problems in virtual machines.

My slides are now available here (PDF).

If you would like to learn the basics or get new ideas for troubleshooting with KVM, check them out.

Enjoy!

Wednesday, December 24, 2014

QEMU Advent Calendar 2014 retrospective

This year I ran http://qemu-advent-calendar.org/, an online advent calendar that features a QEMU disk image for download each day from December 1st to 24th.

Pitching the idea

The idea for a QEMU advent calendar is something I had in 2012 or 2013 but there is only one chance to do it per year and I missed the boat previously.  This year the stars were aligned, I was able to pitch the idea to people who I thought might be game at KVM Forum/LinuxCon Europe.

When I saw the reactions from people in the QEMU community on hearing the idea, I thought it had a chance.  Most people were amused and found it slightly weird, but they were positive and had ideas for disk images.

So I had a sense that I could collect disk image contributions from enough people to make the advent calendar work...

How it worked

Each advent calendar entry consists of a tarball with a disk image and "run" shell script, a brief description of the disk image, a screenshot, and a sources tarball (for GPL compliance).

Going into this I didn't demand a specific format of these artifacts from contributors.  Some people sent me a bare disk image and QEMU command-line to launch the thing.  Then I had to come up with the remaining artifacts and create the tarballs.

Digging up the GPL sources for various Linux distributions was time-consuming but I worked hard on this after a request was submitted for sources (not just a link or name/version of the distribution).

This process could have been much easier if I asked each contributor to follow a checklist and provide artifacts in a specific format.  Instead, I scrambled to put polish on contributions in various states of completeness.

Just-in-time calendar making

I launched the advent calendar with promises for around 10 disk images from potential contributors.  We needed 24 disk images so there was still quite a bit of ground to cover.

The risk was worth it because once the website went live, new contributions started to pour in.  The idea spread successfully on Google+, Hacker News, Reddit, and other communities so that additional people became inspired to recommend or build full disk images from scratch.

There were one or two days where a late cancelation or schedule slip meant someone who had promised an image couldn't deliver.  In those cases I had a list of half-baked ideas that I chose from, and I would scramble to put together an image in about 2 hours.

Companies contributed too

As the word spread about QEMU Advent Calendar 2014, I got emails where companies wanted to contribute disk images.  These were the Ubuntu Core and Pebble smartwatch disk images.

These images fit the scope of the calendar nicely and were "exclusive" in some form.  Both the Ubuntu Core and Pebble smartwatch images were brand new releases that had never seen the light of day before.  It was cool to feature not just nostalgic emulated software on the calendar but also cutting-edge products that are being developed right now with QEMU.

Canonical and Pebble were very proactive here but also tasteful.  They didn't try to push crass advertising, instead they had something appropriate to contribute.  It was easy to accept their contributions since they were in the spirit of the project.  (The whole calendar was ad-free and neither I nor the contributors made money from it.)

The impact

I wanted to do QEMU Advent Calendar 2014 for two reasons:
1. To spread the word about QEMU and cool open source software
2. To celebrate the QEMU community with a fun activity

Here we are, 480 GB of web traffic later.  41,000 unique visitors and over 1,000,000 hits!

(These numbers don't include the full Day 24 because I collected statistics and wrote up this post before waiting for it to finish.)

Top disk image by downloads: Day 1 - Slacker's time travel by Gerd Hoffmann.  Congratulations Gerd!

I'm very happy with the way things went.  The goals have been achieved!

Thank you for all the fun!

Thanks to everyone who contributed disk images.  There were a few disk images which we couldn't fit on the calendar for various reasons (file size too large, demo not quite working, etc).  All of them were appreciated though!

Special thanks to Alex Bennee for providing web traffic allowance way beyond my server's monthly quota.  We didn't know if this thing would take off but he monitored the situation and allowed it to stay online.

Happy holidays and New Year 2014/2015!