Stefan Hajnoczi

Saturday, April 13, 2013

QEMU.org is in Google Summer of Code 2013!

As recently announced on Google+, QEMU.org has been accepted to Google Summer of Code 2013.

We have an exciting list of project ideas for QEMU, libvirt, and the KVM kernel module. Students should choose a project idea and contact the mentor to discuss the requirements. The easiest way to get in touch is via the #qemu-gsoc IRC channel on irc.oftc.net.

Student applications formally open on April 22 but it's best to get in touch with the mentor now. See the timeline for details.

I've shared my advice on applying to Summer of Code on this blog. Check it out if you're looking for a guide to a successful application from someone who has been both a student and a mentor.

Tuesday, April 9, 2013

QEMU Code Overview slides available

I recently gave a high-level overview of QEMU aimed at new contributors or people working with QEMU. The slides are now available here:

QEMU Code Overview (pdf)

Topics covered include:

External interfaces (command-line, QMP monitor, HMP monitor, UI, logging)
Architecture (process model, main loop, threads)
Device emulation (KVM accelerator, guest/host device split, hardware emulation)
Development (build process, contributing)

It is a short presentation and stays at a high level, but it can be useful for getting your bearings before digging into QEMU source code, debugging, or performance analysis.

Enjoy!

Wednesday, March 13, 2013

New in QEMU 1.4: high performance virtio-blk data plane implementation

QEMU 1.4 includes an experimental feature for improved high IOPS disk I/O scalability called virtio-blk data plane. It extends QEMU to perform disk I/O in a dedicated thread that is optimized for scalability with high IOPS devices and many disks. IBM and Red Hat have published a whitepaper presenting the highest IOPS achieved to date under virtualization using virtio-blk data plane:

KVM Virtualized I/O Performance [PDF]

Update

Much of this post is now obsolete! The virtio-blk dataplane feature was integrated with QEMU's block layer (live migration and block layer features are now supported), virtio-scsi dataplane support was added, and libvirt XML syntax was added.

If you have a RHEL 7.2 or later host please use the following:

QEMU syntax:

$ qemu-system-x86_64 -object iothread,id=iothread0 \
                     -drive if=none,id=drive0,file=vm.img,format=raw,cache=none,aio=native \
                     -device virtio-blk-pci,iothread=iothread0,drive=drive0

Libvirt domain XML syntax:

<domain>
    <iothreads>1<iothreads>
    <cputune>  <!-- optional -->
        <iothreadpin iothread="1" cpuset="5,6"/>
    </cputune>
    <devices>
        <disk type="file">
            <driver iothread="1" ... />
        </disk>
    </devices>
</domain>

When can virtio-blk data plane be used?

Data plane is suitable for LVM or raw image file configurations where live migration and advanced block features are not needed. This covers many configurations where performance is the top priority.

Data plane is still an experimental feature because it only supports a subset of QEMU configurations. The QEMU 1.4 feature has the following limitations:

Image formats are not supported (qcow2, qed, etc).
Live migration is not supported.
QEMU I/O throttling is not supported but cgroups blk-io controller can be used.
Only the default "report" I/O error policy is supported (-drive werror=,rerror=).
Hot unplug is not supported.
Block jobs (block-stream, drive-mirror, block-commit) are not supported.

How to use virtio-blk data plane

The following libvirt domain XML enables virtio-blk data plane:

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
...
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source file='path/to/disk.img'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </disk>
...
  </devices>
  <qemu:commandline>
    <qemu:arg value='-set'/>
    <qemu:arg value='device.virtio-disk0.scsi=off'/>
  </qemu:commandline>
  <!-- config-wce=off is not needed in RHEL 6.4 -->
  <qemu:commandline>
    <qemu:arg value='-set'/>
    <qemu:arg value='device.virtio-disk0.config-wce=off'/>
  </qemu:commandline>
  <qemu:commandline>
    <qemu:arg value='-set'/>
    <qemu:arg value='device.virtio-disk0.x-data-plane=on'/>
  </qemu:commandline>
<domain>

Note that <qemu:commandline> must be added directly inside <domain> and not inside a child tag like <devices>.

If you do not use libvirt the QEMU command-line is:

qemu -drive if=none,id=drive0,cache=none,aio=native,format=raw,file=path/to/disk.img \
     -device virtio-blk,drive=drive0,scsi=off,config-wce=off,x-data-plane=on

What is the roadmap for virtio-blk data plane

The limitations of virtio-blk data plane in QEMU 1.4 will be lifted in future releases. The goal I intend to reach is that QEMU virtio-blk simply uses the data plane approach behind-the-scenes and the x-data-plane option can be dropped.

Reaching the point where data plane becomes the default requires teaching the QEMU event loop and all the core infrastructure to be thread-safe. In the past there has been a big lock which allows a lot of code to simply ignore multi-threading. This creates scalability problems that data plane avoids by using a dedicated thread. Work is underway to reduce scope of the big lock and allow the data plane thread to work with live migration and other QEMU features that are not yet supported.

Patches have also been posted upstream to convert the QEMU net subsystem and virtio-net to data plane. This demonstrates the possibility of converting other performance-critical devices.

With these developments happening, 2013 will be an exciting year for QEMU I/O performance.

Tuesday, December 4, 2012

qemu-kvm.git has unforked back into qemu.git!

With the QEMU 1.3.0 release the qemu-kvm.git fork has now been merged back. The qemu.git source tree now contains equivalent code - it is no longer necessary to use qemu-kvm.git.

This is great news and has taken a lot of work from folks in the community. The qemu-kvm.git source tree had a number of differences compared to qemu.git. Over time, these changes were cleaned up and merged into qemu.git so that there is no longer a need to maintain a separate qemu-kvm.git source tree.

Many distros had both qemu and qemu-kvm or kvm packages. This sometimes led to confusion when people were unsure which package to install - both packages supported KVM to some degree. Now they are equivalent and distros will be able to simplify QEMU packaging.

For full details of the QEMU 1.3.0 release, see the announcement.

Friday, November 9, 2012

GlusterFS for KVM Users and Developers at KVM Forum 2012

At KVM Forum 2012 I gave an overview of GlusterFS for KVM users and developers. If you're looking for an introduction to the GlusterFS distributed storage system, check out the slides:

GlusterFS for KVM Users and Developers [PDF]

EDIT: Vijay Bellur provided a link to his presentation on extending GlusterFS. Check it out for deeper information on writing xlators:

Extending GlusterFS - The Translator Way [PDF]

Hopefully slides from other talks will become available online at the KVM Forum 2012 website.

Friday, September 21, 2012

Thoughts on Linux Containers (LXC)

I have been using Linux Containers (LXC) to run lightweight virtual machines for about a year now. I revisited LXC this week because I temporarily lack access to a machine with virtualization extensions for KVM. So I need an alternative for running Linux virtual machines with good performance. Here are my thoughts on LXC and its current status.

What is Linux Containers (LXC)?

The Linux Containers (LXC) project provides container virtualization for Linux, which has been described as "chroot(8) on steroids". The idea of containers is that resources like process identifiers (pids), user identifiers (uids), or memory are no longer managed globally by the kernel. Instead, the kernel accounts each resource against a container and gives them separate namespaces. This way multiple containers can run an init(8) process with pid 1 without conflicting, for example.

LXC can run multiple Linux systems simultaneously on one kernel. Containers are isolated from each other and do not conflict because each resource belongs to a container. The lxc 0.8.0~rc1-8+deb7u1 package I used on Debian comes with "templates" for Fedora, OpenSuSE, Ubuntu, Debian, Arch Linux, and altlinux. It takes about 10 minutes to create a new container pre-installed with one of these Linux distributions.

It's also possible to share resources, like file systems, between containers. It's even possible to run just a single process in a container without booting up a full Linux distribution - this really is "chroot(8) on steroids". But I think these features are useful to fewer users because they require understanding the exact resource dependencies of the application being containerized. It's easier to boot up a full Linux distribution to run the application.

In case you're wondering, Linux Containers (LXC) is purely a software kernel feature and therefore works fine inside KVM, Xen, or VMware. I use it to slice up my virtual server hosted on KVM and it works fine on Amazon EC2, Linode, etc.

How to create a container

First you need to install the lxc package:

sudo aptitude install lxc

There are now a bunch of utilities called lxc(1), lxc-start(1), lxc-create(1), etc available. The lxc(1) utility is a wrapper for the others, similar to git(1) versus git-branch(1).

Create a Debian container from its template:

sudo lxc create mydebian -t debian

There is a configuration file in /var/lib/lxc/mydebian/config that defines resources available to the container. You need to edit this file to set memory limits, set up networking, change the hostname, and more. The specific configuration options are out of scope for this post. Refer to your lxc package documentation, for example, /usr/share/doc/lxc/ on Debian.

Once installation is complete, launch the container:

sudo lxc start mydebian

To shut down the container:

sudo lxc shutdown mydebian

The rough edges

Use virt-manager, if possible

LXC has a few hurdles to get going. They remind me of issues first-time qemu users face. And the answer to configuration complexity is to use a high-level tool like virt-manager. Here are the advantages to using virt-manager:

Configure the container from a graphical interface. Skip learning configuration file syntax for something you will rarely edit.
Sets up environment dependencies for cgroups and networking. If you're not familiar with network bridges or IP forwarding this can save time.
Libvirt API support if you wish to automate or write custom tooling. This is actually a libvirt benefit and not specific to virt-manager itself.

Unfortunately, I wasn't able to use virt-manager because libvirt requires cgroups controllers that are not available on my machine (without recompiling the kernel). Libvirt assumes that you want to fully isolate the container and relies on several cgroups controllers. In my case I don't care about limiting memory usage and it just so happens that the distro kernel I run doesn't have that controller built. I decided to use lxc(1) manually and deal with the configuration file and networking setup.

Templates should indicate what to do after installation

After installing the Debian template, the container booted without networking and no login console. The boot process also spits out a number of errors that are not common on baremetal systems, because the unmodified Debian distro is trying to do things that are not allowed inside the container. Without much in the way of getting started documentation, it took a little while to figure out what was missing.

I ensured the ttys were running getty in /etc/inittab and created /dev/ttyX devices in the container. That took care of login consoles and made lxc console mydebian happy.

For networking I chose to use the veth type together with IP forwarding for external network access. This is fairly standard stuff if you've been playing with KVM or Xen.

Anyway, what is missing from LXC is some information between lxc create and lxc start that gives hints what to do in order to successfully boot and access the container.

Summary

Linux Containers (LXC) allows you to run lightweight Linux virtual machines with good performance. Once they are set up they boot quickly and feel like baremetal. As of September 2012, the container setup experience requires patience and some experience with troubleshooting Linux boot and networking issues. Hopefully the libvirt lxc support will reach the level where it's possible to create containers in virt-manager's wizard without getting familiar with low-level details.

Are you using LXC or have you tried it? Share your comments below.

Thursday, December 22, 2011

QEMU 2011 Year in Review

As 2011 comes to an end I want to look back at the highlights from the QEMU community this year. Development progress feels good, the mailing list is very active, and QEMU's future looks bright. I only started contributing in 2010 but the growth since QEMU's early days must be enormous. Perhaps someone will make a source history visualization that shows the commit history and clusters of activity.

Here is the recap of the milestones that QEMU reached in 2011.

QEMU 0.14

In February the 0.14 release came out with a bunch of exciting new features:

ICH9 AHCI (SATA) controller emulation
Intel HD Audio controller emulation
QED disk image file format
Ceph distributed storage system RBD block driver
SPICE remote desktop protocol support including QXL paravirtual graphics card

For full details see the changelog.

QEMU 0.15

In August the 0.15 release brought yet more cool improvements:

Lattice Mico32 (LM32) and UniCore32 target architectures
Xen support merged back Xen's fork of QEMU
Guest agent for host<->guest communication

For full details see the changelog.

Google Summer of Code

QEMU participated in Google Summer of Code 2011 and received funding for students to contribute to QEMU during the summer. Behind the scenes this takes an aweful lot of work from the students themselves but also from the mentors. These four projects were successfully completed:

Boot Mac OS >=8.5 on PowerPC system emulation
QED <-> QCOW2 image conversion utility
Improved VMDK image format compatibility
Adding NeXT emulation support

Hopefully we can continue to participate in GSoC and give students an opportunity to get involved with open source emulation and virtualization.

QEMU 1.0

The final QEMU release for 2011 was in December. The release announcement was picked up quite widely and after hitting Hacker News and Reddit required effort to try to keep the QEMU website up. I think that's a good sign, although QEMU 1.0 is kind of like Linux 3.0 in that the version number change does not represent a fundamental new codebase or architecture. Here somre of the changes:

Xtensa target architecture
TCG Interpreter interprets portable bytecode instead of translating to native machine code

For full details see the changelog.

Ongoing engineering efforts

There is a lot of change in motion as the year ends. Here are long-term efforts that are unfolding right now:

Jan Kiszka has made a lot of progress in the quest to merge qemu-kvm back into QEMU. In a way this is similar to Xen's QEMU fork which was merged back earlier this year. This is a great effort because some day soon there will be no more confusion over qemu-kvm vs qemu when they have been unified.
Avi Kivity took on the interfaces for guest memory management and is in the process of revamping them. This touches not only the core concept of how QEMU registers and tracks guest memory but also every single emulated device.
Anthony Liguori is working on an object model that will make emulated devices and all resources managed by QEMU consistent and accessible via APIs. I think of this like introducing sysfs so that there is a one hierarchy and ways to explore and manipulate everything QEMU knows about.

Looking forward to 2012

It is hard to pick the highlights to mention but I hope this summary has given you a few links to click and brought back cool features you forgot about :). Have a great new year!