Stefan Hajnoczi

Monday, May 6, 2013

Pictures from CERN

I went on a tour of CERN, the European nuclear research center that is home of the Large Hadron Collider (LHC). The facilities are split over multiple sites because the LHC is 27 kilometers long and 100 meters underground. I had a chance to see some of the smaller particle accelerators as well as the CMS experimental site in the LHC. Here are the best pictures from the tour.

LHC is currently offline for hardware upgrades

There are screens around the campus, even in the cafeteria, showing particle accelerator activity. Currently the LHC is offline due to hardware upgrades and will be coming back around 2014-2015 with higher particle energy. It's actually a good time to visit since it's possible to see experiment sites that would be inaccessible during operation.

CERN houses particle accelerators of different sizes

CERN is home to several different particle accelerators. Some are linear accelerators while others are ring-shaped to allow the particles to loop continuously (like the LHC). Particle beams are built up in "packets" or bursts, the low energy accelerators may be used to spin them up before injecting them into larger accelerators.

Linac 2 and LEIR: Smaller particle accelerators

Linac 2 and Low Energy Ion Ring (LEIR) are smaller particle accelerators. Their length is in the 10s of meters, which means you can see the whole thing. The principle seems to be similar to that of the big accelerators: force particles to collide by sending them through a vacuum guided by magnetic fields. The point of collision is equipped with sensors which measure the particles produced by the collision.

Inside the LHC ring pipe

The LHC is much larger scale than Linac 2 or LEIR so it has unique tricks up its sleeve. The electromagnets used to keep the particle beam on its course require so much energy that superconductivity is used to eliminate resistance. This means the pipe is cooled close to absolute zero and has insulation and a vacuum to shield it from the external environment.

The collisions are produced by accelerating two particle beams in opposite directions - clockwise and counterclockwise. You can see the two particle beam pipes in the picture above. The beams are kept separate for most of the ring, only the experiment sites which contain the detectors will cross the beams to create collisions.

The CMS experiment site

The Compact Muon Solenoid is one of the experiment sites on the LHC ring. It has a huge chamber filled with sensors that measure particle collisions. It is 15 meters in diameter and hard to get a picture of due to its size. There is also some datacenter space above with machines that process the data generated by the experiments.

Tux makes an appearance

The experiments produce a huge amount of data - only a tiny fraction of the collisions produce interesting events, like a Higgs boson. The incoming data is processed and discarded unless an interesting event is detected. This is called "triggering" and avoids storing huge amounts of unnecessary data. When walking past the racks I saw custom hardware, circuit boards dangling from machines here and there, which is probably used for filtering or classifying the data.

Finally, I spotted the Linux mascot, Tux, on a monitor. Nice to see :-).

Saturday, April 13, 2013

QEMU.org is in Google Summer of Code 2013!

As recently announced on Google+, QEMU.org has been accepted to Google Summer of Code 2013.

We have an exciting list of project ideas for QEMU, libvirt, and the KVM kernel module. Students should choose a project idea and contact the mentor to discuss the requirements. The easiest way to get in touch is via the #qemu-gsoc IRC channel on irc.oftc.net.

Student applications formally open on April 22 but it's best to get in touch with the mentor now. See the timeline for details.

I've shared my advice on applying to Summer of Code on this blog. Check it out if you're looking for a guide to a successful application from someone who has been both a student and a mentor.

Tuesday, April 9, 2013

QEMU Code Overview slides available

I recently gave a high-level overview of QEMU aimed at new contributors or people working with QEMU. The slides are now available here:

QEMU Code Overview (pdf)

Topics covered include:

External interfaces (command-line, QMP monitor, HMP monitor, UI, logging)
Architecture (process model, main loop, threads)
Device emulation (KVM accelerator, guest/host device split, hardware emulation)
Development (build process, contributing)

It is a short presentation and stays at a high level, but it can be useful for getting your bearings before digging into QEMU source code, debugging, or performance analysis.

Enjoy!

Wednesday, March 13, 2013

New in QEMU 1.4: high performance virtio-blk data plane implementation

QEMU 1.4 includes an experimental feature for improved high IOPS disk I/O scalability called virtio-blk data plane. It extends QEMU to perform disk I/O in a dedicated thread that is optimized for scalability with high IOPS devices and many disks. IBM and Red Hat have published a whitepaper presenting the highest IOPS achieved to date under virtualization using virtio-blk data plane:

KVM Virtualized I/O Performance [PDF]

Update

Much of this post is now obsolete! The virtio-blk dataplane feature was integrated with QEMU's block layer (live migration and block layer features are now supported), virtio-scsi dataplane support was added, and libvirt XML syntax was added.

If you have a RHEL 7.2 or later host please use the following:

QEMU syntax:

$ qemu-system-x86_64 -object iothread,id=iothread0 \
                     -drive if=none,id=drive0,file=vm.img,format=raw,cache=none,aio=native \
                     -device virtio-blk-pci,iothread=iothread0,drive=drive0

Libvirt domain XML syntax:

<domain>
    <iothreads>1<iothreads>
    <cputune>  <!-- optional -->
        <iothreadpin iothread="1" cpuset="5,6"/>
    </cputune>
    <devices>
        <disk type="file">
            <driver iothread="1" ... />
        </disk>
    </devices>
</domain>

When can virtio-blk data plane be used?

Data plane is suitable for LVM or raw image file configurations where live migration and advanced block features are not needed. This covers many configurations where performance is the top priority.

Data plane is still an experimental feature because it only supports a subset of QEMU configurations. The QEMU 1.4 feature has the following limitations:

Image formats are not supported (qcow2, qed, etc).
Live migration is not supported.
QEMU I/O throttling is not supported but cgroups blk-io controller can be used.
Only the default "report" I/O error policy is supported (-drive werror=,rerror=).
Hot unplug is not supported.
Block jobs (block-stream, drive-mirror, block-commit) are not supported.

How to use virtio-blk data plane

The following libvirt domain XML enables virtio-blk data plane:

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
...
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source file='path/to/disk.img'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </disk>
...
  </devices>
  <qemu:commandline>
    <qemu:arg value='-set'/>
    <qemu:arg value='device.virtio-disk0.scsi=off'/>
  </qemu:commandline>
  <!-- config-wce=off is not needed in RHEL 6.4 -->
  <qemu:commandline>
    <qemu:arg value='-set'/>
    <qemu:arg value='device.virtio-disk0.config-wce=off'/>
  </qemu:commandline>
  <qemu:commandline>
    <qemu:arg value='-set'/>
    <qemu:arg value='device.virtio-disk0.x-data-plane=on'/>
  </qemu:commandline>
<domain>

Note that <qemu:commandline> must be added directly inside <domain> and not inside a child tag like <devices>.

If you do not use libvirt the QEMU command-line is:

qemu -drive if=none,id=drive0,cache=none,aio=native,format=raw,file=path/to/disk.img \
     -device virtio-blk,drive=drive0,scsi=off,config-wce=off,x-data-plane=on

What is the roadmap for virtio-blk data plane

The limitations of virtio-blk data plane in QEMU 1.4 will be lifted in future releases. The goal I intend to reach is that QEMU virtio-blk simply uses the data plane approach behind-the-scenes and the x-data-plane option can be dropped.

Reaching the point where data plane becomes the default requires teaching the QEMU event loop and all the core infrastructure to be thread-safe. In the past there has been a big lock which allows a lot of code to simply ignore multi-threading. This creates scalability problems that data plane avoids by using a dedicated thread. Work is underway to reduce scope of the big lock and allow the data plane thread to work with live migration and other QEMU features that are not yet supported.

Patches have also been posted upstream to convert the QEMU net subsystem and virtio-net to data plane. This demonstrates the possibility of converting other performance-critical devices.

With these developments happening, 2013 will be an exciting year for QEMU I/O performance.

Tuesday, December 4, 2012

qemu-kvm.git has unforked back into qemu.git!

With the QEMU 1.3.0 release the qemu-kvm.git fork has now been merged back. The qemu.git source tree now contains equivalent code - it is no longer necessary to use qemu-kvm.git.

This is great news and has taken a lot of work from folks in the community. The qemu-kvm.git source tree had a number of differences compared to qemu.git. Over time, these changes were cleaned up and merged into qemu.git so that there is no longer a need to maintain a separate qemu-kvm.git source tree.

Many distros had both qemu and qemu-kvm or kvm packages. This sometimes led to confusion when people were unsure which package to install - both packages supported KVM to some degree. Now they are equivalent and distros will be able to simplify QEMU packaging.

For full details of the QEMU 1.3.0 release, see the announcement.

Friday, November 9, 2012

GlusterFS for KVM Users and Developers at KVM Forum 2012

At KVM Forum 2012 I gave an overview of GlusterFS for KVM users and developers. If you're looking for an introduction to the GlusterFS distributed storage system, check out the slides:

GlusterFS for KVM Users and Developers [PDF]

EDIT: Vijay Bellur provided a link to his presentation on extending GlusterFS. Check it out for deeper information on writing xlators:

Extending GlusterFS - The Translator Way [PDF]

Hopefully slides from other talks will become available online at the KVM Forum 2012 website.

Friday, September 21, 2012

Thoughts on Linux Containers (LXC)

I have been using Linux Containers (LXC) to run lightweight virtual machines for about a year now. I revisited LXC this week because I temporarily lack access to a machine with virtualization extensions for KVM. So I need an alternative for running Linux virtual machines with good performance. Here are my thoughts on LXC and its current status.

What is Linux Containers (LXC)?

The Linux Containers (LXC) project provides container virtualization for Linux, which has been described as "chroot(8) on steroids". The idea of containers is that resources like process identifiers (pids), user identifiers (uids), or memory are no longer managed globally by the kernel. Instead, the kernel accounts each resource against a container and gives them separate namespaces. This way multiple containers can run an init(8) process with pid 1 without conflicting, for example.

LXC can run multiple Linux systems simultaneously on one kernel. Containers are isolated from each other and do not conflict because each resource belongs to a container. The lxc 0.8.0~rc1-8+deb7u1 package I used on Debian comes with "templates" for Fedora, OpenSuSE, Ubuntu, Debian, Arch Linux, and altlinux. It takes about 10 minutes to create a new container pre-installed with one of these Linux distributions.

It's also possible to share resources, like file systems, between containers. It's even possible to run just a single process in a container without booting up a full Linux distribution - this really is "chroot(8) on steroids". But I think these features are useful to fewer users because they require understanding the exact resource dependencies of the application being containerized. It's easier to boot up a full Linux distribution to run the application.

In case you're wondering, Linux Containers (LXC) is purely a software kernel feature and therefore works fine inside KVM, Xen, or VMware. I use it to slice up my virtual server hosted on KVM and it works fine on Amazon EC2, Linode, etc.

How to create a container

First you need to install the lxc package:

sudo aptitude install lxc

There are now a bunch of utilities called lxc(1), lxc-start(1), lxc-create(1), etc available. The lxc(1) utility is a wrapper for the others, similar to git(1) versus git-branch(1).

Create a Debian container from its template:

sudo lxc create mydebian -t debian

There is a configuration file in /var/lib/lxc/mydebian/config that defines resources available to the container. You need to edit this file to set memory limits, set up networking, change the hostname, and more. The specific configuration options are out of scope for this post. Refer to your lxc package documentation, for example, /usr/share/doc/lxc/ on Debian.

Once installation is complete, launch the container:

sudo lxc start mydebian

To shut down the container:

sudo lxc shutdown mydebian

The rough edges

Use virt-manager, if possible

LXC has a few hurdles to get going. They remind me of issues first-time qemu users face. And the answer to configuration complexity is to use a high-level tool like virt-manager. Here are the advantages to using virt-manager:

Configure the container from a graphical interface. Skip learning configuration file syntax for something you will rarely edit.
Sets up environment dependencies for cgroups and networking. If you're not familiar with network bridges or IP forwarding this can save time.
Libvirt API support if you wish to automate or write custom tooling. This is actually a libvirt benefit and not specific to virt-manager itself.

Unfortunately, I wasn't able to use virt-manager because libvirt requires cgroups controllers that are not available on my machine (without recompiling the kernel). Libvirt assumes that you want to fully isolate the container and relies on several cgroups controllers. In my case I don't care about limiting memory usage and it just so happens that the distro kernel I run doesn't have that controller built. I decided to use lxc(1) manually and deal with the configuration file and networking setup.

Templates should indicate what to do after installation

After installing the Debian template, the container booted without networking and no login console. The boot process also spits out a number of errors that are not common on baremetal systems, because the unmodified Debian distro is trying to do things that are not allowed inside the container. Without much in the way of getting started documentation, it took a little while to figure out what was missing.

I ensured the ttys were running getty in /etc/inittab and created /dev/ttyX devices in the container. That took care of login consoles and made lxc console mydebian happy.

For networking I chose to use the veth type together with IP forwarding for external network access. This is fairly standard stuff if you've been playing with KVM or Xen.

Anyway, what is missing from LXC is some information between lxc create and lxc start that gives hints what to do in order to successfully boot and access the container.

Summary

Linux Containers (LXC) allows you to run lightweight Linux virtual machines with good performance. Once they are set up they boot quickly and feel like baremetal. As of September 2012, the container setup experience requires patience and some experience with troubleshooting Linux boot and networking issues. Hopefully the libvirt lxc support will reach the level where it's possible to create containers in virt-manager's wizard without getting familiar with low-level details.

Are you using LXC or have you tried it? Share your comments below.