Stefan Hajnoczi

Wednesday, September 28, 2011

Preparing and storing patch revisions as git tags using git-publish

This weekend I got down to solving a workflow problem that has been bugging me for some time: preparing and storing patch revisions. Manually managing patch revisions is painful; I often find myself switching between git and my inbox several times to put together a consistent patch series.

git-publish is a script that numbers patch revisions, optionally stores a cover letter, and submits the patches via git-send-email(1). When your tree is in a state that you wish to publish you say:

$ git publish --to=qemu-devel@nongnu.org

It creates a git tag that you can refer back to in the future and send out the patch series emails.

No more numbering revisions, copy & pasting cover letters, or running several steps to format and send patch series.

Give it a try if you are tired of manually managing patch revisions with git. git-publish is released under the MIT License at http://github.com/stefanha/git-publish. I have provided documentation but you can set it up in just two lines:

$ git clone git://github.com/stefanha/git-publish.git
$ git-publish/git-publish --setup # make available via git alias

Be sure to check out the README - it explains how to install and run it in more detail.

Happy git-publishing!

Monday, September 19, 2011

Enhanced VMDK support now in QEMU

QEMU now has greatly enhanced VMDK VMware disk image file support, thanks to Fam Zheng's hard work during Google Summer of Code 2011. Previously QEMU was only able to handle older VMDK files because it did not support the entire VMDK file format specification. This resulted in qemu-img convert and other tools being unable to open certain VMDK files. As of now, qemu.git has merged code to handle the VMDK specification and work well with modern image files.

If you had trouble in the past manipulating VMDK files with qemu-img, it may be worth another look soon. You can already build the latest and greatest qemu-img from the qemu.git repository and distros will provide packages with full VMDK support in the future:

$ git clone git://git.qemu.org/qemu.git
$ cd qemu
$ ./configure
$ make qemu-img
$ ./qemu-img convert ...

It is still recommended to convert VMDK files to QEMU's native formats (raw, qcow2, or qed) in order to get optimal performance for running VMs.

At the end of Fam's Summer of Code project, he put together an article on gotchas and undocumented behavior in the VMDK specification. This will be of great interest to developers writing their own code to manipulate VMDK image files. His experience this summer involved testing a wide range of VMware software, real-world image files, as well as studying existing open-source VMDK code. The VMDK specification is ambiguous in places and does not cover several essential details, so Fam had to figure them out himself and he then documented them.

So with that, happy VMDK-ing...

Sunday, September 11, 2011

How to share files instantly between virtual machines and host

It's pretty common to need to copy files between a virtual machine and the host. This can be drivers or installers from the host into the virtual machine or it could be in order to get some data out of the virtual machine and onto the host.

There are several solutions to sharing files, like network file systems, third-party file hosting, or even the virtfs paravirtualized file system supported by KVM. But my favorite ad-hoc file sharing tool is Python's built-in webserver.

The one-liner that shares files over HTTP

To share some files, simply change into the directory subtree you want to export and start the web server:

$ cd path/to/shared/files && python -m SimpleHTTPServer
Serving HTTP on 0.0.0.0 port 8000 ...

The directory hierarchy at path/to/shared/files is now available over HTTP on all IPs for the machine on port 8000. The web server generates index listings for directories so it is easy to browse.

To access the host from a virtual machine:

User networking (slirp): http://10.0.2.2:8000/
NAT networking: the default gateway from ip route show | grep ^default, for example http://192.168.122.1:8000/
Bridged or routed networking: the IP address of the host

To access the virtual machine from the host:

NAT, bridged, or routed networking: the virtual machine's IP address
User networking (slirp): forward a port with the hostfwd_add QEMU monitor command or -net user,hostfwd= option documented in the QEMU man page

Advantages

There are a couple of reasons why I like Python's built-in web server:

No need to install software since Linux and Mac typically already have Python installed.
No privileges are required to run the web server.
Works for both physical and virtual machines.
HTTP is supported by desktop file managers, browsers, and the command-line with wget or curl.
Network booting and installation is possible straight from the web server. Be warned that RHEL installs over HTTP have been known to fail with SimpleHTTPServer but I have not encountered problems with other software.

Security tips

This handy HTTP server is only suitable for trusted networks for a couple of reasons:

Encryption is not supported so data travels in plain text and can be tampered with in flight.
The directory listing feature means you should only export directory subtrees that contain no sensitive data.
SimpleHTTPServer is not a production web server and there could be obvious bugs.

Conclusion

Python's built-in web server is one of my favorite tricks that not many people seem to know. I hope this handy command comes in useful to you!

Wednesday, September 7, 2011

QEMU Internals: vhost architecture

This post explains how vhost provides in-kernel virtio devices for KVM. I have been hacking on vhost-scsi and have answered questions about ioeventfd, irqfd, and vhost recently, so I thought this would be a useful QEMU Internals post.

Vhost overview

The vhost drivers in Linux provide in-kernel virtio device emulation. Normally the QEMU userspace process emulates I/O accesses from the guest. Vhost puts virtio emulation code into the kernel, taking QEMU userspace out of the picture. This allows device emulation code to directly call into kernel subsystems instead of performing system calls from userspace.

The vhost-net driver emulates the virtio-net network card in the host kernel. Vhost-net is the oldest vhost device and the only one which is available in mainline Linux. Experimental vhost-blk and vhost-scsi devices have also been developed.

In Linux 3.0 the vhost code lives in drivers/vhost/. Common code that is used by all devices is in drivers/vhost/vhost.c. This includes the virtio vring access functions which all virtio devices need in order to communicate with the guest. The vhost-net code lives in drivers/vhost/net.c.

The vhost driver model

The vhost-net driver creates a /dev/vhost-net character device on the host. This character device serves as the interface for configuring the vhost-net instance.

When QEMU is launched with -netdev tap,vhost=on it opens /dev/vhost-net and initializes the vhost-net instance with several ioctl(2) calls. These are necessary to associate the QEMU process with the vhost-net instance, prepare for virtio feature negotiation, and pass the guest physical memory mapping to the vhost-net driver.

During initialization the vhost driver creates a kernel thread called vhost-$pid, where $pid is the QEMU process pid. This thread is called the "vhost worker thread". The job of the worker thread is to handle I/O events and perform the device emulation.

In-kernel virtio emulation

Vhost does not emulate a complete virtio PCI adapter. Instead it restricts itself to virtqueue operations only. QEMU is still used to perform virtio feature negotiation and live migration, for example. This means a vhost driver is not a self-contained virtio device implementation, it depends on userspace to handle the control plane while the data plane is done in-kernel.

The vhost worker thread waits for virtqueue kicks and then handles buffers that have been placed on the virtqueue. In vhost-net this means taking packets from the tx virtqueue and transmitting them over the tap file descriptor.

File descriptor polling is also done by the vhost worker thread. In vhost-net the worker thread wakes up when packets come in over the tap file descriptor and it places them into the rx virtqueue so the guest can receive them.

Vhost as a userspace interface

One surprising aspect of the vhost architecture is that it is not tied to KVM in any way. Vhost is a userspace interface and has no dependency on the KVM kernel module. This means other userspace code, like libpcap, could in theory use vhost devices if they find them convenient high-performance I/O interfaces.

When a guest kicks the host because it has placed buffers onto a virtqueue, there needs to be a way to signal the vhost worker thread that there is work to do. Since vhost does not depend on the KVM kernel module they cannot communicate directly. Instead vhost instances are set up with an eventfd file descriptor which the vhost worker thread watches for activity. The KVM kernel module has a feature known as ioeventfd for taking an eventfd and hooking it up to a particular guest I/O exit. QEMU userspace registers an ioeventfd for the VIRTIO_PCI_QUEUE_NOTIFY hardware register access which kicks the virtqueue. This is how the vhost worker thread gets notified by the KVM kernel module when the guest kicks the virtqueue.

On the return trip from the vhost worker thread to interrupting the guest a similar approach is used. Vhost takes a "call" file descriptor which it will write to in order to kick the guest. The KVM kernel module has a feature called irqfd which allows an eventfd to trigger guest interrupts. QEMU userspace registers an irqfd for the virtio PCI device interrupt and hands it to the vhost instance. This is how the vhost worker thread can interrupt the guest.

In the end the vhost instance only knows about the guest memory mapping, a kick eventfd, and a call eventfd.

Where to find out more

Here are the main points to begin exploring the code:

drivers/vhost/vhost.c - common vhost driver code
drivers/vhost/net.c - vhost-net driver
virt/kvm/eventfd.c - ioeventfd and irqfd

The QEMU userspace code shows how to initialize the vhost instance:

hw/vhost.c - common vhost initialization code
hw/vhost_net.c - vhost-net initialization

Sunday, August 21, 2011

KVM Forum 2011 Highlights

KVM Forum 2011 was co-located with LinuxCon North America in Vancouver, Canada. KVM Forum ran Monday and Tuesday, 16 & 17 of August and featured two tracks packed with developer-oriented talks on KVM and related open source projects.

Here is a summary with links to some of the most interesting talks:

Big picture and roadmaps

Daniel Berrange gave an excellent overview of libvirt and libguestfs. These two tools, along with other virt-tools like virt-v2v, form the user-visible toolkit and APIs around KVM. Anyone who needs to automate KVM or develop custom applications that integrate with KVM should learn about the work being done by the libvirt community to provide both APIs and tools for managing virtualizated guests.

Alon Levy's SPICE Roadmap talk explains how remote graphics, input, sound, and USB are being added to KVM. SPICE goes far beyond today's VNC server, which can only scrape screen updates and send them to the client. SPICE has a deeper insight into the graphics pipeline and is able to provide efficient remote displays. In addition, channels for input, sound, and USB pass-through promise to bring good desktop integration to KVM.

Subsystem status and plumbing

Fixing the USB Disaster explains the state of USB, where Gerd Hoffmann has been working to remove limitations and add support for the USB 2.0 standard. These much-needed improvements will allow USB pass-through to actually work across devices. In the past I've had mixed results when passing through VoIP handsets and consumer electronics devices. Thanks to Gerd's work, I'm hoping that USB pass-through will work more consistently in future releases.

Migration: One year later tells the story of live migration in KVM. Juan Quintela has been working on this area of QEMU. In part due to today's live migration support in the device model, it has been a struggle to provide working live migration as device emulation is extended or fixed. In particular it is a challenging problem to migration from an old qemu-kvm to a new one, and vice versa.

New features

AMD IOMMU Version 2 covers the enhanced I/O Memory Management Unit from AMD. Joerg Roedel, who has been working on nested virtualization on AMD CPUs, presents the key features of this new IOMMU. It adds PCI ATS-based support for demand paging. This eliminates the need to lock guest memory when doing PCI pass-through, since it's now possible to swap in a page and resume the I/O when a fault occurs.

Along the lines of Kemari, Kei Ohmura presents Rapid VM Synchronization with I/O Emulation Logging-Replay. The new trick is that I/O logging enables shared-nothing configurations where the primary and secondary host do not have access to shared storage. Instead the primary sends I/O logs to the secondary, where they are replayed to bring the secondary disk image up to date.

Performance

For a tour of performance tweaks across memory, networking, and storage, check out Mark Wagner's talk on KVM Performance Improvements and Optimizations. He covers Transparent Huge Pages, vhost-net, SR-IOV, block I/O schedulers, Linux AIO, NUMA tuning, and more.

...and more

Each talk was only 30 minutes long, and with two tracks that meant lots of talks. To see all presentations, go to the KVM Forum 2011 website.

This post only covered non-IBM presentations. There were so many IBMers working on KVM around that I'd like to bring together those talks and show all the areas they touch on. In my next post I will give an overview of the KVM presentations given by IBM Linux Technology Center people at KVM Forum and LinuxCon North America.

Wednesday, August 3, 2011

My KVM Architecture guest post is up at Virtualization@IBM

Update: These links no longer work. For a more recent overview of KVM's architecture, check out the slides available here.

Earlier this year IBM launched the Virtualization@IBM blog and I'm pleased to have contributed a guest post on KVM Architecture!

KVM Architecture: The Key Components of Open Virtualization with KVM explains the open virtualization stack built around KVM. It highlights the performance, security, and management characteristics of the architecture. I hope it is a good overview if you want a quick idea of how KVM works.

You can follow @Linux_at_IBM and @OpenKVM on Twitter for more official updates. For example, the Open Virtualization Alliance that HP, IBM, Intel, Red Hat, and many others are coming together around.

Wednesday, June 8, 2011

LinuxCon Japan 2011 KVM highlights

I had the opportunity to attended LinuxCon Japan 2011 in Yokohama. The conference ran many interesting talks, with a virtualization mini-summit taking place through the first two days. Here are highlights from the conference and some of the latest KVM-related presentation slides.

KVM Ecosystem

Jes Sorensen from Red Hat presented the KVM Weather Report, a status update and look at recent KVM work. Check out his slides if you want an overview of where KVM is today. The main points that I identify are:

KVM performance is excellent and continues to improve.
In the future, expect advanced features that round out KVM's strengths.

Jes' presentation ends on a nice note with a weather forecast that reads "Cloudy with a chance of total world domination" :).

Virtualization End User Panel

The end user panel brought together three commerical users of open virtualization software to discuss their deployments and answer questions from the audience. I hope a video of this panel will be made available because it is interesting to understand how virtualization is put to use.

All three users were in the hosting or cloud market. The users were split across KVM, Xen, and KVM on standard and VMware on premium offerings. It is clear that KVM is becoming the hypervisor of choice for hosting due to its low cost and good Linux integration - the Xen user had started several years ago but is now evaluating KVM. My personal experience with USA and UK hosting is that Xen is still widely deployed although KVM is growing quickly.

All three users rely on custom management tools and web interfaces. Although libvirt is being used the management layers above it are seen as an opportunity to differentiate. The current breed of open cloud and virtualization management tools weren't seen as mature or scalable enough. I expect this area of the virtualization stack to solidify with several of the open source efforts consolidating in order to reach critical mass.

Storage and Networking

Jes Sorensen from Red Hat covered the KVM Live Snapshot Support work that will allow disk snapshots to be taken for backup and other purposes without stopping the VM. My own talk gave An Updated Overview of the QEMU Storage Stack and covered other current work in the storage area, as well as explaining the most important storage configuration settings when using QEMU.

Stephen Hemminger from Vyatta presented an overview of Virtual Networking Performance. KVM is looking pretty good relative to other hypervisors, no doubt thanks to all the work that has gone into network performance under KVM. Michael Tsirkin and many others are still optimizing virtio-net and vhost_net to reduce latency, improve throughput, and reduce CPU consumption and I think the results justify virtio and paravirtualized I/O. KVM is able to continue improving network performance by extending its virtio host<->guest interface in order to work more efficiently - something that is impossible when emulating existing real-world hardware.

Other KVM-related talks

Isaku Yamahata from VA Linux gave his Status Update on QEMU PCI Express Support. His goal is PCI device assignment of PCI Express adapters. I think his work is important for QEMU in the long term since eventually hardware emulation needs to be able to present modern devices in order for guest operating systems to function.

Guangrong Xiao from Fujitsu presented KVM Memory Virtualization Progress, which describes how guest memory access works. It covers both hardware MMU virtualization in modern processors as well as the software solution used on older hardware. Kernel Samepage Merging (KSM) and Transparent Hugepages (THP) are also touched upon. This is a good technical overview if you want to understand how memory management virtualization works.