VFIO tips and tricks

Wednesday, August 24, 2016

KVM Forum 2016 - An Introduction to PCI Device Assignment with VFIO

Slides available here:

http://awilliam.github.io/presentations/KVM-Forum-2016

Video to come

Friday, July 15, 2016

Intel Graphics assignment

Hey folks, it feels like it's time to mention that assignment of Intel graphics devices (IGD) is currently available in qemu.git and will be part of the upcoming QEMU 2.7 release. There's already pretty thorough documentation of the modes available in the source tree, please give it a read. There are two modes described there, "legacy" and "Universal Passthrough" (UPT), each have their pros and cons. Which ones are available to you depends on your hardware. UPT mode is only available for Broadwell and newer processors while legacy mode is available all the way back through SandyBridge. If you have a processor older than SandyBridge, stop now, this is not going to work for you. If you don't know what any of these strange names mean, head to Wikipedia and Ark to figure it out.

The high level overview is that "legacy" mode is much like our GeForce support, the IGD is meant to be the primary and exclusive graphics in the VM. Additionally the IGD address in the VM must be at PCI 00:02.0, only Seabios is currently supported, only the 440FX chipset model is supported (no Q35), the IGD device must be the primary host graphics device, and the host needs to be running kernel v4.6 or newer. Clearly assigning the host primary graphics is a bit of an about-face for our GPU assignment strategy, but we depend on running the IGD video ROM, which depends on VGA and imposes most of the above requirements as well (oh add CONFIG_VFIO_PCI_VGA to the requirements list). I have yet to see an IGD ROM with UEFI support, which is why OVMF is not yet supported, but seems possible to support with a CSM and some additional code in OVMF.

Legacy mode should work with both Linux and Windows guests (and hopefully others if you're so inclined). The i915 driver does suffer from the typical video driver problem that sometimes the whole system explodes (not literally) when unbinding or re-binding the IGD to the driver. Personally I avoid this by blacklisting the i915 driver. Of course as some have found out trying to do this with discrete GPUs, there are plenty of other drivers ready to jump on the device to keep the console working. The primary ones I've seen are vesafb and efifb, which one is used on your system depends on your host firmware settings, legacy BIOS vs UEFI respectively. To disable these, simply add video=vesafb:off or video=efifb:off to the kernel command line (not sure which to use? try both, video=vesafb:off,efifb:off). The first thing you'll notice when you boot an IGD system with i915 blacklisted and the more basic framebuffer drivers disabled is that you don't get anything on the graphics head after grub. Plan for this. I use a serial console, but perhaps you're more comfortable running blind and willing to hope the system boots and you can ssh into it remotely.

If you've followed along with this procedure, you should be able to simply create a <hostdev> entry in your libvirt XML, which ought to look something like this:

</source>

</hostdev>

Again, assigning the IGD device (which is always 00:02.0) to address 00:02.0 in the VM is required. Delete the <video> and <graphics> sections and everything should just magically work. Caveat emptor, my newest CPU is Broadwell, I've been told this works with Skylake, but IGD is hardly standardized and each new implementation seems to tweak things just a bit.

Some of you are probably also curious why this doesn't work on Q35, which leads into the discussion of UPT mode; IGD clearly is not a discrete GPU, but "integrated" not only means that the GPU is embedded in the system, in this case it means that the GPU is kind of smeared across the system. This is why IGD assignment hasn't "just worked" and why you need a host kernel with support for exposing certain regions through vfio and a BIOS that's aware of IGD, and it needs to be at a specific address, etc, etc, etc. One of those requirements is that the video ROM actually also cares about a few properties of the device at PCI address 00:1f.0, the ISA/LPC bridge. Q35 includes its own bridge at that location and we cannot simply modify the IDs of that bridge for compatibility reasons. Therefore that bridge being an implicit part of Q35 means that IGD assignment doesn't work on Q35. This also means that PCI address 00:1f.0 is not available for use in a 440FX machine.

Ok, so UPT. Intel has known for a while that the sprawl of IGD has made it difficult to deal with for device assignment. To combat this, both software and hardware changes have been made that help to consolidate IGD to be more assignment-friendly. Great news, right? Well sort of. First off, in UPT mode the IGD is meant to be a secondary graphics device in the VM, there's no VGA mode support (oh, BTW, x-vga=on is automatically added by QEMU in legacy mode). In fact, um, there's no output support of any kind by default in UPT mode. How's this useful you ask, well between the emulated graphics and IGD you can setup mirroring so you actually have a remote-capable, hardware accelerated graphics VM. Plus, if you add the option x-igd-opregion=on to the vfio-pci device, you can get output to a physical display, but there again you're going to need the host running kernel v4.6 or newer and the upcoming QEMU 2.7 support, while no-output UPT has probably actually worked for quite a while. UPT mode has no requirements for the IGD PCI address, but note that most VM firmare, SeaBIOS or OVMF, will define the primary graphics as the one having the lowest PCI address. Usually not a problem, but some of you create some crazy configs. You'll also still need to do all the blacklisting and video disabling above, or just risk binding and unbinding i915 from the host, gambling each time whether it'll explode.

So UPT sounds great except why is this opregion thing optional? Well, it turns out that if you want to do that cool mirroring thing I mention above and a physical output is enabled with the opregion, you actually need to have a monitor attached to the device or else your apps don't get any hardware acceleration love. Whereas if IGD doesn't know about any outputs, it's happy to apply hardware acceleration regardless of what's physically connected. Sucks, but readers here should already know how to create wrapper scrips to add this extra option if they want it (similar to x-vga=on). I don't think Intel really wants to support this hacky hybrid mode either, thus the experimental x- option prefix tag.

Oh, one more gotcha for UPT mode, Intel seems to expect otherwise, but I've had zero success trying to run Linux guests with UPT. Just go ahead and assume this is for your Windows guests only at this point.

What else... laptop displays should work, I believe switching outputs even works, but working on laptops is rather inconvenient since you're unlikely to have a serial console available. Also note that while you can use input-linux to attach a laptop keyboard and mouse (not trackpad IME), I don't know how to make the hotkeys work, so that's a bummer. Some IGD devices will generate DMAR error spew on the host when assigned, particularly the first time per host boot. Don't be too alarmed by this, especially if it stops before the display is initialized. This seems to be caused by resetting the IGD in an IOMMU context where it can't access its page tables setup by the BIOS/host. Unless you have an ongoing spew of these, they can probably be ignored. If you have something older than SandyBridge that you wish you could use this with and continued reading even after told to stop, sorry, there was a hardware change at SandyBridge and I don't have anything older to test with and don't really want to support additional code for such outdated hardware. Besides, those are pretty old and you need an excuse for an upgrade anyway.

With this support I've switched my desktop system so that the host actually runs from a USB stick and the previous bare-metal Fedora install is virtualized with IGD, running alongside my existing GeForce VM. Give it a try and good luck.

Sunday, January 3, 2016

Comments on the 7 Gamers, 1 CPU video

In case you've seen this video:

And you're thinking to yourself that the R9 Nano they used looks like a great choice for your own GPU assignment build, think again. This GPU is known to have reset issues, so while it's impressive to see a build with this degree of consolidation, you should be suspicious about the problems Linus alludes to and the limited functionality of the overall system shown in the video. For instance, does rebooting a VM require a host reboot, or perhaps a manual soft eject of the GPU from the VM? We see this a lot with newer AMD cards, well beyond the partial workarounds we have for Bonaire and Hawaii based GPUs.

Personally I would have preferred to see an NVIDIA based solution, but due to the scale of this build, and unique slot and power restrictions, the compatibility with GPU assignment was mostly an afterthought. NVIDIA is not without issues, but for the time being we understand those issues and have workarounds for them, and even a path for supported configurations with Quadro cards.

On the plus side, yes, this is using KVM and VFIO and it's an impressive example of what this technology can do. However, when you're spec'ing your own build, do your own research and don't rely on videos like this to choose your components. My 2 cents...

Friday, October 23, 2015

Intel processors with ACS support

If you've been keeping up with this blog then you understand a bit about IOMMU groups and device isolation. In my howto series I describe the limitations of the Xeon E3 processor that I use in my example system and recommend Xeon E5 or higher processors to provide the best case device isolation for those looking to build a system. Well, thanks to the vfio-users mailing list, it has come to my attention that there are in fact Core i7 processors with PCIe Access Control Services (ACS) support on the processor root ports.

Intel lists these processors as High End Desktop Processors, they include Socket 2011-v3 Haswell E processors, Socket 2011 Ivy Bridge E processors, and Socket 2011 Sandy Bridge E processors. The linked datasheets for each family clearly lists ACS register capabilities. Current listings for these processors include:

Haswell-E (LGA2011-v3)
i7-5960X (8-core, 3/3.5GHz)
i7-5930K (6-core, 3.2/3.8GHz)
i7-5820K (6-core, 3.3/3.6GHz)

Ivy Bridge-E (LGA2011)
i7-4960X (6-core, 3.6/4GHz)
i7-4930K (6-core, 3.4/3.6GHz)
i7-4820K (4-core, 3.7/3.9GHz)

Sandy Bridge-E (LGA2011)
i7-3960X (6-core, 3.3/3.9GHz)
i7-3970X (6-core, 3.5/4GHz)
i7-3930K (6-core, 3.2/3.8GHz)
i7-3820 (4-core, 3.6/3.8GHz)

These also appear to be the only Intel Core processors compatible with Socket 2011 and 2011-v3 found on X79 and X99 motherboards, so basing your platform around these chipsets will hopefully lead to success. My recommendation is based only on published specs, not first hand experience though, so your mileage may vary.

Unfortunately there are not yet any Skylake based "High End Desktop Processors" and from what we've seen on the mailing list, Skylake does not implement ACS on processor root ports, nor do we have quirks to enable isolation on the Z170 PCH root ports (which include a read-only ACS capability, effectively confirming lack of isolation), and integrated I/O devices on the motherboard exhibit poor grouping with system management components (aside from the onboard I219 LOM, which we do have quirked). This makes the currently available Skylake platforms a really bad choice for doing device assignment.

Based on this new data, I'll revise my recommendation for Intel platforms to include Xeon E5 and higher processors or Core i7 High End Desktop Processors (as listed by Intel). Of course there are combinations where regular Core i5, i7 and Xeon E3 processors will work well, we simply need to be aware of their limitations and factor that into our system design.

EDIT (Oct 30 04:18 UTC 2015): A more subtle feature also found in these E series processors is support for IOMMU super pages. The datasheets for the E5 Xeons and these High End Desktop Processors indicate support for 2MB and 1GB IOMMU pages while the standard Core i5 and i7 only support 4KB pages. This means less space wasted for the IOMMU page tables, more efficient table walks by the hardware, and less thrashing of the I/O TLB under I/O load resulting in I/O stalls. Will you notice it? Maybe. VFIO will take advantage of IOMMU super pages any time we find a sufficiently sized range of contiguous pages. To help insure this happens, make use of hugepages in the VM.

EDIT (Oct 15 18:20 UTC 2016): Intel Broadwell-E processors have been out for some time and as we'd expect, the datasheets do indicate that ACS is supported. So add to the list above:

Broadwell-E (LGA2011-v3)
i7-6950X (10-core, 3.0/3.5GHz)
i7-6900K (8-core, 3.2/3.7GHz)
i7-6850K (6-core, 3.6/3.8GHz)
i7-6800K (6-core, 3.4/3.6GHz)

Wednesday, September 2, 2015

libvirt 1.2.19 - session mode device assignment

Just a quick note to point out a feature, actually bug fix, in the new libvirt 1.2.19 release:

hostdev: skip ACS check when using VFIO for device assignment (Laine Stump)

Why is this noteworthy? Well, that ACS checking required access to PCI config space beyond the standard header, which is privileged. That means that session (ie. user) mode libvirt couldn't do it and failed trying to support <hostdev> entries. Now that libvirt recognizes that vfio enforces device isolation and a userspace ACS test is unnecessary, session mode libvirt can support device assignment! Thanks Laine!

Note that a user still can't just pluck a device from the host and start using it, that's still privileged. There's also the problem that a VM making use of device assignment needs to lock all of the VM memory into RAM, which is typically quite a lot more than the standard user locked memory limit of 64kB. But these can be resolved by enabling the (trusted) user to lock memory sufficient for their VM and preparing the device for the user. The keys to doing this are:

Use /etc/security/limits.conf to increase memlock for the desired user
Pre-bind the desired device to vfio-pci, either by the various mechanisms provided in other posts or simply using virsh nodedev-detach.
Change the ownership of the vfio group to that of the (trusted) user. To determine the group, follow the links in sysfs or use virsh nodedev-dumpxml, for example:

      $ virsh nodedev-dumpxml pci_0000_00_19_0
      <device>
        <name>pci_0000_00_19_0</name>
        <path>/sys/devices/pci0000:00/0000:00:19.0</path>
        <parent>computer</parent>
        <driver>
          <name>e1000e</name>
        </driver>
        <capability type='pci'>
          <domain>0</domain>
          <bus>0</bus>
          <slot>25</slot>
          <function>0</function>
          <product id='0x1502'>82579LM Gigabit Network Connection</product>
          <vendor id='0x8086'>Intel Corporation</vendor>
          <iommuGroup number='4'>
            <address domain='0x0000' bus='0x00' slot='0x19' function='0x0'/>
          </iommuGroup>
        </capability>
      </device>

The iommuGroup sections tells us that this is group number 4, so permissions need to be set on /dev/vfio/4. As always, also note the set of devices within this group and ensure that all endpoints listed are either bound to vfio-pci or pci-stub, the former will allow the user access to the device, the latter will allow the group to be usable without explicitly allowing the user access.

Enjoy!

Tuesday, August 11, 2015

vfio-users mailing list

The Arch Linux VGA thread was recently closed leaving a number of users looking for a place to continue the conversation. To facilitate that, I've created a vfio-users mailing list for discussion of all topics related to vfio. That includes QEMU device assignment use cases, such as VGA/GPU, as well as userspace drivers. Though hosted by Red Hat, this is a distribution independent forum intended to further vfio and its use cases as a technology. Please be respectful of each other and keep topics relevant to the forum. Sign up here

Tuesday, May 5, 2015

VFIO GPU How To series, part 5 - A VGA-mode, SeaBIOS VM

For this example I'm show how to setup a VGA-mode VM using a Windows 7 guest and SeaBIOS. In this case we are not dependent on the guest or the graphics card supporting UEFI, but with Intel host graphics, we do need to work around the Linux i915 driver's broken participation in VGA arbitration. If you intend to use IGD for the host, the typical solution for this is to apply the i915 patch to the host kernel and use the enable_hd_vgaarb=1 option for the i915 driver to make it correctly participate in VGA arbitration. The side-effect of doing this is the loss of DRI support in the host Xorg server.

However, since this is just a demo, I'm going to be lazy and modify my script from part 4 of this series to comment out the test for boot_vga being non-zero such that vfio-pci will be set as the driver override for all VGA devices in the system. This limits my system to text-mode on the primary head for this example. For a long term solution, I would either be looking for a UEFI/OVMF setup as outlined in part 4, or disabling IGD and using discrete graphics for the host. Continuing to patch kernels for i915 VGA arbitration support is of course an option too, if you enjoy that sort of thing. If you're not using IGD on the host, or like my example, avoiding the i915 driver, you should be ok.

Windows 7 VM installation is largely the same as Windows 8 installation in part 4. The difference is that we're not going to modify the VM firmware to use UEFI, we'll leave that set to BIOS. The other difference is that BIOS won't drop to a shell and allow us to manually boot from the install CD after we've broken the boot order by adding the virtio drivers. My brute force solution to this is to attempt to boot the VM, it will error and say no OS found, force the VM off, then modify the boot options to push the Windows CDROM above the virtio disk and start the VM again. I expect that you could also enable the boot menu on the first pass and try to catch the F12 prompt to select the correct boot device.

After installation, update, and installing TightVNC, as in part 4, shut down the VM, prune and tune the VM using virsh edit, just as we did in the previous setup. If you're using Nvidia also add the KVM hidden section to features, removing the Hyper-V section and disable the Hyper-V clock source. Also return to virt-manager and add the GPU and audio function to the VM. All of this is documented in part 4.

Before you start the VM with the GPU, we first need to make a wrapper script around our qemu-kvm binary to insert the x-vga=on option. To do this, create a file named /usr/libexec/qemu-kvm.vga with the following:

#!/bin/sh

exec /usr/libexec/qemu-kvm \

`echo "\$@" | \

sed 's|01:00.0|01:00.0,x-vga=on|g' | \

sed 's|02:00.0|02:00.0,x-vga=on|g'`

We'll use this as the executable libvirt uses in place of qemu-kvm directly. Any time the qemu-kvm command line contains 01:00.0 or 02:00.0, which are the PCI addresses of the graphics cards in my system, we'll add the option ",x-vga=on". This makes it transparent to libvirt that this is happening. Be sure to chmod 755 the file before proceeding.

If your system is using selinux, libvirt will get an audit error trying to execute this script, so we'll need to add a new selinux module to allow for this. Red Hat documentation here provides instructions for doing this. In summary:

Set the selinux permissions for the file:

# restorecon /usr/libexec/qemu-kvm.vga

Create an selinux module

# cat > qemukvmvga.te << EOF

policy_module(qemukvmvga, 1.0)

gen_require(\`

 attribute virt_domain;

 type qemu_exec_t;

')

can_exec(virt_domain, qemu_exec_t)

EOF

Build the selinux module

# make -f /usr/share/selinux/devel/Makefile

Install selinux module

# semodule -i qemukvmvga.pp

With selinux happy, we next need to run virsh edit on the domain again. Find the <emulator> tag and update the executable to point to our new script. Save and exit the configuration and you should now be able to start the VM from virt-manager with VGA-mode enabled. The same driver installation guidelines from part 4 apply for the GeForce/Catalyst drivers. Enjoy.