Stefan Hajnoczi: tracing

Showing posts with label tracing. Show all posts

Tuesday, March 29, 2011

How to use perf-probe

Dynamic tracepoints are an observability feature that allows functions
and even arbitrary lines of source code to be instrumented without recompiling
the program. Armed with a copy of the program's source code, tracepoints can
be placed anywhere at runtime and the value of variables can be dumped each
time execution passes the tracepoint. This is an extremely powerful technique
for instrumenting complex systems or code written by someone else.

I recently investigated issues with CD-ROM media change
inside a KVM guest. Using QEMU tracing I was able to
instrument the ATAPI CD-ROM emulation inside QEMU and observe the commands
being sent to the CD-ROM. I complemented this with perf-probe(1) to
instrument the sr and cdrom drivers in the Linux host and
guest kernels. The power of perf-probe(1) was key to understanding the
state of the CD-ROM drivers without recompiling a custom debugging kernel.

Overview of perf-probe(1)

The probe subcommand of perf(1) allows dynamic tracepoints
to be added or removed inside the Linux kernel. Both core kernel and modules
can be instrumented.

Besides instrumenting locations in the code, a tracepoint can also fetch
values from local variables, globals, registers, the stack, or memory. You can
combine this with perf-record -g callgraph recording to get stack
traces when execution passes probes. This makes perf-probe(1)
extremely powerful.

Requirements

A word of warning: use a recent kernel and perf(1) tool. I was unable
to use function return probes with Fedora 14's 2.6.35-based kernel. It works
fine on with a 2.6.38 perf(1) tool. In general, I have found the
perf(1) tool to be somewhat unstable but this has improved a lot with
recent versions.

Plain function tracing can be done without installing kernel debuginfo
packages but arbitrary source line probes and variable dumping requires the
kernel debuginfo packages. On Debian-based distros the kernel debuginfo
packages are called linux-image-*-dbg and on Red Hat-based distros
they are called kernel-debuginfo.

Tracing function entry

The following command-line syntax adds a new function entry probe:

sudo perf probe <function-name> [<local-variable> ...]

Tracing function return

The following command-line syntax adds a new function return probe and dumps the return value:

sudo perf probe <funtion-name>%return '$retval'

Notice the arg1 return value in this example output for scsi_test_unit_ready():

$ sudo perf record -e probe:scsi_test_unit_ready -aR sleep 10
[...]
$ sudo perf script
     kworker/0:0-3383  [000]  3546.003235: scsi_test_unit_ready: (ffffffffa0208c6d <- ffffffffa0069306) arg1=8000002
 hald-addon-stor-3025  [001]  3546.004431: scsi_test_unit_ready: (ffffffffa0208c6d <- ffffffffa006a143) arg1=8000002
     kworker/0:0-3383  [000]  3548.004531: scsi_test_unit_ready: (ffffffffa0208c6d <- ffffffffa0069306) arg1=8000002
 hald-addon-stor-3025  [001]  3548.005531: scsi_test_unit_ready: (ffffffffa0208c6d <- ffffffffa006a143) arg1=8000002

Listing probes

Existing probes can be listed:

sudo perf probe -l

Removing probes

A probe can be deleted with:

sudo perf probe -d <probe-name<

All probes can be deleted with:

sudo perf probe -d '*'

Further information

See the perf-probe(1) man page for details on command-line syntax and additional features.

The underlying mechanism of perf-probe(1) is kprobe-based event tracing, which is documented here.

I hope that future Linux kernels will add perf-probe(1) for
userspace processes. Nowadays GDB might already include this feature in its
tracing commands but I haven't had a chance to try them out yet.

Saturday, March 12, 2011

How to write trace analysis scripts for QEMU

This post shows how to write a Python script that finds overlapping disk writes in a QEMU simple trace file.

Several trace backends, including SystemTap and LTTng Userspace Tracer, are supported by QEMU. The built in "simple" trace backend often gives the best bang for the buck because it does not require installing additional software and is easy to use for developers. See my earlier post for an overview of QEMU tracing.

The simple trace backend has recently been enhanced with a Python module for analyzing trace files. This makes it easy to write scripts that post-process trace files and extract useful information. Take this example from the commit that introduced the simpletrace module:

#!/usr/bin/env python
# Print virtqueue elements that were never returned to the guest.

import simpletrace

class VirtqueueRequestTracker(simpletrace.Analyzer):
    def __init__(self):
        self.elems = set()

    def virtqueue_pop(self, vq, elem, in_num, out_num):
        self.elems.add(elem)

    def virtqueue_fill(self, vq, elem, length, idx):
        self.elems.remove(elem)

    def end(self):
        for elem in self.elems:
            print hex(elem)

simpletrace.run(VirtqueueRequestTracker())

This script tracks virtqueue_pop and virtqueue_fill operations and prints out the elements that were popped but never filled back, which indicates elements have leaked.

The model of an analysis script is similar to awk. Trace records are processed from the input file by invoking methods on the user's simpletrace.Analyzer object. The analyzer object does not have to supply methods for all possible trace events, it can just implement those that it wants to know about. Trace events that have no dedicated method cause the catchall() method to be invoked, if provided.

Tracing disk write operations

Let's write a slightly fancier script that finds disk writes that overlap a given range. I've needed to perform disk write overlap queries in the past when debugging image formats, so this is a useful script to have. QEMU's block layer write function looks like this:

BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
                                  QEMUIOVector *iov, int nb_sectors,
                                  BlockDriverCompletionFunc *cb, void *opaque);

The block device is called bs and sectors are 512 bytes. Conveniently there is already a trace event for bdrv_aio_write so we just need to enable it in the trace-events file:

disable bdrv_aio_writev(void *bs, int64_t sector_num, int nb_sectors, void *opaque) "bs %p sector_num %"PRId64" nb_sectors %d opaque %p"

Open the trace-events file and remove the disable keyword from the bdrv_aio_write trace event. Then rebuild QEMU like this:

$ ./configure --enable-trace-backend=simple # ... plus your usual options
$ make

Next time you run QEMU a trace file named trace-$PID will be created in the current working directory. The file is in binary and can be parsed using the simpletrace Python module or pretty-printed like this:

$ scripts/simpletrace.py trace-events trace-12345 # replace "trace-12345" with the actual filename

Finding overlapping disk writes

Here is the usage information for a script to find disk writes that overlap a given range:

usage: find_overlapping_writes.py <trace-events> <trace-file> <bs> <sector-num> <nb-sectors>

The script only considers writes to a specific block device, bs. That means all disk I/O to other block devices is ignored.

For example, let's find writes to a 1 MB region at offset 2 MB of the BlockDriverState 0x29c8180:

$ ./find_overlapping_writes.py trace-events trace-19129 0x29c8180 4096 2048
0xe40+664 opaque=0x2eb4190
0x10d8+3032 opaque=0x2eb41d0

Here is the code:

#!/usr/bin/env python
import sys
import simpletrace

def intersects(a_start, a_len, b_start, b_len):
    return not (a_start + a_len <= b_start or \
                b_start + b_len <= a_start)

class OverlappingWritesAnalyzer(simpletrace.Analyzer):
    def __init__(self, bs, sector_num, nb_sectors):
        self.bs = bs
        self.sector_num = sector_num
        self.nb_sectors = nb_sectors

    def bdrv_aio_writev(self, bs, sector_num, nb_sectors, opaque):
        if bs != self.bs:
            return
        if intersects(self.sector_num, self.nb_sectors, sector_num, nb_sectors):
            print '%#x+%d opaque=%#x' % (sector_num, nb_sectors, opaque)

if len(sys.argv) != 6:
    sys.stderr.write('usage: %s <trace-events> <trace-file> <bs> <sector-num> <nb-sectors>\n' %
                     sys.argv[0])
    sys.exit(1)

trace_events, trace_file, bs, sector_num, nb_sectors = sys.argv[1:]
bs = int(bs, 0)
sector_num = int(sector_num, 0)
nb_sectors = int(nb_sectors, 0)

analyzer = OverlappingWritesAnalyzer(bs, sector_num, nb_sectors)
simpletrace.process(trace_events, trace_file, analyzer)

The core of the script is the OverlappingWritesAnalyzer that checks bdrv_aio_writev events for intersection with the given range.

This script is longer than the virtqueue leak detector example above because it parses command-line arguments. The simpletrace.run() function used by the leak detector handles the default trace events and trace file arguments for you. So scripts that take no special command-line arguments can use simpletrace.run(), which also prints usage information automatically. For overlapping writes we really need our own command-line arguments so the slightly lower-level simpletrace.process() function is used.

Where to find out more

There is more information about the simpletrace module in the doc comments, so the simplest way to get started is:

$ cd qemu
$ PYTHONPATH=scripts python
>>> import simpletrace
>>> help(simpletrace)

Another example of how to use the simpletrace module is the trace file pretty-printer which is included as part of the scripts/simpletrace.py source code itself!

Feel free to leave questions or comments about the simple trace backend and the simpletrace Python module.

Wednesday, February 23, 2011

Observability using QEMU tracing

I am going to describe the tracing feature in the QEMU and KVM.

Overview of QEMU tracing

Tracing is available for the first time in QEMU 0.14.0 and qemu-kvm 0.14.0. It's an optional feature and may not be enabled in distro packages yet, but it's there if you are willing to build from source.

QEMU tracing is geared towards answering questions about running virtual machines:

What I/O accesses are being made to emulated devices?
How long are disk writes taking to complete inside QEMU?
Is QEMU leaking memory or other resources by not freeing them?
Are network packets being received but filtered at the QEMU level?

In order to find answers to these questions we place trace events into the QEMU source code at strategic points. For example, every qemu_malloc() and qemu_free() call can be traced so we know what heap memory allocations are going on.

Current status

Today QEMU tracing is useful to developers and anyone troubleshooting or investigating bugs.

The set of trace events that comes with QEMU is limited but already useful for observing the block layer and certain emulated hardware. Developers are adding trace events to new code and converting some existing debug printfs to trace events. I expect the default set of trace events to grow and become more useful in the future.

Trace events are currently not a stable API so scripts that work with one version of QEMU are not guaranteed to work with another version. There is also no documentation on the semantics of particular trace events, so it is necessary to understand the code which contains the trace event to know its meaning. In the future we can make stable trace events with explicit semantics like "packet received from host".

QEMU tracing cross-platform support

You have a choice of trace backends: SystemTap, LTTng Userspace Tracer, and a built-in "simple" tracer are supported. DTrace could be added with little effort on Solaris, Mac OSX, and FreeBSD host platforms.

The available set of trace events is the same no matter which trace backend you choose.

Where to find out more

If you want to get started, check out the documentation that comes are part of QEMU.

Also check out the excellent QEMU 0.14.0 changelog for pointers related to tracing.

I looking forward to writing more about tracing in the future and sharing trace analysis scripts. In fact, I just submitted a patch to provide a Python API for processing trace files generated by the "simple" trace backend. It makes analyzing trace files quick and fun :).