Randomly sub-setting test suites

Sunday 14 January 2024

I needed to run random subsets of my test suite to narrow down the cause of some mysterious behavior. I didn’t find an existing tool that worked the way I wanted to, so I cobbled something together.

I wanted to run 10 random tests (out of 1368), and keep choosing randomly until I saw the bad behavior. Once I had a selection of 10, I wanted to be able to whittle it down to try to reduce it further.

I tried a few different approaches, and here’s what I came up with, two tools in the coverage.py repo that combine to do what I want:

  • A pytest plugin (select_plugin.py) that lets me run a command to output the names of the exact tests I want to run,
  • A command-line tool (pick.py) to select random lines of text from a file. For convenience, blank or commented-out lines are ignored.

More details are in the comment at the top of pick.py, but here’s a quick example:

  1. Get all the test names in tests.txt. These are pytest “node” specifications:
    pytest --collect-only | grep :: > tests.txt
  2. Now tests.txt has a line per test node. Some are straightforward:
    tests/test_cmdline.py::CmdLineStdoutTest::test_version
    tests/test_html.py::HtmlDeltaTest::test_file_becomes_100
    tests/test_report_common.py::ReportMapsPathsTest::test_map_paths_during_html_report
    but with parameterization they can be complicated:
    tests/test_files.py::test_invalid_globs[bar/***/foo.py-***]
    tests/test_files.py::FilesTest::test_source_exists[a/b/c/foo.py-a/b/c/bar.py-False]
    tests/test_config.py::ConfigTest::test_toml_parse_errors[[tool.coverage.run]\nconcurrency="foo"-not a list]
  3. Run a random bunch of 10 tests:
    pytest --select-cmd="python pick.py sample 10 < tests.txt"
    We’re using --select-cmd to specify the shell command that will output the names of tests. Our command uses pick.py to select 10 random lines from tests.txt.
  4. Run many random bunches of 10, announcing the seed each time:
    for seed in $(seq 1 100); do
        echo seed=$seed
        pytest --select-cmd="python pick.py sample 10 $seed < tests.txt"
    done
  5. Once you find a seed that produces the small batch you want, save that batch:
    python pick.py sample 10 17 < tests.txt > bad.txt
  6. Now you can run that bad batch repeatedly:
    pytest --select-cmd="cat bad.txt"
  7. To reduce the bad batch, comment out lines in bad.txt with a hash character, and the tests will be excluded. Keep editing until you find the small set of tests you want.

I like that this works and I understand it. I like that it’s based on the bedrock of text files and shell commands. I like that there’s room for different behavior in the future by adding to how pick.py works. For example, it doesn’t do any bisecting now, but it could be adapted to it.

As usual, there might be a better way to do this, but this works for me.

Comments

[gravatar]

Why random, though? Wouldn’t you want to do this in a more controlled fashion, like maybe a binary search, to get to the problem faster?

(I guess because it’s not finding a specific broken test, but a combination of tests which break. I’m still having a hard time coupling “random” with “identify the problem mechanically”, though - maybe permuting the set of tests in combination with binary search?)

[gravatar]

In my case, there was an interaction between tests. I didn’t know what combination would produce the smallest reproducer, and this was quick to implement. Bisecting might have also worked, or some other form of more disciplined subsetting.

[gravatar]

We used pytest-random (which I know see hasn’t been updated in over a decade). I didn’t go with the subset of 10, I just re-ran the entire suite 50 times with a fast fail until I found an order that caused the flaky test to fail.

My command line ended up being poetry run pytest path/to/tests/file.py --timeout=10 --count=50 --random -x

[gravatar]

Random subsets have a chance of repeatedly picking an already tried test; dividing a single random permutation of the sets into disjoint subsets would be more efficient.

You could write files with lists of 10 tests all at once and try them sequentially, or pick.py could have an additional parameter: permute M items according to random seed S and give the Kth set of N items, or less items in the last set.

There are ceil(M/N) sets, and If K> ceil(M/N) it can be considered an error or the permutation could be changed (using e.g. S+(K*N)/M instead of S).

[gravatar]

To avoid picking already-tried tests*, you could also add a verb to subtract the contents of one file from another. Then, each time you try a batch and get no failures, remove it from the list of ones to try. Basically “choose without replacement”. It’s possible that this small algebra on lists of lines in files would have other uses.

* But because there is an interaction between tests, the fact that a given test was already run without error doesn’t mean that we may conclude it is not implicated in the failure mode.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.