What's New for the CPU95 Benchmarks

Goals

There were two goals to the '95 suites:

Better benchmarks (i.e. less trivial).: The benchmarks have all been greatly upgraded, with many of the weaknesses in the '92 material either being eliminated or drastically reduced.
Greater reproducibility of results.: The rules have been significantly tightened through the creation and enforcement of a standardized build and run environment. Under this new environment, the entire process has been automated, all the way from a given configuration definition right through a finalized PostScript reporting page. Not only is manual intervention unnecessary, it is explicitly prohibited.

For the end-user of the results, this means that the new '95 benchmarks will result in more applicable results.

Differences

There are a number of differences between the '92 and the '95 benchmarks...

Benchmarks: There are now 8 INT and 10 FP benchmarks in the suites. All of the benchmarks are bigger than the '92 benchmarks: larger code size, more memory activity. To increase comparability, the time spent in libraries has been minimized.
Measurement: All measurements must be made in the new tools environment. This ensures that all results have been compiled and executed in a similar environment, no more hand-crafted binaries or special case invocations. Also the new tools calculate the result based upon the median timing out of a series of runs, no more quoting of that one-in-a-hundred case where everything lines up just right.
Tools: The old menu interface has been replaced with a flexible command-line interface. Once installed and set-up, configuration file(s) control most aspects of the benchmark building and running -- what arguments are passed to the compiler (including support for feedback-directed and other 2 pass compilation systems), and arguments to pass to a run-time job loader (including support for job queues). Simple options on the command line control which of several configurations to use as well as selection choices: how many iterations, which binaries to use, what metric to compute, etc. Additionally, there is builtin support for a wide variety of monitoring hooks so that one can automate the collection of sar(1), gprof(1), and other such data.
Result Pages: Result pages are generated automatically for each run by the tools. There is no need to invoke Excel or get out a calculator. The tools will select the appropriate timings, calculate the correct metrics, and create a full reporting page in either ASCII or PostScript form. Thus, once the configuration files have been written, no further manual intervention is required to generate fully compliant results that are ready for submission to SPEC.

New Rules

Most of the rules are the same as for '92. But there are some changes...

Mandatory Baseline: Baseline is now mandatory; each report must include a baseline configuration, then one is also allowed to report a peak result.
Baseline Definition: The definition of what passes for Baseline has also been tightened. Baseline has always been taken to mean that which the sponsor would recommend to any customer who may be interested in performance. This implies that the feature is safe and supported in virtually all cases. Now, in addition, a baseline result may utilize no more than 4 compiler options (in addition to any portability options that might be required to compile a benchmark on a particular system). This is a known arbitrary cut-off, but it was chosen as a means of discouraging unmanageably long and arcane compilation specifications.
Tools: The use of the provided SPEC tool environment is mandatory for any result. This means that it is no longer possible to handcraft a binary or select best timings from a variety of runs. The new tools read one configuration, make on set of runs, and calculate the resulting metric, and generate a PostScript or ASCII result page, all automatically. Any manual "assistance" is prohibited.

Standard Performance Evaluation Corporation

What's New for the CPU95 Benchmarks

Goals

Differences

New Rules