What's New for the CPU95 Benchmarks
Goals
There were two goals to the '95 suites:
-
Better benchmarks (i.e. less trivial).
-
The benchmarks have all been greatly upgraded, with many of the
weaknesses in the '92 material either being eliminated or drastically
reduced.
-
Greater reproducibility of results.
-
The rules have been significantly tightened through the creation and
enforcement of a standardized build and run environment. Under this new
environment, the entire process has been automated, all the way from a
given configuration definition right through a finalized PostScript
reporting page. Not only is manual intervention unnecessary, it is
explicitly prohibited.
For the end-user of the results, this means that the new '95 benchmarks
will result in more applicable results.
Differences
There are a number of differences between the '92 and the '95
benchmarks...
-
Benchmarks
-
There are now 8 INT and 10 FP benchmarks in the suites. All of the
benchmarks are bigger than the '92 benchmarks: larger code size, more
memory activity. To increase comparability, the time spent in libraries
has been minimized.
-
Measurement
-
All measurements must be made in the new tools environment. This ensures
that all results have been compiled and executed in a similar
environment, no more hand-crafted binaries or special case invocations.
Also the new tools calculate the result based upon the median timing out
of a series of runs, no more quoting of that one-in-a-hundred case where
everything lines up just right.
-
Tools
-
The old menu interface has been replaced with a flexible command-line
interface. Once installed and set-up, configuration file(s) control most
aspects of the benchmark building and running -- what arguments are
passed to the compiler (including support for feedback-directed and other
2 pass compilation systems), and arguments to pass to a run-time job
loader (including support for job queues). Simple options on the command
line control which of several configurations to use as well as selection
choices: how many iterations, which binaries to use, what metric to
compute, etc. Additionally, there is builtin support for a wide variety
of monitoring hooks so that one can automate the collection of sar(1),
gprof(1), and other such data.
-
Result Pages
-
Result pages are generated automatically for each run by the tools. There
is no need to invoke Excel or get out a calculator. The tools will select
the appropriate timings, calculate the correct metrics, and create a full
reporting page in either ASCII or PostScript form. Thus, once the
configuration files have been written, no further manual intervention is
required to generate fully compliant results that are ready for
submission to SPEC.
New Rules
Most of the rules are the same as for '92. But there are some
changes...
-
Mandatory Baseline
-
Baseline is now mandatory; each report must include a baseline
configuration, then one is also allowed to report a peak result.
-
Baseline Definition
-
The definition of what passes for Baseline has also been tightened.
Baseline has always been taken to mean that which the sponsor would
recommend to any customer who may be interested in performance. This
implies that the feature is safe and supported in virtually all cases.
Now, in addition, a baseline result may utilize no more than 4 compiler
options (in addition to any portability options that might be required to
compile a benchmark on a particular system). This is a known arbitrary
cut-off, but it was chosen as a means of discouraging unmanageably long
and arcane compilation specifications.
-
Tools
-
The use of the provided SPEC tool environment is mandatory for any
result. This means that it is no longer possible to handcraft a binary or
select best timings from a variety of runs. The new tools read one
configuration, make on set of runs, and calculate the resulting metric,
and generate a PostScript or ASCII result page, all automatically. Any
manual "assistance" is prohibited.