SPECmail2001 Release 1.0 Run and Reporting RulesVersion 1.00 |
This document specifies how SPECmail2001 is to be run for measuring and publicly reporting performance results. These rules have been established by the SPEC Mail Server Subcommittee and approved by the SPEC Open Systems Steering Committee. The rules ensure that results generated with this suite are meaningful, comparable to other generated results, and are repeatable (with documentation covering factors pertinent to duplicating the results).
Per the SPEC license agreement, all results publicly disclosed must adhere to these Run and Reporting Rules.
SPEC believes the user community will benefit from an objective series of tests, which can serve as common reference and be considered as part of an evaluation process.
SPEC is aware of the importance of optimizations in producing the best system performance. SPEC is also aware that it is sometimes hard to draw an exact line between legitimate optimizations that happen to benefit SPEC benchmarks and optimizations that specifically target the SPEC benchmarks. SPEC wants to increase awareness of implementers and end users to issues of unwanted benchmark-specific optimizations that would be incompatible with SPEC's goal of fair benchmarking.
SPEC expects that any public use of results from this benchmark suite shall be for Systems Under Test (SUTs) and configurations that are appropriate for public consumption and comparison. Thus, it is required that:
To ensure that results are relevant to end-users, SPEC expects that the hardware and software implementations used for running the SPEC benchmarks adhere to following conventions:
SPEC reserves the right to investigate any case where it appears that these guidelines and the associated benchmark run and reporting rules have not been followed for a published SPEC benchmark result. SPEC may request that the result be withdrawn from the public forum in which it appears and that the benchmarker correct any deficiency in product or process before submitting or publishing future results.
SPEC reserves the right to adapt the benchmark codes, workloads, and rules of SPECmail2001 as deemed necessary to preserve the goal of fair benchmarking. SPEC will notify members and licensees if changes are made to the benchmark and will rename the metrics (e.g. from SPECmail2001 to SPECmail2001a).
Relevant standards are cited in these run rules as URL references, and are current as of the date of publication. Changes or updates to these referenced documents or URL's may necessitate repairs to the links and/or amendment of the run rules. The most current run rules will be available at the SPEC web site at http://www.spec.org. SPEC will notify members and licensees whenever it makes changes to the documentation.
The production of compliant SPECmail2001 test results requires that the tests be run in accordance with these run rules. These rules relate to the requirements for the System Under Test (SUT) and the testbed (i.e. SUT, clients, and network), including protocols, operation, configuration, test staging, optimizations, and measurement.
As Internet email is defined by its protocol definitions, SPECmail2001 requires adherence to the relevant protocol standards:
RFC 821 :
Simple Mail Transfer Protocol (SMTP)
RFC 1939 : Post Office
Protocol - Version 3 (POP3)
The SMTP and POP3 protocols imply the following:
RFC 791 :
Internet Protocol (IPv4)
RFC 2460 : Internet
Protocol, Version 6 (IPv6) [ may be used in place of IPv4 ]
RFC 792 :
Internet Control Message Protocol (ICMP)
RFC 793 :
Transmission Control Protocol (TCP)
RFC 950 :
Internet Standard Subnetting Procedure
RFC 1122 : Requirements
for Internet Hosts - Communication Layers
Internet standards are evolving standards. Adherence to related RFC's (e.g. RFC 1191 Path MTU Discovery) is also acceptable, provided the implementation retains the characteristic of interoperability with other implementations.
The entire testbed (SUT, clients, and network) must be comprised of components that are generally available, or shall be generally available within three months of the first publication of the results.
Products are considered generally available if they are orderable by ordinary customers and ship within a reasonable time frame. This time frame is a function of the product size and classification, and common practice. Some limited quantity of the product must have shipped on or before the close of the stated availability window. Shipped products do not have to match the tested configuration in terms of CPU count, memory size, and disk count or size, but the tested configuration must be available to ordinary customers. The availability of support and documentation for the products must coincide with the release of the products.
Hardware products that are still supported by their original or primary vendor may be used if their original general availability date was within the last five years. The five-year limit is waived for hardware used in client systems.
Software products that are still supported by their original or primary vendor may be used if their original general availability date was within the last three years.
In the disclosure, the benchmarker must identify any component that is no longer orderable by ordinary customers.
The SUT must utilize stable storage for the mail store. Mail servers are expected to safely store any email they have accepted until the recipient has disposed of it. To do this, mail servers must be able to recover the mail store without loss from multiple power failures (including cascading power failures), operating system failures, and hardware failures of components (e.g. CPU) other than the storage medium. At any point where the data can be cached, after the server has accepted the message and acknowledged its receipt, there must be a mechanism to ensure any cached message survives the server failure.
If an UPS is required by the SUT to meet the stable storage requirement, the benchmarker is not required to perform the test with an UPS in place. The benchmarker must state in the disclosure that an UPS is required. Supplying a model number for an appropriate UPS is encouraged but not required.
If a battery-backed component is used to meet the stable storage requirement, that battery must have sufficient power to maintain the data for at least 48 hours to allow any cached data to be committed to media and the system to be gracefully shut down. The system or component must also be able to detect a low battery condition and prevent the use of the component or provide for a graceful system shutdown.
The SUT must present to mail clients the appearance and behavior of a single logical server for each protocol. Specifically, the SUT must present a single system view, in that the results of any mail transaction from a client that change the state on the SUT must be visible to any/all other clients on any subsequent mail transaction. For example, if User_1 has 10 mail messages in his mailbox on the SUT, then that user could read those 10 messages from any client system.
For a run to be valid, the following attributes related to logging must hold true:
For a run to be valid, the following attributes that relate to TCP/IP network configuration must hold true:
Note: SPEC intends to follow relevant standards wherever practical, but with respect to this performance sensitive parameter it is difficult due to ambiguity in the standards. RFC1122 requires that TIME_WAIT be 2 times the maximum segment life (MSL) and RFC793 suggests a value of 2 minutes for MSL. So TIME_WAIT itself is effectively not limited by the standards. However, current TCP/IP implementations define a de facto lower limit for TIME_WAIT of 60 seconds, which is the value used in most BSD derived UNIX implementations.
To make an official SPECmail2001 test run, the benchmarker must perform the following steps:
or
Benchmark specific optimization is not allowed. Any optimization of either the configuration or software used on the SUT must improve performance for a larger class of workloads than that defined by this benchmark and must be supported and recommended by the provider. Optimizations that take advantage of the benchmark's specific features are forbidden. Examples of inappropriate optimization include, but are not limited to, taking advantage of specially formed test user account names, the fixed set of message sizes in the workload, or the workload's mailbox sizes.
The provided SPECmail2001 tools must be used to run and produce measured SPECmail2001 results. The SPECmail2001 metric is a function of the SPECmail2001 workload, the associated mail store and the defined Quality of Service criteria. SPECmail2001 results are not comparable to any other mail server performance metric.
SPECmail2001 expresses performance in terms of SPECmail2001 Messages per Minutes (MPM). The benchmarker specifies the number of users for which the benchmark tools will generate a workload. The load generators will generate a mix of SMTP and POP3 transactions that are presented to the mail server such that 1 MPM is representative of load expected during the peak hour for 200 POP consumer users. In addition to the MPM metric, the benchmark will also report the configured number of SPECmail2001 users.
SPECmail2001 requires that for each SMTP incoming message received by the SUT, the SUT must also handle a selection of POP transactions. The POP transactions include AUTH, STAT, RETR, and DELE. SMTP incoming messages that are not intended for local users are relayed as outgoing SMTP messages. The workload parameters required for a valid run are contained in the default workload parameter file supplied with the benchmark. A detailed explanation of the workload is included in the SPECmail2001 Architecture White Paper.
It is the responsibility of the benchmarker to ensure that the messages that make up the mail store are placed on the SUT so that they can be accessed properly by the benchmark. These messages and only these messages shall be used as the target working set. The benchmark performs internal validations to verify the expected results. No modification or bypassing of this validation is allowed.
The benchmark determines the initial working set size for the test based on a function of the number of POP3 users specified for the test, the message size distribution, and mailbox size distribution. An estimate of the raw byte count for the working set can be calculated as follows:
The actual size of the mail store and the amount of disk space to contain it will be a function of the mail server products in use and any additional storage overhead needed or configured. It is recommended that an additional 10% of storage space be available to accommodate the fluctuations in the workload.
The benchmarker is responsible for configuring the SUT with the corresponding number of user accounts and mailboxes required for the test. The benchmark suite provides tools for the initial population of the mail store.
Since the working set is not static and changes over the course of the test as messages are added or deleted, it is allowable for the benchmarker to capture the mail store image after the tools have created the initial population (see section 2.7).
The SPECmail2001 benchmark has specific Quality of Service (QoS) criteria for response times, delivery times and error rates. The QoS criteria are checked by the benchmark tools.
According to the POP3 RFC 1939:
A POP3 server MAY have an inactivity autologout timer. Such a timer MUST be of at least 10 minutes duration. The receipt of any command from the client during that interval should suffice to reset the autologout timer. When the timer expires, the session does NOT enter the UPDATE state--the server should close the TCP connection without removing any messages or sending any response to the client.
If the mail server includes an inactivity autologout timer, it must be set to at least 10 minutes. It is recommended that the timer not be set to longer than 10 minutes as this could cause a slight increase in POP3 lock conflicts particularly at the 120% load level.
The SPECmail2001 benchmark requires the use of one or more client systems. One client system is designated the prime client and will run the benchmark manager. One or more client systems act as load generators. One client system is designated as the smtpsink to handle the mail to remote addresses. Please refer to the User Guide for more detail on these roles.
A server component of the SUT must not be used as a load generator or a smtpsink when testing to produce valid SPECmail2001 results. A server component may be used as the prime client, but this is not recommended.
The client systems must have a Java Runtime Environment (JRE) version 1.1.8 or higher installed in order to run the benchmark tools.
The SPECmail2001 benchmark provides two parameter files that contain the testbed configuration and workload parameters. The file SPECmail_config.rc contains the testbed (clients and SUT) configuration information that appears in the final report and must be modified to contain the site-specific information.
The file SPECmail_fixed.rc contains the default workload parameters used to produce a compliant test result. This file must not be altered. Modifying the SPECmail_fixed.rc will not prevent the benchmark from running, but the results generated using the modified SPECmail_fixed.rc file will always be marked non-compliant.
To help ensure that the content of the parameter files is correct and can be used to produce a compliant test run, benchmarkers are encouraged to invoke the java specmail command with the -compliant switch. Then if there are problems in the rc files, the benchmark will generate appropriate warning messages and immediately discontinue the test.
The SPECmail2001 User Guide provides detailed documentation on the parameters
in the SPECmail_config.rc and SPECmail_fixed.rc files.
In order to publicly disclose SPECmail2001 results, the benchmarker must adhere to these reporting rules in addition to having followed the run rules above. The goal of the reporting rules is to ensure the SUT and testbed are sufficiently documented such that someone could understand the results and reproduce the test.
The benchmark single figure of merit, SPECmail2001 messages per minute, is the throughput measured during the run at the 100% load level. A complete benchmark result is comprised of three separate measurements for the 80%, 100%, and 120% load levels, shown on the results reporting page. A detailed breakdown of each test is included on the reporting page.
The report of results for the SPECmail2001 benchmark is generated in HTML by the provided SPEC tools. These tools may not be changed, except for portability reasons with prior SPEC approval. The tools perform error checking and will flag some error conditions resulting in an "invalid result". However, these automatic checks are only there for debugging convenience and do not relieve the benchmarker of the responsibility to check the results and follow the run and reporting rules.
The section of the output.raw file that contains actual test measurements must not be altered. Corrections to the SUT descriptions may be made as needed to produce a properly documented disclosure.
Any SPECmail2001 result produced in compliance with these run and reporting rules may be publicly disclosed and represented as a valid SPECmail2001 result.
Any test result not in full compliance with the run and reporting rules must not be represented using the SPECmail2001 metric name.
The metric SPECmail2001 messages per minute must not be associated with any estimated results. This includes adding, multiplying or dividing measured results to create a derived metric.
When competitive comparisons are made using SPECmail2001 benchmark results available from the SPEC web site, SPEC requires that the following template be used:
SPECmail2001 is a trademark of the Standard Performance Evaluation Corp. (SPEC). Competitive numbers shown reflect results published on www.spec.org from date to date. [The comparison presented is based on basis for comparison.] For the latest SPECmail2001 results visit http://www.spec.org/osg/mail2001.
Notes:
Example:
SPECmail2001 is a trademark of the Standard Performance Evaluation Corp. (SPEC). Competitive numbers shown reflect results published on www.spec.org from Jan 12 to Mar 31, 2001. The comparison presented is based on best performing 4-cpu servers currently shipping by Vendor 1, Vendor 2 and Vendor 3. For the latest SPECmail2001 results visit http://www.spec.org/osg/mail2001.
The rationale for the template is to provide fair comparisons by ensuring that:
SPEC encourages use of the SPECmail2001 benchmark in academic and research environments. It is understood that experiments in such environments may be conducted in a less formal fashion than that required of licensees submitting to the SPEC web site or otherwise disclosing valid SPECmail2001 results.
For example, a research environment may use early prototype hardware that simply cannot be expected to stay up for the length of time required to run the entire benchmark, or may use research software that is unsupported and not generally available. Nevertheless, SPEC encourages researchers to obey as many of the run rules as practical, even for informal research. SPEC suggests that following the rules will improve the clarity, reproducibility, and comparability of research results. Where the rules cannot be followed, SPEC requires the results be clearly distinguished from fully compliant results such as those officially submitted to SPEC, by disclosing the deviations from the rules and avoiding the use of the SPECmail2001 metric name.
The system configuration information that is required to duplicate published performance results must be reported. This list is not intended to be all-inclusive, nor is each performance neutral feature in the list required to be described. The rule is: If it affects performance or the feature is required to duplicate the results, then it must be described.
Any deviations from the standard default configuration for the SUT must be documented, so an independent party would be able to reproduce the result without further assistance.
For most of the following configuration details, there is an entry in the configuration file, and a corresponding entry in the tool-generated HTML result page. If information needs to be included that does not fit into these entries, the Notes sections must be used.
The following SUT hardware components must be reported:
The following SUT software components must be reported:
A brief description of the network configuration used to achieve the benchmark result is required. The minimum information to be supplied is:
The following client system properties must be reported:
A configuration diagram of the SUT must be provided in a common graphics format (e.g. PNG, JPEG, GIF). This will be included in the HTML formatted results page. An example would be a line drawing that provides a pictorial representation of the SUT including the network connections between clients, server nodes, switches and the storage hierarchy and any other complexities of the SUT that can best be described graphically.
The dates of general customer availability must be listed for the major components: hardware, mail server software, and operating system, by month and year. All the system, hardware and software features are required to be available within three months of the first publication of the result. The overall hardware availability date must be the latest of the hardware availability dates. The overall software availability date must be the latest of the software availability dates.
If pre-release hardware or software is used, then the test sponsor represents that the performance measured is the performance to be expected on the same configuration of the release system. If the test sponsor later finds the performance has dropped by more than 5% of that reported for the pre-release system, then the test sponsor must resubmit a corrected test result.
For additional information on general availability requirements, please refer to section 2.2 above.
The reporting page must list:
The Notes section is used to document information such as:
The following additional information must be provided if requested for SPEC's results review:
In order to minimize disk space requirements, the submitter is only required to keep the section of the log that covers the 100% load level phase. However, having the log files available in their entirety during the review is preferred.
Once the test sponsor has a compliant run and wishes to submit it to SPEC for review, they will need to provide the following:
Once the submission is ready, please e-mail it to submail2001@spec.org
Retain the following for possible request during the review:
SPEC encourages the submission of results for review by the relevant subcommittee and subsequent publication on SPEC's web site. Vendors may publish compliant results independently; however, any SPEC member may request a full disclosure report for that result and the test sponsor must comply within 10 business days. Issues raised concerning a result's compliance to the run and reporting rules will be taken up by the relevant subcommittee regardless of whether or not the result was formally submitted to SPEC.
SPEC provides client driver software, which includes the tools for running the benchmark and reporting its results. This software includes a number of checks for conformance with these run and reporting rules.
The client driver software is provided as Java bytecode. SPEC also includes the Java source in the distribution. Only the supplied Java bytecode may be used to produce publishable SPECmail2001 results. SPEC requires the user to provide any other software needed to run the benchmark, e.g. OS and JRE.
The kit also includes the SPECmail_config.rc and SPECmail_fixed.rc files described above and a copy of the benchmark documentation (User Guide, Architecture White Paper, FAQ, and Run and Reporting Rules).
Licensees will be notified of any significant updates to the benchmark tools or documentation. Updated versions of the documentation will be available at http://www.spec.org.
Copyright © 2001 Standard Performance Evaluation Corporation