SPECmail2001 Mail Server Benchmark
Architecture White Paper

Version 1.00
Last modified: January 16, 2001

1. Introduction
1.1. Overview
1.2. Organization of this Paper
1.3. Related Documents
1.4. Run and Reporting Rules
2. Design Principles
2.1. Requirements and Goals
2.2. Excluded Goals
3. Benchmark Metric: SPECmail2001 Messages per Minute
4. Workload

4.1. Basis of Workload Definition
4.2. Transaction Workload
4.2.1. User Profile
4.2.1.1. Message Size (MSG_SIZE_DISTRIBUTION)
4.2.1.2. Recipient Count (MSG_RECP_DISTRIBUTION)
4.2.2. Transactions and Parameters per POP/SMTP session
4.2.2.1. SMTP Session
4.2.2.2. POP3 Session
4.2.3. Transaction Workload Calculation
4.3. Pre-population of Mail Store
4.3.1. Pre-populated Mail Store Calculation
4.3.2. Analytical Background
4.4. Benchmark Metric Calculation
4.5. Mail Server Housekeeping Functions
5. Detail Aspects of Workload Simulation
5.1. Arrival Rate and Interarrival Time Distribution
5.1.1. Arrival Rate
5.1.2. Interarrival Time
5.2. Traffic to/from Remote Mail Servers
5.3. POP Retries
6. Quality of Service
6.1. Response Times
6.2. Delivery Times
6.3. Remote Delivery Times
6.4. Error Rates
6.5. Overall QoS Requirement
7. Performance Characterization
8. Components of a Benchmark Setup

8.1. System under Test
8.2. Benchmark Test Client Systems
9. Future Plans
Appendix: Questions and Answers

1. Introduction

1.1. Overview

SPECmail2001 is a software benchmark designed to measure a system's ability to act as a mail server servicing email requests, based on the Internet standard protocols SMTP and POP3. The first version concentrates on the Internet Service Provider (ISP) class of mail servers, with an overall user count in the range of 10,000 to 1,000,000 users. It models POP consumer users. A future version will address business users and the IMAP4 protocol.

SPECmail2001 has been developed by the Standard Performance Evaluation Corporation (SPEC), a non-profit group of computer vendors, system integrators, universities, research organizations, publishers, and consultants.

This paper discusses the benchmark principles and architecture, and the rationale behind the key design decisions. It also outlines the workload used in the benchmark, and the general steps needed to run a benchmark. However those aspects are covered in more detail in other documents.

1.2. Organization of this Paper

Chapter 2 discusses the basic goals and non-goals of the benchmark.

Chapter 3 introduces the performance metric of SPECmail2001 - messages per minute - and how it relates to the transaction mix imposed on the system under test..

Chapter 4 explains the benchmark workload - how it was derived, how it translates into configuration parameters for the benchmark tool and size calculations for planning a benchmark, and how it relates to the benchmark metric.

Chapter 5 discusses some detail of aspects of the workload generation, namely the exact workload put on the server, and how the benchmark simulates communication with remote mail servers.

Chapter 6 defines the quality of service requirements of this benchmark.

Chapter 7 presents the 3-point performance characterization provided by this benchmark - showing server behavior under target load (100%), as well as under lower load (80%) and overload (120%).

Chapter 8 sketches the hardware and software elements used in a SPECmail2001 benchmark.

Chapter 9 concludes with an outlook into future plans.

An appendix with questions and answers is provided at the end of the paper, reflecting discussions that the workgroup has had in the design and implementation of this benchmark. This material was deemed relevant enough to be captured in this white paper; yet it was eliminated from the main body so as to streamline the presentation of what the benchmark is (and not what it is not). Throughout the preceding chapters, there are links provided into the Q&A section, to indicate how the "off-topic" discussions relate to aspects of the benchmark.

1.3. Related Documents

SPECmail2001 Run and Reporting Rules
Workload Analysis for Internet Mail Servers
SPECmail2001 User Guide
SPECmail2001 Sample Result Disclosure
SPECmail2001 FAQ

All documents can be obtained from the mail server home page.

1.4. Run and Reporting Rules

The Run and Reporting Rules for the SPECmail2001 benchmark are spelled out in a separate document. They ensure execution of the benchmark in a controlled environment. The goal is repeatability by third parties with reasonable resources and knowledge. The rules maximize the comparability of results, leveling the playing field as much as possible. They also define which information needs to be included in published results, and which supporting information needs to be submitted to the SPEC community for potential review.

Under the terms of the SPECmail2001 license, SPECmail2001 results may not be publicly reported unless they are run in compliance with the Run and Reporting Rules. Results published at the SPEC web site have been reviewed and approved by the SPECmail committee. For more information on publishing results at the SPEC web site, please send e-mail to: info@spec.org. The Run and Reporting Rules may be found on the SPEC web site; they are also part of the SPECmail2001 distribution kit.

2. Design of the SPECmail2001 Benchmark

SPECmail2001 is a mail server benchmark testing the capacity of a system as a mail server, while serving requests according to the Internet standard mail protocols SMTP (RFC821) and POP3 (RFC1939). SMTP is the standard for sending email from clients to servers, and between servers. POP3 is a protocol for mail access and retrieval by the user.

SPECmail2001's model for a POP user is a consumer customer of an ISP. The details of a consumer-type user's behavior will be discussed in a later chapter in this paper. Mail server user behavior varies greatly. It is therefore very important to define the type of assumed user, before discussing how many users a mail server can handle.

SPECmail2001 simulates the load which would be generated by a certain user population, and observes the mail server behavior under that load. It enforces the adherence to a required level of quality of service. The goal is to simulate realistic mail server operation, to maximize the usefulness of the benchmark results as guidelines for actual sizing decisions.

2.1. Requirements and Goals

The key goal of SPECmail2001 is to show mail server performance in a realistic context. This means

Appropriate Mail Store
A mail server which handles the transaction load of a large number of users needs to also be able to hold the mail data for the same amount of users. This includes a mailbox per user, as well as a defined number of messages per user.
Arrival Rate
New requests are being issued against the server according to an arrival rate pattern modeled after real world observations.
Modem Simulation
A percentage of clients is simulated as having low bandwidth connections, which extends the session duration and therefore increases parallel session count.
Workload according to Real World Data
The simulated workload (such as messages sent and received per day, the message size distribution, the mailbox check frequency) is based on measurements and experience.
Peak Hour Simulation
Mail traffic is distributed unevenly over the day. Sizing must focus on coping with the peak load. Therefore, the benchmark mail server load simulates the peak hour of the day.
Realistic Operation
The mail server is required to operate realistically, e.g. perform at least a defined level of logging which many ISPs would use in practice. As is the rule in all SPEC benchmarks, no benchmark-specific optimizations are allowed.

2.2. Excluded Goals

Explicitly excluded goals and limitations in the scope of SPECmail2001 are:

The benchmark does not require that there be 1,000's of small, remotely connected email clients. Instead, it allows simulation of the email clients on a small number of server-sized, locally connected client systems, in order to make the benchmark execution practical.

The benchmark does not include administrational activities like on-line backup and account provisioning. These are hard to standardize, and - more importantly - they do not necessarily happen during the peak hour of daily operation.
Besides provisioning, the benchmark does not include modification to account data or deletion of accounts. These activities, although commonplace in the day-to-day operation of a mail server, are seen as testing the directory service rather than the mail server itself. SPECmail2001 emphasizes mail server benchmarking, and includes a directory server in the system-under-test only to the minimum extent necessary to handle normal mail flow.
The benchmark does not cover relay operation (forwarding of incoming SMTP traffic to other, remote MTAs). It does include simulation of messages from local users to remote MTAs, as well as incoming mail from remote MTAs. Relay operation was excluded since it is often handled by separate MTA systems.

There is no requirement for high availability or disaster recovery measures; this was beyond the scope of the benchmark.

The mail server is not required to perform any extended features, like virus scanning, spam filtering, handling of mailing lists, security measures, multiple folders per mailbox. This was beyond the scope of the benchmark.

SMTP sessions can normally send one or several messages per session. The benchmark restricts itself to one message. This reduced the development effort on the benchmark tool, and was also consistent with early mail server traffic data evaluation.
The benchmark does not include handling of MIME messages (messages with attachments). MIME messages do not get special treatment by the mail server under the SMTP or POP protocols; they do get special treatment by the server under the IMAP protocol, which is outside the scope of this version of the benchmark. (Note, mail clients may pay special attention to MIME messages under all protocols, but that does not affect the benchmark).

The rules for preparing the benchmark do not require any aging of the mail store, beyond establishing the initial workload. In reality, mail stores may deteriorate over time, e.g. through file system or database fragmentation. The rationale is that a fresh mail store is at least a good comparison base, and that any realistic aging would inflate the benchmark execution effort.

See also Q&A section:	Why does SPECmail2001 not include IMAP4 users?
	Does SPECmail2001 consider WebMail?
	Why doesn't the benchmark push as hard as possible and report the achieved throughput?
	What constitutes a mail server?

3. Benchmark Metric:

SPECmail2001 Messages per Minute

Capacity at acceptable quality of service (QoS) is the basis of the performance metric of this benchmark. Acceptable QoS is measured by interactive response time of the mail server to each protocol step (see chapter 6). This metric is therefore

SPECmail2001 messages per minute

One SPECmail2001 message per minute includes a mix of transactions that a mail server needs to perform on a single message, from first receiving it to finally deleting it. This transaction mix depends on the workload, which will be defined later on in this document. For the following workload

each user sends two message per day, to two recipients

each user receives two messages per day

each user checks his/her mailbox four times a day

all messages get retrieved and deleted

the transaction mix for 1 SPECmail2001 message per minute consists of the following elementary mail server operations:

Table 1: Transaction count per "Message per Minute"
Transaction Type	Transaction Count per Minute
message sent by user	1
message sent to remote mail server	0.9
message received from remote mail server	0.9
mailbox check	2
retrieval and deletion of message	2

This metric cannot be compared to any other similarly named benchmark metric which does not follow exactly that workload definition and the same execution rules. Every aspect of each may affect the benchmark outcome.

See also Q&A section:	Why isn't the benchmark metric "Number of Users"?
	Why isn't the benchmark metric "Parallel Sessions"?

4. Workload

The SPECmail2001 workload has two parts: the pre-population of the mail store and the transaction workload during runtime. Both depend on the targeted benchmark rating.

It may be helpful at this point to list the basic steps of running a SPECmail2001 benchmark.

Pre-establish and pre-populate mail store according to initial workload (or restore backup copy)
Start from a cold system (processes, file systems, databases), so that all caches are clean
Verify mail store contents: message distribution across mailboxes, message size distribution
Run warm-up period (no measurements)
Run benchmark period (runtime workload, measurements taken)
Shutdown phase of benchmark and Report Generation

The actual process is a bit more complex - refer to the Run and Reporting Rules for details.

The definition of a pre-population of the mail server ensures that a system claiming to handle the transaction load for 1 million users is also capable to handle the mail data related to 1 million users. The pre-population consists of a message distribution across all mailboxes. The runtime load defines the number and kind of transactions that will be issued to the server during the benchmark run.

The SPECmail2001 benchmark is designed so that it generates a steady mail server state: over time, it adds as many messages to the mail store as it removes. Note, however, that insertion and deletion happens on mailboxes which are independently (and randomly) selected - i.e. "steady state" does not mean that the contents of the mail store, or any part of it, will be static. Just the overall volume should only fluctuate within certain limits. This behavior is not quite realistic - a production mail server will "breathe" over time, and it may actually grow over time, due to increasing user demands. However, this behavior is deemed close enough to reality, and greatly simplifies the running of series of benchmarks on the same mail store setup.

A stable mail store's contents can be determined fully by the transaction workload - we'll therefore discuss the transaction workload first, and determine the resulting mail store pre-population after that.

4.1. Basis of Workload Definition

The workload profile has been determined based on observing actual systems in production. At the time of this writing, it appears very difficult to impossible to provide the background data of this workload definition for public review, due to business aspects of the companies involved. The workload definition is therefore based on a two-pronged approach:

The working group was able to obtain a few actual mail server log files. The University of Pavia performed statistical analysis on this data. These were used to determine the "microscopic" behavior of the benchmark. In particular, the shape of the message size distribution, and the shape of the inter arrival time distribution are based on that study.
Some of the members of the working group have direct day-to-day experience with mail server operation in production. Although these individuals were not able to obtain actual, hard data from customer sites (see above), they did provide their general experience for typical load data. These data were collected and combined into an "expert consensus", and are used to determine the "macroscopic" behavior of the benchmark. The average message size, as well as the average arrival rate are based on these data.

The workload definition is believed to be currently realistic for the indicated user type, consumer POP user. The average user profile is, however, a moving target, affected by changing user behavior, evolving technology, etc.

Note: The SPEC consortium is interested in extending its base of field data for this workload characterization. If you are an ISP willing to share traffic data with SPEC, then please contact the work group.

4.2. Transaction Workload

The Transaction Workload defines the number and type of transactions issued against the system under test during the measurement phase of the benchmark. It scales linearly with the number of users and with the SPECmail2001 messages per minute performance metric. Its parameters are:

number of new SMTP sessions established per second
number of new POP sessions established per second
message size distribution
client connection bandwidth distribution
percentage of SMTP traffic to/from remote domain

4.2.1. User Profile

The definition of the transaction workload starts with an assessment of the per-user, per-day load profile. The following tables show assumptions for that profile, as well as the semantics for the elements in that profile. A third column shows the related configuration parameter of the benchmark tool - names in ALL_UPPER_CASE are actual configuration parameters of the benchmark tool, names using UpperAndLower case are inserted only for editorial purposes.

Parameter	Workgroup Estimate	related benchmark tool parameters
number of users (defined by test sponsor)	example: 1 million	USER_START, USER_END UserCount := USER_END - USER_START + 1
messages sent per day	2	MSG_SENT_PER_DAY
average recipients per message	2	MSG_RECP_DISTRIBUTION
messages received per day	4	(MsgRcvd)
mailbox checks per day	4	POP_CHECKS_PER_DAY
% of mailbox checks which don't download messages	75%	(REPEATED_POP_CHECKS_PER_DAY)
average message size in KB	25	MsgSz, MSG_SIZE_DISTRIBUTION
% of users using modems (56Kbit)	90%	DATA_RATE_DISTRIBUTION
% of daily activities during busiest hour	15%	PEAK_LOAD_PERCENT
% of mail to/from remote addresses	90%	MSG_DESTINATION_LOCAL_PERCENT

Explanation of Parameters:

UserCount		This is not a predefined load parameter; rather, it is the user-defined goal of a benchmark run. It scales linearly with the benchmark metric.
MSG_SENT_PER_DAY		The number of messages sent per day per user.
MSG_RECP_DISTRIBUTION		Number of recipients per message: every SMTP message can be directed to multiple users. This parameter affects how many messages are actually found in the mail store.
MsgsRcvd		Number of messages received, per user and day. While this is a favorite parameter to monitor, it is not really an active part of the user behavior: every user will receive whatever is in the mailbox. In the case of SPECmail2001, it is assumed that the amount of sent and received messages is identical.
POP_CHECKS_PER_DAY		Number of mailbox checks per day and user.
REPEATED_POP_CHECKS_PER_DAY		Number of "repeated" mailbox checks per day and user. Workload studies show regularly that POP email users tend to generate a high percentage of POP sessions which operate on previously emptied mailboxes and hence do not download any messages. The workgroup discussed the fraction of these (i.e. PctEmpty), while the benchmark tool implements the actual count of each type. REPEATED_PO_CHECKS_PER_DAY must be less than POP_CHECKS_PER_DAY to make sense.
MSG_SIZE_DISTRIBUTION		Average message size. It should be noted that the actual size of messages varies greatly - a vast majority is very small. A large part of this average size results from a minority of large messages. Accordingly, SPECmail2001 actually defines a mail size distribution, of which this is the average.
DATA_RATE_DISTRIBUTION		Fraction of users using slow modem connections - the rest is assumed to be connected via low-latency, high-bandwidth links. The main effect of slow connections is to drag out mail sessions in length, which in turn affects the dynamic memory footprint of the mail server, as well as CPU usage efficiency.
PEAK_LOAD_PERCENT		SPECmail2001 simulates the peak hour of mail traffic. This factor is key to translate the per-day user profile to the per-second transaction load definition.
MSG_DESTINATION_LOCAL_PERCENT		Fraction of the mail traffic going to /coming in from the internet. SPECmail2001 assumes the two are the same - overall the mail server under test produces as much external mail as it consumes.

4.2.1.1. Message Size (MSG_SIZE_DISTRIBUTION)

The benchmark tool generates test messages on the fly. The size of each message is determined according to a distribution resulting from user data evaluation, which counted messages from buckets of certain sizes. The characteristic of the distribution is that the vast majority of messages is small, while there are also a few very large messages in the mix. The average message size is 24.5KB.

Message Size (KB)	Percentage in Mix
1	10.20%
2	30.71%
3	20.59%
4	10.78%
5	5.88%
6	3.74%
10	7.44%
100	9.79%
1000	0.69%
2,675	0.18%

4.2.1.2. Recipient Count (MSG_RECP_DISTRIBUTION)

Each message gets sent to one or more recipients. The vast majority of messages gets sent to one recipient, while others get sent to up to 20 recipients. The distribution is controlled by the configuration parameter MSG_RECP_DISTRIBUTION. The fixed setting is given in the table and chart below.

Recipient Count	Percentage
1	85.73%
2	1.66%
3	1.49%
4	1.34%
5	1.20%
6	1.08%
7	0.97%
8	0.88%
9	0.79%
10	0.71%
11	0.64%
12	0.58%
13	0.52%
14	0.47%
15	0.42%
16	0.38%
17	0.34%
18	0.31%
19	0.28%
20	0.21%

4.2.2. Transactions and Parameters per POP/SMTP session

4.2.2.1. SMTP Session

An SMTP session connects to the server, receives the banner screen, transmits sender, recipients and contents of a message, and disconnects. There are two types of SMTP sessions, and there a slight difference how the configuration parameters affect each one.

For an SMTP session simulating a message sent by a local user, the configuration parameters have the following influence.

The number of recipients is determined according to MSG_RECP_DISTRIBUTION.
The actual target accounts are selected randomly from the range of available user accounts.
The decision of whether a recipient is to be in the local or remote domain is made according to MSG_DESTINATION_LOCAL_PERCENT.
The length of the message to be sent is based on MSG_SIZE_DISTRIBUTION.
The speed at which the connection is handled - how much delay is inserted in between steps - is based on DATA_RATE_DISTRIBUTION.
Every SMTP session will transmit exactly one message.

For an SMTP session simulating incoming traffic from another mail server, the configuration parameters are applied as follows:

The number of recipients is determined according to MSG_RECP_DISTRIBUTION.
The actual target accounts are selected randomly from the range of available user accounts.
All recipients are assumed to be local..
The length of the message to be sent is based on MSG_SIZE_DISTRIBUTION.
The connection speed is handled at full speed, without artificial delays..
Every SMTP session will transmit exactly one message.

The parameters which determine how many SMTP sessions of these types are being started each second are UserCount, MSG_SENT_PER_DAY, PEAK_LOAD_PERCENT and MSG_DESTINATION_LOCAL_PERCENT:

SMTP sessions simulating local users' messages are started according to the formula
new sessions per second = UserCount * MSG_SENT_PER_DAY * PEAK_LOAD_PERCENT / 3600
SMTP sessions simulating incoming traffic from other mail servers are started according to the formula
new sessions per second = UserCount * MSG_SENT_PER_DAY * PEAK_LOAD_PERCENT / 3600 * (1 - MSG_DESTINATION_LOCAL_PERCENT)

For the handling of traffic to/from remote mail servers, i.e. the role of the parameter MSG_DESTINATION_LOCAL_PERCENT in this context, please refer to section 5.2.

4.2.2.2 POP3 Session

A POP session establishes a connection, receives the banner screen, authenticates as a local user, checks a mailbox, downloads and deletes all messages found, and disconnects. The benchmark intentionally generates two types of POP sessions (see 5.3 - POP retries), but the sequence of activities is the same in each case.

The following parameters affect the simulation of a POP session

The target account is selected randomly from either of two disjoint parts of the range of available test accounts (see 5.3). The parameters POP_CHECKS_PER_DAY andREPEATED_POP_CHECKS_PER_DAY determine the percentage which is issued against each part.
The speed at which the connection is handled - how much delay is inserted in between steps - is based on DATA_RATE_DISTRIBUTION.

How many messages are downloaded and deleted, as well as the size distribution of these messages, is determined by what is found in the mail store, which in turn is a result of earlier message-sending activity.

The parameters which determine how many POP sessions are being started each second are UserCount, POP_CHECKS_PER_DAY, REPEATED_POP_CHECKS_PER_DAY, and PEAK_LOAD_PERCENT:

new POP sessions against the "random" pool:
UserCount * (POP_CHECKS_PER_DAY - REPEATED_POP_CHECKS_PER_DAY) * PEAK_LOAD_PERCENT / 3600
new POP sessions against the "retry" pool:
UserCount * REPEATED_POP_CHECKS_PER_DAY * PEAK_LOAD_PERCENT / 3600

4.2.3. Transaction Workload Parameter Calculation

The workload definition discussed in section 4.2.2 leads to the following overall load definition. For convenience, we are including the related benchmark metric (SPECmail2001 messages per minute) in the first column - for the relationship between user count and benchmark metric, please refer to section 4.4.

SPECmail2001 messages per minute	related user count	SMTP sessions from local users per second	SMTP sessions from remote mail servers per second	(initial (!)) POP checks per second	(repeated) POP checks per second
500	100,000	8.33	7.50	4.17	12.50
1,000	200,000	16.67	15.00	8.33	25.00
1,500	300,000	25.00	22.50	12.50	37.50
2,000	400,000	33.33	30.00	16.67	50.00
2,500	500,000	41.67	37.50	20.83	62.50
3,000	600,000	50.00	45.00	25.00	75.00
3,500	700,000	58.33	52.50	29.17	87.50
4,000	800,000	66.67	60.00	33.33	100.00
4,500	900,000	75.00	67.50	37.50	112.50
5,000	1,000,000	83.33	75.00	41.67	125.00

4.3. Pre-population of the Mail Store

The benchmark tool pre-populates the mail store before the first benchmark run. This initialization step uses SMTP sessions to deposit messages in the test accounts. The distribution is based on the workload parameters of the benchmark (see section 4.2). The benchmark performs some analytic computation (see section 4.3.2) to determine which distribution will be the steady state of the mail server under the defined load. The benchmark tool provide the initialization functionality as a separable step (-initonly), and performs a verification of the message distribution before each test run.

An important feature of the benchmark's workload definition is that the mail server, when exposed to repeated test runs, will enter into a steady state. While messages get added and removed, the overall count, as well as the distribution across test accounts, will fluctuate only statistically. Overall, the mail store will neither fill up and overflow (assuming safe sizing), nor will it empty out.

This allows the tester to run repeated benchmark runs back-to-back on the same setup without the need for re-initialization. It is also explicitly acceptable for the tester to backup and restore an initialized mail store.

The number of test accounts is a minimum requirement - the tester is free to establish a larger mail store with surplus test accounts, e.g. to experiment with supportable load levels, or to benchmark different hardware configurations on the same setup.

The rationale for these decisions is:

flexibility for a series of tests
there is no portable method to prevent the backup/restore of a mail server contents; even a human reviewer would have a hard time proving that it has happened
although not exactly realistic, a de-fragmented mail store is at least a level playing field. Again, even if backup/restore were not allowed, defragmentation utilities could be used in between test runs, without a chance for verification by the test tool.
The only significantly more realistic benchmark method would require including a long-time (24+-hour) burn-in run, included with the benchmark run.

4.3.1. Pre-populated Mail Store Calculation

The transaction workload leads to a mail server steady state. The resulting distribution of messages over mailboxes has been computed as follows (please refer to subsequent section for the analytical background):

Table 2: Message distribution over mailboxes.
Message Count	Percent	Message Count	Percent	Message Count	Percent
0	26.50%	10	1.80%	20	0.16%
1	16.11%	11	1.42%	21	0.13%
2	12.25%	12	1.12%	22	0.10%
3	9.61%	13	0.88%	23	0.08%
4	7.56%	14	0.69%	24	0.06%
5	5.95%	15	0.54%	25	0.05%
6	4.68%	16	0.43%	26	0.04%
7	3.69%	17	0.34%	27	0.03%
8	2.90%	18	0.27%	28	0.02%
9	2.29%	19	0.21%	29	0.02%

This distribution reflects an average message count of 3.43 messages per mailbox - which is in line with expectations (PopMsgsRcvd).

The average message count per mailbox is a base figure for estimating the size of the mail store.

For a test setup for 1 million users, the total message count in the mail store (average over time) is

MsgsRcvd * UserCount
= 3.43 * 1,000,000
= 3,430,000

and the overall raw message data volume is

MsgsRcvd * UserCount * MsgSz
= 3.43 * 1,000,000 * 25 KB
= 86 GB

It is important to note that this is just the raw amount of message data to be expected on average in the mail store. "On average over time" - that means, the benchmark does not guarantee it will stay at exactly that level. A margin of at least 25% is recommended to be added before diving into any further assumptions.

Furthermore, there are further factors with potentially significant impact to be taken into account before determining a safe storage size for the intended benchmark. That is a server-specific planning process, therefore we can't give generally applicable sizing rules here. It is strongly recommended to take not just this overall raw size, but at least the file count (plus 25% headroom) and the message size distribution into the capacity planning process of the target mail server. Further aspects of the user profile may be relevant, too. The following factors may play a role:

file system and/or database overhead (directories, indexes)
internal and external fragmentation - note that the majority of messages are very small
file system reserve space (working close to a file system's capacity often means loss in throughput)
extra disk space in case the mail server uses delayed physical deletion of messages (only logical deletion is required as part of the benchmark)
disk space needed for interim queuing of messages, as well as for logging mail server operation.

Last but not least, disk configuration for a mail server is not strictly a size issue; performance (IO throughput) plays a significant role, too.

4.3.2. Analytical Background

This section describes how the steady-state mail store definition has been determined. It provides purely the mathematical background and is not needed to understand and run the benchmark.

The benchmark separates mailboxes into two pools: a pool for initial POP checks and a pool for simulating repeated POP checks. The following calculation looks only into the pool for initial POP checks.

Let

PopRate be the rate at which a mailbox gets emptied out
SmtpRate be the rate at which additional messages hit a mailbox

Note, the time basis does not play a role here, as long as it is the same in both cases.

First, compute

p[0] = the percentage of mailboxes that is empty
(= the probability that a mailbox is empty)

If one observes p[0] along the time axis, then a "new" p[0]' can be computed from an "old" p[0] as follows:

p[0]' = the old percentage of empty mailboxes
MINUS the portion that received mail
PLUS the portion of previously non-empty mailboxes that got POP-checked

in mathematical terms

p[0]' = p[0] - p[0] * SmtpRate + (1 - p[0]) * PopRate

Since we assume the mail store has reached steady state, that probability does not change over time,

p[0]' = p[0]

and this becomes

p[0] = p[0] - p[0] * SmtpRate + (1 - p[0]) * PopRate

Subtract p[0] on both sides, eliminate the parenthesis:

0 = - p[0] * SmtpRate + PopRate - p[0] * PopRate

Isolate terms with p[0]:

p[0] * (SmtpRate + PopRate) = PopRate

Isolate p[0] by dividing by its factor (which is non-zero):

p[0] = PopRate / (SmtpRate + PopRate)

Now, compute

p[i] = the percentage of mailboxes with i messages (i > 0)
(= the probability that a mailbox contains i messages)

A "new" p[i]' can be computed from an "old" p[i] as follows:

p[i]' = the old percentage of mailboxes with i messages
        MINUS the portion that received mail (and now has i+1 messages)
        MINUS the portion that got POP-checked
        PLUS the portion of mailboxes with i-1 messages that received mail

in mathematical terms

p[i]' = p[i] - p[i] * SmtpRate - p[i] * PopRate + p[i-1] * SmtpRate

Since we assume the mail store has reached steady state, that probability does not change over time,

p[i]' = p[i]

thus

p[i] = p[i] - p[i] * SmtpRate - p[i] * PopRate + p[i-1] * SmtpRate

Isolate terms with p[i]:

p[i] * (1 - 1 + SmtpRate + PopRate) = p[i-1] * SmtpRate

simplified:

p[i] * (SmtpRate + PopRate) = p[i-1] * SmtpRate

Dividing by the factor of p[i], which is non-zero:

p[i] = p[i-1] * SmtpRate / (SmtpRate + PopRate)

In order to apply these formulae to the SPECmail2001 workload parameters, we need to make a forward reference to section 5.3, where we'll discuss how the benchmark implements the repeated POP attempts against the same account. The repeated POP attempts introduce an asymmetry into an otherwise completely random process which could be nicely analyzed with a single application of the above rules.

Suffice it to say that the implementation of the POP Retries splits the set of test accounts into two parts: the retry pool and the random pool. Each pool follows the aforementioned scheme, but with different parameters.

For the random pool, we have:

percentage of account in random pool: 92.5%
SmtpRate: 0.925 * 4.00 = 3.70
PopRate: 1.00

and therefore:

p[0] = PopRate / (SmtpRate + PopRate)
     = 1.00 / (3.70 + 1.00)
     = 21.28%
p[1] = p[0] * SmtpRate / (SmtpRate + PopRate)
     = 0.21276 * 3.7 / (3.7 + 1.00
     = 16.75%

p[2] = p[1] * 3.70 / 4.70
     = 13.19%

...

For the retry pool, we have:

percentage of account in random pool:   7.5%
SmtpRate:               0.075 * 4.00 = 0.30
PopRate:                               3.00

and therefore:

p[0] = PopRate / (SmtpRate + PopRate)
     = 3.00 / (0.30 + 3.00)
     = 90.91%
p[1] = p[0] * SmtpRate / (SmtpRate + PopRate)
     = 0.90909 * 0.3 / (0.3 + 3.00)
     = 8.26%

p[2] = p[1] * 0.30 / 3.30
     = 0.75%

...

In order to compute the combined distribution, these two series need to be combined into a weighted average (92.5% of the first series, 7.5% from the second series). The result is reflected in the table given at the beginning of section 4.3.1.

4.4. Benchmark Metric Calculation

The transaction mix resulting from the profile, relative to a single SMTP message sent by a POP consumer user, is:

Table 4: Transaction mix
Transaction Type	Transaction Count per Minute
message sent by user	1
message sent to remote mail server	0.9
message received from remote mail server	0.9
mailbox check	2
retrieval and deletion of message	2

With the user profile given above, it was already shown that 1 million users will generate 83.3 SMTP messages per second, which equals 5,000 SMTP messages per minute. Since the transaction mix contains exactly 1 SMTP message, we have the basic equation

5,000 SPECmail2001 messages per minute = 1,000,000 supported POP consumer users

which is equivalent to

1 SPECmail2001 messages per minute

=

200 supported POP consumer users

The following table lists a number of translations between the two quantities (note that SPECmail2001 ratings are not limited to entries in this table.)

Table 5: Relationship between user count and messages per minute
Supported POP consumer users	SPECmail2001 messages per minute	Supported POP consumer users	SPECmail2001 messages per minute	Supported POP consumer users	SPECmail2001 messages per minute
1,000	5	10,000	50	100,000	500
2,000	10	20,000	100	200,000	1,000
3,000	15	30,000	150	300,000	1,500
4,000	20	40,000	200	400,000	2,000
5,000	25	50,000	250	500,000	2,500
6,000	30	60,000	300	600,000	3,000
7,000	35	70,000	350	700,000	3,500
8,000	40	80,000	400	800,000	4,000
9,000	45	90,000	450	900,000	4,500
10,000	50	100,000	500	1,000,000	5,000

4.5. Mail Server Housekeeping Functions

Besides the actual workload, the system under test (SUT) is expected to run under normal operating conditions, assuming the production environment of a real-life ISP. The only required function identified to fall into this category is logging, basically:

one log entry per SMTP session and
one log entry per POP/IMAP session.

See also Q&A section: Does the mail server have to actually, physically delete messages while it is performing the benchmark?

5. Detail Aspects of Workload Simulation

This chapter discusses select details of how the workload is generated. The material presented here is getting close to description of the benchmark tool implementation, but it is regarded as closely related to the workload definition.

5.1. Arrival Rate and Interarrival Time Distribution

SPECmail2001 is designed to generate the workload as close to real-world situation as possible. In the real world, the load results from many independent clients issuing requests, with no coordination with, or knowledge about each other. A key aspect of this scenario is that even a high load on the mail server, which typically leads to slower responses, won't discourage new clients from showing up. Each of the many clients produces only a tiny fraction of the total load.

5.1.1. Arrival Rate

SPECmail2001 generates the workload at a defined average rate of requests per unit of time - the arrival rate. In order to understand the significance of this method, one needs to look at the alternative. A benchmark could also put load on the server "as fast as it will go". While that might lead to a faster determination of a peak rate, there are some disadvantages. Under such a load, the mail server could conveniently "throttle" the load dynamically - when ever things get critical, its response time would go up, and therefore the load down. It would also be more difficult to control a correct load mix - a server might serve more requests of the type which are convenient for it. Finally, response times might become unacceptable, and therefore the reported rate meaningless. SPECmail2001 avoids all of these disadvantages.

5.1.2. Interarrival Time

A second aspect of generating the workload is the interarrival time distribution. A standard assumption for this distribution, in a system with many independent arrivals is the exponential distribution. A study of log data from a mail server showed that a Weibull distribution fitted the distribution better. For the reader's convenience, here are two graphs showing the shape of these distributions.

A distinguishing aspect of the Weibull distribution is that it shows a lower probability for very short interarrival times. Since this distinction depends a lot on how much the timestamps in the logs can be trusted, at the very low end, discussion ensued whether the Weibull distribution was really the correct answer here. The suspicion was expressed that artifacts of the conditions of the workload study may have been at play here.

Log data do not correctly reflect when the events actually happened - they reflect when the server got around to processing them. Any serialization effect between the arrival of the events - be it the hardware (a single processor), the OS (a single driver handling both requests), the mail server (a single thread accepting both connections, or a single thread generating and time-stamping log events) - would make very low interarrival times less likely to be recorded as such in the log.

Timestamp granularity - in the given case, to full seconds - has an effect on accuracy of evaluation, too. A simulation was done by a member of the working group which started with a series of exponentially-distributed interarrival times, truncated them to full seconds, and generated the resulting interrival time picture. It had a similar effect in the curve.

As a practical matter, and certainly a contributing factor in the discussion, the Weibull distribution is very difficult to implement correctly in a distributed load generator environment. The exponential distribution has the convenient feature that the overlaid effect of several exponentially-distributed event streams is once again exponentially distributed. The load can therefore be split up between N load generator systems, and each can do its work independently. The same is not true for the Weibull distribution. Correct implementation would need to pay special attention to the very small interarrival times. Distributed load generators could achieve that only through near-perfect clock synchronization. This would have added another level of complexity to the execution of this benchmark: time synchronization, potentially across different platforms.

The workgroup therefore decided to accept the theory "exponential distribution in = Weibull distribution out" (meaning, if a mail server gets triggered with events according to an exponential distribution, it may very well log events with a Weibull distribution), and to base the benchmark on the exponential distribution.

5.2. Traffic to/from Remote Mail Servers

SPECmail2001 includes mail exchange with external mail servers. For a consumer ISP, a majority of the mail sent will go to users of other ISP's. Vice versa, the majority of the mail received comes from external users. The reason for this lies in the fact that the users of one ISP are typically not a tight-knit community; they have more external than internal email contacts.

SPECmail2001 assumes that 90% of the messages sent are going to external users (i.e. are forwarded to remote mail servers), and that 90% of the mail received by users originated from external users (i.e. is received from remote mail servers). The following diagram shows the real mail flow that would result from this assumption.

SPECmail2001 simulates this external mail exchange by

generating the messages received from external mail servers using its load generators, and
providing a Mail Sink component to which the server-under-test sends all messages for external destinations. The Mail Sink does not simulate transfer or response time delays.

The effective mail flow looks as follows:

SPECmail2001 "integrates" the incoming mail from remote servers with the mail sent by local users. Accordingly, a few factors get adjusted, compared to what one would expect from the user profile:

the rate at which SMTP messages are sent increases
the distribution local vs. remote gets adjusted
the percentage of messages sent at modem speed gets adjusted, since all incoming traffic from remote mail servers is assumed to happen at full network speed.

This adjustment is done according to the following formulae. The variable names do not reflect tool configuration parameter names or tool-internal variable names; they have been choosen to be self-explanatory for the purpose of this discussion.. Given

SmtpArrivalsFromUsersPerSec
SmtpFromUserToRemotePercent
SmtpFromUserModemPercentage

Then

SmtpTotalArrivalsPerSec =
(1 + SmtpFromUserToRemotePercentage) * SmtpArrivalsFromUsersPerSec

and

SmtpEffectiveRemotePercentage =
SmtpFromUserToRemotePercentage / (1 + SmtpFromUserToRemotePercentage)

and

SmtpEffectiveModemPercentage =
SmtpFromUserModemPercentage / (1 + SmtpFromUserToRemotePercentage)

Using the same numbers as in the diagrams, here is an example:

users are sending 100 messages into the server
90% of the traffic (90 messages) are assumed to go to an external mail server
Due to the assumption that the amount of mail sent to the Internet equals the amount received from the Internet, 90 messages are assumed to come in from the Internet. All of these get added to the mail store, since they are directed at local users.
100 messages (10 from local users, 90 from the Internet) end up in the local mail store
80% of the user-originated mail traffic are specified as modem traffic, i.e. 80 (of 100) messages

In this example case, the benchmark tool would therefore

generate 190 (instead of 100) SMTP messages
send 47.3% to as outbound mail to remote addresses, so that they end up in the mail sink
send 52.7% to local addresses, so that they end up in the mail store.
send 42.1% (80 of 190) SMTP messages at modem-bandwidth speed.

Note, all remote communication of the mail server is assumed to be at full network speed, hence the Mail Sink does not simulate any bandwidth limitation.

5.3. POP Retries

POP users typically have set up their email client so that it performs periodic mailbox checks after a defined interval. From the point of the mail server, this generates a load on mailboxes which have recently been checked and which in all likeliness are empty. Overall, the vast majority of POP sessions is logged as "empty", i.e. no messages are being downloaded.

The main aspects of this load are that the mailboxes are typically empty, and that they have recently been checked. Both aspects may have effects on the mail server's efficiency to handle requests on these mailboxes. The benchmark simulates this special load by setting aside a portion of the test accounts (referred to as the "retry pool") for use by frequent repeat checks, while performing exclusively "initial" POP checks against the remainder (the "random pool"). While not in line with what is going on in reality, it does implement the main aspects (typically empty accounts, recently checked), and it simplified the implementation since it did not introduce complex state management of test accounts.

We mentioned potential benefits, i.e. aspects which make accessing these test accounts easier for the mail server. Caching of repeatedly used information is a good strategy; that's why the benchmark addresses this specific aspects of a POP3 workload.

However, there is also a potential down-side to be taken into account here - according to the POP3 specification, the mail server may keep an account locked for a certain period of time (5 minutes). Therefore the benchmark needs to avoid overly aggressive repeated accesses to the same test account. The configuration parameter REPEATED_POP_CHECK_INTERVAL_SECONDS specifies the minimum time interval after which a mailbox is being checked again. The conforming setting of this parameter is 300 seconds. In practice, a re-check of a mailbox in the retry pool will happen between this time, and twice that duration (under the conforming setting, between 300 and 600 seconds).

The size of the retry pool is determined so that the minimum re-check interval is safely observed AND so that the size of the pool is appropriate for all three load levels (80/100/120%). Changing the retry pool size would mean disturbing the steady state of the mail server, therefore it needs to be avoided. For the conforming parameter setting, 7.5% of the mailboxes will be assigned to the retry pool, leaving 92.5% for the mainline operation.

On the SMTP side, there is no special consideration for the retry pool: messages are sent randomly against all test accounts without consideration for random vs. retry pools. While checking of Delivery Time is going on on a mailbox, it is protected from any unrelated POP activity, be it initial POPs or POP retries.

The mailboxes for the retry pool are being selected at random by the benchmark tool on each run, in a way that is compliant with the steady state.

If the tester changes the overall user count, in order to verify another load level, then the steady state of the mail server is maintained - since the different or additional set of mailboxes will contain its own portion of candidates for the retry pool.

6. Quality of Service

SPECmail2001 accepts only throughput claims which are proven while maintaining some minimal Quality of Service (QoS). There are two aspects to quality of service - the interactive response time of the mail server to each protocol step, as directly or indirectly observed by the user, and the delivery time that it takes to make a message available to a (local) recipient. The SPECmail2001 QoS standard are meant to support high-quality mail service, rather than promote barely acceptable service levels.

6.1. Response Times

Response times of the server, in reaction to each request, are measured by the benchmark tool, with an accuracy of approx. 0.1 seconds. The clock starts before sending the request, and it stops when the response, or the first packet thereof, is received.

Among the requests issues by SPECmail2001, there are two - SMTP-data and POP-retr - whose response time depends on the amount of data being transmitted. The response time criteria are therefore split into two groups.

For all requests other than SMTP-data and POP-retr, SPECmail2001 requires that, for each request type,

95% of all response times

must be under

5 seconds.

This may appear like a very lax standard - the rationale was to capture complex as well as trivial mail server requests in one number.

For the SMTP-data and POP-retr requests, the size of the test message, as well as the artificial delays inserted for modem bandwidth simulation need to be taken into account. The maximally allowed response time for these requests has two components

a base response time allowed for all requests, regardless of size
an additional response time proportional to the message size

The first component is

5 seconds

identical to the QoS requirement for the first group. The second component is computed based on the message size, requiring that

the message be transferred/accepted at at least half the modem bandwidth

Note that for simplicity of implementation, the allowance for modem bandwidth is made, whether or not the message is actually subject to modem delay simulation. The rationale being that a mail server which is slowing down due to overload will likely do this across the board.

The complete formula for the QoS cutoff time for this group is therefore

5 sec + (MsgSzInBytes / (0.5 * ModemBw in bytes/s) )

and, again,

95% of all requests

need to meet that condition.

Examples

The modem bandwidth simulates 56.6Kbaud; we'll assume in the following math that this equates to 6,000 bytes/sec. The formula becomes therefore

5 sec + (MsgSzInBytes / 3,000 bytes/s ) = 5,000 ms + (MsgSzInBytes / 3 bytes/ms )

The following table illustrates how the constant component dominates the effective QoS cutoff time for small messages, and how it becomes less and less relevant for larger message sizes.

`Message Size`	`QoS Cutoff Time`
`1,000 bytes`	`5.000 + 0.333 = 5.3 sec`
`10,000 bytes`	`5.000 + 3.333 = 8.3 sec`
`100,000 bytes`	`5.000 + 3.333 = 38.3 sec`
`1,000,000 bytes`	`5.000 + 33.333 = 338.3 sec`
`2,000,000 bytes`	`5.000 + 666.666 = 671.7 sec`

The benchmark tool determines the allowed response time on the fly, on a message-by-message basis, and maintains passed/fail counts just like for other QoS criteria.

6.2. Delivery Times

Delivery time is defined as the time it takes from sending a message via SMTP to it becoming available to a user attempting retrieval. SPECmail2001 performs the check for availability to the receiving user via POP attempts. By adding this QoS requirement, the mail server is forced to actually deliver the mail to the target mailboxes in a reasonable time, rather than just spooling it, and processing it later (e.g. after the benchmark is over).

SPECmail2001 requires that

95% of all messages to local users

get delivered to the target mailbox within

60 seconds.

Given that email service is traditionally more of a store-and-forward mechanism than a real-time application, this may appear to be a pretty strict standard. However, the motivation for this requirement was that email exchange is increasingly used as part of real-time interaction between users.

Implementation note: Probing for the message is the only portable way to find out when it gets delivered. However, this probing is overhead, and the resulting insert/remove patterns on the mail store are not very natural. Therefore, SPECmail2001 restricts the probing of submitted messages to a random sample of 1.00% of the overall mail traffic.

6.3. Remote Delivery

Remote Delivery is only checked for count - 95% of messages sent to a remote address need to be reported as received by the Mail Sink within the test phase.

There is no measurement or verification of Remote Delivery Time. This relaxed measurement method eliminates the need for a synchronized system time among the client systems. Relying just on count still ensures that most of the necessary work gets done by the mail server within the test timeframe and does not get pushed to the time after the benchmark.

The requirement is

95% of all messages to the remote mail server

are received by the remote server within the test timeframe.

6.4. Error Rates

Mail protocols, and email clients, are prepared to deal with error situations, like a temporarily unavailable mail server, quite gracefully. Often failed transactions get transparently retried. The SPECmail2001 benchmark allows therefore

an error rate of 1%

to still be acceptable in a valid benchmark run. It is counted as an error if the mail server returns an unexpected response, but also if it does not respond at all within 60 seconds.

6.5. Overall QoS Requirement

SPECmail2001 requires that a mail server passes these requirements in ALL individual categories, i.e. each protocol step by itself, as well as the delivery time evaluations, must meet the requirement. As an example, 94% compliance in one protocol step (a failure) cannot be made up by 98% compliance in another protocol step. The maximum error rate condition needs to be met in addition.

7. Performance Characterization

SPECmail2001 characterizes mail server performance by giving three data points and the degree of adherence to QoS requirements at these points:

The 80% mark. The mail server is running under less than the load that it is supposedly able to handle. A good mail server should exceed the QoS requirements, and report close to 100% compliance.
The 100% mark, the claimed performance level of the mail server. At this level, the mail server needs to meet the QoS requirements, and it is expected it will be close to the minimum QoS level on at least one of the criteria.
The 120% mark. At this higher-than-claimed load level, the mail server will typically no longer meet the QoS requirements. Mail servers are sometimes confronted with an unexpected load beyond specified limits, and should be able to deal with it. This data point shows how well it does.

Below are examples for how better and worse mail servers, with the same claimed capacity (the 100% level), might fare in this characterization.

Better Server
80% |XXXXXXXXXXXXXXXXXXXXXXXXXXX
100% |XXXXXXXXXXXXXXXXXXXXXX
120% |XXXXXXXXXXXXXXXX
      +-----------+----+----+----+
                  |    |    |    |
                 85   90   95 100

Worse Server
80% |XXXXXXXXXXXXXXXXXXXXXX
100% |XXXXXXXXXXXXXXXXXXXXXX
120% |XXXXXXXXXXX
      +-----------+----+----+----+
                  |    |    |    |
                 85   90   95 100

Accordingly, a SPECmail2001 benchmark run consists of a series of warm-up, measurement and cool-down phases:

Only the transaction load is varied between these three points. The pre-populated mail store needs to match the 100% load level (larger is OK), and does not vary for the 80% or the 120% test case.

8. Components of a Benchmark Setup

8.1. System Under Test (SUT)

The mail server system under test is the collection of hardware and software that handles the requests issued by the benchmark test client systems, in all its aspects. Mail servers for large user communities may include a whole set (cluster or similar) of systems - the system under test therefore includes not only all processor, memory, disk hardware, but also all networking hardware including any switches or routers needed on the access side or inside of the cluster.

8.2. Benchmark Test Client Systems

The benchmark test client systems are the systems running the components of the SPECmail2001 benchmark software. The software components are one primary client and one or more load generators. They can be run on one or more client systems. The software components work together to issue the benchmark load against the system under test while monitoring the server's response and delivery times.

9. Future Plans

This concludes the description of the SPECmail2001 benchmark architecture. As all SPEC benchmarks, it is anticipated that this benchmark will evolve over time, and will be released in updated versions, matching changing observations in the field, and evolving technologies.

A follow-up version might address business-type IMAP mail users.
The workload might be reviewed, based on actual traffic profile data from real ISPs.
The mechanics of benchmark execution might get improved, by implementing "automatic peak finding", where the benchmark tool explores the sustainable load on its own.

Appendix: Questions and Answers

These are discussions which have been held during the design phase of this benchmark. There are hyper links throughout the text into this section, where the questions mattered. These considerations have been kept out of the main body of the document, to streamline the description of the benchmark. It was felt, however, that these thoughts should not simply be lost, but may help the understanding of the design decisions.

Question:

Why isn't the benchmark metric "Number of Users"?

Answer: Since ISP's sizes are discussed in the press often in terms of the size of their subscriber base, this would have been an intuitive choice for the benchmark metric. However, user profiles vary greatly from case to case, and will vary even more as technologies evolve (e.g. multi-media contents) and as email moves into more areas. There is easily a factor of 50 and more between "light" and "heavy" user profiles.

A benchmark metric "number of users" could easily be quoted out of context, and could be misunderstood greatly. While the metric "messages per minute" is not completely fail-safe either, at least it identifies an actual transaction load more directly. A translation between "SPECmail2001 messages per minute" and supported user count, according to the assumed user profile, is given in chapter 4.

Question

: Why isn't the benchmark metric "Parallel Sessions"?

Answer: While it is true that the number of parallel established sessions has been used in some proprietary mail server benchmarks, it is not really a good measure for the service a mail server provides.

With the exception of the online operation mode of IMAP, a mail server really "sees" many short, independent requests from email clients and other mail servers. Each of these is in fact a session consisting of several interrelated requests, but these sessions are short-lived. Their duration is not determined by anything like user think time, but rather by the time it takes to move the data (esp. over slow links) and - counter-intuitively - by the server response time. Ironically, the slower mail server will produce longer sessions, and will therefore show a higher count of parallel sessions.

The point isn't to have many people in your shop ... the point is to serve many customers.

Question

: Why doesn't the benchmark push as hard as possible and report the achieved throughput?

Answer: This method would be applicable if it would lead to peak mail server throughput, under acceptable quality of service conditions. This not the case with a mail server. A mail server has the option to spool mail to a queueing area, for processing at an arbitrary later time. This would deteriorate quality of service (delivery time). It would also mean that the work imposed by the benchmark does not necessarily get done during the measurement period: the server might continue to work on de-queueing the spooled messages after the test is over. This method might also lead to unacceptably high response times.

Question

: Why does the benchmark emphasize on generating requests at a certain arrival rate and pattern?

Answer: This is regarded as a key element in generating a real-world load scenario. A mail server needs to serve many independent clients; there is no way for it to "push back" by responding more slowly to previous requests - therefore SPECmail2001 emphasises on a constant arrival rate. The arrival pattern, modelled after real-world observation, eliminates any chance that a mail server might react unusually optimally to request sequences which are evenly spaced over time.

Question

: If the performance characterization shows full conformance of the server even at 120% - why did the sponsor not submit that 120% level as the actual performance metric?

Answer: A condition for the 100% level is that there is a matching prepopulated mail store. The mail store in question may not have been big enough to support the 120% load as the actual benchmark metric.

Another motivation could be, the QoS drop beyond the 120% mark was found to be too sharp by the test sponsor - the rules of SPECmail2001 allow the sponsor to resort to a lower performance claim and show a good 120% behavior instead. The price for the sponsor is the lowered benchmark metric.

Question

: Why does the SPECmail2001 benchmark emphasize putting as much mail into the mail server as it is taking out?

Answer: It makes it easier to run several benchmark runs right after each other, including the required 80% / 100% / 120% series. It also gives the mail server a chance to stabilize over the period of the test. If the store were constantly growing/shrinking, the mail server might keep changing its speed, and it might reach one of the boundary conditions (empty mail store or disk overflow).

Question

: Does the mail server have to actually physically delete messages while it is performing the benchmark?

Answer: No. The standard mail server protocols do not require that deleted mail is actually removed from the physical media at that same instant. So there is no way for the benchmark tool to verify that mail is physically deleted from the server's storage area -- and depending on the underlying storage technology, "deleted" may mean different things to different implementations. Of course, any deleted message must be "gone" from that mailbox, i.e. a subsequent listing of the same mailbox, from the same or another client, must not list deleted messages anymore -- in compliance with the POP and IMAP protocols. The benchmark tool does not currently verify that. If a mail server delays physical removal of messages (several available server products choose to do that), then of course the tester must configure disk space sufficient to cover the excess storage needs.

Question

: Does SPECmail2001 consider mailing lists?

Answer: No. Mailing list expansion is not inherent part of handling the IMAP, POP or SMTP protocol. This task is often not performed by the main mail server, but by dedicated systems. ISPs typically do not offer mailing lists as a service to their users. Our workload studies, which we used as the basis of defining the workload of the benchmark, were performed on mail servers which saw the result of mailing list expansion - so in a way, the net effect, the duplication of messages, was implicitly taken into account.

Question

: Why does SPECmail2001 not include IMAP4 users?

Answer: This was originally intended, as the near-complete coverage of IMAP in the architecture paper, and the existing coverage of IMAP testing options in the benchmark tool show. Mainly due to the inability to identify practical workload patterns for IMAP, but also to reduce risk in completing the first release of this benchmark, the design team decided to drop IMAP4 from the workload mix. There was also concern about the practical value of a mixed POP/IMAP benchmark: actual mail server installations are often dedicated to (mainly) one or the other protocol. The initial workload (the amount of data on disks) per user is also considerably different ... even a minority of IMAP users in the mix tends to dominate the disk usage. These issues will be reviewed and should lead to coverage of IMAP4 by a "SPECmail200x" benchmark soon.

Question

: Does SPECmail2001 consider WebMail?

Answer: No. WebMail is outside the scope of the SPECmail2001 benchmark.

Question

: What constitutes a mail server, in the sense of SPECmail2001?

Answer: A mail server is defined as the collection of all systems, including storage, internal and front-end networking hardware, required to handle the specified mail server load. Hardware to provide directory service as far as needed to handle the mail load is part of the system under test. The mail server can be made up of multiple nodes, as long as it appears to the outside as a single mail server. The list of requirements to qualify as a single mail server includes that all accounts are within one domain, and every change to the mail server (insertion or removal of a message) is visible consistently across the complete system. It is legal for the mail server to provide different IP addresses for the different protocols, but only one IP address per protocol. The mail server must not require that user load traffic can be coming in, separated in any way (e.g. by protocol) onto separate networks. Any router hardware achieving such separation needs to be listed as part of the system under test. For details refer to the Run Rules.

Question

: Doesn't a real mail server require much more disk space?

Answer: Yes. While SPECmail2001 makes a serious attempt at requiring a mail store appropriate for the number of users the system under test claims to support, it is still a compromise between reality and the cost of a benchmark. The increased disk space needed in real installations results largely from two factors. First, there is the spool area. The spool area tends to fill up if external mail servers are not available for receiving mail - a scenario which is not included in this benchmark. Second, there is a class of users which are not reflected in the SPECmail2001 profile - users who leave their mail on the server. Their disk space usage isn't really determined by the amount of mail they receive/retrieve on a regular basis, but by the quota and cleanup policy the ISP imposes on them. In a way, one can think of the disk space absorbed by such long-term storage as "dead space" - it contributes to size of occupied disks, size of file systems and/or database indices, but it does not contribute to the actual transaction load.

SPECmail2001 Mail Server Benchmark Architecture White Paper

Table of Contents

1. Introduction

1.1. Overview

1.2. Organization of this Paper

1.3. Related Documents

1.4. Run and Reporting Rules

2. Design of the SPECmail2001 Benchmark

2.1. Requirements and Goals

2.2. Excluded Goals

3. Benchmark Metric:

SPECmail2001 Messages per Minute

4. Workload

4.1. Basis of Workload Definition

4.2. Transaction Workload

4.2.1. User Profile

4.2.1.2. Recipient Count (MSG_RECP_DISTRIBUTION)

4.2.2. Transactions and Parameters per POP/SMTP session

4.2.2.1. SMTP Session

4.2.2.2 POP3 Session

4.2.3. Transaction Workload Parameter Calculation

4.3. Pre-population of the Mail Store

4.3.1. Pre-populated Mail Store Calculation

4.3.2. Analytical Background

4.4. Benchmark Metric Calculation

4.5. Mail Server Housekeeping Functions

5. Detail Aspects of Workload Simulation

5.1. Arrival Rate and Interarrival Time Distribution

5.1.1. Arrival Rate

5.1.2. Interarrival Time

5.2. Traffic to/from Remote Mail Servers

5.3. POP Retries

6. Quality of Service

6.1. Response Times

6.2. Delivery Times

6.3. Remote Delivery

6.4. Error Rates

6.5. Overall QoS Requirement

7. Performance Characterization

8. Components of a Benchmark Setup

8.1. System Under Test (SUT)

8.2. Benchmark Test Client Systems

9. Future Plans

Appendix: Questions and Answers

SPECmail2001 Mail Server Benchmark
Architecture White Paper