Benchmark

A benchmark is a point of reference for a measurement. The term presumably originates from the practice of making dimensional height measurements of an object on a workbench using a graduated scale or similar tool, and using the surface of the workbench as the origin for the measurements.

In surveying, benchmarks are landmarks of reliable, precisely-known altitude, and are often man-made objects, such as features of permanent structures that are unlikely to change, or special-purpose "monuments", which are typically small concrete obelisks, approximately 3 feet tall and 1 foot at the base, set permanently into the earth.

In computing, a benchmark is the result of running a computer program, or a set of programs, in order to assess the relative performance of an object, by running a number of standard tests and trials against it. The term is also commonly used for specially-designed benchmarking programs themselves. Benchmarking is usually associated with assessing performance characteristics of computer hardware, e.g., the floating point operation performance of a CPU, but there are circumstances when the technique is also applicable to software. Software benchmarks are, for example, run against compilers or database management systems.

Benchmarks provide a method of comparing the performance of various subsystems across different chip/system architectures.

As computer architecture advanced, it became more and more difficult to compare the performance of various computer systems simply by looking at their specifications. Therefore, tests were developed that could be performed on different systems, allowing the results from these tests to be compared across different architectures. For example, Intel processors usually have a higher Hertz rating than AMD processors, but AMD processors tend to be as fast or faster on benchmark tests as compared to Intel processors.

Benchmarks are designed to mimic a particular type of workload on a component or system. "Synthetic" benchmarks do this by specially-created programs that impose the workload on the component. "Application" benchmarks, instead, run actual real-world programs on the system. Whilst application benchmarks usually give a much better measure of real-world performance on a given system, synthetic benchmarks still have their use for testing out individual components, like a hard disk or networking device.

Computer manufacturers have a long history of trying to set up their systems to give unrealistically high performance on benchmark tests that is not replicated in real usage. For instance, during the 1980s some compilers could detect a specific mathematical operation used in a well-known floating-point benchmark and replace the operation with a mathematically-equivalent operation that was much faster. However, such a transformation was rarely useful outside the benchmark.

More generally, users are recommended to take benchmarks, particularly those provided by manufacturers themselves, with ample quantities of salt. If performance is really critical, the only benchmark that matters is the actual workload that the system is to be used for. If that is not possible, benchmarks that resemble real workloads as closely as possible should be used, and even then used with scepticism. It is quite possible for system A to outperform system B when running program "furble" on workload X (the workload in the benchmark), and the order to be reversed with the same program on your own workload.

Some common benchmarks are:

Dhrystone
Whetstone
Standards Performance Evaluation Corporation (SPEC)
BAPco
3DMark
Quake
Khornerstone