Tandem Computers was an early manufacturer of fault tolerant computer systems, marketed to the growing number of transaction processing customers who used them for ATMss, banks, stock exchanges and other similar needs. Tandem systems used a number of redundant processors and storage devices to provide high-speed "failover" in the case of a hardware failure, an architecture that they called NonStop. Over the two decades from the 1970s into the mid-90s, Tandem systems evolved into software-only solutions running on other platforms. The company was eventually purchased by Compaq in 1997 in order to provide that company with more robust server offerings. Today their software is still known as NonStop, as a separate product line offered by Hewlett-Packard.

History

Tandem Computers was founded in 1974 by a group of engineers from Hewlett-Packard, led by James Treybig. Their business plan called for systems that were proof from "single point failures" that were only slightly more expensive than competing non-fault tolerant systems. Tandem considered this to be very important to their business model, as customers invariably developed procedural solutions to downtime when the price was too high.

Design of their NonStop I system was complete in 1975, and the first example was sold to Citibank in 1976. The NonStop consisted of between 2 and 16 processor modules, each capable of about 0.7 MIPS with their own memory, I/O controllers, and dual connections to their custom inter-CPU computer bus, Dynabus. The modules were constructed so that failure would always leave at least one of the busses (both I/O and Dynabus), free for use by the other modules. The CPU's themselves were fairly simple. The basic design was patterned on the HP3000 CPU, a 16-bit stack-based machine with 32-bit addressing. In reality the full 32-bit address space could not be addressed, due to the use of a number of bits acting as status flags. Like the HP3000, the NonStop CPU added a number of registers for fast access, in this case programmer-specified global variables.

The NonStop I ran a custom operating system called Guardian that was key to the system's failover modes. A number of other companies had introduced failover that operated by restarting programs on other CPU's, but in Guardian all operations used message passing and were checkpointed for every operaton. That is, Guardian could restart from any instruction in the program, a key feature that the stack-based processor made fairly easy to construct because it had little "state" to move from machine to machine. All instructions consisted of taking data from the stack and putting it back on when it completed, and if the later failed the stack could be copied to another processor and restarted at that instruction.

While conventional systems of the era, including mainframes, had failure rates on the order of a few days, the NonStop system was designed to fail 100 times less, with "uptimes" measured in years. Nevertheless the NonStop was deliberately designed to be price-competitive with conventional systems, with a simple 2-CPU system priced at just over two times that of a competing single-processor mainframe, as opposed to four or more times of most competing solutions.

NonStop I was followed by the NonStop II in 1981, a slight improvement in speed to 0.8 MIPS, but a more measurable upgrade in memory from a maximum of 384kB per CPU in the I, to 2MB in the II, and the addition of a complete virtual memory system allowing for considerably larger address spaces. The same basic system, including the physical packaging, was used in 1983's NonStop TXP system that over doubled the speed to 2.0 MIPS, and the physical memory to 8MB. In all of these machines the same Dynabus system was used, which had been overdesigned in the NonStop I so they could avoid changing it in the future.

Introduced along with the TXP was a new fibre optic bus system, FOX. FOX allowed a number of TXP and NonStop II systems to be connected together to form a larger system with up to 14 nodes. Like the CPU modules within the computers, Guardian could failover entire task sets to other machines in the network.

In 1986 a major upgrade to the system was introduced, the NonStop VLX. VLX used a new Dynabus, increasing speed from 13MBps to 40MBps (total, 20MBps per independent bus). They also introduced FOX II, increasing the size of the networks from 1Km to 4Km. Using the original FOX VLX systems could be used with the older NonStop II and TPX's, but these systems were not supported on FOX II.

VLX was partnered with the NonStop CLX, a minicomputer sized machine for smaller installations. The CLX had roughly the same performance as the earlier TXP, but was much smaller and less expensive. By the end of it's lifetime the CLX had increased in speed considerably, and competed with the VLX, 1991's CLX 800 was only about 20% slower than the VLX, with the main difference being more limited expansion abilities.

In 1986 Tandem also introduced the first fault-tolerant SQL database, NonStop SQL. Developed on the famous Ingres code base, NonStop SQL added a number of features based on Guardian to ensure data validity across nodes. NonStop SQL was famous for scaling linearily in performance with the number of nodes added to the system, whereas most databases of the era had performance that plateaued quite quickly, often after two CPUs. A later version released in 1989 added transactions that could be spread over nodes, a feature that remained unique for some time.

The NonStop Cyclone was introduced in 1989, introducing a new superscalar CPU design. It was otherwise similar to earlier systems, although much faster. In general terms the Cyclone was about four times as fast as the CLX 800, which Tandem used as their benchmark. On the downside the new CPU was complex and expensive, requiring four circuit boards to implement a single CPU.

In 1991 followed this with RISC-implementations of Guardian, running on MIPS R3000-based CPU modules in the Cyclone/R and CLX/R. Programs written for the earlier stack-based CPU design were automatically translated on the fly into R3000 code in an interpreter, although they ran considerably slower than on earlier machines. Tandem also provided a number of tools to easily port existing object code to the new systems, resulting in code that was some 25% slower than the original Cyclone. Source code compilers were also available. While slower, the new system was considerably less expensive, and it was clear that RISC performance was outpacing CISC. By making the move when they did, they were banking on increases in MIPS performance quickly wiping out any performance disadvantages the system had at the time. Using the same basic hardware, Tandem also shipped NonStop Integrity, replacing Guardian with a modified Unix.

In 1993 Tandem introduced the NonStop Himalaya, also known as the S-Series. The Himalaya was the first system that changed the underlying architechture of the NonStop system, basing both the I/O and inter-CPU busses on their new ServerNet system. Whereas Dynabus and FOX linked the CPU's together into a ring, ServerNet was a true peer-to-peer network replacing both, and ran at much higher speeds. Another addition was the use of "lockstep processors"; each processor in the system had two MIPS CPUs running the same code, and if the results coming out ever disagreed, the processor was considered to be faulting and instantly stopped. At that point Guardian would move that task to another processor as in earlier systems, with lockstep guarenteeing that bad data was never written out.

Tandem was acquired by Compaq in 1997. In an ironic full-circle, Compaq was later acquired by HP in 2002, bringing Tandem back to its original roots. As of 2003, the NonStop product line continues to be produced, under the HP name.

Description

External links:

Tandem Technical Reports
- a page at HP with a number of Tandem white papers
Redundant Fault Tolerant Systems
- a PowerPoint presentation on Tandem's fault tolerance strategies