Cannabis Indica

The R10000, code-named "T5", is a microprocessor implementation of the MIPS IV instruction set architecture (ISA) developed by MIPS Technologies, Inc. (MTI), then a division of Silicon Graphics, Inc. (SGI). The chief designers were Chris Rowen and Kenneth C. Yeager. The R10000 microarchitecture was known as ANDES, an abbreviation for Architecture with Non-sequential Dynamic Execution Scheduling.[1] The R10000 largely replaced the R8000 in the high-end and the R4400 elsewhere.

It was originally intended to be the last high-performance, non-embedded MIPS microprocessor to be developed for SGI, who had opted to replace MIPS with the Itanium. Due to Itanium experiencing repeated delays, the R10000's basic microarchitecture became the basis for successive derivatives to maintain the design's competitiveness. As MTI was a fabless semiconductor company, the R10000 was fabricated by NEC and Toshiba. Previous fabricators of MIPS microprocessors such as Integrated Device Technology (IDT) and three others did not fabricate the R10000 as it was more expensive to do so than the R4000 and R4400.

History

The R10000 was introduced in January 1996 at clock frequencies ranging from 150 MHz to 200 MHz, but was not available in large volumes until later in the year due to fabrication problems at MIPS's foundries. The 200 MHz version was in short supply throughout 1996, and was priced at US$3,000 as a result.[2]

On 25 September, SGI announced that R10000s fabricated by NEC between March and the end of July were faulty, drawing too much current and causing systems to shut down while in operation. SGI recalled 10,000 R10000s that had shipped in systems as a result, which impacting on the company's earnings.[3][4]

Later, the R10000 was fabricated in a 0.25 µm process and this enabled it it reach 250 MHz.

R10000 users were SGI, NEC, Siemens Nixdorf, Tandem Computers and others. SGI used the R10000 in their workstations, servers and supercomputers. NEC built supercomputers utilizing the R10000, and other manufacturers built both workstations and servers.

Description

MIPS IV is a 64-bit architecture, but the R10000 did not implement the entire physical or virtual address to reduce cost. Instead, it has a 40-bit physical address and a 44-bit virtual address, thus it is capable of addressing 1 TB of physical memory and 16 TB of virtual memory.

Integer unit

The integer unit consists of three pipelines, two integer, one load store. The integer register file was 64 bits wide and contained 64 entries, of which 32 were architectural registers and 32 were rename registers used to implement register renaming. The register file had seven read ports and three write ports.

Floating-point unit

The floating-point unit consisted of four functional units, an adder, a multiplier, divide unit and square root unit. The adder and multiplier are pipelined, but the divide and square root units are not. Adds and multiplies have a latency of three cycles and the adder and multiplier can accept a new instruction every cycle. The divide unit has a 12- or 19-cycle latency, depending on whether the divide is single precision or double precision, respectively.

The square root unit executes square root and reciprocal square root instructions. Square root instructions have a 18- or 33-cycle latency for single precision or double precision, respectively. A new square root instruction can be issued to the divide unit every 20 or 35 cycles for single precision and double precision respectively. Reciprocal square roots have longer latencies, 30 to 52 cycles for single precision and double precision respectively.

The floating-point register file contains sixty-four 64-bit registers, of which thirty-two are architectural and the remaining are rename registers. The adder has its own dedicated read and write ports, whereas the multiplier shares its with the divider and square root unit.

The divide and square root units use the SRT algorithm. The MIPS IV ISA has a multiply-add instruction. This instruction is implemented by the R10000 with a bypass - the result of the multiply can bypass the register file and be delivered to the add pipeline as an operand, thus it is not a fused multiply-add and has a four-cycle latency.

Caches

The R10000 has a 32 KB instruction cache and a 32 KB data cache, which was large for the time (1996). The instruction cache was two-way set associative and has a 64-byte line size. Instructions are partially decoded by appending four bits to each instruction (which has a length of 32 bits) used to identify which execution unit the instruction is executed in before they are placed in the cache. The 32 KB data cache was two-way interleaved, with the cache consisting of two 16 KB banks that were two-way set associative. It is virtually indexed and physically tagged to enable the cache to be indexed in the same clock cycle and to maintain coherency with the secondary cache.

The secondary cache capacity was 512 KB to 16 MB, using synchronous static random access memory (SSRAM). It was accessed via a dedicated 128-bit bus with 9-bits for ECC. The cache and bus operated at the same clock frequency as the R10000, whose maximum was 200 MHz. At 200 MHz, the bus yielded a peak bandwidth of 3.2 GB/s.

Avalanche system bus

The R10000 used the Avalanche bus, a 64-bit bus that operated at frequencies up to 100 MHz. Avalanche was a multiplexed address and data bus, so at 100 MHz it yielded a maximum theoretical bandwidth of 800 MB/s, but its peak bandwidth was 640 MB/s as it required some cycles to transmit addresses.

The system interface controller supported glue-less symmetrical multiprocessing (SMP) of up to four microprocessors. Systems using the R10000 with external logic could scale to hundreds of processors, such as the Origin 2000.

Fabrication

The R10000 consisted of approximately 6.7 million transistors, of which approximately 4.4 million are contained in the primary caches. The die measured 16.64 mm by 17.934 mm, for a die area of 298 mm2. It was fabricated in a 0.35 µm process and packaged in 599-pad ceramic land grid array (LGA). Before the R10000 was introduced, the Microprocessor Report, covering the 1994 Microprocessor Forum, reported that it was packaged in a 527-pin ceramic pin grid array (CPGA); and that vendors also investigated the possibility of using a 339-pin multi-chip module (MCM) containing the microprocessor die and 1 MB of cache.[5]

Derivatives

R12000

The R12000 was a further development of the R10000 introduced in November 1998. The R12000 was developed as a stop-gap solution following the cancellation of the "Beast" project, which intended to deliver a successor to the R10000. R12000 users included SGI, Tandem Computers (later Compaq, which had acquired Tandem) and Siemens-Nixdorf.

The microarchitecture was improved by: inserting an extra pipeline stage to improve clock frequency by resolving a critical path; increasing the number of entries in the branch history table, improving prediction; modifying the the instruction queues so they take into account the age of a queued instruction, enabling older instructions were executed before newer ones if possible. The R12000 was fabricated in a 0.25 µm CMOS process with four levels of interconnect. The new use of a new process did not mean that the R12000 was a simple die shrink with a tweaked microarchitecture, the layout of the die was optimized to take advantage of the 0.25 µm process and extra level of interconnect.

R12000A

The R12000A was a R12000 fabricated in a 0.18 µm process. It operated up to 400 MHz. It was introduced in July 2000.

R14000

The R14000 was a further development of the R12000 announced in July 2001. The R14000 operated at 500 MHz, enabled by the 0.13 µm CMOS process with five levels of copper interconnect it was fabricated with. It featured improvements to the microarchitecture of the R12000 by supporting double data rate (DDR) SSRAMs for the secondary cache and a 200 MHz system bus.[6]

R14000A

The R14000A was a further development of the R14000A announced in February 2002. It operated at 600 MHz, dissipated approximately 17 W, and was fabricated by NEC Corporation in a 0.13 µm CMOS process with seven levels of copper interconnect.[6]

R16000

The R16000 was the last MIPS microprocessor for general-purpose computing. Improvements included higher clock frequencies and 64 KB instruction and data caches. Originally, the fastest R16000 publicly known operated at 800 MHz, but SGI later revealed there were 1.0 GHz R16000s shipped to selected customers. R16000 users included SGI for their Tezro workstations and Origin 3000 servers and supercomputers; and HP for their NonStop Himalaya S-Series fault-tolerant servers inherited from Compaq via Tandem.

R18000

The R18000 was a cancelled further development of the R10000 microarchitecture that featured major improvements by Silicon Graphics, Inc. described at the Hot Chips symposium in 2001. The R18000 was designed specifically for SGI's ccNUMA servers and supercomputers. Each node would have two R18000s connected via a multiplexed bus to a system controller, which interfaced the microprocessors to their local memory and the rest of the system via a hypercube network.

The R18000 improved the floating-point instruction queues and revised the floating-point unit to feature two multiply-add units, quadrupling the peak FLOPS count. Division and square-root were performed in separate non-pipelined units in parallel to the multiply-add units. The system interface and memory hierarchy was also significantly reworked. It would have a 52-bit virtual address and a 48-bit physical address. The bidirectional multiplexed address and data system bus of the R18000 would be replaced by two unidirectional DDR links, a 64-bit multiplexed address and write path and a 128-bit read path. Although they are unidirectional, each path could be shared by another R18000, although the two would be shared through multiplexing. The bus could also be configured in the SysAD or Avalanche configuration for backwards compatibility with R10000 systems.

The R18000 would have a 1 MB four-way set-associative secondary cache would be included on-die; supplemented by an optional tertiary cache built from SDR, DDR SSRAM or DDR SDRAM with capacities of 2 to 64 MB. The L3 cache had its cache tags, equivalent to 400 KB, located on-die to reduce latency. The L3 cache is accessed via a 144-bit bus, of which 128 bits are for data and 8 bit for ECC. The clock rate was to have been programmable.

The R18000 was to be fabricated in NEC's UX5 process, a 0.13 µm CMOS process with nine levels of copper interconnect. It would have used 1.2 V power supply and dissipated less heat than contemporary server microprocessors in order to be densely packed into systems.

Notes

  1. ^ "MIPS Claims Floating-Point Record With R10000, The Hottest Chip At The Microprocessor Forum".
  2. ^ Gwennap, "Alpha Sails, PowerPC Flails", p. 8."
  3. ^ "Defects Revealed In SGI R10000 MIPS Systems, Revenues Hit".
  4. ^ "SGI To Recall 10,000 R10000s".
  5. ^ "MIPS R10000 Uses Decoupled Architecture", p. 4.
  6. ^ a b "SGI to develop MIPS chips for Origin, Onyx"

References

Leave a Reply