M-Machine Publications
The MIT Multi-ALU Processor
I. Introduction
The Multi-ALU Processor (MAP) chip is a component of the M-Machine
multicomputer, an experimental machine being developed at MIT to test
architectural concepts motivated by the constraints of semiconductor
technology and the demands of programming systems, such as faster
execution of fixed sized problems and easier use of parallel
computers. Each six-chip M-Machine node consists of a MAP chip and 8
MBytes of synchronous DRAM (SDRAM) with ECC. The MAP chip employs a
novel architecture for exploiting instruction level parallelism as
well as mechanisms to enable large scale multiprocessing. These
include an on-chip integrated network interface and router as well as
mechanisms for enabling data sharing among processors. In addition,
the MAP employs an efficient capability based addressing scheme to
provide protection and ensure data integrity.
The MAP chip is implemented using 7.5 million transistors in a 5
metal, 0.5 micron process. All of the datapath layout is complete and
we are currently performing place and route of the standard-cell
control logic. Tapeout is scheduled for June 1997.
II. MAP Architecture
The MAP chip contains three 64-bit execution clusters, a unified cache
which is divided into into four banks, an external memory interface,
and a communication subsystem consisting of a network interface and a
the router. Two of the clusters have two integer units, a floating
point multiply-add unit, and a floating point divide/square-root unit.
The third cluster has only two integer units.
Two crossbar switches interconnect these components. Clusters make
memory requests to the appropriate bank of the interleaved cache over
the 142-bit wide 3x4 crossbar M-Switch. The 88-bit wide 9x3 crossbar
C-Switch is used for inter-cluster communication and to return data
from the memory system. Both switches support up to three transfers
per cycle; each cluster may send and receive one transfer per cycle.
The 64KB unified on-chip cache is organized as four 16KB banks that
are word-interleaved to permit accesses to consecutive addresses to
proceed in parallel. The cache banks are pipelined with a three-cycle
read latency, including switch traversal. Each cluster has its own
8KB instruction cache which fetches instructions from the unified
cache when instruction cache misses occur. A 128 entry TLB is used to
implement virtual memory.
The MAP employs a form of Processor Coupling to control the multiple
ALUs. Each of the three clusters has its own independent instruction
stream. However, threads running on those clusters may communicate
and synchronize with one another very quickly by writing into each
other's register files via the C-Switch. Scoreboard bits on the
registers are used to synchronize these register-register transfers,
as well as to indicate when the data from a load has returned from the
non-blocking memory system. Clusters may also communicate and
synchronize through globally broadcast condition codes, a cluster
barrier instruction (CBAR), and through the on-chip cache.
Each cluster's execution units are shared among six threads that are
concurrently resident in the cluster's pipeline registers. These
threads are multithreaded on a cycle-by-cycle basis by a
synchronization (SZ) pipeline stage. Some of the hardware thread
slots are reserved for exception, event, and message handlers, so that
they can run in parallel with the user threads and not incur
invocation overhead when they start up.
The integrated communication subsystem includes hardware support for
message injection, routing, and extraction. A message is formatted in
the general purpose registers and delivered atomically to the network
interface with a SEND instruction. An 8-entry Global Translation
Lookaside Buffer (GTLB) translates global virtual addresses to
physical node identifiers, allowing application independent address
mapping similar to the way a traditional TLB maps virtual addresses to
physical memory. The two-dimensional network consists of the on-chip
routers which connect directly to the routers on adjacent nodes
through the pads. The routers implement two message priorities,
sharing four virtual channels. A message that arrives at its final
destination is placed into the incoming message queue which is mapped
to a register name in a dedicated thread slot. Software can then
extract the message and perform the required action. The scoreboard
bit on the register indicates whether there are any words left in the
incoming message queue.
The MAP memory system implements protection between threads in a
globally shared address space. Guarded pointers encode a capability
in the top 10 bits of a 64-bit MAP word, and a tag bit prevents
pointers from being forged. Hardware checking of a pointer's
permissions and segment bounds prevent unauthorized memory access on
load, store, and jump instructions.
A combination of hardware and software mechanisms on the MAP chip are
used to implement fast and flexible data sharing across M-Machine
nodes. In addition to the virtual and physical page numbers, each
page table entry also includes two state bits for each cache line (a
total of 128 bits with 512 word pages and 8 word cache lines). These
bits encode the states READ-ONLY, READ-WRITE, DIRTY, and INVALID,
allowing sharing of cache line sized items across processors. When a
load to an INVALID line is found in the memory system, an event is
invoked in a dedicated thread slot. The software event handler can
then send a message to the home node of the data, using the automatic
translation of the GTLB, requesting a copy of the line. The remote
message handler that is automatically invoked by hardware upon message
arrival at the home node retrieves the data and returns it with another
SEND instruction. When the line arrives, another software handler is
invoked in a dedicated thread slot, which installs the line and
delivers the requested word directly to the destination register of
the original load instruction.
III. MAP Chip Implementation
A unique set of constraints dictated the MAP implementation of the
M-Machine architecture. The M-Machine team had the advantage of a
"clean-sheet" design with no requirement to support a pre-existing
ISA. The corresponding disadvantage was the lack of pre-existing
infrastructure, most notably the absence of pre-existing cell
libraries, datapath components, and experience with the target
fabrication process.
Throughout the project, the limited size of the M-Machine project team
was a fundamental constraint. From inception, in 1992, through
architecture specification, RTL modelling, circuit design,
floorplanning, and physical design, and the planned tapeout in June
1997, an average of nine engineers, with a peak total of twenty,
worked regularly on the MAP implementation.
Despite limited manpower, during the course of the project the
M-Machine team designed, developed and characterized a full standard
cell library; a composable datapath cell library; 5 RAM arrays; and a
broad family of datapath components including a 64b adder/subtractor,
a 64b barrel shifter, a radix 8 multiplier array, and a 7-ported
register file. In addition, a set of CMOS and low voltage simultaneous
bidirectional I/O pads were designed for the project.
The critical components of the execution datapath, such as the
multiplier array, adder, and register files, were implemented in a
full custom methodology. The majority of the latches, multiplexors,
and buffers in the execution datapaths were implemented by explicit
placement (tiling) of the composable datapath cells and standard
cells. The control logic was implemented via standard cell place and
route. The majority of the circuits used in the MAP design are
implemented in static CMOS logic for both design simplicity and to
minimize the effort required to fully characterize their functionality
and performance. The most notable exceptions are the domino multiplier
array, the SRAM arrays, and the simultaneous bidirectional pad
drivers.
While all of the circuit and logic was performed at MIT, a fundamental
project decision was to collaborate with an industrial design center
for the physical design of the chip. The M-Machine team eventually
selected Cadence Spectrum Design (CSD) as its partner for the MAP
implementation. The extensive chip building experience of the CSD
engineers was critical to the success of the project.
The original design of the MAP chip contained over 13 Million
transistors and consisted of 4 clusters, each with an IU, MU and FPU,
1MBit of onchip unified cache and a 3D Mesh router. As the project
progressed, it became apparent that this design was too large for the
target die size. Some of the key lessons of the MAP implementation
experience resulted from the process of paring the original design to
its final form.
Several factors caused underestimation in the size of the original
design: design methodology, an under-appreciation for the complexity
control logic, and greater expectation from the process technology.
The use of the semi-custom datapath layout methodology resulted in an
average 40% area growth in the datapaths, However, the additional
flexibility significantly reduced the time and effort required to make
engineering changes and fix errors. Somewhat surprisingly, the
quantity and complexity of the random control logic and not the
density of the datapaths has dictated both circuit performance and
chip area.
The resulting MAP chip is a 7.5 million transistor microprocessor with
a die size of 18mm x 18mm. It is implemented in a 5-level-metal 0.7um
drawn (0.5um effective) CMOS process. IBM Corporation is manufacturing
the MAP chip described in this presentation for MIT. Each MAP die
will be packaged in an MCM-L chip carrier with 5 16Mb SDRAM TSOPs.