Nintendo’s goal was to give players the best graphics possible, for this it will partner with one of the biggest players in computer graphics to produce the ultimate graphics chip.
The result was a nice-looking console for the family… and a 500-page manual for the developer.
Don’t worry, I promise you this article will not be that long… Enjoy!
The main processor is a NEC VR4300 that runs at 93.75 MHz, it’s a binary-compatible version of Silicon Graphics’ MIPS R4300i that features:
An internal 64-bit FPU is also included in this package, the CPU identifies it as a co-processor (COP1) although the unit is fitted next to the ALU and it’s only accessed through the ALU pipeline, meaning there’s no co-processing per se.
The way RAM is assembled follows the unified-memory architecture or ‘UMA’ where all available RAM is centralised in one place only and all components that need to use will access this same location. The component arbitrating its access is, in this case, the GPU.
The reason for choosing this design comes to the fact that it saves a considerable amount of production costs while, on the other side, it increments access latency.
Due to the unified memory architecture, the CPU no longer has direct access to RAM, so the GPU will be providing DMA functionality as well.
Apart from the UMA, the way of addressing RAM is a little bit complicated, so I’ll try to keep it simple, here it goes…
The system physically contains 4.5 MB of RAM, however it’s connected using a 9-bit bus where the 9th bit can only be accessed by the GPU (more details later). As a consequence, every component except the GPU will only find up to 4 MB.
The type of RAM fitted in the board is called Rambus DRAM or ‘RDRAM’ for short, this was just another design that competed against SDRAM on becoming the next standard. RDRAM is connected in serial (where transfers are done one bit at a time) while SDRAM uses a parallel connection (transfers multiple bits at a time).
RDRAM’s latency is directly proportional to the number of banks installed and as a consequence, with the amount of RAM this system has, the resulting latency is significant.
By contrast, the amount of available RAM on this console can be expanded by installing the Expansion Pak accessory: A fancy-looking small box that includes 4.5 MB. Curiously enough, the RAM bus must be terminated, so the console always shipped with a terminator (called Jumper Pak) fitted in the place of the Expansion Pak. Now, you may ask, what would happen if you switch on the console without any Pak installed? Literally nothing, you get a blank screen!
The core of the graphics reside on a huge chip designed by Silicon Graphics called Reality Co-Processor running at 62.5 MHz. This package contains a lot of circuitry so don’t worry if you find it difficult to follow, the graphics sub-system has a very complex architecture!
Anyway, this chip is divided into three main modules, two of them are used for graphics processing:
Also known as RSP, it’s just another CPU package composed of:
In order to operate this module, the CPU stores in RAM a series of commands called Display list along with the data that will be manipulated, then the RSP reads the list and applies the required operations on it. The available features include:
This seems straightforward, but how does it perform these operations? Well, here’s the interesting part: Unlike its competitors (PS1 and Saturn), the geometry engine is not hard-wired. Instead, the RSP contains some memory (4 KB for instructions and 4 KB for data) to store microcode, a small program, with no more than 1000 instructions, that implements the graphics pipeline. In other words, it directs the Scalar Unit on how it should operate our graphics data. The microcode is fed by the CPU during runtime.
Nintendo provided different microcodes to choose from and, similarly to the SNES’ background modes, each one balances the resources differently.
After the RSP finished processing our polygon data, it will start sending rasterisation commands to the next module, the RDP. These commands are either sent using a dedicated bus called XBUS or through main RAM.
The RDP is just another processor (this time with fixed functionality) that includes multiple engines used to apply textures over our polygons and project them on a 2D bitmap.
It can process either triangles or rectangles as primitives, the latter is useful for drawing sprites. The RDP’s rasterisation pipeline contains the following blocks:
The RDP provides four modes of functioning, each mode combines these blocks differently in order to optimise specific operations.
Since this module is constantly updating the frame-buffer in RAM, it uses an special addressing mode: Remember the unusual 9-bit addressing? The ninth bit is reserved for special calculations (like z-buffering) which can only be accessed using the Memory interface.
The resulting frame-buffer will be captured by the video encoder and sent through the video signal. The theoretical maximum capabilities are 24 bit colour depth (16.8 million colors) and 640x480 resolution (or 720x576 in the PAL region).
I mention it as ‘theoretical’ since using the maximum capabilities can be resource-hungry, so programmers will tend to use lower stats in order to free up enough resources for other services.
Let’s put all the previous explanations into perspective, for that I’ll borrow Nintendo’s Super Mario 64 to show, in a nutshell, how a frame is composed:
To start with, our 3D models are located in the cartridge ROM, but in order to keep a steady bandwidth, we need to copy them to RAM first.
Then it’s time to build a scene using our models, the CPU could do it by itself but it may take ages, so the task is delegated to the RCP. The CPU will instead send orders to the RCP, this is done by carrying out these tasks:
Afterwards, the RSP will start performing the first batch of tasks and the result will be sent to the RDP in the form of rasterisation commands.
So far we managed to process our data and apply some effects on it, but we still need to:
As you may guess, these tasks will be performed by the RDP. The CPU will provide the data (such as tiles) by placing it on RAM, this module has a fixed pipeline but we can select an optimal mode of operation to improve the frame-rate.
When the RDP finished it will write the last bitmap to the frame-buffer area in RAM, then the CPU will transfer the frame-buffer to the Video Interface (preferably using the DMA) which is then sent to the Video Encoder for display.
Here are some examples of previous 2D characters for the Super Nintendo that have been redesigned for the new 3D era, they are interactive so I encourage you to check them out!
SGI clearly invested a lot of technology into this system, however this was a console meant for the household and as such it had to keep its cost down. Some hard decisions resulted in difficult challenges for programmers:
Due to the huge number of components and operations in the graphics pipeline, the RCP ended up being very susceptible to stalls: An undesirable situation where sub-components keep idling for considerable periods of time because the required data is delayed at the back of the pipeline.
This will always result in performance degradation and is up to the programmer to avoid them. Although to make things easier, some CPUs such as the Scalar Unit implement a feature called Bypassing which enables to execute similar instructions at a faster rate by bypassing some execution stages that can be skipped. For example, if we have to compute sequential ‘add’ instructions there’s no need to write the result back to a register and then read it back every time each ‘add’ is finished, we can instead keep using the same register for all additions and do the write back once the last ‘add’ is completed.
Inside the RDP there are 4 KB Texture memory available to be used as Texture Cache, its main goal is to avoid stalling read cycles from RAM. Unfortunately, in practice 4 KB happened to be insufficient for high-resolution textures.
As a result, some games used solid colours with Gouraud shading (like Super Mario 64) and others relied on pre-computed textures (for example, where multiple layers had to be mixed).
Before we go into the details, let’s define the two endpoints of the audio sub-system:
Now, how do we connect both ends? Consoles normally include a dedicated audio chip that does the work for us. Unfortunately, the Nintendo 64 doesn’t have such dedicated chip, so this task is distributed across these components:
The resulting data is, as expected, waveform data. This is then sent to the Audio Interface or ‘AI’ block which will then transfer it to the digital-to-analog converter. The resulting waveform contains two channels (since our system is stereo) with 16-bit resolution each.
Because of this design, the constraints will depend on the implementation:
Similar to the PS1 and Saturn, N64 games are written for bare-metal, however there are no BIOS routines available to simplify some operations. As a substitute, games embed small OS that provides a fair amount of abstraction to efficiently handle the CPU, GPU and I/O.
This is not the conventional desktop OS that we may imagine at first, it’s just a micro-kernel with the smallest footprint possible that provides the following functionality:
The kernel is automatically embedded by using Nintendo’s libraries, additionally, if programmers decide not to include one of the libraries, the respective portion of the kernel is skipped to avoid cartridge space being wasted.
As you know by now, I/O is not directly connected to the CPU, so the RCP’s third module (which I haven’t mentioned until now) serves as a I/O interface, it basically communicates with the CPU, controllers, game cartridge and Audio/Video DACs.
Nintendo held on to the cartridge as medium for storage and as a consequence, games enjoyed higher bandwidths (between 5-50 MB/s depending on the ROM’s speed) while being more expensive to produce. The biggest cartridge found in the market has 64 MB.
Inside cartridges manufacturers may include extra memory (in the form of EEPROM, flash or SRAM with a battery) to hold saves, however this is not a strong requirement any more since certain accessories could be used to store saves as well.
The Nintendo 64 controller included a connector used to plug in accessories, some of them are:
All accessories connected to the controller are managed by the Peripheral Interface.
Apart from that, this console included a special connector at the bottom of its motherboard which was meant to be used by the yet-unreleased Disk drive, some sort of an ‘extra floor’ that contained a proprietary disk reader, the drive was only released on Japan nonetheless and eventually cancelled for the rest of the world.
In general, development was mainly done in C, assembly was also used to achieve better performance. While this system contained a 64-bit instruction set, 64-bit instructions were rarely used since in practice, 32-bit instructions happened to be faster to execute and required half the storage.
Libraries contained several layers of abstractions in order to command the RCP, for example, structs like the Graphics Binary Interface or ‘GBI’ were designed to assemble the necessary Display lists more easily, the same applied for audio functions (its struct was called Audio Binary Interface or ‘ABI’).
In terms of microcode development, Nintendo already provided a set of microcode programs to choose from, however if developers wanted to customise it, that would indeed be a challenging tasks: The Scalar Unit instruction set wasn’t initially documented (at the request of Nintendo, of course), later on the company changed its position and SGI finally released some documentation for microcode programming.
Hardware used for development included workstations supplied by SGI, like the Indy machine which came with an extra daughterboard called ‘U64’ that emulates the retail console. Tools were supplied for Windows computers as well.
Other third-party tools consisted in cartridges containing wide cables that connected to the workstation, this cartridge was fitted on a normal Nintendo 64 and included internal circuitry to redirect the read requests from the console to the workstation’s RAM. Debugging was carried out by transferring a copy of the game to RAM and then, when the console was switched on, it would start reading from there.
The anti-piracy system is a continuation of the SNES’ CIC. As you know, bootleg detection and region locking is possible thanks to the CIC chip (which must be present in every authorised game cartridge), the Nintendo 64 improved this system by requiring different games to have an specific variant of the CIC chips in order to make sure the cartridge was not a counterfeit or contained a CIC clone, the Peripheral Interface or ‘PIF’ would do checksum checks at the start and during gameplay to supervise current CIC installed on the cartridge.
If by any reason the PIF considers the current cartridge is not valid, it will then induce the console in a permanent freeze.
Region-locking was done by slightly altering the shape of the cartridge between different regions so the user can’t physically insert the game on a N64 from a different region.
Overall, there was not too much concern regarding piracy thanks to the use of cartridge medium, although price on games were three times higher than CD based ones.
As silly as it may seem, Nintendo left one door opened: The Disk Drive port.
A few companies reversed engineered the interface in order to develop their own hardware, and some of the resulting products became a concern for piracy.
I guess the one worth mentioning is the Doctor v64, this device has the same shape as the Disk Drive port but included a CD-ROM drive that’s used to clone the contents of the cartridge to a CD, the opposite (reading Roms from a CD) is also possible.
When I was a kid I used to play some N64 games on a Pentium II machine using an emulator, it wasn’t that bad but I wondered now how the freck was it able to happily emulate a complex 64-bit machine since, among other things, my PC barely had enough RAM to keep the integrated video alive.
The truth is, while reproducing the architecture of this console can be complex, things like microcode will give a hint of what the console is trying to do, and since emulators don’t have to be cycle-accurate, they can apply enough optimisations to provide more performance in exchange for real emulation.
Another example are the 64-bit instructions, since games barely used them, emulation speed would hardly be hit when running on a 32-bit host machine.
I have to say, this article may be the longest one I’ve ever written, but hopefully you found it a nice read!
I’ll probably take the following days to tide up some things on the website instead of starting to write the next article.
Until next time!
This article is part of the Architecture of Consoles series. If you found it interesting please consider donating, your contribution will be used to get more tools and resources that will help to improve the quality of current articles and upcoming ones.