Emulating the Original Gameboy’s CPU — Part 2 of a Series

Raphael Stäbler
11 min readDec 16, 2019

--

The original DMG CPU clocked at 4.19 MHz

In the first part of this series I’ve been introducing the whole project of building my own version of this iconic handheld gaming console while also discussing some of the requirements to the hardware. Now I’m going to dig in and start working on the heart of the console: the CPU.

How does a CPU work?

In principal, a CPU does the same three steps over and over again: fetch, decode, execute.

To build an emulation of a CPU you basically just need to implement these three steps. First the emulated CPU fetches an operation, then it decodes the operation and finally it executes the operation. After that it’s back to step one.

But wait, what does it even mean to fetch an operation? How do you know where to fetch the operation from? And what would such an operation even look like in the first place?

Instruction Set

What an operation looks like is defined by the CPU’s instruction set. Since I’m dealing with an 8 bit CPU here, a typical instruction for the Gameboy consists of eight bits:

01110110

The above instruction is also known as HALT and brings all operations on the CPU to a stop until an interrupt occurs (I’ll be talking about what interrupts are later on in this series).

There are a lot more possible instructions and some of them even take arguments. In order for that to make sense and also to understand how a CPU knows where to fetch operations, it needs some internal memory called registers.

Registers

The Gameboy’s CPU is known as the DMG CPU. It is equipped with 6 registers of 16 bit each and they’re called AF, BC, DE, HL, SP and PC. That sounds fairly cryptical, so let’s look at each of them individually.

AF
The AF register is actually more like two 8 bit registers A and F.
A stands for the accumulator and it’s where arithmetic operations usually take place.
F stands for flag register and it’s to be interpreted as 8 distinct bits each with a different meaning. I’ll be looking into this in more detail later.

BC
The BC register is also to be seen as two 8 bit registers B and C. Besides that there’s nothing special about this register.

DE
The DE register is exactly the same as the BC register, just 2x8 bits more to store some data.

HL
The HL register can be used as two 8 bit registers H and L, just like BC and DE. On top of that, it functions as a 16 bit register and can be used to point to addresses in memory.

SP
SP stands for stack pointer. It points to a special area in memory called the stack. More on this later.

PC
The PC register is what I’m interested in right now. It stands for program counter and it’s actually a pointer to the next instruction in memory. And that’s how the CPU knows where to fetch the next operation to be executed.

Fetch

Now I know what an operation looks like and I know where to find one, or do I? Where do I find the very first instruction? Execution simply starts at location 0, that’s where the Gameboy’s boot ROM is located at. My Gameboy won’t incorporate the original boot ROM and start directly with the game. The first instruction for cartridges is at the memory address 0x0100 (the hexadecimal notation for 0000000100000000). That’s the initial value of my program counter (PC register) and that’s where the CPU is going to look for the first 8 bits of the first operation to be executed.

I’m talking about fetching instructions from memory but I haven’t even talked about memory itself. That’s okay, though. For now let’s just assume we have some addressable memory and that there’s an actual executable program stored in it. I’ll be talking about memory in more detail in the next post.

The following section is going to include short snippets of C code. I want you to keep in mind that you don’t need to understand the code in order to understand the concepts discussed in this post. You may just skip the code blocks if you’re not used to reading it.

Decode

I fetched my first 8 bits from the memory address 0x0100, next I need to decode the operation. While I’ve seen pretty sophisticated ways the decode CPU instructions, it’s actually very often done using just a single switch/case statement where each operation is its own case.

So let’s say I’ve got some function to fetch 8 bits from memory and store them in a variable called op:

uint16_t PC = 0x0100;
uint8_t op = fetchOp(PC);

I then construct a switch/case to handle each possible operation:

switch (op) {
case 0:
// NOP aka no operation
break;
case 0x76:
// HALT
break;

}

Some operations take arguments, for example the operation 0xC6 takes another 8 bits and adds them to the value stored in register A. So, in case of a 0xC6 instruction the CPU needs to also fetch the next 8 bits. I can reuse the same fetchOp function for that since its only job is to fetch 8 bits from memory.


case 0xC6:
uint8_t value = fetchOp(PC + 1);
// add value to register A

As you can see, I will have to advance the program counter (PC) by one in order to read the next 8 bits. This is usually done inside the fetchOp function, so that every read increments the counter automatically:

uint8_t fetchOp() {
return memory[PC++];
}

With fetchOp implemented like that it’s no longer necessary to advance the counter outside of the function.

Execute

Now that instructions can be fetched and decoded, all that’s left to do is to execute them. All a CPU needs for that is access to some memory as well as its internal registers. Let’s have a closer look at the instruction set for the DMG CPU.

8 Bit Load Commands

These are operations where 8 bits of data are going to be copied from one place to another. Some examples:

  • Load the contents of one register into another register (e.g. A to B)
  • Load an arbitrary value of 8 bits into a given register
  • Load the contents of a location in memory into a register
  • Load the contents of a register into a location in memory

16 Bit Load Commands

Although the Gameboy’s CPU is an 8 bit CPU it has some limited ability to work with junks of 16 bit data. All memory addresses are 16 bit long and so is the program counter (PC) and the stack pointer (SP) since they point to locations in memory. In fact, as already mentioned all registers are described as 16 bit registers AF, BC, DE, HL, SP and PC and can be fed directly with 16 bit values. In total there are four 16 bit load commands available:

  • Load an arbitrary value of 16 bits into a register (e.g. BC)
  • Load the contents of the HL register into the stack pointer (SP)
  • Push the contents of a register onto the stack
  • Pop the contents of the stack into a register

The last two commands may require further explanation. I already introduced the stack pointer register (SP) by mentioning it’s pointing to a special location in memory called the stack. A stack is basically exactly what it sounds like: A place to put some data aside for later use. It’s often used together with branching or subroutines. Those are the times when a program needs to save its registers on the stack before doing something else and popping them off the stack when it’s done.

8 Bit Arithmetical and Logical Operations

Besides copying and moving around data, a CPU usually does a lot of arithmetical and logical calculations. It’s no surprise that the DMG CPU has quite a few instructions dedicated to those kinds of tasks:

  • Adding the value of a given register to the value stored in register A (the accumulator)
  • Adding an arbitrary value of 8 bits to the value stored in register A
  • Add the contents of a location in memory to the value stored in register A
  • The same operations as above but for subtraction instead of addition
  • Increment or decrement a value stored in a register
  • Increment or decrement the contents of a location in memory
  • Logical operations AND, OR, and XOR (exclusive or) of a register with register A
  • Logical operations AND, OR, and XOR of an arbitrary value with register A
  • Logical operations AND, OR, and XOR of the contents of a location in memory with register A
  • Addition and subtraction with respect to certain flags stored in register F

The last point is kind of important and prone to errors, at least in my experience. The flag register (F) holds four flags:

  • The Zero flag
  • The Subtraction flag
  • The Half Carry flag
  • The Carry flag

When dealing with arithmetical and logical operations, these flags are often being set, reset and even incorporated into the calculation itself. The Zero flag for example is being set to 1 when the result of an operation has been 0 and reset otherwise. The Subtraction flag is set to 1 when the preceding operation was a subtraction and reset otherwise. The Carry and Half Carry flags are used to indicate that a certain type of carrying has occurred during the operation. Remember when you did addition and subtraction by hand in school? You had to carry numbers from one column to the next. That’s exactly what carry means here, too.

16 Bit Arithmetical Operations

There’s a limited amount of 16 bit arithmetical operations available on the DMG CPU. These not that different from their 8 bit counterparts except they operate on the 16 bit registers BC, DE, HL and SP — and they disregard all the flags completely which makes them much easier to implement.

Rotate and Shift Operations

Another useful operation with binary data is bit shifting and rotation. Let’s take the 8 bit expression00000001 (1 in decimal) and shift it to the left by one. The result is 00000010 which is 2 in decimal. Shift it another time to the left and you get 00000100 (4 in decimal). Each shift left is equivalent to a times 2 operation. It’s apparent how this can be useful.

Rotation is basically the same as shift with the only difference being that the the data is wrapped around at the edges of the 8 bit values. For example, shifting 11000000 left would result in 10000000 while rotating 11000000 left results in 10000001.

The DMG CPU supports the following rotate and shift commands:

  • Rotate the contents of a given register left or right
  • Rotate the contents of a given register left or right through the carry flag (in the example above the leftmost bit would be rotated into the carry flag and whatever value was stored in the carry flag would be rotated into the rightmost bit: 11000000 010000000 1)
  • Rotate the contents of a location in memory left or right
  • Rotate the contents of a location in memory left or right through the carry flag
  • Shift the contents of a given register left or right into the carry flag
  • Shift the contents of a given register right into the carry flag without changing the leftmost bit
  • Shift the contents of a given memory location left or right into the carry flag
  • Shift the contents of a given memory location right into the carry flag without changing the leftmost bit
  • Swap the leftmost 4 bits and the rightmost 4 bits of the contents of a given register (1111000000001111)
  • Swap the leftmost 4 bits and the rightmost 4 bits of the contents of a location in memory

Single Bit Operations

The Gameboy’s CPU also supports a few operations on single bits instead of whole bytes:

  • Set a single bit of a register to 1
  • Reset a single bit of a register (set it to 0)
  • Set a single bit of a byte value at a location in memory
  • Reset a single bit of a byte value at a location in memory
  • Test a single bit of a register (the Zero flag will be set to 1 if the bit is 0)
  • Test a single bit of a value at a location in memory

CPU Control Operations

There are a couple of commands to control the CPU itself:

  • HALT. Halt all operations on the CPU until an interrupt occurs.
  • NOP. Don’t do anything. This is a useful command to control timing — a topic that will be discussed further down.
  • Complement the carry flag: 1 becomes 0 and 0 becomes 1
  • Reset the carry flag
  • Enable or disable interrupts. Interrupts will be discussed in a future post.

Jump Operations

I already mentioned branching and subroutines in connection to the stack. Branching and subroutines are implemented using jump, call and return operations. The DMG CPU supports the following operations:

  • Jump to an arbitrary location in memory. This sets the program counter (PC) to that location.
  • Jump to the location stored within the register HL
  • Jump to an arbitrary location if a certain condition is met (i.e. Zero flag set, Zero flag reset, Carry flag set, Carry flag reset)
  • Jump to a location relative to the current program counter (PC)
  • Jump to a location relative to the current program counter (PC) if a certain flag-related condition is met
  • CALL. Push the address of the next instruction onto the stack and execute a jump to an arbitrary location
  • Push the address of the next instruction onto the stack and execute a jump to an arbitrary location if a certain condition is met
  • RETURN. Pop the location of the next instruction from the stack and jump there
  • Pop the location of the next instruction from the stack and jump there if a certain condition is met
  • Pop the location of the next instruction from the stack and jump there while also enabling all interrupts

That’s it for the DMG CPU’s instruction set. With these operations implemented, every Gameboy game can be executed — in theory. There’s a few things missing like an actual memory implementation and some necessary interrupts.

Timing

I briefly mentioned timing when discussing the NOP operation. Getting the timing right is an important aspect when building an emulator. I want to make sure my emulated CPU runs as closely as possible to the correct clock speed.

The DMG CPU is clocked at 4.19 MHz which is a bit misleading since every operation takes at least 4 clock cycles to complete. This means that, at best, you can only get one operation executed every 4 ticks. This is also why some consider the Gameboy’s CPU to only run at an actual speed of 1.05 MHz.

Regardless of whether you want to consider the CPU to be clocked at 1 MHz or 4 MHz, it’s important to count clock cycles for each operation. Some operations take considerably longer to execute than others and you want to have this reflected by an emulator as precisely as possible. This will become apparent when implementing graphics and sound output. For example: a horizontal line to the display is generated once every 116 ticks considering the CPU to be clocked at 1 MHz, or once every 464 ticks considering the clock speed to be 4 MHz. I will be discussing this in more detail when talking about the graphics implementation.

Final Thoughts

This post has been fairly technical but in order to complete this project, it’s important for me to understand the actual core of the Gameboy.

If you’d like to see my current implementation of the DMG CPU you can do so here. Keep in mind that I wouldn’t consider this a particularly elegant implementation, though. I’d probably do a few things differently next time. Most and foremost I’d strongly consider making use of C’s union type for the register implementation:

union {
struct {
uint8_t A;
uint8_t F;
};
uint16_t AF;
}

Anyway, the implementation works and that’s the most important thing for now.

Thanks for reading, I hope you enjoyed it. If you have any questions or would like to know more about the project, feel free to leave a comment or contact me on Instagram or Twitter!

--

--

Raphael Stäbler
Raphael Stäbler

Responses (2)