Five Stage RISC-V RV32i Microprocessor

Introduction

The goal of the project is to design a five-stage RISC-V RV32i microprocessor implemented on a MAX10 FPGA. The design included a full microprocessor solution implemented on the FPGA having 128 KiB of on-chip memory using M9K blocks with streaming of programs over the JTAG interface and clocked by the onboard 50MHz clock.

Hardware

The project was implemented on a Terasic DE10-Lite FPGA board based on the Altera MAX 10 [1] that includes 50,000 Logic Elements, on-board ADC, USB Blaster, Accelerometer, GPIOs, and Arduino expanders. However, the scope of this project only required the usage of the GPIOs, the Logic Elements, and the onboard clocks. Additionally, the board also includes a 64MB SDRAM, x16 bits data bus.

The implementation of the project used the following peripherals available on board

10 LEDs
10 Slide Switches
2 Push buttons
7-Segment displays

Figure 1: Block Diagram of the DE10-Lite Board

Figure 2: The DE10-Lite board with values in the LEDs and the 7 segment display

Software

The project was implemented using the principles of FPGA programming using the System Verilog Hardware Description Language and its paradigms. The Visual Studio Code IDE was primarily used in the development and writing of code for the project. Compilation of the design, including the Analysis and Synthesis, Fitting, Assembly, Timing Analysis, and Programming was done using the Quartus Prime - Lite IDE software.

Figure 3: Quartus Prime - Lite IDE specifications

Implementation

Overview

The Five stages of the 5-stage RISC-V pipeline include:

Instruction Fetch (IF): To fetch instruction from the memory and forward to the next stage
Instruction Decode (ID): To decode the instruction and identify the operands
Execution (EX): To perform the actual operation or computation of the instruction
Memory Access (MEM): To access data from the memory, if required
Write Back (WB): To write the result back to the register file system

In summary, the IF stage fetches the instruction, the ID stage decodes it, the EX stage executes it, the MEM stage accesses the memory (if needed), and the WB stage writes the result back to the register file. These stages are repeated for each instruction in the program. In addition to the five standard stages, each of which are programmed using a separate module, there are three additional modules in the project, they are

DualPortRam module: The memory module
IO: The module to handle the IO operations
Register File System: The module to handle register file system

The pipeline stages work in parallel, allowing multiple instructions to be executed simultaneously. As one instruction is being executed in one stage, the next instruction can be fetched in the previous stage, resulting in a significant improvement in instruction throughput. However, the pipeline introduces the risk of hazards such as data hazards, control hazards, and structural hazards, which need to be addressed to ensure correct operation of the processor. The implementation can simply be summarized as in Figure 4.

Figure 4: Implementation block diagram

As seen in Figure 4, each interconnection between the blocks are representative of multiple data lines connected of widths ranging from 1 bit wide to 32 bit wide.

Supported Instructions

Following are the supported instructions by the current implementation

Instruction	Type	Ex Operation
ADD	R	rs1 + rs2
SUB	R	rs1 – rs2
SLL	R	rs1 << rs2[4:0]
SLT	R	1 if rs1 < rs2 (signed) 0 else
SLTU	R	1 if rs1 < rs2 (unsigned) 0 else
XOR	R	rs1 XOR rs2
SRL	R	rs1 >> rs2[4:0] (logical)
SRA	R	rs1 >> rs2[4:0] (arithmetic)
OR	R	rs1 v rs2
AND	R	rs1 ^ rs2
JALR	I	pc+4
LB	I	rs1+signex(i[11:0])
LH	I	rs1+signex(i[11:0])
LW	I	rs1+signex(i[11:0])
LBU	I	rs1+signex(i[11:0])
LHU	I	rs1+signex(i[11:0])
ADDI	I	rs1+signex(i[11:0])
SLTI	I	1 if rs1 < signex(i[11:0]) (signed) 0 else
SLTIU	I	1 if rs1 < signex(i[11:0]) (unsigned) 0 else (2 disjoint ranges)
XORI	I	rs1 XOR signex(i[11:0])
ORI	I	rs1 v signex(i[11:0])
ANDI	I	rs1 ^ signex(i[11:0])
SLLI	I	rs1 << shamt
SRLI	I	rs1 >> shamt (logical)
SRAI	I	rs1 >> shamt (arith)
FENCE	I	-
ECALL	I	-
EBREAK	I	-
SB	S	rs1+signex(i[11:0])
SH	S	rs1+signex(i[11:0])
SW	S	rs1+signex(i[11:0])
BEQ	B	-
BNE	B	-
BLT	B	-
BGE	B	-
BLTU	B	-
BGEU	B	-
LUI	U	i[31:12]:12’b0
AUIPC	U	i[31:12]:12’b0+pc
JAL	J	pc+4

Table 1: List of supported instructions

Instruction Fetch

In this stage, the instruction is fetched from memory, the Dual Port RAM module and loaded into the iw_out instruction register. The program counter is incremented to point to the next instruction and is stored in the pc_out register and passed to the next stage. The signals originating from and arriving at the IF stage are summarised in Table 2.

Signal	Type	Synchronicity Status	Width (bits)	Source/Destination (module)
clk	Input	-	-	Board
reset	Input	Synchronous	1	Board
memIfData	Input	Asynchronous	32	Dual Port RAM / Memory Interface
jump_en_in	Input	Synchronous	1	Instruction Decode
jump_addr	Input	Synchronous	32	Instruction Decode
memIfAddr	Output	Asynchronous	32	Dual Port RAM / Memory Interface
iw_out	Output	Synchronous	32	Instruction Decode
pc_out	Output	Synchronous	32	Instruction Decode
jump_en_out	Output	Synchronous	1	Instruction Decode

Table 2: Signals in the Instruction Fetch Stage

In a synchronized manner, depending on the state of the reset signal, the PC is either incremented by 4 or reset to 0. Additionally, another check is performed to ensure if the instruction fetched in the previous cycle and decoded by the IF stage requires a jump. The signal indicating this is checked and the appropriate jump address is loaded into the iw_out register. Communication between the Dual Port RAM module is asynchronous so as to not delay the instruction fetch cycle and meet the timing constraints of the design.

Figure 5: Implementation of the IF stage

Instruction Decode

In this stage, the instruction word received from the Instruction Fetch stage is decoded to determine the operation to be performed and the operands to be used. Depending on the type of the instruction, the register file is read to fetch the values of the operands. Figure 6 summarizes the signals in the ID stage.

Figure 6: Signals in the ID stage

Data Hazards

Additionally, the stage also implements Data Forwarding to handle Data Hazards by comparing the write-back register and the write-back signals forwarded from the EX, MEM and the WB stages of the previous three instructions with the rs1 and rs2 of the current instructions. If there is a match and if the write-back signal in the forwarded signals is high, then the rs1 or rs2 is updated to be the forwarded value instead of the one fetched from the Register File system. The control signals from each of the three subsequent stages are forwarded asynchronously and also checked asynchronously in the ID stage.

Branch Hazards

Instruction Decode stage also handles Branch Hazards by decoding the opcode first and determining if a jump is necessary. It must be noted that the rs1 and rs2 values used in determining the status of various jump instructions could possibly be the forwarded values from the three subsequent stages or from the ones fetched from the memory.

Upon detecting that a jump is required, a jump_enable flag is set and a jump address is calculated from the values in the opcode and both these signals are asynchronously passed back to the Instruction Fetch stage which takes appropriate action and performs the jump. The signals originating from and coming into the Instruction Decode stage are summarized in Table 3.

Signal	Type	Synchronicity Status	Width (bits)	Source/Destination (module)
clk	Input	-	-	Board
reset	Input	Synchronous	1	Board
iw_in	Input	Synchronous	32	Instruction Fetch
rs1_data_in	Input	Synchronous	32	Register Interface
rs2_data_in	Input	Synchronous	32	Register Interface
jump_en_in	Input	Synchronous	1	Instruction Fetch
df_ex_enable	Input	Asynchronous	1	Execution
df_ex_reg	Input	Asynchronous	5	Execution
df_ex_data	Input	Asynchronous	32	Execution
df_mem_enable	Input	Asynchronous	1	Memory Access
df_mem_reg	Input	Asynchronous	5	Memory Access
df_mem_data	Input	Asynchronous	32	Memory Access
df_wb_enable	Input	Asynchronous	1	Write-back
df_wb_reg	Input	Asynchronous	5	Write-back
df_wb_data	Input	Asynchronous	32	Write-back
wb_en_out	Output	Synchronous	1	Execution
rs1_reg	Output	Synchronous	5	Execution
rs2_reg	Output	Synchronous	5	Execution
wb_reg	Output	Synchronous	5	Execution
pc_out	Output	Synchronous	32	Execution
iw_out	Output	Synchronous	32	Execution
rs1_data_out	Output	Synchronous	32	Execution
rs2_data_out	Output	Synchronous	32	Execution
w_en_out	Output	Synchronous	1	Execution
jump_en_out	Output	Asynchronous	1	Instruction Fetch
jump_addr	Output	Asynchronous	32	Instruction Fetch

Table 3: Signals in the Instruction Decode Stage

Execution

In this stage, the actual operation is performed, using the operands obtained in the previous stage. The result of the operation is then stored in a temporary register, alu_temp. The value of the temporary register is latched on and passed to the next stage and forwarded back to the ID stage (data forwarding). The task of actually executing the received instruction is done asynchronously to ensure timing constraints are met and no delays are introduced.

Figure 7: Signals in the EX stage

An appropriate action is performed depending on the opcode, function-7 and the function-3 values received in the iw_in instruction word and the correct operands (rs1 and rs2) from the ID stage. Depending on the type of operation, the execution stage acts either on the operands rs1, rs2, manipulates the values of the Program Counter or calculates a register address as necessary.

If the reset button is pressed on the board, the reset signal will be synchronously passed through the stages and at the EX stage, all output signals will be cleared and set to default values (0 in most cases). If the reset signal is not active at the clock, the output signals will appropriately be updated with the values received from previous stages and the results of the EX stage. Three signals are asynchronously forwarded to the ID module for the purpose of Data Forwarding to handle Data Hazards. The signals originating from and coming into the Execution stage are summarized in Table 4.

Signal	Type	Synchronicity Status	Width (bits)	Source/Destination (module)
clk	Input	-	-	Board
reset	Input	Synchronous	1	Board
wb_en_in	Input	Synchronous	1	Instruction Fetch
wb_reg_in	Input	Synchronous	5	Instruction Decode
pc_in	Input	Synchronous	32	Instruction Decode
iw_in	Input	Synchronous	32	Instruction Decode
rs1_data_in	Input	Synchronous	32	Instruction Decode
rs2_data_in	Input	Synchronous	32	Instruction Decode
w_en_in	Input	Synchronous	1	Instruction Decode
alu_out	Output	Synchronous	32	Memory Access
iw_out	Output	Synchronous	32	Memory Access
pc_out	Output	Synchronous	32	Memory Access
wb_reg_out	Output	Synchronous	5	Memory Access
wk_en_out	Output	Synchronous	1	Memory Access
w_en_out	Output	Synchronous	1	Memory Access
rs2_data_out	Output	Synchronous	32	Memory Access
df_ex_enable	Output	Asynchronous	1	Instruction Decode
df_ex_reg	Output	Asynchronous	5	Instruction Decode
df_ex_data	Output	Asynchronous	32	Instruction Decode

Table 4: Signals in the Execution Stage

Memory

In this stage, the result obtained from the execution stage is written to memory or data cache if the instruction is a store instruction. If the instruction is a load instruction, the data is read from memory or data cache and written to a temporary register.

This is done asynchronously depending on the w_en_in signal and the 31st bit of the alu_output. alu_in[31] is set only if the instruction word was that of a Load instruction. This will cause the processor to read/write from/to the IO space, (the IO module in the design). If the two signal conditions are not met, the module will read/write from/to the memory space (the Dual Port RAM module).

Figure 8: Signals in the MEM stage

This module additionally generates another signal called the source_select. This signal will be synchronously passed on to the Write-back stage and will indicate the source of the data that must be written into the Register File System. The source can be one of the following three:

IO space
Memory
ALU output Signals are passed onto the WB stage synchronously to ensure the timing constraints are accurately met. Three signals are asynchronously forwarded to the ID module for the purpose of Data Forwarding to handle Data Hazards.

Signal	Type	Synchronicity Status	Width (bits)	Source/Destination (module)
clk	Input	-	-	Board
reset	Input	Synchronous	1	Board
wb_en_in	Input	Synchronous	1	Execution
pc_in	Input	Synchronous	5	Execution
iw_in	Input	Synchronous	32	Execution
alu_in	Input	Synchronous	32	Execution
wb_reg_in	Input	Synchronous	5	Execution
wb_en_out	Output	Synchronous	1	Write-back
pc_out	Output	Synchronous	32	Write-back
iw_out	Output	Synchronous	32	Write-back
alu_out	Output	Synchronous	32	Write-back
wb_reg_out	Output	Synchronous	5	Write-back
memif_rdata	Input	Asynchronous	32	Dual Port RAM
io_rdata	Input	Asynchronous	32	Dual Port RAM
rs2_data_in	Input	Synchronous	32	Execution
w_en_in	Input	Synchronous	1	Execution
memif_we	Output	Asynchronous	1	Dual Port RAM
memif_be	Output	Asynchronous	4	Dual Port RAM
memif_addr	Output	Asynchronous	30	Dual Port RAM
memif_wdata	Output	Asynchronous	32	Dual Port RAM
memif_rdata_out	Output	Asynchronous	32	Write-back
io_we	Output	Asynchronous	1	IO
io_addr	Output	Asynchronous	30	IO
io_wdata	Output	Asynchronous	32	IO
src_sel_out	Output	Synchronous	2	Write-back
df_mem_enable	Output	Asynchronous	1	Instruction Decode
df_mem_reg	Output	Asynchronous	5	Instruction Decode
df_mem_data	Output	Asynchronous	32	Instruction Decode

Table 5: Signals in the Memory Access Stage

Write-back

In this stage, the final result of the instruction is written back to the register file. This is the final stage of the pipeline before the next instruction is fetched. In the Write-back stage, the data to be written to the Register File System is determined by the src_sel_in signal received from the MEM stage. If a reset is received, the Write-back enable signal is cleared ensuring no data may be written to the Register File System. Table 6 summarizes the signals originating from the WB stage and coming into it.

Signal	Type	Synchronicity Status	Width (bits)	Source/Destination (module)
clk	Input	-	-	Board
reset	Input	Synchronous	1	Board
wb_en_in	Input	Synchronous	1	Write-back
wb_reg_in	Input	Synchronous	5	Write-back
pc_in	Input	Synchronous	32	Write-back
iw_in	Input	Synchronous	32	Write-back
alu_in	Input	Synchronous	32	Write-back
memif_rdata	Input	Asynchronous	32	Dual Port RAM
io_rdata	Input	Asynchronous	32	IO
src_sel_in	Input	Synchronous	2	Write-back
wb_en_out	Output	Synchronous	1	Register File System
wb_data	Output	Synchronous	32	Register File System
wb_reg_out	Output	Synchronous	5	Register File System
df_wb_enable	Output	Asynchronous	1	Instruction Decode
df_wb_reg	Output	Asynchronous	5	Instruction Decode
df_wb_data	Output	Asynchronous	32	Instruction Decode

Table 6: Signals in the Write-back Stage

Additional Modules

In addition to the five modules for the five stages of the RISC-V pipelined architecture, there are three additional modules implemented to support their operations. These are:

Register File System: Register storage system
IO: Handle IO operations
Dual Port RAM: Memory system

Register File System

The design supports 32 general purpose registers including one always-zero register as outlined in the RISC-V architecture [4]. Values written may be written to a register only if the writeback_enable signal from the WB stage is a valid 1 and in synchronicity with the data and the register address. Additionally, the write-back register address points to the zero register, no data will be written to it. Writes to the register are synchronous while reads from it are asynchronous. Table 7 summarizes the register hierarchy.

Register Name	Description
zero	Hardwired zero
ra	Return address
sp	Stack pointer
gp	Global pointer
tp	Thread pointer
t0	Temporary/alternate link register
t1–t2	Temporaries
s0	Saved register/frame pointer
s1	Saved register
a0–a1	Function arguments/return values
a2–a7	Function arguments
s2–s11	Saved registers
t3–t6	Temporaries

Table 7: Register Hierarchy

IO Module

The IO module maps addresses in the memory to physical IO ports that are present on the Board. The IO ports in particular that are available on the DE10-Lite MAX 10 FPGA board include 10 toggle switches, 10 monochrome LEDs and 2 push buttons. Of these, the operations and working of the LEDs and the push buttons have been extensively tested in this implementation.

In particular, the address 0x8000 0000h has been mapped to push button 1 (input) and address 0x8000 0004h has been mapped to the 10 LEDs as output. The implementation assumes that Inputs and Outputs are independent and unique and may not behave interchangeably, and therefore limits the operations at the addresses to only read or write respectively. Figure 9 shows the implementation block diagram of the IO module.

Figure 9: IO module implementation

Dual Port RAM module

This module is used to convert the endianness of data during reads and writes to a dual-port RAM in a RISC-V processor. It takes in a clock input and two address inputs, one for instruction port (RO) and the other for data port (RW). It also takes in a data input, which includes a write-enable signal (d_we), a byte-enable signal (d_be), and a 32-bit data input (d_wdata). The module outputs two 32-bit data values, one for instruction read data (i_rdata) and the other for data read data (d_rdata). The multi-dimensional packed array "ram" is initialized using a bitstream from the "ram.hex" file, which is used to initialize the M9K RAM modules. The always block with a posedge clock is used for instruction fetch, and the always block for data read/write is triggered by a posedge clock as well. The module uses bitwise operations to convert the endianness of data during writes to RAM and during reads from RAM.

Figure 10: Implementation block diagram of the Dual Port RAM module

Known Issues and Limitations

The implementation has certain difficulties in matching timing when using the 50MHz clock and specifically the IO module. If the requirement is to run the processor at 50MHz, the present implementation will need to be modified to not support IO operations. To overcome this, the 10MHz clock is used instead.
The Dual Port RAM module requires two clock cycles at the start of a program execution before the right Program Counter (PC) and Instruction Word (IW) are fetched and passed to the Instruction Fetch (IF) stage. By observation, this is due to the time taken to sync the io_we signal between the Memory Access stage and the Dual Port RAM stage. As a work around, the first two instructions MUST be NOP (No-operation) stalls.
Data hazards between Memory Read/Write operations are not handled in the implementation at present. To ensure this does not lead to any unexpected issues during execution two solutions may be proposed -
- Three NOPs must be inserted following each Store and Load instruction in the Assembly code itself (this will be implemented as a part of the architecture of the processor in the near term).
- Alternatively, care can be taken to ensure that a location manipulated by a Store or Load instruction is not accessed by other instructions for at least three clock cycles. This can be achieved by inserting instruction for other operations between two successive Memory/IO instructions.

References

[1] https://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&CategoryNo=218&No=1021&PartNo=2#heading [2] https://www.terasic.com.tw/attachment/archive/1021/image/DE10-Lite_Blackdiagram.jpg [3] Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10 are generated images from the Quartus Prime - Lite NetList RTL Viewer tool [4] https://riscv.org/technical/specifications/

prithvi-narayan-bhat / Custom_RISC_Implementation