The ZAP Processor (ARM® V5TE Compatible)

ZAP : An Open Source High Performance ARM® Processor for FPGA (ARM® V5TE Compatible)

By Revanth Kamaraj <revanth91kamaraj@gmail.com>

1. Introduction

The ZAP is a high performance ARM® V5TE compliant processor. It is intended to be used in FPGA projects that need a high performance ARM® V5TE soft processor core. Most aspects of the processor can be configured through HDL parameters. The default processor specification is as follows (based on default parameters):

Property	Value
Slow Corner Operating Frequency @ FPGA Part Number	143MHz @ xc7a75tcsg324-3
Pipeline Depth	17
Issue and Execution Width	Single issue, in order core, with out-of-order completion for some loads/stores that miss in cache.
Data Width	32
Address Width	32
Virtual Address Width	32
Instruction Set	ARM® V5TE
L1 I-Cache	16KB Direct Mapped VIVT Cache. 64 Byte Cache Line
L1 D-Cache	16KB Direct Mapped VIVT Cache 64 Byte Cache Line
I-TLB Structure	Direct mapped. 512 entries divided into - 128 entry section TLB - 128 entry large page TLB - 128 entry small page TLB - 128 entry tiny page TLB Usually, section TLB size can be smaller, for example. Please override parameters as required.
D-TLB Structure	Direct mapped. 512 entries divided into - 128 entry section TLB - 128 entry large page TLB - 128 entry small page TLB - 128 entry tiny page TLB Usually, section TLB size can be smaller, for example. Please override parameters as required.
Branch Prediction	Bimodal Predictor + BTB. Direct Mapped. 1K entries in Thumb® state. 512 entries in ARM® state.
RAS Depth	4 deep return address stack.
Branch latency	12 cycles (wrong prediction or unrecognized branch) 3 cycles (taken, correctly predicted) 1 cycle (not-taken, correctly predicted) 12 cycles (32-bit/16-bit switch) 18 cycles (Exception/Interrupt Entry/Exit)
Store Buffer	FIFO, 16 x 32-bit.
Fetch Buffer	FIFO, 16 x 32-bit.
DMIPS/MHz Rating	0.9 DMIPS/MHz
Bus Interface	Unified 32-Bit Wishbone B3 bus with CTI and BTE signals.
FPGA Resource Utilization	23K LUTs 116 LUTRAMs 15.3K FFs 29 BRAMs 4 DSP Blocks

A simplified block diagram of the ZAP pipeline is shown below:

ZAP includes several microarchitectural enhancements to improve instruction throughput, hide external bus and memory latency and boost performance:

The ability to continue instruction execution even when the data cache is being filled. The data cache features hit under miss capability. The processor stalls when an instruction that depends on the cache access is decoded.
Direct mapped instruction and data caches. These caches are virtually indexed and virtually tagged. Individual caches allow code and data to be accessed at the same time. The sizes of these caches can be set during synthesis. Cache size is parameterizable. Cache line width may be set as well.
The D-cache also stores the physical address of the cache line on write as this allows subsequent cache clean operations to avoid having to walk the page table again. This feature does increase resource usage but can significantly reduce cache clean latency.
Direct mapped instruction and data memory TLBs. Having separate translation buffers allows data and code translation to happen in parallel. The sizes of these TLBs can be set during synthesis. Six different TLB memories are provides, each providing direct mapped buffering for sections, large page and small page, each for instruction and data (3 x 2 = 6). The sizes of these 6 memories is parameterizable.
A parameterizable store buffer that helps buffer stores when the cache is disabled or if the data access is uncacheable. When the cache is enabled and data is cacheable, the store buffer helps buffer cache clean operations. This is slightly different from a write buffer.
A 4-state bimodal branch predictor that predicts the outcome of immediate branches and branch-and-link instructions. ZAP employs a BTB (Branch Target Buffer) to predict branch outcomes early.
A 4 deep return address stack that stores the predicted return address of branch and link instructions function return. When a BX LR, MOV PC,LR or a block load with PC in register list, the processor pops off the return address. Note that switching between ARM® and Thumb® state has a penalty of 12 cycles.
The ability to execute most ARM instructions in a single clock cycle. The only instructions that take multiple cycles include branch-and-link, 64-bit loads and stores, block loads and stores, swap instructions and BLX/BLX2.
A highly efficient superpipeline with dual feedback networks to minimize pipeline stalls as much as possible while allowing for high clock frequencies. A deep 17 stage superpipelined architecture that allows the CPU to run at relatively high FPGA speeds.
Support for single cycle saturating additions and subtractions with the result being available for immediate use in the next instruction itself.
- NOTE: Multiplication/MAC operations takes 3 cycles per operation (+1 is result is immediately used). Note that the multiplier inside the ZAP processor is not pipelined.
The abort model is base restored. This allows for the implementation of a demand based paging system if supporting software is available.

1.1. Superpipelined Microarchitecture

ZAP uses a 17 stage execution pipeline to increase the speed of the flow of instructions to the processor. The 17 stage pipeline consists of Address Generator, TLB Check, Cache Access, Memory, Fetch, Instruction Buffer, Thumb Decoder, Pre-Decoder, Decoder, Issue, Shift, Execute, TLB Check, Cache Access, Memory and Writeback.

To maintain compatibility with the ARM® v5TE standard, reading the program counter (PC) will return PC + 8 when read.

During normal operation:

One instruction is writing out one or two results to the register bank.
- In case of LDR/STR with writeback, two results are being written to the register bank in a single cycle.
- All other instructions write out one result to the register bank per cycle.
- A pending load might write another value. The RF has 3 write ports.
The instruction before that is accessing the cache/TLB RAM.
The instruction before that is accessing the cache/TLB RAM.
The instruction before that is accessing the cache/TLB RAM.
The instruction before that is accessing the cache/TLB RAM.
The instruction before that is performing arithmetic, logic or memory address generation operations. This stage also confirms or rejects the branch predictor's decisions. The ALU performs saturation in the same cycle.
The instruction before that is performing a correction, shift or multiply/MAC operation.
The instruction before that is performing a register or pipeline read.
The instruction before that is being decoded.
The instruction before that is being sequenced to micro-ops (possibly).
- Most of the time, 1 ARM® instruction = 1 micro-op.
- The only ARM instructions requiring more than 1 micro-op generation are BLX, BL, LDM, STM, SWAP, LDRD, STRD, LDR loading to PC and long multiply variants (They generate a 32-bit result per micro-op).
- All other ARM®/Thumb® instructions are decode to just a single micro-op.
- Most micro-ops can be execute in a single cycle.
- This stage also causes branches predicted as taken to be actually executed. The latency for a successfully predicted taken branch is 3 cycles.
The instruction before that is being being decompressed. This is only required in the Thumb state, else the stage simply passes the instructions on.
The instruction before that is popped off the instruction buffer.
The instruction before that is pushed onto the instruction buffer. Branches
are predicted using a bimodal predictor (if applicable).
The instruction before that is accessing the cache/TLB RAM.
The instruction before that is accessing the cache/TLB RAM.
The instruction before that is accessing the cache/TLB RAM.
The instruction before that is accessing the cache/TLB RAM.

The deep pipeline, although uses more resources, allows the ZAP to run at high clock speeds.

1.1.1. Automatic Dual Forwarding Network

The ZAP pipeline has an efficient automatic dual forwarding network with interlock detection hardware. This is done automatically and no software intervention is required. This complex feedback logic guarantees that almost all micro-ops/instructions execute at a rate of 1 every cycle.

Most of the time, the pipeline is able to process 1 instruction per clock cycle. The only times a pipeline stalls (bubbles inserted) is when (assume 100% cache hit rate and 100% branch prediction rate):

An instruction uses a register that is a data (not pointer) destination for an LDR/LDM instruction within 7 cycles.
Two back to back instructions that require non-zero shift and the second instruction's operand overlaps with the first instruction's destination.
When a previous instruction modifies the flags and the current instruction requires a non-trivial shift operation. Note that LSL #0 is considered a trivial shift operation.
The pipeline is executing any multiply/MAC instruction:
- Introduces a 3 cycle bubble into the pipeline if a single register is written out. (+1 if two registers are written out)
- An instruction that uses a register that is/are destination(s) for multiply/MAC adds +1 to the multiply/MAC operation's latency.
B is executed as it adds a 3 cycle bubble due to branch prediction. Takes +1 cycle if link bit is set.
MOV/ADD instructions with pc/r15 as destination are executed. They will insert a 3 cycle bubble in the pipeline.
MSR when writing to CPSR. This will insert a 12 cycle bubble into the pipeline.
LDx with pc/r15 in the register list/data register is executed. This will insert a 9 cycle bubble in the pipeline.
MCR/MRC / SWI are executed. These will insert an 18 cycle bubble into the pipeline.

This snippet of ARM® code takes 6 cycles to execute:

    ADD  R1, R2, R2 LSL #10 (1 cycle)
    ADD  R1, R1, R1 LSL #20 (2 cycles)
    ADD  R3, R4, R5, LSR #3 (1 cycle)
    QADD R3, R3, R3         (1 cycle)
    MOV  R4, R3             (1 cycle)

This snippet of ARM code takes only 5 cycles (Possible because of dual feedback network):

    ADD  R1, R2, R2, R2     (1 cycle)
    ADD  R1, R1, R1 LSL #2  (1 cycle)
    ADD  R3, R4, R5 LSR #3  (1 cycle)
    QADD R3, R3, R3         (1 cycle)
    MOV  R4, R3             (1 cycle)

1.1.2. Single Cycle Load with Writeback

The ZAP can execute LDR/STR with writeback in a single cycle. It will perform a parallel write to the register file with the pointer register and the data register in the same cycle.

1.1.3. Hit Under Miss/Execute Under Miss

Data cache accesses that are performing line fills will not block subsequent instructions from executing. In addition, the data cache supports hit under miss functionality i.e., the cache can service the next memory access (hit) while handing the current line fill (miss). Thus, the ZAP can change the order of completion of memory accesses with respect to other instructions, when possible, in a relatively simple way.

If a store misses and is in the process of a line fill, a subsequent load at the same address will report as a hit during the line fill.

Some examples are shown below.

The unoptimized code below takes 15 cycles due to improperly optimized load latency (The code still works fine):

LDR R1, [R2]
// Load latency since value is used immediately = 7 cycles
ADD R3, R2, R1 
ADD R4, R4, R5
ADD R5, R8, R7
ADD R12, R4, R5
ADD R13, R8, R7
ADD R11, R4, R5
ADD R10, R8, R7
ADD R10, R10, R11

This can be reduced to 9 cycles by reordering the code:

LDR R1, [R2]
// Load latency is hidden.
ADD R4, R4, R5
ADD R5, R8, R7
ADD R12, R4, R5
ADD R13, R8, R7
ADD R11, R4, R5
ADD R10, R8, R7
ADD R10, R10, R11
ADD R3, R2, R1

1.1.4. Multi Port Register File

ZAP implements the register file in flip-flops. The register file provides 4 read ports and 3 write ports (A, B and C). The operation is as follows:

Operation	Port A	Port B	Port C
Load with Writeback	Used	Used	Free to be used by background load
Other	Used	Unused	Free to be used by background load

1.1.5. Branch Predictor and Return Address Stack

To improve performance, the ZAP processor uses a bimodal branch predictor. A branch memory is maintained which stores the state of each branch and the target address and branch tag. Note that the predictor can only predict the listed ARM instructions (and equivalent 16-bit instructions in Thumb state):

Bcc[L]

BX LR that does not switch ARM®/Thumb® state.

ADD/MOV instruction with destination as PC.

LDM with PC in register list

LDR that loads to PC.

MOV PC, LR

instructions. Some of these utilize the RAS for better prediction. Using an unlisted instruction to branch will result in 12 cycles of penalty.

The processor implements a 4 deep return address stack. The RAS and the predictor cannot be disabled. They are transparent to software and self clearing if they predict a non branch instruction as a branch.

Upon calls to

BL offset

the potential return address is pushed to a stack in the processor.

On encountering these ARM instructions (or equivalent 16-bit instructions):

BX LR,
MOV PC, LR
LDM with PC in register list.
LDR to PC with SP as base address register.

the CPU treats them as function returns and will pop return address of the stack much earlier.

This results in some performance improvement and reduced branch latency. Correctly predicted return takes 2 cycles (first two in the above list) or 9 cycles(last two in the above list), while incorrectly or unpredicted returns takes 12 cycles.

Returns that result in change from ARM® to Thumb® state or vice versa are unpredicted, and take 12 cycles. Performance optimization of returns is available only when no instruction set state change occurs i.e., for faster returns: ARM® code should return to ARM® code, Thumb® code should return to Thumb® code.

1.2. External Bus Interface

ZAP features a common 32-bit Wishbone B3 bus to access external resources (like DRAM/SRAM/IO etc). The processor can generate byte, halfword or word accesses. The processor uses CTI and BTE signals to allow the bus to function more efficiently. Registered feedback mode is supported for higher performance. Note that multiprocessing is not readily supported and hence, SWAP instructions do not actually perform locked transfers.

The 32-bit standard Wishbone bus makes it easy to interface to other components over a typical 32-bit FPGA SoC bus, without the need for up/down converters.

The bus interface is efficient for burst transfers and hence, cache and MMU must be enabled as soon as possible for good performance.

1.3. System Control

Please refer to ref [1] for CP15 CSR architectural requirements. The ZAP implements the following software accessible registers within its CP15 coprocessor.

NOTE: Cleaning and flushing cache and TLB is only supported for the entire memory.

Selective flushing of TLB will result in the entire TLB being flushed.

Selective cleaning of TLB will result in the entire TLB being cleaned.

The above rules are permitted as per the architecture spec [1].

The arch spec allows for a subset of the functions to be implemented for register 7.

These below are valid value supported in ZAP for register 7. Using other operations will result in UNDEFINED operation.

Cache Operation	Opcode2	CRM
Flush instruction and data cache	0b000	0b0111
Flush instruction cache	0b000	0b0101
Flush data cache	0b000	0b0110
Clean data cache	0b000	0b101x
Clean and flush data cache. Flush instruction cache.	0b000	0b1111
Clean and flush data cache	0b000	0b1110

1.4. CPU Ports and Parameters

1.4.1. Parameters

Note that all parameters should be 2^n. Cache size should be multiple of line size and at least 16 x line width. Caches/TLBs consume majority of the resources so should be tuned as required. The default parameters give you quite large caches.

Parameter	Default	Description
BP_ENTRIES	1024	Predictor RAM depth. Each RAM row also contains the branch target address.
FIFO_DEPTH	16	Command FIFO depth.
STORE_BUFFER_DEPTH	16	Depth of the store buffer. Keep multiple of cache line size in bytes / 4.
DATA_SECTION_TLB_ENTRIES	128	Section TLB entries (Data).
DATA_LPAGE_TLB_ENTRIES	128	Large page TLB entries (Data).
DATA_SPAGE_TLB_ENTRIES	128	Small page TLB entries (Data).
DATA_FPAGE_TLB_ENTRIES	128	Tiny page TLB entries (Data).
DATA_CACHE_SIZE	16384	Cache size in bytes. Should be at least 16 x line size.
CODE_SECTION_TLB_ENTRIES	128	Section TLB entries.
CODE_LPAGE_TLB_ENTRIES	128	Large page TLB entries.
CODE_SPAGE_TLB_ENTRIES	128	Small page TLB entries.
CODE_FPAGE_TLB_ENTRIES	128	Tiny page TLB entries.
CODE_CACHE_SIZE	16384	Cache size in bytes. Should be at least 16 x line size.
DATA_CACHE_LINE	64	Cache Line for Data (Byte). Keep > 8
CODE_CACHE_LINE	64	Cache Line for Code (Byte). Keep > 8
RAS_DEPTH	4	Depth of Return Address Stack

1.4.2. IO

Port	Description
i_clk	Clock. All logic is clocked on the rising edge of this signal.
i_reset	Active high global reset signal. Assert for a duration >= 1 clock cycle (not necessarily synchronously to clock edge). Signal is internally synced.
i_irq	Interrupt. Level Sensitive. Signal is internally synced.
i_fiq	Fast Interrupt. Level Sensitive. Signal is internally synced.
o_wb_cyc	Wishbone CYC signal.
o_wb_stb	Wishbone STB signal.
o_wb_adr	Wishbone address signal. (32)
o_wb_we	Wishbone write enable signal.
o_wb_dat	Wishbone data output signal. (32)
o_wb_sel	Wishbone byte select signal. (4)
o_wb_cti	Wishbone CTI (Classic, Incrementing Burst, EOB) (3)
o_wb_bte	Wishbone BTE (Linear) (2)
i_wb_ack	Wishbone acknowledge signal. Wishbone registered cycles recommended.
i_wb_dat	Wishbone data input signal. (32)

1.4.3. Integration

To use the ZAP processor in your project:
- Get the project files:
  
  git clone https://github.com/krevanth/ZAP.git
- Add all the files in src/rtl/*.sv to your project.
- Add src/rtl/ to your tool's search path to allow it to pick up SV headers.
- Instantiate the ZAP processor in your project using this template:

       zap_top #(.FIFO_DEPTH              (),
                 .BP_ENTRIES              (),
                 .STORE_BUFFER_DEPTH      (),
                 .DATA_SECTION_TLB_ENTRIES(),
                 .DATA_LPAGE_TLB_ENTRIES  (),
                 .DATA_SPAGE_TLB_ENTRIES  (),
                 .DATA_FPAGE_TLB_ENTRIES  (),
                 .DATA_CACHE_SIZE         (),
                 .CODE_SECTION_TLB_ENTRIES(),
                 .CODE_LPAGE_TLB_ENTRIES  (),
                 .CODE_SPAGE_TLB_ENTRIES  (),
                 .CODE_FPAGE_TLB_ENTRIES  (),
                 .CODE_CACHE_SIZE         ()) u_zap_top (
                 .i_clk                   (),
                 .i_reset                 (),
                 .i_irq                   (),
                 .i_fiq                   (),
                 .o_wb_cyc                (),
                 .o_wb_stb                (),
                 .o_wb_adr                (),
                 .o_wb_we                 (),
                 .o_wb_cti                (),
                 .i_wb_dat                (),
                 .o_wb_dat                (),
                 .i_wb_ack                (),
                 .o_wb_sel                (),
                 .o_wb_bte                ()
       );

The processor provides a Wishbone B3 bus. It is recommended that you use it in registered feedback cycle mode.
Interrupts are level sensitive and are internally synced to clock.

1.5. ARM Implementation Options

ZAP implements the integer instruction set specified in ARM® V5TE. T refers to the Thumb instruction set and E refers to the enhanced DSP extensions. ZAP does not implement the optional floating point extension specified in Part C of [1].

1.5.1. Big and Little Endian

ZAP only supports little endian byte ordering.

1.5.2. 26-Bit Architecture

ZAP does not support the legacy 26-bit mode.

1.5.3. Thumb

ZAP has support for the thumb (ARM® V5T) instruction set.

1.5.4. ARM DSP Enhanced Instruction Set

The ZAP implements the ARM DSP-enhanced instruction set (ARM® V5E). There are new multiply instructions that operate on 16-bit data values and new saturation instructions. Some of the new instructions are:

SMLAxy 32<=16x16+32
SMLAWy 32<=32x16+32
SMLALxy 64<=16x16+64
SMULxy 32<=16x16
SMULWy 32<=32x16
QADD adds two registers and saturates the result if an overflow occurred.
QDADD doubles and saturates one of the input registers then add and saturate.
QSUB subtracts two registers and saturates the result if an overflow occurred.
QDSUB doubles and saturates one of the input registers then subtract and saturate.

NOTE: All of the multiplication and MAC operations in ZAP take a fixed 3 clock cycles (irrespective of 32x32=32, 32x32=64, 16x16+32 etc). Additional latency of 1 cycle is incurred if the result is required in the immediate next instruction. A further additional latency of 1 cycle is incurred if two registers must be updated (long multiply/MAC operations).

The ZAP also implements LDRD, STRD and PLD instructions with the following implementation notes:

PLD is interpreted as a NOP.
MCRR and MRRC are not intended to be used on coprocessor 15 (see [1]). Since ZAP does not have an external coprocessor bus, these instructions should not be used.

1.5.5. Base Register Update

If a data abort is signaled on a memory instruction that specifies writeback, the contents of the base register will not be updated. This holds for all load and store instructions. This behavior is referred to in the ARM® V5TE architecture as the Base Restored Abort Model.

1.5.6. Cache and TLB Lockdown

ZAP does not support lockdown of cache and TLB entries.

1.5.7. TLB Flush

ZAP implements global TLB flushing. If software tries to invalidate selective TLB entries, the entire TLB will be invalidated. This behavior is acceptable as per the arch specification.

1.5.8. Cache Clean and Flush

ZAP implements global cache cleaning and flushing. Cleaning and/or flushing specific cache lines/VA is not supported.

1.5.9. Cache and TLB Structure

ZAP implements a direct mapped cache and TLB. Separate caches and TLBs exist for instruction and data paths. Each MMU (I and D) has 4 TLBs, one each for sections, large pages, small pages and tiny pages. Each one is direct mapped.

Thus, each cache uses 2 block RAMs (Tag and Data) and each MMU uses 4 RAMs. Functionally, the processor's memory subsystem requires 12 block RAMs. In practice, FPGA synthesis implements this using groups of smaller block RAMs (same overall function) so the BRAM count would be higher.

1.5.10. FCSE

ZAP implements the FCSE.

1.5.11. Cache/MMU Enabling

ZAP allows the cache and MMU to have these combinations:

MMU	Cache	Behavior
ON	OFF	All pages are treated as uncacheable. C bit is IGNORED.
ON	ON	Caching and paging enabled. Recommended configuration.
OFF	OFF	VA = PA. All addresses are treated as uncacheable.
OFF	ON	All pages are treated as uncacheable. C bit is IGNORED.

1.5.12. Coprocessor Interface

ZAP internally implements an internal CP15 coprocessor using its internal bus mechanism. The coprocessor interface is internal to the ZAP processor and is not exposed. Thus, ZAP only has access to its internal coprocessor 15. External coprocessors cannot be attached to the processor.

2. Project Environment (Docker®)

The project environment requires Docker® to be installed at your site. Click here for instructions on how to install Docker®. The steps here assume that the user is a part of the docker group.

2.1. Running TCs

To run all/a specific TC, do:

make [TC=test_name]

See src/ts for a list of test names. Not providing a testname will run all tests.

To remove existing object/simulation/synthesis files, do:

make clean

2.2. Adding TCs

Create a folder src/ts/<test_name>
Please note that these will be run on the sample TB SOC platform.
- See src/testbench/testbench.v for more information.
Tests will produce wave files in the obj/src/ts/<test_name>/zap.vcd.
Add a C file (.c), an assembly file (.s) and a linker script (.ld).
Create a Config.cfg. This is a Perl hash that must be edited to meet requirements. Note that the registers in the REG_CHECK are indexed registers. To find those, please do:

cat src/rtl/zap_localparams.svh | grep PHY

For example, if a check requires a certain value of R13 in IRQ mode, the hash will mention the register number as r25.
Here is a sample Config.cfg:

       %Config = ( 
               # CPU configuration.
               DATA_CACHE_SIZE             => 4096,    
               CODE_CACHE_SIZE             => 4096,    
               CODE_SECTION_TLB_ENTRIES    => 8,       
               CODE_SPAGE_TLB_ENTRIES      => 32,      
               CODE_LPAGE_TLB_ENTRIES      => 16,      
               DATA_SECTION_TLB_ENTRIES    => 8,       
               DATA_SPAGE_TLB_ENTRIES      => 32,      
               DATA_LPAGE_TLB_ENTRIES      => 16,      
               BP_DEPTH                    => 1024,    
               INSTR_FIFO_DEPTH            => 4,       
               STORE_BUFFER_DEPTH          => 8,      
      
               # Debug helpers. 
               DEBUG_EN                    => 1,       
                                              # Enables debug print messages. 
                                              # Set DEBUG_EN=0 for synthesis.        
      
               # Testbench configuration.
               MAX_CLOCK_CYCLES            => 100000,  
                                              # Clock cycles to run the 
                                              # simulation for.
      
               REG_CHECK                   => {"r1" => "32'h4", 
                                               "r2" => "32'd3"},      
                                              # Make this an anonymous has with 
                                              # entries like "r10" => "32'h0". 
      
               FINAL_CHECK                 => {"32'h100" => "32'd4", 
                                               "32'h104" => "32'h12345678",
                                               "32'h66" =>  "32'h4"}   
                                              # Make this an anonymous hash
                                              # with entries like 
                                              # Base address of 32-bit word => 
                                              # 32-bit verilog_value. The 
                                              # script compares 4 bytes at 
                                              # once.
        );

2.3. Running RTL Lint

To run RTL lint, simply do:

make lint

2.4. Running Xilinx® Vivado® Synthesis

Synthesis scripts can be found here: src/syn/

Assuming you have Vivado® installed, please do (in project root directory):

make syn

Timing report will be available in obj/syn/syn_timing.rpt

2.4.1. XDC Setup (Vivado® FPGA Synthesis)

The XDC assumes a 200MHz clock for an Artix 7 FPGA part with -3 speed grade.
Input assume they receive data from a flop with Tcq = 50% of clock period.
Outputs assume they are driving a flop with Tsu = 2ns Th=1ns.
Setting FPGA synthesis clock to an unreasonably high FPGA design frequency may result in better timing closure (but will result in a larger FPGA design).

3. References

[1] ARM® Architecture Specification (ARM® DDI 0100E)

4. Mentions

The ZAP project was mentioned in a survey conducted here.

5. Acknowledgements

Thanks to Erez Binyamin for adding Docker infrastructure support.

Thanks to Bharath Mulagondla and Akhil Raj Baranwal for pointing out bugs in the design.

The testbench UART core in src/testbench/uart.v is taken from the UART-16550 project.

The testbench assembly code in src/ts/arm_test*/arm_test*.s is based on this external assembly file.

6. Trademarks

ARM is a registered trademark of ARM Ltd.

Xilinx, Artix and Vivado are trademarks of Xilinx Inc.

7. License

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.