g0mb4 / tas

Toy two pass assembler in C.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Toy Two Pass Assembler

The project is based on Yonatan Zilpa's excersie. A brief explanation can be found here. The majority of the following is just a direct copy of that site. Some differences:

  • Hexadecimal (base 16) numeric system is used instead of octal.
  • .entry MAIN needed to be defined explicitly.
  • No relative addressing.
  • The generated object code file uses a different format (it includes the entries and the externals).

Documentation can be found here.

A working virtual machine is created for this project, in orded to run the assembled programs. The machine can be found here: tvm.

"Hardware"

Our computer architecture consists from Central Processing Unit (CPU), registers and Random Access Memory (RAM), where part of the memory is being used as a stack. The size of each word in memory is 16 bits. Arithmetics is to be carried by the '2's complement' method. Our computer machine can only handle integers (Positives or negatives), it doesn't handle real numbers.

Registers

Our computer machine includes the following list of registers:

  • Eight general registers (r0, r1, r2, r3, r4, r5, r6, r7)
  • One Program Counter register (pc).
  • One Stack Pointer register (sp).
  • One Status register (psw - Program Status Word) which has two flags: carry flag and zero flag.

All registers are 16 bits in size. The two first bits of the PSW register are C and Z in correspondence Characters are coded in ASCII.

Memory

The size of memory is 2000 words (each word is 16 bits in size).

Stack

The stack is in the end of the main memory, starts at memory address 1999 (07cf hex)(in words) and it can grow downwards. The size of the stack is 16 words.

Initialization

On startup the all regsiters have a value of zero, including the flags. The contents of the memory is also zero.

Instructions

In our computer machine, instruction is a word (16 bits in size) that carries information about the operator and operands. Although instruction is a string of 16 bits, it can be divided into fields. The following table provides further information about the instruction. The bits are in decimal number system.

Fields Operation Source Operand Destination Operand
Addressing Mode Register Addressing Mode Register
Bits 15-12 11-9 8-6 5-3 2-0

The following table maps operator's name to its corresponding instruction code (opcode).

Operator Opcode
mov 0
cmp 1
add 2
sub 3
mul 4
div 5
lea 6
inc 7
dec 8
jnz 9
jnc a
shl b
prn c
jsr d
rts e
hlt f

All operators are written in lower case letters, details on the meaning of these operators will be specified later.

  • Bits 9-11: This field refers to the addressing mode of the source operand. Depending on the value of this field (numeric values of bits 9-11) , the instruction may refer to additional word (first additional word)

  • Bits 6-8: This field refers to the register of the source operand. The field (bits 6-8) maps its numeric value n to register rn.

    Notice: If the addressing mode in the source operand does not require the source register, then the source register field are not in use. In such a case the numeric value of the field (bits 6-8) is equal to zero.

  • Bits 3-5: This field refers to the addressing mode of the destination operand. Depending on the numeric value of this field (bits 3-5) , the instruction may refer to additional word (second additional word)

  • Bits 0-2: This field refers to the register of the destination operand. The field (bits 0-2) maps its numeric value n to register rn.

    Notice: If the addressing mode in the destination operand does not require the destination register, then the source register field are not in use. In such a case the numeric value of the field (bits 6-8) is equal to zero.

There are six types of addressing modes in our assembly language, some of these modes require additional information, i.e. additional word. The following table provides information on all types of addressing mode.

First Word Additional Word Operand Way of Writing   Example  
Field Value Name Register
0 Instant addressing zero (not in use) yes The numeric value of the operand is determined by the numeric value of the additional word. The operand is a number preceded by the '#' sign. mov #-1,r2
1 Direct addressing zero (not in use) yes The additional word contains memory address. The numeric value of the operand is the value of this address. The operand is a label, either declared or expected to be declared later in the file. mov x,r2
2 Indirect addressing zero (not in use) yes The numeric value of the additional word contains memory address. The value of this address is also a memory address. The value of the second address is the numeric value of the operand. Indirect addressing is indicated by the '@' sign which appeared just before the label. The label is declared in the same way as in the direct addressing mode. mov @x,r2
3 Direct register addressing n (positive integer) no Register rn contains the value of the operand. The operand is a legal register name. mov r1,r2
4 Indirect register addressing n (positive integer)) no Register rn contains information on memory address. This memory address contains the operand. The operand is a legal register name indicated by the '@' sign. mov @r1,r2

Machine Instruction Characterization

Machine instruction may be classified into three different classes (according to the number of operands appeared in each instruction).

First Class of Operators

The first class contains all machine instructions that get two operands. Any machine instruction that belongs to this class may contain one of the following operators:

        mov, cmp, add, sub, mul, div, lea, shl

The following table provides further explanation on the operational aspects of these operators:

Numeric Code Operator Description   Example   Example Description
0 mov Copies the value of the source operand (the first operand) to the destination operand (the second operand). mov A, r1 Copy the value of A to register r1.
1 cmp Compare between two operands. The cmp operator subtracts the destination operand from the source operand, without saving the subtraction result, it then updates the zero flag, flag z, in the status register, PSW. cmp A, r1 If the values of A and r1 are equal, then the zero flag A, in the status register PSW, is turned on. Else the zero flag is turned off.
2 add The destination operand is assigned with the value of the source operand plus the value of the destination operand. add A, r0 Register r0 gets the sum of r0 and A.
3 sub The destination operand is assigned with the value of the destination operand minus the value of the source operand. sub #3, r1 Register r1 is assigned with the value of r1 minus 3.
4 mul Destination operand assigned with the value of the source operand times the value of destination operand mul A, r2 Register r2 assigned with A times r2.
5 div Destination operand is assigned with the value of destination operand divided by the source operand. div A, r2 Register r2 assigned with r2/A.
6 lea Acronym for 'load effective address'. This operation loads memory address, marked with the label appeared in the first operand to the destination operand. lea ABC, r1 The memory address of label ABC is assigned to register r1.
b shl Shift bits to the left in the source operand. The number of shifts is determined by the value of the destination operand. shl r1, #1 Register r1 is shifted 1 bit to the left.

Second Class of Operators

The second class contains all machine instructions that gets one operand. In such cases there is no source operand, thus bits 6-11 are meaningless (their values is zero). Any machine instruction in this class may contain one of the following instruction:

        inc, dec, jnz, jnc, prn, jsr

The following table provides further explanation on the operational aspects of these operators:

Numeric Code Operator Description   Example   Example Description
7 inc The operand is increased by one. inc r2 Register r2 is assigned with r2 plus 1.
8 dec The operand is decreased by one. dec r2 Register r2 is assigned with r2 minus 1.
9 jnz Acronym: jump if not zero. The Program Counter register PC is assigned with the source operand if the Z flag, in the Program Status Word register PSW is not zero. jnz LINE If the Z flag (in the PSW register) is not zero, then PC register is assigned with LINE.
a jnc Acronym: jump if not carry. The Program Counter register PC is assigned with zero if the C flag, in the Program Status Word register PSW is not 0. jnc LINE If the C flag (in the PSW register) is not zero, then PC register is assigned with LINE.
c prn Prints the ASCII equivalent of the operand to the standard output file (stdout). prn r1 The ASCII equivalent character of the value stored in r1 is printed to standard file.
d jsr Calls a subroutine that pushes register PC to the running time stack and assign the operand to the Program Counter register PC. jsr FUNC stack[SP] = PC
SP = SP-1
PC = FUNC

Third Class of Operators

The third class contains all machine instructions that gets no operands. In such cases bits 0-11 are meaningless (their values is zero). Any machine instruction in this class may contain one of the following instruction:

        rts, hlt

The following table provides further explanation on the operational aspects of these operators:

Numeric Code Operator Description   Example   Example Description
e rts Pops a value from the running time stack and move this value to the Program Counter register. rts SP = SP+1
PC = stack[SP]
f hlt Halts the program. hlt Halting the program.

Legal addressing modes

The following table contains information on legal addressing mode for the source and destination operands.

Operator Legal Addressing Modes for the Source Operand Legal Addressing Modes for the Destination Operand
mov 0,1,2,3,4 1,2,3,4
cmp 0,1,2,3,4 0,1,2,3,4
add 0,1,2,3,4 1,2,3,4
sub 0,1,2,3,4 1,2,3,4
mul 0,1,2,3,4 1,2,3,4
div 0,1,2,3,4 1,2,3,4
lea 1 1,2,3,4
inc No source operand 1,2,3,4
dec No source operand 1,2,3,4
jnz No source operand 1,2,4
jnc No source operand 1,2,4
shl 1,2,3,4 0,1,2,3,4
prn No source operand 0,1,2,3,4
jsr No source operand 1,2,4
rts No source operand No source operand
hlt No source operand No source operand

Flags

The following table contains information on the flags modified by the instructions.

Operator Zero Flag Modified Carry Flag Modified
mov No No
cmp Yes No
add Yes Yes
sub Yes Yes
mul Yes Yes
div No No
lea No No
inc Yes No
dec Yes No
jnz No No
jnc No No
shl Yes Yes
prn No No
jsr No No
rts No No
hlt No No

Statements

Our assembly language is consisted of statements separated by the new line character '\n'. When we look into a file it appeared to be made out of lines of statements, each statement appeared in its own line. Our assembly language has four types of statements. These statements described in the following table.

Type of statement General Explanation
Empty Statement Line with this kind of statement may contains only white spaces: tab character '\t' or space character ' '
Comment Statement The first character in a line with this statement is the semicolon ';' character. This line should be completely ignored by the assembler.
Declarative Statement This statement is a directive to the assembler program. It does not generate machine instruction.
Operation Statement This statement generates machine instruction that needs to be executed by the CPU. The statement represent machine instruction in symbolic form.

Directive Statement

Directive statement is of the following form: Directive statement may optionally start with a label, the label has to follow certain syntax rules (to be described later). Directive can start with or without a label, in any case a directive name, preceded by a dot '.' character, must be included. NO whitespace allowed between the '.' character and the directive name. If the directive does include a label, then at least one whitespace character is separating between the label and the '.' character. Following the directive name, whitespace-separated, appearing, in the same line, the directive parameters (the number of parameters is determined by the type of the directive). As mentioned, directive statement may include four types of directive:

  1. .data

    The parameter(s) of data is a list of legal numbers separated by a comma ',' character. For example:

.data    +7,-57 ,17   ,    9
Notice that any number of whitespace characters may appear between the number(s) and the comma character(s). However, the comma character must separate between two numeric values.
The '.data' directive statement directs the assembler to allocate space in its data image where the appropriate numeric parameters is to be stored. It also direct the assembler to advance the data counter by the number of parameters (of the '.data' directive). If the '.data' directive has a label name, then this label name is assigned with the value in the data image (before it was advanced) and get inserted to the symbols table. This way we can refer to certain place in the data image using the label name. For instance, if we write
XYZ:    .data   +7,-57,17,9
    mov 	XYZ, r1
then register r1 is assigned with the value +7. If we continue to write
lea    XYZ, r1
then r1 would have been assigned with the address (in the data image) that stores the +7 value.
  1. .string

    The '.string' directive statement gets only one legal string as parameter. The meaning of '.string' directive statement is similar to the '.data' directive statement. The ASCII characters composed the string are coded to their appropriate numeric ASCII values) and get inserted to the data image by their order. At the end a zero value is being inserted, to mark the end of the string. The value of the data counter is to be increase, according to the length of the string + one. If the line includes a label name, then the value of the label name is going to point to the location in memory that stores the ASCII code of the first character of the string, at the same way as it was done for the '.data' string. For instance the directive statement

ABC:    .string    "abcdef"
is going to allocate an array of characters of length 7 starting from the address stored in the ABC label name. This "array" is initialized to the ASCII value of characters 'a', 'b', 'c', 'd', 'e', 'f' in correspondence, the array is to be ended with the zero value concatenate to the end of the array.
  1. .entry

    The '.entry' directive statement gets one parameter only. This parameter is a label name, declared by other directive statement in the very same file where the The purpose of the '.entry' directive statement is to deal handle cases where a label name defined in an assembly source file A needs to be referred by other assembly source file(s) B, C, D, etc. In this case the '.entry' directive statement, written in the file A, gets the label name as its parameter (the '.entry' directive statement has to have a single parameter). For instance, if an assembly source file A contains the following lines

.entry	HELLO
HELLO:  add		#1, r1
then other assembly source file(s), may refer to HELLO label name. Notice that a label at the beginning of the '.entry' directive is meaningless.
  1. .extern

    The '.extern' directive statement gets one parameter this parameter is the name of a label name defined in other assembly source file. The purpose of this directive statement is to declare that the label has been defined in other source file and that this assembly source file (the one that contains the '.extern' directive statement) is using it. The correspondence between the value of the label, as appeared in the source file where it was defined, and the operation instruction(s) that are using it as an argument is to be done at linking time.

.extern HELLO
Notice that a label at the beginning of the '.extern' directive is meaningless.

Operation Statement

Operation statement is composed from the following:

  1. Optional label.

  2. Operation name.

  3. Operands (the number of operands may be 0, 1 or 2 depending on the operation).

The length of a statement (of any type) cannot exceed 80 characters. The name of the operation is to be written in lower case letter, operation name can be one of the 16 operations mentioned above. After the operation name, separated with whitespace character(s), one or two operands may appear. In the case of two operands, the operands are separated with a comma ',' character. As mentioned before, whitespace character(s) may separate the comma and the operands. Operation statement with two operands has the following form:

Label Operation Operands
Source Destination
HELLO: add r7, B
JUMP: jnc XYZ
END: hlt

Formal Definitions

Label

Every label must begin with an upper or lower case letter, the rest of the label may contain letters or numbers. The length of the label cannot exceed 30 characters. The label ends with a column ':' character. The column character is not part of the label name it is just a sign representing the end of the character. The label must begin with the first column of the line. Label name cannot have more than one definition. The following labels are written correctly.

        hEllo:
        x:
        He78940:

Label name cannot be the same as register or operation name. The label derived its value from the syntax. Label written at the beginning of '.data' or '.string' directive gets the value of the appropriate data counter. Label written at the beginning of an operation statement gets the value of the appropriate operation counter.

Number

Number is a string of decimal digits (0-9) that may optionally be preceded by either '-' or '+' sign. The number gets its value from its decimal representation represented by the string of digits. For instance the numbers

        76, -5, +123

can be accepted as numbers. As mentioned, we do not handle rational or real numbers, only integers.

String

String is a sequence of visible ASCII characters surrounded by double quotation marks. The quotation marks are not part of the string. The string

        "Hello World"

is an example for legal string.

Two Pass Assembler

When the assembler is starting to translate code it needs to carry two major assignments. Its first assignment is to identify and translate the operation code and its second assignment is to determine addresses for all data and variables appeared in the source file(s). For instance, when the assembler reads the following code:

.entry MAIN
MAIN:   mov LENGTH, r1
	    lea STR, r2
LOOP:   prn @r2
        inc r2
        sub #1, r1
        jnz LOOP
END:    hlt
STR:    .string "abcdef"
LENGTH: .data 6

it has to replace the operation names mov, lea, jnz, prn, sub, inc, jnc, hlt with their equivalent binary codes, in addition, the assembler has to replace the symbols STR, LEN, MAIN, LOOP, END with their appropriate addresses that have been allocated for the directive statements. Assuming that the code in example I has being translated by the assembler and has been stored (operations and directives) in a memory block that starts from address 0000, then this translation can be described as follow:

Label Address Command Operand(s) Machine Code
.entry MAIN
MAIN: 0000 mov LEN, r1 0219
0001 0012
0002 lea STR, r2 621a
0003 000b
LOOP: 0004 prn @r2 c022
0005 inc r2 701a
0006 sub #1, r1 3019
0007 0001
0008 jnz LOOP 9008
0009 0004
END: 000a hlt f000
STR: 000b .string "abcdef" 0061
000c 0062
000d 0063
000e 0064
000f 0065
0010 0066
0011 0000
LEN: 0012 .data 6 0006

If the assembler maintains a table of all the operation names and their corresponding binary codes, then all operation names can be easily converted. Whenever the assembler reads an operation name it can simply use the table to find its equivalent binary code. In order to carry the same conversion for the addresses of symbols the assembler has to build similar table. For instance, in example I, prior to reading the source file(s) the assembler has no way to know that the LOOP symbol relates to address 0004. Thus, in regards to all symbols that have been defined by the programmer, the assembler has to accomplish two separate tasks. The first task is to build a table of all symbols and their related numeric values, and the second is to replace all the symbols, appeared in the source file(s) with the numeric values of the address fields. This two assignments can be achieved by performing two separate scans (passes) on the source file(s). In the first pass the assembler builds a table of symbols, this table correspond address to each symbol. In the second pass the assembler translate the source file(s) into binary machine code. Notice that the two passes are done by the assembler, during translation (in the assembly time), before the linking process. After the translation process, the program may be linked and load to memory for execution.

First pass

In the first pass, each instruction is being substituted with its appropriate code and the table of symbols is being built. The rest of the code are left untouched. The code should be loaded at address zero. After applying the first pass on example I, we should get the following result

The table of symbols:

Name Value Image
MAIN 0000 instruction
LOOP 0004 instruction
END 000a instruction
STR 0000 data
LEN 0007 data

List of entries:

Name Value
MAIN ????

Data image:

Address Value
0000 0061
0001 0062
0002 0063
0003 0064
0004 0065
0005 0066
0006 0000
0007 0006

Instruction image:

Address Value
0000 0219
0001 ????
0002 621a
0003 ????
0004 c022
0005 701a
0006 3019
0007 ????
0008 9008
0009 ????
000a f000

Second pass

Applying the second pass on the code of example I yields the following final results:

Name Value Image
MAIN 0000 object code
LOOP 0004 object code
END 000a object code
STR 000b object code
LEN 0012 object code

List of entries:

Name Value
MAIN 0000

Object code:

Address Machine Word
0000 0219
0001 0012
0002 621a
0003 000b
0004 c022
0005 701a
0006 3019
0007 0001
0008 9008
0009 0004
000a f000
000b 0061
000c 0062
000d 0063
000e 0064
000f 0065
0010 0066
0011 0000
0012 0006

When the assembler program is done an object code is generated this object code is to be sent to a linker program. The purpose of the linker program is described as follows:

  1. To allocate the program with place in memory (allocation).
  2. To link the object file into one executable file (linking)
  3. To change addresses according to the loading place (relocation)
  4. To physically load the code into memory.

After the linker program is done the program can be loaded to memory and is ready to run. We are not going to make further discussion on how the linker program works.

The format of output files

The object file written by the assembler provides informations about machine's memory. The first instruction is to be inserted to memory address 0, the second instruction is to be inserted to be inserted to memory address 2,3 or 4 (depending on the length of the first instruction) and so fourth until the translation of the last instruction. The next memory address, after the last translated instruction, contains the data that were built by the '.data' and '.string' instructions, their order of appearance in memory depends on their precedence of appearance in the source file (first instruction occupies first free memory in a rising order).

The object code file (.oc)

The object file is composed out of lines of text and contains 3 sections: code, entries, externals.

code

The code section starts with '.cbegin' and ends with '.cend'. The first line contains (in hex) the length of the code and the length of data, both are in terms of memory words. Those two numbers must be separated by white space. Each of the next lines provides information on the content of memory address (in hex form) starting from memory address 0. In addition, for each memory address, occupied by instruction (not data), there appear additional information for the linker. This additional information could be one of the following three characters: 'e' 'a' or 'r'. The character 'a' designates the fact that the content of the memory address is absolute and does not depend on where the file is to be loaded (the assembler assumes it to start from memory address 0). The character 'r' designates the fact that memory address is relocatable and should be added with the appropriate offset, in regards to where the file is to be loaded. The offset is the first memory address from which the first instruction of the program is to be loaded. The letter 'd' designates the fact that the content of the file depends on external variable, the linker program is to take care on the insertion of the appropriate value.

entries

The entries section starts with '.lbegin' and ends with '.lend'. The entries section is composed out of lines of text. Each line contains the entry name and value, as it was computed for this file.

externals

The entries section starts with '.ebegin' and ends with '.eend'. The externals section is composed out of lines of text. Each line contains the name and memory address of the external variable.

Binary file (.bin)

The binary file contains the object code in binary (non-text) format. It can't be created, if the source code contains .extern directives.

Example files

test

Prints the string "abcdef".

test.as

; test.as
; Prints the string "abcdef".

        .entry MAIN      ; file contains the definition of MAIN
MAIN:   mov LEN, r1	     ; move LEN(=6) to r1
        lea STR, r2	     ; load the address of STR to r2
LOOP:   prn @r2          ; print the character at the memory location that r2 holds
        inc r2           ; r2 = r2 + 1
        sub #1, r1       ; r1 = r1 - 1
        jnz LOOP         ; jump to LOOP if the zero flag is not set (sub sets it)
END:    hlt              ; end of the program
STR:    .string "abcdef" ; string to print
LEN:    .data 6          ; length of the string

test.oc

.cbegin
b 8
0000 0219 a
0001 0012 r
0002 621a a
0003 000b r
0004 c022 a
0005 701a a
0006 3019 a
0007 0001 a
0008 9008 a
0009 0004 r
000a f000 a
000b 0061  
000c 0062  
000d 0063  
000e 0064  
000f 0065  
0010 0066  
0011 0000  
0012 0006  
.cend
.lbegin
MAIN 0000
.lend
.ebegin
.eend

Usage of tas

tas <options> source-file

where the options are:

-l : prints debugging lists after each pass
-n : creates NO output files
-b : creates binary output file
-h : shows this text

Compilation of tas

Windows

cd tas
mkdir build
cd build
cmake ..
tas.sln

Linux

cd tas
mkdir build
cd build
cmake ..
make

About

Toy two pass assembler in C.


Languages

Language:C 98.9%Language:CMake 0.7%Language:Shell 0.5%