Compiler and build toolchain for a toy C-like language. Based on Nora Sandler's tutorial series.
This toolchain is written in C# for .NET Core 3.0, so it should theoretically run on Windows, Linux, and Mac OS.
Work in progress - it can't actually build programs yet.
I have never written a compiler before, so it's a learning exercise. I also often write CTF challenges and crackmes that utilise VMs/emulators for custom toy architectures, and having my own compile toolchain that I can use for those architectures makes it much easier to write code for them.
Nope! The syntax is very similar and many trivial C programs are also valid ChavLang programs, but this is absolutely not a standards-compliant C compiler. Many modern C features will be missing and some syntax will be different.
There are three main projects:
- ChavLang: A library that contains the implementation of the compiler.
- ChavLangTest: Contains tests for ChavLang.
- ChavBuild: Command line tool for compiling programs.
The ChavLang library is split into a number of parts:
- Lexer: Takes input program code as a string and converts it into a sequence of tokens. See
Lexer.cs
and theTokens
directory. - Parser: Takes a sequence of tokens from the lexer and converts it into a syntax tree (AST). See
Parser.cs
and theNodes
directory.
The general approach is to use a set of regular expressions (regexes) to do the bulk of the work.
The lexer loops through a set of regexes to find valid tokens at the current file pointer, then moves along and repeats the process until it hits the end of the file.
The parser turns the given sequence of tokens into a string representation in order to use regex in a similar manner to the lexer. The major difference is that the parser maintains state information about where it is in the syntax tree in order to select which syntax regexes are valid - for example, a function definition is only valid at the program scope (because ChavLang does not support nested functions).
Adding a new lexer token manually is fairly easy:
- Create a new class in the
Tokens
directory for your token type, using theFooBarToken
naming convention. The class must derive fromBaseToken
and must implement thebase(contents)
constructor. - In
Lexer.cs
, add a newRegex
instance to the token regexes at the top of the code for your new token, preferably keeping the declarations in precedence order. - In
Lexer.cs
, add a new entry to the_tokenPatternMap
dictionary in the constructor in order to map the regex you created in step 2 to the new token class you created in step 1.
To help speed up the process of creating a lot of new lexer tokens, there's a LinqPad script called TokenMapGenerator.linq
which automatically generates default token class files for a given list of names, and generates the token pattern map for placing in the Lexer
class constructor. If you're adding a new token I generally suggest starting by updating the list in this script and then running it. You just need to change the path at the top to match the location of the repository on your disk. Make sure replaceFiles
is set to false, as setting it to true destroys all changes in the existing *Token.cs
files.
Steps as follows:
- Create a new node class inside of the
Nodes
directory, and inherit fromNodeBase
. Override theToString
method to provide proper output for the human-readable tree view. - In
Parser.cs
, add a newRegex
instance to the syntax regexes at the top of the file. - In
Parser.cs
, inside the constructor, add your new syntax regex variable to the_validSyntaxForLocation
dictionary in order to specify where the syntax can be used. - In
Parser.cs
, add a new function to parse your new syntax. Refer to the other parse functions for more information. - In
Parser.cs
, inside the constructor, add a new entry to the_syntaxHandlers
dictionary that maps the syntax regex variable you created in step 2 to the function you created in step 4.
There is not currently a helper script for this because there's a lot of manual work involved in writing new parse functions.
Code is MIT licensed. See the LICENSE
file for full text.