sunjay / brain

A high level programming language that compiles into the brainfuck esoteric programming language

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Refactoring

sunjay opened this issue · comments

The first pass through implementing the compiler was good enough, but the approach isn't very robust. This prevents us from growing the codebase in a reasonable way without a lot of issues.

In particular, the following is hard right now:

  • implementing other types
    • everything was done with arrays of bytes in mind, so UTF-8 strings, numbers, custom data types, etc. are difficult to implement
  • error handling
    • the nom parsing framework doesn't generate the best error messages on its own, we also don't pass position information very well between stages of compilation
    • this prevents us from providing valuable error messages.
  • code generation
    • code generation is a very naive process
    • we are basically appending large amounts of Instruction enums to a Vec and then converting that to a string
    • we need something higher level to avoid using too much memory and to be able to generate smarter code more robustly (without so many silly bugs)
  • optimization
    • Complex optimizations are almost impossible because of the problems outlined above
    • Without context, it's difficult to make more interesting optimization decisions

As you can see, there are problems in nearly every stage of compilation.

This issue will outline a plan to address each of these issues and improve the overall quality of the code.

Plan

New Compilation Stages

  1. Source - the brain source code to be compiled
  2. Parsing --> AST
    • Position information is embedded into every node of the AST denoting the exact index in the source (and possibly the filename) where that node was found
    • This is the closest representation to the original syntax
    • Deguaring:
      • continue and break
      • operators like +, -, /, *
  3. Static Checking - "Complete" AST (AST, HashMap<Name, Item>)
    // Forces us to enforce that variables are initialized before used
    enum Item {
        Defined {type_def},
        Initialized {type_def},
    }
    • Desugared AST with all names resolved and complete type information
    • Looks up names and determines the type of each
    • Any type inference is resolved here
    • Any size inference is resolved here
    • We check for type compatibility and make sure everything makes sense
    • Types are associated with some Rust struct which knows how to perform operations
    • The AST is immutable from this point on -- we don't want to make any changes
    • Complains if names defined in a non-global scope are used where they are not accessible
  4. Operation IR --> Vec<Operation>
    • AST is transformed into a tree of operations
    • We define our own set of primitive operations which (for the most part) we can later translate into brainfuck
    • We use these operations to transparently handle memory without explicitly building the static memory layout here
    • These operations are easier to work with and more sophisticated than the default brainfuck instructions
    • A lot of the pitfalls are avoided like forgetting to close loops, dealing with relative movement and memory allocation
    • Rust/pseudocode Operation enum:
    enum Operation {
        /// Allocates the given size in bytes on the tape so that it is not used by any other code accidentally
        /// The id is generated automatically and represents the position which will eventually be determined when the memory layout is generated
        /// Nothing is guaranteed about the position other than that `size` consecutive cells including the position will be available for use without conflicts
        /// This is also used to ensure memory is automatically dropped at the end of its scope (stack frame)
        Allocate {id, size},
    
        /// While most allocations can be dropped at the end of their scope, temporary cells
        /// should be dropped as soon as possible so that they are available in the memory
        /// layout again as soon as possible
        /// This way we can avoid a lot of unnecessary move operations over cells
        /// that aren't being used anymore
        /// While this could be an optimization as well, temporary cells in particular are
        /// *known* to have this property since we generate temporary cells in the compiler itself
        /// The temporary cells are guaranteed to last for the duration of the given body
        /// They are then freed immediately afterwards
        TempAllocate {id, size, body: Vec<Operation>),
    
        /// Frees the given memory id and all cells associated with it
        /// Typically not used unless an explicit free is necessary before the end of the scope
        /// Freeing means both marking those cells available and zeroing their values
        Free {id},
    
        /// Move to the position associated with a particular memory id
        MoveTo {id},
    
        /// Perform a movement that returns to the reference position (usually cell 0)
        ReturnToReference,
    
        /// Increment the value of the current cell by a certain amount (relative to whatever the current amount in the cell is)
        Increment {amount},
    
        /// Decrement the value of the current cell by a certain amount (relative to whatever the current amount in the cell is)
        Decrement {amount},
    
        /// Read bytes into `size` consecutive cells
        /// Note: this generates both read instructions and move right instructions
        Read {size},
    
        /// Write bytes into `size` consecutive cells
        /// Note: this generates both write instructions and move right instructions
        Write {size},
    
        /// Set the value of `size` consecutive cells including the current cell to zero
        /// Note: this generates instructions to zero the value and move to the right
        Zero {size},
    
        /// Loop with the given operations as the loop body
        /// A loose restriction on loops is that they should return to the reference position whenever that makes sense
        /// In well behaved code, this can usually be done by inserting a ReturnToReference operation at the end of the loop body
        Loop {body: Vec<Operation>},
    
        /// Copy the cells at the position of sourceId to the position of targetId using a single temporary
        /// cell at the position of tempId
        /// Only copies up to the size of sourceId
        /// sourceId and targetId must have the same size
        Copy {sourceId, targetId, tempId},
    
        /// Relocate the value at the position of sourceId to the position of targetId
        /// Leaving zeros where the source data was originally
        /// sourceId and targetId must have the same size
        Relocate {sourceId, targetId},
    }
  5. Static Memory Layout --> (Vec<Operation>, HashMap<id, Cell {position, size}>)
    • We use the operations to figure out a memory layout and map the allocated ids to actual positions on the tape
    • This is a separate stage so we have the most information possible when doing this
    • Drops allocated memory at the end of its scope
      • The end of scope is the end of a loop or the end of the main program
    • Complains if memory is used after it is freed
  6. Optimization --> (Vec, HashMap<id, Cell {position, size}>)
    • Operations are performed on the operations tree
    • Each optimizer walks the tree and returns a new one with the operations performed
  7. Compilation --> Vec
    • Operations are transformed into Instruction instances representing the actual brainfuck instructions
    • Compilation can accurately keep track of position information and reliably generate code because it has the entire operation tree to interpret
    • Drops all values declared in loop bodies
    • Operations that can be composed of other operations should be converted first during generation
      • In other words, generation is a pipeline operation of first converting operations to lower level operations and then converting those lower level operations directly to instructions
  8. Brainfuck Optimizations (to be added later)
    • Further brainfuck optimizations specific to streams of brainfuck instructions are performed
  9. Brainfuck
    • The Instruction struct is transformed into a string representation which can then be written into a file

Representing Types

We need to do several things here:

  1. Associate type names like "u8" or "[u8; 7]" with internal structs which represent those types
  2. Be able to lookup method names like "write" in internal structs and type signatures
  3. Be able to apply functions to internal type structs

Approach:

/// Operators are things like Add, Subtract, Concatenate, Call, Write, Read, etc.
#[derive(Clone, Copy)]
enum Operator {
    /// Read values into the type using Read operations
    Read,

    /// Write the type using Write operations
    Write,

    /// Decrement the given type by 1
    ///TODO: This is essentially a placeholder for use before we get proper numeric support
    ///TODO: Eventually this will be replaced by a Subtract operator or something
    Decrement,

    /// Call a specific method name (e.g. len) on the type
    /// The method name to call is the first argument in the args Vec
    /// Return ApplyError::UnsupportedOperator if the given method name is not supported
    /// Return ApplyError::UnsupportedArgumentTypes if the types of the given arguments are not supported
    CallMethod,
}

trait BType {
    /// Returns the name that can be used in brain code to refer to this type
    fn type_name() -> &'static str;
    /// Apply the given function or operator
    fn apply(f: Operator, args: Vec<&BType>) -> Result<Vec<Operation>, ApplyError>;
}

enum ApplyError {
    /// When the given operator isn't supported by this type
    /// Example: Use this when CallMethod is passed with "len" but the type doesn't support len()
    UnsupportedOperator,
    /// When the types of the given arguments are not supported
    UnsupportedArgumentTypes,
}

To Do

  • Tests
  • Travis CI build

This is still representative of roughly what the compiler does. Some of the stages are different, moved or changed, but the overall structure and process is still the same.