Open Ethereum Virtual Machine
Warning.
This repo is a hodpodge of collected documents, some may be out of date or incorrect!
[TOC]
- The Contract constructor function
- Limitation - Address layout - Limitations - The function dispatcher (meta function) - Moving the function dispatcher to front of the LLVM IR function list
- Functionalities
- Optimizations
- Support more than 16 local variables
- Instruction scheduling
- Improve EVM calling conventions
- Re-materialization of constants
- EVM LLVM Wiki
- SHA3 (Keccak)
- Address
- Balance
- Origin
- Caller
- CallValue
- CallDataLoad
- CallDataSize
- CallDataCopy
- CodeSize
- CodeCopy
- GasPrice
- ExtCodeSize
- ExtCodeCopy
- ReturnDataSize
- ReturnDataCopy
- BlockHash
- Coinbase
- Timestamp
- Number
- Difficulty
- GasLimit
- StorageLoad
- StorageStore
- MemSize
- Gas
- Log0
- Log1
- Log2
- Log3
- Log4
- Create
- Create2
- Call
- CallCode
- DelegateCall
- StaticCall
- Return
- Revert
- SelfDestruct
- Frontend is expected to emit 256bit values LLVM IR
- Frontend needs to generate compatible LLVM IR
- Constructor
- Testing utilities
- How to run testings
- How to add new tests
- TODO lists
- Types of calls
- Internal call conventions
- Procedure of a subroutine call
- [EIP2315 Support: Subroutine calls](#-eip2315--https---eipsethereumorg-eips-eip-2315--support--subroutine-calls)
- External calls
- Contract Input Argument Types -- The Solidity convention
Introduction
The EVM is a stack machine. Instructions might use values on the stack as arguments, and push values onto the stack as results. Let’s consider the operation add.
Elements on the stack are 32-byte words, and all keys and values in storage are 32 bytes. There are over 100 opcodes, divided into categories delineated in multiples of 16.
The EVM is a security oriented virtual machine, designed to permit untrusted code to be executed by a global network of computers.
To do so securely, it imposes the following restrictions:
- Every computational step taken in a program's execution must be paid for up front, thereby preventing Denial-of-Service attacks.
- Programs may only interact with each other by transmitting a single arbitrary-length byte array; they do not have access to each other's state.
- Program execution is sandboxed; an EVM program may access and modify its own internal state and may trigger the execution of other EVM programs, but nothing else.
- Program execution is fully deterministic and produces identical state transitions for any conforming implementation beginning in an identical state.
Ethereum ABI (Application Binary Interface): is the standard way to interact with contracts in the Ethereum ecosystem, both from outside the blockchain and for contract-to-contract interaction. It is a specification for how arguments and function calls should be encoded in the call-data
An EVM compiler doesn’t exactly optimize for bytecode size or speed or memory efficiency. Instead, it optimizes for gas usage, which is an layer of indirection that incentivizes the sort of calculation that the Ethereum blockchain can do efficiently.
Gas costs are set somewhat arbitrarily, and could well change in the future. As costs change, compilers would make different choices.
EVM Implementations (Ethereum clients) (Execution Layer)
- go-ethereum (Go)
- Akula (Rust)
- Erigion(Go)
- Hyperledger Besu (Java)
Some interesting implementations:
- eth-isabelle: theorem provers
- sputter: implementation in Clojure
Programming Languages that Compile into EVM
- Solidity
- LLLL: An LLL-like compiler being implemented in Isabelle/HOL
- HAseembly-evm: An EVM assembly implemented as a Haskell DSL
main :: IO ()
main = putStrLn $ codegen prog1
prog1 :: EvmAsm
prog1 = do
push1 0x10
push1 0x20
add
Context ops (0x30-range)
The following ops have dynamic gas, and are not charted:
`CALLDATALOAD`, `CALLDATACOPY`, `CODECOPY`, `EXTCODECOPY`, `RETURNDATACOPY`,
Note: BALANCE
should really be called EXTBALANCE
, since it fetches balance for (potentially) external accounts.
SHA3 ops (0x20-range)
The SHA3
operation has it's own range since it'a dynamically priced operation.
Storage and execution (0x50 range)
These are the ops
`POP` `MLOAD` `MSTORE` `MSTORE8` `SLOAD` `SSTORE` `JUMP` `JUMPI` `PC` `MSIZE` `GAS` `JUMPDEST`
However,
-MSTORE and MSTORE8 have an additional cost for expanding memory
SLOAD varies depending on previous value JUMP/JUMPI have a hidden cost: they require jumpdest analysis.
0xf0 - calling
Remaining ops are 'special' - both dynamic costs and non-trivial effects, such as starting new call contexts or exiting from call contexts.
`CREATE` `CALL` `CALLCODE` `RETURN` `DELEGATECALL` `CREATE2` `STATICCALL` `REVERT` `SELFDESTRUCT`
0xa0 - Logging
The LOG opcodes are dynamically priced, depending on the memory size, so we don't have gas/time charts, but here's a time spent-chart:
0x60 range
These are DUPX
,SWAPX
and PUSHX
.
Resources
additional refs:
The Contract constructor function
Because EVM's execution always start from the beginning of the code (pc = 0
), there must be a way to handle more complicated contract behaviours. In EVM LLVM, we use a function to describe the function handling. It is called contract constructor function. To implement the function, developers are expected to respect the following contract constructor properties:
- The constructor should be the first function in the generated LLVM IR.
- The constructor should be named
solidity.main
ormain
(could change in the future). The backend recognizes these specific names and will generate different call codes. - The constructor should not take any arguments.
- The constructor should initialize the function's
free memory pointer
, which is located at address0x40
. Thefree memory pointer
is like the usual frame pointer, used to calculate function frames and stack allocations. Because it is located at0x40
, so you cannot initialize it to a smaller number.
Skeleton example of a very small constructor function
Here is an illustration of the skeleton of a small smart contract:
declare i256 @llvm.evm.calldataload(i256)
declare void @llvm.evm.return(i256, i256)
declare void @llvm.evm.mstore(i256, i256)
define void @main() {
entry:
call void @llvm.evm.mstore(i256 64, i256 128)
%0 = call i256 @llvm.evm.calldataload(i256 0)
%1 = call i256 @llvm.evm.calldataload(i256 32)
%2 = call i256 @add(i256 %0, i256 %1)
call void @llvm.evm.mstore(i256 0, i256 %2)
call void @llvm.evm.return(i256 0, i256 32)
unreachable
}
define i256 @add(i256, i256) #0 {
%3 = alloca i256, align 4
%4 = alloca i256, align 4
store i256 %0, i256* %3, align 4
store i256 %1, i256* %4, align 4
%5 = load i256, i256* %3, align 4
%6 = load i256, i256* %4, align 4
%7 = add nsw i256 %5, %6
ret i256 %7
}
**Usually, it is the frontend's responsibility to do the smart contract's plumbing, including the contract's constructor function. ** We need the language frontends to generate corresponding LLVM IR code.
This smart contract does the following things;
- Initialize the
free memory pointer
to 128 - parse the first two 32-byte inputs
- call the
@add
function and supply it with the two parsed arguments - In the function
@add
, we simply add the two arguments, and return it - In the
@main
function, return the retrieved value usingllvm.evm.return
intrinsic.
Compiling the smart contract
Let's put the above smart contract code into a file named test.ll
, and we use llc
to generate EVM binary:
llc -mtriple=evm -filetype=obj test.ll -o test.o
Running the contract
A generated .o
file is in binary format. To see its content in hex, try to use xxd
, for example:
xxd -p -cols 65536 test.o
The xxd
will emit a hex string representation of the binary format. xxd
will try to break the line if it is too long. Here we specify -cols 65536
to avoid linebreaking. After calling xxd
, you should see some output such as:
5b600135600080803561003d909192939091604051806108200152604051610840016040526004580192565b60405160209003516040529052602090f35b80826040519190915260206040510152019056
That is what we need to execute using an EVM engine. Let's try to do it using Geth's EVM. Remember that we need to supply two input arguments, so the command line should be like:
evm --input 1234567890123456789012345678901234567890123456789012345678901234 --code 5b600135600080803561003d909192939091604051806108200152604051610840016040526004580192565b60405160209003516040529052602090f35b80826040519190915260206040510152019056 run
evm
will emit the result of the two added files:
0x468acf08a2468acf08a2468acf08a2468acf08a2468acf08a2468acf08a24634
EVM LLVM provides a way to emit program's metadata for various of purposes. For examples, a symbol table that records the jump destinations can be emitted along with the generated binary.
Developers can use this utility to emit more program information.
Existing implementation
When compiling a contract, a file named EVMMeta.txt
will be generated along with the binary code. The file contains the function symbol table in the compiled program, along with the offset of each function. The metadata file can be used for various purposes, such as debugging, manual linking, analysis, and so on.
To specify a custom metadata file name if you do not want to use the EVMMeta.txt
filename, option -evm_md_file
can be used.
Limitation
Existing implementation of EVM metadata emitting is limited to MachineCode
module/level, which means that if there are any transformations at a higher level such as in the IR level, it will not be shown in the result.
Let's try to use a simple C file to test our compiler:
cat <<EOF > test.c
unsigned x;
int abc(unsigned a, unsigned b, unsigned c) {
if (c > 0) {
return a + x;
} else {
return a + b;
}
}
EOF
Prerequisite: You have to install clang
and use it to generate LLVM IR first:
clang -S -emit-llvm test.c
This will generate a test.ll
file which should be the LLVM IR equivalent of our test.c
file. Then we can generate EVM binary or assembly from it. In order to use the backend to generate EVM assembly, you have to specify -mtriple=evm
when calling llc
. An example is as follows:
./build/bin/llc -mtriple=evm test.ll -o test.s
The generated test.s
file contains the compiled EVM assembly code. Note that the generated code is the function body itself. In order to generate a complete smart contract source code we need to use a smart contract creator function, which we will talk about it in another page.
Notice that you can also get the binary code of the function body by emitting an object file:
./build/bin/llc -mtriple=evm -filetype=obj test.ll -o test.o
Address layout
EVM bytecode has a flat structure. It does not have explicit function entries, nor symbol tables. All executions starts from address 0x00
.
Limitations
Notice that at this moment this backend is limited to generate correct code for a single compilation unit.
In order to link more than one compilation units, one shall inline existing compilation units in the frontend so that the frontend can generate correct main
(the function dispatcher
function) for the whole smart contract.
The function dispatcher (meta function)
The function dispatcher
function (usually called main
function in some contexts) is always placed at the beginning of the generated binary bytecode. The dispatcher is responsible for:
- parse the call data and find the called function address in the jump table using the hash value provided in the call data.
- extract the call arguments, and push them on to stack.
- call the function address specified in the jump table.
Start of address
+----------------> +-------------------------+
| Function dispatcher |
| Jump Table |
| (Func1, |
| Func2, |
| Func3) |
+-------------------------+
| |
| Func1 |
| |
+-------------------------+
| |
| Func2 |
| |
+-------------------------+
| |
| Func3 |
| |
+-------------------------+
Moving the function dispatcher to front of the LLVM IR function list
At this moment it is up to the frontend developer to move the LLVM IR function to the beginning of the function list. You can do something like this when creating function dispatcher:
// Let's say you have a dispatcher function named "dispatcher"
// You should include "llvm/IR/SymbolTableListTraits.h" here
using FunctionListType = SymbolTableList<Function>;
FunctionListType &FuncList = TheModule->getFunctionList();
FuncList.remove(dispatcher);
FuncList.insert(FuncList.begin(), dispatcher);
Functionalities
Experimental support of landing pad
Landingpad is used to support exception handling.
Experimental support of simulating heap allocations
EVM does not have a heap space, so we cannot use heap allocations. We might be able to do around it.
Constant table support
Having a constant table in the smart contract could potentially save some code size if the elements in the table are reused.
Metadata export
We could export more metadata for debugging, analyzing, and so on.
Optimizations
Support more than 16 local variables
EVM can only support retrieval of an element up to depth of 16 from the stack top using instructions SWAP1
to SWAP16
-- resulting a limitation in Solidity compiler that can only support 16 local variables. At this moment, EVM LLVM will also face a stack too deep
issue if the variables in a single basic block is more than 16.
But in LLVM we can totally work around this issue, and do a much better job. With dataflow analysis and register allocation algorithm, we can have near-optimal variable assignment (on the stack or on memory stack) in linear time.
Instruction scheduling
Arranging the order of the opcodes in EVM binary is critical to its performance. Instructions has to be arranged so that we have minimal stack manipulation over head (the opcodes that does not do actual computation, but rather, reorder stack operands' relative position to the top of stack).
EVM LLVM backend is designed in such a way that a scheduler before register allocation can be implemented to reduce the stack operation overhead.
Improve EVM calling conventions
When calling a subroutine, The return address is the first argument and resides at top of stack. This is non-optimal because the return address will definitely not be used until the very end of the subroutine, and taking up a visible slot is expensive. We can re-arrange the return address to be at the end of argument so it will not have to be reached until we want to return from subroutine.
Re-materialization of constants
usual small constants should not stay in stack --- they should be rematerialized whenever it is needed.
Ethereum Virtual Machine specific operations, such as accessing storage, retrieve block information, etc, are through EVM specific instructions. Solidity language automatically generates necessary EVM-specific instructions under the hood so as to hide the details from Solidity developers. However, as a compiler backend, the input to EVM LLVM is LLVM IR format, which is unable to hold any language specific semantics that is higher than the C language level. So it is up to compiler frontends to lower language specific semantics onto LLVM IR level.
Intrinsic functions are used to represent EVM-specific semantics in the input LLVM IR. Intrinsic functions are usually higher level representations of architecture-specific instructions. In EVM LLVM, we allow users to leverage EVM-specific instructions that are used to interact with the chain or storage by exposing those EVM instructions in the form of intrinsic functions.
- This page lists the intrinsic functions that frontend developers can use.
- Intrinsics are defined here.
- And here are examples on how to leverage intrinsics
EVM LLVM Wiki
Welcome to the evm_llvm
wiki! This project aims at bringing LLVM infrastructure to the EVM world where smart contracts are widely deployed.
EVM LLVM is an EVM architecture backend for LLVM. With EVM LLVM you can generate EVM binary code with LLVM-based compilers. The backend does not assume a language frontend, so you should be able to plug in a new smart contract language frontend to generate EVM binary.
The goal of this project is to make it able to for various of platforms, tools and smart contract programming language projects be able to quickly adapt a high-performance EVM backend. In order to produce EVM-specific instructions, a set of intrinsics must be specified for the frontend to emit so that the EVM backend understands to generate the corresponding instruction. Many of these instructions, in Solidity, are represented by built-in functions. NOTE: The argument ordering in these intrinsics may not fully match the order on which they must be pushed onto the EVM stack. Someone should double check. TODO: Double-check argument and return types as well.
SHA3 (Keccak)
- Name:
llvm.evm_builtin_sha3
- Arguments:
data: u256be
: byte pointer to buffer to hashlen: u256be
: length of buffer
- Return:
hash: bytes32
: resulting hash
- Codegen details:
- SHA3 opcode:
0x20
- SHA3 opcode:
Address
- Name:
llvm.evm_builtin_address
- Arguments: None
- Return:
address: bytes20
: Currently executing account
- Codegen details:
- ADDRESS opcode:
0x30
- ADDRESS opcode:
Balance
- Name:
llvm.evm_builtin_balance
- Arguments:
address: bytes20
: Account to check balance of
- Return:
balance: u256be
: Balance of given account
- Codegen details:
- BALANCE opcode:
0x31
- BALANCE opcode:
Origin
- Name:
llvm.evm_builtin_origin
- Arguments: None
- Return:
address: bytes20
: Address of TX origin
- Codegen details:
- ORIGIN opcode:
0x32
- ORIGIN opcode:
Caller
- Name:
llvm.evm_builtin_caller
- Arguments: None
- Return:
address: bytes20
: Address of caller
- Codegen details:
- CALLER opcode:
0x33
- CALLER opcode:
CallValue
- Name:
llvm.evm_builtin_callvalue
- Arguments: None
- Return:
value: u256be
: TX value sent
- Codegen details:
- CALLVALUE opcode:
0x34
- CALLVALUE opcode:
CallDataLoad
- Name:
llvm.evm_builtin_calldataload
- Arguments:
offset: u256be
: Offset within call data buffer to load
- Return:
data: bytes32
: 32 bytes of calldata beginning atoffset
- Codegen details:
- CALLDATALOAD opcode:
0x35
- CALLDATALOAD opcode:
CallDataSize
- Name:
llvm.evm_builtin_calldatasize
- Arguments: None
- Return:
size: u256be
: Length of call data buffer
- Codegen details:
- CALLDATASIZE opcode:
0x36
- CALLDATASIZE opcode:
CallDataCopy
- Name:
llvm.evm_builtin_calldatacopy
- Arguments:
dst: u256be
: Pointer to destination bufferoffset: u256be
: Offset within call data to copylen: u256be
: Number of bytes to copy beginning atoffset
- Return: None
- Codegen details:
- CALLDATACOPY opcode:
0x37
- CALLDATACOPY opcode:
CodeSize
- Name:
llvm.evm_builtin_codesize
- Arguments: None
- Return:
size: u256be
: Size of code at currently executing account
- Codegen details:
- CODESIZE opcode:
0x38
- CODESIZE opcode:
CodeCopy
- Name:
llvm.evm_builtin_codecopy
- Arguments:
dst: u256be
: Pointer to destination bufferoffset: u256be
: Offset within code to copylen: u256be
: Number of bytes to copy beginning atoffset
- Return: None
- Codegen details:
- CODECOPY opcode:
0x39
- CODECOPY opcode:
GasPrice
- Name:
llvm.evm_builtin_gasprice
- Arguments: None
- Return:
price: u256
: Cost of gas in current environment
- Codegen details:
- GASPRICE opcode:
0x3a
- GASPRICE opcode:
ExtCodeSize
- Name:
llvm.evm_builtin_extcodesize
- Arguments:
address: bytes20
: Address whose code size to check
- Return:
size: u256be
: Size of code at given account
- Codegen details:
- EXTCODESIZE opcode:
0x3b
- EXTCODESIZE opcode:
ExtCodeCopy
- Name:
llvm.evm_builtin_extcodecopy
- Arguments:
address: bytes20
Address whose code to copydst: u256be
: Pointer to destination bufferoffset: u256be
: Offset within code to copylen: u256be
: Number of bytes to copy beginning atoffset
- Return: None
- Codegen details:
- EXTCODECOPY opcode:
0x3C
- EXTCODECOPY opcode:
ReturnDataSize
- Name:
llvm.evm_builtin_returndatasize
- Arguments: None
- Return:
size: u256be
: Size of return data buffer from last call
- Codegen details:
- RETURNDATASIZE opcode:
0x3d
- RETURNDATASIZE opcode:
ReturnDataCopy
- Name:
llvm.evm_builtin_returndatacopy
- Arguments:
dst: u256be
: Pointer to destination bufferoffset: u256be
: Offset within return buffer to copylen: u256be
: Number of bytes to copy beginning atoffset
- Return: None
- Codegen details:
- RETURNDATACOPY opcode:
0x3e
- RETURNDATACOPY opcode:
BlockHash
- Name:
llvm.evm_builtin_blockhash
- Arguments:
age: u256be
: The age of the block, between 0 and 256 blocks old
- Return:
hash: bytes32
: The hash of the requested block
- Codegen details:
- BLOCKHASH opcode:
0x40
- BLOCKHASH opcode:
Coinbase
- Name:
llvm.evm_builtin_coinbase
- Arguments: None
- Return:
beneficiary: bytes20
: Current mining beneficiary
- Codegen details:
- COINBASE opcode:
0x41
- COINBASE opcode:
Timestamp
- Name:
llvm.evm_builtin_timestamp
- Arguments: None
- Return:
stamp: u256be
: Timestamp of last block
- Codegen details:
- TIMESTAMP opcode:
0x42
- TIMESTAMP opcode:
Number
- Name:
llvm.evm_builtin_blocknumber
- Arguments: None
- Return:
blknum: u256be
: Current block number
- Codegen details:
- NUMBER opcode:
0x43
- NUMBER opcode:
Difficulty
- Name:
llvm.evm_builtin_difficulty
- Arguments: None
- Return:
difficulty: u256be
: Current block difficulty
- Codegen details:
- DIFFICULTY opcode:
0x44
- DIFFICULTY opcode:
GasLimit
- Name:
llvm.evm_builtin_gaslimit
- Arguments: None
- Return:
limit: u256be
: Block gas limit //NOTE: check type again
- Codegen details:
- GASLIMIT opcode:
0x45
- GASLIMIT opcode:
StorageLoad
- Name:
llvm.evm_builtin_sload
- Arguments:
key: u256be
: Storage key to access
- Return:
value: u256be
: Storage value at key
- Codegen details:
- SLOAD opcode:
0x54
- SLOAD opcode:
StorageStore
- Name:
llvm.evm_builtin_sstore
- Arguments:
key: u256be
: Storage key to writevalue
value: u256be
: Value to write
- Return: None
- Codegen details:
- SSTORE opcode:
0x55
- SSTORE opcode:
MemSize
- Name:
llvm.evm_builtin_msize
- Arguments: None
- Return:
size: u256be
: Size of active memory in bytes
- Codegen details:
- MSIZE opcode:
0x59
- MSIZE opcode:
Gas
- Name:
llvm.evm_builtin_gas
- Arguments: None
- Return:
gasleft: u256be
: Amount of gas left in current execution
Log0
- Name:
llvm.evm_builtin_log0
- Arguments:
data: u256be
: Pointer to log data bufferlen: u256be
: Size of log data buffer in bytes
- Return: None
- Codegen details:
- LOG0 opcode:
0xa0
- LOG0 opcode:
Log1
- Name:
llvm.evm_builtin_log0
- Arguments:
data: u256be
: Pointer to log data bufferlen: u256be
: Size of log data buffer in bytestopic1: u256be
: Log topic 1
- Return: None
- Codegen details:
- LOG1 opcode:
0xa1
- LOG1 opcode:
Log2
- Name:
llvm.evm_builtin_log2
- Arguments:
data: u256be
: Pointer to log data bufferlen: u256be
: Size of log data buffer in bytestopic1: u256be
: Log topic 1topic2: u256be
: Log topic 2
- Return: None
- Codegen details:
- LOG2 opcode:
0xa2
- LOG2 opcode:
Log3
- Name:
llvm.evm_builtin_log3
- Arguments:
data: u256be
: Pointer to log data bufferlen: u256be
: Size of log data buffer in bytestopic1: u256be
: Log topic 1topic2: u256be
: Log topic 2topic3: u256be
: Log topic 3
- Return: None
- Codegen details:
- LOG3 opcode:
0xa3
- LOG3 opcode:
Log4
- Name:
llvm.evm_builtin_log4
- Arguments:
data: u256be
: Pointer to log data bufferlen: u256be
: Size of log data buffer in bytestopic1: u256be
: Log topic 1topic2: u256be
: Log topic 2topic3: u256be
: Log topic 3topic4: u256be
: Log topic 4
- Return: None
- Codegen details:
- LOG4 opcode:
0xa4
- LOG4 opcode:
Create
- Name:
llvm.evm_builtin_create
- Arguments:
value: u256be
: Value in wei sent to new contractcode: u256be
: Pointer to code buffer for new contractlen: u256be
: Code buffer length
- Return:
addr: bytes20
: Address of newly created contract
- Codegen details:
- CREATE opcode:
0xf0
- CREATE opcode:
Create2
- Name:
llvm.evm_builtin_create2
- Arguments:
value: u256be
: Value in wei sent to new contractcode: u256be
: Pointer to code buffer for new contractlen: u256be
: Code buffer lengthsalt: bytes32
: Salt for address creation
- Return:
addr: bytes20
: Address of newly created contract
- Codegen details:
- CREATE2 opcode:
0xf5
- CREATE2 opcode:
Call
- Name:
llvm.evm_builtin_call
- Arguments:
gas: u256be
: Gas allowance for calladdress: bytes20
: Call destination addressvalue: u256be
: Wei value sent with callinput: u256be
: Input data pointerinputlen: u256be
: Input data buffer sizeoutput: u256be
: Output buffer pointeroutputlen: u256be
: Output data buffer size
- Return:
return: u256be
: Exit code
- Codegen details:
- CALL opcode:
0xf1
- CALL opcode:
CallCode
- Name:
llvm.evm_builtin_callcode
- Arguments:
gas: u256be
: Gas allowance for calladdress: bytes20
: Address of code to usevalue: u256be
: Wei value sent with callinput: u256be
: Input data pointerinputlen: u256be
: Input data buffer sizeoutput: u256be
: Output buffer pointeroutputlen: u256be
: Output data buffer size
- Return:
return: u256be
: Exit code
- Codegen details:
- CALL opcode:
0xf2
- CALL opcode:
DelegateCall
- Name:
llvm.evm_builtin_calldelegate
- Arguments:
gas: u256be
: Gas allowance for calladdress: bytes20
: Address of code to useinput: u256be
: Input data pointerinputlen: u256be
: Input data buffer sizeoutput: u256be
: Output buffer pointeroutputlen: u256be
: Output data buffer size
- Return:
return: u256be
: Exit code
- Codegen details:
- CALLDELEGATE opcode:
0xf4
- CALLDELEGATE opcode:
StaticCall
- Name:
llvm.evm_builtin_staticcall
- Arguments:
gas: u256be
: Gas allowance for calladdress: bytes20
: Call destination addressinput: u256be
: Input data pointerinputlen: u256be
: Input data buffer sizeoutput: u256be
: Output buffer pointeroutputlen: u256be
: Output data buffer size
- Codegen details:
- STATICCALL opcode:
0xfa
- STATICCALL opcode:
Return
- Name:
llvm.evm_builtin_return
- Arguments:
data: u256be
: Return data pointerlen: u256be
: Return buffer size
- Return: Ends execution
- Codegen details:
- RETURN opcode:
0xf3
- RETURN opcode:
Revert
- Name:
llvm.evm_builtin_revert
- Arguments:
data: u256be
: Return data pointerlen: u256be
: Return buffer size
- Return: Ends execution
- Codegen details
- REVERT opcode:
0xfd
- REVERT opcode:
SelfDestruct
- Name:
llvm.evm_builtin_selfdestruct
- Arguments:
beneficiary: bytes20
: Address to send funds to
- Return: Ends execution
- Codegen details:
- SELFDESTRUCT opcode:
0xff
## EVM target specific changes
- SELFDESTRUCT opcode:
Frontend is expected to emit 256bit values LLVM IR
The EVM architecture is the only 256-bit machine out there in the market, and so far it have not yet received support from LLVM community. We added 256-bit and 160-bit support in the LLVM IR level.
In order to utilize 256-bit and 160-bit operands, developers are expected to emit i256
and i160
data types in their IR code generation. Include the evm_llvm
's header files in include/llvm
folders so that these two pre-defined data types can be properly generated.
Frontend needs to generate compatible LLVM IR
Notice that development of this backend is based on LLVM 10, which is released in March 2020. We also have a LLVM 8 branch just to support those who creates their frontends in LLVM 8.
We could do back porting to other lower versions such as LLVM 9 at the request of developers for better stability or compatibility. Please let me know if you have such needs. EVM is different than other execution platform in that it is on blockchain. The result of the execution of a smart contract will be dependent on the state of the blockchain as well. So, we have to integrate EVM execution environment (in this early stage, geth
) into our tests.
Constructor
Unit tests will only focus on small test functions. But you cannot execute a function independently on blockchain, we need to have a contract constructor and dispatcher as the first function in the file. The reason is that EVM will always start its execution from address 0x00
-- where the contract header / constructor /dispatcher resides. The header then tries to set up the contract -- allocating memory/storage or parsing incoming parameters, et cetera.
Here is the commentated constructor code we use for handling unit tests:
define void @main() {
entry:
%0 = call i256 @llvm.evm.calldataload(i256 0) ; extract first 32-byte argument
%1 = call i256 @llvm.evm.calldataload(i256 32); extract second 32-byte argument
%2 = call i256 @test(i256 %0, i256 %1) ; execute the unit test function
call void @llvm.evm.mstore(i256 0, i256 %2) ; store the returned value to memory address `0x00`
call void @llvm.evm.return(i256 0, i256 32) ; call "return" to return the value returned by @test
unreachable
}
Notice that the @test
function takes 2 parameters, so we will have two calls to @llvm.evm.calldataload
.
The unit test is compiled using llc
with options: -mtriple=evm -filetype=obj
. Then the code is executed using geth
's evm
command.
Testing utilities
A Python script is used to handle the testing, file evm_llvm/tools/evm-test/evm_test.py
is the script we created to test functionalities of the llvm backend. Here are what it does:
- call evm_llvm backend to compile an LLVM IR file (
.ll
file) into object file (.o
) file. The file should contain the function we are going to verify along with a smart contract constructor header which is used to handle input arguments. The function should be at the beginning of the IR file (the first function). - extract the contract opcodes from the
.o
file and prepare the input arguments (by padded each arguments to be 32 bytes long and concatenate everything into a long string). - Run the executable binary using geth's
evm
, get the result from the print, And compare the result with expected value.
How to run testings
- Install Python3
- Run
evm_llvm/tools/evm-test/evm_test.py
then you should see the results.
How to add new tests
Please take a look at the evm_llvm/tools/evm-test/evm_testsuit.py
file, it organizes tests by categorizing them into different OrderedList
. Each element of the list contains the following information:
- the name of the test
- the array of input arguments
- the path of the unit test source code file (in LLVM IR form)
- the expected result value
When adding new tests, you should:
- put your test files into
evm_llvm/test/CodeGen/EVM
folder. - add the test file path and expected results to the
evm_testsuit.py
file. (We might change it when the file gets too large).
TODO lists
- add blockchain state related tests
- add re-entrance tests (which are also related to changes of blockchain states)
Please help improve the test utility!## Variables
In the context of stack machine, a variable refers to an operand that will be consumed by an opcode. In EVM LLVM, variables are treated as virtual registers, until they are stackfied (convert register-based code to stack-based code) right before lowering to machine code.
In LLVM's internal SSA representation mode, it is fairly easy to compute a register's live range (the range from its assignment to its last use). Variables are treated differently with regard to its live range. Local variables (variables that its liveness only extends within a single basic block) will live entirely on the stack, while non-local variables (variables that live across basic blocks) will be spilled to a memory slot allocated by the compiler.
Frame Objects
Frame objects will be allocated either on stack or on memory space. Since each of the elements are 256bits, we have to ensure that frame objects are 256bits in length as well. Frame objects with smaller length is not supported.
It is possible for a frame object to be allocated on to memory space, if we are consuming too much of stack space. The stack allocation pass will try to find an efficient way to decide which goes to the memory and which stays in stack.
Frame Pointer (or Free Memory Pointer)
Stack pointers and frame pointers are essential to support subroutine calls. Frame pointer is used to record the structure of stack frames. Because we do not have registers in EVM, we will have to store stack frame pointer in memory locations. Usually, we put stack frame pointer at location 0x40
, and we follow Solidity compiler's convention to initialize it to value 128
. So the stack frame of the first function starts at that location. The value of frame pointer changes as the contract calls a subroutine or exits from a subroutine. Whenever we need to have access to frame pointer, we will retrieve its value from that specific location.
Memory stack
Part of the memory is used as a stack for function calls and variable spills. The structure is described as follows:
- The stack goes from lower address to higher address, as different from usual hardware implementations.
- The frame is arranged into 3 parts:
- frame object locations. Each frame object has its own frame slot. Frame object
x
will have a 32 byte space starting from$fp + (x * 32)
, where$fp
is the frame pointer, and is stored at location0x40
. - spilled variables. Variable that are unable to be fully stackified will reside on the memory stack. In codegen, each spilled variable will have an index, and each index refers to a memory slot. A spilled variable that bears index
y
, will reside at location$fp + (number_of_frame_objects * 32) + (y * 32)
. - subroutine context. Like a regular register machine, the memory stack is used to store subroutine context so as to support function calls. Two slots are allocated at the end of current frame for a) the existing frame pointer, and b) return
PC
address.
- frame object locations. Each frame object has its own frame slot. Frame object
Here is an example showing a stack frame right before we jump into a subroutine:
Stack top Higher address
+-----------> +----------------------------+ <--------------+
| |
| Return Address |
| |
+----------------------------+
| |
| Function argument |
new FP | |
+-----------> +----------------------------+
| |
| Saved frame pointer |
| (Start of frame) |
+----------------------------+
| |
| Stack Object 1 |
| |
+----------------------------+
| |
| Frame Object 2 |
| |
+----------------------------+
| |
| Frame Object 1 |
Start of frame | | Lower address
+------------> +----------------------------+ <----------------+
The EVM architecture is a simplistic structure, but it has everything we need to do usual program computations.
Types of calls
There are two types of calls in an EVM smart contract:
- Internal calls. Internal calls are referred to function calls within a smart contract. An example is that we have two defined function
A
andB
, and somewhere inA
we save our context and change our execution flow to the beginning ofB
. - External calls. Or cross-contract calls.
A
andB
are defined in different deployed EVM contract andA
callsB
in its context.
Internal call conventions
Up to ETH 1.5, there is no link and jump EVM opcode for easy handling of subroutines(even though some discussions are on-going). So we have to manually handle subroutine calls. Here are the calling conventions for an internal calls:
- current subroutine's frame pointer is saved at stack, at memory location
$fp - 32
where$fp
is the subroutine call's frame pointer. - Arguments are all pushed on stack, along with the return address. Argument with smaller index number occupies a stack slot on top of another argument with a larger index number. For example, when we want to do a function call:
func abc(x, y, z)
, here is the arrangement of the arguments:
+-----------+
|Return Addr|
+-----------+
| X |
+-----------+
| Y |
+-----------+
Current FP | Z |
+------------> +-----------+
| Old FP |
+-----------+
| ..... |
+-----------+
Note: Putting the return address on top of the stack is because it is easier to compute the location, but this will result in more stack manipulation overhead for the subroutine calls. We will improve this design in a later version.
- A subroutine's return value is stored on stack top. Note: currently we only support one return value. In the future we will improve it by supporting multiple return values.
Procedure of a subroutine call
To illustrate the procedure for a subroutine call, we need to do the following to save the context of current function execution:
- calculate the current frame size. The frame size should be the size sum of: a) slots occupied by frame objects, b) slots occupied by spilled variables, and c) one more slot for storing current frame pointer. let's assume the frame size is calculated to be
%frame_size
. - bump the frame pointer to:
$fp = $fp + %frame_size
. After that, we can easily restore the old frame pointer by looking at location$fp - 32
. - push all subroutine arguments in order on to stack.
- push return address onto stack. (At this moment, the return address is
PC + 6
). - push the beginning address of subroutine and jump.
Right before we return from a subroutine, the stack should be empty and the return address should be at the top of the stack. When returning from a subroutine call, we should do the following:
- push return value on to top of stack.
- Do a
swap1
to move the return address to top of stack - jump to return address and resume the execution in caller function. If the function returns nothing, simply jump to return address.
After jumping back to caller, we have to resume the execution:
- restore caller's frame pointer by storing the value at location
$fp - 32
to0x40
.
EIP2315 Support: Subroutine calls
The support of subroutines inside EVM enables compiler to generate better performance code. To be more specific: With EIP235, it is up to EVM to maintain the stack:
- the return address stack is only accessible to VM
- the stack is invisible to users and compilers
A better calling convention is made with the support of EIP2315:
To generate a call procedure
- calculate the current frame size. The frame size should be the size sum of: a) slots occupied by frame objects, b) slots occupied by spilled variables, and c) one more slot for storing current frame pointer. let's assume the frame size is calculated to be
%frame_size
. - save existing frame pointer at memory location
$fp + %frame_size - 32
. The frame pointer is maintained at0x40
. - bump the frame pointer to:
$fp = $fp + %frame_size
. After that, we can easily restore the old frame pointer by looking at location$fp - 32
. - push all subroutine arguments in order on to stack.
- push the beginning address of subroutine and call
JUMPSUB
To generate the return
- push return value on to top of stack.
- call
RETURNSUB
to resume execution of caller function.
External calls
External calls are implemented using intrinsic calls.## Newly supported Types So far the open-source LLVM trunk has not yet implemented bit size support larger than 128bits. We have implemented 256bit supports in our own backend, and is considering contributing them back to main trunk.
Users are allowed to use i256
and i160
data types in their generated LLVM IR, which represent 256bit integer types and 160bit integer types respectively.
Even though all EVM data types are 256bit in length internally. We are still able to offer support to smaller data types. However, users are encouraged to use 256bit data types internally because it is free.
Contract Input Argument Types -- The Solidity convention
Contract arguments are passed to EVM via the call data field. The function dispatcher is responsible to extract input arguments from call data.
In Solidity's convention, the arguments in call data are padded to 32 bytes long if its data type's length is shorter. So, in order to maintain the convention, the function dispatcher needs to truncate the input arguments to the defined size in the function that is going to be called.
This is undoubtedly inefficient, so users are discouraged to use smaller data types.