Ithemal Extension

In this repository, we provide code for our extension to Ithemal. This project is done for the EECS 583 class at the University of Michigan.

EECS 583 Group 29

Installation Instructions

Clone the repository:

git clone https://github.com/thomasehuang/Ithemal-Extension.git
cd Ithemal-Extension

Install all the necessary packages:

pip install -r requirements.txt

Install common_libs:

cd common
pip install -e .

Install DynamoRIO for the tokenizer:

curl -sL https://github.com/DynamoRIO/dynamorio/releases/download/release_7_0_0_rc1/DynamoRIO-Linux-7.0.0-RC1.tar.gz | tar xz -C /path/to/dynamorio

Build data tools (requires DynamoRIO):

cd Ithemal-Extension
mkdir build; cd build
cmake -DDynamoRIO_DIR=/path/to/dynamorio/cmake/folder ..
make

Dataset

We implemented our own data collection pipeline to build the dataset for our project, which can be found here: https://github.com/zhaola/583DataCollection.

The structure of the dataset is as follows:

dataset
├── function1                   # contains binary files that represents basic blocks in a function
    ├── basicblock1.out         # binary file for a single basic block
    ├── basicblock2.out
    ├── ...
    ├── basicblockM.out
    └── CFG.json/xml            # contains profile information about the function
├── function2
├── ...
├── functionN
└── labels.csv                  # contains list of function -> throughput mappings

Save Basic Block Embeddings

To save the final embedding of a basic block, use the --save-embed flag when running the predict.py script. This will automatically save the embedding to the same directory as the input binary. For usage on our dataset, set --files to point to the dataset in the above format and use the --extend flag:

python learning/pytorch/predict.py --verbose --save-embed --extend \
    --model ithemal-models/hsw.dump \
    --model-data ithemal-models/hsw.mdl \
    --files learning/pytorch/examples/test_dataset/

To execute on a batch of binary files, you can simply input multiple files to the --files argument:

python learning/pytorch/predict.py --verbose --save-embed \
    --model ithemal-models/hsw.dump \
    --model-data ithemal-models/hsw.mdl \
    --files learning/pytorch/examples/example1.out \
            learning/pytorch/examples/example2.out \
            learning/pytorch/examples/example3.out

Training

To train our extended version of Ithemal, you can run the following:

python learning/pytorch/run_ithemal_extend.py \
    --data learning/pytorch/examples/test_dataset/ --use-rnn \
    train --experiment-name TEST_RUN --experiment-time TODAY \
    --sgd --trainers 1 --weird-lr --decay-lr --epochs 20

See all available config parameters in learning/pytorch/run_ithemal_extend.py. Checkpoints of the trained model will be saved in learning/pytorch/saved/TEST_RUN/TODAY/checkpoints/ under the corresponding timestamp for the run.

Evaluation

To test the model, use the --test flag and specify the path to the model checkpoint file with --load-file:

python learning/pytorch/run_ithemal_extend.py \
    --data learning/pytorch/examples/test_dataset/ --use-rnn \
    train --experiment-name TEST_RUN --experiment-time TODAY \
    --test --load-file /path/to/model/checkpoint

Below is the README for the original Ithemal repository, which can be found here: https://github.com/ithemal/Ithemal.

Overview

Ithemal is a data driven model for predicting throughput of a basic block of x86-64 instructions.

More details about Ithemal's approach can be found in our paper.

Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks
Charith Mendis, Alex Renda, Saman Amarasinghe, Michael Carbin
Proceedings of the 36th International Conference on Machine Learning (ICML) 2019. Bibtex

A demo of Ithemal's Skylake model can be found here. It takes in a basic block in any standard syntax, and generates a prediction for the throughput of that block. Ithemal does not currently handle UNK tokens, so this model will likely not generate a prediction for any blocks with esoteric instructions or operands.

Usage

Environment Setup

You first need to install docker and docker-compose.

It is easiest to run Ithemal within the provided docker environment. To build the docker environment, run docker/docker_build.sh. No user interaction is required during the build, despite the various prompts seemingly asking for input.

Once the docker environment is built, connect to it with docker/docker_connect.sh. This will drop you into a tmux shell in the container. It will also start a Jupyter notebook exposed on port 8888 with the password ithemal; nothing depends on this except for your own convenience, so feel free to disable exposing the notebook by removing the port forwarding on lines 37 and 38 in docker/docker-compose.yml. The file system in the container is mounted from the local file system, so changes to the file system on the host will propagate to the docker instance, and vice versa. The container will continue running in the background, even if you exit. The container can be stopped with docker/docker_stop.sh from the host machine. To detach from the container while keeping jobs running, use the normal tmux detach command of Control-b d; running docker/docker_connect.sh will drop you back into the same session.

Prediction

Models can be downloaded from the Ithemal models repository. Models are split into two parts: the model architecture and the model data (this is an unfortunate historical artifact more than a good design decision). The model architecture contains the code and the token embedding map for the model, and the model data contains the learned model tensors. The versions of the model reported in the paper are:

Command Line

Ithemal can be used as a drop-in replacement for the throughput prediction capabilities of IACA or llvm-mca via the learning/pytorch/ithemal/predict.py script. Ithemal uses the same convention as IACA to denote what code should be predicted; this can be achieved with any of the files in learning/pytorch/examples, or by consulting the IACA documentation.

Once you have downloaded one of the models from the previous section, and you have compiled a piece of code to some file, you can generate prediction for this code with something like:

python learning/pytorch/ithemal/predict.py --verbose --model predictor.dump --model-data trained.mdl --file a.out

Training

Data

Representation

Raw data is represented as a list of tuples containing (code_id, timing, code_intel, code_xml), where code_id is a unique identifier for the code, timing is the float representing the timing of the block, code_intel is the newline-separated human-readable block string (tis is only for debugging and can be empty or None), and code_xml is the result of the tokenizer on that block (detailed in the Canonicalization section). To store datasets, we torch.save and torch.load this list of tuples. The first 80% of a dataset is loaded as the train set, and the last 20% is the test set. For an example of one of these datasets, look at a small sample of our training data.

Canonicalization

To canonicalize a basic block so that it can be used as input for Ithemal, first get the hex representation of the basic block you want to predict (i.e. via xxd, objdump, or equivalent). For instance, the instruction push 0x000000e3 is represented in hex as 68e3000000. Next, run the tokenizer as follows:

data_collection/build/bin/tokenizer {HEX_CODE} --token

which will output an XML representation of the basic block, with all implicit operands enumerated.

Model Training

To train a model, pick a suitable EXPERIMENT_NAME and EXPERIMENT_TIME, and run the following command:

python learning/pytorch/ithemal/run_ithemal.py --data {DATA_FILE} --use-rnn train --experiment-name {EXPERIMENT_NAME} --experiment-time {EXPERIMENT_TIME} --sgd --threads 4 --trainers 6 --weird-lr --decay-lr --epochs 100

which will train a model with the parameters reported in the paper. The results of this are saved into learning/pytorch/saved/EXPERIMENT_NAME/EXPERIMENT_TIME/, which we will refer to as RESULT_DIR from now on. The training loss is printed live as the model is trained, and also saved into RESULT_DIR/loss_report.log, which is a tab-separated list of epoch, elapsed time, training loss, number of active parallel trainers. The results of the trained model on the test set are stored in RESULT_DIR/validation_results.txt, which consists of a list of of the predicted,actual value of each item in the test set, followed by the overall loss of the trained model on the test set at the end. Finally, the resulting trained models and predictor dumps (for use in the command line API above) are saved in RESULT_DIR/trained.mdl and RESULT_DIR/predictor.dump respectively.

Code Structure

The command-line and training infrastructure is in learning/pytorch/ithemal/.
The models and optimizer are in learning/pytorch/models/graph_models.py and learning/pytorch/models/train.py
The data infrastructure is primarily in learning/pytorch/data/data_cost.py
The representation of instructions and basic blocks are in common/common_libs/utilities.py
The canonicalization and tokenization are in data_collection/tokenizer/tokenizer.c and data_collection/common/

Data Collection

This repo contains a preliminary version of our data collection infrastructure; we plan on releasing a more detailed description and infrastructure soon.

thomasehuang / Ithemal-Extension

Ithemal Extension

Installation Instructions

Dataset

Save Basic Block Embeddings

Training

Evaluation

Overview

Usage

Environment Setup

Prediction

Command Line

Training

Data

Representation

Canonicalization

Model Training

Code Structure

Data Collection

About

Languages