This repository is a modification of the original by Yu et. al.
I have made, or plan to make, the following changes to make the code a benchmark for real-time applications:
-
Removed the ability to train; only running inference on a pretrained network.
-
Added more command-line flags, making it easier for other scripts to configure parameters.
-
Changed the way the data is loaded to avoid reading from disk.
-
Added code for measuring times and rate-limiting jobs.
-
Various other things to make it more suitable to my personal preferencecs.
Code in this repository is still in a state of flux as I remove experimental, unused data and work to reduce the number of difficult dependencies.
This project, being primarily used by my own research, requires many specific
modified dependencies. You will likely need to install all of the following
projects on your system in order to use full functionality. (The minimal
rtbenchmark.py
may be able to work without locking or other management if
some of these aren't present.) They should be installed in the following order:
-
My modifications to ROCm.. Note that this requires a specific ROCm version to be installed. It adds userspace support for per-kernel GPU locking and subdividing large memory- transfer requests.
-
The
rocm_helper
python library. This is part of the samerocm_mega_repo
, but I'm listing it separately so it won't be overlooked. Follow the instructions in therocm_helper_python
directory in the above repository. It's used to create streams with CU masks within PyTorch scripts. -
My modified version of PyTorch.. Install this from source, using the same instructions you would when installing vanilla PyTorch from source. (It has been run through
hipify
already, so you don't need to do so again.) This fixes support for using external CUDA/HIP streams that ought to be present in vanilla PyTorch, but was apparently never fully integrated. Maybe I'll submit a patch to upstream about this some day. -
My GPU-locking kernel module.. While not used for locking any more (I use my newer KFMLP module instead), this still provides a user-accessible interface for evicting running tasks off of an AMD GPU.
-
My KFMLP locking support.. This consists of a Linux kernel module and a python library for interacting with it. The kernel module provides a k-exclusion lock and a few other convenience features, such as a barrier allowing a task system to be released only when all tasks are ready, and an API for switching a process to use the SCHED_FIFO scheduler, bypassing kernel permission checks.
-
The pre-computed data blobs. This will, in turn, require the imagenet dataset on-disk in the same layout expected by the original repo. Once you have the imagenet dataset in the correct layout, change the
dataset_dir
field inconfig.py
to point to your dataset location. (On my system, thedataset_dir
contains three directories:test
,train
, andval
. Thetest
directory is full of JPG images, while the other two dirs are full of directories containing JPG images. I'm not sure specifically what is required by the originalslimmable_networks
code responsible for parsing this structure.) Anyway, once you have this directory set up, runpython generate_data_blobs.py
to generateinput_data_raw.bin
andresult_data_raw.bin
. Note thatinput_data_raw.bin
is about 6 GB. This entire file is buffered into memory at runtime to simplify loading logic.
After setting up all of the above dependencies, you should be able to run
rtbenchmark.py
using Python 3. Run python rtbenchmark.py --help
for usage
information.