This project contains submodules, please run:
git clone --recursive git@github.com:sunxfancy/IPRA-exp.git
or after clone, run:
git submodule update --init --recursive
Required tools: SingularityCE - installing it following the documentation: https://docs.sylabs.io/guides/3.11/user-guide/
- run
make singularity/image
to build the image, you may need the root privileges - run
./make
to build the clang compiler, autofdo and some small tools - run
./make benchmarks/${bench}/${target}
can build the specific target of one of the benchmarks - run
./make benchmarks/${bench}/${target}.bench
can run the bench for this target - run
./make benchmarks/${bench}/${target}.regprof3
can run the register and spill code profiling for this target
E.g. Run the clang benchmark register and spill code profiling for thin-lto build with all features eanbled (Threshold for hot function = 3, Callsite Cold Ratio = 20):
./make benchmarks/clang/pgo-full-bfdoipra6.3-20.regprof3
Run ./make benchmarks
to run all benchmarks
There are two folder will be generated: build
and tmp
. In the build/benchmarks
folder, you will see the benchmarks result:
<target>.bench
the result ofperf stat
after running 5 times<target>.regprof3
the result of push pop countings and spill code size
You can use the python script in benchmarks/report.ipynb
to view those data quickly, by running ./make jupyter
to start a jupyter session.
After build the compiler, there is a folder install/llvm
contains the clang compiler. The specific compiler flags:
-mllvm -fdo-ipra
to enable FDOIPRA-mllvm -fdoipra-both-hot=false
to control applying optimzaiton for function which entry is hot or having hot loop in its body. When it isfalse
, means only apply to function which entry is hot. (Default: true)-fdoipra-cc=1
to enable optimaztion for Cold Callsite and Cold Callee (Default: true)-fdoipra-ch=1
to enable optimaztion for Cold Callsite and Hot Callee (Default: false)-fdoipra-hc=1
to enable optimaztion for Hot Callsite and Cold Callee (Default: false)-fdoipra-ccr=10
to setup the Callsite Cold Ratio (Default: 10.0)
There should be a 3-steps build:
- Build the pgo-full version: instrumentation using
-fprofile-generate
, profiling and then rebuild the PGO with FullLTO - Using perf to collect the sampling data and convert the profile using
hot-list-creator
to generate a hot function list. - Build the final binary
You can checkout the small example here: how-to-use/build.mk
- clang
- gcc
- mysql
- leveldb
pgo-full
PGO + FullLTO build, the baselinepgo-full-ipra
PGO + IPRA + FullLTO build, the existing solution
the following 3 are applying optimazition for function which entry is hot:
pgo-full-fdoipra
PGO + FullLTO with ColdCallSite-ColdCallee using no_caller_saved_registers attributespgo-full-fdoipra2
PGO + FullLTO with ColdCallSite-ColdCallee and ColdCallsite-HotCallee using no_caller_saved_registers attributes and a proxy callpgo-full-fdoipra3
PGO + FullLTO with ColdCallSite-ColdCallee, ColdCallsite-HotCallee and indirect calls using no_caller_saved_registers attributes and a proxy call
the following 3 are applying optimzaiton for function which entry is hot or having hot loop in its body:
pgo-full-bfdoipra
PGO + FullLTO with ColdCallSite-ColdCallee using no_caller_saved_registers attributespgo-full-bfdoipra2
PGO + FullLTO with ColdCallSite-ColdCallee and ColdCallsite-HotCallee using no_caller_saved_registers attributes and a proxy callpgo-full-bfdoipra3
PGO + FullLTO with ColdCallSite-ColdCallee, ColdCallsite-HotCallee and indirect calls using no_caller_saved_registers attributes and a proxy call
The following 6 are applying no_callee_saved_registers to cold caller and hot callee functions, corresponding to the previous 6 items:
pgo-full-fdoipra4
PGO + FullLTO with ColdCallSite-ColdCallee using no_caller_saved_registers attributespgo-full-fdoipra5
PGO + FullLTO with ColdCallSite-ColdCallee and ColdCallsite-HotCallee using no_caller_saved_registers attributes and a proxy callpgo-full-fdoipra6
PGO + FullLTO with ColdCallSite-ColdCallee, ColdCallsite-HotCallee and indirect calls using no_caller_saved_registers attributes and a proxy callpgo-full-bfdoipra4
PGO + FullLTO with ColdCallSite-ColdCallee using no_caller_saved_registers attributespgo-full-bfdoipra5
PGO + FullLTO with ColdCallSite-ColdCallee and ColdCallsite-HotCallee using no_caller_saved_registers attributes and a proxy callpgo-full-bfdoipra6
PGO + FullLTO with ColdCallSite-ColdCallee, ColdCallsite-HotCallee and indirect calls using no_caller_saved_registers attributes and a proxy call
There are variants for each target, used to configure the threshold for each target. Format: pgo-full-fdoipra.A-B
Number A: The threshold for how many times could be considered as a hot function in sampling data. e.g 3 means if a function has been seen 3 times in the perf sampling data, the system will consider it as a hot function. Number B: Callsite Cold Ratio, in the PGO profiling data, which callsite should be considered as a cold callsite. e.g. 10 means if the hit frequence at entry of the caller function has 10 times larger than the hit frequence at callsite, this is a cold callsite.
Available A: 1 3 5 10 Availabel B: 10 20
IPRA-exp/fdoipra.patch at main · sunxfancy/IPRA-exp (github.com)
IPRA-exp/fix.patch at main · sunxfancy/IPRA-exp (github.com)
Build LLVM and Clang
Build modified version of autofdo (dev branch) sunxfancy/autofdo at dev (github.com)
a. Build instrumented version:
-fprofile-generate=<output_path>
b. Run tests and profile merge:
… // run tests
cd <output_path> && llvm-profdata merge -output=instrumented.profdata *
c. Build optimized version:
-flto=thin -fprofile-use=instrumented.profdata -Wl,--lto-basic-block-sections=labels -Wl,--build-id
Other flags to avoid bugs:
-fno-optimize-sibling-calls -Wl,-mllvm -Wl,-fast-isel=false -Wl,-Bsymbolic-non-weak-functions
Other flags for best performance:
-fsplit-machine-functions
a. Run PGO Optimized Version with perf record
perf record -e cycles:u -j any -o samples
b. Generate hotlist
hot_list_creator
--binary="<PGO VERSION>" \
--profile="<Sample>" \
--output="hot_list" \
--detail="detail" \ # for debugging
--hot_threshold=3 # 3 counts in samples will mark the function hot
The last step is to build the final binary using two different groups o information:
a. The PGO profile data
b. The hot list
Build FDOIPRA
-Wl,-mllvm -fdo-ipra -Wl,-fdoipra-new-impl -Wl,-mllvm -Wl,-fdoipra-both-hot=false -Wl,-mllvm -Wl,-fdoipra-ch=1 -Wl,-mllvm -Wl,-fdoipra-hc=1 -Wl,-mllvm -Wl,-fdoipra-use-caller-reg=1
Flags for the hot list:
-Wl,-mllvm -Wl,-fdoipra-hot-list=hot_list
fdo-ipra | Enable FDOIPRA pass |
---|---|
fdoipra-new-impl | There are two implementation, the new one supports ThinLTO which should be good to use. |
fdoipra-both-hot | False: only function hot in entry will be considered. True: function hot in entry and body will both be considered as candidates for optimization. |
fdoipra-ch | Enable optimization for cold-callsite-hot-callee |
fdoipra-hc | Enable optimization for hot-callsite-cold-callee |
fdoipra-use-caller-reg | Enable optimization for transferring callee saved registers to caller saved. |