shinezyy / DirtyStuff

An individual repo to contain all the tools that I created for arch research;

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Distributed gem5 checkpoint runner

yqszxx opened this issue · comments

Doc WIP.

  • 4 servers, s{1-4}, 32 core, 128G ram
  • /home mounted as nfs and shared across servers
  • build in local podman container
  • parallel is GNU Parallel
branch=dep-check && \
podman exec --workdir /ff-reshape 02 git fetch --all && \
podman exec --workdir /ff-reshape 02 git checkout $branch -f && \
podman exec --workdir /ff-reshape 02 git pull -f && \
gitid=$(podman exec --workdir /ff-reshape 02 git describe --always --dirty) && \
podman exec --workdir /ff-reshape 02 sed -i -e '/opt/ s/-g//' src/SConscript && \
podman exec --workdir /ff-reshape 02 scons build/RISCV/gem5.opt -j22 && \
rm -rf gem5.opt configs && \
ssh s1 "rm -rf rm -rf /home/yqszxx/gem5.opt /home/yqszxx/configs" && \
podman cp 02:/ff-reshape/build/RISCV/gem5.opt . && \
scp gem5.opt s1:/home/yqszxx/gem5.opt && \
podman cp 02:/ff-reshape/configs . && \
find configs -name \*.pyc -delete && \
scp -r configs s1:/home/yqszxx/configs && \
starttime=$(date +%y%m%d-%H%M%S) && \
parallel -S 28/s1,28/s2,28/s3,28/s4 -a /home/yqszxx/checkpoints/chkpt50.list \
/home/yqszxx/gem5.opt \
--outdir /home/yqszxx/result/$gitid-$starttime/{} \
/home/yqszxx/configs/example/fs.py \
--cpu-type NonCachingSimpleCPU \
--mem-size=8GB \
--maxinsts 50000000 \
--gcpt-restorer /home/yqszxx/gcpt-restorer \
--generic-rv-cpt /home/yqszxx/checkpoints/{}/0/*.gz \
--depcheck

For spce17:

branch=dep-check && \
podman exec --workdir /ff-reshape 02 git fetch --all && \
podman exec --workdir /ff-reshape 02 git checkout $branch -f && \
podman exec --workdir /ff-reshape 02 git pull -f && \
gitid=$(podman exec --workdir /ff-reshape 02 git describe --always --dirty) && \
podman exec --workdir /ff-reshape 02 sed -i -e '/opt/ s/-g//' src/SConscript && \
podman exec --workdir /ff-reshape 02 scons build/RISCV/gem5.opt -j22 && \
rm -rf gem5.opt configs && \
ssh s2 "rm -rf rm -rf /home/yqszxx/gem5.opt /home/yqszxx/configs" && \
podman cp 02:/ff-reshape/build/RISCV/gem5.opt . && \
scp gem5.opt s2:/home/yqszxx/gem5.opt && \
podman cp 02:/ff-reshape/configs . && \
find configs -name \*.pyc -delete && \
scp -r configs s2:/home/yqszxx/configs && \
starttime=$(date +%y%m%d-%H%M%S) && \
parallel -S 28/s2,28/s3,28/s4 -a /home/yqszxx/checkpoints17/chkpt50.list \
/home/yqszxx/gem5.opt \
--outdir /home/yqszxx/result17/$gitid-$starttime/{} \
/home/yqszxx/configs/example/fs.py \
--cpu-type NonCachingSimpleCPU \
--mem-size=8GB \
--maxinsts 20000000 \
--gcpt-restorer /home/yqszxx/gcpt-restorer \
--generic-rv-cpt /home/yqszxx/checkpoints17/{}/0/*.gz \
--depcheck

Result process:

for f in */*.txt; do
testcase=$(echo $f | cut -d/ -f1);
dep=$(grep dependent $f | sed -E 's/\s+/,/g' | cut -d, -f2);
inter=$(grep interGroup $f | sed -E 's/\s+/,/g' | cut -d, -f2);
intra=$(grep intraGroup $f | sed -E 's/\s+/,/g' | cut -d, -f2);
echo $testcase,$dep,$inter,$intra >> summary.csv;
done

Filter out checkpoints that intra-rate < 0.8, and print total weighted intra-rate >= 0.8 rate:

cat summary.csv | sed -E 's/(.*)_(.*)/\1,\2/' |
awk -F, '{print $0 "," $5/$3}' |
sort -t, -k6 -r |
awk -F, 'BEGIN{print "< 0.8: chkpt weight intra-rate"} {total+=$2; if ($6 >= 0.8) meet+= $2; else print $1,$2,$6;} END{print "weighted (>=0.8)/total = " meet/total}'

One failed checkpoint.

shinezyy/ff-reshape@9686aea50

Global frequency set at 1000000000000 ticks per second
gem5 Simulator System.  http://gem5.org
gem5 is copyrighted software; use the --copyright option for details.

gem5 version 21.2.0.0
gem5 compiled Jan 22 2022 12:51:07
gem5 started Jan 22 2022 20:51:25
gem5 executing on mcdonalds.shic.lan, pid 182806
command line: /home/yqszxx/gem5.opt --outdir /home/yqszxx/result/9686aea50-220122-205117/gcc_expr_200000000_0.184808 /home/yqszxx/configs/example/fs.py --cpu-type NonCachingSimpleCPU --mem-size=8GB --maxinsts 50000000 --gcpt-restorer /home/yqszxx/gcpt-restorer --generic-rv-cpt /home/yqszxx/checkpoints/gcc_expr_200000000_0.184808/0/_200001000_.gz --depcheck

info: Standard input is not a terminal, disabling listeners.
**** REAL SIMULATION ****
build/RISCV/arch/riscv/bare_metal/fs_workload.cc:49: info: No bootload provided
build/RISCV/base/remote_gdb.cc:381: warn: Sockets disabled, not accepting gdb connections
build/RISCV/mem/physical.cc:498: warn: Overriding Gcpt restorer
build/RISCV/sim/system.cc:601: info: Restoring from Generic Checkpoint
warn: No dot file generated. Please install pydot to generate the dot file and pdf.
build/RISCV/sim/simulate.cc:194: info: Entering event queue @ 0.  Starting simulation...
build/RISCV/mem/port.cc:200: warn: Port system.lint.pio doesn't support requesting a back door.
build/RISCV/dev/riscv/lint.cc:21: warn: Lint device doesn't support writes
build/RISCV/dev/riscv/lint.cc:21: warn: Lint device doesn't support writes
build/RISCV/mem/port.cc:200: warn: Port system.membus.badaddr_responder.pio doesn't support requesting a back door.
gem5.opt: build/RISCV/cpu/simple/atomic.cc:511: virtual gem5::Fault gem5::AtomicSimpleCPU::writeMem(uint8_t*, unsigned int, gem5::Addr, gem5::Request::Flags, uint64_t*, const std::vector<bool>&): Assertion `!pkt.isError()' failed.
Program aborted at tick 9472968500
--- BEGIN LIBC BACKTRACE ---
/home/yqszxx/gem5.opt(+0xc242c0)[0x556949dd72c0]
/home/yqszxx/gem5.opt(+0xc414fe)[0x556949df44fe]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7f48c72823c0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f48c642918b]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f48c6408859]
/lib/x86_64-linux-gnu/libc.so.6(+0x25729)[0x7f48c6408729]
/lib/x86_64-linux-gnu/libc.so.6(+0x36f36)[0x7f48c6419f36]
/home/yqszxx/gem5.opt(+0x6d6916)[0x556949889916]
/home/yqszxx/gem5.opt(+0x6ee651)[0x5569498a1651]
/home/yqszxx/gem5.opt(+0x4593fe)[0x55694960c3fe]
/home/yqszxx/gem5.opt(+0x448c3c)[0x5569495fbc3c]
/home/yqszxx/gem5.opt(+0x6d7459)[0x55694988a459]
/home/yqszxx/gem5.opt(+0xc30862)[0x556949de3862]
/home/yqszxx/gem5.opt(+0xc5d8e4)[0x556949e108e4]
/home/yqszxx/gem5.opt(+0xc5e63e)[0x556949e1163e]
/home/yqszxx/gem5.opt(+0xc121a2)[0x556949dc51a2]
/home/yqszxx/gem5.opt(+0x749b97)[0x5569498fcb97]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x2a8718)[0x7f48c7538718]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x8dd8)[0x7f48c730df48]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x8fb)[0x7f48c745aecb]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x94)[0x7f48c75380f4]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x74d6d)[0x7f48c7304d6d]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x7d86)[0x7f48c730cef6]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x8006b)[0x7f48c731006b]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x74d6d)[0x7f48c7304d6d]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x12fd)[0x7f48c730646d]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x8fb)[0x7f48c745aecb]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x94)[0x7f48c75380f4]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x74d6d)[0x7f48c7304d6d]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x7d86)[0x7f48c730cef6]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x8fb)[0x7f48c745aecb]
/lib/x86_64-linux-gnu/libpython3.8.so.1.0(PyEval_EvalCodeEx+0x42)[0x7f48c745b252]
--- END LIBC BACKTRACE ---

This might be an existing bug: I have not implemented some devices in GEM5: for example, SDCARD. You can print the Address of the error packet and check which range it falls in.

Because GCPT functionality is added into GEM5 when I was rushing the Omegaflow paper, I just put checkpoints that use SDCARD into blacklist and skip them.

Because GCPT functionality is added into GEM5 when I was rushing the Omegaflow paper, I just put checkpoints that use SDCARD into blacklist and skip them.

I think it's okay to just ignore these checkpoints 😂