josehu07 / CS839-MLSys-AS2

CS839 MLSys SP2022 Collective Communication Assignment

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CS839 MLSys Collective Communication Assignment

CS839 MLSys @ UW-Madison, SP2022. Assignment 2 on collective communication.

Install parallel-ssh tool, matplotlib, and sklearn:

pip3 install parallel-ssh matplotlib sklearn

Machine Setup

CloudLab connection information is hardcoded in pssh_common.py. Setup passwordless sudoer torchuser on all nodes:

cd setup
python3 setup_user.py

Install and test CPU-based pytorch on all nodes:

python3 torch_inst.py
cd ..

If successful, should see the script outputting a tensor result on all nodes at the end. The above steps have already been completed on the current nodes.

Running Tasks

The core source code for the two AllReduce algorithms are at:

  • tasks/allreduce_ring.py: "Ring" algorithm
  • tasks/allreduce_recur_hd.py: Recursive Halving and Doubling algorithm

To run all 4 tasks, execute the command below at the root path of this repository:

./run.sh

Check out the task*.csv and task*.png files produced for results.

About

CS839 MLSys SP2022 Collective Communication Assignment


Languages

Language:Python 96.9%Language:Shell 3.1%