Calculon - Co-design for large scale parallel applications

Running

Run Calculon like this:

$> PYTHONPATH=. ./bin/ <args>

Calculon is a hierarchical command line. To see the commands it accepts, use --help or -h:

$> PYTHONPATH=. ./bin/ -h

You can also see how to use any command specifically by using --help or -h on the command:

$> PYTHONPATH=. ./bin/ llm -h

LLM Example

Run a single calculation for LLM (~1 sec):

$> PYTHONPATH=. ./bin/ llm models/megatron-1T.json examples/3072_t4_p64_d12_mbs4_full.json systems/a100_80g.json -

Run a system execution optimizer for LLM (~1 min):

$> PYTHONPATH=. ./bin/ llm-optimal-execution models/turing-530B.json 5128 2520 float16 systems/a100_80g.json output.json -m

opt_exe.json will contain the optimal way to run Turing-530B across 5128 A100 GPUs.

To store results from all successful runs from the same experiment, run a special system optimizer (~1 min):

$> PYTHONPATH=. ./bin/ llm-all-executions models/turing-530B.json 5128 2520 float16 systems/a100_80g.json all_output.csv

Testing and validation (optional)

To make sure that the current build is working, use

$> make test

To validate Calculon performance modeling against Megatron run on NVIDIA's Selene A100-based supercomputer with results published in "Sequence parallelism" paper, use

$> PYTHONPATH=. ./bin/calculon llm-validation

Publications

Calculon: A Methodology and Tool for High-Level Co-Design of Systems and Large Language Models
Mikhail Isaev, Nic McDonald, Larry Dennison, Richard Vuduc
Paper
Scaling Infrastructure to Support Multi-Trillion Parameter LLM Training
Mikhail Isaev, Nic McDonald, Richard Vuduc
Paper

calculon-ai / calculon

Calculon - Co-design for large scale parallel applications

Running

LLM Example

Testing and validation (optional)

Publications

About

Languages