Simple stats on SGE accounting data.
python3 -m pip install -r requirements.txt
The accounting
file must be in this directory (or modify the path in the Python scripts).
sbatch prune.sh
Either directly:
./sgestats.py
or submit to cluster:
sbatch sgestats.sh
- Modin parallelizes, and is noticeably faster.
- However, it is does not provide 100% of Pandas features, and
sgestats.py
uses some Pandas functionality not found in Modin.
- However, it is does not provide 100% of Pandas features, and
- Feather files are much faster to write and much smaller. Comparable JSON is about 4x larger, and much, much slower to read.
Using standard Pandas and doing i/o to CSV files, the sgestats.py
job runs in about 3.5 to 4 minutes, having to munge the category
column into their own columns. Using the Feather file, and Modin with Ray, the job runs in about 45 seconds.
N.B. was unable to pip install modin[ray]
on Ubuntu 23.10 using Homebrew-installed Python 3.12.2