pytorch-trainer (The best of both worlds, HuggingFace and Torch-Lightning.)

For each of the CPU, DDP, FSDP, and DS files, we have organized the various examples independently.

Use these examples as a guide to write your own train.py!

I'm thinking of making a template that is somewhat enforced. Torch-fabric doesn't support as many features as I thought it would. Write my own trainer in pure native torch.

Each trainer will be written in its own python file.

torch >= 2.1.2
cuda 11.8
I am experimenting with codebase deepspeed as of 231230.

deepspeed install is,

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

After executing the command, restart the terminal to install deepspeed.

Make sure you have 11.8 CUDA and the corresponding TORCH installed,

sh scripts/install_deepspeed.sh

to install deepspeed.

In vs_code_launch_json, upload launch.json for debugging vscode.

Usage

Download raw data and input dir raw_data
Copy Model Network(just nn.Module) in networks dir
make preprocess.py and preprocess yourself
if you have to make or copy dataset, input utils/data/ and check some sample
if you have to make or sampler, loader, input utils/data and check some sample
- i make some useful sampler in custom_sampler.py already (reference HF's transformers)
- DistributedBucketSampler : make random batch, but lengths same as possible.
- LengthGroupedSampler : descending order by length column, and select random indices batch. (dynamic batching)
- DistributedLengthGroupedSampler: distributed dynamic batching
change [cpu|ddp|deepspeed]_train.py
1. Defines the Trainer inheritance, which is already implemented. Since the training pipeline may vary from model to model and dataset to dataset, we've used @abstractmethod to give you direct control, and if you need to implement additional inheritance, we've made it easy to do so by ripping out the parts you need and referencing them.
2. Basically, you can modify the training_step and eval_loop, which require a loss operation, and do whatever you want to accomplish, utilizing batches, labels, criterion, and eval_metric. I was implemented examples for using all_gather and more in each of the ddp,deepspeed,fsdp examples, so check them out and write your own code effectively!
3. I've implemented chk_addr_dict, which makes heavy use of the dictionary inplace function, to reference address values during debug. Always be careful that your own implementations don't run out of memory with each other!
In the main function, you'll do some simple operations on the data at the beginning and prepare the ingredients for your model, optimizer, and scheduler.
- learning rate scheduler must be {"scheduler": scheduler, "interval": "step", "frequency": 1, "monitor": None}
- frequency is step accumulation, if is 2, for every 2 train steps, take 1 scheduler step.
- monitor is for only ReduceLROnPlateau's loss value
run! cd {your-workpsace}/pytorch-trainer & sh scripts/run_train_[cpu|ddp|deepseed].sh

>>>> TODO LIST Open/Close

TODO LIST

each test wandb is here Link

Deepspeed ZeRO Test result (lstm1: n_layer, lstem2: n_layer each 1000)

The ZeRO test is not accurate because the model was run with an lstm.

Since the LSTM requires contiguous parameters to be fully guaranteed, the partitioning(cpu-gpu) may not have worked well.

Also, I have some doubts about the CPU offload performance because I used TORCH OPTIMIZED ADAMW, not DEEPSPEED OPTIMIZED ADAMW.

However, I will share the results of what I did with 2 lstm layers set to 1000 n_layer.

test set	RTX3090 GPU Mem (Mib)
zero2 optim offload	2016
zero2 optim offload	1964
zero3 full offload	2044
zero3 optim offload	2010
zeor3 param offload	2044
zero3 not offload	2054

I think, optim offload is good but, param offload is strange...

Infer Result

distributed learning will shuffle the data for each GPU
so you won't be able to find the source specified here up to scaler.

Time Series Task

category	image
label
CPU
DDP
DeepSpeed

IMDB (Binary Classification) Task

category	image
FSDP

Unsupported list

tensorboard - I personally find it too inconvenient.

Gradient Checkpointing is implemented in nn.Module network!! so, I can not make any process...

useful link: https://github.com/prigoyal/pytorch_memonger/blob/master/tutorial/Checkpointing_for_PyTorch_models.ipynb

plz help!!!

I don't have much understanding of distirbuted learning, so I'm looking for someone to help me out, PRs are always welcome.

Bugfixes and improvements are always welcome.

If you can recommend any accelerator related blogs or videos for me to study, I would be grateful. (in issue or someting)

Special Thanks!

@jp1924 @ddobokki @Master_yang

About

torch 2.0 and cpu, ddp, deepspeed trainer template

MIT License

Languages

Language:Python 98.0%Language:Shell 2.0%