pytorch / PiPPy

Pipeline Parallelism for PyTorch

pytorch/PiPPy Issues

Issue with optimizer instantiation
Updated 4 months ago2
Check if remap_qualname still works after refactorization
Closed 4 months ago1
Check if stage-wise checkpoint loading still works after refactorization
Updated 4 months ago
Check if meta device tracing still works after refactorization
Updated 4 months ago
ResNet example always underfitting when pippy training
Updated 4 months ago5
PyTorch renaming submod indices leading to assert break
Updated 4 months ago
Pipeline Schedule confused
Updated 4 months ago1
Decouple graph interpretation from pipeline executor
Updated 5 months ago
[H100] local test C10D forward does not have tensor result equivalency (16% mismatch)
Updated 5 months ago
Incompatible with pytorch 2.0?
Closed 6 months ago
Failed to run fine-tuning (freezing some layers) of hf model with pippy
Updated 7 months ago
split_into_equal_size returns submodules with non-optimizable parameters
Updated 7 months ago
[spmd] spmd api tracing warning need to investigate
Closed 8 months ago2
Any plan to support PEFT LoRA models?
Updated 8 months ago2
Why does parallel pipeline require a master
Updated 8 months ago1
tp+pp and gspmd examples not running
Closed 10 months ago1
[spmd] spmd logging doesn't work with logging level
Closed 10 months ago
How did this error happen when i run example about resnet?
Updated 10 months ago
Split each layer in multiple gpu
Updated a year ago
Request for Examples of Pipeline Parallelism with Multiple Machines in PiPPy
Updated a year ago1
TP+PiPPy failing on HF examples.
Updated a year ago4
How to run the gpt2 example on a single node with four GPU?
Updated a year ago
Could pippy be coexisted with deepspeed?
Updated a year ago1
Incorrect loss value of huggingface bert example
Updated a year ago
init_empty_weights only works with torchrun and is very slow
Closed a year ago6
How to reduce memory costs when running on CPU
Closed a year ago
Move DTensor from tau to PyTorch
Closed a year ago
Pippy ddp2pipe example doesn't work for pipeline
Updated a year ago4
Problem reproducing minimal example
Closed a year ago2
[SPMD] Missing DT support NotImplementedError: Operator aten.amax.default does not have a DistributedTensor rule registered.
Updated a year ago
[SPMD] Add support for convolution ops to DTensor sharding prop
Updated a year ago
[DTensor] missing rule for aten.fill.Scalar causing unit tests to fail for SPMD
Updated a year ago
Issue with FX tracing of HF seq2seq models
Updated a year ago
Remove checkpoint files moved to PT
Closed a year ago
Fix test failure in test/spmd/checkpoint/test_dt_planner.py
Closed a year ago
Fix test failure in test/spmd/checkpoint/test_pg_planner.py
Closed a year ago
[SPMD][Fusion] add bucket size/ num_bytes policy for fusion
Updated a year ago
[SPMD][Fusion] - ensure matching ProcessGroups for fused comm calls
Updated a year ago
[SPMD][Fusion] - ensure buffer dtype matches gradient tensor dtype
Updated a year ago
[SPMD][Fusion] Add unit tests for fusion
Updated a year ago
[SPMD][Fusion] tracking - move global buffer to just before first fusion
Updated a year ago
[spmd] incorrect aten.expand call with nn.linear (expanded size must match existing size at dim 0)
Updated a year ago1
'CLIPVisionConfig' object has no attribute 'vocab_size'
Updated a year ago
[SPMD] Remove Gradient tensor clones added during DTensor comm collective insertion
Updated a year ago
pytests_test_gpu(0) will fail if allocated a non-4 gpu server - add guard/skip?
Updated a year ago
Support Segformer models in HF tests
Updated a year ago
[spmd] torch.cat (aten.cat.default) not implemented for Distributed Tensor (tracking)
Updated a year ago
[spmd] self-attention not converging
Updated 2 years ago1
[spmd] self-attention module's proj.bias isn't properly updated on all ranks but rank 0
Updated 2 years ago1
Buck run device error
Closed 2 years ago