The PyTorch version is incorrect.

Question

The PyTorch version is incorrect.

Doraemonzzz opened this issue a month ago · comments

Thank you for your work, this is a great project. However, I encountered some environment issues while running it. I have tried 2.4.0.dev20240419+cu121, 2.4.0.dev20240612+cu121, and 2.5.0.dev20240617+cu121, but all of these resulted in errors. Could you please provide the correct torch version that can be used with the main branch? Thank you.

Tasks

Beta Give feedback

No tasks being tracked yet.

Options

Ruitao-L · Answer 1 · Tue Jun 18 2024 11:49:50 GMT+0800 (China Standard Time)

I didn't get the code up and running either. The test model using debug_model.toml ran successfully. But when trying to train llama3 using llama3_8b.toml, I had several ImportErrors:

cannot import name 'Partial' from 'torch.distributed._tensor'
cannot import name 'CheckpointPolicy’ from 'torch.utils.checkpoint'

etc.

I tried torch-2.4.0.dev20240412 (which torchtitan is verified on according to README) from https://download.pytorch.org/whl/nightly/cu118/torch-2.4.0.dev20240412%2Bcu118-cp310-cp310-linux_x86_64.whl and several other pytorch nightly builds with no luck.

It seems we need a specific version of pytorch nightly.

Xilun Wu · Answer 2 · Tue Jun 18 2024 14:49:57 GMT+0800 (China Standard Time)

This is due to the PyTorch PR was reverted:
pytorch/pytorch#125795

You can manually patch your code based on the change in this 2 PRs: #397 #401

Doraemonzzz · Answer 3 · Tue Jun 18 2024 16:06:22 GMT+0800 (China Standard Time)

Thank you for your response. I am currently encountering the following issue:

ModuleNotFoundError: No module named 'torch.distributed.pipelining

The torch version is "2.4.0.dev20240419+cu121".

Xilun Wu · Answer 4 · Wed Jun 19 2024 01:40:27 GMT+0800 (China Standard Time)

Just as the previous issue, this is caused by active API changing during release on PyTorch side. Since PyTorch is preparing the next minor release, the API is actively changing recently. One suggestion is to wait for those change getting into nightly build and use nightly build pytorch, or compile PyTorch using the latest github branch. We understand the inconvenience but right now we couldn't come up with a better solution... :-(

update: I believe 2.5.0.dev20240617+cu121 should be enough new to include the API change.

Doraemonzzz · Answer 5 · Thu Jun 20 2024 00:49:49 GMT+0800 (China Standard Time)

Thank you for the version you provided. We have successfully run the code on version 2.5.0.dev20240617+cu121. By the way, could you please explain the purpose of the following code in norms.py? It seems that we couldn't find the corresponding documentation.

 @partial(
      local_map,
      out_placements=[Shard(1)],
      in_placements=(None, [Shard(1)], [Replicate()], None),
  )

Xilun Wu · Answer 6 · Thu Jun 20 2024 23:25:51 GMT+0800 (China Standard Time)

@Doraemonzzz Thanks for your interest in our new experimental feature (this is why there's no official doc about it) local_map. In short, this decorator allows user to call the decorated function on DTensor, with user-specified sharding specification (i.e. placements). See pytorch/pytorch#123676 for details.

Xilun Wu · Answer 7 · Thu Jun 20 2024 23:27:36 GMT+0800 (China Standard Time)

Since pytorch/pytorch#125795 is re-landed, this issue should be gone in recent PyTorch nightly soon.

Yushun Zhang · Answer 8 · Fri Jul 05 2024 13:50:10 GMT+0800 (China Standard Time)

Thank you for your response. I am currently encountering the following issue:
ModuleNotFoundError: No module named 'torch.distributed.pipelining
The torch version is "2.4.0.dev20240419+cu121".

Hi, I also encountered this issue. I installed all the dependencies as recommended:

git clone https://github.com/pytorch/torchtitan
cd torchtitan
pip install -r requirements.txt
pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121 # or cu118
pip3 install --pre torchdata --index-url https://download.pytorch.org/whl/nightly

and my resulting torch version is 2.3.1+cu121. Any suggestions? How to switch to 2.5.0.dev20240617+cu121 ? I cannot find it online

Update： problem solved by using

pip3 install --pre torch==2.5.0.dev20240617  --index-url https://download.pytorch.org/whl/nightly/cu121

allela-roy · Answer 9 · Fri Jul 05 2024 19:24:51 GMT+0800 (China Standard Time)

Encountering the same issue.

[rank0]:   File "/opt/ml/code/torchtitan/parallelisms/parallelize_llama.py", line 415, in apply_ac
[rank0]:     transformer_block = checkpoint_wrapper(transformer_block, ac_config)
[rank0]:                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/ml/code/torchtitan/parallelisms/parallelize_llama.py", line 50, in checkpoint_wrapper
[rank0]:     from torch.utils.checkpoint import (
[rank0]: ImportError: cannot import name 'CheckpointPolicy' from 'torch.utils.checkpoint' (/opt/conda/lib/python3.11/site-packages/torch/utils/checkpoint.py)