SHI-Labs / Neighborhood-Attention-Transformer

Neighborhood Attention Transformer, arxiv 2022 / CVPR 2023. Dilated Neighborhood Attention Transformer, arxiv 2022

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

While ruuning the code, I got this types of problem. Could you please tell me the solution

Mehulk43 opened this issue · comments

python -m torch.distributed.launch --nproc_per_node=1 train.py -c configs/nat_mini.yml /dataset/Imagenet

/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for

further instructions
warnings.warn(
Training with a single process on 1 GPUs.
WARNING: Unsupported operator aten::mul encountered 52 time(s)
WARNING: Unsupported operator aten::softmax encountered 18 time(s)
WARNING: Unsupported operator aten::add encountered 70 time(s)
WARNING: Unsupported operator aten::gelu encountered 18 time(s)
WARNING: Unsupported operator aten::rand encountered 34 time(s)
WARNING: Unsupported operator aten::floor_ encountered 34 time(s)
WARNING: Unsupported operator aten::div encountered 34 time(s)
WARNING: Unsupported operator aten::adaptive_avg_pool1d encountered 1 time(s)
Model nat_mini created.
19.984M Params and 2.713GFLOPs

Data processing configuration for current model + dataset:
input_size: (3, 224, 224)
interpolation: bicubic
mean: (0.485, 0.456, 0.406)
std: (0.229, 0.224, 0.225)
crop_pct: 0.875
Using native Torch AMP. Training in mixed precision.
Traceback (most recent call last):
File "train.py", line 1020, in
main(args)
File "train.py", line 517, in main

dataset_train = create_dataset(

File "/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/timm/data/dataset_factory.py", line 138, in create_dataset
ds = ImageDataset(root, parser=name, class_map=class_map, load_bytes=load_bytes, **kwargs)
File "/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/timm/data/dataset.py", line 32, in init
parser = create_parser(parser or '', root=root, class_map=class_map)
File "/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/timm/data/parsers/parser_factory.py", line 22, in create_parser
assert os.path.exists(root)
AssertionError
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 48603) of binary: /home/user/anaconda3/envs/nat/bin/python

Traceback (most recent call last):

File "/home/user/anaconda3/envs/nat/lib/python3.8/runpy.py", line 192, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/user/anaconda3/envs/nat/lib/python3.8/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))

File "/home/user/anaconda3/envs/nat/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

============================================================
train.py FAILED


Failures:
<NO_OTHER_FAILURES>


Root Cause (first observed failure):

[0]:
time : 2022-11-04_13:29:38
host : user
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 48603)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

============================================================

Can you confirm the path /dataset/Imagenet exists?

yes, it exits

Can you confirm the path /dataset/Imagenet exists?

I'm pretty sure that's the problem, it's literally failing at checking if the dataset path exists.

I'm pretty sure that's the problem, it's literally failing at checking if the dataset path exists.

I am getting like this " ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) "
Not for the dataset.

I have already tried to give the full path of dataset.
The same is happened.

It is literally failing here:

assert os.path.exists(root)
AssertionError

Also, could you clarify what the "full path of the dataset" is?
Can you please ls /dataset/ImageNet and share the output?

It is literally failing here:

assert os.path.exists(root)
AssertionError

Also, could you clarify what the "full path of the dataset" is? Can you please ls /dataset/ImageNet and share the output?

I have created a folder name " dataset" in classification folder and put the imagnet in dataset folder.

In that case it should be dataset/ImageNet and not /dataset/ImageNet (no forward slash in the beginning.)

In that case it should be dataset/ImageNet and not /dataset/ImageNet (no forward slash in the beginning.)

Thanks you for replying fast.

Yeah I know that, I have tried that too.

and i have also given full path name like
~/Downloads/MyProject/[Neighborhood-Attention-Transformer/classification/dataset/ImageNet

And I have also tried like
./dataset/ImageNet

@Mehulk43 I can confirm that this is a path issue. It is an assertion error in timm on the create_dataset function. You may be confused because we have left /dataset/ImageNet in as an example of where that might be. It's pretty unlikely that's where you have ImageNet at. I suggest using readlink -f <insert ImageNet folder path here> and paste that into the path argument.

Also note that any path starting with ~/ is actually relative. ~/ is the same as the $HOME variable. Full paths start with / which is root directory.

Thank you,

I will try and upload the screenshot if I get the error again.

Closing this due to inactivity. If you still have questions feel free to open it back up.