Configuration issues with Hardimagenet and problems loading LLaMA model `.pt` file

Question

Configuration issues with Hardimagenet and problems loading LLaMA model `.pt` file

XiangLiSky opened this issue 8 months ago · comments

Thank you for publishing this repository.
I've been trying to work with the Hardimagenet and LLaMA model but came across a couple of issues that hoping to get some help with.

Firstly, as I set up Hardimagenet, I ran into some confusion regarding the directory structure and the specific paths for various files and folders within the dataset. Could you please clarify or provide documentation detailing the expected folder hierarchy and path relationships for Hardimagenet?
Secondly, I am encountering an error when trying to load the .pt file for the LLaMA model. The error message I receive is as follows:

2023-12-11 16:42:50.125 | INFO     | lance.edit_captions:load_model:97 - [Loading fine-tuned LLaMA model]
Traceback (most recent call last):
  File "main.py", line 292, in <module>
    main(args)
  File "main.py", line 115, in main
    caption_editor = CaptionEditor(
  File "/home/user/Experiement/Lance/lance/edit_captions.py", line 70, in __init__
    self.model = self.load_model()
  File "/home/user/Experiement/Lance/lance/edit_captions.py", line 101, in load_model
    with lazy_load(pretrained_path) as pretrained_checkpoint, lazy_load(
  File "/home/user/Experiement/Lance/lit-llama/lit_llama/utils.py", line 331, in __init__
    self.zf = torch._C.PyTorchFileReader(str(fn))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
Traceback (most recent call last):
  File "/home/user/.local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/user/.local/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/user/.local/lib/python3.8/site-packages/accelerate/commands/launch.py", line 994, in launch_command
    simple_launcher(args)
  File "/home/user/.local/lib/python3.8/site-packages/accelerate/commands/launch.py", line 636, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)

The traceback indicates an issue with reading the PyTorch file reader's zip archive, which I suspect might be need to connected to the model's zip package with .pt file.

Any advice or troubleshooting steps would be much appreciated.

Sriram Yenamandra · Answer 1 · Thu Dec 28 2023 02:00:33 GMT+0800 (China Standard Time)

Hi, thanks for creating the issue!

To run on HardImageNet, you can follow the commands here. We point to the ImageNet val set here and the dataset will include only the
HardImageNet images if you specify --dset_name HardImageNet.
For example: if we pass --img_dir /path/to/ImageNet/val, it should have images listed here at paths that look like /path/to/ImageNet/val/n03218198/ILSVRC2012_val_00002266.JPEG.
Can you please check if the checkpoint files are downloaded properly under checkpoints/caption_editing? What are the sizes of files you see? The three files should have sizes of 26G, 8.1M, and 489K.
If you see that the files are not downloaded properly, please run git lfs pull inside checkpoints/caption_editing.

Please let us know if the above don't work.

Xiang · Answer 2 · Thu Jan 04 2024 11:54:54 GMT+0800 (China Standard Time)

Hi, thanks for creating the issue!

To run on HardImageNet, you can follow the commands here. We point to the ImageNet val set here and the dataset will include only the
HardImageNet images if you specify --dset_name HardImageNet.
For example: if we pass --img_dir /path/to/ImageNet/val, it should have images listed here at paths that look like /path/to/ImageNet/val/n03218198/ILSVRC2012_val_00002266.JPEG.

Can you please check if the checkpoint files are downloaded properly under checkpoints/caption_editing? What are the sizes of files you see? The three files should have sizes of 26G, 8.1M, and 489K.
If you see that the files are not downloaded properly, please run git lfs pull inside checkpoints/caption_editing.

Please let us know if the above don't work.

Hi, thank you for your advice.

There are still some small errors for the first problem but I think I can overcome them anyway.

The second problem is that the three checkpoint files were not fully downloaded when I cloned the project. I ran git lfs pull and have solved this problem.