ahmdtaha / simsiam

Pytorch implementation of Exploring Simple Siamese Representation Learning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to prepare my own dataset?

zhujilin1995 opened this issue · comments

Hello, in your code, the dataset of cifar10 is used. But the file of cifar10 python is specialized. I want to use my own pictures to train, how do I prepare? Thank you very much.

Hi Zhu,
It has been a while since I worked on this code.
As far as I remember, create a new Dataset module (e.g., imagenet) inside simsiam/data following CIFAR module.
Import the new dataset class inside init. E.g., from data.imagenet import ImageNet

provide the set argument to indicate your new Dataset (e.g., ImageNet)
I hope this helps

Thank you for your reply. I came acoss another problem when I ran the pretrain_main.py, the following errors are shown:

_Train Running basic DDP example on rank 0.
Process Process-2:
Traceback (most recent call last):
File "D:\public_program\miniconda\lib\multiprocessing\process.py", line 315, in _bootstrap
self.run()
File "D:\public_program\miniconda\lib\multiprocessing\process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "F:\Zhujilin\SimSiam\pretrain_main.py", line 66, in train_ddp
setup(rank, cfg.world_size, start_port)
File "F:\Zhujilin\SimSiam\pretrain_main.py", line 28, in setup
dist.init_process_group(
File "C:\Users\Zhujilin.conda\envs\simsiam\lib\site-packages\torch\distributed\distributed_c10d.py", line 503, in init_process_group
_update_default_pg(_new_process_group_helper(
File "C:\Users\Zhujilin.conda\envs\simsiam\lib\site-packages\torch\distributed\distributed_c10d.py", line 588, in new_process_group_helper
pg = ProcessGroupGloo(
RuntimeError

I have tried looking up some solutions in the Internet, but I still cannot find appropriate solutions. Could you please give me some advice to deal with this error? I will appreciate you too much.

Seems like a DDP initialization error.
Try to run the code on a single GPU first and see if the error persists. You can do so either by

  1. Set the world_size = 1
  2. Skip DDP setup altogether and go directly to train_ddp. Basically, call train_ddp instead of spawn_train. Make sure to pass the right parameters to train_ddp

Thank you, I have settled this issue.

However, I have another problem.
When I finished the 799 epoch, there occured an error. I polished the code, and I use the 799-epoch.state profile to retrained. But the training accuracy dropped from 89% to 50%. This is different from what I expected. I thought the previous training accuracy would continue.

So, why did this happen, and how should we handle this situation? I'm a beginner in coding. My questions might be bothersome, and I apologize for that. Thank you very much for your answers.

The training information is as follows:
图片

I am not sure if you are referring to the pre-training (pretrain_main.py) or fine-tuning (classifier_main.py) stage?

If you are referring to pre-training stage, you should set either resume or pretrained.

I don't fully remember the difference between resume or pretrained, but at least one difference is that pretrained will assume you are training from scratch, i.e., start_epoch=0. Contrary, resume will set start_epoch correctly to resume from where the failure happened. In this case, the lr scheduler should resume with the correct learning rate, i.e., the one at the crash point and not the initial lr.

If I were you, I would make sure the resume code is executed correctly and both the start_epoch and the lr reflect the state where failure happened.

A similar logic applies with the fine-tuning (classifier_main.py) stage. Basically, make sure your code executes this line. Also double check both the start_epoch and the lr.

I hope this helps

This helps a lot, thank you