NVlabs / few-shot-vid2vid

Pytorch implementation for few-shot photorealistic video-to-video translation.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

有关colab训练模型的坑及解决方案(Pits and solutions about training on colab)

jjandnn opened this issue · comments

1,cuda出错,no kernel image和no memory:
其中p100和T4,在安装运行flownet2_pytorh时(每次都需要装),不会有问题,直接按照官方read me运行,或者进入few-shot-vid2vid/models/networks/flownet2_pytorch/后,!bash install.sh。
而k80和p4则会报错,cuda kernel:no kernel image……(一个千年未解决的cuda老问题)
/content/drive/My Drive/few-shot-vid2vid/models/networks/flownet2_pytorch/networks/channelnorm_package/setup.py;
/content/drive/My Drive/few-shot-vid2vid/models/networks/flownet2_pytorch/networks/correlation_package/setup.py;
/content/drive/My Drive/few-shot-vid2vid/models/networks/flownet2_pytorch/networks/resample2d_package/setup.py;
nvcc_args = [
'-gencode', 'arch=compute_30,code=sm_30',
'-gencode', 'arch=compute_35,code=sm_35',
'-gencode', 'arch=compute_37,code=sm_37',
'-gencode', 'arch=compute_50,code=sm_50',
'-gencode', 'arch=compute_52,code=sm_52',
'-gencode', 'arch=compute_60,code=sm_60',
'-gencode', 'arch=compute_61,code=sm_61',
'-gencode', 'arch=compute_70,code=sm_70',
'-gencode', 'arch=compute_70,code=compute_70'


进入/content/drive/My Drive/few-shot-vid2vid/checkpoints/face/web/


3,50个epoch后,seq length to XX,out of memory内存溢出。
70个epoch后,seq length to 16。

注意:第3条的问题,已经由程序主-@tcwang0509 升级修正了,现在跑起来很流畅。就是70个epoch后,colab的主机速度比较慢,这是没办法的,免费啊!——2020.2.13

English(machine translation,forgive me):
1, cuda error, no kernel image and no memory: colab mainly has four kinds of graphics cards: k80, p4, p100,T4.
Among them, p100 and T4, when installing and running flownet2_pytorh (need to be installed each time), there will be no problem, run directly according to the official read me, or enter fee-shot-vid2vid / models / networks / flownet2_pytorch /, and bash install.sh.
But k80 and p4 will report an error, cuda kernel: no kernel image ... (an unsolved old problem of cuda for a thousand years)
The best solution is: directly reset all code execution programs, change the host, change to p100 or T4, the most labor-saving , The most efficient, huh, huh.
The second solution is to modify your
/ content / drive / My Drive / few-shot-vid2vid / models / networks / flownet2_pytorch / networks / channelnorm_package / setup.py;
/ content / drive / My Drive / few-shot-vid2vid / models /networks/flownet2_pytorch/networks/correlation_package/setup.py;
/ content / drive / My Drive / few-shot-vid2vid / models / networks / flownet2_pytorch / networks / resample2d_package / setup.py;
add adaptations to the three files Environment variables:
nvcc_args = ['-gencode', 'arch = compute_30, code = sm_30', '-gencode', 'arch = compute_35, code = sm_35', '-gencode', 'arch = compute_37, code = sm_37 ',' -gencode ',' arch = compute_50, code = sm_50 ',' -gencode ',' arch = compute_52, code = sm_52 ',' -gencode ',' arch = compute_60, code = sm_60 ',' -gencode ',' arch = compute_61, code = sm_61 ',' -gencode ',' arch = compute_70, code = sm_70 ',' -gencode ',' arch = compute_70, code = compute_70 '] k80 Please specify pytorch == 0.41 .

2,Web preview image reading error when continuing training: input / output: epoch ...
This error does not occur locally.
Only Colab and Google Cloud Disk.
Cause: There are too many files in the Google Cloud Disk folder, and Colab cannot read them (also an old problem).
Solution: Go to / content / drive / My Drive / few-shot-vid2vid / checkpoints / face / web/
Delete the entire images folder and generate an empty images.
Or add ‘--no_html’ parameter during training (I have n’t tested it because I need to preview it)

Note: If you continue to train, iters may not be an integer and will not affect the result.

3,After 50 epochs, seq length to XX, out of memory.
Solution: Change P100 instead of T4.

After 70 epochs, seq length to 16.
Haven't figured out a solution yet.

Note: The problem of Article 3 has been upgraded and corrected by the program owner-@ tcwang0509, and now it runs smoothly. After 70 epochs, the host of Colab is relatively slow. There is no way to do it, it's free!——2020.2.13

The program is great, thanks to the developers, thanks to NVlabs (I just went crazy and shorted the stock of NV, thinking that few-v2v and stylegan2 can't keep up with the end of the year ... oh, my God!)
I wish you all a smooth and happy!

train_options.py line 9
parser.add_argument('--display_freq', type=int, default=100, help='frequency of showing training results on screen')
remove iter below util/visualizer.py line 110 if self.use_html:

@AaronWong how much training time approximately can it take to achieve the same results as on the gifs of this repo using colab?

Hi @ssaleth

@AaronWong how much training time approximately can it take to achieve the same results as on the gifs of this repo using colab?

I didn't use colab
you may ask jjandnn
I still train on my server and don‘t achieve a good result