NVlabs / few-shot-vid2vid

Pytorch implementation for few-shot photorealistic video-to-video translation.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

有关colab训练模型的坑及解决方案(Pits and solutions about training on colab)

jjandnn opened this issue · comments

1,cuda出错,no kernel image和no memory:
colab主要有四种显卡:k80、p4、T4、p100。
其中p100和T4,在安装运行flownet2_pytorh时(每次都需要装),不会有问题,直接按照官方read me运行,或者进入few-shot-vid2vid/models/networks/flownet2_pytorch/后,!bash install.sh。
而k80和p4则会报错,cuda kernel:no kernel image……(一个千年未解决的cuda老问题)
最佳解决方案为:
直接重置所有代码执行程序,换主机,换到T4或p100为止,最省力,最高效,呵呵。
其次的方案为修改你的
/content/drive/My Drive/few-shot-vid2vid/models/networks/flownet2_pytorch/networks/channelnorm_package/setup.py;
/content/drive/My Drive/few-shot-vid2vid/models/networks/flownet2_pytorch/networks/correlation_package/setup.py;
/content/drive/My Drive/few-shot-vid2vid/models/networks/flownet2_pytorch/networks/resample2d_package/setup.py;
在三个文件中添加上适配的环境变量:
nvcc_args = [
'-gencode', 'arch=compute_30,code=sm_30',
'-gencode', 'arch=compute_35,code=sm_35',
'-gencode', 'arch=compute_37,code=sm_37',
'-gencode', 'arch=compute_50,code=sm_50',
'-gencode', 'arch=compute_52,code=sm_52',
'-gencode', 'arch=compute_60,code=sm_60',
'-gencode', 'arch=compute_61,code=sm_61',
'-gencode', 'arch=compute_70,code=sm_70',
'-gencode', 'arch=compute_70,code=compute_70'
]

k80请强行指定pytorch==0.41。

2,继续训练时web预览图片读取错误:input/output:epoch……
这个错误,本地不会发生。只有colab与谷歌云盘。
原因:谷歌云盘的文件夹内文件过多,colab无法读入(也是老问题)。
解决方案:
进入/content/drive/My Drive/few-shot-vid2vid/checkpoints/face/web/
删掉整个images文件夹中,再生成一个空的images就可以了。
或者在训练时加上‘--no_html‘参数(我未测试,因为我需要预览)

注意:继续训练,iters不一定为整数,不影响结果。

3,50个epoch后,seq length to XX,out of memory内存溢出。
解决方案:只有换P100,T4不行。
70个epoch后,seq length to 16。
还没想出方案,诶,flownet2升级验证太大了。

注意:第3条的问题,已经由程序主-@tcwang0509 升级修正了,现在跑起来很流畅。就是70个epoch后,colab的主机速度比较慢,这是没办法的,免费啊!——2020.2.13

English(machine translation,forgive me):
1, cuda error, no kernel image and no memory: colab mainly has four kinds of graphics cards: k80, p4, p100,T4.
Among them, p100 and T4, when installing and running flownet2_pytorh (need to be installed each time), there will be no problem, run directly according to the official read me, or enter fee-shot-vid2vid / models / networks / flownet2_pytorch /, and bash install.sh.
But k80 and p4 will report an error, cuda kernel: no kernel image ... (an unsolved old problem of cuda for a thousand years)
The best solution is: directly reset all code execution programs, change the host, change to p100 or T4, the most labor-saving , The most efficient, huh, huh.
The second solution is to modify your
/ content / drive / My Drive / few-shot-vid2vid / models / networks / flownet2_pytorch / networks / channelnorm_package / setup.py;
/ content / drive / My Drive / few-shot-vid2vid / models /networks/flownet2_pytorch/networks/correlation_package/setup.py;
/ content / drive / My Drive / few-shot-vid2vid / models / networks / flownet2_pytorch / networks / resample2d_package / setup.py;
add adaptations to the three files Environment variables:
nvcc_args = ['-gencode', 'arch = compute_30, code = sm_30', '-gencode', 'arch = compute_35, code = sm_35', '-gencode', 'arch = compute_37, code = sm_37 ',' -gencode ',' arch = compute_50, code = sm_50 ',' -gencode ',' arch = compute_52, code = sm_52 ',' -gencode ',' arch = compute_60, code = sm_60 ',' -gencode ',' arch = compute_61, code = sm_61 ',' -gencode ',' arch = compute_70, code = sm_70 ',' -gencode ',' arch = compute_70, code = compute_70 '] k80 Please specify pytorch == 0.41 .

2,Web preview image reading error when continuing training: input / output: epoch ...
This error does not occur locally.
Only Colab and Google Cloud Disk.
Cause: There are too many files in the Google Cloud Disk folder, and Colab cannot read them (also an old problem).
Solution: Go to / content / drive / My Drive / few-shot-vid2vid / checkpoints / face / web/
Delete the entire images folder and generate an empty images.
Or add ‘--no_html’ parameter during training (I have n’t tested it because I need to preview it)

Note: If you continue to train, iters may not be an integer and will not affect the result.

3,After 50 epochs, seq length to XX, out of memory.
Solution: Change P100 instead of T4.

After 70 epochs, seq length to 16.
Haven't figured out a solution yet.

Note: The problem of Article 3 has been upgraded and corrected by the program owner-@ tcwang0509, and now it runs smoothly. After 70 epochs, the host of Colab is relatively slow. There is no way to do it, it's free!——2020.2.13

程序很棒,感谢开发者,感谢NVlabs(我刚发了疯做空了NV的股票,以为few-v2v和stylegan2赶不上年底的………………诶,天啊!)
祝大家顺利,愉快!
The program is great, thanks to the developers, thanks to NVlabs (I just went crazy and shorted the stock of NV, thinking that few-v2v and stylegan2 can't keep up with the end of the year ... oh, my God!)
I wish you all a smooth and happy!

Q2:
train_options.py line 9
parser.add_argument('--display_freq', type=int, default=100, help='frequency of showing training results on screen')
or
remove iter below util/visualizer.py line 110 if self.use_html:

@AaronWong how much training time approximately can it take to achieve the same results as on the gifs of this repo using colab?

Hi @ssaleth

@AaronWong how much training time approximately can it take to achieve the same results as on the gifs of this repo using colab?

I didn't use colab
you may ask jjandnn
I still train on my server and don‘t achieve a good result