有关colab训练模型的坑及解决方案（Pits and solutions about training on colab）

Question

有关colab训练模型的坑及解决方案（Pits and solutions about training on colab）

jjandnn opened this issue 5 years ago · comments

1，cuda出错，no kernel image和no memory：
colab主要有四种显卡：k80、p4、T4、p100。
其中p100和T4，在安装运行flownet2_pytorh时（每次都需要装），不会有问题，直接按照官方read me运行，或者进入few-shot-vid2vid/models/networks/flownet2_pytorch/后，!bash install.sh。
而k80和p4则会报错，cuda kernel：no kernel image……（一个千年未解决的cuda老问题）
最佳解决方案为：
直接重置所有代码执行程序，换主机，换到T4或p100为止，最省力，最高效，呵呵。
其次的方案为修改你的
/content/drive/My Drive/few-shot-vid2vid/models/networks/flownet2_pytorch/networks/channelnorm_package/setup.py；
/content/drive/My Drive/few-shot-vid2vid/models/networks/flownet2_pytorch/networks/correlation_package/setup.py；
/content/drive/My Drive/few-shot-vid2vid/models/networks/flownet2_pytorch/networks/resample2d_package/setup.py；
在三个文件中添加上适配的环境变量：
nvcc_args = [
'-gencode', 'arch=compute_30,code=sm_30',
'-gencode', 'arch=compute_35,code=sm_35',
'-gencode', 'arch=compute_37,code=sm_37',
'-gencode', 'arch=compute_50,code=sm_50',
'-gencode', 'arch=compute_52,code=sm_52',
'-gencode', 'arch=compute_60,code=sm_60',
'-gencode', 'arch=compute_61,code=sm_61',
'-gencode', 'arch=compute_70,code=sm_70',
'-gencode', 'arch=compute_70,code=compute_70'
]

k80请强行指定pytorch==0.41。

2，继续训练时web预览图片读取错误：input/output：epoch……
这个错误，本地不会发生。只有colab与谷歌云盘。
原因：谷歌云盘的文件夹内文件过多，colab无法读入（也是老问题）。
解决方案：
进入/content/drive/My Drive/few-shot-vid2vid/checkpoints/face/web/
删掉整个images文件夹中，再生成一个空的images就可以了。
或者在训练时加上‘--no_html‘参数（我未测试，因为我需要预览）

注意：继续训练，iters不一定为整数，不影响结果。

3，50个epoch后，seq length to XX，out of memory内存溢出。
解决方案：只有换P100，T4不行。
70个epoch后，seq length to 16。
还没想出方案，诶，flownet2升级验证太大了。

注意：第3条的问题，已经由程序主-@tcwang0509 升级修正了，现在跑起来很流畅。就是70个epoch后，colab的主机速度比较慢，这是没办法的，免费啊！——2020.2.13

English（machine translation，forgive me）：
1, cuda error, no kernel image and no memory: colab mainly has four kinds of graphics cards: k80, p4, p100,T4.
Among them, p100 and T4, when installing and running flownet2_pytorh (need to be installed each time), there will be no problem, run directly according to the official read me, or enter fee-shot-vid2vid / models / networks / flownet2_pytorch /, and bash install.sh.
But k80 and p4 will report an error, cuda kernel: no kernel image ... (an unsolved old problem of cuda for a thousand years)
The best solution is: directly reset all code execution programs, change the host, change to p100 or T4, the most labor-saving , The most efficient, huh, huh.
The second solution is to modify your
/ content / drive / My Drive / few-shot-vid2vid / models / networks / flownet2_pytorch / networks / channelnorm_package / setup.py;
/ content / drive / My Drive / few-shot-vid2vid / models /networks/flownet2_pytorch/networks/correlation_package/setup.py;
/ content / drive / My Drive / few-shot-vid2vid / models / networks / flownet2_pytorch / networks / resample2d_package / setup.py;
add adaptations to the three files Environment variables:
nvcc_args = ['-gencode', 'arch = compute_30, code = sm_30', '-gencode', 'arch = compute_35, code = sm_35', '-gencode', 'arch = compute_37, code = sm_37 ',' -gencode ',' arch = compute_50, code = sm_50 ',' -gencode ',' arch = compute_52, code = sm_52 ',' -gencode ',' arch = compute_60, code = sm_60 ',' -gencode ',' arch = compute_61, code = sm_61 ',' -gencode ',' arch = compute_70, code = sm_70 ',' -gencode ',' arch = compute_70, code = compute_70 '] k80 Please specify pytorch == 0.41 .

2，Web preview image reading error when continuing training: input / output: epoch ...
This error does not occur locally.
Only Colab and Google Cloud Disk.
Cause: There are too many files in the Google Cloud Disk folder, and Colab cannot read them (also an old problem).
Solution: Go to / content / drive / My Drive / few-shot-vid2vid / checkpoints / face / web/
Delete the entire images folder and generate an empty images.
Or add ‘--no_html’ parameter during training (I have n’t tested it because I need to preview it)

Note: If you continue to train, iters may not be an integer and will not affect the result.

3，After 50 epochs, seq length to XX, out of memory.
Solution: Change P100 instead of T4.

After 70 epochs, seq length to 16.
Haven't figured out a solution yet.

Note: The problem of Article 3 has been upgraded and corrected by the program owner-@ tcwang0509, and now it runs smoothly. After 70 epochs, the host of Colab is relatively slow. There is no way to do it, it's free!——2020.2.13

程序很棒，感谢开发者，感谢NVlabs（我刚发了疯做空了NV的股票，以为few-v2v和stylegan2赶不上年底的………………诶，天啊！）
祝大家顺利，愉快！
The program is great, thanks to the developers, thanks to NVlabs (I just went crazy and shorted the stock of NV, thinking that few-v2v and stylegan2 can't keep up with the end of the year ... oh, my God!)
I wish you all a smooth and happy!

AaronWong · Answer 1 · Sun Dec 15 2019 16:22:54 GMT+0800 (China Standard Time)

Q2：
train_options.py line 9
parser.add_argument('--display_freq', type=int, default=100, help='frequency of showing training results on screen')
or
remove iter below util/visualizer.py line 110 if self.use_html:

pythagoras000 · Answer 2 · Sun Dec 22 2019 10:23:28 GMT+0800 (China Standard Time)

@AaronWong how much training time approximately can it take to achieve the same results as on the gifs of this repo using colab?

AaronWong · Answer 3 · Mon Dec 23 2019 15:04:14 GMT+0800 (China Standard Time)

Hi @ssaleth

@AaronWong how much training time approximately can it take to achieve the same results as on the gifs of this repo using colab?

I didn't use colab
you may ask jjandnn
I still train on my server and don‘t achieve a good result