pku-liang / FlexTensor

Automatic Schedule Exploration and Optimization Framework for Tensor Computations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Some problem while running on GPU

Light-of-Hers opened this issue · comments

There seems some problems while running the second case on GPU:

$ python3 optimize_block_circulant_matrix.py --target cuda --parallel 20 --timeout 120
Optimize block_circulant_matrix shape (1024, 256, 8)
......
######################################
op schedules:
----------------------------------
fuse [[1, 2, 2]]
spatial [[8, 1, 4, 4], [16, 1, 2, 8]]
reduce [[1, 1, 8]]
reorder [[0]]
unroll [[512, 1]]
----------------------------------
fuse [[1, 2, 2]]
spatial [[32, 1, 2, 16], [2, 4, 32, 1]]
reorder [[0]]
unroll [[1, 1]]
graph schedules:
inline [[0, 0]]
merge [[0, 1]]
block_circulant_matrix_block_circulant_matrix_(1024, 256, 8)_cuda(0):[[{"fuse": [[1, 2, 2]], "spatial": [[8, 1, 4, 4], [16, 1, 2, 8]], "reduce": [[1, 1, 8]], "reorder": [[0]], "inline": [], "unroll": [[512, 1]], "merge": []}, {"fuse": [[1, 2, 2]], "spatial": [[32, 1, 2, 16], [2, 4, 32, 1]], "reduce": [], "reorder": [[0]], "inline": [], "unroll": [[1, 1]], "merge": []}], {"fuse": [], "spatial": [], "reduce": [], "reorder": [], "inline": [[0, 0]], "unroll": [], "merge": [[0, 1]]}]
Use 0.24673579999999998 ms
Cost 1717.7187826633453 s

Optimize block_circulant_matrix shape (1024, 256, 16)
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
.......

but if I only run the 2nd case, everything seems OK

$ python3 optimize_block_circulant_matrix.py --target cuda --parallel 20 --timeout 120 -f 1 -t 2
Optimize block_circulant_matrix shape (1024, 256, 16)
......
######################################
op schedules:
----------------------------------
fuse [[1, 2, 2]]
spatial [[32, 1, 2, 1], [4, 1, 4, 16]]
reduce [[1, 2, 8]]
reorder [[1]]
unroll [[512, 0]]
----------------------------------
fuse [[1, 2, 2]]
spatial [[128, 4, 1, 2], [4, 4, 16, 1]]
reorder [[2]]
unroll [[512, 1]]
graph schedules:
inline [[0, 0]]
merge [[1, 0]]
block_circulant_matrix_block_circulant_matrix_(1024, 256, 16)_cuda(0):[[{"fuse": [[1, 2, 2]], "spatial": [[32, 1, 2, 1], [4, 1, 4, 16]], "reduce": [[1, 2, 8]], "reorder": [[1]], "inline": [], "unroll": [[512, 0]], "merge": []}, {"fuse": [[1, 2, 2]], "spatial": [[128, 4, 1, 2], [4, 4, 16, 1]], "reduce": [], "reorder": [[2]], "inline": [], "unroll": [[512, 1]], "merge": []}], {"fuse": [], "spatial": [], "reduce": [], "reorder": [], "inline": [[0, 0]], "unroll": [], "merge": [[1, 0]]}]
Use 0.0080564 ms
Cost 1806.2457213401794 s

and if I run case 3~6, then case 3 will be OK but case 4 will fail

$ python3 optimize_block_circulant_matrix.py --target cuda --parallel 20 --timeout 120 -f 2 -t 6
Optimize block_circulant_matrix shape (1024, 512, 8)
......

Optimize block_circulant_matrix shape (1024, 512, 16)
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
......

same problem also appear when I run some other scripts such as 'optimize_conv1d.py' and 'optimize_gemm.py'

commented

This issue was solved by using 'spawn' rather than 'fork' when using multiprocessing, but the drawback is the process of scheduling becomes slow as process launching cost is heavy for 'spawn', and the resource consuming becomes severe because processes do not share contents.