pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Home Page:https://pytorch.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

roi_crop (from Detectron.pytorch) building consistently fails

phalexo opened this issue · comments

Python has no problem with importing pytorch, but building the extension fails.

gcc -pthread -B /home/developer/anaconda3/envs/pytorch/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DWITH_CUDA -I/home/developer/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/ffi/../../lib/include -I/home/developer/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/ffi/../../lib/include/TH -I/home/developer/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/ffi/../../lib/include/THC -I/usr/local/cuda/include -I/home/developer/anaconda3/envs/pytorch/include/python3.6m -c /home/developer/Detectron.pytorch/lib/model/roi_crop/src/roi_crop_cuda.c -o ./home/developer/Detectron.pytorch/lib/model/roi_crop/src/roi_crop_cuda.o -std=c99
/home/developer/Detectron.pytorch/lib/model/roi_crop/src/roi_crop_cuda.c: In function 'BilinearSamplerBHWD_updateOutput_cuda':
/home/developer/Detectron.pytorch/lib/model/roi_crop/src/roi_crop_cuda.c:22:64: error: dereferencing pointer to incomplete type 'THCTensor {aka struct THCTensor}'
success = BilinearSamplerBHWD_updateOutput_cuda_kernel(output->size[1],
^
Traceback (most recent call last):
File "/home/developer/anaconda3/envs/pytorch/lib/python3.6/distutils/unixccompiler.py", line 118, in _compile
extra_postargs)
File "/home/developer/anaconda3/envs/pytorch/lib/python3.6/distutils/ccompiler.py", line 909, in spawn
spawn(cmd, dry_run=self.dry_run)
File "/home/developer/anaconda3/envs/pytorch/lib/python3.6/distutils/spawn.py", line 36, in spawn
_spawn_posix(cmd, search_path, dry_run=dry_run)
File "/home/developer/anaconda3/envs/pytorch/lib/python3.6/distutils/spawn.py", line 159, in _spawn_posix
% (cmd, exit_status))
distutils.errors.DistutilsExecError: command 'gcc' failed with exit status 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/developer/anaconda3/envs/pytorch/lib/python3.6/site-packages/cffi/ffiplatform.py", line 51, in _build
dist.run_command('build_ext')
File "/home/developer/anaconda3/envs/pytorch/lib/python3.6/distutils/dist.py", line 974, in run_command
cmd_obj.run()
File "/home/developer/anaconda3/envs/pytorch/lib/python3.6/distutils/command/build_ext.py", line 339, in run
self.build_extensions()
File "/home/developer/anaconda3/envs/pytorch/lib/python3.6/distutils/command/build_ext.py", line 448, in build_extensions
self._build_extensions_serial()
File "/home/developer/anaconda3/envs/pytorch/lib/python3.6/distutils/command/build_ext.py", line 473, in _build_extensions_serial
self.build_extension(ext)
File "/home/developer/anaconda3/envs/pytorch/lib/python3.6/distutils/command/build_ext.py", line 533, in build_extension
depends=ext.depends)
File "/home/developer/anaconda3/envs/pytorch/lib/python3.6/distutils/ccompiler.py", line 574, in compile
self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts)
File "/home/developer/anaconda3/envs/pytorch/lib/python3.6/distutils/unixccompiler.py", line 120, in _compile

I believe this issue should be opened in the Detectron.pytorch repo.

FYI, this is because we made some structs in THC abstract in HEAD. Any sites which accessed members directly have to use a function instead now.

Is there a reference what changes have to happen to correspond to changes in PyTorch? Detectron repo is not the only one affected, there are at least 2 others.

Not yet, but you can get some guidance looking at 4caea64; look at changes to files in torch/csrc

I've been looking at the changes and unfortunately I am not seeing how everything is connected.

The first error occurs in this line:
success = BilinearSamplerBHWD_updateOutput_cuda_kernel(output->size[1],
output is the "undefined" pointer. It does not appear to be a stream.

#include <THC/THC.h>
#include <stdbool.h>
#include <stdio.h>
#include "roi_crop_cuda_kernel.h"

#define real float

// this symbol will be resolved automatically from PyTorch libs
extern THCState *state;

// Bilinear sampling is done in BHWD (coalescing is not obvious in BDHW)
// we assume BHWD format in inputImages
// we assume BHW(YX) format on grids

int BilinearSamplerBHWD_updateOutput_cuda(THCudaTensor *inputImages, THCudaTensor *grids, THCudaTensor *output){
// THCState *state = getCutorchState(L);
// THCudaTensor *inputImages = (THCudaTensor *)luaT_checkudata(L, 2, "torch.CudaTensor");
// THCudaTensor *grids = (THCudaTensor *)luaT_checkudata(L, 3, "torch.CudaTensor");
// THCudaTensor *output = (THCudaTensor *)luaT_checkudata(L, 4, "torch.CudaTensor");

int success = 0;
success = BilinearSamplerBHWD_updateOutput_cuda_kernel(output->size[1],
output->size[3],
output->size[2],
output->size[0],
THCudaTensor_size(state, inputImages, 1),
THCudaTensor_size(state, inputImages, 2),
THCudaTensor_size(state, inputImages, 3),
THCudaTensor_size(state, inputImages, 0),
THCudaTensor_data(state, inputImages),
THCudaTensor_stride(state, inputImages, 0),
THCudaTensor_stride(state, inputImages, 1),
THCudaTensor_stride(state, inputImages, 2),
THCudaTensor_stride(state, inputImages, 3),
THCudaTensor_data(state, grids),
THCudaTensor_stride(state, grids, 0),
THCudaTensor_stride(state, grids, 3),
THCudaTensor_stride(state, grids, 1),
THCudaTensor_stride(state, grids, 2),
THCudaTensor_data(state, output),
THCudaTensor_stride(state, output, 0),
THCudaTensor_stride(state, output, 1),
THCudaTensor_stride(state, output, 2),
THCudaTensor_stride(state, output, 3),
THCState_getCurrentStream(state));

//check for errors
if (!success) {
THError("aborting");
}
return 1;
}

I made the following mods to one file and similar mods to two others. I'd appreciate a comment if it makes sense.

#include <THC/THC.h>
#include <stdio.h>
#include "nms_cuda_kernel.h"

// this symbol will be resolved automatically from PyTorch libs
extern THCState *state;

int nms_cuda(THCudaIntTensor *keep_out, THCudaTensor *boxes_host,
THCudaIntTensor *num_out, float nms_overlap_thresh) {

    int sz0 = THCudaTensor_size(state, boxes_host, 0);
    int sz1 = THCudaTensor_size(state, boxes_host, 1);
    nms_cuda_compute(THCudaIntTensor_data(state, keep_out),
                     THCudaIntTensor_data(state, num_out),
                     THCudaTensor_data(state, boxes_host),
                     sz0, sz1,
                     //boxes_host->size[0],
                     //boxes_host->size[1],
                     nms_overlap_thresh);

    return 1;

}

@phalexo You could try out to install pytorch 0.4.0, and insert CFLAGS="-std=c99 before sh make.sh.

I want to follow up on this issue. The commit mentioned earlier 4caea64 makes a lot of user defined cpp/cuda extensions broken on 0.4.1, examples include roi_align in Detectron.pytorch, correlation in flownet2-pytorch and many more.

I think a tutorial or at least detailed comment can be provided illustrating how to migrate these extensions into the newer at::Tensor format as suggested in the cpp_extension tutorial. Otherwise most of the opensource implementations with self-defined operations cannot benefit from other updates from 0.4.1 and later versions.

Thanks for your time.

Edit:
I found some reference here:
https://github.com/pytorch/pytorch/tree/master/aten
https://github.com/zdevito/ATen/tree/master/aten/doc
https://github.com/pytorch/pytorch/tree/master/aten/src/ATen/test
but still little confused on how to start to migrate.
Is there anything else as reference? Thanks.

@JiamingSuen The world is small LOL

Struggling to get flownet2-pytorch built

Face the same issue. Currently, I am using CUDA 9.2. I can't degrade pytorch 0.4.1 to lower versions because cuda 9.2 doesn't support them. And I don't want to degrade CUDA.
Any insights? Thanks

@ezyang Any more comments would be nice to address this issue.
@GuoleiSun You may compile pytorch==0.4.0 by yourself for a temporary solution, however rewriting self-defined operators is necessary to use 0.4.1 and above.

@JiamingSuen your references are all spot-on. Also look at how we migrated torchaudio from the cffi extension that used TH* API into a cpp_extension that uses ATen: pytorch/audio@18c01be

We're happy to answer any questions.