Non mobile version of LibTorch

Question

Non mobile version of LibTorch

mmaybeno opened this issue 4 years ago · comments

Thanks for the awesome library. I was trying to use the vanilla fasterrcnn_resnet50_fpn but ran into some errors while using it. Discovered that the mobile version is missing some custom ops which prevent it from being available. Would it be beneficial to have docker images that use the non mobile build of LibTorch?

Szymon Maszke · Answer 1 · Thu Sep 24 2020 05:09:57 GMT+0800 (China Standard Time)

@mmaybeno Yes, would love to have that feature. Unluckily, AFAIK, static build of PyTorch is pretty cumbersome (see this stackoverflow answer of mine for more info) and this PyTorch issue specifically.

If you are up for the challenge I might give you some guidance on how the stuff works here and help with this feature, but I don't think I will change it on my own for the nearest future (unless their static build gets sorted out).

Szymon Maszke · Answer 2 · Thu Sep 24 2020 07:29:24 GMT+0800 (China Standard Time)

Also, did you see torchlambda build --operations flag? Wouldn't this solve your issue?

Matt Maybeno · Answer 3 · Fri Sep 25 2020 12:24:27 GMT+0800 (China Standard Time)

Thanks for the info. I am currently trying to build with the --operations flag but having some issues. I think when you install torchlambda via pip it is missing the Dockerfile and related scripts. Working on it and will post back on the results :).

Matt Maybeno · Answer 4 · Mon Sep 28 2020 04:35:00 GMT+0800 (China Standard Time)

I was able to successfully build a custom torchlamba version after adding some necessary files from the repo to the pip installed version. Not sure if this was intentional? The pip version was missing: build.sh, CMakeLists.txt, and Dockerfile. Unfortunately, I still encountered the runtime error missing the operation. I think the mobile version just does not include them from LibTorch. pytorch/vision#1407 (comment)

terminate called after throwing an instance of 'torch::jit::ErrorReport'
  what():
Unknown builtin op: torchvision::nms.
Could not find any similar ops to torchvision::nms. This op may not exist or may not be currently supported in TorchScript.
:
  File "/Users/maybeno/workspace/torch-lambda/.venv/lib/python3.8/site-packages/torchvision/ops/boxes.py", line 40
        by NMS, sorted in decreasing order of scores
    """
    return torch.ops.torchvision.nms(boxes, scores, iou_threshold)
           ~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
Serialized   File "code/__torch__/torchvision/ops/boxes.py", line 91
    scores: Tensor,
    iou_threshold: float) -> Tensor:
  _42 = ops.torchvision.nms(boxes, scores, iou_threshold)
        ~~~~~~~~~~~~~~~~~~~ <--- HERE
  return _42```

Szymon Maszke · Answer 5 · Mon Sep 28 2020 16:32:38 GMT+0800 (China Standard Time)

@mmaybeno Fixed MANIFEST.in in last commit, indeed it didn't include files necessary for custom build, not sure how I could've missed that but thanks.

Still, torchvision.models.detection.fasterrcnn_resnet50_fpn is too large to be deployed (around ~100mb zipped), while AWS Lambda layers are limited to 50 mb. You would need a workaround (keeping model in S3) and writing custom C++ deployment code anyway.

I think the only way is to use non-mobile static build of PyTorch though I was unsuccessful (maybe something changed and the process is better documented now).

Szymon Maszke · Answer 6 · Mon Sep 28 2020 23:00:01 GMT+0800 (China Standard Time)

Also @mmaybeno you could research PyTorch's static build steps ATM and if you find a recipe I will include it in torchlambda as build flag and custom docker images.

Matt Maybeno · Answer 7 · Mon Sep 28 2020 23:09:16 GMT+0800 (China Standard Time)

Thanks @szymonmaszke! You are right about the size limit as well. I think I was so focused on getting everything working I forgot about the most important limitation :). I will probably look at some alternatives to getting it working, and if I have time research the static build. Thank you again for all the work!

Szymon Maszke · Answer 8 · Mon Sep 28 2020 23:18:34 GMT+0800 (China Standard Time)

@mmaybeno you could make large models work with torchlambda but that would require some custom C++ and splitting them. Say you divide your model into 4 sequential parts each occupying different AWS Lambda Layer (5 is AWS Lambda's limit) and you load them into your application (and pass the input from one network to another). So actual limitation without S3 are models up to 200Mb (or even 215Mb).

Matt Maybeno · Answer 9 · Tue Sep 29 2020 23:27:32 GMT+0800 (China Standard Time)

That gives me some good starting points then. I'll see how far I get. Thanks!

Matt Maybeno · Answer 10 · Tue Oct 27 2020 04:44:53 GMT+0800 (China Standard Time)

To give an update, I've been able to compile the static library fairly consistently but I've run into all sorts of linking errors. Did you have any of these issues in your initial development? It's ranged from issues with C10 to other undefined references like onnxifi_load. It makes me think that the build is successful but all of these linkings are missing or just not setup correctly.

Szymon Maszke · Answer 11 · Tue Jan 05 2021 04:18:29 GMT+0800 (China Standard Time)

@mmaybeno yes, I had the same issues with static non-mobile build and I wasn't able to solve it unfortunately.

Matt Maybeno · Answer 12 · Tue Jan 05 2021 05:08:13 GMT+0800 (China Standard Time)

With the recent announcements from AWS using containers, this probably is a non issue now. I haven't had a moment to try it yet but it seems easy enough.

I mainly stopped trying to fit the faster rcnn image recognition on lambda's because even with JIT it is very slow. From what I understood there is still some work to be done to fully transfer a faster rcnn model into Torchscript.

Szymon Maszke · Answer 13 · Thu Mar 18 2021 18:52:38 GMT+0800 (China Standard Time)

Please reopen if needed.