szymonmaszke / torchlambda

Lightweight tool to deploy PyTorch models to AWS Lambda

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Non mobile version of LibTorch

mmaybeno opened this issue · comments

Thanks for the awesome library. I was trying to use the vanilla fasterrcnn_resnet50_fpn but ran into some errors while using it. Discovered that the mobile version is missing some custom ops which prevent it from being available. Would it be beneficial to have docker images that use the non mobile build of LibTorch?

@mmaybeno Yes, would love to have that feature. Unluckily, AFAIK, static build of PyTorch is pretty cumbersome (see this stackoverflow answer of mine for more info) and this PyTorch issue specifically.

If you are up for the challenge I might give you some guidance on how the stuff works here and help with this feature, but I don't think I will change it on my own for the nearest future (unless their static build gets sorted out).

Also, did you see torchlambda build --operations flag? Wouldn't this solve your issue?

Thanks for the info. I am currently trying to build with the --operations flag but having some issues. I think when you install torchlambda via pip it is missing the Dockerfile and related scripts. Working on it and will post back on the results :).

I was able to successfully build a custom torchlamba version after adding some necessary files from the repo to the pip installed version. Not sure if this was intentional? The pip version was missing: build.sh, CMakeLists.txt, and Dockerfile. Unfortunately, I still encountered the runtime error missing the operation. I think the mobile version just does not include them from LibTorch. pytorch/vision#1407 (comment)

terminate called after throwing an instance of 'torch::jit::ErrorReport'
  what():
Unknown builtin op: torchvision::nms.
Could not find any similar ops to torchvision::nms. This op may not exist or may not be currently supported in TorchScript.
:
  File "/Users/maybeno/workspace/torch-lambda/.venv/lib/python3.8/site-packages/torchvision/ops/boxes.py", line 40
        by NMS, sorted in decreasing order of scores
    """
    return torch.ops.torchvision.nms(boxes, scores, iou_threshold)
           ~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
Serialized   File "code/__torch__/torchvision/ops/boxes.py", line 91
    scores: Tensor,
    iou_threshold: float) -> Tensor:
  _42 = ops.torchvision.nms(boxes, scores, iou_threshold)
        ~~~~~~~~~~~~~~~~~~~ <--- HERE
  return _42```

@mmaybeno Fixed MANIFEST.in in last commit, indeed it didn't include files necessary for custom build, not sure how I could've missed that but thanks.

Still, torchvision.models.detection.fasterrcnn_resnet50_fpn is too large to be deployed (around ~100mb zipped), while AWS Lambda layers are limited to 50 mb. You would need a workaround (keeping model in S3) and writing custom C++ deployment code anyway.

I think the only way is to use non-mobile static build of PyTorch though I was unsuccessful (maybe something changed and the process is better documented now).

Also @mmaybeno you could research PyTorch's static build steps ATM and if you find a recipe I will include it in torchlambda as build flag and custom docker images.

Thanks @szymonmaszke! You are right about the size limit as well. I think I was so focused on getting everything working I forgot about the most important limitation :). I will probably look at some alternatives to getting it working, and if I have time research the static build. Thank you again for all the work!

@mmaybeno you could make large models work with torchlambda but that would require some custom C++ and splitting them. Say you divide your model into 4 sequential parts each occupying different AWS Lambda Layer (5 is AWS Lambda's limit) and you load them into your application (and pass the input from one network to another). So actual limitation without S3 are models up to 200Mb (or even 215Mb).

That gives me some good starting points then. I'll see how far I get. Thanks!

To give an update, I've been able to compile the static library fairly consistently but I've run into all sorts of linking errors. Did you have any of these issues in your initial development? It's ranged from issues with C10 to other undefined references like onnxifi_load. It makes me think that the build is successful but all of these linkings are missing or just not setup correctly.

@mmaybeno yes, I had the same issues with static non-mobile build and I wasn't able to solve it unfortunately.

With the recent announcements from AWS using containers, this probably is a non issue now. I haven't had a moment to try it yet but it seems easy enough.

I mainly stopped trying to fit the faster rcnn image recognition on lambda's because even with JIT it is very slow. From what I understood there is still some work to be done to fully transfer a faster rcnn model into Torchscript.

Please reopen if needed.