ELS-RD / transformer-deploy

Efficient, scalable and enterprise-grade CPU/GPU inference server for πŸ€— Hugging Face transformer models πŸš€

Home Page:https://els-rd.github.io/transformer-deploy/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GPT-J 6B model

timofeev1995 opened this issue Β· comments

Hello! Thank you for your framework!
I have a question about very large (6B+ models) to convert and serve using your framework.
I tried to convert with tips about large models (--fast option etc) but i have CUDA OOM even using a100 40GB NVIDIA card.
Is it expected behaviour? Is there any tips to perform conversion of models sized like that?
Thank you in advance.

sorry for the latency, do you use ONNX Runtime or TensorRT?

When converting the model on ONNX, this happens:

β”‚ /home/james/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1 β”‚
β”‚ 182 in _slow_forward                                                         β”‚
β”‚                                                                              β”‚
β”‚   1179 β”‚   β”‚   β”‚   else:                                                     β”‚
β”‚   1180 β”‚   β”‚   β”‚   β”‚   recording_scopes = False                              β”‚
β”‚   1181 β”‚   β”‚   try:                                                          β”‚
β”‚ ❱ 1182 β”‚   β”‚   β”‚   result = self.forward(*input, **kwargs)                   β”‚
β”‚   1183 β”‚   β”‚   finally:                                                      β”‚
β”‚   1184 β”‚   β”‚   β”‚   if recording_scopes:                                      β”‚
β”‚   1185 β”‚   β”‚   β”‚   β”‚   tracing_state.pop_scope()                             β”‚
β”‚                                                                              β”‚
β”‚ /home/james/.local/lib/python3.10/site-packages/transformers/models/gptj/mod β”‚
β”‚ eling_gptj.py:589 in forward                                                 β”‚
β”‚                                                                              β”‚
β”‚    586 β”‚   β”‚   β”‚   past_length = 0                                           β”‚
β”‚    587 β”‚   β”‚   β”‚   past_key_values = tuple([None] * len(self.h))             β”‚
β”‚    588 β”‚   β”‚   else:                                                         β”‚
β”‚ ❱  589 β”‚   β”‚   β”‚   past_length = past_key_values[0][0].size(-2)              β”‚
β”‚    590 β”‚   β”‚                                                                 β”‚
β”‚    591 β”‚   β”‚   if position_ids is None:                                      β”‚
β”‚    592 β”‚   β”‚   β”‚   position_ids = torch.arange(past_length, input_shape[-1]  β”‚
╰──────────────────────────────────────────────────────────────────────────────╯
IndexError: Dimension specified as -2 but tensor has no dimensions

When converting the model on TensorRT, I get the same error.

I tried --seq-len 1 128 128, 1 128 2047, 1 2048 2048, and 1 2047 2047 on both onnx and TensorRT, always the same error. I tested on an A100 and on a CPU machine with 128GB RAM.