nvidia/Llama3-ChatQA-1.5-70B failing to start

Question

nvidia/Llama3-ChatQA-1.5-70B failing to start

mariokostelac opened this issue 3 months ago · comments

System Info

I've used the code suggested on https://huggingface.co/nvidia/Llama3-ChatQA-1.5-70B to run inference on AWS inferentia chips.


Specifically

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

# Hub Model configuration. https://huggingface.co/models
hub = {
    "HF_MODEL_ID": "nvidia/Llama3-ChatQA-1.5-70B",
    "HF_NUM_CORES": "24",
    "HF_BATCH_SIZE": "4",
    "HF_SEQUENCE_LENGTH": "4096",
    "HF_AUTO_CAST_TYPE": "bf16",  
    "MAX_BATCH_SIZE": "4",
    "MAX_INPUT_LENGTH": "3686",
    "MAX_TOTAL_TOKENS": "4096",
}


# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=get_huggingface_llm_image_uri("huggingface-neuronx", version="0.0.21"),
    env=hub,
    role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.inf2.48xlarge",
    container_startup_health_check_timeout=3600,
    volume_size=512,
)

# send request
predictor.predict(
    {
        "inputs": "What is is the capital of France?",
        "parameters": {
            "do_sample": True,
            "max_new_tokens": 128,
            "temperature": 0.7,
            "top_k": 50,
            "top_p": 0.95,
        }
    }
)

I saw many warnings like:

#033[2m2024-05-16T10:25:30.983525Z#033[0m #033[33m WARN#033[0m #033[2mtokenizers::tokenizer::serialization#033[0m#033[2m:#033[0m #033[2m/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.15.2/src/tokenizer/serialization.rs#033[0m#033[2m:#033[0m#033[2m159:#033[0m Warning: Token '<|end_of_text|>' was expected to have ID '128001' but was given ID 'None'

It failed to start with following error:

 File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 20, in intercept
    return await response
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 39, in Warmup
    max_tokens = self.generator.warmup(request.batch)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 336, in warmup
    self.prefill(batch)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 404, in prefill
    selector = TokenSelector.create(
  File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/generation/token_selector.py", line 136, in create
    assert eos_token_id is not None and not isinstance(eos_token_id, list)

Is there some prep needed to be done to run the model on inferentia with this library?

Who can help?

@JingyaHuang @daco

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

Available above.

Expected behavior

Endpoint should start.

Mario Kostelac · Answer 1 · Thu May 16 2024 20:36:38 GMT+0800 (China Standard Time)

I can confirm that meta-llama/Meta-Llama-3-70B-Instruct fails the same way.

David Corvoysier · Answer 2 · Thu May 16 2024 21:03:23 GMT+0800 (China Standard Time)

This issue is fixed with version0.0.22

Mario Kostelac · Answer 3 · Thu May 16 2024 21:07:34 GMT+0800 (China Standard Time)

@dacorvo trying it out with 0.0.22 🙇

David Corvoysier · Answer 4 · Thu May 16 2024 21:09:37 GMT+0800 (China Standard Time)

The corresponding pull-request: #580 . The sagemaker python package might not have been updated yet to support 0.0.22 (it was due later today).

Update:
It is actually available (great !). FYI the image_uri should be something like:
763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.22-neuronx-py310-ubuntu22.04

Mario Kostelac · Answer 5 · Thu May 16 2024 21:31:55 GMT+0800 (China Standard Time)

Yes, figured it's available, but it's still creating the endpoint 😁 .

Mario Kostelac · Answer 6 · Thu May 16 2024 21:39:41 GMT+0800 (China Standard Time)

Thanks a lot @dacorvo, I can confirm that it worker for me by just changing the version to 0.0.22 in the snippet above? Do you know who'd be responsible for fixing that on HF UI?

David Corvoysier · Answer 7 · Thu May 16 2024 21:42:04 GMT+0800 (China Standard Time)

@mariokostelac thank you for the feedback. I'll take care of it. We were actually waiting for the sagemaker update, and I had not realized it was ready.

David Corvoysier · Answer 8 · Thu May 16 2024 21:48:49 GMT+0800 (China Standard Time)

The update was done this morning, but it has not been refreshed yet. It should be fixed soon.

Mario Kostelac · Answer 9 · Thu May 16 2024 21:53:51 GMT+0800 (China Standard Time)

Thanks a lot for the quick support on this issue. I'm running now with the original model (nvidia one) to verify that it works there too. Given that tokenizer configs are the same, I'd be very surprised if it didn't.

David Corvoysier · Answer 10 · Thu May 16 2024 22:02:59 GMT+0800 (China Standard Time)

Feel free to report any issues you get: feedback on such new features/models is very valuable.