huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

nvidia/Llama3-ChatQA-1.5-70B failing to start

mariokostelac opened this issue · comments

System Info

I've used the code suggested on https://huggingface.co/nvidia/Llama3-ChatQA-1.5-70B to run inference on AWS inferentia chips.


Specifically

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

# Hub Model configuration. https://huggingface.co/models
hub = {
    "HF_MODEL_ID": "nvidia/Llama3-ChatQA-1.5-70B",
    "HF_NUM_CORES": "24",
    "HF_BATCH_SIZE": "4",
    "HF_SEQUENCE_LENGTH": "4096",
    "HF_AUTO_CAST_TYPE": "bf16",  
    "MAX_BATCH_SIZE": "4",
    "MAX_INPUT_LENGTH": "3686",
    "MAX_TOTAL_TOKENS": "4096",
}


# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=get_huggingface_llm_image_uri("huggingface-neuronx", version="0.0.21"),
    env=hub,
    role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.inf2.48xlarge",
    container_startup_health_check_timeout=3600,
    volume_size=512,
)

# send request
predictor.predict(
    {
        "inputs": "What is is the capital of France?",
        "parameters": {
            "do_sample": True,
            "max_new_tokens": 128,
            "temperature": 0.7,
            "top_k": 50,
            "top_p": 0.95,
        }
    }
)

I saw many warnings like:

#033[2m2024-05-16T10:25:30.983525Z#033[0m #033[33m WARN#033[0m #033[2mtokenizers::tokenizer::serialization#033[0m#033[2m:#033[0m #033[2m/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.15.2/src/tokenizer/serialization.rs#033[0m#033[2m:#033[0m#033[2m159:#033[0m Warning: Token '<|end_of_text|>' was expected to have ID '128001' but was given ID 'None'  

It failed to start with following error:

 File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 20, in intercept
    return await response
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 39, in Warmup
    max_tokens = self.generator.warmup(request.batch)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 336, in warmup
    self.prefill(batch)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 404, in prefill
    selector = TokenSelector.create(
  File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/generation/token_selector.py", line 136, in create
    assert eos_token_id is not None and not isinstance(eos_token_id, list)

Is there some prep needed to be done to run the model on inferentia with this library?

Who can help?

@JingyaHuang @daco

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

Available above.

Expected behavior

Endpoint should start.

I can confirm that meta-llama/Meta-Llama-3-70B-Instruct fails the same way.

This issue is fixed with version0.0.22

@dacorvo trying it out with 0.0.22 🙇

The corresponding pull-request: #580 . The sagemaker python package might not have been updated yet to support 0.0.22 (it was due later today).

Update:
It is actually available (great !). FYI the image_uri should be something like:
763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.22-neuronx-py310-ubuntu22.04

Yes, figured it's available, but it's still creating the endpoint 😁 .

Thanks a lot @dacorvo, I can confirm that it worker for me by just changing the version to 0.0.22 in the snippet above? Do you know who'd be responsible for fixing that on HF UI?

@mariokostelac thank you for the feedback. I'll take care of it. We were actually waiting for the sagemaker update, and I had not realized it was ready.

The update was done this morning, but it has not been refreshed yet. It should be fixed soon.

Thanks a lot for the quick support on this issue. I'm running now with the original model (nvidia one) to verify that it works there too. Given that tokenizer configs are the same, I'd be very surprised if it didn't.

Feel free to report any issues you get: feedback on such new features/models is very valuable.