failed to load 'stable_diffusion' version 1:
whatsondoc opened this issue · comments
Hello,
Following instructions to deploy this project, and observing that Triton is unable to load the stable_diffusion
model.
This is seen in the Triton Server logs printed to stdout:
1028 08:21:03.012132 581 pb_stub.cc:309] Failed to initialize Python stub: AttributeError: 'LMSDiscreteScheduler' object has no attribute 'set_format'
At:
/models/stable_diffusion/1/model.py(58): initialize
I1028 08:21:03.465850 1 onnxruntime.cc:2606] TRITONBACKEND_ModelInstanceInitialize: encoder (GPU device 1)
E1028 08:21:03.470367 1 model_lifecycle.cc:596] failed to load 'stable_diffusion' version 1: Internal: AttributeError: 'LMSDiscreteScheduler' object has no attribute 'set_format'
At:
/models/stable_diffusion/1/model.py(58): initialize
The specific function referenced in model.py
is here (line 58, indicated below):
def initialize(self, args: Dict[str, str]) -> None:
"""
Initialize the tokenization process
:param args: arguments from Triton config file
"""
current_name: str = str(Path(args["model_repository"]).parent.absolute())
self.device = "cpu" if args["model_instance_kind"] == "CPU" else "cuda"
self.tokenizer = CLIPTokenizer.from_pretrained(current_name + "/stable_diffusion/1/")
self.scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear")
self.scheduler = self.scheduler.set_format("pt") <--
self.height = 512
self.width = 512
self.num_inference_steps = 50
self.guidance_scale = 7.5
self.eta = 0.0
I tried commenting this line out so self.scheduler
is only defined in the line previous, and Triton Server starts and all models (including stable_diffusion) loads successfully and is reported by Triton as online and ready.
Leaving this in place, when subsequently working through the Jupyter Notebook, an error is raised (somewhat expectedly):
InferenceServerException: Failed to process the request(s) for model instance 'stable_diffusion', message: Stub process is not healthy.
So, forced back to the original issue - have you seen this before, or any idea on a fix?
Same issue on my side. Testing on a V100 using pip install --upgrade diffusers (0.7.2)
"failed to load 'stable_diffusion' version 1: Internal: AttributeError: 'LMSDiscreteScheduler' object has no attribute 'set_format'"
Which version of Diffusers library are you using for this demo?
EDIT : with a previous version of diffusers (0.3.0) it's loading the model without error.
But then, still the other error
E1107 15:43:31.899870 116 python_be.cc:1818] Stub process is unhealthy and it will be restarted.
What do you think this comes from ?
Thanks for your help
Try now
Unfortunately still not working when launching an Inference.
The server is however fixed with this new diffusers version.
E1108 12:53:16.125181 95 python_be.cc:1818] Stub process is unhealthy and it will be restarted. ftfy or spacy is not installed using BERT BasicTokenizer instead of ftfy.
What Hardware are you testing on ?
Thanks for the support.
I'm seeing the same, I'm afraid.
The Triton container build & server launch run smoothly, however when whizzing through the Inference notebook I get to stage #7, which produces the InferenceServerException: Failed to process the request(s) for model stable_diffusion, message: stub process is not healthy
(which is also populated in the TritonServer log).
For reference, I'm trying this on a DGX-2 with 16 x V100-SXM3-32GB.
Hardware tested 1080Ti
@whatsondoc @lolagiscard
Could you please share screenshot/logs ?
Try running the docker with below cmd
docker run -it --rm --gpus device=0 -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 16384m \
-v $PWD/stable-diffusion-v1-4-onnx/models:/models tritonserver \
tritonserver --model-repository /models/
I'm already testing testing on 1 GPU only, if this is the change you want us to try.
There might be something wrong in the nvidia Triton docker itself, might not work with some of the GPU architectures
Okay.
I will test it on v100 and let you know.
@lolagiscard @lolagiscard running on v100 ?
Great , thanks let us know
Yes Im also on a V100 sorry should have said that already above :)
I was able to reproduce the issue with torch version 1.13
I have pinned torch 1.12.1 in docker and fixed the issue. 666e148
All good now, in deed !
Thanks a lot :)
@whozwhat
Multi-GPU not yet fixed
run the docker using below cmd should fix the issue for now
docker run -it --rm --gpus device=0 -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 16384m \
-v $PWD/models:/models sd_trt bash
Thanks Reply, this cmd works.
It would be awesome if the multi-gpu issue could be fixed