predibase / lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

Home Page:https://loraexchange.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

mistralai/Mistral-7B-Instruct-v0.2 error with total tokens > 8192 and setting --compile

magdyksaleh opened this issue · comments

System Info

ghcr.io/predibase/lorax:8ff0bf5

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

run the v2 mistral model and try to query with max new tokens above 8192

Expected behavior

should work

I am running Mixtral model in 4 shards and got transport error when prompt size is 1731.

sudo docker run --gpus='"device=4,5,6,7"' --shm-size 1g -p 8080:80 -v $PWD/data:/data \
    ghcr.io/predibase/lorax:latest --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --num-shard 4 --sharded true \
--max-input-length 4095 \
   --max-total-tokens 4096\
   --max-batch-prefill-tokens 65536\
   --waiting-served-ratio 1.2 \
   --max-waiting-tokens 20 \
   --max-stop-sequences 10 \
   --cuda-memory-fraction 0.99

client

from lorax import Client
client = Client("http://127.0.0.1:8080")
prompt="""some string with 1731 tokens"""
print(client.generate(prompt, max_new_tokens=20, stop_sequences=["\n\n"]).generated_text)

Errors

  File "/home/hayley/lorax/.venv/lib/python3.8/site-packages/lorax/client.py", line 192, in generate
    raise parse_error(resp.status_code, payload)
lorax.errors.GenerationError: Request failed during generation: Server error: transport error


2024-03-27T23:59:34.118606Z ERROR lorax_launcher: interceptor.py:41 Method Prefill encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/lorax-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 324, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 96, in Prefill
    generations, next_batch = self.model.generate_token(batch)
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 927, in generate_token
    raise e
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 924, in generate_token
    out = self.forward(batch, adapter_data)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mixtral.py", line 430, in forward
    logits = model.forward(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mixtral_modeling.py", line 979, in forward
    hidden_states = self.model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mixtral_modeling.py", line 922, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mixtral_modeling.py", line 868, in forward
    moe_output = self.moe(normed_attn_res_output)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mixtral_modeling.py", line 718, in forward
    return self.sparse_forward(x)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mixtral_modeling.py", line 616, in sparse_forward
    x = ops.padded_gather(x, indices, bin_ids, bins, padded_bins,
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/opt/conda/lib/python3.10/site-packages/stk/backend/autocast.py", line 28, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/megablocks/ops/padded_gather.py", line 14, in forward
    return kernels.padded_gather(
  File "/opt/conda/lib/python3.10/site-packages/megablocks/backend/kernels.py", line 118, in padded_gather
    output_rows = padded_bins[-1].cpu().item()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


2024-03-27T23:59:34.119496Z ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: lorax_client: router/client/src/lib.rs:34: Server error: Unexpected <class 'RuntimeError'>: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

2024-03-27T23:59:34.629248Z ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: lorax_client: router/client/src/lib.rs:34: Server error: transport error
2024-03-27T23:59:34.939148Z ERROR shard-manager: lorax_launcher: Shard complete standard error output:

[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
Warmup to max_total_tokens: 100%|██████████| 1/1 [00:11<00:00, 11.96s/it]
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc433581d87 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc43353275f in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc433cfe8a8 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1d40e (0x7fc433cc940e in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1f744 (0x7fc433ccb744 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1fb6d (0x7fc433ccbb6d in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x540210 (0x7fc4324f7210 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x649bf (0x7fc4335669bf in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7fc43355fc8b in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fc43355fe39 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x802b98 (0x7fc4327b9b98 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2f6 (0x7fc4327b9f16 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x13d5a7 (0x55923e51c5a7 in /opt/conda/bin/python3.10)
frame #13: <unknown function> + 0x14db76 (0x55923e52cb76 in /opt/conda/bin/python3.10)
frame #14: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #15: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #16: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #17: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #18: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #19: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #20: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #21: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #22: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #23: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #24: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #25: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #26: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #27: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #28: <unknown function> + 0x15262b (0x55923e53162b in /opt/conda/bin/python3.10)
frame #29: <unknown function> + 0x1525e7 (0x55923e5315e7 in /opt/conda/bin/python3.10)
frame #30: <unknown function> + 0x563095 (0x7fc21caff095 in /opt/conda/lib/python3.10/site-packages/grpc/_cython/cygrpc.cpython-310-x86_64-linux-gnu.so)
frame #31: <unknown function> + 0x56b815 (0x7fc21cb07815 in /opt/conda/lib/python3.10/site-packages/grpc/_cython/cygrpc.cpython-310-x86_64-linux-gnu.so)
frame #32: <unknown function> + 0x60ae0f (0x7fc21cba6e0f in /opt/conda/lib/python3.10/site-packages/grpc/_cython/cygrpc.cpython-310-x86_64-linux-gnu.so)
frame #33: <unknown function> + 0x56a20b (0x7fc21cb0620b in /opt/conda/lib/python3.10/site-packages/grpc/_cython/cygrpc.cpython-310-x86_64-linux-gnu.so)
frame #34: <unknown function> + 0x5cbc29 (0x7fc21cb67c29 in /opt/conda/lib/python3.10/site-packages/grpc/_cython/cygrpc.cpython-310-x86_64-linux-gnu.so)
frame #35: <unknown function> + 0x14f3bd (0x55923e52e3bd in /opt/conda/bin/python3.10)
frame #36: PyObject_VectorcallMethod + 0x85 (0x55923e53dc85 in /opt/conda/bin/python3.10)
frame #37: <unknown function> + 0xae1eb (0x55923e48d1eb in /opt/conda/bin/python3.10)
frame #38: <unknown function> + 0x7bf6 (0x7fc434080bf6 in /opt/conda/lib/python3.10/lib-dynload/_asyncio.cpython-310-x86_64-linux-gnu.so)
frame #39: <unknown function> + 0x143d2a (0x55923e522d2a in /opt/conda/bin/python3.10)
frame #40: <unknown function> + 0x25f22c (0x55923e63e22c in /opt/conda/bin/python3.10)
frame #41: <unknown function> + 0xfda7b (0x55923e4dca7b in /opt/conda/bin/python3.10)
frame #42: <unknown function> + 0x13c1b3 (0x55923e51b1b3 in /opt/conda/bin/python3.10)
frame #43: _PyEval_EvalFrameDefault + 0x5d5d (0x55923e51916d in /opt/conda/bin/python3.10)
frame #44: _PyFunction_Vectorcall + 0x6c (0x55923e5238cc in /opt/conda/bin/python3.10)
frame #45: _PyEval_EvalFrameDefault + 0x72c (0x55923e513b3c in /opt/conda/bin/python3.10)
frame #46: _PyFunction_Vectorcall + 0x6c (0x55923e5238cc in /opt/conda/bin/python3.10)
frame #47: _PyEval_EvalFrameDefault + 0x72c (0x55923e513b3c in /opt/conda/bin/python3.10)
frame #48: _PyFunction_Vectorcall + 0x6c (0x55923e5238cc in /opt/conda/bin/python3.10)
frame #49: _PyEval_EvalFrameDefault + 0x72c (0x55923e513b3c in /opt/conda/bin/python3.10)
frame #50: _PyFunction_Vectorcall + 0x6c (0x55923e5238cc in /opt/conda/bin/python3.10)
frame #51: _PyEval_EvalFrameDefault + 0x72c (0x55923e513b3c in /opt/conda/bin/python3.10)
frame #52: _PyFunction_Vectorcall + 0x6c (0x55923e5238cc in /opt/conda/bin/python3.10)
frame #53: _PyEval_EvalFrameDefault + 0x4c12 (0x55923e518022 in /opt/conda/bin/python3.10)
frame #54: _PyFunction_Vectorcall + 0x6c (0x55923e5238cc in /opt/conda/bin/python3.10)
frame #55: _PyEval_EvalFrameDefault + 0x4c12 (0x55923e518022 in /opt/conda/bin/python3.10)
frame #56: _PyFunction_Vectorcall + 0x6c (0x55923e5238cc in /opt/conda/bin/python3.10)
frame #57: PyObject_Call + 0xbc (0x55923e52fd9c in /opt/conda/bin/python3.10)
frame #58: _PyEval_EvalFrameDefault + 0x2d84 (0x55923e516194 in /opt/conda/bin/python3.10)
frame #59: _PyFunction_Vectorcall + 0x6c (0x55923e5238cc in /opt/conda/bin/python3.10)
frame #60: PyObject_Call + 0xbc (0x55923e52fd9c in /opt/conda/bin/python3.10)
frame #61: _PyEval_EvalFrameDefault + 0x2d84 (0x55923e516194 in /opt/conda/bin/python3.10)
frame #62: <unknown function> + 0x150402 (0x55923e52f402 in /opt/conda/bin/python3.10)
frame #63: PyObject_Call + 0xbc (0x55923e52fd9c in /opt/conda/bin/python3.10)
 rank=3
2024-03-27T23:59:34.939198Z ERROR shard-manager: lorax_launcher: Shard process was signaled to shutdown with signal 6 rank=3
2024-03-27T23:59:35.003043Z ERROR lorax_launcher: Shard 3 crashed
2024-03-27T23:59:35.003070Z  INFO lorax_launcher: Terminating webserver
2024-03-27T23:59:35.003088Z  INFO lorax_launcher: Waiting for webserver to gracefully shutdown
2024-03-27T23:59:35.003172Z  INFO lorax_router::server: router/src/server.rs:1187: signal received, starting graceful shutdown
2024-03-27T23:59:35.100045Z ERROR shard-manager: lorax_launcher: Shard complete standard error output:

[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
Warmup to max_total_tokens: 100%|██████████| 1/1 [00:11<00:00, 11.94s/it]
[rank2]:[E ProcessGroupNCCL.cpp:1182] [Rank 2] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc387781d87 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc38773275f in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc387fb28a8 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6c (0x7fc33d7fe3ac in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fc33d8024c8 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x15a (0x7fc33d805bfa in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fc33d806839 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd3e95 (0x7fc387af0e95 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7fc389319ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7fc3893aaa04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [Rank 2] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc387781d87 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc38773275f in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc387fb28a8 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6c (0x7fc33d7fe3ac in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fc33d8024c8 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x15a (0x7fc33d805bfa in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fc33d806839 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd3e95 (0x7fc387af0e95 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7fc389319ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7fc3893aaa04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc387781d87 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xdf6b11 (0x7fc33d55cb11 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3e95 (0x7fc387af0e95 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x94ac3 (0x7fc389319ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7fc3893aaa04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
 rank=2
2024-03-27T23:59:35.100095Z ERROR shard-manager: lorax_launcher: Shard process was signaled to shutdown with signal 6 rank=2
2024-03-27T23:59:35.322172Z ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: lorax_client: router/client/src/lib.rs:34: Server error: transport error
2024-03-27T23:59:35.343506Z ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: lorax_client: router/client/src/lib.rs:34: Server error: transport error
2024-03-27T23:59:35.343683Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(0)}:clear_cache{batch_id=Some(0)}: lorax_client: router/client/src/lib.rs:34: Server error: error trying to connect: Connection refused (os error 111)
2024-03-27T23:59:35.343698Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(0)}:clear_cache{batch_id=Some(0)}: lorax_client: router/client/src/lib.rs:34: Server error: error trying to connect: Connection refused (os error 111)
2024-03-27T23:59:35.343707Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(0)}:clear_cache{batch_id=Some(0)}: lorax_client: router/client/src/lib.rs:34: Server error: error trying to connect: Connection refused (os error 111)
2024-03-27T23:59:35.343720Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(0)}:clear_cache{batch_id=Some(0)}: lorax_client: router/client/src/lib.rs:34: Server error: error trying to connect: Connection refused (os error 111)

@magdyksaleh need more details to repro the issue. I ran a test the the following params:

--max-input-length 32767 --max-total-tokens 32768 --max-batch-prefill-tokens 60000

And had no issues prompting the model with various values of max_new_tokens (including leavng it empty). This was with 1x A100 (40GB).

What was the exact error message you were seeing?

Okay, seems the issue here is specific to use of --compile with long contexts. Will take a look.

Hey @hayleyhu, the transport error is interesting, I have not seen that one before. Would you mind creating a separate issue to track that?