mistralai/Mistral-7B-Instruct-v0.2 error with total tokens > 8192 and setting --compile

Question

mistralai/Mistral-7B-Instruct-v0.2 error with total tokens > 8192 and setting --compile

magdyksaleh opened this issue 4 months ago · comments

Magdy Saleh commented 4 months ago

System Info

ghcr.io/predibase/lorax:8ff0bf5

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

run the v2 mistral model and try to query with max new tokens above 8192

Expected behavior

should work

Hayley Hu · Answer 1 · Thu Mar 28 2024 08:04:04 GMT+0800 (China Standard Time)

I am running Mixtral model in 4 shards and got transport error when prompt size is 1731.

sudo docker run --gpus='"device=4,5,6,7"' --shm-size 1g -p 8080:80 -v $PWD/data:/data \
    ghcr.io/predibase/lorax:latest --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --num-shard 4 --sharded true \
--max-input-length 4095 \
   --max-total-tokens 4096\
   --max-batch-prefill-tokens 65536\
   --waiting-served-ratio 1.2 \
   --max-waiting-tokens 20 \
   --max-stop-sequences 10 \
   --cuda-memory-fraction 0.99

client

from lorax import Client
client = Client("http://127.0.0.1:8080")
prompt="""some string with 1731 tokens"""
print(client.generate(prompt, max_new_tokens=20, stop_sequences=["\n\n"]).generated_text)

Errors

  File "/home/hayley/lorax/.venv/lib/python3.8/site-packages/lorax/client.py", line 192, in generate
    raise parse_error(resp.status_code, payload)
lorax.errors.GenerationError: Request failed during generation: Server error: transport error


2024-03-27T23:59:34.118606Z ERROR lorax_launcher: interceptor.py:41 Method Prefill encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/lorax-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 324, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 96, in Prefill
    generations, next_batch = self.model.generate_token(batch)
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 927, in generate_token
    raise e
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 924, in generate_token
    out = self.forward(batch, adapter_data)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mixtral.py", line 430, in forward
    logits = model.forward(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mixtral_modeling.py", line 979, in forward
    hidden_states = self.model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mixtral_modeling.py", line 922, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mixtral_modeling.py", line 868, in forward
    moe_output = self.moe(normed_attn_res_output)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mixtral_modeling.py", line 718, in forward
    return self.sparse_forward(x)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mixtral_modeling.py", line 616, in sparse_forward
    x = ops.padded_gather(x, indices, bin_ids, bins, padded_bins,
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/opt/conda/lib/python3.10/site-packages/stk/backend/autocast.py", line 28, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/megablocks/ops/padded_gather.py", line 14, in forward
    return kernels.padded_gather(
  File "/opt/conda/lib/python3.10/site-packages/megablocks/backend/kernels.py", line 118, in padded_gather
    output_rows = padded_bins[-1].cpu().item()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


2024-03-27T23:59:34.119496Z ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: lorax_client: router/client/src/lib.rs:34: Server error: Unexpected <class 'RuntimeError'>: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

2024-03-27T23:59:34.629248Z ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: lorax_client: router/client/src/lib.rs:34: Server error: transport error
2024-03-27T23:59:34.939148Z ERROR shard-manager: lorax_launcher: Shard complete standard error output:

[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
Warmup to max_total_tokens: 100%|██████████| 1/1 [00:11<00:00, 11.96s/it]
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc433581d87 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc43353275f in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc433cfe8a8 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1d40e (0x7fc433cc940e in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1f744 (0x7fc433ccb744 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1fb6d (0x7fc433ccbb6d in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x540210 (0x7fc4324f7210 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x649bf (0x7fc4335669bf in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7fc43355fc8b in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fc43355fe39 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x802b98 (0x7fc4327b9b98 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2f6 (0x7fc4327b9f16 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x13d5a7 (0x55923e51c5a7 in /opt/conda/bin/python3.10)
frame #13: <unknown function> + 0x14db76 (0x55923e52cb76 in /opt/conda/bin/python3.10)
frame #14: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #15: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #16: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #17: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #18: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #19: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #20: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #21: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #22: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #23: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #24: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #25: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #26: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #27: <unknown function> + 0x14dbd3 (0x55923e52cbd3 in /opt/conda/bin/python3.10)
frame #28: <unknown function> + 0x15262b (0x55923e53162b in /opt/conda/bin/python3.10)
frame #29: <unknown function> + 0x1525e7 (0x55923e5315e7 in /opt/conda/bin/python3.10)
frame #30: <unknown function> + 0x563095 (0x7fc21caff095 in /opt/conda/lib/python3.10/site-packages/grpc/_cython/cygrpc.cpython-310-x86_64-linux-gnu.so)
frame #31: <unknown function> + 0x56b815 (0x7fc21cb07815 in /opt/conda/lib/python3.10/site-packages/grpc/_cython/cygrpc.cpython-310-x86_64-linux-gnu.so)
frame #32: <unknown function> + 0x60ae0f (0x7fc21cba6e0f in /opt/conda/lib/python3.10/site-packages/grpc/_cython/cygrpc.cpython-310-x86_64-linux-gnu.so)
frame #33: <unknown function> + 0x56a20b (0x7fc21cb0620b in /opt/conda/lib/python3.10/site-packages/grpc/_cython/cygrpc.cpython-310-x86_64-linux-gnu.so)
frame #34: <unknown function> + 0x5cbc29 (0x7fc21cb67c29 in /opt/conda/lib/python3.10/site-packages/grpc/_cython/cygrpc.cpython-310-x86_64-linux-gnu.so)
frame #35: <unknown function> + 0x14f3bd (0x55923e52e3bd in /opt/conda/bin/python3.10)
frame #36: PyObject_VectorcallMethod + 0x85 (0x55923e53dc85 in /opt/conda/bin/python3.10)
frame #37: <unknown function> + 0xae1eb (0x55923e48d1eb in /opt/conda/bin/python3.10)
frame #38: <unknown function> + 0x7bf6 (0x7fc434080bf6 in /opt/conda/lib/python3.10/lib-dynload/_asyncio.cpython-310-x86_64-linux-gnu.so)
frame #39: <unknown function> + 0x143d2a (0x55923e522d2a in /opt/conda/bin/python3.10)
frame #40: <unknown function> + 0x25f22c (0x55923e63e22c in /opt/conda/bin/python3.10)
frame #41: <unknown function> + 0xfda7b (0x55923e4dca7b in /opt/conda/bin/python3.10)
frame #42: <unknown function> + 0x13c1b3 (0x55923e51b1b3 in /opt/conda/bin/python3.10)
frame #43: _PyEval_EvalFrameDefault + 0x5d5d (0x55923e51916d in /opt/conda/bin/python3.10)
frame #44: _PyFunction_Vectorcall + 0x6c (0x55923e5238cc in /opt/conda/bin/python3.10)
frame #45: _PyEval_EvalFrameDefault + 0x72c (0x55923e513b3c in /opt/conda/bin/python3.10)
frame #46: _PyFunction_Vectorcall + 0x6c (0x55923e5238cc in /opt/conda/bin/python3.10)
frame #47: _PyEval_EvalFrameDefault + 0x72c (0x55923e513b3c in /opt/conda/bin/python3.10)
frame #48: _PyFunction_Vectorcall + 0x6c (0x55923e5238cc in /opt/conda/bin/python3.10)
frame #49: _PyEval_EvalFrameDefault + 0x72c (0x55923e513b3c in /opt/conda/bin/python3.10)
frame #50: _PyFunction_Vectorcall + 0x6c (0x55923e5238cc in /opt/conda/bin/python3.10)
frame #51: _PyEval_EvalFrameDefault + 0x72c (0x55923e513b3c in /opt/conda/bin/python3.10)
frame #52: _PyFunction_Vectorcall + 0x6c (0x55923e5238cc in /opt/conda/bin/python3.10)
frame #53: _PyEval_EvalFrameDefault + 0x4c12 (0x55923e518022 in /opt/conda/bin/python3.10)
frame #54: _PyFunction_Vectorcall + 0x6c (0x55923e5238cc in /opt/conda/bin/python3.10)
frame #55: _PyEval_EvalFrameDefault + 0x4c12 (0x55923e518022 in /opt/conda/bin/python3.10)
frame #56: _PyFunction_Vectorcall + 0x6c (0x55923e5238cc in /opt/conda/bin/python3.10)
frame #57: PyObject_Call + 0xbc (0x55923e52fd9c in /opt/conda/bin/python3.10)
frame #58: _PyEval_EvalFrameDefault + 0x2d84 (0x55923e516194 in /opt/conda/bin/python3.10)
frame #59: _PyFunction_Vectorcall + 0x6c (0x55923e5238cc in /opt/conda/bin/python3.10)
frame #60: PyObject_Call + 0xbc (0x55923e52fd9c in /opt/conda/bin/python3.10)
frame #61: _PyEval_EvalFrameDefault + 0x2d84 (0x55923e516194 in /opt/conda/bin/python3.10)
frame #62: <unknown function> + 0x150402 (0x55923e52f402 in /opt/conda/bin/python3.10)
frame #63: PyObject_Call + 0xbc (0x55923e52fd9c in /opt/conda/bin/python3.10)
 rank=3
2024-03-27T23:59:34.939198Z ERROR shard-manager: lorax_launcher: Shard process was signaled to shutdown with signal 6 rank=3
2024-03-27T23:59:35.003043Z ERROR lorax_launcher: Shard 3 crashed
2024-03-27T23:59:35.003070Z  INFO lorax_launcher: Terminating webserver
2024-03-27T23:59:35.003088Z  INFO lorax_launcher: Waiting for webserver to gracefully shutdown
2024-03-27T23:59:35.003172Z  INFO lorax_router::server: router/src/server.rs:1187: signal received, starting graceful shutdown
2024-03-27T23:59:35.100045Z ERROR shard-manager: lorax_launcher: Shard complete standard error output:

[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
Warmup to max_total_tokens: 100%|██████████| 1/1 [00:11<00:00, 11.94s/it]
[rank2]:[E ProcessGroupNCCL.cpp:1182] [Rank 2] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc387781d87 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc38773275f in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc387fb28a8 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6c (0x7fc33d7fe3ac in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fc33d8024c8 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x15a (0x7fc33d805bfa in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fc33d806839 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd3e95 (0x7fc387af0e95 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7fc389319ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7fc3893aaa04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [Rank 2] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc387781d87 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc38773275f in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc387fb28a8 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6c (0x7fc33d7fe3ac in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fc33d8024c8 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x15a (0x7fc33d805bfa in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fc33d806839 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd3e95 (0x7fc387af0e95 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7fc389319ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7fc3893aaa04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc387781d87 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xdf6b11 (0x7fc33d55cb11 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3e95 (0x7fc387af0e95 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x94ac3 (0x7fc389319ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7fc3893aaa04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
 rank=2
2024-03-27T23:59:35.100095Z ERROR shard-manager: lorax_launcher: Shard process was signaled to shutdown with signal 6 rank=2
2024-03-27T23:59:35.322172Z ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: lorax_client: router/client/src/lib.rs:34: Server error: transport error
2024-03-27T23:59:35.343506Z ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: lorax_client: router/client/src/lib.rs:34: Server error: transport error
2024-03-27T23:59:35.343683Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(0)}:clear_cache{batch_id=Some(0)}: lorax_client: router/client/src/lib.rs:34: Server error: error trying to connect: Connection refused (os error 111)
2024-03-27T23:59:35.343698Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(0)}:clear_cache{batch_id=Some(0)}: lorax_client: router/client/src/lib.rs:34: Server error: error trying to connect: Connection refused (os error 111)
2024-03-27T23:59:35.343707Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(0)}:clear_cache{batch_id=Some(0)}: lorax_client: router/client/src/lib.rs:34: Server error: error trying to connect: Connection refused (os error 111)
2024-03-27T23:59:35.343720Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(0)}:clear_cache{batch_id=Some(0)}: lorax_client: router/client/src/lib.rs:34: Server error: error trying to connect: Connection refused (os error 111)

Travis Addair · Answer 2 · Fri Mar 29 2024 01:12:22 GMT+0800 (China Standard Time)

@magdyksaleh need more details to repro the issue. I ran a test the the following params:

--max-input-length 32767 --max-total-tokens 32768 --max-batch-prefill-tokens 60000

And had no issues prompting the model with various values of max_new_tokens (including leavng it empty). This was with 1x A100 (40GB).

What was the exact error message you were seeing?

Travis Addair · Answer 3 · Fri Mar 29 2024 01:16:51 GMT+0800 (China Standard Time)

Okay, seems the issue here is specific to use of --compile with long contexts. Will take a look.

Travis Addair · Answer 4 · Fri Mar 29 2024 05:19:46 GMT+0800 (China Standard Time)

Hey @hayleyhu, the transport error is interesting, I have not seen that one before. Would you mind creating a separate issue to track that?