ggerganov / llama.cpp

LLM inference in C/C++

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Embedding server crashes when used with langchain openai embeddings

voorhs opened this issue · comments

Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug.

If the bug concerns the server, please try to reproduce it first using the server test scenario framework.

The snippet:

from langchain_openai import OpenAIEmbeddings
embedding=OpenAIEmbeddings(model="-", api_key="sk-no-key-required", base_url="http://localhost:8666")
embedding.embed_documents(['hello there'])

Logs from server:

{"tid":"140695133081600","timestamp":1715435598,"level":"INFO","function":"update_slots","line":1807,"msg":"all slots are idle"}
{"tid":"140695133081600","timestamp":1715435604,"level":"INFO","function":"launch_slot_with_task","line":1036,"msg":"slot is processing task","id_slot":0,"id_task":0}
terminate called after throwing an instance of 'nlohmann::json_abi_v3_11_3::detail::type_error'
  what():  [json.exception.type_error.302] type must be number, but is array

After that, server stops.

Server is launched in server-cuda container:

docker run --gpus all -v ./llm-gguf:/models -p 8666:8000 -e "CUDA_VISIBLE_DEVICES=2" local/llama.cpp:server-cuda -m /models/GritLM-7B-Q4_K_M.gguf --port 8000 --host 0.0.0.0 --n-gpu-layers 32 --embeddings

When used with openai python api, everything is good:

import openai

client = openai.OpenAI(
    base_url="http://localhost:8666",
    api_key = "sk-no-key-required"
)

client.embeddings.create(input=['hello mister'], model='-').data[0].embedding

System: Ubuntu 20.04.6 LTS, NVIDIA A100-40GB

I have exactly the same bug. On a mac studio m1 max (last os). I'm using Hermes 2 Pro llama 8B model.