[BUG] Observability Agent: model cannot determine what arguments to use for the tool call

Question

[BUG] Observability Agent: model cannot determine what arguments to use for the tool call

sadieleob opened this issue 3 months ago · comments

Sadiel Ortega commented 3 months ago

📋 Prerequisites

I have searched the existing issues to avoid creating a duplicate
By submitting this issue, you agree to follow our Code of Conduct
I am using the latest version of the software
I have tried to clear cache/cookies or used incognito mode (if ui-related)
I can consistently reproduce this issue

🎯 Affected Service(s)

App Service

🚦 Impact/Severity

Blocker

🐛 Bug Description

The Observability Agent in kagent fails when making queries to analyze pod resource consumption. The agent encounters an OpenAI/LiteLLM error where tool calls are not being properly handled, resulting in failed queries and broken chat sessions.

how much memory the httpbin pod in the default namespace is consuming?
{"contextId":"ctx-4fd05d66-2589-45fa-9090-6e7f36df4613","final":false,"kind":"status-update","status":{"state":"submitted","message":{"contextId":"ctx-4fd05d66-2589-45fa-9090-6e7f36df4613","kind":"message","messageId":"msg-1ca9a726-2e53-4d1f-8bda-867955b17bd9","parts":[{"kind":"text","text":"how much memory the httpbin pod in the default namespace is consuming?"}],"role":"user","taskId":"05a92297-d14b-4d4f-8ca7-1e21490ff740"},"timestamp":"2025-08-22T05:10:19.859440+00:00"},"taskId":"05a92297-d14b-4d4f-8ca7-1e21490ff740"}
{"contextId":"ctx-4fd05d66-2589-45fa-9090-6e7f36df4613","final":false,"kind":"status-update","metadata":{"adk_app_name":"kagent__NS__observability_agent","adk_session_id":"ctx-4fd05d66-2589-45fa-9090-6e7f36df4613","adk_user_id":"admin@kagent.dev"},"status":{"state":"working","timestamp":"2025-08-22T05:10:19.868998+00:00"},"taskId":"05a92297-d14b-4d4f-8ca7-1e21490ff740"}
{"contextId":"ctx-4fd05d66-2589-45fa-9090-6e7f36df4613","final":true,"kind":"status-update","status":{"state":"failed","message":{"kind":"message","messageId":"8517bc35-354d-4ef9-84f4-002d8acba573","parts":[{"kind":"text","text":"litellm.BadRequestError: OpenAIException - An assistant message with 'tool_calls' must be followed by tool messages responding to each 'tool_call_id'. The following tool_call_ids did not have response messages: call_YksEGtVWMvu6udpAyPNlC035, call_Dl90RWYGFAOK9PHyhUiThoU1"}],"role":"agent"},"timestamp":"2025-08-22T05:10:20.801804+00:00"},"taskId":"05a92297-d14b-4d4f-8ca7-1e21490ff740"}

🔄 Steps To Reproduce

Deploy kagent with observability-agent (version 0.6.3)
Configure Grafana MCP server with custom Grafana URL and API key
Ask the observability agent a question about pod resource consumption, such as:
"can you give me the pods consuming more cpu?"
"how much memory the httpbin pod in the default namespace is consuming?"

how much memory the httpbin pod in the default namespace is consuming?
{"contextId":"ctx-4fd05d66-2589-45fa-9090-6e7f36df4613","final":false,"kind":"status-update","status":{"state":"submitted","message":{"contextId":"ctx-4fd05d66-2589-45fa-9090-6e7f36df4613","kind":"message","messageId":"msg-1ca9a726-2e53-4d1f-8bda-867955b17bd9","parts":[{"kind":"text","text":"how much memory the httpbin pod in the default namespace is consuming?"}],"role":"user","taskId":"05a92297-d14b-4d4f-8ca7-1e21490ff740"},"timestamp":"2025-08-22T05:10:19.859440+00:00"},"taskId":"05a92297-d14b-4d4f-8ca7-1e21490ff740"}
{"contextId":"ctx-4fd05d66-2589-45fa-9090-6e7f36df4613","final":false,"kind":"status-update","metadata":{"adk_app_name":"kagent__NS__observability_agent","adk_session_id":"ctx-4fd05d66-2589-45fa-9090-6e7f36df4613","adk_user_id":"admin@kagent.dev"},"status":{"state":"working","timestamp":"2025-08-22T05:10:19.868998+00:00"},"taskId":"05a92297-d14b-4d4f-8ca7-1e21490ff740"}
{"contextId":"ctx-4fd05d66-2589-45fa-9090-6e7f36df4613","final":true,"kind":"status-update","status":{"state":"failed","message":{"kind":"message","messageId":"8517bc35-354d-4ef9-84f4-002d8acba573","parts":[{"kind":"text","text":"litellm.BadRequestError: OpenAIException - An assistant message with 'tool_calls' must be followed by tool messages responding to each 'tool_call_id'. The following tool_call_ids did not have response messages: call_YksEGtVWMvu6udpAyPNlC035, call_Dl90RWYGFAOK9PHyhUiThoU1"}],"role":"agent"},"timestamp":"2025-08-22T05:10:20.801804+00:00"},"taskId":"05a92297-d14b-4d4f-8ca7-1e21490ff740"}

mcpserver:

apiVersion: kagent.dev/v1alpha1
kind: MCPServer
metadata:
  annotations:
    meta.helm.sh/release-name: kagent
    meta.helm.sh/release-namespace: kagent
  creationTimestamp: "2025-08-22T04:52:17Z"
  generation: 2
  labels:
    app.kubernetes.io/instance: kagent
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: observability-agent
    app.kubernetes.io/part-of: kagent
    app.kubernetes.io/version: 0.6.3
    helm.sh/chart: observability-agent-0.6.3
  name: grafana
  namespace: kagent
  resourceVersion: "12582"
  uid: 1f256cda-6b65-4f9b-b0af-ba905eeff1ec
spec:
  deployment:
    args:
    - --transport
    - stdio
    cmd: /app/mcp-grafana
    env:
      GRAFANA_URL: kube-prometheus-stack-grafana.monitoring.svc.cluster.local:3000/api
    image: mcp/grafana:latest
    port: 3000
    secretRefs:
    - name: grafana-api-key
  transportType: stdio
status:
  conditions:
  - lastTransitionTime: "2025-08-22T04:52:18Z"
    message: MCPServer configuration is valid
    observedGeneration: 2
    reason: Accepted
    status: "True"
    type: Accepted
  - lastTransitionTime: "2025-08-22T04:52:18Z"
    message: All references resolved successfully
    observedGeneration: 2
    reason: ResolvedRefs
    status: "True"
    type: ResolvedRefs
  - lastTransitionTime: "2025-08-22T04:52:18Z"
    message: All resources created successfully
    observedGeneration: 2
    reason: Programmed
    status: "True"
    type: Programmed
  - lastTransitionTime: "2025-08-22T04:58:40Z"
    message: Deployment is ready and all pods are running
    observedGeneration: 2
    reason: Ready
    status: "True"
    type: Ready
  observedGeneration: 2

grafana/kagent pod logs:

│ mcp-server 2025-08-22T05:11:26.026076Z    info    request    gateway=bind/3000 listener=default route=mcp src.addr=10.244.1.70:44518 http.method=DELETE http.host=grafana.kagent htt │
│ p.path=/mcp http.version=HTTP/1.1 http.status=202 duration=0ms                                                                                                                       │
│ mcp-server time=2025-08-22T05:12:19.276Z level=INFO msg="Starting Grafana MCP server using stdio transport" version=(devel)                                                          │
│ mcp-server time=2025-08-22T05:12:19.276Z level=INFO msg="Using Grafana configuration" url=http://localhost:3000/ api_key_set=true                                                     │
│ mcp-server 2025-08-22T05:12:19.276944Z    info    request    gateway=bind/3000 listener=default route=mcp src.addr=10.244.1.56:37738 http.method=POST http.host=grafana.kagent http. │
│ path=/mcp http.version=HTTP/1.1 http.status=200 duration=9ms                                                                                                                         │
│ mcp-server 2025-08-22T05:12:19.277366Z    info    request    gateway=bind/3000 listener=default route=mcp src.addr=10.244.1.56:37738 http.method=POST http.host=grafana.kagent http. │
│ path=/mcp http.version=HTTP/1.1 http.status=202 duration=0ms                                                                                                                         │
│ mcp-server 2025-08-22T05:12:19.280237Z    info    request    gateway=bind/3000 listener=default route=mcp src.addr=10.244.1.56:37738 http.method=POST http.host=grafana.kagent http. │
│ path=/mcp http.version=HTTP/1.1 http.status=200 duration=2ms                                                                                                                         │
│ mcp-server 2025-08-22T05:12:19.281060Z    info    request    gateway=bind/3000 listener=default route=mcp src.addr=10.244.1.56:37738 http.method=DELETE http.host=grafana.kagent htt │
│ p.path=/mcp http.version=HTTP/1.1 http.status=202 duration=0ms

🤔 Expected Behavior

The observability agent should successfully query Grafana for pod resource metrics and return the requested information about CPU/memory consumption.

📱 Actual Behavior

The observability agent fails with the following error:

how much memory the httpbin pod in the default namespace is consuming?
{"contextId":"ctx-4fd05d66-2589-45fa-9090-6e7f36df4613","final":false,"kind":"status-update","status":{"state":"submitted","message":{"contextId":"ctx-4fd05d66-2589-45fa-9090-6e7f36df4613","kind":"message","messageId":"msg-1ca9a726-2e53-4d1f-8bda-867955b17bd9","parts":[{"kind":"text","text":"how much memory the httpbin pod in the default namespace is consuming?"}],"role":"user","taskId":"05a92297-d14b-4d4f-8ca7-1e21490ff740"},"timestamp":"2025-08-22T05:10:19.859440+00:00"},"taskId":"05a92297-d14b-4d4f-8ca7-1e21490ff740"}
{"contextId":"ctx-4fd05d66-2589-45fa-9090-6e7f36df4613","final":false,"kind":"status-update","metadata":{"adk_app_name":"kagent__NS__observability_agent","adk_session_id":"ctx-4fd05d66-2589-45fa-9090-6e7f36df4613","adk_user_id":"admin@kagent.dev"},"status":{"state":"working","timestamp":"2025-08-22T05:10:19.868998+00:00"},"taskId":"05a92297-d14b-4d4f-8ca7-1e21490ff740"}
{"contextId":"ctx-4fd05d66-2589-45fa-9090-6e7f36df4613","final":true,"kind":"status-update","status":{"state":"failed","message":{"kind":"message","messageId":"8517bc35-354d-4ef9-84f4-002d8acba573","parts":[{"kind":"text","text":"litellm.BadRequestError: OpenAIException - An assistant message with 'tool_calls' must be followed by tool messages responding to each 'tool_call_id'. The following tool_call_ids did not have response messages: call_YksEGtVWMvu6udpAyPNlC035, call_Dl90RWYGFAOK9PHyhUiThoU1"}],"role":"agent"},"timestamp":"2025-08-22T05:10:20.801804+00:00"},"taskId":"05a92297-d14b-4d4f-8ca7-1e21490ff740"}

💻 Environment

kagent version: 0.6.3
Kind K8s 1.29.12
Grafana and Prometheus https://docs.solo.io/gateway/latest/observability/metrics/

🔧 CLI Bug Report

kagent-bug-report-20250822-001548.tar.gz

🔍 Additional Context

No response

📋 Logs

📷 Screenshots

No response

🙋 Are you willing to contribute?

I am willing to submit a PR to fix this issue

Ashley Wang · Answer 1 · Wed Aug 27 2025 21:24:14 GMT+0800 (China Standard Time)

Hello - I was able to reproduce this consistently and the issue seems to be this log line:

mcp-server time=2025-08-22T05:12:19.276Z level=INFO msg="Using Grafana configuration" url=http://localhost:3000/ api_key_set=true

When the observability-agent was created, the GRAFANA_URL was missing, so the MCP server defaulted to http://localhost:3000, causing connection timeouts.

This was the case even when my MCPServer was correctly configured:

apiVersion: kagent.dev/v1alpha1
kind: MCPServer
metadata:
  annotations:
    meta.helm.sh/release-name: kagent
    meta.helm.sh/release-namespace: kagent
  creationTimestamp: "2025-08-26T11:31:33Z"
  generation: 4
  labels:
    app.kubernetes.io/instance: kagent
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: observability-agent
    app.kubernetes.io/part-of: kagent
    app.kubernetes.io/version: v0.6.3-3-g986a84e
    helm.sh/chart: observability-agent-v0.6.3-3-g986a84e
  name: grafana
  namespace: kagent
  resourceVersion: "53598"
  uid: 17881eb0-a59e-4ca0-be7a-631d70ae9914
spec:
  deployment:
    args:
    - --transport
    - stdio
    cmd: /app/mcp-grafana
    env:
      GRAFANA_URL: http://grafana.grafana:3000
    image: mcp/grafana:latest
    port: 3000
    secretRefs:
    - name: grafana-api-key
  transportType: stdio

The only way I was able to fix it was by running:

# Add the missing env array
kubectl patch deployment grafana -n kagent --type='json' -p='[
  {
    "op": "add", 
    "path": "/spec/template/spec/containers/0/env",
    "value": [{"name": "GRAFANA_URL", "value": "<My grafana instance URL was http://grafana.grafana>"}]
  }
]'

It seems like kmcp MCPServer controller failed to apply the env section to the container?

This was fixed in: https://github.com/kagent-dev/kmcp/pull/56/files

Lin Sun · Answer 2 · Thu Sep 04 2025 22:49:04 GMT+0800 (China Standard Time)

I'm also hitting this issue as well, related to tool_calling when using open AI as the LLM

Eitan Yarmush · Answer 3 · Wed Sep 10 2025 02:15:17 GMT+0800 (China Standard Time)

I believe I have fixed this issue in #872, which was released in 0.6.10