influxdata / kapacitor

Open source framework for processing, monitoring, and alerting on time series data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Bug]: Receiving opentsdb data (plaintext) causes panic

elofu17 opened this issue · comments

commented

The latest version of kapacitord, v1.6.5-1, seem to have some bug in the opentsdb handling.

To reproduce:
On a Debian 11 machine I have a netdata process that export its metrics (opentsdb) to localhost:4242 where kapacitord is listening.

In your repo, there are currently two versions of kapacitor available:

  • 1.6.5-1
  • 1.6.4-1

I did an apt full-upgrade which gave me v1.6.5-1, and kapacitord now constantly fails. :(
Every time a chunk of opentsdb metrics (plaintext) is received on port 4242 it says:

Dec 14 15:25:58 netdatacentral kapacitord[1041]: ts=2022-12-14T15:25:58.592+01:00 lvl=info msg="http request" service=http host=::1 username=- start=2022-12-14T15:25:58.592460338+01:00 method=POST uri=/write?consistency=&db=_internal&precision=ns&rp=monitor protocol=HTTP/1.1 status=204 referer=- user-agent=InfluxDBClient request-id=3a524601-7bbb-11ed-800a-0666a6579300 duration=290.345µs
Dec 14 15:26:00 netdatacentral kapacitord[1041]: panic: not implemented
Dec 14 15:26:00 netdatacentral kapacitord[1041]: goroutine 109 [running]:
Dec 14 15:26:00 netdatacentral kapacitord[1041]: github.com/influxdata/kapacitor.(*TaskMaster).WritePointsPrivileged(0x0?, {{0x4?, 0x203001?}}, {0xc001d89e80?, 0x4?}, {0x0?, 0x2000100000060?}, 0x0?, {0xc00200a000, 0x5b, ...})
Dec 14 15:26:00 netdatacentral kapacitord[1041]: #011/root/kapacitor/task_master.go:273 +0x27
Dec 14 15:26:00 netdatacentral kapacitord[1041]: github.com/influxdata/influxdb/services/opentsdb.(*Service).processBatches(0xc000124900, 0xc00235eea0)
Dec 14 15:26:00 netdatacentral kapacitord[1041]: #011/go/pkg/mod/github.com/influxdata/influxdb@v1.9.6/services/opentsdb/service.go:483 +0x3ae
Dec 14 15:26:00 netdatacentral kapacitord[1041]: github.com/influxdata/influxdb/services/opentsdb.(*Service).Open.func1()
Dec 14 15:26:00 netdatacentral kapacitord[1041]: #011/go/pkg/mod/github.com/influxdata/influxdb@v1.9.6/services/opentsdb/service.go:127 +0x65
Dec 14 15:26:00 netdatacentral kapacitord[1041]: created by github.com/influxdata/influxdb/services/opentsdb.(*Service).Open
Dec 14 15:26:00 netdatacentral kapacitord[1041]: #011/go/pkg/mod/github.com/influxdata/influxdb@v1.9.6/services/opentsdb/service.go:127 +0x2df
Dec 14 15:26:00 netdatacentral systemd[1]: kapacitor.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Dec 14 15:26:00 netdatacentral systemd[1]: kapacitor.service: Failed with result 'exit-code'.
Dec 14 15:26:00 netdatacentral systemd[1]: kapacitor.service: Service RestartSec=100ms expired, scheduling restart.

(and netdata log that it lost its connection when kapacitord restarted itself:
Dec 14 15:25:59 netdatacentral netdata-error.log: 2022-12-14 15:25:59: netdata ERROR : MAIN : EXPORTING: 'localhost:4242' closed the socket
)

Every time a new chunk of metrics is received, kapacitord panic and restart itself. No data is actually processed, kapacitord just panics and dies.

I now downgrade to the other, older, version available:

apt install kapacitor=1.6.4-1
reboot

Now it works again. The plaintext opentsdb metrics are received, processed and sent to our InfluxDB as it should.

I have done no changes in the configuration or TICK script. So the bug must be in the kapacitor package for v1.6.5-1.
The regression happened after v1.6.4-1.

I have also tried changing the netdata export to use [opentsdb:http:opentsdb_POST_to_kapacitor] (just in case the new version of kapacitor should expect HTTP-formatted metric data instead of plaintext) but that didn't work either.


Additional info:

A tcpdump show that the format of the plaintext metrics are the same (i.e. it is not netdata that has changed logging format).

16:01:59.480522 IP 127.0.0.1.32932 > 127.0.0.1.4242: Flags [S], seq 2855994911, win 65495, options [mss 65495,sackOK,TS val 2211832732 ecr 0,nop,wscale 7], length 0
E..<.Y@.@..`.............;...........0.........
............
16:01:59.480537 IP 127.0.0.1.4242 > 127.0.0.1.32932: Flags [S.], seq 861833801, ack 2855994912, win 65483, options [mss 65495,sackOK,TS val 2211832732 ecr 2211832732,nop,wscale 7], length 0
E..<..@.@.<.............3^.I.;. .....0.........
............
16:01:59.480551 IP 127.0.0.1.32932 > 127.0.0.1.4242: Flags [.], ack 1, win 512, options [nop,nop,TS val 2211832733 ecr 2211832732], length 0
E..4.Z@.@..g.............;. 3^.J.....(.....
........
16:02:09.484044 IP 127.0.0.1.32932 > 127.0.0.1.4242: Flags [.], seq 1:32742, ack 1, win 512, options [nop,nop,TS val 2211842736 ecr 2211832732], length 32741
E....[@.@.:..............;. 3^.J....~......
..
.....put netdata.disk_svctm.nvme0n1.svctm 1670857326 1.0000000 host=netdatacentral
put netdata.disk_ext_avgsz.nvme0n1.discards 1670857326 0.0000000 host=netdatacentral
put netdata.disk_avgsz.nvme0n1.reads 1670857326 0.0000000 host=netdatacentral
put netdata.disk_avgsz.nvme0n1.writes 1670857326 -26.7857143 host=netdatacentral
...and so on... A few large packets are sent/received before the server send a FIN and the next packet from the client get a RST (since nothing is now listening at tcp/4242 while kapacitord is restarting).

Let me know if you need more conf-files. Here are what I guess is the relevant stuff:

# cat /etc/kapacitor/kapacitor.conf
hostname = "localhost"
data_dir = "/var/lib/kapacitor/.kapacitor"
skip-config-overrides = false
default-retention-policy = ""

[http]
  bind-address = ":9092"
  auth-enabled = false
  log-enabled = true
  write-tracing = false
  pprof-enabled = false
  https-enabled = false
  https-certificate = "/etc/ssl/kapacitor.pem"
  https-private-key = ""
  shutdown-timeout = "10s"
  shared-secret = ""

[replay]
  dir = "/var/lib/kapacitor/.kapacitor/replay"

[storage]
  boltdb = "/var/lib/kapacitor/.kapacitor/kapacitor.db"

[task]
  dir = "/var/lib/kapacitor/.kapacitor/tasks"
  snapshot-interval = "1m0s"

[load]
  enabled = true
  dir = "/etc/kapacitor/load"

[[influxdb]]
  enabled = true
  default = true
  name = "default"
  urls = ["http://localhost:8086"]
  username = ""
  password = ""
  ssl-ca = ""
  ssl-cert = ""
  ssl-key = ""
  insecure-skip-verify = false
  timeout = "0s"
  disable-subscriptions = false
  subscription-protocol = "http"
  subscription-mode = "cluster"
  kapacitor-hostname = ""
  http-port = 0
  udp-bind = ""
  udp-buffer = 1000
  udp-read-buffer = 0
  startup-timeout = "5m0s"
  subscriptions-sync-interval = "1m0s"
  [influxdb.excluded-subscriptions]
    _kapacitor = ["autogen"]

[logging]
  file = "STDERR"
  level = "DEBUG"

[config-override]
  enabled = true

[opentsdb]
  enabled = true
  bind-address = "127.0.0.1:4242"
  database = "opentsdb"
  retention-policy = "autogen"
  consistency-level = "one"
  tls-enabled = false
  certificate = "/etc/ssl/influxdb.pem"
  batch-size = 1000
  batch-pending = 5
  batch-timeout = "1s"
  log-point-errors = true

[reporting]
  enabled = false
  url = "https://usage.influxdata.com"

[stats]
  enabled = true
  stats-interval = "10s"
  database = "_kapacitor"
  retention-policy = "autogen"
  timing-sample-rate = 0.1
  timing-movavg-size = 1000

# Connect to a second InfluxDB
[[influxdb]]
  enabled = true
  default = false
  name = "InfluxCloud"
  urls = ["https://blahblahblah.influxcloud.net:8086"]
  username = "blahblah"
  password = "blahblah"
  timeout = 0
# cat /etc/netdata/exporting.conf
[exporting:global]
    enabled = yes

[opentsdb:opentsdb_plaintext_to_kapacitor]
    enabled = yes
    destination = localhost:4242
    data source = average
    update every = 60
    send hosts matching = *
    send charts matching = system.cpu system.uptime system.load system.entropy disk_space.* system.ram system.swap disk_ops.*
# cat /etc/kapacitor/load/tasks/stream_netdata_to_influxdb.tick
// Stream data from Netdata to remote InfluxDB
dbrp "opentsdb"."autogen"

var data = stream
    |from()
        .database('opentsdb')
        .retentionPolicy('autogen')
        .groupByMeasurement()
    |window()
        .period(1m)
        .every(1m)

data
    |influxDBOut()
        .database('opentsdb')
        .retentionPolicy('autogen')
        .cluster('InfluxCloud')