grafana / carbon-relay-ng

Fast carbon relay+aggregator with admin interfaces for making changes online - production ready

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Leaks memory

mmimica opened this issue · comments

Relay seems to be leaking memory. It consumes all 2GB of RAM we gave him until OOMK kicks in.
Routing rules are pretty simple:

init = [
     "addBlack .net",
     "addRoute sendAllMatch core prefix=org.company  12.56.176.41:2014 spool=false pickle=true  12.56.176.42:2014 spool=false pickle=true",
     "addRoute sendAllMatch other prefix=stats  12.56.176.51:2014 spool=false pickle=true  12.56.176.52:2014 spool=false pickle=true"
]

I'm also seeing this issue.

instance = "default"
listen_addr = "0.0.0.0:2103"
admin_addr = "0.0.0.0:2104"
http_addr = "0.0.0.0:8089"

spool_dir = "/log/graphite/spool"
log_level = "error"

init = [
"addRoute sendAllMatch carbonrelay-cluster 10.198.34.168:2013 spool=false pickle=false 10.198.34.169:2013 spool=false pickle=false 10.198.35.228:2013 spool=false pickle=false 10.199.223.31:10005 spool=false pickle=false"
]

[instrumentation]
graphite_addr = "localhost:2103"
graphite_interval = 15000

Slowly eats memory over time.

It appears that my configuration was having lots of dest metric drops. This was tracked down to my last destination (10005) which was an influxdb destination that was having issues. When looking at the carbon-relay-ng metrics, it showed lots of errors and drops. Once I removed my influxdb destination, we've been stable. I think this isn't an issue with influxdb being a destination - instead it seems that its an issue with the performance of my influxdb backend. The back-up in processing seems to have been negatively impacting carbon-relay-ng.

Any news on this?

weird. i have a 2GB machine and never even needed to look at memory. never been an issue for me.
any of you interested in doing some memory profiling? this can tell you where a lot of memory is being allocated.
see http://blog.golang.org/profiling-go-programs

I'm also getting oom kills. I'm also pushing to influxdb. Here's a grafana graph of memory usage (sorry for the missing units the graph is ~2 weeks wide and 7.5gb tall. I can try and do some memory profiling soon.

memory

what's the green vs the yellow line? (which is memory and what's the other?)

Sorry. They are both memory. The yellow line is memory usage on a different
server running carbon-relay-ng at a lower throughput.

The green line does represent total memory usage on a server running
carbon-relay-ng and some other processes, but that memory usage is normally pretty constant, I think the graph shows memory growth up to the point the kernel oom kills the process. I also have oom messages in dmesg.

I have a core file I'm analyzing too.

My core file doesn't show anything particularly interesting. Approximately 90% of the core file is NULLs, which would make sense if we had lots of structures initialized to their zero value and ever used. I'll make a build with some memory allocation debugging next.

I'm seeing the same issue. Here's a memory graph of two servers, both running carbon-relay-ng with an identical config. carbon-relay-ng was installed between May 28-29. Decreases in cached memory are server reboots.
carbon-relay-ng-leak

The difference between both servers is that the first one is getting all the incoming carbon-relay-ng traffic right now.

The config is pretty trivial. This is running on 3 servers, each also running carbon-cache. Clients can talk to carbon-relay-ng on any of the 3 servers, which then forwards the metrics to all 3.

instance = "default"
listen_addr = "192.168.80.11:2103"
admin_addr = "127.0.0.1:2004"
http_addr = "127.0.0.1:8081"
spool_dir = "/var/spool/carbon-relay-ng"
log_level = "notice"
bad_metrics_max_age = "24h"

init = [
     'addBlack collectd.localhost',
     'addRoute sendAllMatch mon-carbons  192.168.80.11:2004 spool=true pickle=true  192.168.80.12:2004 spool=true pickle=true  192.168.80.13:2004 spool=true pickle=true',
]

[instrumentation]
graphite_addr = ""
graphite_interval = 1000

It hasn't OOMed yet, though it looks like it will in a few days at this rate. I'll give a shot at profiling it.

Well, looks like carbon-relay-ng already imports pprof, so that was easy.

(pprof) top10
5.41GB of 5.45GB total (99.16%)
Dropped 109 nodes (cum <= 0.03GB)
      flat  flat%   sum%        cum   cum%
    5.41GB 99.16% 99.16%     5.41GB 99.16%  github.com/graphite-ng/carbon-relay-ng/_third_party/github.com/Dieterbe/go-metrics.(*WindowSample).Update
         0     0% 99.16%     5.41GB 99.16%  github.com/graphite-ng/carbon-relay-ng/_third_party/github.com/Dieterbe/go-metrics.(*StandardHistogram).Update
         0     0% 99.16%     5.40GB 99.01%  github.com/graphite-ng/carbon-relay-ng/_third_party/github.com/Dieterbe/go-metrics.(*StandardTimer).Update
         0     0% 99.16%     5.40GB 99.09%  main.(*Conn).HandleData
         0     0% 99.16%     5.45GB   100%  runtime.goexit

That would be where all the memory is going. Sounds like this is broken in metrics_wrapper.go:

    //histogram: NewHistogram(NewExpDecaySample(1028, 0.015)),
    histogram := metrics.NewHistogram(metrics.NewWindowSample())

WindowSample just accumulates values forever, never freeing them, until it is explicitly cleared. ExpDecaySample is limited to 1028 samples, in this case. I'm changing both instances of WindowSample to ExpDecaySample in metrics_wrapper.go and seeing how that goes over the next few days.

Hi marcan,

I haven't dug through the code but just one thing.... you aren't sending internal metrics from carbon-relay-ng anywhere....

[instrumentation]
graphite_addr = ""
graphite_interval = 1000

Maybe if you were it would be flushing the WindowSamples?

That's possible, though I'm not currently interested in the instrumentation and just left that section untouched from the sample config file :-)

hmm i have to check the go-metrics code again, it is possible that certain metrics types require to be collected for them to be aggregated or their datastructures to be trimmed/reset.

btw i created a little lib that automatically creates a memory profile when a certain memory usage is reached. see https://github.com/Dieterbe/profiletrigger
anyone feel like building this in and PR'ing? :) should help all you in tracking this down..

Any news about this issue ?
I am running into it right now.

Is it linked to a bad 'instrumentation' section ?

commented

@Dieterbe I did the same as @marcan (changing metrics.NewWindowSample() to metrics.NewExpDecaySample(1028, 0.015)) and carbon-relay-ng stopped to leak memory. I can make a PR with that, but I have a lack of understanding how metrics histogram should work in this case.

@olivierHa nope, it's has nothing to do with instrumentation section.

On my side, this was the instrumentatio :(
If you put a "wrong" or empty address to flush internal metrics, I guess metrics are just 'stacking' in memory until oom occurs.

commented

I see. Well, in my case there is no instrumentation section at all and it's leaking.

In my case OOM stopped when instrumentation section was configured.

So I've got it set up to flush instrumentation, and I also tweaked it as suggested above to use NewExpDecaySample, but neither made any difference. It constantly receives a low level of incoming metrics (5-15/s) in my test configuration, and it eventually gets killed by the kernel.

I tried to get it to use the referenced profiletrigger, but I haven't had any luck (nothing ever gets written). Basically I did:

        if *memprofile != "" {
               log.Warning("Triggering memory profiling at 1GB to '%s'\n", *memprofile)
               errors := make(chan error)
               trigger, _ := heap.New(*memprofile, 1000000000, 60, time.Duration(1)*time.Second, errors)
               go trigger.Run()
               go func() {
                       for e := range errors {
                               log.Fatal("profiletrigger heap saw error:", e)
                       }
              }()
               log.Warning("Started memprofile error thread")
       }

I don't know too much about Go, so I may be doing something simple wrong.

the profiletrigger should be unrelated to the memprofile flag, so instead of using *memprofile you should use a path to a writeable directory. also make sure your threshold makes sense for your system, i believe it's in bytes.

Ah, wasn't clear to me it was supposed to be a directory.

Here's what I get, from a couple of profiles about 3 minutes apart. My threshold is low at the moment to make it happen quickly, but it will use up pretty much all the system's memory in a few hours. I've got an older version of the program running on a different system entirely (different environment, os), handling about 10K metrics/second, and it hasn't been restarted since 6/10. I'm testing this system, and it's only receiving 6-10 metrics/second, but it will run out of memory and the process manager will restart the relay service in 4-6 hours.

go tool pprof /usr/local/bin/carbon-relay-ng /appdata/carbon-relay-ng-profiles/1470066792.profile-heap
(pprof) top 10
821.70MB of 822.21MB total (99.94%)
Dropped 54 nodes (cum <= 4.11MB)
      flat  flat%   sum%        cum   cum%
  478.75MB 58.23% 58.23%   478.75MB 58.23%  main.(*keepSafe).keepClean
  219.14MB 26.65% 84.88%   219.14MB 26.65%  main.NewWriter
  103.06MB 12.53% 97.42%   336.58MB 40.94%  main.NewConn
   13.88MB  1.69% 99.10%    13.88MB  1.69%  main.NewKeepSafe
    6.87MB  0.84% 99.94%     6.87MB  0.84%  github.com/graphite-ng/carbon-relay-ng/badmetrics.New
         0     0% 99.94%   336.58MB 40.94%  main.(*Destination).updateConn
         0     0% 99.94%     6.87MB  0.84%  main.main
         0     0% 99.94%   822.21MB   100%  runtime.goexit
         0     0% 99.94%     6.87MB  0.84%  runtime.main
(pprof) top10 -cum
821.70MB of 822.21MB total (99.94%)
Dropped 54 nodes (cum <= 4.11MB)
      flat  flat%   sum%        cum   cum%
         0     0%     0%   822.21MB   100%  runtime.goexit
  478.75MB 58.23% 58.23%   478.75MB 58.23%  main.(*keepSafe).keepClean
         0     0% 58.23%   336.58MB 40.94%  main.(*Destination).updateConn
  103.06MB 12.53% 70.76%   336.58MB 40.94%  main.NewConn
  219.14MB 26.65% 97.42%   219.14MB 26.65%  main.NewWriter
   13.88MB  1.69% 99.10%    13.88MB  1.69%  main.NewKeepSafe
    6.87MB  0.84% 99.94%     6.87MB  0.84%  github.com/graphite-ng/carbon-relay-ng/badmetrics.New
         0     0% 99.94%     6.87MB  0.84%  main.main
         0     0% 99.94%     6.87MB  0.84%  runtime.main

go tool pprof /usr/local/bin/carbon-relay-ng /appdata/carbon-relay-ng-profiles/1470066982.profile-heap
Entering interactive mode (type "help" for commands)
(pprof) top 10
849.39MB of 849.90MB total (99.94%)
Dropped 58 nodes (cum <= 4.25MB)
      flat  flat%   sum%        cum   cum%
  508.82MB 59.87% 59.87%   508.82MB 59.87%  main.(*keepSafe).keepClean
  226.97MB 26.71% 86.57%   226.97MB 26.71%  main.NewWriter
  106.74MB 12.56% 99.13%   334.21MB 39.32%  main.NewConn
    6.87MB  0.81% 99.94%     6.87MB  0.81%  github.com/graphite-ng/carbon-relay-ng/badmetrics.New
         0     0% 99.94%   334.21MB 39.32%  main.(*Destination).updateConn
         0     0% 99.94%     6.87MB  0.81%  main.main
         0     0% 99.94%   849.90MB   100%  runtime.goexit
         0     0% 99.94%     6.87MB  0.81%  runtime.main
(pprof) top10 -cum
849.39MB of 849.90MB total (99.94%)
Dropped 58 nodes (cum <= 4.25MB)
      flat  flat%   sum%        cum   cum%
         0     0%     0%   849.90MB   100%  runtime.goexit
  508.82MB 59.87% 59.87%   508.82MB 59.87%  main.(*keepSafe).keepClean
         0     0% 59.87%   334.21MB 39.32%  main.(*Destination).updateConn
  106.74MB 12.56% 72.43%   334.21MB 39.32%  main.NewConn
  226.97MB 26.71% 99.13%   226.97MB 26.71%  main.NewWriter
    6.87MB  0.81% 99.94%     6.87MB  0.81%  github.com/graphite-ng/carbon-relay-ng/badmetrics.New
         0     0% 99.94%     6.87MB  0.81%  main.main
         0     0% 99.94%     6.87MB  0.81%  runtime.main

the profile above is not very conclusive but i suspect it was taken when carbon-relay-ng hasn't run long enough.

From the reports from various people in this ticket, as well as the code, it's clear that the problem is carbon-relay-ng's internal metrics that pile up, if it's not set up to send them anywhere.
we need to do two things:

A) make this issue extra clear in readme and sample configs and set up default config to work around it, because too many folks are tripping over it. I'll do this now.
B) actually solve it. maybe by using a different library, or by flushing the metrics to /dev/null if graphite_addr is empty. Unfortunately i have no time to do this on the short term.

@Dieterbe is there a way to discard these internal metrics?

i don't have time right now to dig into the code (anyone else, feel free to do so). from what i recall no / not with our current implementation

Hey can we add a flag just to disable the instrumentation if it's not needed? If you can help me by directing me to any relevant parts of the code which can help me better understand the situation, I'll be happy to try to contribute a fix. :)

@shlok007 , did you make any progress on this?

Seems that we start memory profiling before we know we even want a memory profiler.

go func() {
sys := stats.Gauge("what=virtual_memory.unit=Byte")
alloc := stats.Gauge("what=memory_allocated.unit=Byte")
ticker := time.NewTicker(time.Second)
var memstats runtime.MemStats
for range ticker.C {
runtime.ReadMemStats(&memstats)
sys.Update(int64(memstats.Sys))
alloc.Update(int64(memstats.Alloc))
}
}()

I'm going to take a poke at a PR for this.