dgraph-io / ristretto

A high performance memory-bound Go cache

Home Page:https://dgraph.io/blog/post/introducing-ristretto-high-perf-go-cache/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MaxCost sometimes ignored

erikdubbelboer opened this issue · comments

I have written a simple program that Gets and Sets entries in Ristretto. The program sets a MaxCost (in bytes) and passes a value for the cost to Set based on the size in bytes of the value.

Code can be found here: https://gist.github.com/erikdubbelboer/bd904c51f00079421fd3eb2a061a50c0

As you can see from the source Ristretto should be limited to 1GB of memory. The program doesn't do anything strange.

In most cases this program works fine and Ristretto limits the memory usage correctly. But in the case of the exact code as above the memory usage keeps growing.

Also keep in mind that runtime.GC() is run every second so it's not just uncollected allocations we seeing here. Not running runtime.GC() has the same result just more variation in the amount allocated.

This is the output when run for a couple of seconds (just go run ristretto-bug.go):

time: 1s alloc: 1033.72 MiB hits: 43 misses: 2614
time: 2s alloc: 1035.74 MiB hits: 71 misses: 3230
time: 3s alloc: 1038.76 MiB hits: 98 misses: 3210
time: 4s alloc: 1048.77 MiB hits: 87 misses: 3236
time: 5s alloc: 1065.79 MiB hits: 109 misses: 3229
...
time: 45s alloc: 5893.52 MiB hits: 642 misses: 2773
time: 46s alloc: 6049.54 MiB hits: 704 misses: 2689

The output of http://localhost:6060/debug/pprof/heap?debug=1 shows that all memory is allocated inside strings.Repeat.

My guess is that in some cases Ristretto keeps references to memory for entries that aren't part of the cache anymore.

Thanks for finding this! From what I can tell, this is what's happening:

  1. Attempt to Get random key
  2. If miss, create random data and Set with a different random key (important for some reason)
  3. Value is a struct pointer containing key and data as fields

Definitely a confusing one. I'm going to dig around.

I believe the issue is in here.

Edit: Nevermind, using another bloom filter implementation didn't fix anything. The issue has something to do with Get increments.

Spent a few hours looking at this, going to continue tomorrow. Some notes for future reference (from policy.go):

  1. No memory leakage:
func (p *tinyLFU) Increment(key uint64) {
	// flip doorkeeper bit if not already
	//if added := p.door.AddIfNotHas(key); !added {
		// increment count-min counter if doorkeeper bit is already set.
		//p.freq.Increment(key)
	//}
	p.incrs++
	if p.incrs >= p.resetAt {
		p.reset()
	}
}
  1. Memory leakage:
func (p *tinyLFU) Increment(key uint64) {
	// flip doorkeeper bit if not already
	//if added := p.door.AddIfNotHas(key); !added {
		// increment count-min counter if doorkeeper bit is already set.
		p.freq.Increment(key)
	//}
	p.incrs++
	if p.incrs >= p.resetAt {
		p.reset()
	}
}
  1. Memory leakage:
func (p *tinyLFU) Increment(key uint64) {
	// flip doorkeeper bit if not already
	if added := p.door.AddIfNotHas(key); !added {
		// increment count-min counter if doorkeeper bit is already set.
		//p.freq.Increment(key)
	}
	p.incrs++
	if p.incrs >= p.resetAt {
		p.reset()
	}
}

Maybe this is due to updating a key with a large size without recalculating its cost due to the c.setBuf contention in Set?

	if c.store.Update(i.keyHash, i.key, i.value) {
		i.flag = itemUpdate
	}
	// attempt to send item to policy
	select {
	case c.setBuf <- i:
		return true
	default:
		c.Metrics.add(dropSets, i.keyHash, 1)
		return false
	}

@brk0v While we do have an eventually-consistent cost mechanism, the problem seems to be within the interaction between TinyLFU counter increments and Sets...

For example, forcing the value + cost update does nothing:

func (c *Cache) Set(key, value interface{}, cost int64) bool {
	if c == nil || key == nil {
		return false
	}
	i := &item{
		flag:    itemNew,
		key:     key,
		keyHash: z.KeyToHash(key, 0),
		value:   value,
		cost:    cost,
	}
	// attempt to immediately update hashmap value and set flag to update so the
	// cost is eventually updated
	if c.store.Update(i.keyHash, i.key, i.value) {
		i.flag = itemUpdate
	}

        c.setBuf <- i
        return true

        /*
	// attempt to send item to policy
	select {
	case c.setBuf <- i:
		return true
	default:
		c.Metrics.add(dropSets, i.keyHash, 1)
		return false
	}
        */
}

And the cost is updated here.

When I print out cache.Metrics.CostAdded() - cache.Metrics.CostEvicted() it starts out at around 1GB but keeps going lower and lower.

Interesting also, if you change the Get to cache.Get("x"+k) it also works fine. So it has something do with the gets and sets on the overlapping keys.

Scratch that, it still happens, but weirdly it happens a lot slower it seems. But if you remove the Get it doesn't happen.

I'm seeing this behavior in real life as well. I'm using this ristretto test for microcache: https://github.com/erikdubbelboer/microcache/tree/ristretto
I'm using snappy so most bytes allocated come from the snappy allocation for the body buffer.
I have given ristretto 1GB but if I look at the pprof allocations I see the snappy allocations are taking up 2.5GB after 1 day. There is no other place holding on to these allocations so it all comes from ristretto. Looking at the currently allocated cost ristretto reports 1GB so it thinks it's correct but it's not 😞

pprof output:

154016: 2693161616 [591105: 11179214896] @ 0x8fc5bc 0x905179 0x907665 0x906c8f 0x90a772 0x6e06d4 0xbfdc9f 0x90e76e 0x93a5fc 0x6e06d4 0xbfe540 0x6e06d4 0x6e3b24 0x6df4c5 0x45dc71
# 154016: 2.51GB [591105: 10.41GB]
#	0x8fc5bc	github.com/golang/snappy.Encode+0x22c						/workspace/vendor/github.com/golang/snappy/encode.go:22
#	0x905179	github.com/erikdubbelboer/microcache.CompressorSnappy.Compress+0x1b9		/workspace/vendor/github.com/erikdubbelboer/microcache/compressor_snappy.go:16
#	0x907665	github.com/erikdubbelboer/microcache.(*microcache).store+0xb5			/workspace/vendor/github.com/erikdubbelboer/microcache/microcache.go:447
#	0x906c8f	github.com/erikdubbelboer/microcache.(*microcache).handleBackendResponse+0x44f	/workspace/vendor/github.com/erikdubbelboer/microcache/microcache.go:396
#	0x90a772	github.com/erikdubbelboer/microcache.(*microcache).Middleware.func1+0xe12	/workspace/vendor/github.com/erikdubbelboer/microcache/microcache.go:329
#	0x6e06d4	net/http.HandlerFunc.ServeHTTP+0x44						/usr/local/go/src/net/http/server.go:2007
#	0xbfdc9f	main.main.func2.1+0x15f								/workspace/cmd/content/main.go:108
#	0x90e76e	github.com/julienschmidt/httprouter.(*Router).ServeHTTP+0x95e			/workspace/vendor/github.com/julienschmidt/httprouter/router.go:334
#	0x93a5fc	github.com/poki/mothership/pkg/logging.Middleware.func1+0x5ac			/workspace/pkg/logging/middleware.go:31
#	0x6e06d4	net/http.HandlerFunc.ServeHTTP+0x44						/usr/local/go/src/net/http/server.go:2007
#	0xbfe540	main.cors.func1+0x2e0								/workspace/cmd/content/middleware.go:18
#	0x6e06d4	net/http.HandlerFunc.ServeHTTP+0x44						/usr/local/go/src/net/http/server.go:2007
#	0x6e3b24	net/http.serverHandler.ServeHTTP+0xa4						/usr/local/go/src/net/http/server.go:2802
#	0x6df4c5	net/http.(*conn).serve+0x875							/usr/local/go/src/net/http/server.go:1890

I tried to reproduce this locally and couldn't. So I tried on another machine and I was able to reproduce the error. I compared the versions and realized I had an older version checked out which made me think this was a regression. Tracked it down to #75 on Oct 1st. I read the diff and spotted the error.

// Before Oct 1st commit
victims, added := c.policy.Add(item.key, item.cost)
if added {
	// ...
}
for _, victim := range victims {
	// ...
}

policy.Add sometimes returns victims when added is false and the old code handled that correctly.
https://github.com/dgraph-io/ristretto/blob/master/policy.go#L168

// After Oct 1st commit
if victims, added := c.policy.Add(item.key, item.cost); added {
	// ...
	for _, victim := range victims {
		// ...
	}
}

The new code is different, causing victims to sometimes not be deleted.

Submitted PR #99

With the PR it appears to be fixed on my machine. Let me know if you run into any more issues @erikdubbelboer and I'll re-open.

Seems to be fixed 👍

Shouldn't there be some test added for this so it doesn't happen again in the future? Ristretto not limiting it's cache to the configured size seems like a major bug to me that should never happen.

@erikdubbelboer I'm working on a Test PR right now.