Potential memory leak

Question

Potential memory leak

ppawel opened this issue 12 years ago · comments

I'm trying to diagnose a memory leak in my Ruby code (see ppawel/openstreetmap-watch-list#20 if interested) and right now my suspect is GEOS (or my wrong usage of the library).

Basically I have a long-running process implemented in Ruby that processes a massive amounts of geometry. It is supposed to be running for at least a week but it clogs the system memory after ~12 hours and OOM killer proceeds to terminate it.

The offending code is somewhere here: https://github.com/ppawel/openstreetmap-watch-list/blob/resurrection/tiler/lib/tiler.rb if you care to look at it.

Right now I diagnosed it partly to GC not being able to collect the result of Geos::create_polygon(cs) after it's no longer needed. I see that in utils.rb in ffi-geos references to inner and outer linear rings are marked as autorelease=false - is this on purpose? Is GEOS going to free this memory?

In general do you have any pointers on how to deal with memory allocation / GC topics when using ffi-geos? I know it's supposed to work automagically but for me it doesn't and I'm not sure if I'm doing something wrong or there's a bug in the library somewhere.

Paweł Paprota · Answer 1 · Tue Jan 22 2013 18:04:49 GMT+0800 (China Standard Time)

I think I have a minimal test case for you:

require 'ffi-geos'

def memory_usage
  `ps -o rss= -p #{$$}`.to_i
end

GC.start
puts memory_usage

polys = []
for i in 1..10000
  cs = Geos::CoordinateSequence.new(5, 2)
  y1, x1 = rand(100), rand(100)
  y2, x2 = rand(100), rand(100)
  cs.y[0], cs.x[0] = y1, x1
  cs.y[1], cs.x[1] = y1, x2
  cs.y[2], cs.x[2] = y2, x2
  cs.y[3], cs.x[3] = y2, x1
  cs.y[4], cs.x[4] = y1, x1
  polys << Geos::create_polygon(cs, :srid => 4326)
end

puts memory_usage

# GC should pick it up after that, right?
polys = nil

GC.start
puts memory_usage

This prints out for me:

9828
30584
30596

So the question is why GC is not able to collect those polygons after the array has already been destroyed?

J · Answer 2 · Tue Jan 22 2013 21:49:32 GMT+0800 (China Standard Time)

G'day Pawel. I think the problem in this case is that the CoordinateSequences aren't getting GC'd properly rather than the Polygon. It may be a misunderstanding of GEOS' memory management on my part in this case, which can be tricky sometimes when dealing with CSes. (I believe that sometimes GEOS expects you to clean up after yourself while in other times it seems to do so on its own.)

I'll take a look at this more closely later on tonight.

Paweł Paprota · Answer 3 · Tue Jan 22 2013 21:59:30 GMT+0800 (China Standard Time)

@dark-panda Thanks for the response. When I remove the call to create_polygon, the test case above prints out:

9832
9912
9916

J · Answer 4 · Wed Jan 23 2013 12:11:37 GMT+0800 (China Standard Time)

I've taken a look at how GEOS handles creating polygons in this manner via its GEOSGeom_createPolygon_r function and following it through GeometryFactory.createPolygon, set up some loggers in the GEOS source code and gave the whole thing a run to see if the memory is being freed properly.

The references to autorelease = false are intentional, as if you take a look at the destructor for geos::geom::Polygon in the GEOS source code you'll see the following:

Polygon::~Polygon()
{
  delete shell;
  for(size_t i=0, n=holes->size(); i<n; ++i)
  {
    delete (*holes)[i];
  }
  delete holes;
}

In this case, disabling autorelease on the ruby side of things lets the GEOS Polygon destructor take care of that. That's also why we're dup'ing them in our force_to_linear_ring method in Geos::Tools before we hand them off in Geos.create_polygon -- to ensure that it's the resulting Polygon that cleans up in the GEOS destructor rather than the originating geometry. Basically, we want the outer and inner rings to be destroyed when the resulting Polygon is GC'd rather than the originating LinearRings, which might be destroyed at another time during GC.

So, it looks like we are cleaning up properly, but it's a question of why aren't we being GC'd in the first place. I've set up a gist to help us debug the problem a bit. Take a look at https://gist.github.com/41150331db9200e20820 for some additions to your test script that I've been playing with.

Part of the problem is that GC doesn't always run when you expect it to, as the GC algorithm may not actually perform GC if it has enough free slots available and hasn't hit certain limits and such. I've been flipping the values USE_ARRAY and USE_STRINGS around in the gist to show that the behaviour occurs with regular ol' ruby strings, too -- the finalizers will run at the very end of the script if you disable USE_ARRAY and enable USE_STRINGS on my system, but will run during the explicit call to GC.start if USE_ARRAY is disabled.

I think this may be a problem of ruby's GC not acting exactly as we'd like it to, or we've made some false assumptions on when GC will run. There may also be references to some objects inside of ffi-geos that are gumming things up by keeping references alive that shouldn't be, but I haven't been able to track them down yet. At any rate, it looks to me like the objects are being destroyed properly from ruby on back to GEOS, but the GC isn't initiating that process quite as we'd expect.

Thoughts?

J · Answer 5 · Wed Jan 23 2013 12:38:24 GMT+0800 (China Standard Time)

One thing that occurred to me -- perhaps you could try playing with the tweakable GC environment variables to try and get GC to run more often, or otherwise force GC to run more often by forcing GC.start periodically. Part of the problem could be that the underlying GEOS objects aren't actually included in any sort of calculations that the GC algorithm makes, and perhaps because of this they are leading to increased overall memory usage in long-running processes?

Paweł Paprota · Answer 6 · Wed Jan 23 2013 17:33:03 GMT+0800 (China Standard Time)

I tried forcing GC by explicitly calling GC.start in my process but it does nothing. Also I played a bit with environment variables but it should not matter anyway because the process eats over 8GB of RAM during 12 hours so GC should run at some point during this period regardless of settings (unless settings are crazy but they are not).

You can reproduce that behavior easily by increasing the number of objects in the test script.

For 100000:

9828
147008
146236

For 1000000:

9836
1236896
1228588

It'd be kind of ridiculous if GC would let that happen and it was "by design".

J · Answer 7 · Wed Jan 23 2013 21:55:36 GMT+0800 (China Standard Time)

Alright, so now it seems like a matter of hunting down all of our references to the geometries being produced and so forth. I'll take another look tonight and see what I can find. Haven't had a good memory leak hunt in a long time, so this ought to be a treat.

Paweł Paprota · Answer 8 · Wed Jan 23 2013 21:59:26 GMT+0800 (China Standard Time)

I will do what I can to help. I have already looked at the ffi-geos code and have not found anything too obvious (except the autorelease=false stuff with create_polygon which you explained). Anyway, good luck, let me know what you find as this issue is of great interest to me (and probably to other ffi-geos users as well though I agree that my usage is rather extreme...).

J · Answer 9 · Wed Jan 23 2013 22:10:16 GMT+0800 (China Standard Time)

Yeah, memory leaks are fun, eh? One thing I might try is to run this through JRuby to see if it produces the same symptoms, as its overall implementation is obviously different from the MRI and so is its ffi implementation. ffi itself has had memory leaks in the past, but I don't think they were GC-related.

Anyways, fun times ahead.

Paweł Paprota · Answer 10 · Wed Jan 23 2013 22:16:44 GMT+0800 (China Standard Time)

Indeed, during my pursuit of this problem I was quite astonished that Ruby does not have any tools for profiling memory that actually work. Some of them are specific to a Ruby version (at specific patch level!), others segfault or just don't do anything. Finally I was enlightened by a suggestion to use the good old ps command and I basically bisected down to this specific ffi-geos problem by using debug statements in my code...

J · Answer 11 · Thu Jan 24 2013 06:04:09 GMT+0800 (China Standard Time)

Alright, here's a workaround for the time being. I tried basically doing the exact same thing as our regular Geometry code but instead used a simple empty Foo class that looked like this:

class Foo
  def initialize
    ObjectSpace.define_finalizer(self, self.class.finalize)
  end

  def self.finalize
    proc { |*args|
      puts "Finalizer for #{args.inspect}"
    }
  end
end

When run through the test script, the Foo objects wouldn't be GC'd until the end of the script if you set USE_ARRAY to true, even if you nil out the array that stores them. So, the answer appears to be to actually loop through the array of objects setting each of the indexes to nil and then niling the entire array like so:

objs.length.times do |i|
  objs[i] = nil
end
objs = nil

I've updated the gist at https://gist.github.com/41150331db9200e20820 to reflect all of this so you can play around with it, but that appears to work for me. It doesn't make any difference whether it's a Foo or a Polygon, it still hangs around for some reason if you're using the array here, even if you nil out the array itself but not its contents. I guess this is a gotcha with GC in the MRI?

J · Answer 12 · Thu Jan 24 2013 14:49:21 GMT+0800 (China Standard Time)

This does appear to be a gotcha in ruby's garbage collection mechanism and it's really quite twitchy.

Here are a few links where the problem is discussed:

http://ruby.11.n6.nabble.com/ruby-core-19342-Bug-649-Memory-leak-in-a-array-assignment-td3586037.html
http://www.ruby-forum.com/topic/109001

I've seen it mentioned that you can use Array#clear in lieu of setting the values to nil to wipe the array and let its elements be GC'd, and that does seem to work and is faster than looping through the array.

In my tests, it also seems dependant on a few factors. All of these tests used a loop of 1..100.

1.9.3-p327 with funny-falcon's patches exhibited the problem with a loop and no Array#clear forced, while vanilla 1.9.3-p327 was okay. In both cases, everything was fixed by calling Array#clear.
1.9.3-p374 worked fine with the vanilla build as well as the patches listed at https://github.com/skaes/rvm-patchsets/tree/master/patches/ruby/1.9.3/p374/railsexpress. Both of these builds worked with and without Array#clear.
1.8.7-p371 exhibited the problem when skipping Array#clear, but was fine when Array#clear was used.

So, yeah, kind of a twitchy thing.

All that being said, could you try using Array#clear in the sections that you're experiencing the problem and see if that helps?

Paweł Paprota · Answer 13 · Fri Jan 25 2013 02:21:32 GMT+0800 (China Standard Time)

Hmm, I still get 9MB before and 30MB after with the following settings:

USE_FINALIZERS = true
USE_ARRAY = true
USE_FOOS = false
WIPE_ARRAY = true

That's on Ruby 1.9.3p374 (2013-01-15 revision 38858) [x86_64-linux]. I will play more with the test script...

J · Answer 14 · Fri Jan 25 2013 03:09:10 GMT+0800 (China Standard Time)

That's not entirely unexpected and probably within the realm of sanity for this script. What we want to see is that the memory usage doesn't continuously increase as we create and destroy geometry objects. You can check this out by modifying the script to set USE_ARRAY to false, and then add a call to GC.start in the loop. If you watch your memory usage it should remain essentially constant (barring any internal objects that ruby creates as it churns along). If you set USE_ARRAY to true and watch the memory usage, it should steadily increase as the array is filled. If there was an actual memory leak, then both cases should cause the memory usage to steadily increase, but that does not appear to be the case, so as far I as can tell, there's no leak.

Paweł Paprota · Answer 15 · Fri Jan 25 2013 03:42:23 GMT+0800 (China Standard Time)

Yeah, well, from my point of view it is just renaming the problem - it does not really help in my use case :( I guess the conclusion is that Ruby GC sucks, it seems very unusual that it cannot pick up a simple array like in the test script...

I guess I will close this issue since it's not really related to ffi-geos.

Thanks for your help.

J · Answer 16 · Fri Jan 25 2013 04:24:18 GMT+0800 (China Standard Time)

Yeah, ruby 2.0 is supposed to have a revamped GC of some sort, so maybe that will help. I only got a chance to test on 2.0-preview1 and I think there may have been changes since then.

Does the Array#clear help with your problem in this case?

RageLtMan · Answer 17 · Wed Feb 06 2013 16:04:02 GMT+0800 (China Standard Time)

Results on Rubinius-head in case you're looking to jump ship to a better VM - https://gist.github.com/sempervictus/4f77f857d568e9cabe0c