libvips / lua-vips

Lua binding for the libvips image processing library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Possible memory leak

kleisauke opened this issue · comments

Hi John,

We are currently investigating a memory leak, it can be caused by lua-vips (or libvips 8.7 RC1), another dependency, a leak in OpenResty or a combination of these factors.

To exclude things, I started checking lua-vips for possible memory leaks. I've found a number of things:

  1. While checking for possible memory leaks, I used valgrind on our test environment from the openresty-valgrind RPM package provided by OpenResty. With this lua-script:

    local vips = require "vips"

    and this command:

    valgrind --leak-check=full /usr/local/openresty-valgrind/luajit/bin/luajit-2.1.0-beta3 vips-init.lua &> output.txt

    it generates this log: https://gist.github.com/kleisauke/57558977a2e31be0c809424078885196
    Note: this is without any GLib / libvips suppression file. So it may be a false-positive.

  2. In addition to valgrind we've made some stacktraces to potential memory leaks with memleax on the production server. See the memleax.txt attachment. We used memleax -e 400 to report all memory allocations that haven't been freed after 400 seconds.

  3. While attempting to fix this (see: kleisauke@bb4971f), I found something odd:
    Test image: https://images2.alphacoders.com/651/651450.jpg

    pyvips:

    python3.6 soak-test.py /home/651450.jpg
    memory: high-water mark 80.83 MB
    

    lua-vips master:

    luajit-2.1.0-beta3 soak-test.lua -- /home/651450.jpg
    memory: high-water mark 1.02 GB
    

    This may be the cause of the GC collector of LuaJIT, but it seems odd that the high-water mark is much higher than that of pyvips / python. I thought this was due to caching, but the soak test from Python is also disabling that: https://github.com/jcupitt/pyvips/blob/master/examples/soak-test.py#L9

I will further debug this (in combination with OpenResty) in the next few days, but I thought I'd let you know in advance. Any help would be greatly appreciated.

memleax.txt

Hi Kleis,

I added vips.leak_set(true) to lua-vips and added a soak example:

https://github.com/jcupitt/lua-vips/blob/master/example/soak.lua

I can run like this:

$ luajit soak.lua ~/pics/k2.jpg 10
loop 	0
loop 	1
...
loop 	10
memory: high-water mark 13.40 MB

And like this:

$ luajit soak.lua ~/pics/k2.jpg 1000
loop 	0
loop 	1
loop 	2
...
loop 	999
loop 	1000
memory: high-water mark 13.42 MB

I watched the second case in top and memuse crept to ~150mb by loop 200, and then stabilised. So I think, in this case anyway, there's no leak.

It's higher than Python's 50mb, but luajit has a very different GC -- I think luajit 2.1 is generational mark-sweep, whereas Python has reference counting plus non-generational mark-sweep to break cycles. This means py can often release memory as soon as it is unused, but luajit has to wait for the next large GC cycle.

People usually say that GC roughly doubles memory consumption, so this seems in line with that.

Obviously there are lots of paths this soak tester is not testing :(

Oh, I just saw your improve-memory branch. I wonder why you see >1gb? git master lua-vips seems to run in less memory than that.

Sorry I keep forgetting things to add.

I have soak-tested lua-vips before and it's seemed OK (or the same as the other bindings), so I doubt if there are any problems directly in the code (plus it's rather simple).

I don't really understand how openresty manages resources or the life-cycle of a request, so I can easily imagine something strange with fork() messing up lua-vips's reference counts.

I'm sure you saw this old issue:

#19

it has some more memory benchmarks with openresty.

Indeed, I think this has something to do with GC. Explanation of >1gb:

It seems that I can't reproduce this with write_to_buffer:

Details k2.jpg
luajit-2.1.0-beta3 soak.lua /home/k2.jpg 1
loop    0
loop    1
memory: high-water mark 24.48 MB
luajit-2.1.0-beta3 soak.lua /home/k2.jpg 10
loop    0
loop    1
loop    2
loop    3
loop    4
loop    5
loop    6
loop    7
loop    8
loop    9
loop    10
memory: high-water mark 24.48 MB
luajit-2.1.0-beta3 soak.lua /home/k2.jpg 50
loop    0
loop    1
loop    2
loop    3
loop    4
loop    5
loop    6
loop    7
loop    8
loop    9
loop    10
...
loop    47
loop    48
loop    49
loop    50
memory: high-water mark 24.63 MB
luajit-2.1.0-beta3 soak.lua /home/k2.jpg 100
loop    0
loop    1
loop    2
loop    3
loop    4
loop    5
loop    6
loop    7
loop    8
loop    9
loop    10
...
loop    97
loop    98
loop    99
loop    100
memory: high-water mark 24.63 MB
luajit-2.1.0-beta3 soak.lua /home/k2.jpg 500
loop    0
loop    1
loop    2
loop    3
loop    4
loop    5
loop    6
loop    7
loop    8
loop    9
loop    10
...
loop    497
loop    498
loop    499
loop    500
memory: high-water mark 24.63 MB
luajit-2.1.0-beta3 soak.lua /home/k2.jpg 1000
loop    0
loop    1
loop    2
loop    3
loop    4
loop    5
loop    6
loop    7
loop    8
loop    9
loop    10
...
loop    997
loop    998
loop    999
loop    1000
memory: high-water mark 24.63 MB

(omitted 651450.jpg because I think that goes well too)

But as soon as I test with write_to_file:

diff --git a/example/soak.lua b/example/soak.lua
index 1111111..2222222 100644
--- a/example/soak.lua
+++ b/example/soak.lua
@@ -17,5 +17,5 @@
 
     local im = vips.Image.new_from_file(arg[1])
     im = im:embed(100, 100, 3000, 3000, { extend = "mirror" })
-    local buf = im:write_to_buffer(".jpg")
+    im:write_to_file("x.v")
 end
Details k2.jpg
luajit-2.1.0-beta3 soak.lua /home/k2.jpg 1
loop    0
loop    1
memory: high-water mark 41.42 MB
luajit-2.1.0-beta3 soak.lua /home/k2.jpg 10
loop    0
loop    1
loop    2
loop    3
loop    4
loop    5
loop    6
loop    7
loop    8
loop    9
loop    10
memory: high-water mark 161.99 MB
luajit-2.1.0-beta3 soak.lua /home/k2.jpg 50
loop    0
loop    1
loop    2
loop    3
loop    4
loop    5
loop    6
loop    7
loop    8
loop    9
loop    10
...
loop    47
loop    48
loop    49
loop    50
memory: high-water mark 265.18 MB
luajit-2.1.0-beta3 soak.lua /home/k2.jpg 100
loop    0
loop    1
loop    2
loop    3
loop    4
loop    5
loop    6
loop    7
loop    8
loop    9
loop    10
...
loop    97
loop    98
loop    99
loop    100
memory: high-water mark 265.26 MB
luajit-2.1.0-beta3 soak.lua /home/k2.jpg 500
loop    0
loop    1
loop    2
loop    3
loop    4
loop    5
loop    6
loop    7
loop    8
loop    9
loop    10
...
loop    497
loop    498
loop    499
loop    500
memory: high-water mark 265.26 MB
luajit-2.1.0-beta3 soak.lua /home/k2.jpg 1000
loop    0
loop    1
loop    2
loop    3
loop    4
loop    5
loop    6
loop    7
loop    8
loop    9
loop    10
...
loop    997
loop    998
loop    999
loop    1000
memory: high-water mark 265.41 MB

Notice when I reached the ±265 MB that it will not go any further.

With 651450.jpg it becomes more intense:

Details 651450.jpg
luajit-2.1.0-beta3 soak.lua /home/651450.jpg 1
loop    0
loop    1
memory: high-water mark 149.34 MB
luajit-2.1.0-beta3 soak.lua /home/651450.jpg 10
loop    0
loop    1
loop    2
loop    3
loop    4
loop    5
loop    6
loop    7
loop    8
loop    9
loop    10
memory: high-water mark 630.00 MB
luajit-2.1.0-beta3 soak.lua /home/651450.jpg 50
loop    0
loop    1
loop    2
loop    3
loop    4
loop    5
loop    6
loop    7
loop    8
loop    9
loop    10
...
loop    47
loop    48
loop    49
loop    50
memory: high-water mark 973.32 MB
luajit-2.1.0-beta3 soak.lua /home/651450.jpg 100
loop    0
loop    1
loop    2
loop    3
loop    4
loop    5
loop    6
loop    7
loop    8
loop    9
loop    10
...
loop    97
loop    98
loop    99
loop    100
memory: high-water mark 1.02 GB
luajit-2.1.0-beta3 soak.lua /home/651450.jpg 500
loop    0
loop    1
loop    2
loop    3
loop    4
loop    5
loop    6
loop    7
loop    8
loop    9
loop    10
...
loop    497
loop    498
loop    499
loop    500
memory: high-water mark 1.02 GB
luajit-2.1.0-beta3 soak.lua /home/651450.jpg 1000
loop    0
loop    1
loop    2
loop    3
loop    4
loop    5
loop    6
loop    7
loop    8
loop    9
loop    10
...
loop    997
loop    998
loop    999
loop    1000
memory: high-water mark 1.08 GB

Notice when I reached the ±1 GB that it will not go any further.

Where k2.jpg is: https://commons.wikimedia.org/wiki/File:Picture_of_K2.jpg and 651450.jpg is: https://images2.alphacoders.com/651/651450.jpg

I'll make a list of all libvips operations that we use, as we don't use write_to_file (we do use a lot of image operations + write_to_buffer).

Huh curious, I'm not sure why write_to_file should use more memory. I'll investigate.

Here's another data point -- a soak in C:

/* compile with
 *
 * gcc -g -Wall soak4.c `pkg-config vips --cflags --libs`
 */

#include <stdio.h>
#include <stdlib.h>
#include <vips/vips.h>

int
main( int argc, char **argv )
{       
        int iterations;
        int i;

        if( VIPS_INIT( argv[0] ) )
                vips_error_exit( NULL );
        vips_leak_set( TRUE );
        vips_cache_set_max( 0 );

        if( argc != 3 )
                vips_error_exit( "usage: soak4 input-file iterations" );
 
        iterations = atoi( argv[2] );

        for( i = 0; i < iterations; i++ ) {
                VipsImage *image;
                VipsImage *x;

                printf( "loop %d ...\n", i );

                if( !(image = vips_image_new_from_file( argv[1], NULL )) )
                        vips_error_exit( NULL );
                if( vips_embed( image, &x, 100, 100, 3000, 3000,
                        "extend", VIPS_EXTEND_MIRROR,
                        NULL ) )
                        vips_error_exit( NULL );
                g_object_unref( image );
                image = x;

                if( vips_image_write_to_file( image, "x.v", NULL ) )
                        vips_error_exit( NULL );
                g_object_unref( image );
        }

        return( 0 );
}       

Running it, I see:

$ ./a.out ~/pics/k2.jpg 1
loop 0 ...
memory: high-water mark 17.61 MB
$ ./a.out ~/pics/k2.jpg 1000
loop 0 ...
loop 1 ...
...
loop 999 ...
memory: high-water mark 17.64 MB

Watching top, it creeps up to 200mb of memory and stays there. I tried 10 iterations in valgrind and no leaks were reported.

I tried luajit soak again on lua-vips master:

jpg buffer output 250mb peak in top
jpg file output 500mb peak in top
vips file output 500mb peak in top

I added this to the end of the loop:

    im = nil
    collectgarbage()

And with vips output it runs in 250mb of memory.

So I think it's clearly the lua GC here. I doubt if you can insert many collectgarbage() calls without hurting performance.

Could you have something to trigger it if the server is idle for more than 100ms? Or perhaps once every 10 requests and no more than once per second?

Hi John,

I think that I've found the culprit, see:

local ffi = require "ffi"

local vips_lib
local glib_lib

if ffi.os == "Windows" then
    vips_lib = ffi.load("libvips-42.dll")
    glib_lib = ffi.load("libglib-2.0-0.dll")
else
    vips_lib = ffi.load("vips")
    glib_lib = vips_lib
end

ffi.cdef [[
    char* vips_filename_get_filename (const char* vips_filename);
    char* vips_filename_get_options (const char* vips_filename);

    void g_free(void* data);
]]

local vips_filename = 'document.pdf[page=1]'

local i = 1

while true do
    print("iteration ", i)

    local filename = vips_lib.vips_filename_get_filename(vips_filename)
    local options = vips_lib.vips_filename_get_options(vips_filename)

    local filename_str = ffi.string(filename)
    local options_str = ffi.string(options)

    print(filename_str)
    print(options_str)

    -- lua-vips isn't doing this:
    -- glib_lib.g_free(filename)
    -- glib_lib.g_free(options)

    i = i + 1
end

vips_filename_get_filename and vips_filename_get_options needs to be freed with g_free because the string is duplicated (with g_strdup), see:
https://github.com/jcupitt/libvips/blob/0b3565c04d7b3f491126433cd42edeb0618824b6/libvips/iofuncs/image.c#L1803
https://github.com/jcupitt/libvips/blob/0b3565c04d7b3f491126433cd42edeb0618824b6/libvips/iofuncs/image.c#L1827

Other libvips bindings may also need to be checked.

Oh yes! Well done!! I pushed a fix, what do you think?

soak probably isn't sensitive enough to spot this, but it seems to run OK.

I'll check the other bindings.

And pyvips (forgot to paste in the link to this issue)

jcupitt/pyvips-experiment@52e2184

OK, all done, I hope. Well done again for finding this dumb thing, Kleis!

Thanks for the fix! This does solve something, but we still see high memory usage (GC seems to be the biggest problem). Can easily be produced with this:

Test image: https://t0.nl/undefined.jpg
lua-vips soak script (adapted for thumbnail processing):

local vips = require "vips"

vips.leak_set(true)
vips.cache_set_max(0)

if #arg ~= 2 then
    print("usage: luajit soak.lua image-file iterations")
    error()
end

local im

for i = 0, tonumber(arg[2]) do
    print("loop ", i)

    im = vips.Image.thumbnail(arg[1], 10, {
        height = 10000000,
        auto_rotate = false,
        linear = false,
        size = "down",
    })
    local buf = im:write_to_buffer(".jpg")
    im = nil

    --collectgarbage()
end

This time I've made a flame graph (with sample-bt-leaks) which can be found here: https://t0.nl/memleak.svg

The high-water mark memory from libvips reports: ±603.71 KB but with 200 iterations the RSS memory easily becomes ±1 GB. This is solved with collectgarbage(), but it feels a bit unnatural to do and indeed I doubt if you can insert many collectgarbage() calls without hurting performance (I still have to test that in production).

Could there be another bug that we have overlooked? The flame graph looks suspicious with those leaks(?) in vips_region_*.

Note that I didn't see this high memory usage in pyvips. Maybe I should wait for the New Garbage Collector in LuaJIT 3.0.

It seems that I've found a strange image again, when I first process it with vipsthumbnail:

vipsthumbnail undefined.jpg \
    --size "4032<" \
    -o %s_2.jpg[optimize_coding,strip] \
    --eprofile sRGB.icm

(which preserves the image size and move it to the sRGB colourspace, sRGB.icm can be found here)

Afterwards when I will process the normalized image (undefined_2.jpg) with the identical script above (so without collectgarbage()), the memory does not continue to rise. So I doubt if it's LuaJIT GC that creates this high memory usage. It looks like there is a leak somewhere in vips_icc_* or LittleCMS.

Your test image is interlaced, so the first time it's read the jpeg reader has to allocate a huge buffer to unpack it. The second time, it's a regular JPG, so memuse is low.

Python doesn't really have a GC -- it's reference-counted. This means it can free objects as soon as they go out of scope, so memuse will typically be much lower than LuaJIT. It has a GC as well, but it's just for breaking cycles and only runs occasionally.

Ruby has a generational GC, so it can GC very frequently without hurting performance. I think this is what LuaJIT is switching to.

Ruby used to have a simple mark-sweep GC, like LuaJIT. Back then, ruby-vips had this thing:

https://github.com/jcupitt/ruby-vips/blob/master/lib/vips/image.rb#L162

It triggered a GC after every 100 writes. The performance impact was small, and it did stop pathological mem growth.

I doubt if there can be a serious memory leak in libvips (though of course it's always possible!).

All large memory areas are linked to GObjects, and set_leak will find any object reference leaks. As long as you are not leaking references, memleaks can't be very significant.

Oh, and interlaced JPG images can't be processed with shrink-on-load either. They really are awful.

We will try to experiment with collectgarbage() on the production server, I'll let you know if this solves our problem.

In the meantime, I'll close this issue because this doesn't seem to be related to lua-vips / libvips. Thanks again for fixing the _get_filename() and _get_options() leaks, I've just fixed this thing in NetVips.

By the way,
When I tried to make further improvements to lua-vips (see this commit: kleisauke@3cacde0), I noticed this line: https://github.com/jcupitt/lua-vips/blob/master/src/vips/Image.lua#L9

Hopefully that doesn't cause leaks, because usually this is done by providing 2 different variables. I tried to bypass that but had no luck, it seems that it's required for fixing the recursive requires. A neater way would be to extract Image.is_Image(x) into a utils file because that seems to be the only thing that needs to be shared accross the Image_methods and voperation files. (https://stackoverflow.com/a/13969886/1480019 is relevant here).

Yes, I remember that being a bit ugly. If you could cook up a PR with a better solution, I'd be very happy!

Using an alternative memory allocator (jemalloc) has solved the consequences of our memory leak. Now only virtual memory is being leaked (sigh), opposed to real RAM.

For the long-term solution; we are experimenting to rewrite the implementation (again) into a C++ nginx module, so we can keep track of when memory is allocated and deallocated, assuming that the leak is somewhere in Lua's GC. This will take some time..

The good news is; as long as we restart the service every 12 hours, one should be able to run the current codebase on a 128GB RAM server with normal malloc! 😅

image

Oh dear, sorry about another rewrite :(

Perhaps you could make a tiny module that just does thumbnail and run that from nginx? If you see mem growth there too, perhaps the problem is something to do with openresty?

Don't worry, another rewrite is also good for learning a new programming language.

A tiny module that invokes thumbnail, is indeed the first step to start with. I'm going to experiment with this, I'll let you know if that doesn't cause memory leaks. Thanks!

I came across an interesting thread on the luajit email list:

https://www.freelists.org/post/luajit/LuaJIT-GC-Problem

tldr: luajit can keep compiled traces longer than you'd think, and they can fill memory. Moreover, with a trace, it must keep referenced objects. Calling jit.flush(); will dump them all.

@jcupitt: We've tried this solution for a few days in production together with collectgarbage(); in the log_by_lua_file-directive, so that it will be executed at the end of every request, but it didn't seem to yield any difference. Additional suggestions are welcome!

Update:
On December 11, 2018 ImageMagick (v6.9.10-16) was temporarily unavailable on our production server (due to an incorrect update). Our users were not able to load / save (magickload / magicksave) images through ImageMagick (for e.g. .bmp, .ico and .jp2 images).

On the same moment we saw that our memory leak had disappeared:
1544439600-1544698800

It is therefore possible that the memory leak is located in magickload and/or magicksave or that it isn't located in libvips at all. I'll do some preliminary investigation with valgrind in February.

Wow, that's dramatic!

Gotcha! It seems that there is a leak in GetExceptionInfo (see valgrind output). Tested with this C code:

/* compile with:
 *      gcc -g -Wall test.c `pkg-config vips --cflags --libs` -o test
*/

#include <stdio.h>
#include <vips/vips.h>

int
main(int argc, char **argv) {
    VipsImage *in;

    if (VIPS_INIT(argv[0]))
        vips_error_exit(NULL);

    if (argc != 3)
        vips_error_exit("usage: %s infile outfile", argv[0]);

    if (!(in = vips_image_new_from_file(argv[1], NULL)))
        vips_error_exit(NULL);

    if (vips_image_write_to_file(in, argv[2], NULL))
        vips_error_exit(NULL);

    g_object_unref(in);

    vips_shutdown();

    return (0);
}

And this valgrind command:

valgrind --suppressions=libvips.supp \
         --leak-check=yes \
         --log-file=valgrind-out.txt \
         ./test favicon.ico[page=1,access=sequential] favicon.gif[format=gif]

(favicon.ico is located here)

For some reason the ExceptionInfo structure is not freed correctly. I've tried to fix this (see this commit), and the memory leak seems to be gone (see new valgrind output).

I'm not sure if this is the underlying cause of our memory leak. Also, AcquireExceptionInfo seems not available in GraphicsMagick so this may not be the most appropriate solution.

Oh nice! Looks like magick7load and magicksave already do this, so it's just magick2vips.

I'll add a configure test and a little wrapper to magick.c.

OK, there's a branch of 8.7 with a fix, could you test Kleis?

Also, good job on finding this!

Seems to work, tested with ImageMagick 6.9.10-23. Thanks for fixing this! I can't wait to test this on our production environment.