Memory leaks

Question

Memory leaks

opened this issue 11 years ago · comments

Hi guys,

I really couldn't figure out the exact reason of the problem, but the fact is that I'm using your code for processing around 5,000 HTML documents and my RAM is getting filled quickly. I'm 100% sure that it's your code because I replaced it for a simple HTML tags removal and the leak was gone.

Sorry for not being more informative, but I guess it's pretty easy to set an experiment yourselves.

Matěj Cepl · Answer 1 · Thu Nov 07 2013 23:28:59 GMT+0800 (China Standard Time)

How large those documents are? Is it number of documents (which shouldn't make a difference if you are running one at the time) or their size?

Deleted user · Answer 2 · Sat Nov 09 2013 01:35:44 GMT+0800 (China Standard Time)

Hi,

Once again, sorry for not being so useful, but this is the deal:

I had a set of around 4.000 epubs. I opened them and extracted all the html
files and used your software for transforming it to text, and processing
the text later. In total I guess each epub had around 10 htmls inside, so
we are talking about 40.000 files that I was processing in parallel using 6
processors. The processing was for indexing the documents, so for each HTML
I transformed it and ran once through the text and then I discarded it,
you'll have to trust me that I was making extra sure of closing all the
files and removing everything after processing them (so I'm 100% sure the
memory leak was not on my side of the code).

The deal is that with your software the whole thing was consuming my RAM
(8GB), so then I decided not to use your code and just to remove HTML
markups and process the text like that (with the problem that then I was
indexing some CSS code, but it was not such a big deal for my program's
logic). After I did that the memory leak disappeared and the whole
processing was done using something like 1GB of RAM.

Once again, trust me I doubled and tripled checked that the memory leak was
not on my side of the code, neither that I was calling your code
improperly. My final and definite conclusion is that your code had a memory
leak.

I think it's not hard for you to set up an experiment yourself for testing
this. Get a bunch of HTML docs, transform them to text, and then remove
them. By the way, I'm also sure it was not a problem with the garbage
collector.

Cheers

2013/11/7 Matěj Cepl notifications@github.com

How large those documents are? Is it number of documents (which shouldn't
make a difference if you are running one at the time) or their size?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/78#issuecomment-27974904
.

Matěj Cepl · Answer 3 · Sat Nov 09 2013 03:29:06 GMT+0800 (China Standard Time)

BTW, it is not "my code" ... that was just a random drive-by comment. Is the script you have created (or at least substantial part) available somewhere?

Deleted user · Answer 4 · Sat Nov 09 2013 03:45:10 GMT+0800 (China Standard Time)

Oh man, sorry, I thought u were the code's developer.

And regarding the code, I can't publish it sorry. Besides after I decided
to stop using this code my program changed a lot, so even if I gave it to
you it would be hard to get the part were I used to use it. However, as I
said before, it's really not hard to set up an experiment to check the
memory leak (just do an infinite loop and parse the same document many
times, you'll see that your RAM will slowly get consumed).

2013/11/8 Matěj Cepl notifications@github.com

BTW, it is not "my code" ... that was just a random drive-by comment. Is
the script you have created (or at least substantial part) available
somewhere?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/78#issuecomment-28090506
.

Matěj Cepl · Answer 5 · Sat Nov 09 2013 17:10:34 GMT+0800 (China Standard Time)

Well, the problem with this project is that upstream is dead (quite literally in this case unfortunately) so we are all waiting for the resolution of the succession. I am trying to salvage bits and pieces of further development in my own repo but I don't feel like doing any deep changes before the new maintainer arrives.

Alireza Savand · Answer 6 · Fri Jun 20 2014 22:42:13 GMT+0800 (China Standard Time)

Related issue: Alir3z4/html2text#13