aaronsw / html2text

Convert HTML to Markdown-formatted text.

Home Page:http://www.aaronsw.com/2002/html2text/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Memory leaks

opened this issue · comments

Hi guys,

I really couldn't figure out the exact reason of the problem, but the fact is that I'm using your code for processing around 5,000 HTML documents and my RAM is getting filled quickly. I'm 100% sure that it's your code because I replaced it for a simple HTML tags removal and the leak was gone.

Sorry for not being more informative, but I guess it's pretty easy to set an experiment yourselves.

How large those documents are? Is it number of documents (which shouldn't make a difference if you are running one at the time) or their size?

Hi,

Once again, sorry for not being so useful, but this is the deal:

I had a set of around 4.000 epubs. I opened them and extracted all the html
files and used your software for transforming it to text, and processing
the text later. In total I guess each epub had around 10 htmls inside, so
we are talking about 40.000 files that I was processing in parallel using 6
processors. The processing was for indexing the documents, so for each HTML
I transformed it and ran once through the text and then I discarded it,
you'll have to trust me that I was making extra sure of closing all the
files and removing everything after processing them (so I'm 100% sure the
memory leak was not on my side of the code).

The deal is that with your software the whole thing was consuming my RAM
(8GB), so then I decided not to use your code and just to remove HTML
markups and process the text like that (with the problem that then I was
indexing some CSS code, but it was not such a big deal for my program's
logic). After I did that the memory leak disappeared and the whole
processing was done using something like 1GB of RAM.

Once again, trust me I doubled and tripled checked that the memory leak was
not on my side of the code, neither that I was calling your code
improperly. My final and definite conclusion is that your code had a memory
leak.

I think it's not hard for you to set up an experiment yourself for testing
this. Get a bunch of HTML docs, transform them to text, and then remove
them. By the way, I'm also sure it was not a problem with the garbage
collector.

Cheers

2013/11/7 Matěj Cepl notifications@github.com

How large those documents are? Is it number of documents (which shouldn't
make a difference if you are running one at the time) or their size?


Reply to this email directly or view it on GitHubhttps://github.com//issues/78#issuecomment-27974904
.

BTW, it is not "my code" ... that was just a random drive-by comment. Is the script you have created (or at least substantial part) available somewhere?

Oh man, sorry, I thought u were the code's developer.

And regarding the code, I can't publish it sorry. Besides after I decided
to stop using this code my program changed a lot, so even if I gave it to
you it would be hard to get the part were I used to use it. However, as I
said before, it's really not hard to set up an experiment to check the
memory leak (just do an infinite loop and parse the same document many
times, you'll see that your RAM will slowly get consumed).

2013/11/8 Matěj Cepl notifications@github.com

BTW, it is not "my code" ... that was just a random drive-by comment. Is
the script you have created (or at least substantial part) available
somewhere?


Reply to this email directly or view it on GitHubhttps://github.com//issues/78#issuecomment-28090506
.

Well, the problem with this project is that upstream is dead (quite literally in this case unfortunately) so we are all waiting for the resolution of the succession. I am trying to salvage bits and pieces of further development in my own repo but I don't feel like doing any deep changes before the new maintainer arrives.