kevinboone / epub2txt2

A simple command-line utility for Linux, for extracting text from EPUB documents.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

epub2txt stuck on file - [epub2txt TRACE Entering wstring_length]

teamcoltra opened this issue · comments

I have no idea what the problem is with this guy every other file in the thousands I have converted don't have this problem (I have emailed you the epub just in case you would like to look at it yourself). I ran the command and it was just sitting there so I ran again with log=4 and noticed an infinite loop of this entering wstring_length (I have left it sitting here for 10+ minutes just to see if something changes, it doesn't).

root@files:/var/www/static/upload# epub2txt --log=4 '/var/www/static/upload/toprocess/Richard Baker - [Breaker of Empires 02] - Restless Lightning (epub).epub' > /var/www/static/upload/toprocess/ebb-5c51f7603eb82.txt
epub2txt TRACE Entering epub2txt_do_file
epub2txt DEBUG epub2txt_do_file: /var/www/static/upload/toprocess/Richard Baker - [Breaker of Empires 02] - Restless Lightning (epub).epub
epub2txt DEBUG File access OK
epub2txt DEBUG tempbase is: /tmp
epub2txt DEBUG tempdir is: /tmp/epub2txt26002
epub2txt DEBUG Running unzip command; unzip -o -qq "/var/www/static/upload/toprocess/Richard Baker - [Breaker of Empires 02] - Restless Lightning (epub).epub" -d "/tmp/epub2txt26002"
epub2txt DEBUG Unzip finished
epub2txt DEBUG Fix permissions: chmod -R 744 "/tmp/epub2txt26002"
epub2txt DEBUG Permissions fixed
epub2txt DEBUG OPF path is: /tmp/epub2txt26002/META-INF/container.xml
epub2txt TRACE Entering epub2txt_get_root_file
epub2txt DEBUG Read OPF, size 233
epub2txt TRACE Leaving epub2txt_get_root_file
epub2txt DEBUG OPF rootfile is: content.opf
epub2txt DEBUG Content directory is: /tmp/epub2txt26002
epub2txt TRACE Entering epub2txt_get_items
epub2txt DEBUG Read OPF, size 10295
epub2txt TRACE Leaving epub2txt_get_items
epub2txt DEBUG EPUB spine has 37 items
epub2txt TRACE Entering xhtml_file_to_stdout
epub2txt DEBUG Process XHTML file /tmp/epub2txt26002/cover.xhtml
epub2txt TRACE Entering wstring_create_from_utf8_file
epub2txt TRACE Entering wstring_convert_utf8_to_utf32
epub2txt TRACE Leaving wstring_convert_utf8_to_utf32
epub2txt TRACE Leaving wstring_create_from_utf8_file
epub2txt TRACE Entering xhtml_to_stdout
epub2txt DEBUG Process XHTML string
epub2txt TRACE Entering wstring_length
epub2txt TRACE Entering wstring_length
epub2txt TRACE Entering wstring_length
epub2txt TRACE Entering wstring_length
epub2txt TRACE Entering wstring_length
epub2txt TRACE Entering wstring_length
epub2txt TRACE Entering wstring_length
epub2txt TRACE Entering wstring_length
epub2txt TRACE Entering wstring_length
epub2txt TRACE Entering wstring_length
epub2txt TRACE Entering wstring_length
epub2txt TRACE Entering wstring_length
epub2txt TRACE Entering wstring_length
epub2txt TRACE Entering wstring_length
epub2txt TRACE Entering wstring_length
epub2txt TRACE Entering wstring_length
epub2txt TRACE Entering wstring_length
epub2txt TRACE Entering wstring_length
(forever)

This is with latest version compiled :)

Yeah -- the fact is that providing a single XHTML tag 4Mb long breaks the memory management in epub2txt2. I wrote the UTF32 string handler on the basis that no particular string would ever be more than a book chapter long, and probably only a line long. The implementation is pretty dumb, to be frank -- it doesn't even store the current length of the string, just counts it from the start every time :/ What I really need to do is rewrite the string handler completely, so that memory is allocated in blocks, rather than character-by-character, and so that the running length is stored by each method call. That's going to take some time -- not to do the work, but to test it works on a large range of book.

What I can do in the meantime -- and have done -- is to modify the XHTML parser so that if any specific tag is more than, say, 1000 characters long, processing is aborted and it skips the the close of the tag. This is a bit ugly but, to be honest, I can't think of any XHTML tag that epub2txt2 could make any sense of, that would be more than about 200 characters long.

Your comments are welcome.

While perhaps a hacky fix, I think that works well. Better to either have it fail gracefully or even just abort with error.

I have used your code on over 200,000 books this is the first time this has happened so it's certainly not the usual case.

This seemed to work! 💯

If you want this open as a reminder to further fix it, that's great, otherwise you can feel free to close it as my problem has been solved.

I'll keep it on my to-do list; unless there's a strong interest from other users in improving the memory management (and I use the same code in a dozen other projects, so there might be) I'll leave the workaround in place for now.