kevinboone / epub2txt2

A simple command-line utility for Linux, for extracting text from EPUB documents.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Segfault when opening META-INF/container.xml with BOM

tholin opened this issue · comments

I have a few epub files that cause segfaults. The common factor is that they all have a UTF8 byte order mark at the start of META-INF/container.xml in the archive.

The segfault is a null pointer deref here https://github.com/kevinboone/epub2txt2/blob/master/src/epub2txt.c#L345 because the call to XMLDoc_parse_buffer_DOM fails, leaving doc uninitialized.

The XMLDoc_parse_buffer_DOM call fails here https://github.com/kevinboone/epub2txt2/blob/master/src/sxmlc.c#L1621 because it detects the BOM as TEXT_OUTSIDE_NODE.

I worked around the problem by putting a call to freadBOM() in string_create_from_utf8_file() to remove the BOM if it exists. Since it's just a hack I don't bother with a pull request. All my problematic epub files can be read with this hack.

diff --git a/src/string.c b/src/string.c
index 9b68c8f..ed4e7e1 100644
--- a/src/string.c
+++ b/src/string.c
@@ -25,6 +25,7 @@
 #include "string.h" 
 #include "defs.h" 
 #include "log.h" 
+#include "sxmlc.h"
 
 struct _String
   {
@@ -242,16 +243,17 @@ BOOL string_create_from_utf8_file (const char *filename,
   {
   String *self = NULL;
   BOOL ok = FALSE; 
-  int f = open (filename, O_RDONLY);
+  FILE* f = fopen (filename, "r");
   if (f > 0)
     {
     self = malloc (sizeof (String));
-    struct stat sb;
-    fstat (f, &sb);
-    int64_t size = sb.st_size;
+    fseek (f, 0L, SEEK_END);
+    int64_t size = ftell (f);
+    fseek (f, 0L, SEEK_SET);
     char *buff = malloc (size + 2);
-    read (f, buff, size);
-    self->str = buff; 
+    freadBOM (f, NULL, NULL);
+    size = fread (buff, sizeof(unsigned char), size, f);
+    self->str = buff;
     self->str[size] = 0;
     *result = self;
     ok = TRUE;

I'm not seeing this problem in my own tests. If anybody has an EPUB that reproduces this problem, please be kind enough to link me to it so I can follow this up.

The problematic epubs I have are all copyrighted. To work around that I unzipped one and removed all copyrighted material in OPS/images and OPS/xhtml and zipped it up again. The file segfault when using latest master but it decodes fine and displays a table of content when using the patch from my previous comment.

book.zip

hi @tholin , the book.zip is not a epub file I guess, correct me if i am wrong, I tried to run with some epub files, they are working very fine! But I need to replicate the issue of segmenation fault. WHich part of epub should I change or can you send me a epub that causes the crash when the epub2txt is run?

@animesh0904071 The book.zip file becomes an epub if you rename it. Epub files are just zip files with a different ending. Github filters which file types your are allowed to upload so I had to keep the zip ending.

Hi. I feel I ought to be doing something here, but I'm not sure what. If anybody has an EPUB that can make epub2txt crash, by all means send it along, and I'll try to fix it.

If anybody has an EPUB that can make epub2txt crash, by all means send it along, and I'll try to fix it.

Can't you reproduce the segfault using the book.zip file I already uploaded? Rename it to book.epub if you want, the name doesn't really matter.

Sorry, my bad. I'm travelling at present, but I'll deal with this as soon as I get back. By all means remind me if I forget.

OK. Sorry about the delay. I fixed this by simply skipping the BOM if it is present. I changed the code mentioned by @tholin and also another place in wstring.c where files are converted to UTF-32. All the book.zip file no longer crashes epub2txt, it still does not print correctly, because there seem to be files mentioned in package.opf that are not packaged. I don't know if that is expected or not.

I tested the fix and it works fine for the problematic files I have. Thanks for fixing it.

The book.zip file doesn't decode because I removed all copyrighted files from it so getting an error is expected.