Segfault when opening META-INF/container.xml with BOM

Question

Segfault when opening META-INF/container.xml with BOM

tholin opened this issue 3 years ago · comments

I have a few epub files that cause segfaults. The common factor is that they all have a UTF8 byte order mark at the start of META-INF/container.xml in the archive.

The segfault is a null pointer deref here https://github.com/kevinboone/epub2txt2/blob/master/src/epub2txt.c#L345 because the call to XMLDoc_parse_buffer_DOM fails, leaving doc uninitialized.

The XMLDoc_parse_buffer_DOM call fails here https://github.com/kevinboone/epub2txt2/blob/master/src/sxmlc.c#L1621 because it detects the BOM as TEXT_OUTSIDE_NODE.

I worked around the problem by putting a call to freadBOM() in string_create_from_utf8_file() to remove the BOM if it exists. Since it's just a hack I don't bother with a pull request. All my problematic epub files can be read with this hack.

diff --git a/src/string.c b/src/string.c
index 9b68c8f..ed4e7e1 100644
--- a/src/string.c
+++ b/src/string.c
@@ -25,6 +25,7 @@
 #include "string.h" 
 #include "defs.h" 
 #include "log.h" 
+#include "sxmlc.h"
 
 struct _String
   {
@@ -242,16 +243,17 @@ BOOL string_create_from_utf8_file (const char *filename,
   {
   String *self = NULL;
   BOOL ok = FALSE; 
-  int f = open (filename, O_RDONLY);
+  FILE* f = fopen (filename, "r");
   if (f > 0)
     {
     self = malloc (sizeof (String));
-    struct stat sb;
-    fstat (f, &sb);
-    int64_t size = sb.st_size;
+    fseek (f, 0L, SEEK_END);
+    int64_t size = ftell (f);
+    fseek (f, 0L, SEEK_SET);
     char *buff = malloc (size + 2);
-    read (f, buff, size);
-    self->str = buff; 
+    freadBOM (f, NULL, NULL);
+    size = fread (buff, sizeof(unsigned char), size, f);
+    self->str = buff;
     self->str[size] = 0;
     *result = self;
     ok = TRUE;

Kevin Boone · Answer 1 · Mon Jan 24 2022 19:42:54 GMT+0800 (China Standard Time)

I'm not seeing this problem in my own tests. If anybody has an EPUB that reproduces this problem, please be kind enough to link me to it so I can follow this up.

tholin · Answer 2 · Wed Jan 26 2022 18:30:22 GMT+0800 (China Standard Time)

The problematic epubs I have are all copyrighted. To work around that I unzipped one and removed all copyrighted material in OPS/images and OPS/xhtml and zipped it up again. The file segfault when using latest master but it decodes fine and displays a table of content when using the patch from my previous comment.

book.zip

Animesh Kar · Answer 3 · Sun Apr 03 2022 11:42:23 GMT+0800 (China Standard Time)

hi @tholin , the book.zip is not a epub file I guess, correct me if i am wrong, I tried to run with some epub files, they are working very fine! But I need to replicate the issue of segmenation fault. WHich part of epub should I change or can you send me a epub that causes the crash when the epub2txt is run?

tholin · Answer 4 · Tue Apr 05 2022 16:36:58 GMT+0800 (China Standard Time)

@animesh0904071 The book.zip file becomes an epub if you rename it. Epub files are just zip files with a different ending. Github filters which file types your are allowed to upload so I had to keep the zip ending.

Kevin Boone · Answer 5 · Tue Apr 05 2022 19:24:57 GMT+0800 (China Standard Time)

Hi. I feel I ought to be doing something here, but I'm not sure what. If anybody has an EPUB that can make epub2txt crash, by all means send it along, and I'll try to fix it.

tholin · Answer 6 · Wed Apr 06 2022 01:07:16 GMT+0800 (China Standard Time)

If anybody has an EPUB that can make epub2txt crash, by all means send it along, and I'll try to fix it.

Can't you reproduce the segfault using the book.zip file I already uploaded? Rename it to book.epub if you want, the name doesn't really matter.

Kevin Boone · Answer 7 · Wed Apr 06 2022 01:37:25 GMT+0800 (China Standard Time)

Sorry, my bad. I'm travelling at present, but I'll deal with this as soon as I get back. By all means remind me if I forget.

Kevin Boone · Answer 8 · Wed Apr 13 2022 16:55:38 GMT+0800 (China Standard Time)

OK. Sorry about the delay. I fixed this by simply skipping the BOM if it is present. I changed the code mentioned by @tholin and also another place in wstring.c where files are converted to UTF-32. All the book.zip file no longer crashes epub2txt, it still does not print correctly, because there seem to be files mentioned in package.opf that are not packaged. I don't know if that is expected or not.

tholin · Answer 9 · Sat Apr 16 2022 05:20:44 GMT+0800 (China Standard Time)

I tested the fix and it works fine for the problematic files I have. Thanks for fixing it.

The book.zip file doesn't decode because I removed all copyrighted files from it so getting an error is expected.