Kludex / python-multipart

A streaming multipart parser for Python.

Home Page:https://kludex.github.io/python-multipart/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

IndexError in multipart.QueryStringParser._internal_write

brimcfadden opened this issue · comments

Hi,

This looks like a great project, and I'm trying to incorporate it into use with a Tornado server in order to handle files uploaded as multipart/form-data. It looks like it could be a perfect fit. However, I've come across a show-stopping crash that I've spent some time trying to solve, hoping for it to be obvious, to no avail.

Tornado somewhat-recently introduced a new feature, a decorator for its RequestHandler class called @stream_request_body. If you're not familiar with Tornado, just know that the decorator streams received data chunks to a function where you can write the code to handle those chunks. In my case, I'm immediately sending the chunks to a MultipartParser object. However, if there is more than one chunk (which seems to occur for content-length payloads that are larger than 64 KiB), there seems to be a "random chance" that the following error will occur:

Traceback (most recent call last):
  [ ... Tornado & my stuff ...]

  File "...src/multipart/multipart/multipart.py", line 1055, in write
    l = self._internal_write(data, data_len)
  File "...src/multipart/multipart/multipart.py", line 1314, in _internal_write
    c = data[i]
IndexError: string index out of range

In the above traceback, i seems to always be equal to len(data), so it is an off-by-one error that occurs sometimes. Here is a link to the line where this occurs. If I catch this exception, the same error may occur an arbitrary number more times, and the final file will have lost data (I haven't examined exactly how much).

Remember that I said there is a "random chance" of it occurring. Sometimes, it doesn't occur at all, for the same file that has been seen to fail another time.

I also occasionally get logger WARNING messages that look like this, which is more rare:

Did not find boundary character 'i' at index 2

I get a message like the above for each character in the boundary, and always at index 2. I think this is just a different manifestation of the same error.

I have taken a top-down approach to debugging this, but I've not been able to find anything yet. When I write the chunks of data directly to a file (instead of the Parser object), the end result is consistent and correct in that the only differences between uploads are at the head and tail of the file, where the boundaries are, and no data is missing from the original file. This rules out a Tornado bug. I am also pretty sure I am not using the parser incorrectly, though I will be able to furnish example code. At this point, I am somewhat sure that the error is in the _internal_write function, specifically in the Boyer-Moore-Horspool algorithm implementation. I also think that the bug only occurs based on the sizes of the first few chunks, which differ between requests. In my tests, I've been using a ~6 MiB text file, and I've confirmed that the data is chunked differently each time: the beginning chunks are different sizes, and then it 'evens out', and then the last chunk will probabilistically be smaller. This chunking happens when using either Firefox and cURL, while uploading locally to the Tornado server.

I plan to delve into the algorithm code, but I figured that I would try to get your input on this, in hopes you can think of/implement a solution quicker than I can. As I said, I can also provide a working example after I clean up the code file a little bit. I think my next step to solving this problem is emulating different-sized chunking from a file, just to confirm my suspicion. Since your consistently-chunked example code never has this issue, with any file I try, I am really leaning toward the idea that chunk consistency is a factor here.

Here is an example Tornado server:

"""Demonstrate a streaming multipart parser usage in a Tornado server.
Currently buggy."""

import hashlib
import os
import uuid

# import ipdb
import multipart
from multipart.multipart import parse_options_header
import tornado.ioloop
import tornado.log
import tornado.httputil
import tornado.web

__author__ = "Brian McFadden <brimcfadden@gmail.com>"


MiB = 2 ** 20
GiB = 2 ** 30

upload_form_html = """<h1>Upload Form Test</h1>
<form name="upload_form" action="upload" enctype="multipart/form-data" method="post">
    <input name="test_file" type="file" /><br />
    <input type="submit" value="Submit" />
</form>"""

logger = tornado.log.app_log


class MainHandler(tornado.web.RequestHandler):
    def get(self):
        self.redirect('upload', permanent=False)


@tornado.web.stream_request_body
class UploadHandler(tornado.web.RequestHandler):
    def prepare(self):
        """Called upon every request."""

        self.request_id = str(uuid.uuid4())[:7]

        if self.request.method == 'POST':
            self.prepare_post()

    def prepare_post(self):
        """Only called for POST requests, which require special processing.
        Not a Tornado feature--this is called from self.prepare()."""

        try:
            content_type_header = self.request.headers['content-type']
            content_type, options = parse_options_header(content_type_header)
            boundary = options['boundary']
        except:
            # Bad request for this handler
            raise tornado.web.HTTPError(415)

        self.file_hasher = None
        self.file_size = None
        self._fo = None

        mpp_callbacks = {
            'on_part_begin': self._on_part_begin,
            'on_part_data': self._on_part_data,
            'on_part_end': self._on_part_end
        }
        self.mp_parser = multipart.MultipartParser(boundary, mpp_callbacks)
        self.save_as = "./uploads/{}".format(self.request_id)

        # DEBUGGING BOOKKEEPING
        self.num_chunks = 0
        self.chunk_lengths = {}
        # self.unparsed_fo = open("uploads/{}_unparsed".format(self.request_id),
        #                         'wb')
        self.had_error = False

    def get(self):
        """Serve the upload HTML form on GET requests."""

        self.write(upload_form_html)

    def post(self):
        """Handle the uploaded file after it has been processed."""

        logger.debug("Headers:\n%s", self.request.headers)
        logger.debug("Arguments:\n%s", self.request.arguments)
        self.write("File size: {} bytes ({:.3f} MiB)"
                   .format(self.file_size, float(self.file_size) / MiB))
        self.write("<br />SHA-256 digest: {}".format(self.file_hash))
        self.write('<br /><a href="upload">Upload another</a>')

    def data_received(self, data):
        """Called by Tornado when it receives a chunk of data on a POST.
        Tornado only does this if the RequestHandler is decorated with
        tornado.web.stream_request_body. However, since it streams the body
        directly to this function, the multipart/form-data-encoded file needs
        to be parsed. This function simply writes directly to the parser, which
        has its own callback system in place."""

        self.num_chunks += 1

        # FOR DEBUGGING:
        # This writes each chunk to disk, in a consolidated directly named
        # after the established request ID. The parsed file is a sibling to
        # the directory, not a child of it.
        try:
            # with ipdb.launch_ipdb_on_exception():
            self.mp_parser.write(data)
        except:
            # Swallow the error, log it, and continue with the parsing, for the
            # sake of debugging. This means the server will still return status
            # 200 OK and the results HTML page.
            word = 'error'
            logger.error("Error parsing chunk %d (len: %d)",
                         self.num_chunks, len(data))
            self.had_error = True
        else:
            word = 'success'

        chunk_dir = "uploads/{}_chunks".format(self.request_id)
        try:
            os.mkdir(chunk_dir)
        except OSError:
            pass
        chunk_file_name = "{}/chunk_{:0>3}.{}".format(chunk_dir,
                                                      self.num_chunks, word)
        with open(chunk_file_name, 'wb') as fo:
            fo.write(data)

        try:
            self.chunk_lengths[len(data)].append(self.num_chunks)
        except:
            self.chunk_lengths[len(data)] = [self.num_chunks]

        # self.unparsed_fo.write(data)  # Compare unparsed files

    def _on_part_begin(self):
        """Called by the multipart parser when a part is found.
        This is used to established file bookkeeping constructs."""

        self._fo = open(self.save_as, 'w')
        self.file_size = 0
        self.file_hasher = hashlib.sha256()

    def _on_part_data(self, data, start, end):
        """Called by the multipart parser on each chunk written to it.
        This is used write the parsed data to disk and update the bookkeeping
        constructs."""

        data_bytes = data[start:end]
        self.file_size += abs(end - start)
        self.file_hasher.update(data_bytes)
        self._fo.write(data_bytes)

    def _on_part_end(self):
        """Called by the multipart parser after it finds the end of the data.
        Such can be determined with the boundary provided at object creation.
        This function is used to clean up the operation before the typical
        RequestHandler.post function is called."""

        self._fo.close()
        self.file_hash = self.file_hasher.hexdigest()
        logger.info("Upload saved to %s: %d bytes (%.3f MiB) over %d chunks",
                    self.save_as, self.file_size, float(self.file_size) / MiB,
                    self.num_chunks)
        logger.info("Upload saved to %s: SHA-256 digest: %s",
                    self.save_as, self.file_hash)
        if self.had_error:
            logger.error("Chunk dict with chunk lengths, for errors:\n%s",
                         self.chunk_lengths)


application = tornado.web.Application([
    (r"/", MainHandler),
    (r"/upload", UploadHandler)
])


def main():
    print('\n** Press Ctrl+C to exit. **')
    tornado.log.enable_pretty_logging()
    logger.setLevel(tornado.log.logging.INFO)
    port = 8888
    application.listen(port, max_buffer_size=(2 * GiB))
    logger.info("Now listening on %s", port)
    try:
        tornado.ioloop.IOLoop.instance().start()
    except KeyboardInterrupt:
        print('\r** Goodbye. **')

if __name__ == "__main__":
    main()

I recommend testing with a 1 MiB file or so:

$ dd count=1024 bs=1024 if=/dev/urandom of=1M
[ ... ]
$ curl -ik -F file=@1M http://localhost:8888/upload

I save the chunks to disk in the example code, so be wary of automating this, if you're low on space. 1 MiB usually turns into about 20 chunks.

You may need to hurl a 1 MiB file at the server a few times before you see the error logs appear. Example output (spaces added to differentiate requests):

[I 140924 18:12:52 main:165] Upload saved to ./uploads/5c91888: 1048576 bytes (1.000 MiB) over 19 chunks
[I 140924 18:12:52 main:167] Upload saved to ./uploads/5c91888: SHA-256 digest: 26f773601ea256d71aee82c6c9da924a9a357882830eac80f269c99b1bb1734d
[W 140924 18:12:52 multipart:1418] Consuming a byte in the end state
[W 140924 18:12:52 multipart:1418] Consuming a byte in the end state
[I 140924 18:12:52 web:1811] 200 POST /upload (127.0.0.1) 15.57ms

[I 140924 18:12:53 main:165] Upload saved to ./uploads/0baa62d: 1048576 bytes (1.000 MiB) over 19 chunks
[I 140924 18:12:53 main:167] Upload saved to ./uploads/0baa62d: SHA-256 digest: 26f773601ea256d71aee82c6c9da924a9a357882830eac80f269c99b1bb1734d
[W 140924 18:12:53 multipart:1418] Consuming a byte in the end state
[W 140924 18:12:53 multipart:1418] Consuming a byte in the end state
[I 140924 18:12:53 web:1811] 200 POST /upload (127.0.0.1) 18.36ms

[E 140924 18:12:54 main:115] Error parsing chunk 4 (len: 65536)
[I 140924 18:12:54 main:165] Upload saved to ./uploads/d2abc40: 983040 bytes (0.938 MiB) over 18 chunks
[I 140924 18:12:54 main:167] Upload saved to ./uploads/d2abc40: SHA-256 digest: 71f72e3197d3428aaa66de773024163544dd1d3fac0f3745defd658e1b594f2c
[E 140924 18:12:54 main:170] Chunk dict with chunk lengths, for errors:
    {49152: [2], 32768: [3], 146: [1], 65536: [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]}
[W 140924 18:12:54 multipart:1418] Consuming a byte in the end state
[W 140924 18:12:54 multipart:1418] Consuming a byte in the end state

I don't actually know why Consuming a byte in the end state is being logged twice, but it's the least of my worries.

I discovered something new, recently. The issue is present with consistent chunk sizes if the chunks are smaller than in your example. When I change your example code to to_read = min(size, 64 * 1024) (64 KiB instead of 1 MiB), I start getting the same IndexError intermittently. I haven't discovered why this causes the crash.

I may be able to use a buffer in my code to solve the problem in the meantime, but I think it's still worth fixing in the parser code. I'm worried that an odd combination of bytes being parsed + chunk size, despite being large, will still be able to cause a crash.

@brimcfadden that's because of line 1306 (correct code is above).

while i < data_length - 1 and data[i] not in boundary_chars:

in case ofi == len(data) - 1 in the loop we jump to len(data) + boundary_length - 1;
at line 1313 we jump back to len(data) + boundary_length - 1 - boundary_end,
and that is exactly len(data).
Test for it is simple:

for length in range(data_size):
    post_multipart(data[:length])
    assert_result_is_correct