piskvorky / smart_open

Utils for streaming large files (S3, HDFS, gzip, bz2...)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Slow performance due to lack of buffering for GzipFile.write

svisser opened this issue · comments

Problem description

When writing a .csv.gz file to S3, performance was slow as it's calling GzipFile.write() for every single line in the CSV file. This is not the issue itself but there is also a lack of buffering in GzipFile.write() (see: python/cpython#89550) and I was able to improve performance by implementing a solution similar to python/cpython#101251 (i.e., register a custom compression handler for the '.gz' file extension in which GzipFile.write does have buffering).

I'm opening this issue for the smart_open library to discuss:

  • Should the smart_open library make it easier to enable and/or configure buffering for GzipFile.write() for Python versions that don't yet have the above fix?
  • If not, should we document this or make it clear to users of the smart_open library that performance can be improved by adding buffering yourself?

Steps/code to reproduce the problem

Use smart_open.open in write mode for a .csv.gz file to S3:

import csv

import smart_open

column_names = ("a", "b")
my_data = [.......]
with smart_open.open("s3://.../example.csv.gz", "wb", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=column_names)
    writer.writerows(my_data)

(writerows, in C, calls writerow for each line, which in turn call Gzip.write() for each line)

Versions

  • Python 3.8
  • smart_open 6.2.0

What we do in xopen is simply put all GzipFile stuff in a io.BufferedWriter and return that to the user. Should be a one or two-liner fix for the smart_open people.

I'd like to plug xopen here, it has a much less broad focus than smart_open, but as a result it is very fast, especially for any gzip related work. I notice that the smart_open library does not use any accelerated libraries such as python-isal. Those can have great impact on performance.

@svisser Are you interested in making a PR to implement @rhpvorderman 's suggestion?

Is smart_open core the right place for such buffering? IMO that should either go directly in Python (as your CPython ticket suggests), or live in user-land as a custom registered extension handle (as you say).

But I can also see how that's a pretty common use-case, and smart_open does focus on performance. So it's not a bad fit. So I'm on the fence here. How far do we want to take this? Implement buffering / paralellized .bz2, .xz too? Support custom random-access seeking via igzip? Then we may end up maintaining a lot of compatibility / 3rd party API stuff.

Not an issue with just buffered write to .gz. Just thinking ahead about the core mission, scope.

I think just do the bare minimum to address the issue: add buffering to gzip as an edge case. Once the CPython guys fix this on their side we can remove the buffering on our side.

This fix could be really important for my similar use case, streaming large gzipped tar files from an HTTPS endpoint to an S3 bucket. The throughput I've been seeing with smart_open (6.4.0 with Python 3.10) is inadequate, taking around 130 seconds per ~50MB part, meaning over 8 hours to transfer an 11GB file!

Previously we'd streamed files to temporary storage before performing a multipart upload to S3. This was very quick, given sufficient disk space, as there is a lot less overhead than with the current smart_open implementation. Tackling a new larger dataset prompted the move to chaining the streams as the cloud service has disk space constraints that are not sufficient for the largest files.