Failure uploading large files (handling slowDown)

Question

Failure uploading large files (handling slowDown)

wvengen opened this issue 4 months ago · comments

During a large crawl (2GB+), I get stuck in the "Uploading WACZ" stage (using OpenStack SWIFT S3 for storage). The log shows

{"timestamp":"2024-03-04T12:07:55.250Z","logLevel":"debug","context":"general","message":"WACZ successfully generated and saved to: /crawls/collections/thecrawl/thecrawl.wacz","details":{}}
{"timestamp":"2024-03-04T12:07:55.255Z","logLevel":"info","context":"s3Upload","message":"S3 file upload information","details":{"bucket":"x","crawlId":"x","prefix":"x/"}}
{"timestamp":"2024-03-04T12:08:03.027Z","logLevel":"error","context":"general","message":"Crawl failed","details":{"type":"exception","message":"","stack":"S3Error\n    at Object.parseError (/app/node_modules/minio/dist/main/xml-parsers.js:79:11)\n    at /app/node_modules/minio/dist/main/transformers.js:165:22\n    at DestroyableTransform._flush (/app/node_modules/minio/dist/main/transformers.js:89:10)\n    at DestroyableTransform.prefinish (/app/node_modules/readable-stream/lib/_stream_transform.js:123:10)\n    at DestroyableTransform.emit (node:events:514:28)\n    at prefinish (/app/node_modules/readable-stream/lib/_stream_writable.js:569:14)\n    at finishMaybe (/app/node_modules/readable-stream/lib/_stream_writable.js:576:5)\n    at endWritable (/app/node_modules/readable-stream/lib/_stream_writable.js:594:3)\n    at Writable.end (/app/node_modules/readable-stream/lib/_stream_writable.js:535:22)\n    at IncomingMessage.onend (node:internal/streams/readable:705:10)"}}
{"timestamp":"2024-03-04T12:08:03.036Z","logLevel":"info","context":"general","message":"Exiting, Crawl status: failing","details":{}}

and the error message mentioned is

S3Error
    at Object.parseError (/app/node_modules/minio/dist/main/xml-parsers.js:79:11)
    at /app/node_modules/minio/dist/main/transformers.js:165:22
    at DestroyableTransform._flush (/app/node_modules/minio/dist/main/transformers.js:89:10)
    at DestroyableTransform.prefinish (/app/node_modules/readable-stream/lib/_stream_transform.js:123:10)
    at DestroyableTransform.emit (node:events:514:28)
    at prefinish (/app/node_modules/readable-stream/lib/_stream_writable.js:569:14)
    at finishMaybe (/app/node_modules/readable-stream/lib/_stream_writable.js:576:5)
    at endWritable (/app/node_modules/readable-stream/lib/_stream_writable.js:594:3)
    at Writable.end (/app/node_modules/readable-stream/lib/_stream_writable.js:535:22)
    at IncomingMessage.onend (node:internal/streams/readable:705:10)

Trying to reproduce this, it appears that uploading large files triggers a slowDown response from the S3 server, which the MinIO client does not seem to handle automatically.

dd if=/dev/zero of=/tmp/foo bs=1M count=2k

// Javascript
var Minio = require('minio')
var s3Client = new Minio.Client({ endPoint: 's3.example.com', accessKey: 'xx', secretKey: 'xx', partSize: 100*1024*1024 })
await s3Client.fPutObject('x', 'foo', '/tmp/foo')

eventually gives the error

Uncaught S3Error
    at Object.parseError (/app/node_modules/minio/dist/main/xml-parsers.js:79:11)
    at /app/node_modules/minio/dist/main/transformers.js:165:22 {
  code: 'SlowDown',
  bucketname: 'x',
  requestid: 'x',
  hostid: 'x',
  amzRequestid: null,
  amzId2: null,
  amzBucketRegion: null

Amazon mentions this that 503 slow down responses can be present; see also best practices, which recommends to reduce request rate.

Do we need support for handling slowDown responses from the S3 endpoint?

wvengen · Answer 1 · Mon Mar 04 2024 21:26:14 GMT+0800 (China Standard Time)

I did not find anything about minio-js handling of slowDown responses, I don't think it is supported. So either this needs to be handled here, or perhaps the AWS S3 client might have support to handle this? In any case, the request would need to be retried after some timeout (probably with some increasing delay factor in case the server is not yet ready to proceed).

Tessa Walsh · Answer 2 · Tue Mar 05 2024 05:07:47 GMT+0800 (China Standard Time)

@wvengen From minio/minio#11147, it seems like maybe one way of approaching this would be to configure the Minio server's MINIO_API_REQUESTS_DEADLINE to a higher value.

In Browsertrix Cloud, we should be able to set this as an env var to a higher value if needed in chart/templates/minio.yaml.

Otherwise, would need to set that however Minio is being deployed.

wvengen · Answer 3 · Tue Mar 05 2024 14:19:47 GMT+0800 (China Standard Time)

Thanks for your response!
Yes, if I would be running my own Minio server, that would be true. But this is an S3 service by a cloud provider (OpenStack SWIFT) that I have no control over, and there are probably reasons why it is configured this way (e.g. to avoid overloading the server, or waiting for until resources for the bucket are scaled up, like it can happen for AWS S3).

Tessa Walsh · Answer 4 · Wed Mar 06 2024 00:58:30 GMT+0800 (China Standard Time)

Hm, good point. I don't think we've tested with OpenStack SWIFT, so we haven't seen this issue, but you're right that some general exception handling to slow down responses on a 503 (and perhaps 429 Too Many Requests) might not be a bad idea.

Can see if we're also able to enable debug logging via the minio js client. I'm marking this issue for investigation in the coming sprint and will report back.

Tessa Walsh · Answer 5 · Wed Mar 06 2024 01:05:32 GMT+0800 (China Standard Time)

Another thing to keep in mind: In the past when working with other applications, SWIFT has proved to be a problem for files > 5 GB, as SWIFT expects large files to be segmented a particular way. Not sure if that might be an issue with the crawler/minio-js client/SWIFT S3 endpoint as well.

For context: https://docs.openstack.org/swift/latest/overview_large_objects.html

wvengen · Answer 6 · Wed Mar 06 2024 17:08:16 GMT+0800 (China Standard Time)

Thank you, I didn't know about SWIFT's large object support. (The files I had issues with were <5GB, but I might run into this issue later.) But it looks like SWIFT's S3 layer does convert multipart uploads to large object segments, so large objects should be supported when using S3. And I also see references to multipart delete in the source code, so I suppose that would be supported as well.
All in all, handling slow down responses might just be enough here.

wvengen · Answer 7 · Tue Mar 12 2024 20:02:10 GMT+0800 (China Standard Time)

Experimenting with using AWS S3 SDK instead of Minio client in this forked branch.
update I am able to upload 2GB files with the client from the AWS S3 SDK, so it's slightly better, but now I get EPIPE on 4GB files, so that doesn't solve it per se. Note that the AWS S3 SDK uses smithy's retry strategy.

Ilya Kreymer · Answer 8 · Mon Mar 25 2024 02:44:50 GMT+0800 (China Standard Time)

Thanks for looking into this! Yes, happy to switch to the AWS S3 client instead of Minio if that works better, but I think we're generally limited to using an existing S3 client for this. I suppose you could always limit to smaller file sizes, but that may be less than ideal..

wvengen · Answer 9 · Tue Mar 26 2024 17:32:05 GMT+0800 (China Standard Time)

Thanks, @ikreymer. I'm investigating this more with our storage provider. In any case, I already see that the AWS S3 SDK handles Slow Down, whereas I did not see the MinIO client doing that. Also, the AWS S3 SDK first asks the server to confirm that data is acceptable before sending it (by means of CONTINUE). So I think switching to the AWS S3 SDK has several benefits.

Would you like me to prepare a pull request? (There are some things to clean up.)

Ilya Kreymer · Answer 10 · Wed Mar 27 2024 06:14:57 GMT+0800 (China Standard Time)

Thanks, @ikreymer. I'm investigating this more with our storage provider. In any case, I already see that the AWS S3 SDK handles Slow Down, whereas I did not see the MinIO client doing that. Also, the AWS S3 SDK first asks the server to confirm that data is acceptable before sending it (by means of CONTINUE). So I think switching to the AWS S3 SDK has several benefits.

Would you like me to prepare a pull request? (There are some things to clean up.)

Thanks, would definitely appreciate it! There was also a request for region support in #515, and looks like you were addressing that as well.