webrecorder / browsertrix-crawler

Run a high-fidelity browser-based crawler in a single Docker container

Home Page:https://crawler.docs.browsertrix.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Failure uploading large files (handling slowDown)

wvengen opened this issue · comments

During a large crawl (2GB+), I get stuck in the "Uploading WACZ" stage (using OpenStack SWIFT S3 for storage). The log shows

{"timestamp":"2024-03-04T12:07:55.250Z","logLevel":"debug","context":"general","message":"WACZ successfully generated and saved to: /crawls/collections/thecrawl/thecrawl.wacz","details":{}}
{"timestamp":"2024-03-04T12:07:55.255Z","logLevel":"info","context":"s3Upload","message":"S3 file upload information","details":{"bucket":"x","crawlId":"x","prefix":"x/"}}
{"timestamp":"2024-03-04T12:08:03.027Z","logLevel":"error","context":"general","message":"Crawl failed","details":{"type":"exception","message":"","stack":"S3Error\n    at Object.parseError (/app/node_modules/minio/dist/main/xml-parsers.js:79:11)\n    at /app/node_modules/minio/dist/main/transformers.js:165:22\n    at DestroyableTransform._flush (/app/node_modules/minio/dist/main/transformers.js:89:10)\n    at DestroyableTransform.prefinish (/app/node_modules/readable-stream/lib/_stream_transform.js:123:10)\n    at DestroyableTransform.emit (node:events:514:28)\n    at prefinish (/app/node_modules/readable-stream/lib/_stream_writable.js:569:14)\n    at finishMaybe (/app/node_modules/readable-stream/lib/_stream_writable.js:576:5)\n    at endWritable (/app/node_modules/readable-stream/lib/_stream_writable.js:594:3)\n    at Writable.end (/app/node_modules/readable-stream/lib/_stream_writable.js:535:22)\n    at IncomingMessage.onend (node:internal/streams/readable:705:10)"}}
{"timestamp":"2024-03-04T12:08:03.036Z","logLevel":"info","context":"general","message":"Exiting, Crawl status: failing","details":{}}

and the error message mentioned is

S3Error
    at Object.parseError (/app/node_modules/minio/dist/main/xml-parsers.js:79:11)
    at /app/node_modules/minio/dist/main/transformers.js:165:22
    at DestroyableTransform._flush (/app/node_modules/minio/dist/main/transformers.js:89:10)
    at DestroyableTransform.prefinish (/app/node_modules/readable-stream/lib/_stream_transform.js:123:10)
    at DestroyableTransform.emit (node:events:514:28)
    at prefinish (/app/node_modules/readable-stream/lib/_stream_writable.js:569:14)
    at finishMaybe (/app/node_modules/readable-stream/lib/_stream_writable.js:576:5)
    at endWritable (/app/node_modules/readable-stream/lib/_stream_writable.js:594:3)
    at Writable.end (/app/node_modules/readable-stream/lib/_stream_writable.js:535:22)
    at IncomingMessage.onend (node:internal/streams/readable:705:10)

Trying to reproduce this, it appears that uploading large files triggers a slowDown response from the S3 server, which the MinIO client does not seem to handle automatically.

dd if=/dev/zero of=/tmp/foo bs=1M count=2k
// Javascript
var Minio = require('minio')
var s3Client = new Minio.Client({ endPoint: 's3.example.com', accessKey: 'xx', secretKey: 'xx', partSize: 100*1024*1024 })
await s3Client.fPutObject('x', 'foo', '/tmp/foo')

eventually gives the error

Uncaught S3Error
    at Object.parseError (/app/node_modules/minio/dist/main/xml-parsers.js:79:11)
    at /app/node_modules/minio/dist/main/transformers.js:165:22 {
  code: 'SlowDown',
  bucketname: 'x',
  requestid: 'x',
  hostid: 'x',
  amzRequestid: null,
  amzId2: null,
  amzBucketRegion: null

Amazon mentions this that 503 slow down responses can be present; see also best practices, which recommends to reduce request rate.

Do we need support for handling slowDown responses from the S3 endpoint?

I did not find anything about minio-js handling of slowDown responses, I don't think it is supported. So either this needs to be handled here, or perhaps the AWS S3 client might have support to handle this? In any case, the request would need to be retried after some timeout (probably with some increasing delay factor in case the server is not yet ready to proceed).

@wvengen From minio/minio#11147, it seems like maybe one way of approaching this would be to configure the Minio server's MINIO_API_REQUESTS_DEADLINE to a higher value.

In Browsertrix Cloud, we should be able to set this as an env var to a higher value if needed in chart/templates/minio.yaml.

Otherwise, would need to set that however Minio is being deployed.

Thanks for your response!
Yes, if I would be running my own Minio server, that would be true. But this is an S3 service by a cloud provider (OpenStack SWIFT) that I have no control over, and there are probably reasons why it is configured this way (e.g. to avoid overloading the server, or waiting for until resources for the bucket are scaled up, like it can happen for AWS S3).

Hm, good point. I don't think we've tested with OpenStack SWIFT, so we haven't seen this issue, but you're right that some general exception handling to slow down responses on a 503 (and perhaps 429 Too Many Requests) might not be a bad idea.

Can see if we're also able to enable debug logging via the minio js client. I'm marking this issue for investigation in the coming sprint and will report back.

Another thing to keep in mind: In the past when working with other applications, SWIFT has proved to be a problem for files > 5 GB, as SWIFT expects large files to be segmented a particular way. Not sure if that might be an issue with the crawler/minio-js client/SWIFT S3 endpoint as well.

For context: https://docs.openstack.org/swift/latest/overview_large_objects.html

Thank you, I didn't know about SWIFT's large object support. (The files I had issues with were <5GB, but I might run into this issue later.) But it looks like SWIFT's S3 layer does convert multipart uploads to large object segments, so large objects should be supported when using S3. And I also see references to multipart delete in the source code, so I suppose that would be supported as well.
All in all, handling slow down responses might just be enough here.

Experimenting with using AWS S3 SDK instead of Minio client in this forked branch.
update I am able to upload 2GB files with the client from the AWS S3 SDK, so it's slightly better, but now I get EPIPE on 4GB files, so that doesn't solve it per se. Note that the AWS S3 SDK uses smithy's retry strategy.

Thanks for looking into this! Yes, happy to switch to the AWS S3 client instead of Minio if that works better, but I think we're generally limited to using an existing S3 client for this. I suppose you could always limit to smaller file sizes, but that may be less than ideal..

Thanks, @ikreymer. I'm investigating this more with our storage provider. In any case, I already see that the AWS S3 SDK handles Slow Down, whereas I did not see the MinIO client doing that. Also, the AWS S3 SDK first asks the server to confirm that data is acceptable before sending it (by means of CONTINUE). So I think switching to the AWS S3 SDK has several benefits.

Would you like me to prepare a pull request? (There are some things to clean up.)

Thanks, @ikreymer. I'm investigating this more with our storage provider. In any case, I already see that the AWS S3 SDK handles Slow Down, whereas I did not see the MinIO client doing that. Also, the AWS S3 SDK first asks the server to confirm that data is acceptable before sending it (by means of CONTINUE). So I think switching to the AWS S3 SDK has several benefits.

Would you like me to prepare a pull request? (There are some things to clean up.)

Thanks, would definitely appreciate it! There was also a request for region support in #515, and looks like you were addressing that as well.