wagtail / wagtail-transfer

Content transfer for Wagtail

Home Page:https://wagtail.github.io/wagtail-transfer/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Large transfers (especially involving media) can cause server-level timeouts

stevejalim opened this issue · comments

(I've discussed this loosely with Matt and Jacob in Slack, but writing it up here)

When a site is hosted on a platform which has a hard, non-configurable threshold for how long HTTP request can take (eg 30 seconds) a transfer that involves a sizeable video, or a number of other media files, can easily exceed this threshold. This kills the transfer, leaving pages rolled back, but third-party models (eg wagtailmedia) can be in an indeterminate state in terms of files on disks, somewhere.

The timeout happens because the overall WT import takes place over a single HTTP request, and transferring an asset file as part of the request-response cycle involves the time take to copy the file.

This problem is exacerbated when media files are stored in cloud storage, which is common for many PaaS setups.

eg:

Destination Server -> asks Source Server -> asks Source's Storage for file -> Source's Storage returns file to Source Server -> Source Server sends file to Destination Server -> Destination Server stores file in Destination's Storage.

So that's the same file being processed (read or written) 3 or 4 times, depending on upload spooling.

Possible solutions

  • Temporary workaround: Use field-based identification of the problematic model types using WAGTAILTRANSFER_LOOKUP_FIELDS and manually pre-copy the data to the Destination server, so that the Source Server does not have to send it. This works for wagtailmedia.media and its MTI subclasses

  • A solution: Move the Transfer process to multiple AJAX calls (as suggested by Matt), so that we reduce the risk of a timeout. However for large files we may still miss this

  • Alternative solution: Detect if files are in cloud storage and support direct cloud-to-cloud syncing of those files, if possible. (However this could also get complex, especially if a cloud-based function is needed to do the copy)

    • Boto3 appears to have support for CopyObject [docs] which is promising
  • More to come - and more welcome


(Separate from all the above, it would be nice to have a pre-flight check before a transfer to warn about large files that will be sent over)

I've done some work to do direct S3-to-S3 copying using a custom field adapter, which - while not yet in production - seems to be pretty reliable within some known constraints (eg only works for data with a public-read policy). If anyone's interested, the code is open source and I can point you at the relevant bits of the implementation.

If there's appetite for making this part of WT, @jacobtoppm, I'd be happy to do that when I have time.

@stevejalim I'd be interested in taking a look if it still works and is up somewhere!

Hi @easherma-truth I've moved on from the org where I was using it but looks like the code is still there: https://github.com/uktrade/great-cms/blob/develop/core/wagtail_hooks.py#L135