Conditional derivatives and eager original downloading
jrochkind opened this issue · comments
The [File Processing Guide suggests a usage pattern of "Conditional derivatives"](Conditional derivatives), eg:
Attacher.derivatives do |original|
if file.mime_type == "image/svg+xml"
result[:png] = something
end
result
end
This pattern works well for me.
But one thing I noticed is that calling model.image_attacher.create_derivatives
in this case will still end up downloading the entire original, even though it doesn't end up using it for anything, if it conditionally decides not to.
I have very large originals (100MB+), so this download is a non-trivial operation. And I have some use cases where it ends up downloading a bunch of files it doesn't need to, which is a problem for me.
But I'm not sure if you would consider this a bug or design problem/need-of-improvement, or if it's working as intended and is fine? And so I bring it up to ask, and to also document my further investigations and thoughts.
Why is it downloading the file? Often Shrine IO objects are basically "lazy", not read until they are accessed. The reason that is not true here is that in process_derivatives
, before the registered processor is ever called:
shrine/lib/shrine/plugins/derivatives.rb
Lines 272 to 275 in 00968da
Since I use direct uploads, and do derivatives processing in the background, my source
is always an UploadedFile
. Also for use cases that involve re-processing derivatives, which is actually what I'm doing (calling create_derivatives
or process_derivatives
a subsequent time for an existing record).
OK, that's why the file is getting downloaded. But what is the motivation for this line?
I am guessing, to ensure that derivatives processors get actual files with local file paths (since it is not uncommon for a processor to need that); and maybe also to ensure the file only gets downloaded once? Not sure about that.
This logic looks very much like with_file
, I am not sure why it doesn't just use with_file
? It does not actually guarantee that the arg passed to processors will be a local file, like with_file
would, it only does so in the case of an UploadedFile
. If you pass it some other kind of io
(Say a Down
object), it will be passed on through unaltered.
However, I don't know if the unusual implementation is necessary to make the example for "call multiple processors in a row with the same source file, to avoid re-downloading the same source file each time" from the Derivatives doc work.
Regardless, this odd implementation that does allow pass-through of a non-local-file so long as it isn't an UploadedFile
lets me write a workaround, where the file is not downloaded unless a conditional derivative actually decides it needs it....
image_attacher.file.open do |source_io|
image_attacher.process_derivatives(:default, source_io)
end
Now the file is not downloaded, the IO object is passed through the processor, and will be downloaded if the processor asks for it. But what if the processor does need a local file? Well, you could do conditional derivatives something like this:
Attacher.derivatives do |original|
result = {}
if record.is_a?(Photo) || file.mime_type == "image/svg+xml"
shrine_class.with_file(original) do |original_as_file|
magick = ImageProcessing::MiniMagick.source(original_as_file)
if record.is_a?(Photo)
result[:jpg] = magick.convert!("jpeg")
result[:gray] = magick.colorspace!("grayscale")
end
if file.mime_type == "image/svg+xml"
result[:png] = magick.loader(transparent: "white").convert!("png")
end
end
end
result
end
Phew! Now only if the processing implementation decides it actually wants to do something, does it itself call with_file
to make sure the IO passed in is actually a File.
This works.... but seems pretty fragile/hacky. Especially because it relies on process_derivatives
allowing pass-through of this non-File IO... which may itself be a bug, maybe it should not be doing that? So maybe my code will then break if you fix that?
I wonder if derivatives processing ought not to guarantee a file object, ought to just pass an IO (eg from UploadedFile#open
). Processors can use with_file
(and possibly the tempfile plugin) if they want to make sure it becomes a file. Although that would be a breaking change at this point... maybe an option? Not sure what the option would be called.
That got confusing! Interested in any thoughts or feedback.
Sounds like an option to let the user download the file themselves could solve the problem. E.g. if the user does
plugin :derivatives, download: false
The file passed would be a Shrine::UploadedFile
object instead of a raw file. The derivation_endpoint
plugin already has a similar option.
I thought the ability to pass the file manually would be enough, but I hadn't considered the case where no processing might happen.
I'm not sure if it should be a plugin option, or a method argument that can be controlled per-call?
Some processors might be wanting shrine default behavior (including ensuring it's a file... although I'm not sure it actually succeeds at doing that at present), other processors might be fine with a not-yet-downloaded file.
Especially if the argument is going to change what the type of the processor block arg is, which is what I think you're saying? Then you really might want some processors that want one and some another -- maybe it should be an arg when you register the processor?
Or an arg when you call create_derivatives/process_derivatives?
But whether a plugin option or an argument somewhere... if it means the file passed is guaranteed to e a Shrine::UploadedFile for predictability... does that mean process_derivatives
has to convert any other IO objects it gets to a Shrine::UploadedFile
? That's not actually possible in general I don't think, since a random IO you are supplying as a custom source
isn't necessarily in any storage, so can't have an id/storage_key. Hmm. Maybe it's not "the file passed would be a Shrine::UploadedFile", it's just "the file passed, if say download:false
, is not guaranteed to be anything beyond an IO-like object -- if you want to turn it into a file, use Shrine.with_file". Does that make sense?
As far as download:false
-- I never like having boolean options whose default is "true", I think the default should always be false. Avoids common bugs, is more predictable. but I'm not sure what you'd call this in that case... raw_io
maybe.
PS: if I change the line in derivatives.rb
if source.is_a?(UploadedFile)
source.download do |file|
to:
if source.is_a?(UploadedFile)
shrine_class.with_file(source) do |file|
All existing tests pass, I think. I wonder if it should really be that, to use consistent DRY code... and right now, it doens't really guarantee the processor block arg is a file, it just turns it to a file if it's an UploadedFile -- if you give it an io that is not a file or an UploadedFile (say, a Down::ChunkedIO
, which is valid in most other shrine places taking io input) -- it does NOT get turned into a file.
Using with_file
it would.
So maybe we add a raw_io
option, and also change that to with_file
instead of the custom variant?
I think the raw_io
option should probably be on the processor registration, if it's only going to be one place, it effects the logic the processor needs to apply, since it changes the input to the processor.
Attacher.derivative(:thumbnail, raw_io: true) do |raw_io|
end
Yeah, a per-processor option could work too.
I think it's better not to use Shrine.with_file
, because it wouldn't allow people to pass IO objects that aren't files, which could be a valid use case.
Hm, I think it's the opposite with with_file
, that it would allow people to pass IO objects that aren't files.
Current behavior:
- If they pass an UploadedFile, it gets converted to an actual File (via TempFile) before being passed to the processor
- If they pass an actual File, it gets passed to the processor as is
- If they pass an IO object that's not a file, it gets passed to the processor as is
So right now, if passing an IO object that's not a file is allowed -- a processor has to be prepared to accept arbitrary IO, not just files. UploadedFile is converted to a file before being passed, but other IO isn't.
It looks to me like the point of the UploadedFile conversion in 1 is to make a "contract" -- you will always get a file. But it fails in case 3. If you aren't going to convert IO objects before passing them, what's the point of converting UploadedFiles, why convert just them?
Converting implementation to with_file
, it would now be a consistent contract -- you'll always get a file. Looking at implementation of with_file
, cases 1 and 2 would be almost identical, but case 3, the IO object would be downloaded to a file just like the UploadedFile
is, before being passed to processor. In addition, the benefit of using a consistent with_file
implementation DRY.
Am I misunderstanding something?
After proposed change, it would be like: default, everything (including IO) gets converted to a file before being passed to processor. So the processor can be confident it will always get a file. But if you say raw_io: true
, then everything (including UploadedFile) gets passed on as-is, so the processor doesn't know anything except it gets an IO-like object, may or may not be a file, up to the processor to convert to file (probably using with_file
) if desired.
Hmm, I see what you mean, it does sound more consistent 🤔
One obvious problem is backwards compatibility. Let me think about it more and get back to you.
Do you have any examples of code relying on the present behavior? I wonder if it's not a problem in practice.
I have found the present behavior challenging -- if I want to have a processor that can be used with input that's an UploadedFile or a non-file IO, I already need to write logic that can handle both cases -- in which case it would probably work fine even after change?
The only real thing that would trigger backwards incompat is if I had written processor logic that only expected a non-file IO, that would be delivered to it un-altered, it never expected to be called with an UploadedFile -- is this even feasible?
So I think arguably this could be considered a bug fix, the earlier behavior a bug. The docs say:
Typically you'd pass a local file on disk. If you pass a Shrine::UploadedFile object, it will be automatically downloaded to disk.
They don't mention what happens if you pass something that is neither of those things.
BUT if we do need to maintain strict backwards compat, I see only one option. The argument has to be called something like io_style
. The default value, which could be called something like :original
or :default
, would be the current behavior -- UploadedFiles are downloaded, everything else is passed through unaltered. Two other options are :file
(contract that processor will get a file no matter what was passed in), and :raw
(whatever is passed in is passed on unaltered; it is expected to be IO-like).
This maintains backwards compat, but at the cost of a more complicated API, including more tests, and more confusing docs. Possibly the :original
value could trigger a deprecation warning -- but that's pretty annoying for a default value, so.
I also think this should possibly be exposed as a plugin option, as well as an argument on derivative(:name, io_style: :raw) do
registration that can override the plugin option. But if I had to pick only one place, it'd be the argument on processor registration -- this is really a property of the processor, what contract it wants for the argument given to it.
(Also, yes, I am willing/can find time to PR any desired change that comes out of this discussion! I can also PR it now as a proof of concept/concrete code for us to look at if you like, just let me know)
Also just recording my progress. My earlier workaround using present release does not work well -- it turns out open
is itself a fairly expensive operation, especially on S3 storage. So I really want to skip even "eager" open
on the UploadedFile, leave it to the processor logic to do.
(This also reminds me that passing on an unopened UploadedFile isn't quite the same thing as passing an opened File or other IO object. :( This is a bit messy, but I still think where we have arrived is a reasonable API -- also worry that we're spending too much time bike shedding this fairly minor feature, but what can you do).
So my present workaround that does work for me. Instead of calling process_derivatives
(which would force download of UploadedFiles), I look in it's innards and call only the parts I want:
# instead of result = attacher.process_derivatives(processor, source, **options)
processor = attacher.class.derivatives_processor(processor)
result = attacher.instance_exec(source, **options, &processor)
This works for me, although I lose the existing instrumentation, and it requires me to kind of copy paste shrine internal implementation, so may be fragile with future changes.
#477 has been merged, thanks!