shrinerb / shrine

File Attachment toolkit for Ruby applications

Home Page:https://shrinerb.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Conditional derivatives and eager original downloading

jrochkind opened this issue · comments

The [File Processing Guide suggests a usage pattern of "Conditional derivatives"](Conditional derivatives), eg:

Attacher.derivatives do |original|
  if file.mime_type == "image/svg+xml"
    result[:png] = something
  end
 
  result
end

This pattern works well for me.

But one thing I noticed is that calling model.image_attacher.create_derivatives in this case will still end up downloading the entire original, even though it doesn't end up using it for anything, if it conditionally decides not to.

I have very large originals (100MB+), so this download is a non-trivial operation. And I have some use cases where it ends up downloading a bunch of files it doesn't need to, which is a problem for me.

But I'm not sure if you would consider this a bug or design problem/need-of-improvement, or if it's working as intended and is fine? And so I bring it up to ask, and to also document my further investigations and thoughts.

Why is it downloading the file? Often Shrine IO objects are basically "lazy", not read until they are accessed. The reason that is not true here is that in process_derivatives, before the registered processor is ever called:

if source.is_a?(UploadedFile)
source.download do |file|
_process_derivatives(processor_name, file, **options)
end

Since I use direct uploads, and do derivatives processing in the background, my source is always an UploadedFile. Also for use cases that involve re-processing derivatives, which is actually what I'm doing (calling create_derivatives or process_derivatives a subsequent time for an existing record).

OK, that's why the file is getting downloaded. But what is the motivation for this line?

I am guessing, to ensure that derivatives processors get actual files with local file paths (since it is not uncommon for a processor to need that); and maybe also to ensure the file only gets downloaded once? Not sure about that.

This logic looks very much like with_file, I am not sure why it doesn't just use with_file? It does not actually guarantee that the arg passed to processors will be a local file, like with_file would, it only does so in the case of an UploadedFile. If you pass it some other kind of io (Say a Down object), it will be passed on through unaltered.

However, I don't know if the unusual implementation is necessary to make the example for "call multiple processors in a row with the same source file, to avoid re-downloading the same source file each time" from the Derivatives doc work.

Regardless, this odd implementation that does allow pass-through of a non-local-file so long as it isn't an UploadedFile lets me write a workaround, where the file is not downloaded unless a conditional derivative actually decides it needs it....

image_attacher.file.open do |source_io|
    image_attacher.process_derivatives(:default, source_io)
end

Now the file is not downloaded, the IO object is passed through the processor, and will be downloaded if the processor asks for it. But what if the processor does need a local file? Well, you could do conditional derivatives something like this:

Attacher.derivatives do |original|
  result = {}

  if record.is_a?(Photo) || file.mime_type == "image/svg+xml"
    shrine_class.with_file(original) do |original_as_file|
      magick = ImageProcessing::MiniMagick.source(original_as_file)
 
      if record.is_a?(Photo)
        result[:jpg]  = magick.convert!("jpeg")
        result[:gray] = magick.colorspace!("grayscale")
      end
 
      if file.mime_type == "image/svg+xml"
        result[:png] = magick.loader(transparent: "white").convert!("png")
      end
    end
  end
  result
end

Phew! Now only if the processing implementation decides it actually wants to do something, does it itself call with_file to make sure the IO passed in is actually a File.

This works.... but seems pretty fragile/hacky. Especially because it relies on process_derivatives allowing pass-through of this non-File IO... which may itself be a bug, maybe it should not be doing that? So maybe my code will then break if you fix that?

I wonder if derivatives processing ought not to guarantee a file object, ought to just pass an IO (eg from UploadedFile#open). Processors can use with_file (and possibly the tempfile plugin) if they want to make sure it becomes a file. Although that would be a breaking change at this point... maybe an option? Not sure what the option would be called.

That got confusing! Interested in any thoughts or feedback.

Sounds like an option to let the user download the file themselves could solve the problem. E.g. if the user does

plugin :derivatives, download: false

The file passed would be a Shrine::UploadedFile object instead of a raw file. The derivation_endpoint plugin already has a similar option.

I thought the ability to pass the file manually would be enough, but I hadn't considered the case where no processing might happen.

I'm not sure if it should be a plugin option, or a method argument that can be controlled per-call?

Some processors might be wanting shrine default behavior (including ensuring it's a file... although I'm not sure it actually succeeds at doing that at present), other processors might be fine with a not-yet-downloaded file.

Especially if the argument is going to change what the type of the processor block arg is, which is what I think you're saying? Then you really might want some processors that want one and some another -- maybe it should be an arg when you register the processor?

Or an arg when you call create_derivatives/process_derivatives?

But whether a plugin option or an argument somewhere... if it means the file passed is guaranteed to e a Shrine::UploadedFile for predictability... does that mean process_derivatives has to convert any other IO objects it gets to a Shrine::UploadedFile? That's not actually possible in general I don't think, since a random IO you are supplying as a custom source isn't necessarily in any storage, so can't have an id/storage_key. Hmm. Maybe it's not "the file passed would be a Shrine::UploadedFile", it's just "the file passed, if say download:false, is not guaranteed to be anything beyond an IO-like object -- if you want to turn it into a file, use Shrine.with_file". Does that make sense?

As far as download:false -- I never like having boolean options whose default is "true", I think the default should always be false. Avoids common bugs, is more predictable. but I'm not sure what you'd call this in that case... raw_io maybe.

PS: if I change the line in derivatives.rb

if source.is_a?(UploadedFile) 
   source.download do |file| 

to:

if source.is_a?(UploadedFile) 
   shrine_class.with_file(source) do |file| 

All existing tests pass, I think. I wonder if it should really be that, to use consistent DRY code... and right now, it doens't really guarantee the processor block arg is a file, it just turns it to a file if it's an UploadedFile -- if you give it an io that is not a file or an UploadedFile (say, a Down::ChunkedIO, which is valid in most other shrine places taking io input) -- it does NOT get turned into a file.

Using with_file it would.

So maybe we add a raw_io option, and also change that to with_file instead of the custom variant?

I think the raw_io option should probably be on the processor registration, if it's only going to be one place, it effects the logic the processor needs to apply, since it changes the input to the processor.

Attacher.derivative(:thumbnail, raw_io: true) do |raw_io|

end

Yeah, a per-processor option could work too.

I think it's better not to use Shrine.with_file, because it wouldn't allow people to pass IO objects that aren't files, which could be a valid use case.

Hm, I think it's the opposite with with_file, that it would allow people to pass IO objects that aren't files.

Current behavior:

  1. If they pass an UploadedFile, it gets converted to an actual File (via TempFile) before being passed to the processor
  2. If they pass an actual File, it gets passed to the processor as is
  3. If they pass an IO object that's not a file, it gets passed to the processor as is

So right now, if passing an IO object that's not a file is allowed -- a processor has to be prepared to accept arbitrary IO, not just files. UploadedFile is converted to a file before being passed, but other IO isn't.

It looks to me like the point of the UploadedFile conversion in 1 is to make a "contract" -- you will always get a file. But it fails in case 3. If you aren't going to convert IO objects before passing them, what's the point of converting UploadedFiles, why convert just them?

Converting implementation to with_file, it would now be a consistent contract -- you'll always get a file. Looking at implementation of with_file, cases 1 and 2 would be almost identical, but case 3, the IO object would be downloaded to a file just like the UploadedFile is, before being passed to processor. In addition, the benefit of using a consistent with_file implementation DRY.

Am I misunderstanding something?

After proposed change, it would be like: default, everything (including IO) gets converted to a file before being passed to processor. So the processor can be confident it will always get a file. But if you say raw_io: true, then everything (including UploadedFile) gets passed on as-is, so the processor doesn't know anything except it gets an IO-like object, may or may not be a file, up to the processor to convert to file (probably using with_file) if desired.

Hmm, I see what you mean, it does sound more consistent 🤔

One obvious problem is backwards compatibility. Let me think about it more and get back to you.

Do you have any examples of code relying on the present behavior? I wonder if it's not a problem in practice.

I have found the present behavior challenging -- if I want to have a processor that can be used with input that's an UploadedFile or a non-file IO, I already need to write logic that can handle both cases -- in which case it would probably work fine even after change?

The only real thing that would trigger backwards incompat is if I had written processor logic that only expected a non-file IO, that would be delivered to it un-altered, it never expected to be called with an UploadedFile -- is this even feasible?

So I think arguably this could be considered a bug fix, the earlier behavior a bug. The docs say:

Typically you'd pass a local file on disk. If you pass a Shrine::UploadedFile object, it will be automatically downloaded to disk.

They don't mention what happens if you pass something that is neither of those things.

BUT if we do need to maintain strict backwards compat, I see only one option. The argument has to be called something like io_style. The default value, which could be called something like :original or :default, would be the current behavior -- UploadedFiles are downloaded, everything else is passed through unaltered. Two other options are :file (contract that processor will get a file no matter what was passed in), and :raw (whatever is passed in is passed on unaltered; it is expected to be IO-like).

This maintains backwards compat, but at the cost of a more complicated API, including more tests, and more confusing docs. Possibly the :original value could trigger a deprecation warning -- but that's pretty annoying for a default value, so.

I also think this should possibly be exposed as a plugin option, as well as an argument on derivative(:name, io_style: :raw) do registration that can override the plugin option. But if I had to pick only one place, it'd be the argument on processor registration -- this is really a property of the processor, what contract it wants for the argument given to it.

(Also, yes, I am willing/can find time to PR any desired change that comes out of this discussion! I can also PR it now as a proof of concept/concrete code for us to look at if you like, just let me know)

Also just recording my progress. My earlier workaround using present release does not work well -- it turns out open is itself a fairly expensive operation, especially on S3 storage. So I really want to skip even "eager" open on the UploadedFile, leave it to the processor logic to do.

(This also reminds me that passing on an unopened UploadedFile isn't quite the same thing as passing an opened File or other IO object. :( This is a bit messy, but I still think where we have arrived is a reasonable API -- also worry that we're spending too much time bike shedding this fairly minor feature, but what can you do).

So my present workaround that does work for me. Instead of calling process_derivatives (which would force download of UploadedFiles), I look in it's innards and call only the parts I want:

# instead of result = attacher.process_derivatives(processor, source, **options)
processor = attacher.class.derivatives_processor(processor)
result = attacher.instance_exec(source, **options, &processor)

This works for me, although I lose the existing instrumentation, and it requires me to kind of copy paste shrine internal implementation, so may be fragile with future changes.

#477 has been merged, thanks!