[FEATURE] Minimal file reads during initial scan

Question

[FEATURE] Minimal file reads during initial scan

Strontium opened this issue a month ago · comments

Is your feature request related to a problem? Please describe.
When working with a remote file system to the stump host, like keeping files on a cloud storage provider, there may be both bandwidth and egress limits when accessing data.
Currently when doing an initial scan for a book, stump appears to request to read the entire file. When using cloud storage via an rclone mount, this scan causes the file to be downloaded in its entirety, which is inefficient and likely slow for large files.
Opening the book for reading, stump appears to only request the pages it requires, hence rclone will only download those parts of the file, reducing traffic and loading times.

Describe the solution you'd like
Allow an option for reducing file access during the initial scan.
Scan should be limited to:

reading the first page for thumbnail creation
reading metadata
Could be an option for 'minimal scan' if there are other benefits to reading the entire file.
Noting this may only be supported on file types that allow reading data without the entire file (ie ZIP with no compression).

Describe alternatives you've considered
Stump being remote storage aware and will limit itself appropriately automatically.

Additional context
In comparison to other applications in my testing with an rclone mount:

Komga: Reads full file on SCAN and full file on OPEN
Kavita: Reads partial file on SCAN and full file on OPEN
Stump: Reads full file on SCAN and partial file on OPEN

Aaron Leopold · Answer 1 · Mon Apr 29 2024 04:09:09 GMT+0800 (China Standard Time)

Stump: Reads full file on SCAN and partial file on OPEN

The main reason Stump reads the full file on scan is to determine the actual page count. That operation involves iterating through each file e.g. in an archive to determine whether it is a valid page (re: image file). The validity check uses actual byte content, and falls back onto the extension, as a way of attempting to be more accurate in knowing what is truly a valid page. The only feasible way to allow for partial reads would be:

Drop the accuracy and operate only on file extensions, trusting the extension is always correct
Don't count the pages and rely on metadata. This has problems in that missing or inaccurate metadata would cause problems.

This could be a configuration. I'll have to think on implementation details, though.

Strontium · Answer 2 · Mon Apr 29 2024 16:01:25 GMT+0800 (China Standard Time)

Is it often you would have invalid file with an incorrect extension?
What is the downside to having a potentially inaccurate page count?
I think that dropping that level of accuracy is a reasonable trade-off to improve both the initial scan performance and in my case, reducing remote traffic.
If it is easier to implement, i'd still be happy even if it was an optional feature and/or there were some pre-requisites to making it work (ie having the required metadata in the file).
Thanks for considering it.

Aaron Leopold · Answer 3 · Mon Apr 29 2024 23:00:04 GMT+0800 (China Standard Time)

I can't speak to how often you would have a file inside an archive with the wrong/invalid extension. I'd hope it isn't often, and FWIW I haven't encountered the situation personally 😅

What is the downside to having a potentially inaccurate page count?

If there are more pages than what is observed, you likely won't be able to access any of the content. For example, if there are actually 30 pages but for some reason Stump only observed 28 valid pages, API validation would essentially prevent you from even trying to query past the 28th page.
If there are fewer pages than what is observed, internal server errors will start to be thrown as Stump attempts to extract nonexistent pages from the file

Not the end of the world, just things to consider as part of the trade-off. I'll try to see what the general consensus is for this change in behavior before committing to it