Touffy / client-zip

A client-side streaming ZIP generator

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[FEATURE] Announce total length in HTTP header 'Content-Length'

philenius opened this issue · comments

Thank you for your library. I was wondering if it is possible to calculate the download progress as a percentage. To do this, the response object of downloadZip() would have to contain the HTTP header Content-Length, which it does not at the moment.

I have already found that browsers hide HTTP headers from scripts on cross-origin requests (see here) unless the HTTP header Access-Control-Expose-Headers says otherwise. So, I adapted my server config to expose the Content-Length header. As a result, the request to fetch my test file returns the HTTP header Content-Length (see (1) in code snippet). But, because the HTTP header Content-Length is still missing on the response of downloadZip() (see (2) in code snippet), I cannot calculate a progress in percent (as the absolute value is unknown).

So my question is: is it even possible to include the HTTP header Content-Length and have it contain the sum of all content lengths of the files to be downloaded?

The following code snippet demonstrates how to observe the download progress:

import { downloadZip } from 'https://cdn.jsdelivr.net/npm/client-zip/index.js';

async function download() {

    /// (1) Original HTTP request where the Content-Length header is present
    const file = await fetch('https://4sc35swexhjkk3fe-public.s3-eu-west-1.amazonaws.com/elevator-music.mp3');

    const response = await downloadZip([file]);
    const reader = response.body.getReader();

    /// (2) Content-Length header is missing -> always 0 -> cannot calculate a relative progress
    const contentLength = +response.headers.get('Content-Length');

    let receivedLength = 0;
    const chunks = [];
    while (true) {
        const { done, value } = await reader.read();

        if (done) {
            break;
        }

        chunks.push(value);
        receivedLength += value.length;

        console.log(`Received ${receivedLength} of ${contentLength}`);
    }

    const blob = new Blob(chunks);

    const link = document.createElement('a');
    link.href = URL.createObjectURL(blob);
    link.download = 'test.zip';
    link.click();
    link.remove();
}

download();

Hello. As I have explained in CONTRIBUTING.md, downloadZip does not set the content-length because it doesn't know it when it returns the Response (at that point, it hasn't even started to read the inputs). That is a consequence of the streaming design of the library.

It would be possible to set content-length without sacrificing memory efficiency if

  • you pass all your input in an array (not a generator), which forces you to fetch every input first (without reading the data)
  • every input has a known length (always true for Blobs and ArrayBuffers and strings, true for Responses if they have content-length themselves, and for other types — streams and iterators — it would requires adding a new optional property in the input object so you can announce the length manually)

Is that OK for you ?

Then of course downloadZip needs some new code to detect that every length is known in advance and to take advantage of it.

I have another suggestion, that won't be as accurate as monitoring the download stream but should be close enough and works with the current version of client-zip. If all your input is in Streams (I assume it's Response bodies in your case), you could instead monitor those streams.

downloadZip will consume them one after the other in the order you passed them, and you could either just listen to the "end" event on each stream to count how many files have been copied (good enough if it's a lot of small files, I think), or pass the streams through your reader to log fine-grained progress for each file. There should never be more than a few milliseconds in downloadZip between consuming a chunk of data from an input stream and writing it to the output stream. When all the input has been processed, it again only takes just a millisecond to finish the archive.

Thank you for your quick and detailed response. I understand that due to the streaming design of client-zip it is not possible to know the final content length in advance.

If I understood you correct, then you proposed the following solution to calculate the total content length? Using this approach, I can calculate the download progress as a percentage even with the current version of client-zip. Thank you for your hint!

import { downloadZip } from 'https://cdn.jsdelivr.net/npm/client-zip/index.js';

async function download() {

    const file1 = await fetch('https://4sc35swexhjkk3fe-public.s3-eu-west-1.amazonaws.com/elevator-music.mp3');
    const file2 = await fetch('https://4sc35swexhjkk3fe-public.s3-eu-west-1.amazonaws.com/elevator-music.mp3');
    const file3 = await fetch('https://4sc35swexhjkk3fe-public.s3-eu-west-1.amazonaws.com/elevator-music.mp3');
    const file4 = await fetch('https://4sc35swexhjkk3fe-public.s3-eu-west-1.amazonaws.com/elevator-music.mp3');

    const files = [file1, file2, file3, file4];
    
    // Sum-up the content lengths
    let contentLength = 0;
    files.forEach(file => {
        contentLength += +file.headers.get('Content-Length');
    });

    const response = await downloadZip(files);
    const reader = response.body.getReader();

    let receivedLength = 0;
    const chunks = [];
    while (true) {
        const { done, value } = await reader.read();

        if (done) {
            break;
        }

        chunks.push(value);
        receivedLength += value.length;

        console.log(`Received ${receivedLength} of ${contentLength} Bytes -> ${(receivedLength/contentLength*100).toFixed()} %`);
    }

    const blob = new Blob(chunks);

    const link = document.createElement('a');
    link.href = URL.createObjectURL(blob);
    link.download = 'test.zip';
    link.click();
    link.remove();
}

download();

Regarding your second suggestion, could you please explain how to listen to the "end" event of each stream? And yes, simply counting how many files have already been downloaded would be sufficient for my use case, since I download dozens to hundreds of small files (< 1MB). So knowing the count of already downloaded files would be a sufficient indicator of the download progress.

OK, so first about your code: that's not exactly what I had in mind, but close enough, I suppose. You are comparing the progress of the output generation against the total size of the input files, whereas my suggestion was to monitor the progress of the input consumption using a passthrough TransformStream.

In fact, the overhead from the Zip file format isn't very big, and it's predictable so you can make your method more accurate : the exact overhead, without Zip64 (for archives smaller than 4GB), is 92 bytes per file + the length of each filename in UTF8 + 22 bytes at the end. You could approximate to 100-ish bytes per file, depending on your average filename.

By the way, you don't need to await the return from downloadZip, it's not a Promise.

About the second option: sorry, I thought there were events on WHATWG streams and I was mistaken. The easiest solution I can think of is using pipeTo (instead of pipeThrough) with a passthrough TransformStream's writable side as the target, and making a new input with the readable side. Because pipeTo returns a Promise that resolves when the stream is closed. So if you had a list of those Promises for each of your files, you could make a little async generator out of them (because they will resolve in sequence) and measure progress that way.

Honestly, it sounds like more trouble than it's worth, especially since you already have a working solution. The TransformStream solution is meant to be more efficient if you were generating large archives, but I guess that's not your use case.

Thank you for your explanations, @Touffy. You're right, I'm satisfied with this working solution. There's no need to extend client-zip so that this issue can be closed.

For others who might want to calculate the download progress, too, this would be a potential solution:

import { downloadZip } from 'https://cdn.jsdelivr.net/npm/client-zip/index.js';

async function download() {

    const file1 = await fetch('https://4sc35swexhjkk3fe-public.s3-eu-west-1.amazonaws.com/elevator-music.mp3');
    const file2 = await fetch('https://4sc35swexhjkk3fe-public.s3-eu-west-1.amazonaws.com/elevator-music.mp3');
    const file3 = await fetch('https://4sc35swexhjkk3fe-public.s3-eu-west-1.amazonaws.com/elevator-music.mp3');
    const file4 = await fetch('https://4sc35swexhjkk3fe-public.s3-eu-west-1.amazonaws.com/elevator-music.mp3');

    const files = [file1, file2, file3, file4];

    // Calculate total download size by summing up the HTTP header `Content-Length` + the length
    // of the UTF-8 encoded file name + 92 Bytes
    let totalBytes = 0;
    files.forEach(file => {
        const contentLength = +file.headers.get('Content-Length');
        const fileNameSizeInBytes = (new TextEncoder().encode('elevator-music.mp3')).length;
        totalBytes += contentLength + fileNameSizeInBytes + 92;
    });
    totalBytes += 22;
    
    const response = downloadZip(files);
    const reader = response.body.getReader();

    let receivedBytes = 0;
    const chunks = [];
    while (true) {
        const { done, value } = await reader.read();

        if (done) {
            break;
        }

        chunks.push(value);
        receivedBytes += value.length;

        console.log(`Received ${receivedBytes} of ${totalBytes} Bytes -> ${(receivedBytes/totalBytes*100).toFixed()} %`);
    }

    const blob = new Blob(chunks);

    const link = document.createElement('a');
    link.href = URL.createObjectURL(blob);
    link.download = 'test.zip';
    link.click();
    link.remove();
}

download();

Fetch doesn't have a progress event so can we use xmlhttprequest which does have a progress event?
i guess we could.. same with axios no?

Jake Archibald did a pretty good write up in 2016 : https://jakearchibald.com/2016/streams-ftw/

Yes, you could use an XHR to make a custom input for downloadZip, but you don't actually need it to monitor progress. As I have suggested earlier in this thread, you can do that by piping the fetch's response.body through a monitoring Transform. Granted, it's not quite as simple as the progress event. It is simpler than Jake's solution from that 2016 article, as Transforms were not really a thing back then on the client-side (you don't need to get a Reader anymore). I'm sure Jake himself has posted amazing things on Transforms since then.

@Touffy Thanks for your feedback, i must say you have been an amazing open source contributor for this package. I hope you will one day do a paid course that teaches us about js streams/transforms and other browser related api's. I will happily hand over my money. Most if not all other JS courses teach common already well concepts., but none of them ever go in to the nitty gritty of enterprise tech such as streams, piping, blobs,readers, iterables , generators, etc. Thank you. I am sure i will have more questions to come.

oh and jake@google posted an update 2020 video on streams > https://www.youtube.com/watch?v=G9PpImUEeUA

Sorry for bring up a closed issue, but I did not want to create a duplicate one.

For my particular use case, I need to know the final zip size before even starting the process. This is because I need to pipe the Response to another process, which requires to know the input size.

Because I know in advance the sizes of the inputs to downloadZip, I can pass them to it. It would be nice that client-zip can set the Content-Length header immediately after some quick calculation. Otherwise, if any of the inputs has unknown size, client-zip can skip Content-Type header.

I could do the calculation manually, but it's cleaner to keep it within client-zip. Are you open to adding that functionality.

Hi Sơn. You are right to re-use this issue.

Like I said before, it's possible for client-zip to compute the total length in advance if all file lengths and names are known at the start and if the files will be stored without compression (encryption overhead should be predictable, though I haven't looked into it yet).

Ideally, to preserve the general streaming design for large archives, you'd want to avoid reading the metadata from the actual file Responses. Instead, you should use a dedicated metadata endpoint (or, at least, HEAD requests) before starting the zipping process. Once you have collected all the metadata, there are two ways to proceed (and both could be implemented) :

  1. Give downloadZip a configuration object as a second argument, containing that metadata. Doing so would allow client-zip to not only set the Content-Length header, but also avoid the streaming flag in the Zip entries.
  2. Call another method than downloadZip, exposed by client-zip, whose sole purpose would be to predict the length of a Zip file based on metadata. This would allow you to use the information without actually creating the archive (for example, if you want to display it in your UI or check it against an upload quota).
  1. Give downloadZip a configuration object as a second argument, containing that metadata. Doing so would allow client-zip to not only set the Content-Length header, but also avoid the streaming flag in the Zip entries.

The list of first argument of downloadZip can already take an object defining a file with name and body, so why don't we just add another field for that object for size? That should give client-zip enough info to compute the final size (without compression) to set the Content-Type header in the Response.

I suppose if input field of that object is not set, then we can skip the streaming flag.

downloadZip consumes entries from the input iterable lazily, allowing them to be generated lazily too. So if we add metadata to those objects, downloadZip won't have access to them at the beginning. That's why the metadata has to be passed separately.

I know it's a bit more complicated, but this way, you can let downloadZip predict the correct size for everything, while still being able to create very large archives (in a ServiceWorker) without running out of memory or saturating the browser's HTTP request pool.

That actually sounds great for my use case! Thanks for building this with streaming in mind! I'm looking forward to have this functionality implemented!

Actually, I was hasty when saying we can get rid of the streaming flag. That would require also knowing the file's CRC32 in advance. I know some cloud storage services can give you that information in a metadata endpoint, but that's too situational and the benefit is tiny (it would just save 16 to 28 bytes of overhead per file).

So, streaming is still always on after all. The new feature will simply let you predict the total length of the zip file by passing an (synchronous only, this time) iterable of file metadata to a new predictLength function, and then give the result to downloadZip in a new length option.

I think predictLength will take polymorphic input types like downloadZip. Length prediction only needs two things : file size (of course), and file name (because the name also contributes to the length of the Zip file). In some cases (like Response and File objects) both can be extracted from the input. The thing is, though, that fetching a Response to get the size of each file is rather inefficient (even with a HEAD request) and I don't want to encourage that when there is a better way (typically : a metadata endpoint that can list a whole directory in one request).

In the case of Zip64 archives, the list of file metadata given to predictLength will have to be in the exact same order as the actual files given to downloadZip, otherwise there could be inaccuracies when the offset overflows 32 bits at a different file than predicted.

Should be solved with the newly published versions 1.5 (still distributed as ES2018 and without Zip64) and 2.2.

I must say the feature came at a significant cost in kilobytes (well, only about 1kB, but client-zip being so small to begin with, that was a big increase).

This is awesome! Sorry for asking you to add the extra 1KB to the library, but I think it would add more possibilities to the library, such as piping data from cloud providers (which already have API to return the size) into zip file.

When we have the basic password encryption, client-zip would be perfect!