Flushing the output queue WITHOUT invalidating the decode pipeline

Question

Flushing the output queue WITHOUT invalidating the decode pipeline

RogerSanders opened this issue a year ago · comments

First up, let me say thanks to everyone who's worked on this very important spec. I appreciate the time and effort that's gone into this, and I've found it absolutely invaluable in my work. That said, I have hit a snag, which I consider to be a significant issue in the standard.

Some background first so you can understand what I'm trying to do.

I'm doing 3D rendering on a cloud server in realtime, using hardware H265 encoding to compress frames, and sending those frames down over a websocket to the client browser, where I use the WebCodecs api (specifically VideoDecoder) to decode those frames and display them to the user.
My goal is low latency, which is done by balancing rendering time, encoding time, compressed frame size, and decoding time.
My application is an event driven, CAD-style application. Unlike video encoding from a real camera, or 3D games, where a constant or target framerate exists, my application has no such concept. The last image generated remains valid for an indefinite period of time. The next frame I send over the wire may be displayed for milliseconds, or for an hour, it all depends on what state the program is in and what the user does next in terms of input.

So then, based on the above, hopefully my requirements become clearer. When I get a video frame at the client side, I need to ensure that frame gets displayed to the user, without any subsequent frame being required to arrive in order to "flush" it out of the pipeline. My frames come down at an irregular and unpredictable rate. If the pipeline has a frame or two of latency, I need to be able to manually flush that pipeline on demand, to ensure that all frames which have been sent have been sent to the codec for decoding have been made visible.

At this point, I find myself fighting a few parts of the WebCodecs spec. As stated in the spec, when I call the decode method (https://www.w3.org/TR/webcodecs/#dom-videodecoder-decode), it "enqueues a control message to decode the given chunk". That decode request ends up on the "codec work queue" (https://www.w3.org/TR/webcodecs/#dom-videodecoder-codec-work-queue-slot), and ultimately stuck in the "internal pending output" queue (https://www.w3.org/TR/webcodecs/#internal-pending-output). As stated in the spec, "Codec outputs such as VideoFrames that currently reside in the internal pipeline of the underlying codec implementation. The underlying codec implementation may emit new outputs only when new inputs are provided. The underlying codec implementation must emit all outputs in response to a flush." So frames may be held up in this queue, as per the spec. As also stated however, in response to a flush, all pending outputs MUST be emitted in response to a flush. There is, of course, an explicit flush method: https://www.w3.org/TR/webcodecs/#dom-videodecoder-flush
The spec states that this method "Completes all control messages in the control message queue and emits all outputs". Fantastic, that's what I need. Unfortunately, the spec also specifically states that in response to a flush call, the implementation MUST "Set [[key chunk required]] to true." This means that after a flush, I can only provide a key frame. Not so good. In my scenario, without knowing when, or if, a subsequent frame is going to arrive, I end up having to flush after every frame, and now due to this requirement that a key frame must follow a flush, every frame must be a keyframe. This increases my payload size significantly, and can cause noticeable delays and stuttering on slower connections.

When I use a dedicated desktop client, and have full control of the decoding hardware, I can perform a "flush" without invalidating the pipeline state, so I can, for example, process a sequence of five frames such as "IPPPP", flushing the pipeline after each one, and this works without issue. I'd like to be able to achieve the same thing under the WebCodecs API. Currently, this seems impossible, as following a call to flush, the subsequent frame must be an I frame, not a P frame.

My question now is, how can this be overcome? It seems to me, I'd need one of two things:

The ability to guarantee no frames will be held up in the internal pending output queue waiting for another input, OR:
The ability to flush the decoder, or the codec internal pending output queue, without invalidating the decoder pipeline.

At the hardware encoding/decoding API level, my only experience is with NVENC/NVDEC, but I know what I want is possible under this implementation at least. Are there known hardware implementations where what I'm asking for isn't possible? Can anyone see a possible way around this situation?

I can tell you right now, I have two workarounds. One is to encode every frame as a keyframe. This is clearly sub-optimal for bandwidth, and not required for a standalone client. The second workaround is ugly, but functions. I can measure the "queue depth" of the output queue, and send the same frame for decoding multiple times. This works with I or P frames. With a queue depth of 1 for example, which is what I see on Google Chrome, for each frame I receive at the client end, I send it for decoding twice. The second copy of the frame "pushes" the first one out of the pipeline. A hack, for sure, and sub-optimal use of the decoding hardware, but it keeps my bandwidth at the expected level, and I'm able to implement it on the client side alone.

What I would like to see, ideally, is some extra control in the WebCodecs API. Perhaps a boolean flag in the codec configuration? We currently have the "optimizeForLatency" flag. I'd like to see a "forceImmediateOutput" flag or the like, which guarantees that every frame that is sent for decoding WILL be passed to the VideoFrameOutputCallback without the need to flush or force it through the pipeline with extra input. Failing that, an alternate method of flushing that doesn't invalidate the decode pipeline would work. Without either of these solutions though, it seems to me that WebCodecs as it stands is unsuitable for use with variable rate streams, as you have no guarantees about the depth of the internal pending output queue, and no way to flush it without breaking the stream.

Roger Sanders · Answer 1 · Fri Jun 30 2023 13:32:12 GMT+0800 (China Standard Time)

Pulling in a reference to #220, as the discussion on that issue directly informed the current spec requirements that a key frame must follow a flush.

Dale Curtis · Answer 2 · Sat Jul 01 2023 00:33:39 GMT+0800 (China Standard Time)

Whether a H.264 or H.265 stream can be decoded as 1-in-1-out precisely depends on how the bitstream is setup. Is your bitstream setup correctly? If you have a sample we can check it. https://crbug.com/1352442 discusses this for H.264.

If the bitstream is setup correctly and this isn't working, that's a bug in that UA. If it's Chrome, you can file an issue at https://crbug.com/new with the chrome://gpu information and a sample bitstream.

If the bitstream is not setup properly, there's nothing WebCodecs can do to help since the behavior would be up to the hardware decoder.

@sandersdan

Dan Sanders · Answer 3 · Sat Jul 01 2023 01:23:01 GMT+0800 (China Standard Time)

First I should note that there is VideoDecoderConfig.optimizeForLatency, which can affect the choices made in setting up a codec. (In Chrome it affects how background threads are set up for software decoding.)

I expect that WebCodecs implementations will output frames as soon as possible, subject to the limitations of the codec implementation. This means that I predict 1-in-1-out behavior unless the bitstream prevents it. The most likely bitstream limitation is when frame reordering is enabled, as for example discussed in the crbug linked above.

Some codec implementations do allow us to "flush without resetting", which can be used to force 1-in-1-out behavior, but I am reluctant to specify this feature knowing that not all implementations can support it (eg. Android MediaCodec).

If you can post a sample of your bitstream, I'll take a look and let you know if there is a change you can make to improve decode latency.

Roger Sanders · Answer 4 · Sat Jul 01 2023 17:41:44 GMT+0800 (China Standard Time)

Thanks, your replies have given me some things to check on. I'll dig into the bitstream myself as well and see if I can spot any issues with the configuration.

As requested, I've thrown together a minimal sample in a single html page, with some individual frames base64 encoded in:
index.zip
There's a small animation of a coloured triangle which changes shape over the course of 5 frames. I have five buttons F1_I through F5_I, which are IDR frames. As you press each button, that frame will be sent to the decoder. The first frame includes the SPS header. I have also alternatively encoded frames 2-5 as P frames, under buttons F2_P through F5_P, and there's a button to flush.

What I'd like to see is each frame appear as soon as you hit the corresponding button. Right now you'll observe the 1 frame of latency, where I have to send the frame twice, send a new frame in, or flush the decoder to get it to appear.

Dan Sanders · Answer 5 · Thu Jul 06 2023 01:18:29 GMT+0800 (China Standard Time)

I took a quick look at the bitstream you have provided, and at first glance it looks correctly configured to me (in particular, max_num_reoder_frames is set to zero). I also ran the sample on Chrome Canary on macOS, and while I was not able to decode the P frames, the I frames did display immediately.

You may be experiencing a browser bug, which should be filed with the appropriate bug tracker (Chrome: https://bugs.chromium.org/p/chromium/issues/entry).

Roger Sanders · Answer 6 · Fri Jul 07 2023 16:17:57 GMT+0800 (China Standard Time)

Thanks, I also couldn't find problems in the bitstream. I did however diagnose the problem in Chrome. In the chromium project, under "src\media\gpu\h265_decoder.cc", inside the H265Decoder::Decode() method, there's a bit of code that currently looks like this:

      if (par_res == H265Parser::kEOStream) {
        curr_nalu_.reset();
        // We receive one frame per buffer, so we can output the frame now.
        CHECK_ACCELERATOR_RESULT(FinishPrevFrameIfPresent());
        return kRanOutOfStreamData;
      }

As per the comment, the intent is to output the frame when the end of the bitstream is reached, but the FinishPrevFrameIfPresent() method doesn't actually output the frame, it just submits it for decoding. Fix is simple:

      if (par_res == H265Parser::kEOStream) {
        curr_nalu_.reset();
        // We receive one frame per buffer, so we can output the frame now.
        CHECK_ACCELERATOR_RESULT(FinishPrevFrameIfPresent());
        OutputAllRemainingPics();
        return kRanOutOfStreamData;
      }

A corresponding change also needs to occur in h264_decoder.cc, which shares the same logic. I tested this change, and it fixes the issue. I'll prepare a submission for the Chromium project, see what they think.

On the issue of the spec itself though, I'm still concerned by an apparent lack of guarantees around when frames submitted for decoding will be made visible. You mentioned that "I expect that WebCodecs implementations will output frames as soon as possible, subject to the limitations of the codec implementation. This means that I predict 1-in-1-out behavior unless the bitstream prevents it." This is reassuring to me, it's what I'd like to see myself, but I couldn't actually see any part of the standard that would seem to require it. Couldn't a perfectly conforming implementation hold onto 1000 frames right now, and claim they're meeting the spec? This might lead to inconsistent behaviour between implementations, and problems getting bug reports actioned, as they may be closed as "by design".

From what you said before, I gather the intent of the "optimizeForLatency" flag is probably basically the same as what I was asking for with a "forceImmediateOutput", in that both of them really are intended to instruct the decoder not to hold onto any frames beyond what is required by the underlying bitstream. The wording in the spec right now doesn't really guarantee it does much of anything. The docs state about the "optimizeForLatency" flag that it is a:

"Hint that the selected decoder SHOULD be configured to minimize the number of EncodedVideoChunks that have to be decoded before a VideoFrame is output.
NOTE: In addition to User Agent and hardware limitations, some codec bitstreams require a minimum number of inputs before any output can be produced."

Making it a "hint" that the decoder "SHOULD" do something doesn't really sound reassuring. Can't we use stronger language here, such as "MUST"? It would also be helpful to elaborate on this flag a bit, maybe tie it back to the kind of use case I've outlined in this issue. I'd like to see the standard read more like this under the "optimizeForLatency" flag:

"Instructs the selected decoder that it MUST ensure that the absolute minimum number of EncodedVideoChunks have to be decoded before a VideoFrame is output, subject to the limitations of the codec and the supplied bitstream. Where the codec and supplied bitstream allows frames to be produced immediately upon decoding, setting this flag guarantees that each decoded frame will be produced for output without further interaction, such as requiring a call to flush(), or submitting further frames for decoding."

If I saw that in the spec as a developer making use of VideoDecoder, it would give me confidence that setting that flag would be sufficient to achieve 1-in-1-out behaviour, with appropriately prepared input data. A lack of clarity on that is what led me here to create this issue. It would also give stronger guidance to developers implementing this spec that they really needed to achieve this result to be conforming. Right now it reads more as a suggestion or a nice-to-have, but not required.

Roger Sanders · Answer 7 · Fri Jul 07 2023 22:37:24 GMT+0800 (China Standard Time)

For completeness, I'll add this issue has been reported to the Chromium project under https://crbug.com/1462868, and I've submitted a proposed fix under https://chromium-review.googlesource.com/c/chromium/src/+/4672706

Dale Curtis · Answer 8 · Sat Jul 08 2023 00:16:34 GMT+0800 (China Standard Time)

Thanks, we can continue discussion of the Chromium issue there.

In terms of the spec language, I think the current text accurately reflects that UAs are beholden to the underlying hardware decoders (and other OS constraints). I don't think we can use SHOULD/MUST here due to that. We could probably instead add some more non-normative text giving a concrete example (1-in-1-out) next to the optimizeForLatency flag.

Ultimately it seems like you're looking for confidence that you can always rely on 1-in-1-out behavior. I don't think you'll find that to be true; e.g., many Android decoders won't work that way. It's probably best to submit a test decode on your target platforms before relying on that functionality.

Bernard Aboba · Answer 9 · Sun May 05 2024 09:55:15 GMT+0800 (China Standard Time)

@Djuffin @padenot It looks like this issue was due to a bug, rather than a spec issue. Can we close it?