Consider marking an I-frame with Recovery Point SEI message as h264 key frame

Question

Consider marking an I-frame with Recovery Point SEI message as h264 key frame

reinhrst opened this issue a year ago · comments

At start of decode (and after a flush), WebCodecs VideoDecoder demands a keyframe which at the moment is defined as an IDR frame.

H264 has the concept of a Recovery Point SEI Message (D.2.8 in the (08.21) h264 spec): "The recovery point SEI message assists a decoder in determining when the decoding process will produce acceptable pictures for display after the decoder initiates random access or after the encoder indicates a broken link in the coded video sequence.".

So (afaict) an I-frame with a such a SEI message is meant to be usable as start frame for a decoding operation.

ffprobe also marks these frames as key-frames.

I don't have enough data to comment on how often this happens in real-live video streams; personally I have 1000s of hours of videos taken with different JVC / Sony camcorders (timelaps recordings, used in animal conservation projects), which have the following properties:

Stream starts (when record button is pressed) with IDR frame
IBBPBBPBBPBBI GOPs, where every I-frame has Recovery Point SEI message with exact_match_flag=1 and recovery_frame_cnt=0
IDR frames repeat every 300 frames (every 25 GOPs)
Streams get "cut" after 4GB recording into new file, new file starts with I-frame, but not (guaranteed) IDR frame.

Not being able to start decoding on I-frame + SEI means that:

Worst case first 24 GOP's of stream can not be decoded without having access to previous file
When random-access is needed in decoder, worst case 299 frames need to be decoded before requested frame can be shown (takes about 0.25s on my M1 macbook, not the end of the world, but not a smooth drag-playhead-and-find experience for users either. Note that the video files generally are 4GB large, so decoding all frames up-front is also not a solution.

Solution on client side (short of recoding, which results in unacceptable quality loss) that kind of seems to work (but probably a very bad idea) is to add a dummy-IDR frame that I offer to the decoder before feeding the real stream (and then dropping the first frame of the output).

orange4glace · Answer 1 · Mon May 08 2023 17:02:09 GMT+0800 (China Standard Time)

I have a similar question,
I'm trying to decode h264 stream from mp4 file. In STSS box, it says such sample is sync_frame but when inspecting its actual sample data, it is consisted with 2 NALU, one is 5 byte SEI (0x06) and another one is non-IDR (0x41) with picture data.
But when inspecting with ffprobe, it says it is I-frame (and also key-frame) even though I don't know why (since I'm newbie to media processing).
I want to start decoding with such sample but it errors that VideoDecoder needs key-frame.
Is it also related with this issue?

Claude · Answer 2 · Mon May 08 2023 17:09:18 GMT+0800 (China Standard Time)

Very likely. FFProbe returns the recovery point SEI messages as IFrames (~~whereas technically they are not, and the VideoDecoder spec does not~~ edit (edits in bold): sorry, it's been a while since I dove into the details here. they ARE I frames, just not IDR frames. iirc, ffprobe labels them as key frames, whereas for VideoDecoder they are not enough of a keyframe).

I had limited success with rewriting the first frame to identify as an IDR-Frame; the decoder will show a green screen, but after a couple of frames I got an image (using the software decoder in Chrome). Although, this is obviously very hacky and should (probably) not be tried in production.

I do feel the first frame should have enough data to actually be an IDR-Frame, so it should (in theory) be possible to reencode only the first frame to be an IDR-Frame, but no idea how complex this is (without external tools like ffmpeg).

Francois Laberge · Answer 3 · Mon May 08 2023 21:48:55 GMT+0800 (China Standard Time)

@reinhrst Do you have an example video file that exhibits this? I'd love to test if our video playback system handles it. Would be much appreciated.

Claude · Answer 4 · Mon May 08 2023 22:03:33 GMT+0800 (China Standard Time)

@seflless I have a whole bunch of 4GB video files with this behaviour, however I can see if I can convince ffmpeg to cut out the first couple of minutes :). Will send them to you in a PM, since I'm not 100% sure the copyright owner would agree with me making them public.

Out of interest, when you say "our video playback system", what system are you talking about?

Dan Sanders · Answer 5 · Tue May 09 2023 01:58:19 GMT+0800 (China Standard Time)

Not all decoders support starting at SEI recovery points, so if this feature were to be added it would likely need to be an optional extension. I'm not immediately sure what such an API would look like, it could be as simple as allowing feeding a non-keyframe and you take your chances as to whether the decode will fail.

That said, there is little difference between recovery_frame_cnt=0 and an IDR, so I'm a little confused as to why the camera wouldn't just make a real IDR here. It's plausible that almost all decoders would support decoding from such an I frame.

Claude · Answer 6 · Tue May 09 2023 17:04:44 GMT+0800 (China Standard Time)

That said, there is little difference between recovery_frame_cnt=0 and an IDR, so I'm a little confused as to why the camera wouldn't just make a real IDR here. It's plausible that almost all decoders would support decoding from such an I frame.

@sandersdan I was struggling with the same question, and tried to ask it on stack-overflow, did not get a conclusive answer...

My hunch right now is this:

an IDR frame means no frames in decode or presentation order can reference frames before the IDR frame.
an I-frame with SEI Recovery Point and recovery_frame_cnt=0 and exact_match_flag=1 I expect (but I really need to do more research before I can say for sure) can have frames later in decode order (but earlier in presentation order) that reference earlier frames.

Hence, playback can start at the SEI recovery point (and all frames that come after in presentation order can be decoded), however there may be frames with earlier presentation order that need to be dropped by the decoder (in other words, a decoder can not drop the decoded frame cache on SEI recovery point).

This means that you can (usually) have 2 additional B frames in your GOP (also see the "updated" section in the linked stackoverflow question), meaning you can get better compression for the same quality.

I would be more than happy for someone with more knowledge on the subject to confirm/reject my theory.

Francois Laberge · Answer 7 · Tue May 09 2023 22:32:57 GMT+0800 (China Standard Time)

@reinhrst That'd be awesome if you could send over a smaller version, big versions are fine if you are strapped for time. I can't say what I'm building just yet, will be public soon enough, definitely not in a public comment at least.

Marcello Bastéa-Forte · Answer 8 · Sun May 21 2023 03:09:23 GMT+0800 (China Standard Time)

We have come across a file that seems to have this issue (I believe it was downloaded off YouTube).

Because we're demuxing using libav.js, it considers the frames keyframes and I don't see a way to figure out this "keyframe but not really a keyframe" distinction from it.

If we seek to start decoding from one of them, VideoDecoder.decode synchronously throws DOMException: Failed to execute 'decode' on 'VideoDecoder': A key frame is required after configure() or flush(). (I confirmed that the decoder state is 'configured' and EncodedVideoChunk.type is 'key').

try {
    this.decoder.decode(chunk);
} catch (e) {
    console.error(`[${id}] error decoding chunk (decoder state = ${this.decoder.state})`, chunk, e);
    throw e;
}

Claude · Answer 9 · Tue May 23 2023 02:46:04 GMT+0800 (China Standard Time)

@seflless I emailed you a video last week that I now can confirm indeed starts with 216 frames before the first IDR frame (18 of those 216 frames were I-Frames with Recovery Point).

In the links below I share the first 10 seconds (250 frames) of this video:

Considering that the first IDR frame is only in second 8.5, if you see anything more than 1.5 seconds of video, your video player starts decoding at the first I-frame with Recovery Info (all desktop players I have tried, do so, but I'm sure I did not test excessively). The first timestamp you see (burned into the video) is around 5.8.2022 10:21:52.

Note that the video is a timelapse (it was recorded at 1 frame per second, shown at 25 frames per second), and it's an interlaced format (which is why ffprobe sometimes claims the mp4 file is 50 fps.

The original files from the camcorder are .MTS, however the h264 frames have been copied 1:1 into these new files.

Francois Laberge · Answer 10 · Tue May 23 2023 03:49:12 GMT+0800 (China Standard Time)

@seflless I missed that email, very helpful, thank you. I'll dig into this more when I'm back on the task, busy with some other priorities at the moment. These are some scary files, engineering wise :)

Dan Sanders · Answer 11 · Thu Jun 22 2023 04:32:54 GMT+0800 (China Standard Time)

The interpretation in #650 (comment) makes sense to me. I'm not sure if we would want such a frame to be called "key", but if not we could also make a new type, perhaps "recovery". This makes UA support detectable and lets us specify extra rules if we need to.

I don't know whether we need per-codec feature detection for this, but if we do then we can make it a configuration flag, eg. {codec: 'avc1.420034', recoveryChunks: true}.

Claude · Answer 12 · Wed Nov 22 2023 18:40:02 GMT+0800 (China Standard Time)

After a discussion with the maintainer of libav.js, I understand that there is no way in libav to distinguish between a proper IDR frame and an I-frame with a Recovery Point message. Both have the AV_PKT_FLAG_KEY flag set.

This means that right now the solution seems to be to either manually decode the packet, or just feed it to WebCodecs, and try another packet in case of an error until you find a packet that works. Afterwards you can then feed the original packet. Quite messy....

The interpretation in #650 (comment) makes sense to me. I'm not sure if we would want such a frame to be called "key", but if not we could also make a new type, perhaps "recovery". This makes UA support detectable and lets us specify extra rules if we need to.

I don't know whether we need per-codec feature detection for this, but if we do then we can make it a configuration flag, eg. {codec: 'avc1.420034', recoveryChunks: true}.

I see how it makes sense to have a different type for this if e.g. you output them (through VideoEncoder). It would be great though if during input to VideoDecoder, we would not have to set these types (or maybe have something different like type: "key_or_recovery" or type: "auto"), since demuxers (like libav) may not share this information.

Marcello Bastéa-Forte · Answer 13 · Thu Nov 23 2023 06:56:43 GMT+0800 (China Standard Time)

Not all decoders support starting at SEI recovery points, so if this feature were to be added it would likely need to be an optional extension.

Is it possible this isn't true? If I download any YouTube video into a mp4 it seems to have these special keyframes, how does the browser handle them in the <video> tag?

Dan Sanders · Answer 14 · Tue Nov 28 2023 02:23:53 GMT+0800 (China Standard Time)

Is it possible this isn't true?

Any H.264 decoder can decode them, what isn't guaranteed is starting playback at (ie. seeking to) them.

It is possible that ~every decoder in active use on desktop/mobile can support this. I suspect that is not true for embedded, but I am not certain.

When recovery_frame_cnt = 0, it should be sufficient for a decoder to simply allow decoding to start at a non-keyframe (it must handle gaps_in_frame_num for the first frame). When recovery_frame_cnt > 0, a decoder must additionally support some form of error resiliency (it must be able to track or recover from missing reference frames).

how does the browser handle them in the <video> tag?

For quite some time Chrome's implementation of hardware decoding did not support SEI recovery on any platform, and would fall back to software decoding when it was detected. There are still cases where recovery_frame_cnt > 0 is not supported.

It's also possible to start playback at an earlier true IDR; by spec there must be one (although I have seen media that violates this requirement). This wastes resources decoding additional unused frames when seeking.

Claude · Answer 15 · Wed Nov 29 2023 14:07:04 GMT+0800 (China Standard Time)

When recovery_frame_cnt = 0, it should be sufficient for a decoder to simply allow decoding to start at a non-keyframe (it must handle gaps_in_frame_num for the first frame). When recovery_frame_cnt > 0, a decoder must additionally support some form of error resiliency (it must be able to track or recover from missing reference frames).

Would it be too simple to think that this could be solved easily by handing the decoder a completely green key frame (or multiple if recovery_frame_cnt > 0) and then ignore this frame in the output?

It's also possible to start playback at an earlier true IDR; by spec there must be one (although I have seen media that violates this requirement). This wastes resources decoding additional unused frames when seeking.

Both true. The files that originally sparked this topic are a large MT2S video stream, that gets cut by the camcorder (on an I-frame but not necessarily an IDR-frame) so that files don't grow larger than 4GB. I guess in theory (if stop recording right after the cut is made), you may even end up with a file without any IDR frames.

Dan Sanders · Answer 16 · Thu Nov 30 2023 01:35:37 GMT+0800 (China Standard Time)

Would it be too simple to think that this could be solved easily by handing the decoder a completely green key frame (or multiple if recovery_frame_cnt > 0) and then ignore this frame in the output?

I believe this can be made to work, but you have to create frames with very specific headers, and the complexity of doing that is substantial (similar to writing an unoptimized H.264 encoder that supports all possible profiles). It is frustratingly a lot easier to implement this sort of recovery inside of a decoder.

Jean-Yves Avenard · Answer 17 · Thu Nov 30 2023 06:08:04 GMT+0800 (China Standard Time)

It's also possible to start playback at an earlier true IDR; by spec there must be one (although I have seen media that violates this requirement). This wastes resources decoding additional unused frames when seeking.

Streams with no IDR but only SEI recovery are not uncommon in the broadcasting world. I've seen plenty of HLS content / MPEG-TS from some broadcaster (particularly satellite broadcast ones) with them. It allows for consistent bitrate.

Yahweasel · Answer 18 · Fri Dec 01 2023 05:26:21 GMT+0800 (China Standard Time)

Just to be clear, the reason why libav.js marks these frames as keyframes, aside from the fact that they are, is that it's just taking that data from the demuxer. It does not invoke any packet parser to determine keyframe status. This choice in WebCodecs's definition makes it exceedingly difficult to actually use WebCodecs with any real file formats, because when I read a packet from an MP4 file, or a Matroska file, or a MPEG-TS file, or anything else, those chunks are marked in the format header as being keyframes. Munging them into WebCodecs's definition would require changing the entire stack and all preexisting files, or parsing frames twice (once to determine if they're a keyframe or a super-ultra keyframe, and once to actually decode them).

Dan Sanders · Answer 19 · Sat Dec 02 2023 01:31:23 GMT+0800 (China Standard Time)

Just to be clear, the reason why libav.js marks these frames as keyframes, aside from the fact that they are, is that it's just taking that data from the demuxer.

Terminology is perhaps ambiguous here. SEI recovery frames are I frames but not IDR frames (H.264 terminology). This makes them recovery points (roll=0) but not sync samples (ISO BMFF terminology). Right now, WebCodecs is equating "key" to "sync sample".

WebCodecs is strict about this because MSE was originally not, and that led to content with intentionally mis-marked keyframes. I'm confident that we'll eventually find the right set of tradeoffs, but it's going to be by cautiously removing restrictions.

Based on experience with MSE, I would not expect all muxers to correctly mark recovery points, but at least some do, and in that case a demuxer can distinguish them without parsing the bitstream.

Dan Sanders · Answer 20 · Sat Dec 02 2023 01:37:38 GMT+0800 (China Standard Time)

Streams with no IDR but only SEI recovery are not uncommon in the broadcasting world. [..] It allows for consistent bitrate.

Recovery frames with recovery_frame_cnt = 0 shouldn't affect bitrate much compared to full IDR. There is also rolling intra where recovery_frame_cnt > 0; that's much more consistent but does require compatible decoders. (Conveniently for cable providers they do get to control the decode hardware. I don't know if similar applies to OTA, but it would make sense to standardize support.)

Claude · Answer 21 · Thu Dec 14 2023 01:54:51 GMT+0800 (China Standard Time)

That said, there is little difference between recovery_frame_cnt=0 and an IDR, so I'm a little confused as to why the camera wouldn't just make a real IDR here. It's plausible that almost all decoders would support decoding from such an I frame.

Much later, and I have a better answer (also posted in detail here on stackoverflow).

In short (unless noted otherwise, all orders are presentation order):

The stream is BBIBBPBBPBBPBBiBBPBBPBBPBB.... So there is an IDR frame on frameindex 300 * i + 2, an I frame every 12 * i + 2 (unless IDR frame), a P frame every 3 * i + 2 (unless IDR or I frame), and everything else is B frame
In decode order, all IDR/I and P frames are 2 frames earlier.
After an I frame with recovery message, and frame in presentation order is decodable from there. However in decode order the I frame is followed by two B frames that cannot be decoded
An IDR frame on the other hand means that any frames in both presentation and decode order are decodable from there (because IDR means that all internal frame buffers may be emptied). So the B frames preceding the IDR frame in presentation order (and following it in decode order) can only refer to the IDR frame. So these will be have slightly worse compression.
So fewer IDR frames (and more I frames with recovery messages) leads to a better bitrate / quality.

Note that this is also the reason that the stream is completely happy starting with two B frames (in PO).

zhaopeng · Answer 22 · Thu Mar 07 2024 19:10:37 GMT+0800 (China Standard Time)

i think an open GOP should start with an i frame with a recovery point SEI instead of an IDR frame.

风痕 · Answer 23 · Wed Jul 03 2024 16:25:01 GMT+0800 (China Standard Time)

I encountered the same issue. When an SEI is included before an IDR frame, decoding errors occur:
DOMException: Failed to execute 'decode' on 'VideoDecoder': A key frame is required after configure() or flush().

Its key binary data is as follows:

// The first four bytes indicate the size
0 0 0 62     ｜ 6 ...   // SEI data
0 0 0 2      |  9 16  // access_unit_delimiter_rbsp
0 2 149 240  | 101 ...   // 101 binary: '1100101', the last 5 bits value is 5 (NALU IDR type).

When I remove the SEI from the binary data and reconstruct the EncodedVideoChunk object, the VideoDecoder can decode it correctly.

const ab = new ArrayBuffer(chunk.byteLength);
chunk.copyTo(ab);

const fixedChunk = new EncodedVideoChunk({
  type: chunk.type,
  timestamp: chunk.timestamp,
  duration: chunk.duration ?? 0,
  // 66 is SEI data size
  data: ab.slice(66),
});

I expect the VideoDecoder to proactively ignore SEI information in the chunk without throwing an error.