dukebw / lintel

Hi there. I was getting lots of "Ran out of frames. Looping." messages in my application, even though I never requested more frames than available. The following test showed that the errors were not deterministic, because even for the same video and the same number of frames the message sometimes occurred and sometimes not:

def test_ran_out_of_frames(width, height, duration, framerate):
    for idx in range(50):
        print('Run {}'.format(idx))

        with open('test_video.mp4', 'rb') as f:
            encoded_video = f.read()

        num_frames = int(duration * framerate)
        video, _ = lintel.loadvid(encoded_video,
                                  height=height, width=width, num_frames=num_frames)

Output:

Run 0
Run 1
Run 2
Ran out of frames. Looping.
Run 3
Run 4
Run 5
Run 6
Ran out of frames. Looping.
Run 7
Run 8
Run 9
.
.
.

Doing the same with should_random_seek=False solved the problem, so it very much looks like the random seek point does not truly respect the space to leave for the requested frames.
Can you recreate the issue and if so take a look into this?

Hi! I think respecting the number of frames in the video is a tricky business, for example due to placement of the keyframes, the seek can overshoot. But I can take a look, since I agree this is less than satisfactory. Perhaps a "strict" mode is called for.

What is the length of video you are using? I will try to reproduce with a similar length video.

Thanks for your quick reply! The length of my video is about 10 seconds.

Hello again, I was wondering if you might be able to give the seek-point-overshoot branch a try (https://github.com/dukebw/lintel/tree/seek-point-overshoot).

I changed the seeking in loadvid to work based on AVStream.nb_frames, rather than some poor approximations in seconds. This is more similar to how the loadvid_frame_nums seeking works now.

I have still found at least one video where AVStream.nb_frames is wrong, in this case AVStream.nb_frames reported 168, but using receive_frame() I could only decode 166 frames. So the method will still overshoot by however much AVStream.nb_frames overestimates the number of frames in the video. If this extremely precise accuracy is required, I wonder if it is better to preprocess all the videos by counting frames with receive_frame(), store metadata about how many frames can really be decoded, then use loadvid_frame_nums to only get frames within those bounds. I'm also not sure if the estimate here:

lintel/lintel/py_ext/lintelmodule.c

Lines 134 to 186 in 5af35bd

    
                   if ((video_stream->duration <= 0) || (video_stream->nb_frames <= 0)) { 
        
                           /** 
        
                            * Some video containers (e.g., webm) contain indices of only 
        
                            * frames-of-interest, e.g., keyframes, and therefore the whole 
        
                            * file must be parsed to get the number of frames (nb_frames 
        
                            * will be zero). 
        
                            * 
        
                            * Also, for webm only the duration of the entire file is 
        
                            * specified in the header (as opposed to the stream duration), 
        
                            * so the duration must be taken from the AVFormatContext, not 
        
                            * the AVStream. 
        
                            * 
        
                            * See this SO answer: https://stackoverflow.com/a/32538549 
        
                            */ 
        
                           /** 
        
                            * Compute nb_frames from fmt ctx duration (microseconds) and 
        
                            * stream FPS (frames/second). 
        
                            */ 
        
                           assert(video_stream->avg_frame_rate.den > 0); 
        
                           enum AVRounding rnd = (enum AVRounding)(AV_ROUND_DOWN | 
        
                                                                   AV_ROUND_PASS_MINMAX); 
        
                           int64_t fps_num = video_stream->avg_frame_rate.num; 
        
                           int64_t fps_den = 
        
                                   video_stream->avg_frame_rate.den*(int64_t)AV_TIME_BASE; 
        
                           vid_ctx->nb_frames = 
        
                                   av_rescale_rnd(vid_ctx->format_context->duration, 
        
                                                  fps_num, 
        
                                                  fps_den, 
        
                                                  rnd); 
        
                           /** 
        
                            * NOTE(brendan): fmt ctx duration in microseconds => 
        
                            * 
        
                            * fmt ctx duration == (stream duration)*(stream timebase)*1e6 
        
                            * 
        
                            * since stream timebase is in units of 
        
                            * seconds / (stream timestamp). The rest of the code expects 
        
                            * the duration in stream timestamps, so do the conversion 
        
                            * here. 
        
                            * 
        
                            * Multiply the timebase numerator by AV_TIME_BASE to get a 
        
                            * more accurate rounded duration by doing the rounding in the 
        
                            * higher precision units. 
        
                            */ 
        
                           int64_t tb_num = video_stream->time_base.num*(int64_t)AV_TIME_BASE; 
        
                           int64_t tb_den = video_stream->time_base.den; 
        
                           vid_ctx->duration = 
        
                                   av_rescale_rnd(vid_ctx->format_context->duration, 
        
                                                  tb_den, 
        
                                                  tb_num, 
        
                                                  rnd);

maybe already accounts for when AVStream.nb_frames from the container is wrong, and we should just use this estimate always.

Anyway, please let me know what you think, and if you get a chance to try the fix. Thank you for pointing out the bug!

Thanks for your work! I gave the branch a try, but as you have already noticed there are some problems concerning nb_frames. For the videos in my database I need to reduce nb_frames reported by ffprobe by exactly 3 to get the number of frames that can actually be decoded. That is also why I am not able to check, if the modifications improved the random seek point placement.

Do you think it might be possible that the nb_frames offset somehow comes from lintel itself? I find it hard to imagine that so many of our videos have wrong meta information. The reported nb_frames also matches duration*avg_frame_rate, so the information seems to be consistent.

My preferred solution to all of this would be to allow an argument such as -1 for num_frames. Lintel could then return just as many frames as it is able to decode. The downside here of course is that the output stream can not easily be preallocated. This would require the preprocessing that you mentioned.

Side note: Thanks for removing fps_cap. I was disabling it so far by setting it to an extremely high number. If you want to reintroduce it, I would similarly propose to add the option -1 for disabling.

Okay great, thank you for the feedback about both APIs, and I agree those would be improvements, at least it would be good to have some interface to allow reporting of how many frames were decoded successfully when decoding fails.

I will think about how to incorporate a solution that matches the frame count reported by ffprobe. I'm pretty sure that receive_frame (

lintel/lintel/core/video_decode.c

Line 51 in 456d211

receive_frame(struct video_stream_context *vid_ctx)

) is correctly using the send/receive packet API, so I will have to dig into the ffprobe code to see what the heck it is doing to count frames, and this may take some time.

Hi again! It appears you were right, and there was a bug in receive_frame. I was neglecting to "drain" the codec, as described here:

https://github.com/FFmpeg/FFmpeg/blob/fe06ed22e6e0a8c2995818c4532eb6f4ec9320b9/libavcodec/avcodec.h#L122-L133

I was wondering if you might be able to give commit ca3e1de a try.

	if ((video_stream->duration <= 0) \|\| (video_stream->nb_frames <= 0)) {
	/**
	* Some video containers (e.g., webm) contain indices of only
	* frames-of-interest, e.g., keyframes, and therefore the whole
	* file must be parsed to get the number of frames (nb_frames
	* will be zero).
	*
	* Also, for webm only the duration of the entire file is
	* specified in the header (as opposed to the stream duration),
	* so the duration must be taken from the AVFormatContext, not
	* the AVStream.
	*
	* See this SO answer: https://stackoverflow.com/a/32538549
	*/

	/**
	* Compute nb_frames from fmt ctx duration (microseconds) and
	* stream FPS (frames/second).
	*/
	assert(video_stream->avg_frame_rate.den > 0);

	enum AVRounding rnd = (enum AVRounding)(AV_ROUND_DOWN \|
	AV_ROUND_PASS_MINMAX);
	int64_t fps_num = video_stream->avg_frame_rate.num;
	int64_t fps_den =
	video_stream->avg_frame_rate.den*(int64_t)AV_TIME_BASE;
	vid_ctx->nb_frames =
	av_rescale_rnd(vid_ctx->format_context->duration,
	fps_num,
	fps_den,
	rnd);

	/**
	* NOTE(brendan): fmt ctx duration in microseconds =>
	*
	* fmt ctx duration == (stream duration)(stream timebase)1e6
	*
	* since stream timebase is in units of
	* seconds / (stream timestamp). The rest of the code expects
	* the duration in stream timestamps, so do the conversion
	* here.
	*
	* Multiply the timebase numerator by AV_TIME_BASE to get a
	* more accurate rounded duration by doing the rounding in the
	* higher precision units.
	*/
	int64_t tb_num = video_stream->time_base.num*(int64_t)AV_TIME_BASE;
	int64_t tb_den = video_stream->time_base.den;
	vid_ctx->duration =
	av_rescale_rnd(vid_ctx->format_context->duration,
	tb_den,
	tb_num,
	rnd);

Random seek point can overshoot