RaivoKoot / Video-Dataset-Loading-Pytorch

Generic PyTorch dataset implementation to load and augment VIDEOS for deep learning training loops.

Home Page:https://video-dataset-loading-pytorch.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why subtract 'frames_per_segment' to calculate 'segment_duration' ?

Gateway2745 opened this issue · comments

Hi. Why do you subtract 'frames_per_segment' from 'num_frames' and then divide by 'num_segments' to calculate 'segment_duration' ? Can we not directly divide 'num_frames' by 'num_segments' to get the 'segment_duration' ? Thanks!

segment_duration = (record.num_frames - self.frames_per_segment + 1) // self.num_segments

Hi. Good question.
So, the equation is:
segment_duration = (record.num_frames - self.frames_per_segment + 1) // self.num_segments
If you don't subtract frames_per_segment and add 1, then an IndexOutOfBounds error can occur later.

In the case where frames_per_segment=1, - self.frames_per_segment + 1 obviously makes no difference, because we are subtracting 1 and adding 1. So, this only matters in the case where frames_per_segment > 1.

Some Context

When you use frames_per_segment > 1, what happens is that for each segment, a random start index is sampled, and then starting from each start_index, frames_per_segment consecutive frames are loaded and returned. This function _sample_indices does not return the indices of all frames to be loaded, but only the start index of each segment's frames_per_segment frames.

An Example

num_segments = 3
frames_per_segment = 2
num_frames = 6
frame_indices = [0, 1, 2, 3, 4, 5]

We can not use 5 as a start index. This is because starting from 5 we would need to take frames_per_segment=2 frames which would be the two frames at index 5 and 6. Index 6 is out of bounds though. If you do not do - self.frames_per_segment + 1, then the function will sometimes return index 5 as the start index for the third segment. Doing - self.frames_per_segment + 1 is not a perfect solution, but it works.

Thank you for the detailed explanation! I understood the purpose of it now but my doubt persists.

In the example you have given, the segments would be [0,1], [2,3] and [4,5]. So, the starting index can be either 0, 2 or 4. However, the result of segment_duration = (record.num_frames - self.frames_per_segment + 1) // self.num_segments gives segment_duration=1. Should this not be 2 (each segment [0,1], [2,3],[4,5] has 2 frames) ?

Let's say segment_duration=1. Now, _sample_indices always returns offsets=[0,1,2]. With frames_per_segment = 2, this spans frames [0,1,2,3]. So frames 4 and 5 will never be used. Is this supposed to be how it works?

Thanks again!

Yes, you are right that this is how it works, even though it is not the best behavior. Because most people only use a single frame per segment, I did not pay much attention to improving this behavior, when I adapted this repostiroy from the original code repository, which implemented this sub-optimal behavior. In my own experiments, I also only ever use a single frame per segment.

However, you are very welcome to create a pull request and suggest an improvement for this. If it is suitable, I am happy to merge it!!

Hi. Thanks for the confirmation! Oh I see, I was not aware that the norm was to use only a single frame per segment. Sure, although it is difficult for me right now, if I come up with a way to improve the current strategy, I'll definitely raise a PR. Thanks again.