facebookresearch / co-tracker

CoTracker is a model for tracking any point (pixel) on a video.

Home Page:https://co-tracker.github.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GPU out of memory when trying to evaluate model on kinetics_first

AssafSinger94 opened this issue · comments

Hi,
When trying to evaluate the model on TAP-Vid Kinetics with 'first' sampling, my GPU is reaching a memory limit and crashes.
The error occurs when trying to aggregate the TAP-Vid Kinetics pickle files during the instantiation of the TapVidDataset object.
I was able to evaluate the model on TAP-Vid DAVIS properly.
I am running the following code on an NVIDIA-A100 GPU (the GPU with the most memory that I have access to).

python ./cotracker/evaluation/evaluate.py \
--config-name eval_tapvid_kinetics_first \
exp_dir=./eval_outputs_kinetics_first \
dataset_root=<path_to_tapvid_kinetics_dir> \

Could you please assist me in the matter? Are you able to provide the code you used to evaluate the model in Kinetics? What GPU were you using?
In the TAP-Vid repo they provide create_kinetics_dataset which returns an iterable that yields a video example each time, but I couldn't adjust the code properly to use this iterable instead.

Thank you for your help!
Assaf

Hi @AssafSinger94,

I also trained this model on A100 GPUs. I think the problem with the pickle files is RAM, not GPU memory. You should be able to replace the following code:

class TapVidDataset(torch.utils.data.Dataset):
    def __init__(self, ...):
        ...
        if self.dataset_type == "kinetics":
            all_paths = glob.glob(os.path.join(data_root, "*_of_0010.pkl"))
            points_dataset = []
            for pickle_path in all_paths:
                with open(pickle_path, "rb") as f:
                    data = pickle.load(f)
                    points_dataset = points_dataset + data
            self.points_dataset = points_dataset

    def __getitem__(self, index):
        if self.dataset_type == "davis":
            ...
        else:
            video_name = index
        video = self.points_dataset[video_name]

by something like this. In this case, pickle files are loaded only when necessary.

class TapVidDataset(torch.utils.data.Dataset):
    def __init__(self, ...):
        ...
        if self.dataset_type == "kinetics":
            self.all_paths = glob.glob(os.path.join(data_root, "*_of_0010.pkl"))
            self.curr_path_idx = -1
            self.global_file_idx = 0

    def load_pickle_file(self):
        with open(self.all_paths[self.curr_path_idx], "rb") as f:
            data = pickle.load(f)
        return data

    def __getitem__(self, index):
        if self.dataset_type == "davis":
            ...
        else:
            if index >= len(self.points_dataset):
                self.global_file_idx+=len(self.points_dataset)
                self.curr_path_idx+=1
                self.points_dataset = self.load_pickle_file()
            index -= self.global_file_idx
            video_name = index
        video = self.points_dataset[video_name]
        ...

    def __len__(self):
        ...

I haven't tested it though. Please let me know if this solution works! You'll also need to implement def __len__(self), maybe just by hardcoding the dataset length or loading all the files one by one in the same manner.

Hi @nikitakaraevv , thank you for your help! it really helped.
I made a few small adjustments (code added below), and the evaluation ran properly.

However, my overall average metrics results are much lower then reported in the paper. as follows:
{'occlusion_accuracy': 0.8116, 'pts_within_1': 0.2009, 'jaccard_1': 0.1205, 'pts_within_2': 0.3077, 'jaccard_2': 0.1987, 'pts_within_4': 0.4183, 'jaccard_4': 0.2826, 'pts_within_8': 0.5321, 'jaccard_8': 0.3684, 'pts_within_16': 0.6535, 'jaccard_16': 0.4621, 'average_jaccard': 0.2865, 'average_pts_within_thresh': 0.4225}
Could you please assist me in the matter? Am I missing something?

In addition, inference for each video takes 1-2 minutes on a GPU, and the evaluation on the entire dataset takes over 14 hours. Is there any way to speed up the inference? Does it makes sense that the inference time is so long? Currently I adjusted the code to run for a specific video_ind, sent many separate running jobs and average their results. Looking at prediction visualization shows that video index separation works well, and predicted trajectories seem to "make sense".

Thank you for your help!
Assaf

Adjusted code:

class TapVidDataset(torch.utils.data.Dataset):
    def __init__(self, ...):
        ...
        if self.dataset_type == "kinetics":
            self.all_paths = glob.glob(os.path.join(data_root, "*_of_0010.pkl"))
            self.curr_path_idx = -1
            self.global_file_idx = 0
            self.points_dataset = [] # initialize to empty list

   def __getitem__(self, index):
          if self.dataset_type == "davis":
              ...
          else: # kinetics
              pkl_index = index - self.global_file_idx # index within the current pickle file
              if pkl_index >= len(self.points_dataset):
                  self.global_file_idx+=len(self.points_dataset)
                  self.curr_path_idx+=1
                  self.points_dataset = self.load_pickle_file()
                  pkl_index = index - self.global_file_idx # index within the new pickle file
              video_name = pkl_index
  
  def __len__(self):
          if self.dataset_type == "kinetics":
              return 1144 

Hi @AssafSinger94, did you evaluate the model on TAP-Vid Davis to ensure that the numbers match?
Also, what are the average metrics after the first sequence and after the first five sequences on Kinetics?
Mine are:

'occlusion_accuracy': 0.8893333333333333, 
'average_jaccard': 0.3283059618231899, 
'average_pts_within_thresh': 0.4515646635281086

and after the first five sequences:

'occlusion_accuracy': 0.8990976751255385, 
'average_jaccard': 0.44782753486523796,
'average_pts_within_thresh': 0.5723796905496629

Hey, the low metrics results where caused by an issue I had when creating the dataset (caused by a bad frame rate sampling), after I fixed it, results made more sense. thanks!