Request for Clarification on Dataset Preprocessing Process
Kenny-K opened this issue · comments
Thanks for your great work. I have a question about the data processing process of the deft-data. According to the official metadata of the Ego4D dataset, only around 15,000 annotated clips are available, and some may not include human hands. However, your paper mentions using over 60,000 clips from Ego4D.
Was the deft data modified from the raw Ego4D videos rather than using the official annotated clips? Additionally, if raw full-scale videos were used, could you share how the task descriptions were extracted from these videos(e.g., using some type of video caption model)?