Questions About the Dataset
jk4011 opened this issue · comments
Thank you for your excellent work!
I really amazed about the quality of your paper and code.
I have a few questions regarding the dataset:
- I noticed that training with GeoWizard is more expensive compared to Marigold. Is this due to the size of the dataset?
- Considering the costs associated with data collection, what would you think the ideal size for a high-quality dataset?
- How did you filter out high-quality meshes for the objaverse?
- Are there any plans to make the training data available, such as a filtered list from the objaverse or unban rendered data?
Again, I'm very thank for your nice work.
Thanks for asking, these are certainly good questions!
- Yes. We believe that training data diversity (scaling law) is important for the model's generalization ability. Thus we utilize the high-quality datasets as much as possible (2 indoor, 2 outdoor, 1 object), whereas Marigold only uses 2 (1 indoor and 1 outdoor).
- If you want to train the model from scratch, the more high-quality datasets there are the better performance (Metric3d, DepthAnything). In contrast, if you utilize the prior of a pre-trained model (e.g., SD), the requirement for data coverage will be lower. But in general, we believe there is no limit on the ideal size, we encourage u to start from a mini dataset, and then scale up with more diverse and high-quality data, as long as your model can also scale up (e.g., diffusion model).
- We filter the 3D object based on the data diversity, you can also see here for a similar high-quality subset.
- Yes, we do have a plan. However, the scheduler has not been settled, since the release of render urban scene data is subject to further regulatory and review.
Thank you for your kind comment! I appreciate it :D