Questions About the Dataset

Question

jk4011 opened this issue 2 months ago · comments

Thank you for your excellent work!
I really amazed about the quality of your paper and code.
I have a few questions regarding the dataset:

I noticed that training with GeoWizard is more expensive compared to Marigold. Is this due to the size of the dataset?
Considering the costs associated with data collection, what would you think the ideal size for a high-quality dataset?
How did you filter out high-quality meshes for the objaverse?
Are there any plans to make the training data available, such as a filtered list from the objaverse or unban rendered data?

Again, I'm very thank for your nice work.

Xiao Fu · Answer 1 · Thu May 02 2024 21:04:54 GMT+0800 (China Standard Time)

Thanks for asking, these are certainly good questions!

Yes. We believe that training data diversity (scaling law) is important for the model's generalization ability. Thus we utilize the high-quality datasets as much as possible (2 indoor, 2 outdoor, 1 object), whereas Marigold only uses 2 (1 indoor and 1 outdoor).
If you want to train the model from scratch, the more high-quality datasets there are the better performance (Metric3d, DepthAnything). In contrast, if you utilize the prior of a pre-trained model (e.g., SD), the requirement for data coverage will be lower. But in general, we believe there is no limit on the ideal size, we encourage u to start from a mini dataset, and then scale up with more diverse and high-quality data, as long as your model can also scale up (e.g., diffusion model).
We filter the 3D object based on the data diversity, you can also see here for a similar high-quality subset.
Yes, we do have a plan. However, the scheduler has not been settled, since the release of render urban scene data is subject to further regulatory and review.

Jinhyeok Kim · Answer 2 · Fri May 03 2024 13:26:24 GMT+0800 (China Standard Time)

Thank you for your kind comment! I appreciate it :D