Deduplicated Pile dataset with Domain Attribution
michaelduan8 opened this issue · comments
michaelduan8 commented
Hi there!
I was wondering if there was a way to reproduce this dataset with domain attribution (determining which Pile subdomain a given document comes from) or if the existing dataset at that link could be updated with domain metadata?
Thanks!