EleutherAI / pythia

The hub for EleutherAI's work on interpretability and learning dynamics

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Deduplicated Pile dataset with Domain Attribution

michaelduan8 opened this issue · comments

Hi there!

I was wondering if there was a way to reproduce this dataset with domain attribution (determining which Pile subdomain a given document comes from) or if the existing dataset at that link could be updated with domain metadata?

Thanks!