allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.

Home Page:https://allenai.github.io/dolma/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Need help on accessing the raw reddit data

Jianxin-MNM opened this issue · comments

Hi,

The dolma is really a fantastic work. I am currently trying to extend the data pipeline to more languages with the reddit data. Would any one help with:

  1. share workable link / access method to the raw reddit dataset?
  2. I have found some torrent links with the .zst file from multi archives, would anyone could help to share a sha256sum so that I can valid my downloading is working correctly?

Cheers!

Apologies, but we are not planning to share the raw reddit dataset.