Cerebras / modelzoo

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Extending the SlimPajama pipeline

chris-ha458 opened this issue · comments

I am interesting in building multilingual foundational datasets.

IT would take from mc4 and/or oscar and apply language specific filters and processing

In short, it would be multilingual-SlimPajama(or refinedweb).

The SlimPajama pipeline seems to be a good base to extend from.

Is the SlimPajama pipeline meant to be part of cerebras/modelzoo? are there any plans to spin it off?

hey @chris-ha458 I believe we talked over the discord. Lmk if you want to cont. discussion over here or there. I am super excited to hear about the multilingual-SlimPajama 💪!

I just saw this, and had left a message beforehand in the discord (its quite long lol)
I'm okay either way, but if i had to choose i'd love it if it were done here so it can be referencable and searchable (esp by other projects or developers not involved in the discord).

If you want I'll copy and paste it here!

sure, go ahead!

I read your most recent message and I appreciate it!

As i've shared before two versions that we were looking into were
multilingual-SlimPajama or Multilingual-opensource-RefinedWeb
The latter would be much more ambitious involving pseudocrawling raw WARC files.

Fortunately, we are in talks with an organisation that is planning to build and fully opensource a project at that scale and we are leaning towards cooperarting with them to bring it to fruition.

At the moment, that project does not seem to encompass curated data in any significant quanitity as a datasource and it might be difficult to later incorporate such datasets into it. As such, projects such as this, which processes manually prepared and curated data would still valid IMO.

In the meantime if there are anything that the Cerebras requires to adapt this pipeline for multilingual datasources (pseudo-)crawled or curated, feel free to let me know.
I will do my best to directly contribute or bring into attention any endeavors that enhance the opensource-multilingual data ecosystem

@chris-ha458 I had a conversation internally about this. Do you have a timeline in mind when you are going to start contributing and what changes you're most likely to provide? We can go from there.

Ah as ive mentioned before we might focus on the Common crawl psuedocrawling pipeline as mentioned previously. However, depending on project direction we might still contribute here. In that case it would be to extend the filtering processes to include more languages.

There would need to be a way to inject a language classifier stage, and downstream to that, a stage to effect per language filters.
Depending on resource considerations, these could be happen could happen before or after global dedup, but likely before initial tokenization.

It would depend on project goal and scope, and it would be premature to commit to any kind of architecture before hand.

@chris-ha458 gotcha, thanks for providing these details. do you have a timeline in mind when you would like to start updating our pipeline? Right now I don't see any specific limitations. You can extend the common crawl pipeline and we can include scripts to download it from scratch. You can also add language detectors and change the way we filter out examples. I suggest to do that before dedup.

Right now we are speaking to potential partners and sponsors who may or may not have their own internal pipelines either in development or readying for open source. This makes it difficult for us to commit on choosing a specific codebase. Due to the sparse nature of multilingual datasources we have been in talks to secure enough compute/storage grants to replicate something akin to RefinedWeb(pseudocrawling the whole of CC) while applying good language filters.

This, is as it may sound, a collossal effort that is conditional on securing such resources which is not a given.

However, it seems like something like that has been tried before internally in Cerebras that is compatible/easy to integrate with this code it would be awesome if you can release it as part of this codebase.

I felt that the language filter part deserved its own comment so here it is.

We have been trying out different language filters. For instance mC4 uses google's gcld3 and OSCAR uses Meta(then facebook)'s fastext. They have different False positive and False negative behaviors that we might capitalize to enhance them.

However we have not discovered a good solution that minimizes human involvement (right now we are looking at the individual samples for quality after filtering)

As to the location for the language filters, I assume you meant before the global deduplication and after the "Clean"? This might be part of an ongoing discussion / trial and error cycle due to language filtering being relevant for downstream filters (for instance ROOTS used per language filters)

It might be useful to have the language filters and per language filters as downstream as possible to see how the filter hyperparameters(different languages require different filters and/or different settings) changes the quantity and quality of the resulting corpora (again how ROOTS did it).

These aspects would have to be compromised with what Cerebras envisions and can facilitate.