BigScience Workshop's repositories
promptsource
Toolkit for creating, sharing and using natural language prompts.
Megatron-DeepSpeed
Ongoing research training transformer language models at scale, including: BERT & GPT-2
bigscience
Central place for the engineering/scaling WG: documentation, SLURM scripts and logs, compute environment and data.
biomedical
Tools for curating biomedical training data for large-scale language modeling
data-preparation
Code used for sourcing and cleaning the BigScience ROOTS corpus
lm-evaluation-harness
A framework for few-shot evaluation of autoregressive language models.
data_tooling
Tools for managing datasets for governance and training.
multilingual-modeling
BLOOM+1: Adapting BLOOM model to support a new unseen language
evaluation
Code and Data for Evaluation WG
carbon-footprint
A repository for `codecarbon` logs.
bloom-dechonk
A repo for running model shrinking experiments
catalogue_data
Scripts to prepare catalogue data
historical_texts
BigScience working group on language models for historical texts
pii_processing
PII Processing code to detect and remediate PII in BigScience datasets. Reference implementation for the PII Hackathon
massive-probing-framework
Framework for BLOOM probing
transformers
🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.
bibliography
A list of BigScience publications
datasets_stats
Generate statistics over datasets used in the context of BS
evaluation-robustness-consistency
Tools for evaluating model robustness and consistency