BigScience Workshop

BigScience Workshop's repositories

petals

🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading

Language:PythonMIT9089 91 200

promptsource

Toolkit for creating, sharing and using natural language prompts.

Language:PythonApache-2.02644 32 162

Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Language:PythonNOASSERTION1317 24 144

bigscience

Central place for the engineering/scaling WG: documentation, SLURM scripts and logs, compute environment and data.

Language:ShellNOASSERTION973 38 19

xmtf

Crosslingual Generalization through Multitask Finetuning

Language:Jupyter NotebookApache-2.0510 6 22

t-zero

Reproduce results and replicate training fo T0 (Multitask Prompted Training Enables Zero-Shot Task Generalization)

Language:PythonApache-2.0456 24 21

biomedical

Tools for curating biomedical training data for large-scale language modeling

Language:Python453 29 378

data-preparation

Code used for sourcing and cleaning the BigScience ROOTS corpus

Language:Jupyter NotebookApache-2.0299 24 12

lm-evaluation-harness

A framework for few-shot evaluation of autoregressive language models.

Language:PythonMIT98 4 26

architecture-objective

Language:PythonApache-2.092 4 11

lam

Libraries, Archives and Museums (LAM)

Apache-2.081 28 70

data_tooling

Tools for managing datasets for governance and training.

Language:HTMLApache-2.077 16 261

multilingual-modeling

BLOOM+1: Adapting BLOOM model to support a new unseen language

Language:PythonApache-2.069 16 24

evaluation

Code and Data for Evaluation WG

Language:PythonNOASSERTION41 23 51

metadata

Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.

Language:PythonApache-2.030 18 57

model_card

Apache-2.024 20

tokenization

Language:PythonApache-2.011 16 1

bloom-dechonk

A repo for running model shrinking experiments

Language:Python10 60

carbon-footprint

A repository for `codecarbon` logs.

Language:Jupyter Notebook10 14 1

catalogue_data

Scripts to prepare catalogue data

Language:Jupyter NotebookApache-2.08 21 5

historical_texts

BigScience working group on language models for historical texts

Language:Jupyter Notebook8 240

massive-probing-framework

Framework for BLOOM probing

Language:Python8 20

pii_processing

PII Processing code to detect and remediate PII in BigScience datasets. Reference implementation for the PII Hackathon

Language:PythonNOASSERTION8 15 7

training_dynamics

5 18 11

bibliography

A list of BigScience publications

Language:TeXApache-2.03 1 2

datasets_stats

Generate statistics over datasets used in the context of BS

Language:Makefile2 230

evaluation-robustness-consistency

Tools for evaluating model robustness and consistency

Language:PythonNOASSERTION2 190

multilingual-modeling-1

Language:PythonApache-2.02 10

interpretability-ideas

1 24 10

ShadesofBias

Evaluation for Shades of Bias in Text

Language:Jupyter Notebook000