stefan-it

Stefan Schweter's starred repositories

grok-1

Grok open release

Language:PythonApache-2.048992 553 197

surya

OCR, layout analysis, reading order, line detection in 90+ languages

Language:PythonGPL-3.08538 73 85

KeyBERT

Minimal keyword extraction with BERT

Language:PythonMIT3277 32 191

torchtitan

A native PyTorch Library for large model training

Language:PythonBSD-3-Clause1197 29 87

uform

Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️

Language:PythonApache-2.0933 13 24

llm2vec

Code for 'LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders'

Language:PythonMIT747 17 50

recurrentgemma

Open weights language model from Google DeepMind, based on Griffin.

Language:PythonApache-2.0544 16 5

community-content

Hetzner Online Community Project

Language:MarkdownMIT266 13 207

LEAR

The implementation our EMNLP 2021 paper "Enhanced Language Representation with Label Knowledge for Span Extraction".

Language:Python111 1 18

zett

Code for Zero-Shot Tokenizer Transfer

Language:Python95 1 5

scaling

Language models scale reliably with over-training and on downstream tasks

Language:Jupyter NotebookMIT83 8 3

improved-t5

Experiments for efforts to train a new and improved t5

Language:Python75 6 1

ScandEval

Evaluation of language models on mono- or multilingual tasks.

Language:PythonMIT66 5 300

Next-Token-Failures

Language:Python49 2 1

spacebyte

A byte-level decoder architecture that matches the performance of tokenized Transformers.

Language:Jupyter Notebook38 10

albnlp

2000

transformer-smaller-training-vocab

Temporary remove unused tokens during training to save ram and speed.

Language:PythonMIT20 3 2

causallm_icl

Language:PythonApache-2.0700

fundus-evaluation

Evaluation of the Fundus News Scraper https://github.com/flairNLP/fundus

Language:PythonMIT600

BEAR

BEAR dataset

600

eacl24-german-legal-questions

Data and code: "Answering legal questions from laymen in German civil law system", Büttner & Habernal, EACL'24

Language:PythonApache-2.06 40

tech-report

Raw data, scripts, etc. to produce the tables and figures of our technical report

Apache-2.0500

ChroniclingAmericaQA

ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages

MIT400

Multi-Level-Training-Framework

Official implementation of "A Multi-level Framework for Accelerating Training Transformer Models""

Language:Python400

XAMPLER

XAMPLER: Learning to Retrieve Cross-Lingual In-Context Examples

Language:Python300

umLabeller

Inspection tool for characterizing the semantic compositionality of subword tokenization in English

Language:Python3 80

UD_Bavarian-MaiBaam

NOASSERTION300

newsagency-classification

Recognition of news agency mentions in historical news articles (BERT-based token classification).

Language:Jupyter NotebookMIT1 6 2

maibaam-code

Code for preprocessing data for UD annotations and for tagging/parsing experiments of MaiBaam

Language:Python100

turkish-lm-bias

Investigating Gender Bias in Turkish Language Models

Language:Jupyter NotebookMIT1 10