looput

looput's starred repositories

micrograd

A tiny scalar-valued autograd engine and a neural net library on top of it with PyTorch-like API

Language:Jupyter NotebookMIT970300

BIG-bench

Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models

Language:PythonApache-2.0278000

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

Language:PythonApache-2.0344500

reader

Convert any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/

Language:TypeScriptApache-2.0594400

edna

Note taking for developers and power users

Language:JavaScriptNOASSERTION35400

LiveBench

LiveBench: A Challenging, Contamination-Free LLM Benchmark

Language:PythonNOASSERTION15900

FlagEmbedding

Retrieval and Retrieval-augmented LLMs

Language:PythonMIT631000

MiniCPM-V

MiniCPM-Llama3-V 2.5: A GPT-4V Level Multimodal LLM on Your Phone

Language:PythonApache-2.0815900

llama-moe

⛷️ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

Language:PythonApache-2.081500

LLaMA-Factory

A WebUI for Efficient Fine-Tuning of 100+ LLMs (ACL 2024)

Language:PythonApache-2.02832600

llm.c

LLM training in simple, raw C/CUDA

Language:CudaMIT2244300

dbrx

Code examples and resources for DBRX, a large language model developed by Databricks

Language:PythonNOASSERTION248700

LLM-Agent-Paper-List

The paper list of the 86-page paper "The Rise and Potential of Large Language Model Based Agents: A Survey" by Zhiheng Xi et al.

590400

opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Language:PythonApache-2.0345300

Efficient-LLMs-Survey

[TMLR 2024] Efficient Large Language Models: A Survey

88500

deita

Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]

Language:PythonApache-2.043900

nlpaug

Data augmentation for NLP

Language:Jupyter NotebookMIT437100

data-juicer

A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据！

Language:PythonApache-2.0189100

Pai-Megatron-Patch

The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.

Language:PythonApache-2.058800

SemDeDup

Code for "SemDeDup", a simple method for identifying and removing semantic duplicates from a dataset (data pairs which are semantically similar, but not exactly identical).

Language:PythonNOASSERTION9000

Conditional-Pretraining-of-Large-Language-Models

Language:PythonApache-2.03700

finngen-tools

Tools for training causal language models for Finnish

Language:PythonMIT2500

pdf2htmlEX

Convert PDF to HTML without losing text or format.

Language:HTMLNOASSERTION1031500

pretraining-with-human-feedback

Code accompanying the paper Pretraining Language Models with Human Preferences

Language:PythonMIT17100

HanLP

中文分词词性标注命名实体识别依存句法分析成分句法分析语义依存分析语义角色标注指代消解风格转换语义相似度新词发现关键词短语提取自动摘要文本分类聚类拼音简繁转换自然语言处理

Language:PythonApache-2.03320500

MNBVC

MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化，也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。

MIT327200

LLMSurvey

The official GitHub page for the survey paper "A Survey of Large Language Models".

Language:Python972700

AutoChain

AutoChain: Build lightweight, extensible, and testable LLM Agents

Language:PythonMIT176400

WebShop

[NeurIPS 2022] 🛒WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents

Language:PythonMIT23400