Jiaxin-Wen

中文大模型能力评测榜单：目前已囊括115个大模型，覆盖chatgpt、gpt4o、百度文心一言、阿里通义千问、讯飞星火、商汤senseChat、minimax等商用模型，以及百川、qwen2、glm4、yi、书生internLM2、llama3等开源大模型，多维度能力评测。不仅提供能力评分排行榜，也提供所有模型的原始输出结果！

2440 32 44

Minigrid

Simple and easily configurable grid world environments for reinforcement learning

Language:PythonNOASSERTION2086 39 188

DeepSeek-Coder-V2

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

MIT1997 22 50

magicoder

[ICML'24] Magicoder: Empowering Code Generation with OSS-Instruct

Language:PythonMIT1965 26 40

code2prompt

A CLI tool to convert your codebase into a single LLM prompt with source tree, prompt templating, and token counting.

Language:RustMIT1665 11 29

DeepLearing-Interview-Awesome-2024

AIGC-interview/CV-interview/LLMs-interview面试问题与答案集合仓，同时包含工作和科研过程中的新想法、新问题、新资源与新项目

1580 250

alpaca_eval

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

Language:Jupyter NotebookApache-2.01456 7 142

HALOs

A library with extensible implementations of DPO, KTO, PPO, ORPO, and other human-aware loss functions (HALOs).

Language:PythonApache-2.0707 7 20

babyai

BabyAI platform. A testbed for training agents to understand and execute language commands.

Language:PythonBSD-3-Clause689 36 46

quiet-star

Code for Quiet-STaR

Language:PythonApache-2.0540 13 8

HarmBench

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Language:Jupyter NotebookMIT285 5 47

Thought-Cloning

[NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking

Language:PythonMIT249 20

FineGrainedRLHF

Language:PythonApache-2.0248 8 13

openlogprobs

Extract full next-token probabilities via language model APIs

Language:Python227 3 1

bigcodebench

BigCodeBench: Benchmarking Code Generation Towards AGI

Language:PythonApache-2.0187 5 35

intercode

[NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898

Language:PythonMIT184 7 17

Humback

🐋 An unofficial implementation of Self-Alignment with Instruction Backtranslation.

Language:PythonApache-2.0130 3 9

quality

Language:Python115 12 7

PPOCoder

Code for the TMLR 2023 paper "PPOCoder: Execution-based Code Generation using Deep Reinforcement Learning"

Language:PythonMIT94 3 10

llm_debate

Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"

Language:PythonMIT74 4 2

fneval

Functional Benchmarks and the Reasoning Gap

Language:TeXGPL-3.073 1 8

CiteME

CiteME is a benchmark designed to test the abilities of language models in finding papers that are cited in scientific texts.

Language:PythonNOASSERTION35 100