liupeng's starred repositories

dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.

Language:TypeScriptLicense:NOASSERTIONStargazers:46795Issues:349Issues:3968

lobe-chat

🤯 Lobe Chat - an open-source, modern-design AI chat framework. Supports Multi AI Providers( OpenAI / Claude 3 / Gemini / Ollama / Azure / DeepSeek), Knowledge Base (file upload / knowledge management / RAG ), Multi-Modals (Vision/TTS) and plugin system. One-click FREE deployment of your private ChatGPT/ Claude application.

Language:TypeScriptLicense:NOASSERTIONStargazers:41908Issues:203Issues:2121

Umi-OCR

OCR software, free and offline. 开源、免费的离线OCR软件。支持截屏/批量导入图片,PDF文档识别,排除水印/页眉页脚,扫描/生成二维码。内置多国语言库。

Language:PythonLicense:MITStargazers:25939Issues:139Issues:557

audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.

Language:PythonLicense:MITStargazers:20688Issues:204Issues:374

elasticsearch-analysis-ik

The IK Analysis plugin integrates Lucene IK analyzer into elasticsearch, support customized dictionary.

Language:JavaLicense:Apache-2.0Stargazers:15916Issues:599Issues:933

LaTeX-OCR

pix2tex: Using a ViT to convert images of equations into LaTeX code.

Language:PythonLicense:MITStargazers:12101Issues:73Issues:268

seamless_communication

Foundational Models for State-of-the-Art Speech and Text Translation

Language:Jupyter NotebookLicense:NOASSERTIONStargazers:10791Issues:141Issues:349

lm-evaluation-harness

A framework for few-shot evaluation of language models.

Language:PythonLicense:MITStargazers:6554Issues:37Issues:1086

iceberg

Apache Iceberg

Language:JavaLicense:Apache-2.0Stargazers:6224Issues:159Issues:3393

pycorrector

pycorrector is a toolkit for text error correction. 文本纠错,实现了Kenlm,T5,MacBERT,ChatGLM3,LLaMA等模型应用在纠错场景,开箱即用。

Language:PythonLicense:Apache-2.0Stargazers:5511Issues:83Issues:470

chatgpt-web-share

ChatGPT Plus 共享方案。ChatGPT Plus / OpenAI API sharing solution.

Language:VueLicense:GPL-3.0Stargazers:4265Issues:38Issues:339

img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

Language:PythonLicense:MITStargazers:3621Issues:31Issues:255

trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Language:PythonLicense:Apache-2.0Stargazers:3491Issues:31Issues:363

MNBVC

MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。

Linly

Chinese-LLaMA 1&2、Chinese-Falcon 基础模型;ChatFlow中文对话模型;中文OpenLLaMA模型;NLP预训练/指令微调数据集

List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words

List of Dirty, Naughty, Obscene, and Otherwise Bad Words

LLMDataHub

A quick guide (especially) for trending instruction finetuning datasets

Chinese-Llama-2-7b

开源社区第一个能下载、能运行的中文 LLaMA2 模型!

Language:PythonLicense:Apache-2.0Stargazers:2226Issues:21Issues:39

prm800k

800,000 step-level correctness labels on LLM solutions to MATH problems

Language:PythonLicense:MITStargazers:1493Issues:115Issues:16

LLMTest_NeedleInAHaystack

Doing simple retrieval from LLM models at various context lengths to measure accuracy

Language:Jupyter NotebookLicense:NOASSERTIONStargazers:1475Issues:15Issues:25

LeanCopilot

LLMs as Copilots for Theorem Proving in Lean

Language:C++License:MITStargazers:959Issues:12Issues:38

autofaiss

Automatically create Faiss knn indices with the most optimal similarity search parameters.

Language:PythonLicense:Apache-2.0Stargazers:803Issues:18Issues:78

Llama2-Code-Interpreter

Make Llama2 use Code Execution, Debug, Save Code, Reuse it, Access to Internet

LifeReloaded

A life simulation Game powered by GPT-4's “Advanced Data Analysis” function , offering you a second chance at life. 由GPT4的Advanced Data Analysis功能驱动的人生重来模拟器,给您人生第二春。

Language:PythonLicense:Apache-2.0Stargazers:634Issues:8Issues:10

Megatron-LLaMA

Best practice for training LLaMA models in Megatron-LM

Language:PythonLicense:NOASSERTIONStargazers:611Issues:6Issues:62

BetterOCR

🔍 Better text detection by combining multiple OCR engines (EasyOCR, Tesseract, and Pororo) with 🧠 LLM.

Language:PythonLicense:MITStargazers:469Issues:4Issues:13

data_management_LLM

Collection of training data management explorations for large language models

Open-Instruction-Generalist

Open Instruction Generalist is an assistant trained on massive synthetic instructions to perform many millions of tasks

Language:PythonLicense:Apache-2.0Stargazers:204Issues:13Issues:9

wikitextprocessor

Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.

Language:PythonLicense:NOASSERTIONStargazers:93Issues:5Issues:71

SUS-Chat

SUS-Chat: Instruction tuning done right

License:NOASSERTIONStargazers:47Issues:0Issues:11