berooo's repositories
beir
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
MRAG
Official Implementation of "Multi-Head RAG: Solving Multi-Aspect Problems with LLMs"
self-rag
This includes the original implementation of SELF-RAG: Learning to Retrieve, Generate and Critique through self-reflection by Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi.
FlagEmbedding
Retrieval and Retrieval-augmented LLMs
GitHub520
:kissing_heart: 让你“爱”上 GitHub,解决访问时图裂、加载慢的问题。(无需安装)
DeepSpeed
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Qwen
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
llm-rankers
Zero-shot Document Ranking with Large Language Models.
vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
TabRecSet
A large scale camera-taken table detection and recognition dataset.
EasyOCR
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
LLM-Agent-Paper-List
The paper list of the 86-page paper "The Rise and Potential of Large Language Model Based Agents: A Survey" by Zhiheng Xi et al.
tabula
Tabula is a tool for liberating data tables trapped inside PDF files
MNBVC
MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。
WanJuan1.0
万卷1.0多模态语料
open-llms
📋 A list of open LLMs available for commercial use.
nougat
Implementation of Nougat Neural Optical Understanding for Academic Documents
DocBank
DocBank: A Benchmark Dataset for Document Layout Analysis
baichuan-7B
A large-scale 7B pretraining language model developed by BaiChuan-Inc.
awesome-document-understanding
A curated list of resources for Document Understanding (DU) topic
ERNIE-Layout-Pytorch
An unofficial Pytorch implementation of ERNIE-Layout which is originally released through PaddleNLP.
LaTeX-OCR
pix2tex: Using a ViT to convert images of equations into LaTeX code.
open-mllms
open llm for multimodal
ChineseNLPCorpus
中文自然语言处理数据集,平时做做实验的材料。欢迎补充提交合并。
GPT2-Chinese
Chinese version of GPT2 training code, using BERT tokenizer.
layout-parser
A Unified Toolkit for Deep Learning Based Document Image Analysis
CAN
When Counting Meets HMER: Counting-Aware Network for Handwritten Mathematical Expression Recognition (ECCV’2022 Poster).
UIE
Unified Structure Generation for Universal Information Extraction