ccx06

ccx06

Geek Repo

0

followers

0

following

Github PK Tool:Github PK Tool

ccx06's starred repositories

Cherry_LLM

[NAACL'24] Self-data filtering of LLM instruction-tuning data using a novel perplexity-based difficulty score, without using any other models

Language:PythonStargazers:280Issues:0Issues:0

datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Language:PythonLicense:Apache-2.0Stargazers:1927Issues:0Issues:0

cc_net

Tools to download and cleanup Common Crawl data

Language:PythonLicense:MITStargazers:958Issues:0Issues:0

RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.

Language:PythonLicense:Apache-2.0Stargazers:4518Issues:0Issues:0
Language:XSLTLicense:Apache-2.0Stargazers:109Issues:0Issues:0

MathPile

Generative AI for Math: MathPile

Language:PythonLicense:Apache-2.0Stargazers:372Issues:0Issues:0

awesome-LLM-resourses

🧑‍🚀 全世界最好的LLM资料总结 | Summary of the world's best LLM resources.

Stargazers:976Issues:0Issues:0

Awesome-Tabular-LLMs

We collect papers about "large language models (LLM) for table-related tasks", e.g., using LLM for Table QA task. “表格+LLM”相关论文整理

Stargazers:166Issues:0Issues:0

agentscope

Start building LLM-empowered multi-agent applications in an easier way.

Language:PythonLicense:Apache-2.0Stargazers:4769Issues:0Issues:0
Language:PythonLicense:NOASSERTIONStargazers:197Issues:0Issues:0

doremi

Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets

Language:HTMLLicense:MITStargazers:289Issues:0Issues:0

awesome-pretrained-chinese-nlp-models

Awesome Pretrained Chinese NLP Models,高质量中文预训练模型&大模型&多模态模型&大语言模型集合

Language:PythonLicense:MITStargazers:4711Issues:0Issues:0

MNBVC

MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。

License:MITStargazers:3371Issues:0Issues:0

zero_nlp

中文nlp解决方案(大模型、数据、模型、训练、推理)

Language:Jupyter NotebookLicense:MITStargazers:2838Issues:0Issues:0

GPT2-Chinese

Chinese version of GPT2 training code, using BERT tokenizer.

Language:PythonLicense:MITStargazers:7441Issues:0Issues:0

Open-GPT

GPT is a belief. This project provides a code library for efficiently training Chinese GPT. This project uses the nanoGPT framework and the novel optimizer algorithm sophia.

Language:Jupyter NotebookStargazers:5Issues:0Issues:0

nlp_chinese_corpus

大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP

License:MITStargazers:9402Issues:0Issues:0

gpt-neo

An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.

Language:PythonLicense:MITStargazers:8204Issues:0Issues:0

gpt-2

Code for the paper "Language Models are Unsupervised Multitask Learners"

Language:PythonLicense:NOASSERTIONStargazers:22232Issues:0Issues:0
Language:PythonStargazers:1746Issues:0Issues:0

HPT

This repository implements a prompt tuning model for hierarchical text classification. This work has been accepted as the long paper "HPT: Hierarchy-aware Prompt Tuning for Hierarchical Text Classification" in EMNLP 2022.

Language:PythonLicense:MITStargazers:61Issues:0Issues:0

FlagEmbedding

Retrieval and Retrieval-augmented LLMs

Language:PythonLicense:MITStargazers:6764Issues:0Issues:0

Chinese-Keyphrase-Extraction

无监督中文关键词抽取(Keyphrase Extraction),基于统计,基于图【LDA与PageRank(TextRank, TPR, Salience Rank, Single TPR等)】,基于嵌入【SIFRank等】,开箱即用!

Language:PythonLicense:MITStargazers:100Issues:0Issues:0

Baichuan2

A series of large language models developed by Baichuan Intelligent Technology

Language:PythonLicense:Apache-2.0Stargazers:4070Issues:0Issues:0

multi-label-classification

基于tf.keras的多标签多分类模型

Language:PythonLicense:MITStargazers:83Issues:0Issues:0

LLaMA-Factory

Efficiently Fine-Tune 100+ LLMs in WebUI (ACL 2024)

Language:PythonLicense:Apache-2.0Stargazers:30604Issues:0Issues:0

albert_pytorch

A Lite Bert For Self-Supervised Learning Language Representations

Language:PythonLicense:Apache-2.0Stargazers:708Issues:0Issues:0

Pytorch-NLU

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词、抽取式文本摘要等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of spee

Language:PythonLicense:Apache-2.0Stargazers:322Issues:0Issues:0

ChatRWKV

ChatRWKV is like ChatGPT but powered by RWKV (100% RNN) language model, and open source.

Language:PythonLicense:Apache-2.0Stargazers:9381Issues:0Issues:0