李熙's starred repositories
IDP-system
Intelligent Document Processing System
bucket-based_farthest-point-sampling_CPU
the CPU implementation of bucket based farthest point sampling, achieves 7-81x speedup than the conventional implementation
bucket-based_farthest-point-sampling_GPU
the GPU implementation of bucket based farthest point sampling, achieves 3-4x speedup than the conventional implementation
DecryptPrompt
总结Prompt&LLM论文,开源数据&模型,AIGC应用
awesome-sentence-embedding
A curated list of pretrained sentence and word embedding models
sentence-transformers
Multilingual Sentence & Image Embeddings with BERT
pycorrector
pycorrector is a toolkit for text error correction. 文本纠错,实现了Kenlm,T5,MacBERT,ChatGLM3,LLaMA等模型应用在纠错场景,开箱即用。
lihang-code
《统计学习方法》的代码实现
text-classification-surveys
文本分类资源汇总,包括深度学习文本分类模型,如SpanBERT、ALBERT、RoBerta、Xlnet、MT-DNN、BERT、TextGCN、MGAN、TextCapsule、SGNN、SGM、LEAM、ULMFiT、DGCNN、ELMo、RAM、DeepMoji、IAN、DPCNN、TopicRNN、LSTMN 、Multi-Task、HAN、CharCNN、Tree-LSTM、DAN、TextRCNN、Paragraph-Vec、TextCNN、DCNN、RNTN、MV-RNN、RAE等,浅层学习模型,如LightGBM 、SVM、XGboost、Random Forest、C4.5、CART、KNN、NB、HMM等。介绍文本分类数据集,如MR、SST、MPQA、IMDB、Yelp、20NG、AG、R8、DBpedia、Ohsumed、SQuAD、SNLI、MNLI、MSRP、MRDA、RCV1、AAPD,评价指标,如accuracy、Precision、Recall、F1、EM、MRR、HL、Micro-F1、Macro-F1、P@K,和技术挑战,包括多标签文本分类。
Conference-Acceptance-Rate
Acceptance rates for the major AI conferences
langdetect
Port of Google's language-detection library to Python.
ekphrasis
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
Summarization-Papers
Summarization Papers
heideltime
A multilingual, cross-domain temporal tagger developed at the Database Systems Research Group at Heidelberg University.
spacy-models
💫 Models for the spaCy Natural Language Processing (NLP) library
pumpkin-book
《机器学习》(西瓜书)公式详解
COVID-19-tracker
北航大数据高精尖中心研究团队进行数据来源的整理与获取,利用自然语言处理等技术从已公开全国4626确诊患者轨迹中抽取了基本信息(性别、年龄、常住地、工作、武汉/湖北接触史等)、轨迹(时间、地点、交通工具、事件)及病患关系形成结构化信息
git-for-win
Git for Windows. 国内直接从官网下载比较困难,需要翻墙。这里提供一个国内的下载站,方便网友下载