Chenghao Mou's repositories
text-dedup
All-in-one text de-duplication
touchbar-lyric
Show synced lyric in the touch-bar with BetterTouchTool and NetEase APIs
pytorch-pQRNN
Implementation of pQRNN in PyTorch
embeddings
zero-vocab or low-vocab embeddings
awesome-data-deduplication
An awesome list of data deduplication use cases, papers, tools, and methods.
chenghaomou.github.io
Personal Blog
deduplicate-text-datasets
A modified version of Google's tool for pure text file
lightning-grid-template
A minimal template for pytorch-lightning and grid.ai
ai.robots.txt
A list of AI agents and robots to block.
awesome-nlp
:book: A curated list of resources dedicated to Natural Language Processing (NLP)
bender-ruler
Bender Rule analysis for NLP papers
bigcode-analysis
Repository for analysis notebooks and experimentes of the BigCode project.
blog
Public repo for HF blog posts
data_tooling
Tools for managing datasets for governance and training.
edgar-crawler
SEC EDGAR Exhibit Downloader
file-explorer-markdown-titles
Obsidian Plugin that adds the the markdown title within your notes to the file explorer
go-wordninja
Probabilistically split concatenated words using NLP based on English Wikipedia unigram frequencies.
open-source-mac-os-apps
π Awesome list of open source applications for macOS. https://t.me/s/opensourcemacosapps
paper2audio
Convert research papers to audio files.
presidio
Context aware, pluggable and customizable data protection and de-identification SDK for text and images
pytorch-dice-loss
Dice loss for data-imbalanced NLP tasks
quartz
π± a fast, batteries-included static-site generator that transforms Markdown content into fully functional websites
star-classification
A tool for the projects you starred on GitHub
table-transformer-doclaynet
Table Transformer Fine-tuned with DocLayNet Dataset