Lirong Jian's repositories
alpa
Auto parallelization for large-scale neural networks
antlr4
ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
arrow
Apache Arrow is a columnar in-memory analytics layer designed to accelerate big data. It houses a set of canonical in-memory representations of flat and hierarchical data along with multiple language-bindings for structure manipulation. It also provides IPC and common algorithm implementations.
BQconvert
BigQuery Schema Conversion Tool
c-store
C-Store : A column-oriented DBMS prototype (frozen)
ClickBench
ClickBench: a Benchmark For Analytical Databases
cylon
Cylon is a fast, scalable distributed memory data parallel library for processing structured data
diagrams
:art: Diagram as Code for prototyping cloud system architectures
dsb
The DSB benchmark is designed for evaluating both workloaddriven and traditional database systems on modern decision support workloads. DSB is adapted from the widely-used industrialstandard TPC-DS benchmark. It enhances the TPC-DS benchmark with complex data distribution and challenging yet semantically meaningful query templates. DSB also introduces configurable and dynamic workloads to assess the adaptability of database systems. Since workload-driven and traditional database systems have different performance dimensions, including the additional resources required for tuning and maintaining the systems, we provide guidelines on evaluation methodology and metrics to report.
juicefs
A distributed POSIX file system built on top of Redis and S3.
Jungle
An embedded key-value store library specialized for building state machine and log store
llama2_aided_tesseract
Enhance Tesseract OCR output for scanned PDFs by applying Large Language Model (LLM) corrections, complete with options for text validation and hallucination filtering.
llmperf
LLMPerf is a library for validating and benchmarking LLMs
lux
Automatically visualize your pandas dataframe via a single print! 📊 💡
magika
Detect file content types with deep learning
MediaCrawler
小红书笔记 | 评论爬虫、抖音视频 | 评论爬虫、快手视频 | 评论爬虫、B 站视频 | 评论爬虫、微博帖子 | 评论爬虫
modin
Modin: Speed up your Pandas workflows by changing a single line of code
neon
Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, branching, and bottomless storage.
OpenLineage
An Open Standard for lineage metadata collection
orioledb
OrioleDB – building a modern cloud-native storage engine (... and solving some PostgreSQL wicked problems)
proton
A unified streaming and historical data processing engine in one single binary, powered by ClickHouse
queryparser
Parsing and analysis of Vertica, Hive, and Presto SQL.
sqlancer
Detecting Logic Bugs in DBMS
sqlsmith
A random SQL query generator
system-design-resources
These are the best resources for System Design on the Internet
timely-dataflow
A modular implementation of timely dataflow in Rust
ucx
Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
velox
A new C++ vectorized database acceleration library aimed to optimizing query engines and data processing systems.