AI+BigData+Cloud Made Easy

A list of frameworks, libraries, resources, and shiny things. Inspired by awesome-... stuff. Those most frequently used or well-know items are not listed here, which could be referred from awesome series: Awesome Big Data by Onur Akpolat and The Big-Data Ecosystem Table by Andrea Mostosi .

Projects

Storage Design and Data Structures
Distributed Infrastructure for Cloud---Database and Storage
Distributed Infrastructure for Cloud---Application
Distributed Infrastructure for Cloud---A(AI)B(BigData)C(Cloud)
Concurrency
System Performance and Profiling
Search Engine and Information Retrieval

Storage Design and Data Structures

Db-readings - Readings in Databases .
Bitvector - A C++ container-like data structure for storing a vector of bits with fast appending on both sides and fast insertion in the middle, all in succinct space .
BitSliceIndex - Experiments on bit-slice indexing .
RoaringBitmap - Roaring Bitmap .
Pilosa - High performance OLAP based on roaring bitmap .
Cpp-btree - C++ in-memory containers based on a B-tree data structure.
Graphillion - Fast, lightweight graphset operation library .
Emphf - An efficient external-memory algorithm for the construction of minimal perfect hash functions .
Skipgraph - Implementation of skipgraph on messagepack-rpc .
Splay Map - STL map implemented with splay tree .
Cedar - C++ implementation of efficiently-updatable double-array trie .
WikiSort - Fast and stable sort algorithm that uses O(1) memory. Public domain .
Annoy - Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk .
Expgram - An ngram toolkit with succinct storage .
Cuckoofilter - A Bloom filter replacement for approximated set-membership queries .
DCF - Dynamic Cuckoo Filter .
PackedArray - Random access array of tightly packed unsigned integers .
FrameOfReference - C++ library to pack and unpack vectors of integers having a small range of values using a technique called Frame of Reference .
FFBF - Feed-forward Bloom filters .
Concurrent Trees - C++ implementation of concurrent Binary Search Trees .
Concurrent B-Tree - A working project for High-concurrency B-tree source code in C .
Palmtree - An implementation of Intel's concurrent B+Tree (Palm Tree) .
BwTree - An open sourced implementation of Bw-Tree in SQL Server Hekaton .
W-TinyLFU - C++11 header-only implementation for Window-TinyLFU Cache .
Block-graph - A succinct implementation of a block-graph data structure .
RePair-WaveletTree-Graph - Graph Implementation with repair bitmap compressed WaveletTree .
RLZ - Contains the RLZ compression and self-index source code .
Serangequerying - Space-Efficient Structures for Range Querying .
Succinct - Experimentation with various succinct data structures. Combines previous doc-counter and wavelet-tree repos .
Sdsl-lite - Succinct Data Structure Library 2.0 .
Relative-FMIndex - Relative FM-index which is smaller but slower than plain FMIndex.
GCSA - Generalized Compressed Suffix Array.
Succinct - A collection of succinct data structures .
DYNAMIC - Dynamic succinct/compressed data structures .
DPT - Distributed Patricia Trie .
Rmq - Implementations of LCA and RMQ data structures from "The LCA Problem Revisited" .
YuNomi - Compressed Array Library .
DACs - Directly Addressable Codes (DACs) consist in a variable-length encoding scheme for integers that enables direct access to any element of the encoded sequence and obtains compact spaces .
Cpi00 - The compressed permuterm index .
Smbt - Succinct Multibit Tree for similarity search .
Gwt - Graph-indexing wavelet tree for graph similarity search .
Webgraphs - Fast and Compact Web Graph Representations .
Erika-trie - Erika-trie: succinct trie library .
Path_decomposed_tries - Implementation of the data structures described in the paper "Fast Compressed Tries using Path Decomposition" .
Sumire-tries - A variety of succinct tries .
Trie4j - (Succinct) trie implementation in Java .
SuDS - Succinct Data Structures (SuDS) www.cs.helsinki.fi .
Marisa-trie - Marisa succinct trie .
LibCDS - Compact Data Structures Library .
HSDS - Succinct Data Structure Library Collection including bit-vector/wavelet-matrix/trie .
BWTIL - BWT Text Indexing Library: a set of tools to work with BWT-based text indexes .
Bwt-Merge - A tool for merging large BWTs .
PWT - Parallel Wavelet Tree and Wavelet Matrix Construction .
PSAC - Parallel Suffix Array, LCP Array, and Suffix Tree Construction .
R-Index - Optimal space run-length Burrows-Wheeler transform full-text index .
Fbcsa - Fixed Block based Compact Suffix Array .
Quantile-Index - Code for "The Quantile Index -- Succinct Self-Index for Top-k Document Retrieval" .
Gonzalo Navarro - Publications of Gonzalo Navarro .
Kvtx - Transaction over CAS see https://docs.google.com/open?id=0B04zCRiCIQGGZDcyNTEwZGQtODk4Yy00NjEwLWI1MjQtYjc3NzJhN2RlNzk0 .
MemC3 - An in-memory key-value cache based on concurrent cuckoo hashing.
Libart - Adaptive Radix Trees implemented in C .
Masstree - Masstree, a fast, multi-core key-value store .
HyPer - A hybrid online transactional processing (OLTP) and online analytical processing (OLAP) high-performance main memory database system that is optimized for modern hardware .
HERD - A Highly Efficient key-value system for RDMA .
Nldb - Nanolat Database supporting 1M transactions per second .
Sophia - Modern embeddable key-value database designed for a high load environment .
FOEDUS - Transactional fast optimistic engine optimized for a large number of CPU cores and NVRAM storage (or fast SSD) .
FastBit_UDF - MySQL UDF for creating, manipulating and querying FastBit indexes .
Jump Consistent Hash - A Go implementation of the jump consistent hash .
Content Defined Chunking - High Performance Content Defined Chunking .
SSD optimizations - Optimizing SSDs random IOPs, noop/tpps scheduler, rotational=0, add_random=0 .
Article-SSD - Coding for SSDs - What every programmer should know about solid-state drives .
Article-Key-Value - Implementing a Key-Value Store .
Article-MVCC - Implementation of MVCC Transactions for Key-Value Stores .
Article-SSD - Solid-state revolution: in-depth on how SSDs really work .
DB Redbook - Readings in Database Systems .

Distributed Infrastructure for Cloud---Database and Storage

Cockroach - A Scalable, Geo-Replicated, Transactional Datastore .
TiDB - Distributed NewSQL database compatible with MySQL protocol .
ElastiCell - Cloud native key-value store with strong consistency and reliability .
Yugabyte - Cloud native database store with strong consistency and reliability .
FBase - Cloud native database store with strong consistency and reliability by JD.
Paxosstore - Cloud native key value store with strong consistency and reliability by WeChat.
Phxqueue - A high-availability, high-throughput and highly reliable distributed queue based on the Paxos algorithm.
Youzan-nsq - Youzan's modification of nsq to provide cloud native capability from reliability to auto rebalancing.
Baidu-Elasticsearch - Baidu's modification of elasticsearch to provide strong data consistency and full SQL.
ClickHouse - Yandex's distributed column store OLAP.
Palo - Baidu's distributed OLAP based on Google's Mesa paper.
MapD - MapD OLAP based on GPU.
ContainerFS - Cloud native distributed filesystem for Kubernetes.
OpenEBS - Cloud native filesystem for Kubernetes(non-distributed ).
Seaweed-FS - Distributed filesystem for small blob files.
Ambry - Distributed filesystem for small and large blob files.
DistributedLog - High performance replicated log service.
Jepsen - Techniques Jepsen occupies a particular niche of the correctness testing landscape .
Namazu - Programmable fuzzy scheduler for testing distributed system .
GPaxos - Golang Paxos implementation based on Phxpaxos .
Consensus-Yaraft - C++ Raft implementation based on Etcd's golang raft .
NOPaxos - Network-Ordered Paxos .
TAPIR - Building Consistent Transactions with Inconsistent Replication .
Phat - An implementation of the Chubby lock service protocol in Msgpack RPC .
Hydra - A distributed data processing and storage system originally developed at AddThis .
Summingbird - Streaming MapReduce with Scalding and Storm https://twitter.com/summingbird .
Hustle - A column oriented, embarrassingly distributed relational event database .
MDCC - Multi-DataCenter Consistency protocol .
URingPaxos - High throughput atomic multicast protocol .
Course-CS6452 - Datacenter Networks and Services .

Distributed Infrastructure for Cloud---Application

Pinpoint - Non-intrusive Dapper-like APM solution .
CAT - APM solution at Dianping Inc .
Brave - Java version of OpenZipkin .
Appdash - Golang version of Dapper .
Jaeger - Golang version of Dapper in Uber.
Cadence - Microservice workflow orchestrator .
Zeebe - Microservice workflow orchestrator .
F-Stack - Network framework with high performance based on DPDK .
DPVS - High performance Layer-4 load balancer based on DPDK .

Distributed Infrastructure for Cloud---A(AI)B(BigData)C(Cloud)

Galaxy - Naive scheduler for Baidu search cluster .
Cook - Fair job scheduler on Mesos for batch workloads and Spark .
Kube-arbitrator - Cluster colocation scheduler for Kubernetes .
BigFlow - Baidu dataflow operator .
Pulsar - Business level monitor and analysis .
Cubert - A fast and efficient batch computation engine for complex analysis and reporting of massive datasets on Hadoop .
Embulk - A plugin-based parallel bulk data loader that makes painful data integration works relaxed .
Gobblin - Data ingestion as a service .
Magpie - Deploying and managing a Hadoop Yarn cluster with Docker Swarm .
Horovod - Uber's modification of TensorFlow to provide RingReduce based on MPI.
Angel - Tencent's parameter server infrastructure to support machine learning.
Ytk-Learn - Yuantiku's distributed machine learning platform.
Libble - LIBBLE from NJU to provide faster convergence than SGD.
Gloo - Facebook's communications library with various primitives for multi-machine training.
xLearn - High Performance, Easy-to-use, and Scalable Machine Learning Package (C++, Python, R).
LASER - A Scalable Response Prediction Platform For Online Advertising .
Hivemall - Scalable machine learning library for Hive/Hadoop .
Ml-ease - ADMM based large scale logistic regression .
Jubatus - Distributed Online Machine Learning Framework .

Concurrency

Concurrent Queue - A fast multiple-producer, multi-consumer lock-free concurrent queue for C++11 .
CAF - An Open Source Implementation of the Actor Model in C++ .
TAMER - C++ extensions for readable event-driven programming .
C++React - A reactive programming library for C++11 .
Libslock - Cross-platform atomic operations and lock algorithm library http://lpd.epfl.ch/site/ssync .
CDS - Header only C++ Concurrent Data Structures library .
Libcds - A C++ template library of lock-free and fine-grained algorithms .
Locksmith - A library for debugging locking in C, C++, or Objective C programs .
Concurrency-concepts - A guide to concurrency, multi-threading and parallel programming concepts. Explains the differences between every concept, their advantages and disadvantages in detail .
Concurrency Kit - Concurrency primitives, safe memory reclamation mechanisms and non-blocking data structures for the research, design and implementation of high performance concurrent systems .
Nanahan - An implementation of Hopscotch hashing for single thread .
Scalex - Code snippets for the workshop on concurrent data structure implementation .
CBB - Provides a set of concurrent building blocks (Java & C/C++) that can be used to develop parallel/multi-threaded applications .
Thrust - A parallel algorithms library which resembles the C++ Standard Template Library (STL) .
Varon-t - A C implementation of Disruptor queues http://varon-t.readthedocs.org/ .
Lockfree Queue - Lock-free Condition Wait for Lock-free Multi-producer Multi-consumer Queue, see http://natsys-lab.blogspot.ru/2013/08/lock-free-condition-wait-for-lock-free.html .
Ssmem - A simple object-based memory allocator with epoch-based garbage collection, the publication "Asynchronized Concurrency: The Secret to Scaling Concurrent Search Data Structures" .
CLHT - A very fast and scalable (lock-based and lock-free) hash table that uses cache-line sized buckets .
Comsat - Comsat lets your application enjoy the scalability of asynchronous web-frameworks, serving many thousands of concurrent long-lived connections, or issuing hundreds of web-service calls for each request, all while maintaining the simple “thread per request” model .
Quasar-thrift - Quasar fiber based Thrift RPC .
Seastar - Concurrency library in user space .
Article-TM - Transactional Memory: History and Development .

System Performance And Profiling

Vmmlib - A templatized C++ vector and matrix math library .
Blaze-lib - A high performance C++ math library .
Light-matrix - A Light-weight and Fast Template Matrix Library .
Light-simd - A light weight library for SIMD based computation .
MathSimd - SIMD-optimized math library in C++ .
Opti - Experiment of x86/x64 optimization .
Fmath - Fast log and exp functions for x86/x64 SSE http://homepage1.nifty.com/herumi/soft/fmath.html .
Mie - Fast string library with SSE4.2 .
Libsimdpp - Header-only zero-overhead C++ wrapper for SIMD intrinsics of multiple instruction sets .
Smart - SMT-aware Real-time scheduler for Linux from Yandex.
Simple Binary Encoding - Serialization with ultra low latency .
Farmhash - FarmHash is a successor to CityHash, and includes many of the same tricks and techniques, several of them taken from Austin Appleby’s MurmurHash .
Proxygen - A collection of C++ HTTP libraries including an easy to use HTTP server .
Yamail - YMail General Purpose Library .
WDT - Warp speed Data Transfer (WDT) is an embeddedable library (and command line tool) aiming to transfer data between 2 systems as fast as possible over multiple TCP paths .
UNetStack - Userspace TCP/IP stack .
CamIO - Userspace IO abstraction .
Ktap - A lightweight script-based dynamic tracing tool for Linux http://ktap.org .
Perfbook - Is Parallel Programming Hard, And, If So, What Can You Do About It ?
Article-GC-Java - Garbage Collection Optimization for High-Throughput and Low-Latency Java Applications | LinkedIn Engineering .
Article-Memory Management - Optimizing Linux Memory Management for Low-latency / High-throughput Databases | LinkedIn Engineering .
Article-Modern Microprocessors - Modern Microprocessors A 90 Minute Guide! .
Article-Cache Oblivious Array - Cache oblivious array operations .
Article-Understanding Memory - Understanding Memory .
Article-1975 Programming - So what's wrong with 1975 programming? .
Article-Database Research - Database Research on Modern Computing Architecture .
Article-Linux Learn From Solaris - What Linux can learn from Solaris performance and vice-versa .
Brendan D. Gregg - Blog of Brendan D. Gregg .
Course-CMU 18-645 - How to Write Fast Code .
ParallelismBook - A book about parallel computing & code optimization .

Search Engine and Information Retrieval

Vespa - Production ready search engine to support web-scale data .
SF1R - A distributed massive data engine for enterprise/vertical search written in C++ .
BitFunnel - Signature file based search engine from Bing .
Trinity - Trinity IR toolkit .
IResearch - IR toolkit to be used for ArangoDB .
Partitioned_elias_fano - Code used for the experiments in the paper "Partitioned Elias-Fano Indexes" .
Clustered_Partitioned_elias_fano - Code used for paper Clustered Elias-Fano Indexes" .
Data Structures for Inverted Indexes - Optimal Space-Time Tradeoffs for Inverted Indexes .
Surf - SUccinct Retrieval Framework .
FastPFor - Fast integer compression .
Indexing - Experimenting with indexing on GPUs .
Genie - Generic Inverted Index on GPU .
Simdcomp - A simple C library for compressing lists of integers .
SIMDCompressionAndIntersection - A C++ library to compress and intersect sorted lists of integers using SIMD instructions .
TurboPFor - Fastest Integer Compression .
Pos-cmp - Comparison framework for positional inverted indexes and self-index supporting phrase queries .
MaskedVByte - SIMD-accelerated VByte Compression, Publication "Vectorized VByte Decoding" .
Wavelet - Information Retrieval based on Wavelet Tree .
Shuffla - Search engine using kd-tree .
RoSA - Large-Scale Pattern Search Using Reduced-Space On-Disk Suffix Arrays .
Dualsorted - Dual sorted inverted index based on Wavelet Tree .
Treap - Faster and Smaller Inverted Indices with Treaps .
Gigablast - A distributed open source search engine and spider written in C/C++ for Linux .
SIMD-Based-Posting-lists - Implementation of Alexander A. Stepanov inverted Index Compression algorithms .
Groonga - Open-source fulltext search engine and column store .
Atire - A search engine built using the most effective recent research techniques discovered by Information Retrieval researchers around the world .
Mg4j - Academic search engine with succinct design(say quasi-succinct indices) .
Argos - A structural data search engine .
MFRetrieval - Tools for maximum inner product retrieval in recommender systems .
Faiss - A library for efficient similarity search and clustering of dense vectors .
Lopq - Training of Locally Optimized Product Quantization (LOPQ) models for approximate nearest neighbor search of high dimensional data in Python and Spark .