izenecloud / big-data-made-easy

Big Data Made Easy

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

AI+BigData+Cloud Made Easy

A list of frameworks, libraries, resources, and shiny things. Inspired by awesome-... stuff. Those most frequently used or well-know items are not listed here, which could be referred from awesome series: Awesome Big Data by Onur Akpolat and The Big-Data Ecosystem Table by Andrea Mostosi .

Projects

Storage Design and Data Structures

  • Db-readings - Readings in Databases .
  • Bitvector - A C++ container-like data structure for storing a vector of bits with fast appending on both sides and fast insertion in the middle, all in succinct space .
  • BitSliceIndex - Experiments on bit-slice indexing .
  • RoaringBitmap - Roaring Bitmap .
  • Pilosa - High performance OLAP based on roaring bitmap .
  • Cpp-btree - C++ in-memory containers based on a B-tree data structure.
  • Graphillion - Fast, lightweight graphset operation library .
  • Emphf - An efficient external-memory algorithm for the construction of minimal perfect hash functions .
  • Skipgraph - Implementation of skipgraph on messagepack-rpc .
  • Splay Map - STL map implemented with splay tree .
  • Cedar - C++ implementation of efficiently-updatable double-array trie .
  • WikiSort - Fast and stable sort algorithm that uses O(1) memory. Public domain .
  • Annoy - Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk .
  • Expgram - An ngram toolkit with succinct storage .
  • Cuckoofilter - A Bloom filter replacement for approximated set-membership queries .
  • DCF - Dynamic Cuckoo Filter .
  • PackedArray - Random access array of tightly packed unsigned integers .
  • FrameOfReference - C++ library to pack and unpack vectors of integers having a small range of values using a technique called Frame of Reference .
  • FFBF - Feed-forward Bloom filters .
  • Concurrent Trees - C++ implementation of concurrent Binary Search Trees .
  • Concurrent B-Tree - A working project for High-concurrency B-tree source code in C .
  • Palmtree - An implementation of Intel's concurrent B+Tree (Palm Tree) .
  • BwTree - An open sourced implementation of Bw-Tree in SQL Server Hekaton .
  • W-TinyLFU - C++11 header-only implementation for Window-TinyLFU Cache .
  • Block-graph - A succinct implementation of a block-graph data structure .
  • RePair-WaveletTree-Graph - Graph Implementation with repair bitmap compressed WaveletTree .
  • RLZ - Contains the RLZ compression and self-index source code .
  • Serangequerying - Space-Efficient Structures for Range Querying .
  • Succinct - Experimentation with various succinct data structures. Combines previous doc-counter and wavelet-tree repos .
  • Sdsl-lite - Succinct Data Structure Library 2.0 .
  • Relative-FMIndex - Relative FM-index which is smaller but slower than plain FMIndex.
  • GCSA - Generalized Compressed Suffix Array.
  • Succinct - A collection of succinct data structures .
  • DYNAMIC - Dynamic succinct/compressed data structures .
  • DPT - Distributed Patricia Trie .
  • Rmq - Implementations of LCA and RMQ data structures from "The LCA Problem Revisited" .
  • YuNomi - Compressed Array Library .
  • DACs - Directly Addressable Codes (DACs) consist in a variable-length encoding scheme for integers that enables direct access to any element of the encoded sequence and obtains compact spaces .
  • Cpi00 - The compressed permuterm index .
  • Smbt - Succinct Multibit Tree for similarity search .
  • Gwt - Graph-indexing wavelet tree for graph similarity search .
  • Webgraphs - Fast and Compact Web Graph Representations .
  • Erika-trie - Erika-trie: succinct trie library .
  • Path_decomposed_tries - Implementation of the data structures described in the paper "Fast Compressed Tries using Path Decomposition" .
  • Sumire-tries - A variety of succinct tries .
  • Trie4j - (Succinct) trie implementation in Java .
  • SuDS - Succinct Data Structures (SuDS) www.cs.helsinki.fi .
  • Marisa-trie - Marisa succinct trie .
  • LibCDS - Compact Data Structures Library .
  • HSDS - Succinct Data Structure Library Collection including bit-vector/wavelet-matrix/trie .
  • BWTIL - BWT Text Indexing Library: a set of tools to work with BWT-based text indexes .
  • Bwt-Merge - A tool for merging large BWTs .
  • PWT - Parallel Wavelet Tree and Wavelet Matrix Construction .
  • PSAC - Parallel Suffix Array, LCP Array, and Suffix Tree Construction .
  • R-Index - Optimal space run-length Burrows-Wheeler transform full-text index .
  • Fbcsa - Fixed Block based Compact Suffix Array .
  • Quantile-Index - Code for "The Quantile Index -- Succinct Self-Index for Top-k Document Retrieval" .
  • Gonzalo Navarro - Publications of Gonzalo Navarro .
  • Kvtx - Transaction over CAS see https://docs.google.com/open?id=0B04zCRiCIQGGZDcyNTEwZGQtODk4Yy00NjEwLWI1MjQtYjc3NzJhN2RlNzk0 .
  • MemC3 - An in-memory key-value cache based on concurrent cuckoo hashing.
  • Libart - Adaptive Radix Trees implemented in C .
  • Masstree - Masstree, a fast, multi-core key-value store .
  • HyPer - A hybrid online transactional processing (OLTP) and online analytical processing (OLAP) high-performance main memory database system that is optimized for modern hardware .
  • HERD - A Highly Efficient key-value system for RDMA .
  • Nldb - Nanolat Database supporting 1M transactions per second .
  • Sophia - Modern embeddable key-value database designed for a high load environment .
  • FOEDUS - Transactional fast optimistic engine optimized for a large number of CPU cores and NVRAM storage (or fast SSD) .
  • FastBit_UDF - MySQL UDF for creating, manipulating and querying FastBit indexes .
  • Jump Consistent Hash - A Go implementation of the jump consistent hash .
  • Content Defined Chunking - High Performance Content Defined Chunking .
  • SSD optimizations - Optimizing SSDs random IOPs, noop/tpps scheduler, rotational=0, add_random=0 .
  • Article-SSD - Coding for SSDs - What every programmer should know about solid-state drives .
  • Article-Key-Value - Implementing a Key-Value Store .
  • Article-MVCC - Implementation of MVCC Transactions for Key-Value Stores .
  • Article-SSD - Solid-state revolution: in-depth on how SSDs really work .
  • DB Redbook - Readings in Database Systems .

Distributed Infrastructure for Cloud---Database and Storage

  • Cockroach - A Scalable, Geo-Replicated, Transactional Datastore .
  • TiDB - Distributed NewSQL database compatible with MySQL protocol .
  • ElastiCell - Cloud native key-value store with strong consistency and reliability .
  • Yugabyte - Cloud native database store with strong consistency and reliability .
  • FBase - Cloud native database store with strong consistency and reliability by JD.
  • Paxosstore - Cloud native key value store with strong consistency and reliability by WeChat.
  • Phxqueue - A high-availability, high-throughput and highly reliable distributed queue based on the Paxos algorithm.
  • Youzan-nsq - Youzan's modification of nsq to provide cloud native capability from reliability to auto rebalancing.
  • Baidu-Elasticsearch - Baidu's modification of elasticsearch to provide strong data consistency and full SQL.
  • ClickHouse - Yandex's distributed column store OLAP.
  • Palo - Baidu's distributed OLAP based on Google's Mesa paper.
  • MapD - MapD OLAP based on GPU.
  • ContainerFS - Cloud native distributed filesystem for Kubernetes.
  • OpenEBS - Cloud native filesystem for Kubernetes(non-distributed ).
  • Seaweed-FS - Distributed filesystem for small blob files.
  • Ambry - Distributed filesystem for small and large blob files.
  • DistributedLog - High performance replicated log service.
  • Jepsen - Techniques Jepsen occupies a particular niche of the correctness testing landscape .
  • Namazu - Programmable fuzzy scheduler for testing distributed system .
  • GPaxos - Golang Paxos implementation based on Phxpaxos .
  • Consensus-Yaraft - C++ Raft implementation based on Etcd's golang raft .
  • NOPaxos - Network-Ordered Paxos .
  • TAPIR - Building Consistent Transactions with Inconsistent Replication .
  • Phat - An implementation of the Chubby lock service protocol in Msgpack RPC .
  • Hydra - A distributed data processing and storage system originally developed at AddThis .
  • Summingbird - Streaming MapReduce with Scalding and Storm https://twitter.com/summingbird .
  • Hustle - A column oriented, embarrassingly distributed relational event database .
  • MDCC - Multi-DataCenter Consistency protocol .
  • URingPaxos - High throughput atomic multicast protocol .
  • Course-CS6452 - Datacenter Networks and Services .

Distributed Infrastructure for Cloud---Application

  • Pinpoint - Non-intrusive Dapper-like APM solution .
  • CAT - APM solution at Dianping Inc .
  • Brave - Java version of OpenZipkin .
  • Appdash - Golang version of Dapper .
  • Jaeger - Golang version of Dapper in Uber.
  • Cadence - Microservice workflow orchestrator .
  • Zeebe - Microservice workflow orchestrator .
  • F-Stack - Network framework with high performance based on DPDK .
  • DPVS - High performance Layer-4 load balancer based on DPDK .

Distributed Infrastructure for Cloud---A(AI)B(BigData)C(Cloud)

  • Galaxy - Naive scheduler for Baidu search cluster .
  • Cook - Fair job scheduler on Mesos for batch workloads and Spark .
  • Kube-arbitrator - Cluster colocation scheduler for Kubernetes .
  • BigFlow - Baidu dataflow operator .
  • Pulsar - Business level monitor and analysis .
  • Cubert - A fast and efficient batch computation engine for complex analysis and reporting of massive datasets on Hadoop .
  • Embulk - A plugin-based parallel bulk data loader that makes painful data integration works relaxed .
  • Gobblin - Data ingestion as a service .
  • Magpie - Deploying and managing a Hadoop Yarn cluster with Docker Swarm .
  • Horovod - Uber's modification of TensorFlow to provide RingReduce based on MPI.
  • Angel - Tencent's parameter server infrastructure to support machine learning.
  • Ytk-Learn - Yuantiku's distributed machine learning platform.
  • Libble - LIBBLE from NJU to provide faster convergence than SGD.
  • Gloo - Facebook's communications library with various primitives for multi-machine training.
  • xLearn - High Performance, Easy-to-use, and Scalable Machine Learning Package (C++, Python, R).
  • LASER - A Scalable Response Prediction Platform For Online Advertising .
  • Hivemall - Scalable machine learning library for Hive/Hadoop .
  • Ml-ease - ADMM based large scale logistic regression .
  • Jubatus - Distributed Online Machine Learning Framework .

Concurrency

  • Concurrent Queue - A fast multiple-producer, multi-consumer lock-free concurrent queue for C++11 .
  • CAF - An Open Source Implementation of the Actor Model in C++ .
  • TAMER - C++ extensions for readable event-driven programming .
  • C++React - A reactive programming library for C++11 .
  • Libslock - Cross-platform atomic operations and lock algorithm library http://lpd.epfl.ch/site/ssync .
  • CDS - Header only C++ Concurrent Data Structures library .
  • Libcds - A C++ template library of lock-free and fine-grained algorithms .
  • Locksmith - A library for debugging locking in C, C++, or Objective C programs .
  • Concurrency-concepts - A guide to concurrency, multi-threading and parallel programming concepts. Explains the differences between every concept, their advantages and disadvantages in detail .
  • Concurrency Kit - Concurrency primitives, safe memory reclamation mechanisms and non-blocking data structures for the research, design and implementation of high performance concurrent systems .
  • Nanahan - An implementation of Hopscotch hashing for single thread .
  • Scalex - Code snippets for the workshop on concurrent data structure implementation .
  • CBB - Provides a set of concurrent building blocks (Java & C/C++) that can be used to develop parallel/multi-threaded applications .
  • Thrust - A parallel algorithms library which resembles the C++ Standard Template Library (STL) .
  • Varon-t - A C implementation of Disruptor queues http://varon-t.readthedocs.org/ .
  • Lockfree Queue - Lock-free Condition Wait for Lock-free Multi-producer Multi-consumer Queue, see http://natsys-lab.blogspot.ru/2013/08/lock-free-condition-wait-for-lock-free.html .
  • Ssmem - A simple object-based memory allocator with epoch-based garbage collection, the publication "Asynchronized Concurrency: The Secret to Scaling Concurrent Search Data Structures" .
  • CLHT - A very fast and scalable (lock-based and lock-free) hash table that uses cache-line sized buckets .
  • Comsat - Comsat lets your application enjoy the scalability of asynchronous web-frameworks, serving many thousands of concurrent long-lived connections, or issuing hundreds of web-service calls for each request, all while maintaining the simple “thread per request” model .
  • Quasar-thrift - Quasar fiber based Thrift RPC .
  • Seastar - Concurrency library in user space .
  • Article-TM - Transactional Memory: History and Development .

System Performance And Profiling

Search Engine and Information Retrieval

  • Vespa - Production ready search engine to support web-scale data .
  • SF1R - A distributed massive data engine for enterprise/vertical search written in C++ .
  • BitFunnel - Signature file based search engine from Bing .
  • Trinity - Trinity IR toolkit .
  • IResearch - IR toolkit to be used for ArangoDB .
  • Partitioned_elias_fano - Code used for the experiments in the paper "Partitioned Elias-Fano Indexes" .
  • Clustered_Partitioned_elias_fano - Code used for paper Clustered Elias-Fano Indexes" .
  • Data Structures for Inverted Indexes - Optimal Space-Time Tradeoffs for Inverted Indexes .
  • Surf - SUccinct Retrieval Framework .
  • FastPFor - Fast integer compression .
  • Indexing - Experimenting with indexing on GPUs .
  • Genie - Generic Inverted Index on GPU .
  • Simdcomp - A simple C library for compressing lists of integers .
  • SIMDCompressionAndIntersection - A C++ library to compress and intersect sorted lists of integers using SIMD instructions .
  • TurboPFor - Fastest Integer Compression .
  • Pos-cmp - Comparison framework for positional inverted indexes and self-index supporting phrase queries .
  • MaskedVByte - SIMD-accelerated VByte Compression, Publication "Vectorized VByte Decoding" .
  • Wavelet - Information Retrieval based on Wavelet Tree .
  • Shuffla - Search engine using kd-tree .
  • RoSA - Large-Scale Pattern Search Using Reduced-Space On-Disk Suffix Arrays .
  • Dualsorted - Dual sorted inverted index based on Wavelet Tree .
  • Treap - Faster and Smaller Inverted Indices with Treaps .
  • Gigablast - A distributed open source search engine and spider written in C/C++ for Linux .
  • SIMD-Based-Posting-lists - Implementation of Alexander A. Stepanov inverted Index Compression algorithms .
  • Groonga - Open-source fulltext search engine and column store .
  • Atire - A search engine built using the most effective recent research techniques discovered by Information Retrieval researchers around the world .
  • Mg4j - Academic search engine with succinct design(say quasi-succinct indices) .
  • Argos - A structural data search engine .
  • MFRetrieval - Tools for maximum inner product retrieval in recommender systems .
  • Faiss - A library for efficient similarity search and clustering of dense vectors .
  • Lopq - Training of Locally Optimized Product Quantization (LOPQ) models for approximate nearest neighbor search of high dimensional data in Python and Spark .

About

Big Data Made Easy

License:MIT License