mengjiahao / OpenMLDB

OpenMLDB is an open-source database that is designed and optimized to enable data integrity & efficiency for machine learning driven applications. In addition to 10x faster ML application landing experience, OpenMLDB provides unified computing & storage engines to reduce the complexity and cost of development and operation.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

build status docker pulls slack discuss

English version|中文版

What is OpenMLDB

OpenMLDB is an open-source database designed and optimized to enable data correctness & efficiency for machine learning driven applications. Besides the 10x faster ML application landing experience, OpenMLDB provides the unified computing & storage engines to reduce the complexity and cost of development and operation.

Who Uses OpenMLDB

The OpenMLDB project originated from several enterprise AI data products(RTiDB, SparkFE & FeDB) at 4Paradigm. OpenMLDB is now used in production to serve machine learning scenarios in many leading companies, with more than 120 industry landed use cases including content recommender system, ads CTR prediction, AIOps, anti-money laundering, anti-fraud recognition, intelligent marketing, etc.

Features

  • Consistency

    OpenMLDB ensures the consistency for online and offline. Data scientists can use OpenMLDB for feature extration which will avoid crossing data. The online and offline computation are consistent because of using the same LLVM IR for complication. To encure the consistency of storage, OpenMLDB will synchronize data for offline and online. Users do not need to manage multiple data sources for online and offline, which may avoid the inconsistency from features or data.

  • High Performance

    OpenMLDB implements the native SQL compiler with C++ and LLVM. It contains tens of optimization passes for physical plans and expressions. It can generate optmized binary code for different hardware and optmize the memory layout for feature storage. The storage and cost for features can be 9x times lower than the similar databases. The performance of real-time execution can be 9x times better and the performance of batch processing can be 6x times better.

  • High Availability

    OpenMLDB supports distributed massive-parallel processing and database storage. It supports automatical failover to avoid the single point of failure.

  • ANSI SQL Support

    OpenMLDB supports user-friendly SQL interface which is compatible with most ANSI SQL and extends syntax for AI secenarios. Take the time serial features as example, OpenMLDB not only supports the syntax of Over Window but also supports the new syntax for sliding window with instance table and real-time window aggregation with current row data.

  • AI Optimization

    OpenMLDB is designed for optimizing AI scenarios. For storage we design the efficient data struct to storage features which gets better the utilization of space and performance than the similar products. For computation we provide the usual methods for table join and the UDF/UDAF for most machine learning scenarios.

  • Easy To Use

    OpenMLDB is easy to use just like other standalone database. Not only data scientists but also application developers can use SQL to develop the machine learning applications which includes massive-parallel processing and real-time feature extraction. With this database it is easy for AI landing with lowest cost.

Performance

Comparing with the mainstream databases, OpenMLDB achieves better performance for different size of data and computational complexity.

Online Benchmark

Comparing with the popular Spark computation framework, using OpenMLDB for batch data process can achieve better performance and lower TCO especially with optimization for skew window data.

Offline Benchmark

QuickStart

Take Predict Taxi Tour Duration as example, we can use OpenMLDB to develop and deploy ML applications easily.

# Start docker image
docker run -it 4pdosc/openmldb:0.1.0 bash
 
# Initilize the environment
sh init.sh
 
# Import the data to OpenMLDB
python3 import.py
 
# Run feature extraction and model training
python3 train.py ./fe.sql /tmp/model.txt
 
# Start HTTP serevice for inference with OpenMLDB
sh start_predict_server.sh ./fe.sql 8887 /tmp/model.txt
 
# Run inference with HTTP request
python3 predict.py

Architecture

Status and Roadmap

Status of Project

  • SQL compiler and optimizer[Complete]
    • Support ANSI SQL compiler[Complete]
    • Support optimizing physical plans and expressions[Complete]
    • Support code generation for functions[Complete]
  • Front-end programming interfaces[In Process]
    • Support JDBC protocol[Complete]
    • Support C++、Python SDK[Complete]
    • Support RESTful API[In Process]
  • Online/offline computaion engine[Complete]
    • Support online database computaion engine[Complete]
    • Support offline batch process computaion engine[Complete]
  • Unified storage engine[In Process]
    • Support distributed memory storage[Complete]
    • Support synchronization for online and offline data[In Process]

Roadmap

  • SQL Compatibility
    • Support more Window types and Where, GroupBy with complex expressions[2021H2]
    • Support more SQL syntax and UDF/UDAF functions for AI scenarios[2021H2]
  • Performance Improvement
    • Logical and physical plan optimization for batch mode and request mode data processing[2021H2]
    • High-performance, distributed execution plan generation and codegen[2021H2]
    • More classic SQL expression pass support[2022H1]
    • Integrate the optimization passes for Native LastJoin which is used in AI scenarios[2021H2]
    • Provide a new strategy of memory allocation to reduce memory fragmentation[2022H1]
  • Ecosystem Integration
    • Adapt to various encoding format in row and column, be compatible with Apache Arrow[2021H2]
    • Adapt to open source SQL compute framework like FlinkSQL[2022H1]
    • Support popular programing languages,including C++, Java, Python, Go, Rust etc[2021H2]
    • Support PMEM-based storage engine[2022H1]
    • Support Flink/Kafka/Spark connector[2022H1]

License

Apache License 2.0

About

OpenMLDB is an open-source database that is designed and optimized to enable data integrity & efficiency for machine learning driven applications. In addition to 10x faster ML application landing experience, OpenMLDB provides unified computing & storage engines to reduce the complexity and cost of development and operation.

License:Apache License 2.0


Languages

Language:C++ 80.5%Language:Python 10.1%Language:Scala 3.4%Language:Java 3.2%Language:Shell 0.7%Language:CMake 0.7%Language:Yacc 0.6%Language:SWIG 0.3%Language:LLVM 0.2%Language:Lex 0.2%Language:Roff 0.0%Language:C 0.0%Language:JavaScript 0.0%