lotabout / OpenMLDB

OpenMLDB is an open-source database that is designed and optimized to enable data integrity and efficiency for machine learning driven applications. In addition to 10x faster ML application landing experience, OpenMLDB provides unified computing and storage engines to reduce the complexity and cost of development and operation.

Home Page:https://openmldb.ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

build status docker pulls slack discuss codecov release license gitee maven central maven central pypi

English version|中文版

What is OpenMLDB

OpenMLDB is an open-source database designed and optimized to enable data correctness & efficiency for machine learning driven applications. Besides the 10x faster ML application landing experience, OpenMLDB provides the unified computing & storage engines to reduce the complexity and cost of development and operation.

Who Uses OpenMLDB

The OpenMLDB project originated from several enterprise AI data products(RTiDB, SparkFE & FeDB) at 4Paradigm. OpenMLDB is now used in production to serve machine learning scenarios in many leading companies, with more than 120 industry landed use cases including content recommender system, ads CTR prediction, AIOps, anti-money laundering, anti-fraud recognition, intelligent marketing, etc.

Features

  • Consistency

    OpenMLDB ensures the consistency for online and offline. Data scientists can use OpenMLDB for feature extration which will avoid crossing data. The online and offline computation are consistent because of the use of the same LLVM IR for complication. To ensure the consistency of storage, OpenMLDB will synchronize data for offline and online. Users do not need to manage multiple data sources for online and offline, which may avoid the inconsistency from features or data.

  • High Performance

    OpenMLDB implements the native SQL compiler with C++ and LLVM. It contains tens of optimization passes for physical plans and expressions. It can generate optimized binary code for different hardware and optimize the memory layout for feature storage. The storage and cost for features can be 9x lower than the similar databases. The performance of real-time execution can be 9x better and the performance of batch processing can be 6x better.

  • High Availability

    OpenMLDB supports distributed massive-parallel processing and database storage. It supports automatic failover to avoid the single point of failure.

  • ANSI SQL Support

    OpenMLDB supports user-friendly SQL interface which is compatible with most ANSI SQL and extends syntax for AI scenarios. Take the time serial features as example, OpenMLDB not only supports the syntax of Over Window but also supports the new syntax for sliding window with instance table and real-time window aggregation with current row data.

  • AI Optimization

    OpenMLDB is designed for optimizing AI scenarios. For storage we design the efficient data structure to storage features which gets better the utilization of space and performance than the similar products. For computation, we provide the usual methods for table join and the UDF/UDAF for most machine learning scenarios.

  • Easy To Use

    OpenMLDB is easy to use just like any other standalone database. Not only data scientists but also application developers can use SQL to develop the machine learning applications which includes massive-parallel processing and real-time feature extraction. With this database it is easy for AI landing with lowest cost.

Performance

Comparing with the mainstream databases, OpenMLDB achieves better performance for different size of data and computational complexity.

Online Benchmark

Comparing with the popular MPP system, using OpenMLDB for batch data process can achieve better performance and lower TCO especially with optimization for skew window data.

Offline Benchmark

QuickStart

Take Predict Taxi Tour Duration as example, we can use OpenMLDB to develop and deploy ML applications easily. The demo code is here.

# Start docker image (support linux/amd64 & linux/arm64 image)
# try one of the mirror image from 
#  - mirror.baidubce.com/4pdosc/openmldb:0.2.1
#  - ghcr.io/4paradigm/openmldb:0.2.1
# if you pull the image slowly
docker run -it 4pdosc/openmldb:0.2.1 bash

# Initilize the environment
sh init.sh
 
# Import the data to OpenMLDB
python3 import.py
 
# Run feature extraction and model training
python3 train.py ./fe.sql /tmp/model.txt
 
# Start HTTP serevice for inference with OpenMLDB
sh start_predict_server.sh ./fe.sql 8887 /tmp/model.txt
 
# Run inference with HTTP request
python3 predict.py

Build

See here

Architecture

Status and Roadmap

Status of Project

  • SQL compiler and optimizer[Complete]
    • Support ANSI SQL compiler[Complete]
    • Support optimizing physical plans and expressions[Complete]
    • Support code generation for functions[Complete]
  • Front-end programming interfaces[In Process]
    • Support JDBC protocol[Complete]
    • Support C++、Python SDK[Complete]
    • Support RESTful API[In Process]
  • Online/offline computaion engine[Complete]
    • Support online database computaion engine[Complete]
    • Support offline batch process computaion engine[Complete]
  • Unified storage engine[In Process]
    • Support distributed memory storage[Complete]
    • Support synchronization for online and offline data[In Process]

Roadmap

  • SQL Compatibility
    • Support more Window types and Where, GroupBy with complex expressions[2021H2]
    • Support more SQL syntax and UDF/UDAF functions for AI scenarios[2021H2]
  • Performance Improvement
    • Logical and physical plan optimization for batch mode and request mode data processing[2021H2]
    • High-performance, distributed execution plan generation and codegen[2021H2]
    • More classic SQL expression pass support[2022H1]
    • Integrate the optimization passes for Native LastJoin which is used in AI scenarios[2021H2]
    • Provide a new strategy of memory allocation to reduce memory fragmentation[2022H1]
  • Ecosystem Integration
    • Adapt to various encoding format in row and column, be compatible with Apache Arrow[2021H2]
    • Adapt to open source SQL compute framework like FlinkSQL[2022H1]
    • Support popular programing languages,including C++, Java, Python, Go, Rust etc[2021H2]
    • Support PMEM-based storage engine[2022H1]
    • Support Flink/Kafka/Spark connector[2022H1]

Publication

Some Readings

In-Depth Interpretation of the Latest VLDB 2021 Paper: Artificial Intelligence Driven Real-Time Decision System Database and Optimization Based on Persistent Memory

Predictive maintenance — 5minutes demo of an end to end machine learning project

Compared to Native Spark 3.0, We Have Achieved Significant Optimization Effects in the AI Application Field

MLOp Practice: Using OpenMLDB in the Real-Time Anti-Fraud Model for the Bank’s Online Transaction

License

Apache License 2.0

About

OpenMLDB is an open-source database that is designed and optimized to enable data integrity and efficiency for machine learning driven applications. In addition to 10x faster ML application landing experience, OpenMLDB provides unified computing and storage engines to reduce the complexity and cost of development and operation.

https://openmldb.ai/

License:Apache License 2.0


Languages

Language:C++ 72.2%Language:Java 14.4%Language:Python 8.0%Language:Scala 3.4%Language:Shell 0.9%Language:CMake 0.7%Language:SWIG 0.3%Language:LLVM 0.1%Language:Dockerfile 0.0%Language:Makefile 0.0%Language:JavaScript 0.0%