wei20024 / OpenMLDB

OpenMLDB is an open-source database particularly designed to efficiently provide consistent data for machine learning driven applications.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

build status docker pulls slack discuss codecov release license gitee maven central maven central pypi

English version | 中文版

1. Introduction

OpenMLDB is an open-source database particularly designed to efficiently provide consistent data for machine learning. A database for machine learning consists of two major tasks: feature extraction and feature access, which are served as data provisioning for offline training and online inference. Without OpenMLDB, there are two separate systems for online and offline data provisioning, which cost significant effort to verify the online-offline consistency. On the contrary, OpenMLDB supports the unified SQL programming and its execution engine for both online and offline data provisioning. As a result, the online-offline consistency is inherently guaranteed. Moreover, the system is carefully designed and optimized to ensure the efficiency. By taking advantages of OpenMLDB, database engineers are now able to write SQL scripts only to efficiently provide consistent data to machine learning, and an offline model can be immediately deployed for online serving with little cost involved.

image-20211103103052252

The above figure illustrates the OpenMLDB workflow. SQL engineers first write SQL scripts for offline feature extraction, which provides data for offline model training. When the model quality is satisfied, the online feature extraction and access can be enabled immediately for online serving without additional efforts involved. Thanks to the unified SQL programming and execution engine, the online-offline consistency verification is eliminated, which is inherently guaranteed by OpenMLDB. Furthermore, certain optimization techniques (e.g., data skew optimization and in-memory indexing for offline and online feature extraction, respectively) are adopted to ensure that the performance requirement can be met for both offline training and online inference. In summary, OpenMLDB enables SQL as the only programming interface for consistent and efficient data provisioning for both offline model training and online inference serving.

2. Highlight Features

2.1. SQL Programming APIs

We believe SQL is the most suitable programming APIs for feature engineering because of its elegant design and popularity. OpenMLDB enables SQL as the programming APIs for developers for both offline and online feature extraction. Besides, we extend the capability of standard SQL and make it more powerful for feature extraction.

2.2 Online-Offline Consistency

Based on the SQL programming APIs, we design an unified execution engine for both online and offline feature extraction. As a result, the online-offline consistency is inherently guaranteed by OpenMLDB with no other cost.

2.3. Efficiency

We propose a few techniques to improve the performance for both offline and online feature extraction. As a result, our offline feature extraction can be significantly faster than existing opensource bigdata processing frameworks. Moreover, our online service can provide low latency (tens of milliseconds) to meet the performance requirement of online inference.

You can read our below section (7. Publications & Blogs) for more technical detail.

2.4 Integrated CLI

We provide a powerful integrated CLI for SQL programming, job management, online and offline deployment, and database administration. Developers who are familiar with database's CLIs should be very comfortable with our tool.

Note that, the CLI of current release 0.3.0 supports the cluster mode partially. It will be fully supported in the next release of 0.4.0

3. Build & Install

👉 Read more

4. Demo & QuickStart

Since OpenMLDB v0.3.0, we have introduced two operating modes, which are cluster mode and standalone mode. The cluster mode is suitable for large-scale datasets and real-world applications, which provides the scalability and high-availability. On the other hand, the lightweight standalone mode running on a single node is ideal for small businesses and demonstration.

We demonstrate the workflow using the cluster and standalone modes:

5. Roadmap

We list a few highlight features that we have planned in the future releases. Please join our community to understand more about our planning and discuss your ideas.

Version Est. release date Highlight features
0.4.0 End of 2021 - Full support of standalone and cluster modes in the integrated CLI
0.5.0 2022 Q1 - Monitoring APIs and tools for online serving
- Efficient queries over a fairly long period of time by window functions
- Kafka/Pulsar connector support for online data source

6. Community

You may join our community for feedback and discussion

  • Email: contact@openmldb.ai

  • Slack Workspace: You may find useful information of release notes, user support, development discussion and even more from our various Slack channels.

  • GitHub Issues and Discussions: If you are a serious developer, you are most welcome to join our discussion on GitHub. GitHub Issues are used to report bugs and collect new requirements. GitHub Discussions are mostly used by our project maintainers to publish and comment RFCs.

  • Blogs (Chinese)

  • WeChat Groups (Chinese):

    img

7. Publications & Blogs

About

OpenMLDB is an open-source database particularly designed to efficiently provide consistent data for machine learning driven applications.

License:Apache License 2.0


Languages

Language:C++ 71.9%Language:Java 15.1%Language:Python 7.6%Language:Scala 3.5%Language:Shell 0.7%Language:CMake 0.7%Language:SWIG 0.3%Language:LLVM 0.1%Language:Makefile 0.1%Language:Dockerfile 0.0%Language:JavaScript 0.0%