Venice

Derived Data Platform for Planet-Scale Workloads

Venice is a derived data storage platform, providing the following characteristics:

High throughput asynchronous ingestion from batch and streaming sources (e.g. Hadoop and Samza).
Low latency online reads via remote queries or in-process caching.
Active-active replication between regions with CRDT-based conflict resolution.
Multi-cluster support within each region with operator-driven cluster assignment.
Multi-tenancy, horizontal scalability and elasticity within each cluster.

The above makes Venice particularly suitable as the stateful component backing a Feature Store, such as Feathr. AI applications feed the output of their ML training jobs into Venice and then query the data for use during online inference workloads.

Write Path

The Venice write path can be broken down into three granularities: full dataset swap, insertion of many rows into an existing dataset, and updates of some columns of some rows. All three granularities are supported by Hadoop and Samza, thus leading to the below full matrix of supported operations:

	Hadoop	Samza
Full dataset swap	Full Push Job	Reprocessing Job
Insertion of some rows into an existing dataset	Incremental Push Job	Real-Time Job
Updates to some columns of some rows	Incremental Push Job doing Write Compute	Real-Time Job doing Write Compute

Hybrid Stores

Moreover, the three granularities of write operations can all be mixed within a single dataset. A dataset which gets full dataset swaps in addition to row insertion or row updates is called hybrid.

As part of configuring a store to be hybrid, an important concept is the rewind time, which defines how far back should recent real-time writes be rewound and applied on top of the new generation of the dataset getting swapped in.

Leveraging this mechanism, it is possible to overlay the output of a stream processing job on top of that of a batch job. If using partial updates, then it is possible to have some of the columns be updated in real-time and some in batch, and these two sets of columns can either overlap or be disjoint, as desired.

Write Compute

Write Compute includes two kinds of operations, which can be performed on the value associated with a given key:

Partial update: set the content of a field within the value.
Collection merging: add or remove entries in a set or map.

N.B.: Currently, write compute is only supported in conjunction with active-passive replication. Support for active-actice replication is under development.

Read Path

Venice supports the following read APIs:

Single get: get the value associated with a single key
Batch get: get the values associated with a set of keys
Read compute: project some fields and/or compute some function on the fields of values associated with a set of keys.

Read Compute

When using the read compute DSL, the following functions are currently supported:

Dot product: perform a dot product on the float vector stored in a given field, against another float vector provided as query param, and return the resulting scalar.
Cosine similarity: perform a cosine similarity on the float vector stored in a given field, against another float vector provided as query param, and return the resulting scalar.
Hadamard product: perform a Hadamard product on the float vector stored in a given field, against another float vector provided as query param, and return the resulting vector.
Collection count: return the number of items in the collection stored in a given field.

Client Modes

There are two main client modes for accessing Venice data:

Classical Venice: perform remote queries against Venice's distributed backend service. In this mode, read compute queries are pushed down to the backend and only the computation results are returned to the client.
Da Vinci: eagerly load some or all partitions of the dataset and perform queries against the resulting local cache. Future updates to the data continue to be streamed in and applied to the local cache.

Getting Started

Refer to the Venice quickstart to create your own Venice cluster and play around with some features like creating a data store, batch push, incremental push, and single get.

Previously Published Content

The following blog posts have previously been published about Venice:

2015: Prototyping Venice: Derived Data Platform
2017: Building Venice with Apache Helix
2017: Building Venice: A Production Software Case Study
2017: Venice Hybrid: Doing Lambda Better
2018: Venice Performance Optimization
2021: Taming memory fragmentation in Venice with Jemalloc
2022: Supporting large fanout use cases at scale in Venice
2022: Open Sourcing Venice – LinkedIn’s Derived Data Platform

The following talks have been given about Venice:

2018: Venice with Apache Kafka & Samza
2019: People You May Know: Fast Recommendations over Massive Data
2019: Enabling next generation models for PYMK Scale
2022: Open Sourcing Venice

Keep in mind that older content reflects an earlier phase of the project and may not be entirely correct anymore.

Community Resources

Feel free to engage with the community using our:

cdbartholomew / venice