tshauck / lance

An open-source tool for blazing fast exploration and analysis of visual data using SQL and Python, currently supports DuckDB

Home Page:https://eto-ai.github.io/lance/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Lance

Blazing fast exploration and analysis of machine learning visual data using SQL

SELECT predict(‘resnet’, image) FROM dataset

CI Docs PyPi Python versions

Lance makes machine learning workflows with visual data easy (images, videos, point clouds, audio, and more), by allowing Developers, Analysts and Operations to:

  • Use arbitary ML functions in SQL for common use cases such as similarity search using embeddings, model inference and computing evaluation metrics like F1, IOU and more.

  • [Coming soon] Visualize, slice and drill-into visual datasets to inspect embeddings, labels/annotations, metrics and more.

  • [Coming soon] Version, compare and diff visual datasets easily.

Lance is powered by Lance Format, an Apache-Arrow compatible columnar data format which is an alternative to Parquet, Iceberg and Delta. Lance has 50-100x faster query performance for visual data use cases.

Lance currently supports DuckDB.

Quick Start

pip install pylance

Thanks to its Apache Arrow-first APIs, lance can be used as a native Arrow extension. For example, it enables users to directly use DuckDB to analyze lance dataset via DuckDB's Arrow integration.

# pip install pylance duckdb
import lance
import duckdb

# Understand Label distribution of Oxford Pet Dataset
ds = lance.dataset("s3://eto-public/datasets/oxford_pet/oxford_pet.lance")
duckdb.query('select class, count(1) from ds group by 1').to_arrow_table()

Important directories

Directory Description
cpp Core Lance Format
python Python SDK (Pylance)
notebooks Jupyter Notebooks
duckdb extension Lance Duckdb extension

What makes Lance different

Here we will highlight a few aspects of Lance’s design. For more details, see the full Lance design document.

Encodings: to achieve both fast columnar scan and sub-linear point queries, Lance uses custom encodings and layouts.

Nested fields: Lance stores each subfield as a separate column to support efficient filters like “find images where detected objects include cats”.

Versioning / updates (ROADMAP): a Manifest can be used to record snapshots. Updates are supported via write-ahead logs.

Secondary Indices (ROADMAP):

  • Vector index for similarity search over embedding space
  • Inverted index for fuzzy search over many label / annotation fields

Benchmarks

We create a Lance dataset using the Oxford Pet dataset to do some preliminary performance testing of Lance as compared to Parquet and raw image/xmls. For analytics queries, Lance is 50-100x better than reading the raw metadata. For batched random access, Lance is 100x better than both parquet and raw files.

Why are you building yet another data format?!

Machine Learning development cycle involves the steps:

graph LR
    A[Collection] --> B[Exploration];
    B --> C[Analytics];
    C --> D[Feature Engineer];
    D --> E[Training];
    E --> F[Evaluation];
    F --> C;
    E --> G[Deployment];
    G --> H[Monitoring];
    H --> A;

People use different data representations to varying stages for the performance or limited by the tooling available. The academia mainly uses XML / JSON for annotations and zipped images/sensors data for deep learning, which is difficult to integrated into data infrastructure and slow to train over cloud storage. While the industry uses data lake (Parquet-based techniques, i.e., Delta Lake, Iceberg) or data warehouse (AWS Redshift or Google BigQuery) to collect and analyze data, they have to convert the data into training-friendly formats, such as Rikai/Petastorm or Tfrecord. Multiple single-purpose data transforms, as well as syncing copies between cloud storage to local training instances have become a common practice among ML practices.

While each of the existing data formats excel at its original designed workload, we need a new data format to tailored for multistage ML development cycle to reduce the fraction in tools and data silos.

A comparison of different data formats in each stage of ML development cycle.

Lance Parquet & ORC JSON & XML Tfrecord Database Warehouse
Analytics Fast Fast Slow Slow Decent Fast
Feature Engineering Fast Fast Decent Slow Decent Good
Training Fast Decent Slow Fast N/A N/A
Exploration Fast Slow Fast Slow Fast Decent
Infra Support Rich Rich Decent Limited Rich Rich

Presentations and Talks

About

An open-source tool for blazing fast exploration and analysis of visual data using SQL and Python, currently supports DuckDB

https://eto-ai.github.io/lance/

License:Apache License 2.0


Languages

Language:C++ 66.1%Language:Python 28.9%Language:CMake 2.9%Language:Cython 1.6%Language:Shell 0.4%Language:Dockerfile 0.1%Language:Makefile 0.0%