Labelbox / labelspark

This library makes it easy to take unstructured data in your Data Lake and prepare it for analysis and AI work in Databricks. The Labelbox Connector for Apache Spark takes in a Spark DataFrame to create a dataset in Labelbox, and it also brings labeled, structured data back into Databricks also as a Spark DataFrame.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The Official Labelbox <> Databricks Python Integration

Labelbox enables teams to maximize the value of their unstructured data with its enterprise-grade training data platform. For ML use cases, Labelbox has tools to deploy labelers to annotate data at massive scale, diagnose model performance to prioritize labeling, and plug in existing ML models to speed up labeling. For non-ML use cases, Labelbox has a powerful catalog with auto-computed similarity scores that users can leverage to label large amounts of data with a couple clicks.

This library was designed to run in a Databricks environment, although it will function in any Spark environment with some modification.

We strongly encourage collaboration - please free to fork this repo and tweak the code base to work for you own data, and make pull requests if you have suggestions on how to enhance the overall experience, add new features, or improve general performance.

Please report any issues/bugs via Github Issues.

Table of Contents

Requirements

Setup

Set up LabelSpark with the following lines of code:

%pip install labelspark -q
import labelspark as ls

api_key = "" # Insert your Labelbox API key here
client = ls.Client(api_key)

Once set up, you can run the following core functions:

  • client.create_data_rows_from_table() : Creates Labelbox data rows (and metadata) given a Spark Table DataFrame

  • client.export_to_table() : Exports labels (and metadata) from a given Labelbox project and creates a Spark DataFrame

Example Notebooks

Importing Data

Notebook Github
Basics: Data Rows from URLs Github
Data Rows with Metadata Github
Data Rows with Attachments Github
Data Rows with Annotations Github
Putting it all Together Github

Exporting Data

Notebook Github
Exporting Data to a Spark Table Github

While using LabelSpark, you will likely also use the Labelbox SDK (e.g. for programmatic ontology creation). These resources will help familiarize you with the Labelbox Python SDK:

About

This library makes it easy to take unstructured data in your Data Lake and prepare it for analysis and AI work in Databricks. The Labelbox Connector for Apache Spark takes in a Spark DataFrame to create a dataset in Labelbox, and it also brings labeled, structured data back into Databricks also as a Spark DataFrame.

License:Apache License 2.0


Languages

Language:HTML 77.0%Language:Jupyter Notebook 20.6%Language:Python 2.4%