HazyResearch / meerkat

Creative interactive views of any dataset.

Repository from Github https://github.comHazyResearch/meerkatRepository from Github https://github.comHazyResearch/meerkat

Meerkat logo

GitHub pre-commit

Create interactive views of any dataset.

Website | Quickstart | Docs | Contributing | Discord | Blogpost

⚡️ Quickstart

pip install meerkat-ml

Next Steps. Check out our Getting Started page and our documentation to start building with Meerkat.

Why Meerkat?

Meerkat is an open-source Python library that helps users visualize, explore, and annotate any dataset. It is especially useful when processing unstructured data types (e.g. free text, PDFs, images, video) with machine learning models.

✏️ Features and Design Principles

Here are four principles that inform Meerkat's design.

(1) Low overhead. With four lines of Python, start interacting with any dataset.

  • Zero-copy integrations with your preferred data abstractions: Pandas, Arrow, HF Datasets, Ibis, SQL.
  • Limited data movement. With Meerkat, you interact with your data where it already lives: no uploads to external databases and no reformatting.
import meerkat as mk
df = mk.from_csv("paintings.csv")
df["image"] = mk.files("image_url")
df
Meerkat logo

(2) Diverse data types. Visualize and annotate almost any data type in Meerkat interfaces: text, images, audio, video, MRI scans, PDFs, HTML, JSON.

(3) "Intelligent" user interfaces. Meerkat makes it easy to embed machine learning models (e.g. LLMs) within user interfaces to enable intelligent functionality such as searching, grouping and autocomplete.

df["embedding"] = mk.embed(df["img"], engine="clip")
match = mk.gui.Match(df,
	against="embedding",
	engine="clip"
)
sorted_df = mk.sort(df,
	by=match.criterion.name,
	ascending=False
)
gallery = mk.gui.Gallery(sorted_df)
mk.gui.html.div([match, gallery])
Meerkat logo

(4) Declarative (think: Seaborn), but also infinitely customizable and composable. Meerkat visualization components can be composed and customized to create new interfaces.

plot = mk.gui.plotly.Scatter(df=plot_df, x="umap_1", y="umap_2",)

@mk.gui.reactive
def filter(selected: list, df: mk.DataFrame):
    return df[df.primary_key.isin(selected)]

filtered_df = filter(plot.selected, plot_df)
table = mk.gui.Table(filtered_df, classes="h-full")

mk.gui.html.flex([plot, table], classes="h-[600px]") 
Meerkat logo

✨ Use cases where Meerkat shines

  • Exploratory analysis over unstructured data types. Demo
  • Spot-checking the behavior of large language models (e.g. GPT-3). Demo
  • Identifying systematic errors made by machine learning models. Demo
  • Rapid labeling of validation data.

🤔 Use cases where Meerkat may not be the right fit

  • Are you only working with structured data (e.g. numerical and categorical variables)? Popular data visualization libraries (e.g. Seaborn, Matplotlib) are often sufficient. If you're looking for interactivity, Plotly and Streamlit work well with structured data. Meerkat is differentiated in how it visualizes unstructured data types: long-form text, PDFs, HTML, images, video, audio...
  • Are you trying to make a straightforward demo of a machine learning model (single input/output, chatbot) and share with the world? Gradio is likely a better fit! Though, if your demo involves visualizing lots of data, you may find Meerkat useful.
  • Are you trying to manually label tens of thousands of data points? If you are looking for a data labeling tool to use with a labeling team, there are great open source labeling solutions designed for this (e.g. LabelStudio). In contrast, Meerkat is great fit for teams/individuals without access to a large labeling workforce who are using pretrained models (e.g. GPT-3) and need to label validation data or in-context examples.

✉️ About

Meerkat is being built by Machine Learning PhD students in the Hazy Research lab at Stanford. We're excited to build for a future where models will make it easier for teams to sift and reason through large volumes of unstructtured data effortlessly.

Please reach out to kgoel [at] cs [dot] stanford [dot] edu, eyuboglu [at] stanford [dot] edu, and arjundd [at] stanford [dot] edu if you would like to use Meerkat for a project, at your company or if you have any questions.

About

Creative interactive views of any dataset.

License:Apache License 2.0


Languages

Language:Python 69.8%Language:Svelte 27.3%Language:JavaScript 1.6%Language:TypeScript 0.9%Language:Jinja 0.1%Language:Makefile 0.1%Language:HTML 0.0%Language:Batchfile 0.0%Language:CSS 0.0%