danielgross / embedland

A collection of text embedding experiments

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

embedland

Theoretically this is a universe of code for playing with embeddings. In reality it contains one file. More to come, I hope.

bench.py

This file benchmarks various embeddings using the Enron email corpus. Once you install the various libraries it needs, you can run it with python bench.py. It will:

  • Download the Enron email dataset.
  • Unzip it.
  • Attempt to run embeddings on it (with OpenAI's embedder as a default, you can change that at the end of the file to T5, or some other engine.)
  • Cluster the embeddings.
  • Label the clusters by sampling the subject lines from the clusters and sending them to GPT-3.
  • Show you a pretty chart, like the one you see above.

viz.py

Visualization helper. This file helps you go from "a list of embeddings" to "something pretty to look at".

TODO:

  • Make longer embeddings work by chunking and averaging out the results.

About

A collection of text embedding experiments


Languages

Language:Python 100.0%