wjbmattingly / spacy-annoy

A package for doing semantic search with spaCy docs.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Spacy Annoy

SpacyAnnoy is a Python class that integrates Spacy's natural language processing capabilities with Annoy's efficient similarity search to provide a powerful tool for analyzing and querying large text corpora based on semantic similarity.

Features

  • Text Processing with Spacy: Leverages Spacy's robust NLP features for text processing.
  • Efficient Similarity Search: Uses Annoy (Approximate Nearest Neighbors Oh Yeah) for fast search of similar text chunks.
  • Contextual Window Chunking: Splits text into chunks based on sentence context for more nuanced analysis.
  • Original Context Preservation: Retains references to the original document spans, enabling access to all original Spacy properties.

Installation

Before you begin, ensure you have Python installed on your machine. Then, install the required dependencies:

pip install spacy-annoy

Usage

Initialization

from SpacyAnnoy import SpacyAnnoy

# Initialize with a Spacy model name
sa = SpacyAnnoy("en_core_web_sm")

Loading and Processing Documents

texts = ["Your text data.", "Another document."]
sa.load_docs(texts)

Building the Index

sa.build_index(n_trees=10, metric="euclidean")

Querying

# Query the index
results = sa.query_index("Query text", depth=5)

# Pretty print results
sa.pretty_print(results)

Accessing Results

# Accessing the Spacy span of the first result
first_result_span = results[0][0]

About

A package for doing semantic search with spaCy docs.

License:MIT License


Languages

Language:Python 58.5%Language:Jupyter Notebook 41.5%