nemhods / ingest

Helper script to import documents into elasticsearch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Overview

ingest is a simple helper script to put documents into an elasticsearch cluster.

Quick How To:

  • have Python 3 with the module elasticsearch
  • have access to an elasticsearch cluster (via simple API, no fancy auth supported currently)
  • a set of documents (filenames, URLs, raw python strings)
  • a function that extracts features from these documents

ingest then wraps the extractor function in multiprocessing, and automatically passes the extracted features into elasticsearch.

Possible improvements:

  • make the forwarding to a database modular, so that forwarding to other dbs or even files is possible.
  • the multiprocessing stuff does not feel optimised yet.

Usage

from ingest import Ingester

def example_feature_extractor(document):
    features = dict() # will have fields that correspond to my_doctype
    features['version'] = 3
    if "Hello" in document:
        features['another_field'] = True
    return features

if __name__ == "__main__":
    # always re-create testindex, good for debugging as you'll start with a fresh index
    i = Ingester('http://elastic_host:9200', 'testindex', delete_index=True)

    # define what fields your data type will have.
    # These are the features that you want to extract from the documents.
    # see https://www.elastic.co/guide/en/elasticsearch/guide/current/mapping-intro.html
    # you can ingest data without defining a doctype - elasticsearch will guess
    my_doctype = dict()
    my_doctype['version'] = {'type':'string', 'index':'not_analyzed'}
    my_doctype['another_field'] = {'type':'boolean'}
    i.create_doctype("my_doctype", my_doctype)

    # this could be a list of filenames, urls, or even raw data.
    # Elements of this list will be passed as the only argument to your extractor
    document_list = ["Document_1","Document_2","Document_3"]

    # start ingesting
    i.ingest_documents("my_doctype", document_list, example_feature_extractor)
    print("it's not blocking :)")

About

Helper script to import documents into elasticsearch

License:MIT License


Languages

Language:Python 100.0%