RediSearch / redisearch-py

RediSearch python client

Home Page:https://redisearch.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can't find any instruction on batch indexing

rednafi opened this issue · comments

Currently, it seems like the only way of adding a document is via the following command:

client.add_document('doc1', title = 'RediSearch', body = 'Redisearch implements a search engine on top of redis')

The above command adds documents one by one. However, I'd like to insert data by batches. There is no instruction (Nowhere in the docs at least) on how to do that using the python API. There is a BatchIndexer class in the client.py file but how do I utilize that? Thanks!

Hello @rednafi

Documentation can always be improved upon (I'd be the first to admit that), but in the case of it lacking you can usually look at the test scripts (i.e. https://github.com/RediSearch/redisearch-py/blob/master/test/test.py#L19) for hints... A PR to the docs is always welcome btw ;)

Thank you for your quick response. Checking!!

For reference, this is a test implementation of how batchIndexer class can be used to perform batch indexing.

r = csv.reader(bzfp, delimiter=';')

Solved it like this.

Notice the _build_index() method to see how the index instance that supports batch insertion was created. Here, chunk_size=25000 means that data will be committed after the insertion of each batch containing 25000 documents. Later this index instance was used in the insert_data() method to insert the data batch by batch.

class IndexData:
    def __init__(self, client):
        self.client = client
        self.file_path = "./index-data/area.csv"
        self.fields = (
            NumericField("index"),
            NumericField("areaId"),
            TextField("areaTitle"),
            TextField("areaBody"),
        )

    def _build_index(self):
        """Build index schema."""

        try:
            print("Building index....")
            self.client.create_index(self.fields)
            index = client.batch_indexer(chunk_size=25000)
            return index

        except ResponseError:
            print("Index already exists. Proceeding...")
            self.client.drop_index()
            return self._build_index()

    def insert_data(self):
        index = self._build_index()
        df = prepare_data(self.file_path)
        for row in tqdm(df.iterrows()):
            doc = row[1].to_dict()

            """
            doc: dict = {
            "areaId": "2",
            "areaTitle": "motijheel",
            "areaBody": "Motijheel, dhaka 1209",
            }
            """

            try:
                index.add_document(doc["index"], **doc)
            except ResponseError:
                raise

            except DataError:
                print("Badly formatted data")
                raise

        index.commit()