Can't find any instruction on batch indexing
rednafi opened this issue · comments
Currently, it seems like the only way of adding a document is via the following command:
client.add_document('doc1', title = 'RediSearch', body = 'Redisearch implements a search engine on top of redis')
The above command adds documents one by one. However, I'd like to insert data by batches. There is no instruction (Nowhere in the docs at least) on how to do that using the python API. There is a BatchIndexer
class in the client.py
file but how do I utilize that? Thanks!
Hello @rednafi
Documentation can always be improved upon (I'd be the first to admit that), but in the case of it lacking you can usually look at the test scripts (i.e. https://github.com/RediSearch/redisearch-py/blob/master/test/test.py#L19) for hints... A PR to the docs is always welcome btw ;)
Thank you for your quick response. Checking!!
For reference, this is a test implementation of how batchIndexer
class can be used to perform batch indexing.
Line 39 in d7558a6
Solved it like this.
Notice the _build_index()
method to see how the index
instance that supports batch insertion was created. Here, chunk_size=25000
means that data will be committed after the insertion of each batch containing 25000 documents. Later this index instance was used in the insert_data()
method to insert the data batch by batch.
class IndexData:
def __init__(self, client):
self.client = client
self.file_path = "./index-data/area.csv"
self.fields = (
NumericField("index"),
NumericField("areaId"),
TextField("areaTitle"),
TextField("areaBody"),
)
def _build_index(self):
"""Build index schema."""
try:
print("Building index....")
self.client.create_index(self.fields)
index = client.batch_indexer(chunk_size=25000)
return index
except ResponseError:
print("Index already exists. Proceeding...")
self.client.drop_index()
return self._build_index()
def insert_data(self):
index = self._build_index()
df = prepare_data(self.file_path)
for row in tqdm(df.iterrows()):
doc = row[1].to_dict()
"""
doc: dict = {
"areaId": "2",
"areaTitle": "motijheel",
"areaBody": "Motijheel, dhaka 1209",
}
"""
try:
index.add_document(doc["index"], **doc)
except ResponseError:
raise
except DataError:
print("Badly formatted data")
raise
index.commit()