[Question] how to save an updatable index to disk?
gcalabria opened this issue · comments
Hi everyone 👋🏽
I am trying to create a system where users can add new documents to the collection on-the-fly. Thus, I would like to update my index, instead of re-indexing the whole collection every time. I've seen here that one can update an index by adding them together:
index1 = pt.IndexFactory.of("./index1")
index2 = pt.IndexFactory.of("./index2")
comb_index = index1 + index2
br = pt.BatchRetrieve(comb_index)
However, as far as I know, the combined object exists only in memory. Is there a way of saving it onto the disk?
Thanks in advance 😄
Hi,
I have just seen:
modules/batch-indexers/src/main/java/org/terrier/structures/indexing/DiskIndexWriter.java
So the following Python should work:
writer = pt.autoclass("org.terrier.structures.indexing.DiskIndexWriter")("/path/to/dir", "data")
writer.write(comb_index)
Hi,
I have just seen: modules/batch-indexers/src/main/java/org/terrier/structures/indexing/DiskIndexWriter.java
So the following Python should work:
writer = pt.autoclass("org.terrier.structures.indexing.DiskIndexWriter")("/path/to/dir", "data") writer.write(comb_index)
That looks great! I will check whether it works and come back here to confirm it. Thanks a lot for your help :)
So, I made some progress.
After adding these lines to my code, I first got the error: Cannot document-wise merge indices with and without positions (False vs True)
. I've fixed this by instantiating my new index with the argument blocks=True
.
Here is my code:
import pyterrier as pt
import pandas as pd
if not pt.started():
pt.init()
documents = [
{'text': 'Creates a Function that returns a Function or returns the value of the given property .'},
{'text': 'Returns the URL of the occupants .'},
{'text': 'Exit the timer with the default values .'},
]
# Create new index from pandas dataframe
pd_indexer = pt.DFIndexer('/path/to/temp_dir', blocks=True)
df = pd.DataFrame(documents)
df['docno'] = df.index.astype(str)
indexref = pd_indexer.index(df['text'], df['docno'])
new_index = pt.IndexFactory.of(indexref)
# Load old index
old_index = pt.IndexFactory.of(
"/path/to/data.properties"
)
# Merge indexes
comb_index = new_index + old_index
# Instanciate writer object and write merged index to disk
writer = pt.autoclass("org.terrier.structures.indexing.DiskIndexWriter")(
str(
"/path/to/comb_index_dir" # created at runtime
),
"data",
)
writer.write(comb_index)
Now, I am getting a new error message:
---------------------------------------------------------------------------
JavaException Traceback (most recent call last)
Cell In[9], line 1
----> 1 writer.write(comb_index)
File jnius/jnius_export_class.pxi:878, in jnius.JavaMethod.__call__()
File jnius/jnius_export_class.pxi:955, in jnius.JavaMethod.call_method()
File jnius/jnius_utils.pxi:79, in jnius.check_exception()
JavaException: JVM exception occurred: java.lang.NullPointerException
Any ideas about what is going on?
Hi. Firstly, you can see the underlying exception by using the following construct:
from jnius import JavaException
try:
writer2.write(jindex2)
except JavaException as je:
print('\n\t'.join(je.stacktrace))
I figured out the problem here is that MultiIndex does not expose structure input streams.
You can get what you want using the StructureMerger:
dest = pt.autoclass("org.terrier.structures.IndexOnDisk").createNewIndex("/tmp/", "data3")
src1 = pt.cast("org.terrier.structures.IndexOnDisk", src1)
src2 = pt.cast("org.terrier.structures.IndexOnDisk", src2)
merger = pt.autoclass("org.terrier.structures.merging.StructureMerger")(src1, src2, dest)
Longer term, we can try to create some examples that use/expose UpdatableIndices in PyTerrier, but its not a primary use case for PyTerrier.
Thank you very much for your help 😄 I will try this
I added a test case for #390 (comment), which now passes. Hence closing this issue. Thanks for the report @g-lopes
I added a test case for #390 (comment), which now passes. Hence closing this issue. Thanks for the report @g-lopes
Thank you for your contribution!