[Question] how to save an updatable index to disk?

Question

[Question] how to save an updatable index to disk?

gcalabria opened this issue a year ago · comments

Guilherme Calábria Lopes commented a year ago

Hi everyone 👋🏽

I am trying to create a system where users can add new documents to the collection on-the-fly. Thus, I would like to update my index, instead of re-indexing the whole collection every time. I've seen here that one can update an index by adding them together:

index1 = pt.IndexFactory.of("./index1")
index2 = pt.IndexFactory.of("./index2")
comb_index = index1 + index2
br = pt.BatchRetrieve(comb_index)

However, as far as I know, the combined object exists only in memory. Is there a way of saving it onto the disk?

Thanks in advance 😄

Craig Macdonald · Answer 1 · Tue Jun 27 2023 17:08:58 GMT+0800 (China Standard Time)

Hi,

I have just seen:
modules/batch-indexers/src/main/java/org/terrier/structures/indexing/DiskIndexWriter.java

So the following Python should work:

writer = pt.autoclass("org.terrier.structures.indexing.DiskIndexWriter")("/path/to/dir", "data")
writer.write(comb_index)

Guilherme Calábria Lopes · Answer 2 · Tue Jun 27 2023 18:09:33 GMT+0800 (China Standard Time)

Hi,

I have just seen: modules/batch-indexers/src/main/java/org/terrier/structures/indexing/DiskIndexWriter.java

So the following Python should work:
writer = pt.autoclass("org.terrier.structures.indexing.DiskIndexWriter")("/path/to/dir", "data")
writer.write(comb_index)

That looks great! I will check whether it works and come back here to confirm it. Thanks a lot for your help :)

Guilherme Calábria Lopes · Answer 3 · Tue Jun 27 2023 20:46:53 GMT+0800 (China Standard Time)

So, I made some progress.
After adding these lines to my code, I first got the error: Cannot document-wise merge indices with and without positions (False vs True). I've fixed this by instantiating my new index with the argument blocks=True.

Here is my code:

import pyterrier as pt
import pandas as pd

if not pt.started():
  pt.init()

documents = [
  {'text': 'Creates a Function that returns a Function or returns the value of the given property .'},
  {'text': 'Returns the URL of the occupants .'},
  {'text': 'Exit the timer with the default values .'},
]

# Create new index from pandas dataframe
pd_indexer = pt.DFIndexer('/path/to/temp_dir', blocks=True)
df = pd.DataFrame(documents)
df['docno'] = df.index.astype(str)
indexref = pd_indexer.index(df['text'], df['docno'])
new_index = pt.IndexFactory.of(indexref)

# Load old index
old_index = pt.IndexFactory.of(
    "/path/to/data.properties"
)

# Merge indexes
comb_index = new_index + old_index

# Instanciate writer object and write merged index to disk
writer = pt.autoclass("org.terrier.structures.indexing.DiskIndexWriter")(
    str(
        "/path/to/comb_index_dir" # created at runtime
    ),
    "data",
)
writer.write(comb_index)

Now, I am getting a new error message:

---------------------------------------------------------------------------
JavaException                             Traceback (most recent call last)
Cell In[9], line 1
----> 1 writer.write(comb_index)

File jnius/jnius_export_class.pxi:878, in jnius.JavaMethod.__call__()

File jnius/jnius_export_class.pxi:955, in jnius.JavaMethod.call_method()

File jnius/jnius_utils.pxi:79, in jnius.check_exception()

JavaException: JVM exception occurred: java.lang.NullPointerException

Any ideas about what is going on?

Craig Macdonald · Answer 4 · Wed Jun 28 2023 18:49:47 GMT+0800 (China Standard Time)

Hi. Firstly, you can see the underlying exception by using the following construct:

from jnius import JavaException
try:
    writer2.write(jindex2)
except JavaException as je:
    print('\n\t'.join(je.stacktrace))

I figured out the problem here is that MultiIndex does not expose structure input streams.

You can get what you want using the StructureMerger:

dest = pt.autoclass("org.terrier.structures.IndexOnDisk").createNewIndex("/tmp/", "data3")
src1 = pt.cast("org.terrier.structures.IndexOnDisk", src1)
src2 = pt.cast("org.terrier.structures.IndexOnDisk", src2)
merger = pt.autoclass("org.terrier.structures.merging.StructureMerger")(src1, src2, dest)

Longer term, we can try to create some examples that use/expose UpdatableIndices in PyTerrier, but its not a primary use case for PyTerrier.

Guilherme Calábria Lopes · Answer 5 · Wed Jun 28 2023 20:47:10 GMT+0800 (China Standard Time)

Thank you very much for your help 😄 I will try this

Craig Macdonald · Answer 6 · Thu Nov 02 2023 20:42:25 GMT+0800 (China Standard Time)

I added a test case for #390 (comment), which now passes. Hence closing this issue. Thanks for the report @g-lopes

Guilherme Calábria Lopes · Answer 7 · Fri Nov 24 2023 20:16:24 GMT+0800 (China Standard Time)

I added a test case for #390 (comment), which now passes. Hence closing this issue. Thanks for the report @g-lopes

Thank you for your contribution!