bartdegoede / python-searchengine

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cache the Index

NoisyWool opened this issue · comments

The index can easily be saved to disk using pickle; which, significantly reduces the running time of subsequent runs.

Is this something that is within the project's scope?

I am closing this; because, the changes to make this useful are—I believe—outside of the project's goals.

Pickling the index is not very practical since you need to rebuild the index every time the data or code changes.

Also the pickle file is about half the size of the uncompressed data.

I am including below some details on how to cache the index using pickle for those interested.


The Index class already supports pickling, so you just need to wire in the pickling logic.

In run.py after the index is built you need to pickle it to a file, I used data/index.pickle.

Now the download and index building code needs to be wrapped in an if statement that checks to see if the pickle file exists.
If the pickle file does exist load it into index; otherwise, build the index from scratch.