Add custom stopwordlist to indexer
Trojan13 opened this issue · comments
Question
I am trying to replicate an old study that uses SMART Stopword list. This list is not in the standard of pyterrier included. Does pyterrier support the usage of custom stopword lists? If so how?
Idea
I know terrier has the class Stopwords and it takes a 'filename' as property. But how do I access this via pyterrier?
I want to do indexing like:
smartStopword = pt.TerrierStopwords(filename='en.txt')
indexer = pt.index.IterDictIndexer(pt_index_path,
blocks=True,
stopwords=TerrierStopwords)
Is this possible?
I hope this is the right place to ask. If not I am sorry.
yes, we wanted to do custom lists, but havent integrated it yet. Meanwhile, you should be able to set the property on the indexer:
indxr = pt.IterDictIndexer(pt_index_path)
indxr.setProperty("stopwords.filename", "/path/to/smart-list.txt")
but be very careful of comparing transformers and indices using different stopwords within the same Python process, as the property is a global.
I'm working on an implementation that will exposes a stopword indexing API as follows:
indexer = pt.IterDictIndexer(pt_index_path, stopwords=['a', 'an', 'the'])
The stopword list is saved inside the index's data.properties file. However, I'll need to make changes upstream in Terrier for the retrieval side.