terrier-org / pyterrier

A Python framework for performing information retrieval experiments, building on http://terrier.org/

Home Page:https://pyterrier.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add custom stopwordlist to indexer

Trojan13 opened this issue · comments

commented

Question

I am trying to replicate an old study that uses SMART Stopword list. This list is not in the standard of pyterrier included. Does pyterrier support the usage of custom stopword lists? If so how?

Idea

I know terrier has the class Stopwords and it takes a 'filename' as property. But how do I access this via pyterrier?
I want to do indexing like:

smartStopword = pt.TerrierStopwords(filename='en.txt')
indexer = pt.index.IterDictIndexer(pt_index_path, 
   blocks=True,
   stopwords=TerrierStopwords)

Is this possible?

I hope this is the right place to ask. If not I am sorry.

yes, we wanted to do custom lists, but havent integrated it yet. Meanwhile, you should be able to set the property on the indexer:

indxr = pt.IterDictIndexer(pt_index_path) 
indxr.setProperty("stopwords.filename", "/path/to/smart-list.txt")

but be very careful of comparing transformers and indices using different stopwords within the same Python process, as the property is a global.

I'm working on an implementation that will exposes a stopword indexing API as follows:

indexer = pt.IterDictIndexer(pt_index_path, stopwords=['a', 'an', 'the'])

The stopword list is saved inside the index's data.properties file. However, I'll need to make changes upstream in Terrier for the retrieval side.