What is the correct way of applying the TermPipeline to a set of questions?

Question

What is the correct way of applying the TermPipeline to a set of questions?

T-Almeida opened this issue 2 years ago · comments

Hi, I was trying the pyterrier and came across the issue described in the title. So, I have a set of questions and I would like to apply the same text processing pipeline that was used in my index (tokenizer, stopwords, stemmer).

I could not find any issue related to this one, the closest thing was #70, which suggests the existence of the following transformer pt.rewrite.ApplyTermPipeline, but I think it is no longer available in the most recent version. Note that I am looking for an independent transformation that just does this, because from what I understood some transformations already do this internally.

After looking around the source code I found that the QueryExpansion uses something close to what I am looking for, so I hijack that code and this is my current workaround:

import pyterrier as pt
from jnius import cast

def applyTermPipeline(index):
    indexref = pt.batchretrieve._parse_index_like(index)
    applytp = pt.autoclass("org.terrier.querying.ApplyTermPipeline")() # uses the default termpipelines... this is wrong should use the one defined in data.properties
    manager = pt.autoclass("org.terrier.querying.ManagerFactory")._from_(indexref)
    TerrierQLParser = pt.autoclass("org.terrier.querying.TerrierQLParser")()
    TerrierQLToMatchingQueryTerms = pt.autoclass("org.terrier.querying.TerrierQLToMatchingQueryTerms")()
    
    def apply_func(row):
        rq = cast("org.terrier.querying.Request", manager.newSearchRequest(row.qid, row.query))
        TerrierQLParser.process(None, rq)
        TerrierQLToMatchingQueryTerms.process(None, rq)
        applytp.process(None, rq)
        return " ".join(map(lambda term:term.getKey().toString(), rq.getMatchingQueryTerms()))
    
    return apply_func

pipe = pt.rewrite.tokenise() >> pt.apply.query(applyTermPipeline(index))
pipe(questions_dataframe)

My main question is, there is any better way to do this? (I am new here and it is highly likely that I am missing something.)

Another question, how can I get the 'termpipelines' property that is defined on 'data.properties' of my index, so that I can pass it to the org.terrier.querying.ApplyTermPipeline?

Craig Macdonald · Answer 1 · Fri Nov 25 2022 18:39:44 GMT+0800 (China Standard Time)

HI @T-Almeida Thanks for the excellent, detailed question.

By default, BatchRetrieve instantiated from an index should retain the stemming and stopwords of the indexing configuration. If you want to change what is stemming and stopwords, these can be configured using the controls:
pt.BatchRetrieve(index, controls={"termpipelines" : "ClassA,ClassB"})

However, I suspect you want to recover the query as used internally by Terrier. Your solution is technically correct to the very detail. I think you could probably get something simpler like this:

def tp_func():
  stops = pt.autoclass("org.terrier.terms.Stopwords")(None)
  stemmer = pt.autoclass("org.terrier.terms.PorterStemmer")(None)
  def _apply_func(row):
    words = row["query"].split(" ") # this is safe following pt.rewrite.tokenise()
    words = [stemmer.stem(w) for w in words if not stops.isStopword(w) ]
    return " ".join(words)
  return _apply_func 

pipe = pt.rewrite.tokenise() >> pt.apply.query(tp_func())

Tiago Almeida · Answer 2 · Fri Nov 25 2022 20:04:22 GMT+0800 (China Standard Time)

Thanks for the detailed answer @cmacdonald, that was exactly what I was looking for. Just another follow-up question, if I want to instantiate the stemmer and stopword based on a specific index, i.e., given an index I want to instantiate the stemmer and stopwords that are defined in the data.propreties file. Furthermore, a simple example is to imagine that I have several indexes that were created for different collections (in English, Spanish, and french, so different stemmers were defined) and I want to apply the above function to questions (text) for each of the different indexes, so it would be beneficial if the above function tp_func could get the correct stemmer and stopwords for each specific index.

The code should be something like this:

def tp_func(index):
  stops = get_index_stopwords(index) # function that returns the stopwords instance used by the index
  stemmer = get_index_stemmer(index) # function that returns the stemmer instance used by the index
  # same code as the above

I look at the Javadoc of org.terrier.querying.IndexRef and org.terrier.structures.Index, but I wasn't able to figure out how to get the termpipelines information that I was assuming it should be stored somewhere in the index (maybe my assumption is wrong). So, do you know any simple way to get this information? If not, you can close the issue, because the above answer already suffices.

Craig Macdonald · Answer 3 · Fri Nov 25 2022 20:27:47 GMT+0800 (China Standard Time)

Briefly, its in the "termpipelines" index property, which you can access like:

pindex = pt.cast("org.terrier.structures.PropertiesIndex", index)
classlist = pindex.getIndexProperty("termpipelines, "")

Tiago Almeida · Answer 4 · Fri Nov 25 2022 21:59:28 GMT+0800 (China Standard Time)

Thanks @cmacdonald, it was exactly what I needed.