deepset-ai / FARM

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

Home Page:https://farm.deepset.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Getting a ton of WARNING messages: "Currently no support in Processor for returning problematic ids"

johann-petrak opened this issue · comments

Using the latest FARM installed with pip install farm, I am getting many many WARNING messages on the log:

" WARNING - farm.data_handler.processor - Currently no support in Processor for returning problematic ids"

What does this mean and is there anything I can do about it?

Hey @johann-petrak cool to see you using FARM again 😄

So this warning is not that there are problematic input samples per se but that we do not have functionality for it in place for some processors.

For the QA processor we can return problematic samples during preprocessing, e.g. for TextclassificationProcessor and its derivatives we cannot. See https://github.com/deepset-ai/FARM/blob/master/farm/data_handler/processor.py#L675
If you want to improve FARM in this respect I see two options:

  1. [quick win] Change the message to Info or reduce the number it is displayed.
  2. [correct but difficult] Implement the problemtaic id check for the processor you need.

Sorry, TBH so far what I do not understand is much more basic: what is actually meant by "problematic input sample" i.e. which error conditions make a sample problematic? And what error conditions could actually already occur when converting samples in the TextclassificationProcessor?
Apparently the only way how these ids can bubble up is through an exception in self._sample_to_features(sample=sample) ?

An exception in _sample_to_features could be a good start. In general problematic input sample means an input sample that cannot be converted to a pytorch tensor in the correct way, so it is rather general.
We would like input processing to be stable, so catching exceptions on input specifics is a way forward. OF course we want to return the IDs of those problematic samples later.
For an example please have a look how Question Answering is converted, e.g. here.

I think currently the message is not really informative and also pops up per process. So option 1 would already improve FARM.

Hey @johann-petrak would you be interested in contributing the quick improvement I proposed as method 1?

Method 1: Change the message to Info and/or reduce the number of times it is displayed.

I think the only reasonable way to do this is to move the warning into the constructor.

Since the processor instance is pickled and replicated in many many other processes, there is no (practical and easy) way for those processes to figure out if any of them is the first to emit the warning.

Since the warning is really about the TextClassificationProcessor implementation, emitting it from the constructor makes sense as well, I think.

Makes sense to put it there! Would you like to create a PR?