Getting a ton of WARNING messages: "Currently no support in Processor for returning problematic ids"

Question

Getting a ton of WARNING messages: "Currently no support in Processor for returning problematic ids"

johann-petrak opened this issue 3 years ago · comments

Johann Petrak commented 3 years ago

Using the latest FARM installed with pip install farm, I am getting many many WARNING messages on the log:

" WARNING - farm.data_handler.processor - Currently no support in Processor for returning problematic ids"

What does this mean and is there anything I can do about it?

Timo Moeller · Answer 1 · Fri May 07 2021 23:35:11 GMT+0800 (China Standard Time)

Hey @johann-petrak cool to see you using FARM again 😄

So this warning is not that there are problematic input samples per se but that we do not have functionality for it in place for some processors.

For the QA processor we can return problematic samples during preprocessing, e.g. for TextclassificationProcessor and its derivatives we cannot. See https://github.com/deepset-ai/FARM/blob/master/farm/data_handler/processor.py#L675
If you want to improve FARM in this respect I see two options:

[quick win] Change the message to Info or reduce the number it is displayed.
[correct but difficult] Implement the problemtaic id check for the processor you need.

Johann Petrak · Answer 2 · Sat May 08 2021 01:08:43 GMT+0800 (China Standard Time)

Sorry, TBH so far what I do not understand is much more basic: what is actually meant by "problematic input sample" i.e. which error conditions make a sample problematic? And what error conditions could actually already occur when converting samples in the TextclassificationProcessor?
Apparently the only way how these ids can bubble up is through an exception in self._sample_to_features(sample=sample) ?

Timo Moeller · Answer 3 · Sat May 08 2021 22:36:08 GMT+0800 (China Standard Time)

An exception in _sample_to_features could be a good start. In general problematic input sample means an input sample that cannot be converted to a pytorch tensor in the correct way, so it is rather general.
We would like input processing to be stable, so catching exceptions on input specifics is a way forward. OF course we want to return the IDs of those problematic samples later.
For an example please have a look how Question Answering is converted, e.g. here.

I think currently the message is not really informative and also pops up per process. So option 1 would already improve FARM.

Timo Moeller · Answer 4 · Thu May 20 2021 00:13:07 GMT+0800 (China Standard Time)

Hey @johann-petrak would you be interested in contributing the quick improvement I proposed as method 1?

Method 1: Change the message to Info and/or reduce the number of times it is displayed.

Johann Petrak · Answer 5 · Thu May 20 2021 00:39:07 GMT+0800 (China Standard Time)

I think the only reasonable way to do this is to move the warning into the constructor.

Since the processor instance is pickled and replicated in many many other processes, there is no (practical and easy) way for those processes to figure out if any of them is the first to emit the warning.

Since the warning is really about the TextClassificationProcessor implementation, emitting it from the constructor makes sense as well, I think.

Timo Moeller · Answer 6 · Thu May 20 2021 03:51:15 GMT+0800 (China Standard Time)

Makes sense to put it there! Would you like to create a PR?