deepset-ai / FARM

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

Home Page:https://farm.deepset.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Recommended way to evaluate a loaded model on a file?

johann-petrak opened this issue · comments

I am training a model in one python file/process and I am saving the processor and the model to the same directory:

# initialize processor for the tasks
processor.save("mymodel")
# setup and carry out training on the training file defined for the processor earlier (test file is None)
model.save("mymodel")

In a different program, I want to load that model and use it for inference and evaluation. For this I restore the model into an inferencer:

inferencer = Inferencer.load("mymodel", ....)

Now I would also like to use that Inferencer for evaluation on some data file.

I am doing:

processor = ... get a Processor here from the Inferencer+modify or make a new one

# create the silo
data_silo = DataSilo(
    processor=processor,
    batch_size=BATCH_SIZE)
evaluator = Evaluator(
            data_loader=data_silo.get_data_loader("test"),
            tasks=processor.tasks,
            device=device
        )
result = evaluator.eval(inferencer_loaded.model, return_preds_and_labels=True)
evaluator.log_results(result, "Test", steps=len(data_silo.get_data_loader("test")))

Is this the recommended way for how to do it or is there a better way?

When I run this I get the following message very often:

To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

What has been done wrong to cause this?

I think its not bad loading the model by using the Inferencer. You could also just use the AdaptiveModel.load function directly.

Avoid using tokenizers before the fork if possible

This warning happens with fast tokenizers using Rust Multithreading and FARM based python multiprocessing running at the same time. If the code runs I would ignore this warning. Otherwise you could set max_processes=1 in the data_silo constructor to disable FARM multiprocessing.

OK, so if in doubt, switch off FARM MP rather than Rust MP?
(I am a bit scared of deadlocks when using this in production eventually)

Exactly, when in doubt, switch off FARM MP.

Rust MT is an incredible speed boost on the tokenization side, especially for large texts. The FARM MP is not really needed any more with the fast tokenizers. We havent seen any deadlocks with the combination yet, thats why we kept both methods turned on. If you should encounter problems though we can think of disabling FARM MP by default.

Seems resolved, closing now. Feel free to reopen.