Recommended way to evaluate a loaded model on a file?

Question

Recommended way to evaluate a loaded model on a file?

johann-petrak opened this issue 3 years ago · comments

I am training a model in one python file/process and I am saving the processor and the model to the same directory:

# initialize processor for the tasks
processor.save("mymodel")
# setup and carry out training on the training file defined for the processor earlier (test file is None)
model.save("mymodel")

In a different program, I want to load that model and use it for inference and evaluation. For this I restore the model into an inferencer:

inferencer = Inferencer.load("mymodel", ....)

Now I would also like to use that Inferencer for evaluation on some data file.

I am doing:

processor = ... get a Processor here from the Inferencer+modify or make a new one

# create the silo
data_silo = DataSilo(
    processor=processor,
    batch_size=BATCH_SIZE)
evaluator = Evaluator(
            data_loader=data_silo.get_data_loader("test"),
            tasks=processor.tasks,
            device=device
        )
result = evaluator.eval(inferencer_loaded.model, return_preds_and_labels=True)
evaluator.log_results(result, "Test", steps=len(data_silo.get_data_loader("test")))

Is this the recommended way for how to do it or is there a better way?

When I run this I get the following message very often:

To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

What has been done wrong to cause this?

Timo Moeller · Answer 1 · Thu May 20 2021 22:18:18 GMT+0800 (China Standard Time)

I think its not bad loading the model by using the Inferencer. You could also just use the AdaptiveModel.load function directly.

Avoid using tokenizers before the fork if possible

This warning happens with fast tokenizers using Rust Multithreading and FARM based python multiprocessing running at the same time. If the code runs I would ignore this warning. Otherwise you could set max_processes=1 in the data_silo constructor to disable FARM multiprocessing.

Johann Petrak · Answer 2 · Fri May 21 2021 00:28:26 GMT+0800 (China Standard Time)

OK, so if in doubt, switch off FARM MP rather than Rust MP?
(I am a bit scared of deadlocks when using this in production eventually)

Timo Moeller · Answer 3 · Fri May 21 2021 00:51:25 GMT+0800 (China Standard Time)

Exactly, when in doubt, switch off FARM MP.

Rust MT is an incredible speed boost on the tokenization side, especially for large texts. The FARM MP is not really needed any more with the fast tokenizers. We havent seen any deadlocks with the combination yet, thats why we kept both methods turned on. If you should encounter problems though we can think of disabling FARM MP by default.

Timo Moeller · Answer 4 · Tue Jun 01 2021 22:08:39 GMT+0800 (China Standard Time)

Seems resolved, closing now. Feel free to reopen.