training a classifier should overwrite the .lex

Question

training a classifier should overwrite the .lex

kordjamshidi opened this issue 7 years ago · comments

Parisa Kordjamshidi commented 7 years ago

It seems if the .lex of a classifier has been created before and exists in the default path when we retrain the classifiers it adds features to the same lexicon, that is, the lexicon is not overwritten.
(We need tests for load, save and when classifiers are created from scratch. related to #411 )

Parisa Kordjamshidi · Answer 1 · Fri Aug 04 2017 10:44:08 GMT+0800 (China Standard Time)

@danyaljj do you have any comments on this?

Daniel Khashabi · Answer 2 · Sat Aug 05 2017 01:18:02 GMT+0800 (China Standard Time)

Just to clarify it, are you saying that training a model would write on disk (lexicon file), before/without calling save()?

Parisa Kordjamshidi · Answer 3 · Sat Aug 05 2017 01:21:45 GMT+0800 (China Standard Time)

No, with or without save is not an issue. The issue is when there exists a lex anyhow from the past, the train() just uses that and adds new features to it that leads to exploding the lex size as we run the app and train() frequent times (in different independent runs).

Daniel Khashabi · Answer 4 · Sat Aug 05 2017 01:42:26 GMT+0800 (China Standard Time)

I see. So you think we should always remove lexicon file, at the beginning of train?

Parisa Kordjamshidi · Answer 5 · Sat Aug 05 2017 01:49:11 GMT+0800 (China Standard Time)

I expected it to be overwritten by default, we need to indicate if we want to continue training or need to train from scratch. Because removing those at the beginning of the train will be problematic in case we want to initialize models with existing lex and lc.

Daniel Khashabi · Answer 6 · Sat Aug 05 2017 01:52:13 GMT+0800 (China Standard Time)

Right I agree it's tricky.
We can ask the user at the beginning of the training:

Do you want to remove existing model files? [Y/N]

What do you think?

Parisa Kordjamshidi · Answer 7 · Sat Aug 05 2017 01:55:17 GMT+0800 (China Standard Time)

Sounds good to me. @Rahgooy might have comments.

Taher Rahgooy · Answer 8 · Sat Aug 05 2017 02:02:35 GMT+0800 (China Standard Time)

I think it is good for training a single model, but when we want to train multiple models, let's say with a loop, in that case, the user should wait for the first model to train and then enter [Y/N]. IMO, the better option is to have it as a parameter or something.

Parisa Kordjamshidi · Answer 9 · Sat Aug 05 2017 02:08:50 GMT+0800 (China Standard Time)

In fact for jointraining we have the init parameter: here