OscarKjell / text

Using Transformers from HuggingFace in R

Home Page:https://r-text.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

fatal error preventing a trained model from being saved

lilchow opened this issue · comments

I tried to train a classification model to assign short passages to one of the six classes based on their 1536D embeddings obtained from OpenAI api. I invoked textTrainRandomForest function with all of its default settings.

The training process seemed to go ok because I saw the following output in the console:

Fold: bal_accuracy 0.846 (duration: 3.59 mins). 
Fold: bal_accuracy 0.828 (duration: 3.593 mins). 
Fold: bal_accuracy 0.856 (duration: 3.595 mins). 
Fold: bal_accuracy 0.848 (duration: 3.597 mins). 
Fold: bal_accuracy 0.856 (duration: 3.637 mins). 
Fold: bal_accuracy 0.842 (duration: 3.587 mins). 
Fold: bal_accuracy 0.841 (duration: 3.618 mins). 
Fold: bal_accuracy 0.842 (duration: 3.58 mins). 
Fold: bal_accuracy 0.846 (duration: 3.605 mins). 
Fold: bal_accuracy 0.849 (duration: 3.648 mins).

However, I got an error message at the end and no trained model was saved as a result. Here is the error message:

Error in stats::fisher.test(predy_y$truth, predy_y$estimate) : 
  FEXACT error 5.
The hash table key cannot be computed because the largest key
is larger than the largest representable int.
The algorithm cannot proceed.
Reduce the workspace, consider using 'simulate.p.value=TRUE' or another algorithm.

From what i was able to gather, there is no way I can change the setting for fisher.test as it is not hardcoded into the textTrainRandomForest function. Is there any way to circumvent this problem as I need the trained model to make predictions on an unlabelled dataset.

Thank you very much!

Thanks for reporting this.
I have now added simulate.p.value in both textTrainRegression and textTrainRandomForest. Please see if setting it to TRUE will solve this problem.

Thanks for the quick fix! But I just ran into a different error when performing the same training on the same training set (i.e., assigning short passages to one of the six classes based on their 1536D embeddings). Here is the error message (btw, the training seemed to go alright because I saw pretty good results with the fold-wise bal_accuracy of around 0.83):

Error in `yardstick::roc_auc()`:
! The number of levels in `truth` (6) must match the number of columns supplied in `...` (1).
Backtrace:
 1. text::textTrainRandomForest(...)
 2. text:::classification_results(...)
 4. yardstick:::roc_auc.data.frame(predy_y, truth, colnames(predy_y[3]))
 5. yardstick::prob_metric_summarizer(...)
 8. yardstick (local) fn(...)

Thanks for the feedback. This should be fixed in Github version 1.2.02.