Add new classifier: need of probability output?

Question

Add new classifier: need of probability output?

mattvan83 opened this issue 5 years ago · comments

I would like to add LinearSVC classifier based on liblinear implementation. Does the current implementation of neuropredict need that predictions are based on probability values? Because, LinearSVC doesn't allow prediction of probabilities.

Pradeep Reddy Raamana · Answer 1 · Fri Nov 08 2019 22:14:23 GMT+0800 (China Standard Time)

Been wanting to do add Linear SVM and Naive Bayes to supported classifiers, adding this to todo list :).

scikit-learn does provide probabilities for some classifiers - I could try output them as well.

thanks for your suggestions and bug reports.

mattvan83 · Answer 2 · Fri Nov 08 2019 23:14:52 GMT+0800 (China Standard Time)

I tried to add Linear SVM in algorithms.py code, then updated config.neuropredict.py to add new classifier. However when launching the command line I still got the error:

usage: neuropredict [-h] [-m META_FILE] [-o OUT_DIR] [-f FS_SUBJECT_DIR]
                    [-y PYRADIGM_PATHS [PYRADIGM_PATHS ...]]
                    [-u USER_FEATURE_PATHS [USER_FEATURE_PATHS ...]]
                    [-d DATA_MATRIX_PATHS [DATA_MATRIX_PATHS ...]]
                    [-a ARFF_PATHS [ARFF_PATHS ...]] [-p POSITIVE_CLASS]
                    [-t TRAIN_PERC] [-n NUM_REP_CV]
                    [-k NUM_FEATURES_TO_SELECT]
                    [-sg [SUB_GROUPS [SUB_GROUPS ...]]]
                    [-g {none,light,exhaustive}]
                    [-is {median,mean,most_frequent,raise}]
                    [-fs {selectkbest_mutual_info_classif,selectkbest_f_classif,variancethreshold}]
                    [-e {randomforestclassifier,extratreesclassifier,decisiontreeclassifier,svm,xgboost}]
                    [-z MAKE_VIS] [-c NUM_PROCS] [--po PRINT_OPT_DIR] [-v]
neuropredict: error: argument -e/--classifier: invalid choice: 'linearsvc' (choose from 'randomforestclassifier', 'extratreesclassifier', 'decisiontreeclassifier', 'svm', 'xgboost')

I have certainly missed one place but where?

mattvan83 · Answer 3 · Sat Nov 09 2019 00:00:23 GMT+0800 (China Standard Time)

I reached advanced error stage since the previous message. I managed to launch my own neuropredict with LinearSVC but got the following error message apprently linked to multiprocessing:

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/netapp/vol1_homeunix/mvanhoutte/Soft/neuropredict/neuropredict/rhst.py", line 690, in holdout_trial_compare_datasets
    average='weighted')
  File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/site-packages/sklearn/metrics/ranking.py", line 355, in roc_auc_score
    sample_weight=sample_weight)
  File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/site-packages/sklearn/metrics/base.py", line 76, in _average_binary_score
    return binary_metric(y_true, y_score, sample_weight=sample_weight)
  File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/site-packages/sklearn/metrics/ranking.py", line 327, in _binary_roc_auc_score
    sample_weight=sample_weight)
  File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/site-packages/sklearn/metrics/ranking.py", line 622, in roc_curve
    y_true, y_score, pos_label=pos_label, sample_weight=sample_weight)
  File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/site-packages/sklearn/metrics/ranking.py", line 402, in _binary_clf_curve
    assert_all_finite(y_score)
  File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/site-packages/sklearn/utils/validation.py", line 72, in assert_all_finite
    _assert_all_finite(X.data if sp.issparse(X) else X, allow_nan)
  File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/site-packages/sklearn/utils/validation.py", line 56, in _assert_all_finite
    raise ValueError(msg_err.format(type_err, X.dtype))
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/bin/neuropredict", line 11, in <module>
    load_entry_point('neuropredict', 'console_scripts', 'neuropredict')()
  File "/netapp/vol1_homeunix/mvanhoutte/Soft/neuropredict/neuropredict/__main__.py", line 11, in main
    run_workflow.cli()
  File "/netapp/vol1_homeunix/mvanhoutte/Soft/neuropredict/neuropredict/run_workflow.py", line 1049, in cli
    grid_search_level, classifier, feat_select_method)
  File "/netapp/vol1_homeunix/mvanhoutte/Soft/neuropredict/neuropredict/run_workflow.py", line 1024, in prepare_and_run
    options_path=options_path)
  File "/netapp/vol1_homeunix/mvanhoutte/Soft/neuropredict/neuropredict/rhst.py", line 422, in run
    cv_results = pool.map(partial_func_holdout, range(num_repetitions))
  File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/multiprocessing/pool.py", line 288, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/multiprocessing/pool.py", line 670, in get
    raise self._value
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The ouput message was stucked at the parallelizing step:


Python 3.6.7
SGE recognized, job set up with 35 slots.
Running neuropredict 0.5+34.g220af55.dirty

Requested features for analysis:
get_data_matrix from /netapp/vol2_agewell/pro/IMAP/imap_mvh/CAT12/pet/Analyses/ML/All/CN_vs_MCI_vs_AD/fdg/pons/metaROI.csv
get_data_matrix from /netapp/vol2_agewell/pro/IMAP/imap_mvh/CAT12/pet/Analyses/ML/All/CN_vs_MCI_vs_AD/fdg/pons/metaROI_split.csv
get_data_matrix from /netapp/vol2_agewell/pro/IMAP/imap_mvh/CAT12/pet/Analyses/ML/All/CN_vs_MCI_vs_AD/fdg/pons/HCP_parcellation.csv
Ignoring imputation strategy chosen, as no missing data were found!

Data import is done.


Requested processing for the following subgroups:
CN,MCI
CN,AD
MCI,AD

--------------------------------------------------------------------------------
Processing subgroup : CN,MCI (1/3)
--------------------------------------------------------------------------------
SGE recognized, job set up with 35 slots.
Training percentage      : 0.8
Number of CV repetitions : 250
Classifier chosen        : linearsvc
Feature selection chosen : variancethreshold
Level of grid search     : exhaustive
Number of processors     : 35
Saving the results to 
  /netapp/vol2_agewell/pro/IMAP/imap_mvh/CAT12/pet/Analyses/ML/All/CN_vs_MCI_vs_AD/fdg/pons/linearsvc/binary/CN_MCI

-------------------------
All datasets contain:
 
86 samples, 2 classes, 2 features
Class  CN : 71 samples
Class MCI : 15 samples
-------------------------

Estimated chance accuracy : 0.500

Different classes in the training set are stratified to match the smallest class!
Parallelizing the repetitions of CV with 35 processes ...

Do you have an idea about that?

mattvan83 · Answer 4 · Sat Nov 09 2019 00:41:13 GMT+0800 (China Standard Time)

The error is linked to the LinearSVC classifier from liblinear. If I use svc(kernel='linear') from libsvm it works !

Pradeep Reddy Raamana · Answer 5 · Sat Nov 09 2019 02:42:29 GMT+0800 (China Standard Time)

Congrats on being able to customize your own version of neuropredict. This is awesome! Great job.

It appears error is to do with : ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

can you check your input csv files to ensure there are no NaNs or Inf or missing values etc

Pradeep Reddy Raamana · Answer 6 · Sat Nov 09 2019 02:43:36 GMT+0800 (China Standard Time)

also can you push the changes you made to your fork, so I can take a look at it to see if there are any potential mistakes there?

mattvan83 · Answer 7 · Tue Nov 12 2019 19:02:59 GMT+0800 (China Standard Time)

I have checked my input csv files and there as no NaNs or Inf or missing values.

I will push the changes I've made to my fork. I have made the following upgrades on algorithms.py and config_neuropredict.py:

Add LinearSVC (liblinear or libsvm) and LogisticRegression
Add train_class_sizes as argument of clf_builder in order to choice activation or not of dual optimization according size of features relative to size of training samples
Add random_state parameter to each clf_builder in order to solve reproducible problem
Add these new classifiers in defaults of the configuration neuropredict file
Add these new classifiers in list of feature importance function

You can check the introduction of LinearSVC based on liblinear implementation that failed at lines 451 to 455.

Libsvm LinearSVC and LogisiticRegression worked but graphics with feature importances were empty. Any suggestion?

Pradeep Reddy Raamana · Answer 8 · Tue Dec 17 2019 01:34:43 GMT+0800 (China Standard Time)

Hi Matt,

the upcoming version #51 would solve many of these issues you identify. Thanks for the feedback and testing and usage, appreciate it.

Happy holidays! :)

mattvan83 · Answer 9 · Tue Dec 17 2019 03:06:11 GMT+0800 (China Standard Time)

Hi Pradeep, That are great news ! Thanks for the update. Happy holidays :) Matthieu Le lun. 16 déc. 2019 à 18:34, Pradeep Reddy Raamana < notifications@github.com> a écrit :

…

Hi Matt, the upcoming version #51 <#51> would solve many of these issues you identify. Thanks for the feedback and testing and usage, appreciate it. Happy holidays! :) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#48?email_source=notifications&email_token=ABDKSKPFH4YSZCJFKWI3G6LQY633JA5CNFSM4JKV5ERKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEG7PNCA#issuecomment-566163080>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDKSKL42F5UIM3C2Q6CFFTQY633JANCNFSM4JKV5ERA> .

mattvan83 · Answer 10 · Tue Feb 11 2020 01:53:59 GMT+0800 (China Standard Time)

Hi Pradeep, After having run linear SVC with neuropredict is there a way with save .pickle files to get back the vector orthogonal to the optimal margin hyperplane (weights associated to each feature) ? Thanks for helping. Best regards, Matthieu

…

Le 16 déc. 2019 à 20:05, Matthieu Vanhoutte ***@***.***> a écrit : Hi Pradeep, That are great news ! Thanks for the update. Happy holidays :) Matthieu Le lun. 16 déc. 2019 à 18:34, Pradeep Reddy Raamana ***@***.*** ***@***.***>> a écrit : Hi Matt, the upcoming version #51 <#51> would solve many of these issues you identify. Thanks for the feedback and testing and usage, appreciate it. Happy holidays! :) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#48?email_source=notifications&email_token=ABDKSKPFH4YSZCJFKWI3G6LQY633JA5CNFSM4JKV5ERKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEG7PNCA#issuecomment-566163080>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDKSKL42F5UIM3C2Q6CFFTQY633JANCNFSM4JKV5ERA>.

Pradeep Reddy Raamana · Answer 11 · Tue Feb 11 2020 06:50:47 GMT+0800 (China Standard Time)

feature importance data saved by neuropredict is very similar to that (if I understand you correctly) - take a look at the CSV output files and PDF plot.

Pradeep Reddy Raamana · Answer 12 · Tue Jun 09 2020 10:20:40 GMT+0800 (China Standard Time)

Hi @mattvan83, if you are still working on this, give the latest version a try and let me know if your problems haven't been resolved. I'll close this for now, and let's start a new issue if that doesn't work.