Welcome to our repository for feature selection with deep tabular models! This repository contains code for our paper "A Performance-Driven Benchmark for Feature Selection in Tabular Deep Learning".
Academic tabular benchmarks often contain small sets of curated features. In contrast, data scientists typically collect as many features as possible into their datasets, and even engineer new features from existing ones. To prevent over-fitting in subsequent downstream modeling, practitioners commonly use automated feature selection methods that identify a reduced subset of informative features. Existing benchmarks for tabular feature selection consider classical downstream models, toy synthetic datasets, or do not evaluate feature selectors on the basis of downstream performance. We construct a challenging feature selection benchmark evaluated on downstream neural networks including transformers, using real datasets and multiple methods for generating extraneous features. We also propose an input-gradient-based analogue of LASSO for neural networks, called Deep Lasso, that outperforms classical feature selection methods on challenging problems such as selecting from corrupted or second-order features.
Please, consider citing our work if you find our benchmark and Deep Lasso methods helpful:
@inproceedings{cherepanova2023performance,
title={A Performance-Driven Benchmark for Feature Selection in Tabular Deep Learning},
author={Cherepanova, Valeriia and Levin, Roman and Somepalli, Gowthami and Geiping, Jonas and Bruss, C Bayan and Wilson, Andrew Gordon and Goldstein, Tom and Goldblum, Micah},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2023}
}
Our benchmark for feature selection builds upon the datasets from two papers:
- Revisiting Deep Learning Models for Tabular Data [1]
- On Embeddings for Numerical Features in Tabular Deep Learning [2]
Please, follow the instructions here and here to download the datasets and put them in folder /data
.
[1] Gorishniy, Y., Rubachev, I., Khrulkov, V. and Babenko, A., 2021. Revisiting deep learning models for tabular data. Advances in Neural Information Processing Systems, 34, pp.18932-18943.
[2] Gorishniy, Y., Rubachev, I. and Babenko, A., 2022. On embeddings for numerical features in tabular deep learning. Advances in Neural Information Processing Systems, 35, pp.24991-25004.
We include the environment requirements in requirements.txt
.
-
config
contains Hydra configs for main scripts, for model parameters, training parameters and datasets. -
deep tabular
utils
contains utilities for loading and preprocessing data, training and testing loops, feature importance calculationmodels
contains implementations of deep tabular models
-
launch
contains examples of training jobs -
train_deep_model.py
usesconfig/train_model.yaml
config for training a specified deep tabular model (MLP or FT-Transformer) on a specified dataset (see arguments inconfig/train_model.yaml
) -
train_classical.py
usesconfig/train_model.yaml
config for training a specified classical feature selection model (Random Forest, XGBoost, Linear/Logistic Regression or Univariate Statistical test) -
tune_baseline.py
usesconfig/tune_config.yaml
config for tuning the hyperparameters of a model on a specified dataset (see arguments inconfig/train_model.yaml
) -
tune_full_pipeline.py
usesconfig/tune_full_pipeline_config.yaml
config for tuning the hyperparameters of upstream feature selection model and downstream model simultaneously with respect to the downstream performance (see arguments inconfig/tune_full_pipeline_config.yaml
)
The benchmark construction utilities are in deep tabular/utils/data_tools
, including the get_data_locally
function for reading and augmenting datasets with extraneous features.
add_noise
argument controls the type of extraneous features and can berandom_feats
,corrupted_feats
, orsecondorder_feats
noise_percent
argument controls the proportion of added extraneous features.
Note: Our current implementation is tailored for numerical features only; applying it to categorical features may result in errors.
Find the implementation of our novel Deep Lasso regularizer in deep_lasso.py
, along with the feature selection functionality.
To train a deep tabular model, such as an MLP or FT-Transformer, on a dataset containing extraneous features, use the train_deep_model.py
script. For instance, to train an MLP on the California Housing dataset with 50% of the features being extraneous second-order features, execute the following command:
python3 train_deep_model.py mode=downstream dataset=california_housing name=no_fs model=mlp hyp=hyp_for_neural_network dataset.add_noise=secondorder_feats dataset.noise_percent=0.5
Results of this job will be saved in stats.json
file, located in the directory specified in the config/train_model.yaml
file.
For feature selection with classical algorithms (Lasso, XGBoost, Random Forest, etc), use python3 train_classical.py
script and specify mode=feature_selection
. For example, to calculate feature importance using the XGBoost model on the California Housing dataset with 50% extraneous second-order features, run:
python3 train_classical.py mode=feature_selection dataset=california_housing name=fs_xgboost model=xgboost hyp=hyp_for_xgboost dataset.add_noise=secondorder_feats dataset.noise_percent=0.5
To perform feature selection with deep learning-based algorithms (like Deep Lasso, First-Layer Lasso, Attention Map Importance), use the python3 train_deep_model.py
script and specify mode=feature_selection
. For instance, to determine feature importance using Deep Lasso with the FT-Transformer model on the California Housing dataset with 50% extraneous second-order features, execute:
python3 train_deep_model.py mode=feature_selection dataset=california_housing name=fs_deep_lasso model=ft_transformer hyp=hyp_for_neural_network dataset.add_noise=secondorder_feats dataset.noise_percent=0.5 hyp.regularization=deep_lasso hyp.reg_weight=0.1
hyp.regularization=deep_lasso
for Deep Lasso feature importancehyp.regularization=first_lasso
for First-Layer Lasso feature importancemodel=ft_transformer_attention
for Attention Map feature importance
Computed feature importances will be saved in feature_importances.pt
To leverage pre-computed feature importances, specify the path using importance_path=feature_importances.pt
and indicate the proportion of the most significant features to include in the dataset using topk
argument:
python3 train_deep_model.py mode=downstream dataset=california_housing name=ft_transformer_fs_xgboost model=ft_transformer hyp=hyp_for_neural_network dataset.add_noise=secondorder_feats dataset.noise_percent=0.5 importance_path=feature_importances.pt topk=0.5
To tune the hyperparameters of the downstream deep tabular models, use the python3 tune_baseline.py
script. For example, to tune hyperparameters of FT-Transformer on California Housing dataset with 50% corrupted features and no feature selection:
python3 tune_baseline.py mode=downstream model=ft_transformer dataset=california_housing name=tune_ft_ch hyp=hyp_for_neural_network dataset.add_noise=corrupted_feats dataset.noise_percent=0.5
This job will save the best performing hyperparameters in best_config.json
, results for the best hyperparameters in best_stats.json
and stats from all trials in all_stats.json
and trials.csv
.
To tune the hyperparameters for feature selection algorithm and downstream model simultaneously, use the python3 tune_full_pipeline.py
script. For example, to tune MLP-based Deep Lasso feature selector, and the downstream MLP model:
python3 tune_full_pipeline.py model=mlp model_downstream=mlp dataset=california_housing name=tune_ft_ch_full hyp=hyp_for_neural_network hyp_downstream=hyp_for_neural_network dataset.add_noise=corrupted_feats dataset.noise_percent=0.5 hyp.regularization='deep_lasso' topk=0.5
This job will save the best performing hyperparameters of both upstream feature selection and downstream models as well as performance stats of their combination.
More examples can be found in the launch
folder.
First, tune the hyperparameters of both feature selection and downstream models for each fs_method-model-dataset configuration:
python3 tune_full_pipeline.py model={FS MODEL} model_downstream={DOWNSTREAM MODEL} dataset={DATASET NAME} name={NAME OF EXPERIMENT} hyp={CONFIG FOR FS MODEL} hyp_downstream={CONFIG FOR DOWNSTREAM MODEL} dataset.add_noise={NOISE SETUP} dataset.noise_percent={% OF NOISE IN DATASET} hyp.regularization={FS REGULARIZATION} topk={% OF FEATURES TO SELECT}
For example for XGBoost feature selection and downstream MLP model run:
python3 tune_full_pipeline.py model=xgboost model_downstream=mlp dataset=california_housing name=xgboost_mlp hyp=hyp_for_xgboost hyp_downstream=hyp_for_neural_network dataset.add_noise=corrupted_feats dataset.noise_percent=0.5 topk=0.5
Then, run training job for the best hyperparameters for 10 different seeds:
python3 run_full_pipeline.py --multirun model=xgboost model_downstream=mlp dataset=california_housing name=xgboost_mlp hyp=hyp_for_xgboost hyp_downstream=hyp_for_neural_network dataset.add_noise=corrupted_feats dataset.noise_percent=0.5 topk=0.5 hyp.seed=0,1,2,3,4,5,6,7,8,9
This script loads the best_config.json
file and runs feature selection and downstream models with the specified hyperparameters for 10 seeds. Results are saved in final_stats.json
files in folders corresponding to the seed number in the same directory.
Please, find more examples in launch/feature_selection_California_Housing.sh