koc-lab / MTSIT

Codes for: "Multivariate Time Series Imputation with Transformers"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MTSIT

This is the GitHub repository for the paper: A. Y. Yıldız, E. Koç, A. Koç, “Multivariate Time Series Imputation with Transformers”, IEEE Signal Processing Letters, 2022. This paper is based on Multivariate Time Series Transformer Framework and extended on imputation tasks.

Datasets

Physionet Healthcare Dataset and Beijing Air Quality Dataset are used for imputation task.

For any dataset, including Healthcare and Air Quality, Pandas Time Series Data (ptsd) format is used. We preprocess the datasets following BRITS. Dataset is pre-processed into numpy array data with shapes (number of samples × features × time points). Afterwards, using create_df function in the create_df.py file, data is saved with to_pickle function as .pickle files that are named as train_inputs - train_labels, and test_inputs - test_labels in a desired folder. --data_dir option parameter holds the directory of the stored folder.

Requirements

Codes are implemented on Linux based systems, e.g. Ubuntu. Packages that are used with versions are included in requirements.txt. Additionally, for conda users, venv.yml file is also included.

Experiments

Models are trained and saved in experiments folder. You are expected to create this folder beforehand, by mkdir experiments. Models can be tested by using the best model checkpoint for any model saved in experiments. Additionally, for any task implemented, e.g. imputation for our case, results are recorded in the file determined by --records_file with the row name --name. Corresponding sample terminal option parameters are shown below.

Training

For Air Quality experiment:

python src/main.py --output_dir experiments --name imputation_air_quality --records_file imputation_air_quality.xls --data_dir air_quality_data/ --data_class ptsd --pattern train --val_ratio 0.2 --epochs 400 --lr 0.001 --optimizer RAdam --pos_encoding learnable --task imputation

For Healthcare experiment:

python src/main.py --output_dir experiments --name imputation_healthcare --records_file imputation_healthcare.xls --data_dir healthcare_data/ --data_class ptsd --pattern train --val_ratio 0.2 --epochs 400 --lr 0.001 --optimizer RAdam --pos_encoding learnable --task imputation

Test

In --load_model, $experiment_name is the trained model folder to be tested. --masking_ratio and --mask_distribution parameters are specific for the test requirements, and may not be used if not wanted. Default values for these parameters are shown in options.py.

For Air Quality experiment:

python src/main.py --output_dir experiments --name imputation_air_quality --records_file imputation_air_quality.xls --data_dir air_quality_data/ --data_class ptsd --pattern train --val_ratio 0.2 --epochs 400 --lr 0.001 --optimizer RAdam --pos_encoding learnable --task imputation --test_only testset --test_pattern test --load_model experiments/$experiment_name/checkpoints/model_best.pth

For Healthcare experiment:

python src/main.py --output_dir experiments --name imputation_healthcare --records_file imputation_healthcare.xls --data_dir healthcare_data/ --data_class ptsd --pattern train --val_ratio 0.2 --epochs 400 --lr 0.001 --optimizer RAdam --pos_encoding learnable --task imputation --test_only testset --test_pattern test --load_model experiments/$experiment_name/checkpoints/model_best.pth --masking_ratio 0.1 --mask_distribution bernoulli

After testing; three numpy array files are saved under the folder visualize_data, which are target.npy, target_mask.npy and predictions.npy whose shape are also (number of samples × features × time points). These files correspond to the ground-truth values, the masked indexes, and the imputed values of the test data respectively. These can be used to visualize the time points of the testing data by selecting any sample index and any feature index.

About

Codes for: "Multivariate Time Series Imputation with Transformers"


Languages

Language:Python 100.0%