This repo hosts a toolset aimed at symplifying data scientists' daily life, by automating a lot of tedious and repetitive tasks around dataset wrangling, training, testing, deploying etc.
It is currently focused on supervised learning tasks.
It notably unifies in a unique interface 3 famous ML libraries:
-
Scikit Learn
-
H2o
-
AWS Sagemaker
Meaning that you can use seamlessly algorithms from these libraries, only by setting your training config file, leaving all the rest to MALCOLM.
Only two scripts are needed to do the job: one for pre-processing, one for training/testing/etc.
usage: python prepare_datasets.py
SageMaker pre-processing script
required arguments:
-d DATASET_NAME input dataset name (without extension)
-t TARGET feature name for the labels, or index if no headers
optional arguments:
-h, --help show this help message and exit
-f DATA_FOLDER data folder (default data)
--infra-s3 S3_BUCKET configuration of the S3 bucket from infra_s3.yaml
-p PREPROCESSOR preprocessor name
--scale subtract the mean and divide by the std
--encode one-hot encoding of categorical variables (label encoding for H2O)
--categ CAT1 CAT2 categorical features, encoded if --encoding is specified
--headers if enabled, attempt to parse headers
--clean ignore previously generated local files and erase intermediate files to regenerate them
-v verbose mode
Specify the dataset name with -d option, without file extension (e.g. filename_20190319
).
The script needs a unique local working folder, that will be used to download/upload/store all files.
Specifying a S3 bucket + S3 folder will trigger storing the final results in the same S3 folder than where the input file can be downloaded.
The loading process works this way:
-
First it tries to load a local h5 file (from WORKING_FOLDER) for the input dataset.
-
(If not found) it tries to load a local csv file, then saves a h5 file for the raw data to speedup next launch.
-
(If not found) it tries to decompress a .csv.bz2 file, then performs step 2.
-
(If not found) it tries to download a .csv file from Amazon S3, using provided S3 credentials, then performs step 2.
-
(If not found) it tries to download a .csv.bz2 file from Amazon S3, using provided S3 credentials, then performs step 2.
-
(If not found) it raises a FileNotFoundError exception.
In order to upload the dataset to S3, your AWS credentials (aws_access_key_id
and aws_secret_access_key
) must be stored in ~/.aws/credentials
, as recommended in the boto3 documentation. The region
parameter should also be specified, and the specified S3 bucket must exist.
Note that the current parsing of the libsvm format by SageMaker implies that runtime errors will arise a test time if the last categorical variable contains a new value. This issue won't arise if you use --new-categ ignore
or if your dataset contains at least one numerical feature.
python src/prepare_datasets.py -d abalone_multi -p abalone -t label --headers --scale --encode --infra-s3 s3_dev
usage: python src/run.py
SageMaker training/testing/deployment script
required arguments:
-a ACTIONS actions to perform among: t(training), p(predict), e(evaluate), g(grid search),
d(deploy endpoint): deploy the model trained or specified by --awsmid on SageMaker,
r(remove endpoint): delete the endpoint corresponding to the specified --awsmid
-d DATASET_NAME input dataset name (without extension)
-c CONFIG_FILENAME config file name based on `config/model_<configfilename>.yaml`
--model CONFIG_ID model config with a matching in `config/model_<configfilename>.yaml`
optional arguments:
-h, --help show this help message and exit
-f DATA_FOLDER data folder (default data)
-w WORKING_FOLDER working folder (default jobs)
--infra-s3 S3_BUCKET configuration of the S3 bucket from infra_s3.yaml (required for AWS algos)
--infra-sm SM_CONFIG configuration of the SageMaker platform from infra_sm.yaml (required for AWS algos)
--h2o H2O ip:port of a running h2o instance (default localhost:54321)
--model-id MODEL_ID model ID of a previously trained model. Ignored if 't' is part of the actions
--clean ignore previously generated local files and erase intermediate files to regenerate them
The script will first attempt to find the .h5
preprocessed dataset locally. If it does not exist, the dataset will be retrieved from the S3 bucket specified in config/config_model.yaml
.
Training an H2O algorithm will automatically launch a local H2O server. However, you may want to use an existing H2O instance (local or remote).
To launch an h2o instance on a remote server:
- connect to
REMOTESERVER
nohup java -Xmx20g -jar /mnt/sdb/h2o/h2o-3.24.0.2/h2o.jar -port 54342 -nthreads 20 >/mnt/sdb/h2o/logs/h2o.log 2>&1 &
-Xmx20g
20g of ram used-nthreads 20
using 20 threads, all available threads if not specified
To check if the instance is running point a browser to REMOTESERVER:54342
.
Warning: same versions of h2o python package and h2o instance are needed, check requirements.txt
python src/run.py -a tpe -d abalone_multi -c abalone --model sklearn_rf [--model-id MODEL_ID]
python src/run.py -a tpedr -d abalone_multi -c abalone --model aws_xgb --infra-s3 s3_dev --infra-sm sm_dev [--model-id MODEL_ID]
We encourage submitting pull requests for new features and inevitable bug-fixing. Current leads for improvements are:
-
Deployment/cleaning of sklearn and H2O models on SageMaker through -a d/r actions
-
(Parallel) execution of several models on a given dataset (including summary at the end of the evaluations)