ImagePipeline
An image pipeline for training image classifiers on the Hadoop cluster
Synopsis
WMF Research team has been working on projects related to computer vision tools, such as prototypes of image classifiers trained on Commons categories. The project aimed to develop prototypes to evaluate the feasibility of developing in-house computer vision tools to support future platform evolution.
We'd now like to expand the similar training procedures on the cluster, so we could fully utilize the 6 GPUs on 6 different nodes on our Analytics cluster. By using Yarn labels to run tf-yarn, we were able to run tensorflow-rocm on GPU nodes and run common tensorflow on nodes without GPU.
Getting started
Connect to stat1008 through ssh
$ ssh stat1008.eqiad.wmnet
Installation
First, clone the repository
$ git clone https://github.com/AikoChou/ImagePipeline.git
Setup and activate the virtual environment
$ cd ImagePipeline
$ conda-create-stacked venv
$ source conda-activate-stacked venv
Install the dependencies
$ export http_proxy=http://webproxy.eqiad.wmnet:8080
$ export https_proxy=http://webproxy.eqiad.wmnet:8080
$ pip install --ignore-installed -r requirements.txt
Running the script
To run the script for training an image classifier, you need to
- Save your neural network model to JSON;
- Upload your data to HDFS;
- Setup the Config file.
Prepare model
First, Keras provides the ability to describe any model in JSON format through the to_json()
function. This can be saved to a file and then loaded via the model_from_json()
function, which will create a new model from the JSON specification.
model = tf.keras.Sequential([
...
])
model_json = model.to_json()
with open('keras_model/model.json', 'w') as json_file:
json_file.write(model_json)
The JSON format of the model will be loaded in the model function model_fn
in model.py:
def model_fn(features, labels, mode):
model_file = os.path.join('keras_model', 'model.json')
if os.path.isdir('keras_model') else 'model.json'
with open(model_file, 'r') as f:
model_json = f.read()
model = tf.keras.models.model_from_json(model_json)
...
Here model_fn
is a function which given inputs and a number of other parameters, returns the ops necessary to perform training, evaluation, or predictions. Currently, we provide a simple template for image classifiers for both binary classification or multi-classification. You can use the template by specifying pre-defined loss, metric, and optimizer in the config file (it will be explained later) or you can rewrite the whole model_fn
on your own.
Prepare input data
Secondly, upload the training and evaluation data saved as TFRecord files to HDFS.
$ hadoop fs -copyFromLocal your_data.tfrecords folder_on_hdfs/
The data will be loaded in the input function input_fn
in data.py:
def input_fn(mode, input_context=None):
if mode == tf.estimator.ModeKeys.TRAIN:
dataset = tf.data.TFRecordDataset(cfg.train_data).map(parse_example)
...
else:
dataset = tf.data.TFRecordDataset(cfg.eval_data).map(parse_example)
...
Here function input_fn
provides input data for training, evaluation, or predictions as minibatches. One needs to parse the data by specifying the data format saved in the TFRecords and perform normalization if needed.
Setup the Config file
The Config file contains parameters for various purposes in the pipeline. The basic is to modify _GLOBAL_CONFIG
, _DATA_CONFIG
, and _MODEL_CONFIG
.
_GLOBAL_CONFIG
name
the name of the yarn applicationhdfs_dir
the location on HDFS at which the model and its checkpoints will be saved
_GLOBAL_CONFIG = dict(
name = 'ImagePipeline',
hdfs_dir = f'{cluster_pack.get_default_fs()}user/{USER}/tf_yarn/tf_yarn_{int(datetime.now().timestamp())}'
)
_DATA_CONFIG
train_data
the location for the training data on HDFS. If there are multiple files, a list is giveneval_data
the location for the evaluation data on HDFS. If there are multiple files, a list is givenimg_size
the shape for the image databuffer_size
for shuffling
_DATA_CONFIG = dict(
train_data = [f'{cluster_pack.get_default_fs()}user/{USER}/pixels-160x160-shuffle-000.tfrecords'],
eval_data = [f'{cluster_pack.get_default_fs()}user/{USER}/pixels-160x160-shuffle-001.tfrecords'],
img_size = (160, 160, 3)
buffer_size = 1000
)
_MODEL_CONFIG
weights_to_load
warm starts from pre-trained weights or the weights saved in a checkpoint on HDFSload_var_name
whether to load a dict of name mapping for the name of the variable between previous checkpoint (or pre-trained model) and current model. If true, a 'var_name.json' file needs to providedlayer_to_train
specify a certain layer to train, freeze other layers. It is used in transfer learning/fine-tuning, to train the top layer onlytrain_steps
number of total steps for which to train modeleval_steps
number of steps for which to evaluate model. If None, evaluates the whole eval_databatch_size
batch size to use. Note to adjust the batch size according to the device type (CPU or GPU), otherwise you may encounter network problems.[1]
_MODEL_CONFIG = dict(
weights_to_load = f'{cluster_pack.get_default_fs()}user/{USER}/mobilenet/variables/variables',
load_var_name = True,
layer_to_train = 'dense',
train_steps = 1000,
eval_steps = None,
batch_size = 256,
learning_rate = 1e-3,
optimizer = 'gradient_descent',
loss_fn = 'binary_crossentropy',
metric_fn = 'binary_accuracy'
)
Other blocks of configuration can remain unchanged, such as _RESOURCE_CONFIG
that provide default settings to resources for distributed tasks, and _HADOOP_ENV_CONFIG
that set up all the environment variables to have Tensorflow working with HDFS.
Run the script
After setting the above things, we can run training procedures on the cluster. You can choose from two versions, one is running on CPU nodes, and the other is running on GPU nodes.
$ python scripts/train.py # for CPU nodes
$ python scripts/train_on_gpu.py # GPU nodes
Checking out the performance
To know the accuracy of the model after training, we can look at the yarn logs.
$ yarn logs -applicationId application_xxxxxxxxxxxxx_xxxxx
In the evaluator.log, we can see the evaluation accuracy and loss for the final training step.
...
2021-06-30 14:41:30,411:INFO:tensorflow: Finished evaluation at 2021-06-30-14:41:30
2021-06-30 14:41:30,412:INFO:tensorflow: Saving dict for global step 1007: binary_accuracy = 0.6373899, global_step = 1007, loss = 0.63418937
2021-06-30 14:41:30,474:INFO:tensorflow: Saving 'checkpoint_path' summary for global step 1007: hdfs://analytics-hadoop/user/aikochou/tf_yarn/tf_yarn_1625063221/model.ckpt-1007
2021-06-30 14:41:30,479:DEBUG:tensorflow: Calling exporter with the `is_the_final_export=True`.
2021-06-30 14:41:30,480:INFO:tensorflow: Waiting 453.976144 secs before starting next eval run.
2021-06-30 14:49:04,488:INFO:tensorflow: Exiting evaluation, global_step=1007 >= train max_steps=1000
2021-06-30 14:49:04,498:INFO:__main__: evaluator:0 SUCCEEDED
In addition, the locations at which the checkpoints saved are shown. By setting the weights_to_load
in the config file to the latest checkpoint, you can warm start from the model status and continue to train the model.
In the chief.log, we can see the initial loss and the loss for final step.
2021-06-30 14:27:25,565:INFO:tensorflow: loss = 0.87347865, step = 0
2021-06-30 14:28:11,977:INFO:tensorflow: global_step/sec: 2.30507
2021-06-30 14:28:49,234:INFO:tensorflow: global_step/sec: 2.81827
2021-06-30 14:29:30,823:INFO:tensorflow: global_step/sec: 2.83734
2021-06-30 14:30:23,872:INFO:tensorflow: global_step/sec: 2.58249
2021-06-30 14:31:04,625:INFO:tensorflow: global_step/sec: 2.82185
2021-06-30 14:31:39,712:INFO:tensorflow: global_step/sec: 3.10662
2021-06-30 14:31:41,813:INFO:tensorflow: loss = 0.64253294, step = 694 (256.248 sec)
2021-06-30 14:32:15,668:INFO:tensorflow: global_step/sec: 3.05927
2021-06-30 14:32:19,758:INFO:tensorflow: Calling checkpoint listeners before saving checkpoint 817...
2021-06-30 14:32:19,758:INFO:tensorflow: Saving checkpoints for 817 into hdfs://analytics-hadoop/user/aikochou/tf_yarn/tf_yarn_1625063221/model.ckpt.
2021-06-30 14:32:20,674:INFO:tensorflow: Calling checkpoint listeners after saving checkpoint 817...
2021-06-30 14:32:50,571:INFO:tensorflow: global_step/sec: 3.06574
2021-06-30 14:33:23,267:INFO:tensorflow: Calling checkpoint listeners before saving checkpoint 1007...
2021-06-30 14:33:23,268:INFO:tensorflow: Saving checkpoints for 1007 into hdfs://analytics-hadoop/user/aikochou/tf_yarn/tf_yarn_1625063221/model.ckpt.
2021-06-30 14:33:24,361:INFO:tensorflow: Calling checkpoint listeners after saving checkpoint 1007...
2021-06-30 14:33:24,471:INFO:tensorflow: Loss for final step: 0.6461505.
2021-06-30 14:33:24,482:INFO:__main__: chief:0 SUCCEEDED