Jul 18th, 2018
predictive_maintenance_dataset.csv
is a file that contains parameters and settings for many wind turbines:
- operational_setting_1
- operational_setting_2
- sensor_measurement_1
- sensor_measurement_2 ...
There is a column called unit_number
which specifies which turbine it is, and one called status
, in which a value of 1 means the turbine broke down that day, and 0 means it didn't.
The task is to create a model that, when fed with operational settings and sensor measurements (unit_number and time_stamp will not be fed in), outputs 1 if the turbine will break down within the next 40 days, and 0 if not.
For a closer look at the process, please review the Jupyter Notebook
forecasting_dataset.csv
is a file that contains pollution data for a city. The task is to create a model that, when fed with columns co_gt, nhmc, c6h6, s2, nox, s3, no2, s4, s5, t, rh, ah, and level, predicts the value of y six hours later.
For a closer look at the process, please review the Jupyter Notebook
A writeup explaining design decisions, potential works and the reasons for making current choices: Notebook
For log files, open using TensorBoard by typing below command in your terminal in where the log folder is:
tensorboard --logdir=logs
The model is actually a pipeline for both tasks.
Task 1 pipeline contains:
- Get dummies from categorical variable and drop 1 level
- Select only features appears in training set
- Impute with the mean
- Feed Forward Neural Network with Keras
Task 2 pipeline contains:
- Select only features appears in training set
- Get dummies from categorical variable
- Impute with the mean
- Feed Forward Neural Network with Keras
Download the model saved in pickle file in Result folder.
For task 1:
- Load in the pipeline first
- Then load the keras model in the pipeline. (use Keras 1.2 to load the model)
# Load the pipeline first:
pl_load_in = joblib.load('../../results/task1_pipeline.pkl')
# Then, load the Keras model:
pl_load_in.named_steps['model'].model = load_model('../../results/task1_keras_model.h5')
# Test the model:
# Compute and print MSE for validation
ypred = pl_load_in.predict(Xval)
mse = mean_squared_error(yval, ypred)
print("Mean squared error: %f" % (mse))
# reset index for comparison (if yval already have clean index, this step can be omitted)
yval2 = yval.reset_index(drop=True)
# assign hard label (function hard_label() is in src.task1.reform_results)
new_ypred=pd.DataFrame(ypred)[0].apply(hard_label)
# Compute the accuracy: accuracy for validation
accuracy = float(np.sum(new_ypred==yval2))/yval2.shape[0]
print("accuracy: {}%".format(round(accuracy*100, 3)))
For task 2:
Load in the pipeline. The model is included in the pipeline.
# load the model from disk
filename = 'results/task2_model.pkl' # path leads to pickle model
loaded_model = pickle.load(open(filename, 'rb'))
# Test the model
ypred = loaded_model.predict(Xtest)
print("R squared score is:", r2_score(ytest,ypred).round(3))
- numpy
- pandas
- missingno
- imbalanced-learn
- sklearn
- statsmodels
- keras 2.0 for modelling, 1.2 if just need to load model and use it.
- matplotlib
- seaborn
- scikitplot
The hierarchy of this repository is described like below:
.
|-- README
|-- LICENSE
|-- .gitignore.py
|-- data
| -- predictive_maintenance_dataset.csv
| -- forecasting_dataset.csv
|-- doc
| -- notebook.md # electronic lab notebook
| -- manuscript.md
|-- results # storing all the result models
|-- src # source code used for both tasks
| -- task1 # code specific for task 1
| -- task2 # code specific for task 2
|-- test # tests for functions
|-- assets # store images
|-- bin
| -- # keep all the files I want to delete but not sure whether I will need it later