This project aims to predict winter wheat yields based on location and weather data. It is inspired by this data science challenge.
Here I briefly outline the main steps in my approach as well as my main results. A detailed report is also available: Full Report
A gradient-boosted decision tree regressor turned out to be the best performer. The tuned model achieved an R2 value of ~0.83 with a root mean square error (RMSE) of 5.3 (yield values in the dataset range from 10 to 80). The mean absolute percentage error is ~5%.
Below I outline briefly the main steps in the workflow. The Jupyter notebooks linked in each step contain the code (with comments) that was used to achieve the results.
Task | Summary | Notebook |
---|---|---|
Explore and clean data | Exploring data structure and impute missing values. | 01 |
Collect additional data | For each location determine elevation and length-of-day at a unified date. | 03 |
Feature engineering | Construct higher-level features by characterizing each location across the season. | 04 |
Statistical analysis | High-level statistical exploration of final feature set. | 05 |
Select algorithm | Compare a number of algorithms using cross validation to identify the most promising performers for this data/feature set. | 06 |
Tune model | Tune hyper-parameters of a gradient-boosted tree regressor using cross validation, learning curves and validation curves. Find best balance between performance and bias-variance tradeoff. | 06 |
Establish model performance | Use a 30% hold-out test set to compare predicted and observed yields. | 06 |
While the performance of the model appears quite good, a close inspection reveals that it has a tendency to under predict at high yield values (>60 observed). There is also some residual overfitting, even after careful tuning.
In future iterations, these issues could be addressed by:
- getting more data,
- engineering additional and/or different features, or
- using ensemble techniques by combining the results of different models.