rsangole / capstone_project

Predict 498 Capstone Project

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Modeling Ideas

rsangole opened this issue · comments

commented

Team,

Please populate your modeling ideas here.

What types of models can be built? What ideas do you have in mind?

Target variables:

  1. Mosquito count [regression problem]
  2. West Nile virus present (y/n) [classification problem]

For 1, we could simply model total number of mosquitos (regardless of species), or we could try to build separate predictions for each species. I think I favor the former. It's simpler to implement and interpret. That's why I built a version of the data with one row per trap per date and a count of total mosquitos.

For 2, we could model each species separately or presence of WNV in any species. Although I created a version of the data that has one row per trap per date, we could revert to the original format that has one row per trap per date per batch of 1-50 mosquitos. The latter is more complicated and ill-suited to some modeling approaches but might be slightly richer. I'm not sure if it's worth it. Thoughts?

Modeling approaches:

I think linear regression, logistic regression, and random forests can handle the data the way I tentatively structured it.

For time series methods, we might need to further reshape data so that we have have one observation for every calendar day (whereas right now we only have observations when a trap was tested). One option is to restructure so that we have weekly summaries for each trap instead of data aggregated to the data when the trap results were recorded.

Thoughts?

commented

One thing I thought of today while pondering this problem...

We can simplify the modeling work and build a model to predict (exactly what you had in your post):

  1. How many mosq (time series prediction)
  2. How many expected cases of wnv (classification problem)

But -- both these would be at an "aggregate" Chicago level, yes? We will not be predicting the "location" of the high mosq count, or the "location" of the wnv cases.

Of course we could try predicting location too - but it gets much more complex.

Personally, I like the idea of running a single model to predict time- and location-specific mosquito counts & WNV presence (or, actually, two models -- one for each target variable). But I think we can simplify that by using latitude/longitude and other geographic indicators rather than one hot encoding the WNV traps by name.

commented

List of websites talking about modeling this problem:

  1. KH Blog
  2. His github
  3. Another solution

Question: how to handle time, especially for lagged trap data?

If we want to use time series methods that will actually respect time (i.e. use only past data when examining each time point), the methods I'm familiar with require regular observations at some defined periodicity -- e.g. daily, weekly, monthly, quarterly or yearly data.

In our case, we have approximately weekly data. If we limit to traps that are active in a given year, I estimate that about 70% of the data points are taken within a week of the preceding data point for the same trap and about 93% within two weeks. We do occasionally have 2-3 data points within the same week for a given trap. This presents some issues for calculated lag terms for the target variables, which is why I didn't construct them already.

I favor weekly time series over monthly because we give up a lot of granularity if we move to monthly data. If we want to force data to conform to weekly time series, we could impute the missing trap results, possibly just using a simple interpolation (average of preceding and subsequent points). We also have discontinuities since data only runs from about May to October each year. We could treat the missing data points as having zero mosquitos. Although these approaches will introduce noise (and bias), they would give us a way to use conventional time series methods. I think they'll be useful for getting insights into decomposition (seasonal, trend components) and making decisions about lag terms (differencing, moving averages, etc.). Pros: allows us to run time series methods (e.g. ARIMA). Cons: more data munging, have to run separate models for each trap (or cluster, if we have a clustering approach we're happy with).

Alternatively, instead of forcing data into a weekly time series structure, we could simply used observed dates and calculate the lagged terms using the most recently observed results and an additional variable indicating the lag time (in days). Pros: fewer changes to data and data structure; works for methods that don't require periodicity or enforce time. Cons: still have to deal with missing data for lag terms. Random forests, for example, do not typically run when you have missing data. Imputation is messy and could bias models.

Thoughts?

I have used external regressors with ARIMA using the forecast package in R, but it's a bit clunky and hard to work with many different regressors at the same time.

Another option may possibly be adaptations of XGBoost specifically for time series.