BlightFight

Repo for Capstone Project of Data Science at Scale course offered by University of Washington on Coursera.

Task

Work with real data collected in Detroit to help urban planners predict blight (the deterioration and decay of buildings and older areas of large cities, due to neglect, crime, or lack of economic support).

Approach

Step 1: Establish a list of all the buildings with their space extents.

Done

Filter NAs and invalid coordinates (outside the bounds of Detroit)
Extract latitutude/longitude pair and address (in raw text) from 4 files
Concatenate them into one data frame
Clean up the address field (extract numbers, drop symbols, normalize spelling, expand abbreviations, etc)
Cluster geolocations by fuzzy matching on address field and incident proximities (eps = 0.000075).
Represent each building with a rectangle centered at average coordinates.

Tried

DBSCAN based on coordinates, no good.
DBSCAN based on a combination of coordinates and address fields, impossible to do without rewriting algorithm because of the way that feature distances are computed.

Step 2: Generate a balanced data set for training and testing

Done

Map demolition permits to buildings, derive positive labels.
Random sample a same amount of buildings with negative labels.
Concatenate them into a "training" set.

Note

This "training" set will later be divided into a (real) training set and a validation set. In this task it does not make much sense to use the remaining data as a "testing" set (at least no in a traditional sense) because we only got buildings that are not on the demolition list. And there's no way to figure out their true labels. So this part is a little bit like semi-supervised learning: I'll just evaluate the model on the validation set and use the remaining data for visualization and drawing conclusions. Anyway this is also what the task requires us to do.

Step 3: Develop a naive model and evalute its performance.

I believe it's OK to jump right to Step 4.

Step 4: Feature engineering.

Done

Derive features from violations.csv, calls.csv and crimes.csv. Bascially counts of one-hot-encoded categorical variables.
Examine feature importance using random forest. Got a ~0.83 AUC score on OOB data.

Note

Counts of violations and crimes are the simplist yet most important features. I even hadn't include a decaying propagation effect of bad incidents.

Step 5: Develop a more advanced model.

Trained a Xgboost model, got a ~0.85 AUC score on OOB data
Simplify the model and still got a~0.849 AUC score.

Step 6: Evaluation and drawing conclusions.

Present a summary with some visualizations.

Explain the model.
Make a Choropleth map of blight risks on out-of-sample data.

Author

Linghao Zhang

License

MIT license

dnc1994 / BlightFight

BlightFight

Task

Approach

Step 1: Establish a list of all the buildings with their space extents.

Done

Tried

Step 2: Generate a balanced data set for training and testing

Done

Note

Step 3: Develop a naive model and evalute its performance.

Step 4: Feature engineering.

Done

Note

Step 5: Develop a more advanced model.

Step 6: Evaluation and drawing conclusions.

Author

License

About

Languages