ML Runbook

Collection of solutions for common ML problems. Contributions are welcome :)

Dataset

If you have high variance (overfitting).
If the features are good enough for prediciton and a human expert can do manual estimation based on them.
If the algorithm has many parameters and can represent fairly complex functions.

Your model is performing very well on the training set, but poorly on the test set.

Your model performs poorly on both training and test sets.

If the number of features is large (relative to the number of examples), use either logistic regression or SVM without a kernel.
If the number of features is small and the number of examples is intermediate (up to 10K), use SVM with Gaussian kernel.
If the number of features is small, but the number of examples is large (over 10k), create/add more features, then use logistic regression or SVM without a kernel.

When to use anomaly detection algorithm (e.g. Gaussian distribution):

You expect a very small number of anomalies (up to 20) and a large number of non-anomalous examples.
You expect different types of anomalies and future anomalies may look like nothing you’ve seen so far.

When to use supervised learning: