java l2-regularization logistic-regression online-learning sgd

RLR: Regularized Logistic Regression

This is an assginment for CMU 10-605 "Machine Learning with Large Datasets". It contains a Java implementation for L2-regularized logistic regression learning with scalable on-line stochastic gradient descent. Efficient sparse updates are achieved by lazy update of regularization. The hashing trick is used for memory saving.

The data are articles from DBPedia, and the label is the type of the article. There are in total 17 classes in the dataset, and they are from the first level class in DBpedia ontology. Each document may belong to multiple classes, and we train a separate binary classifier for each class. The data contains one document per line of the format:

docID label1,label2,... word1 word2 word3...

Given the path to testing dataset, LR.java streams through training data from stdin (System.in), and produces output in the following format, one line per test sample:

label1 probability_label1,label2 probability_label2,...

See run.sh for an example of training the logistic regression using 20 iterations, and producing prediction for testing data.

About

My java implementation of scalable on-line stochastic gradient descent for regularized logistic regression

java l2-regularization logistic-regression online-learning sgd

Languages

Language:Java 90.4%Language:Shell 6.4%Language:Python 3.2%