Logistic Regression

Logistic regression is a supervised learning algorithm that applies to binary or multinomial classification problems. Logistic regression measures the relationship between features by estimating probabilities using a logistic function, which is the cumulative logistic distribution as shown below, and it can be used to predict the probability of occurrence of an event.

In our case, the class of each output depends on several features. Therefore, the logistic regression classifier is parameterized by a weight matrix W and a bias vector b. The logistic function above needs to add those parameters into the equation, and we call a new equation as a softmax function.

To implement logistic regression in python with the scikit learn package, there are two logistic regression function classifiers namely, LogisticRegression and LogisticRegressionCV. For both of the two classifiers, we can opt for optimizers such as liblinear, newton-cg, and sag of lbfgs and regularization methods supporting the selected optimizer. The liblinear solver supports both L1 and L2 regularization with dual formulation only for the L2 penalty. However, the different between the two classifier is the best hyperparameter of LogisticRegressionCV can be selected by by the cross-validator StratifiedKFold, or it can be altered using the cv parameter. The results of both algorithms with different setups are shown below. In order to have the logloss and percent accuracy, we use the training set and divide it into 5 folds, and use 5-fold cross validation to validate each model. Moreover, we also test the model with the test set data to find which model can still predict the unseen data with the lowest logloss.

Table 5: accuracy and logloss of the logistic regression

EECS 349 Machine Learning Project on

San Francisco Crime Classification

Bo Guan, Panitan Wongse-ammat, Xinyuan Zhao

Email: {BoGuan2015, Top, xinyuanzhao2016}@u.northwestern.edu

Northwestern University

Logistic Regression