Algorithm & Evaluation in Weka

Algorithm and Evaluation

To participate in the Kaggle competition for SF crime classification, each participant is evaluated on the multi-class logarithmic loss. Each incident is labeled with one true class, and for each of them we must submit a set of predicted probabilities (one for every class). We will find the log loss based on the equation written above. In the equation, N is the number of cases in the test set, M is the number of class labels, log is the natural logarithm, yij is 1 if observation i is in class j and 0 otherwise, and pij is the predicted probability that observation i belongs to class j.

For this reason, we need to find an algorithm that can classify each incident and give us probabilities for each class. After doing more research and taking Professor’s suggestions into serious consideration, we test some algorithms such as Naive Bayes, k-Nearest Neighbor, Logistic Regression and Random Forest. Those algorithms are critical because they are able to generate their probability outputs for each class to compute logloss. We ran these algorithms and validated them with 10-fold cross validation in Weka. Their results are shown in Table 1.

Table 1: The results of the classification algorithms we have tested in weka

We use the scikit learn package to write python scripts, and we customize these algorithms to have a better result. We will discuss how we customize each algorithm and its result in the section below.

EECS 349 Machine Learning Project on

San Francisco Crime Classification

Bo Guan, Panitan Wongse-ammat, Xinyuan Zhao

Email: {BoGuan2015, Top, xinyuanzhao2016}@u.northwestern.edu

Northwestern University

Algorithm and Evaluation