Discussion & Summary

Discussion

The original data has only six features which are Dates, DayOfWeek, PdDistrict, Address, X, Y coordinate of the crime location and . We extract some features from those six original features, and assign numeric values to have features such as Dates_month, Dates_day, Dates_year, hour48, minute, DayOfWeek_index, PdDistrict_index, grid_index10, and Category_index. With this modified dataset, we pass it as an input to naive bayes, logistic regresssion, k-nearest neighbors, and random forest algorithms implemented in the scikitlearn packages. The best result we achieve is from the Random Forest, and below is our analysis why each algorithm performs below our expectation.

As mentioned in the lecture slides, Naive Bayes trusts the assumption in training. One of the assumptions of the naive bayes algorithm is conditionally independent between features. We speculate that some features such as Date_months, Dates_day, Dates_year, hour48, and minute break the assumption because they are related to each other and does not conditionally independent. When our data is not based on the assumption, it brings about high logloss. Moreover, Naive bayes is not very effective in producing probability estimates for each class. Since we access each algorithm by the log loss, we believe this is the reason why naive bayes gives us poor logloss values.

Logistic Regression is very similar to the naive bayes algorithm in the case that the amount of data set asymptotically approaches to infinity. Logistic Regression does not need conditionally independent requirements between features; thus, it can perform a bit better than naive bayes algorithm.The fact that the standard deviation of all logloss of logistic regression is less than that of naive bayes substantiates this idea. Moreover, the nominal features are not related to each other. When we extract numerical features from nominal features, we unintentionally create relationship of all values within that features. We agree that this relationship between numerical values of each features results in poor performance of the logistic regression.

Random Forest is easy to understand and also “complex” enough to make full use of all the features without need of caring about their independence. When selecting a proper max_depth, we can easily avoid overfitting and get a good result.

As we know, K-nearest neighbors cannot distinguish between the relevant features and irrelevant features corresponding to the output. For this reason, no matter how we set the k value, k-NN cannot accurately classifies the output. The logloss in the table reflects that although we choose the large k, it does not improve the logloss value. In addition, k-nearest neighbor can make a good prediction only for the training data. When we test it with the unseen data, it performs very poorly. This reflects the idea that k-nearest neighbors is a lazy algorithm that it does not learn anything or create a prediction model when trained. Therefore, it cannot predict the unseen data.

Summary

After the analysis and preprocessing of the dataset downloaded from Kaggle, we extract these 8 features: ‘month’, ‘day’, ‘year’, ‘hour48’, ‘minute’, ‘DayOfWeek_index’, 'PdDistrict_index', ‘'grid index’. We test some algorithms in Weka and select Naive Bayes, Logistic Regression, Random forest and KNN to implement in Python with scikit-learn library. The best log-loss we get is 2.474 by Random forest which ranks 697/2335 in public leaderboard.

EECS 349 Machine Learning Project on

San Francisco Crime Classification

Bo Guan, Panitan Wongse-ammat, Xinyuan Zhao

Email: {BoGuan2015, Top, xinyuanzhao2016}@u.northwestern.edu

Northwestern University