Random Forest | 349project

This part will talks about the algorithm: random forest. Random forest is a updated version of decision tree with more than one root. We train the model in Python with scikit-learn package.

"min_samples_split" & "min_samples_leaf":

Before I do training, firstly, I set "min_samples_split" as 4 and set "min_samples_leaf" as 2. "min_samples_split" is the minimum number of samples required to split an internal node. "min_samples_leaf" is the minimum number of samples in newly created leaves. Considering the huge number of data items, if a leaf only has one sample, it is too precise and does not make sense.

"n_estimators" & "max_depth":

Then I found that the two most important parameters which will affect the log_loss of testing file is "n_estimators" and "max_depth". "n_estimators" is the number of seeds. As the table shows, when n_estimators and max_depth are too small, the forest will be too concise. When n_estimators and max_depth are too big, the forest will be too precise and overfitting happens. When n_estimators = 220, max_depth = 11, the optimal log_loss of testing file is 2.474.

The most representative results are showed in the table below:

EECS 349 Machine Learning Project on

San Francisco Crime Classification

Bo Guan, Panitan Wongse-ammat, Xinyuan Zhao

Email: {BoGuan2015, Top, xinyuanzhao2016}@u.northwestern.edu

Northwestern University