Decision trees and random forests are data driven methods, which can be used both for classification (called Classification tree) and for prediction (called Regression tree). Trees are wildly spread because it is easy to interpret for non-professionals in Data mining and they provide accurate results. Decision trees algorithm based on separating observations into smaller groups creating smaller splits. These splits create rules, which are used to classify or predict a desirable outcome. For example, IF Income > 500$ and Age < 23 than the observation belongs to class 1.
Here is an example of a decision tree, which helps us to predict approve the loan or not.
There are two major ideas, which help us to construct a tree: Recursive partitioning and Pruning.
Recursive partitioning: Repeatedly split the records into two parts to achieve maximum homogeneity within new parts.
Pruning the tree: Simplify the tree by pruning peripheral branches to avoid overfitting.
Recursive partitioning steps:
1. Pick one of the predictor variable Xi
2.
Pick a value of Xi, that divides training data into two portions
3. Measure how pure each portion is (we will discuss how to do that)
4. After you get maximum purity split and repeat the process for the second split and so on
An example how algorithm splits observations and makes nodes for Tree:
Goal: Classify 24 households as owning or not owning riding mowers.
Predictor variables: Income, Lot size.
First split is done by using variable Lot Size, which equals to 19000:
Second split Income = 84000$:
After all Splits:
Measuring impurity is done by two methods. The split is called pure if there all records belong to that class.
Measures are:
Gini Index and Entropy.
Gini index:
Where
m = number of record
P = proportion of case in rectangle A that belong to class K
I (A) = 0 when all case belong to same class
I (A) = maximum value when all classes are equally represented
P = Proportion of cases in rectangle A that belong to class K
Stopping the tree growth:
The natural end of a process is 100% purity, which over fits data and makes a bad decision for other data. From a certain point, the error rate for validation data starts to increase.
The widespread method to stop the tree to grow is CHAID test. Splitting stops when purity improvement is not statistically significant.
Random forest:
A random forest is an estimator that fits a number of decision tree classifiers on various subsamples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).
Now I will show you a script, which you can use in R and make the beautiful decision trees.
Random Forest and Decision Tree in R:
library(lattice)
library(ggplot2)
library(caret)
library(rpart)
library(rpart.plot)
library(rattle)
library(ROCR)
library(gplots)
library(randomForest)
ebaytest=read.csv("ebay_test.csv")
ebaytrain=read.csv("ebay_train.csv")
str(ebaytrain)
summary(ebaytrain)
library(rpart)
fone=formula(Competitive~Category+currency+sellerRating+Duration+endDay+ClosePrice+OpenPrice )
fit=rpart(fone, data=ebaytrain, method="class")
# fit=rpart(fone, data=ebaytest, method="anova")
library(rpart.plot)
prp(fit, type=2,faclen=3, main="Decision tree for ebaytrain")
prp(fit, type=1, extra=2,faclen=3, main="Decision tree for ebaytest")
plot(fone)
library(rattle) # pruning
fancyRpartPlot(fit)
fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"]
pfit<- prune(fit, cp= fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"])
plot(pfit, uniform=TRUE, main="Pruned Classification Tree for ebaytest")
text(pfit, use.n=TRUE, E, cex=.8)
pfit<- prune(fit, cp=0.010000)
plot(pfit, uniform=TRUE,main="Pruned Regression Tree for ebayttrain")
text(pfit, use.n=TRUE, all=TRUE, cex=.8)
pfit$cptable[which.min(fit$cptable[,"xerror"]),"CP"]
pfit2<- prune(pfit, cp= fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"])
plot(pfit2, uniform=TRUE, main="Pruned Classification Tree for ebaytrain2")
text(pfit2, use.n=TRUE, all=TRUE, cex=.8)
# New tree without ClosPrise variable
ftwo=formula(Competitive~Category+currency+sellerRating+Duration+endDay+OpenPrice)
fit2=rpart(ftwo, data=ebaytrain, method="class")
prp(fit, type=1, extra=2,faclen=3, main="Decision tree for ebaytest last")
# using controls
fit2<-rpart(ftwo, data=ebaytrain, method="class",control= rpart.control(minsplit=500, minbucket=250))
prp(fit2)
fit2<-rpart(ftwo, data=ebaytrain, method="class",control= rpart.control(minsplit=700, minbucket=350))
prp(fit2)
fit2<-rpart(ftwo, data=ebaytrain, method="class",control= rpart.control(minsplit=300, minbucket=150))
prp(fit2)
fit2<-rpart(ftwo, data=ebaytrain, method="class",control= rpart.control(minsplit=10, minbucket=round(minsplit/3)))
prp(fit2)
pfit2<- prune(fit2, cp= fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"])
prp(pfit2)
printcp(fit) # the sam can be done with this way
library(caret)
library(ROCR)
library(rattle)
asRules(pfit2)
#Making prediction, confussion matrix, ROC curve, AUC on testing data
pfit3<-predict(pfit2, ebaytest, type="class")
confusionMatrix(pfit3, ebaytest$Competitive, positive="More")
# ROC
pred_prob<-predict(pfit2, ebaytest, type="prob")
pred <- prediction(pred_prob[,2],ebaytest$Competitive)
perf <- performance(pred,"tpr","fpr")
performance(pred, "auc")@y.values
plot(perf)
# random forest
library(randomForest)
rfit=randomForest(Competitive~sellerRating+Duration,OpenPrice,data = ebaytrain )
#rm(list=ls())
Forest1<-randomForest(ftwo,data=ebaytrain,importance = TRUE,ntree = 1500)
# predict with forest
pred_f1<-predict(Forest1, newdata=ebaytest)
confusionMatrix(pred_f1, ebaytest$Competitive, positive="More")
plot(margin(Forest1,ebaytest$Competitive))
Referebces: Data MIning for Business analytics Galit Shmueli
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html