Saturday, July 22, 2017

Logistic regression explained and example of Logistic regression created by R.

Logistic regression explained: Linear regression very looks like Logistic regression by the idea but the Logistic regression expands the area where regression can be used. The expanded area is a Y dependent variable, which in Logistic regression can be categorical variable. For example when Y dependent variable donates a recommendation on holding, selling, buying a stock we can create an ordinal logistic regression with three categorical variables. Logistic regression uses predictor variables  to classify an observation where its class is unknown.
This regression is used in applications such as:
1.  Classifying customers as returning or churning                           (classification)
2.  Finding factors that differentiate between male and female         top executives (profiling)
3. Predicting approval or disapproval of a loan based on                 customer information
Logistic regression is used in different fields, especially when we need to classify a category (in particular a binary).

            Logistic regression Model:
The idea of logistic regression is simple: Instead of using Y as an independent variable, we use a function of it, which is called logit. To understand it we should first take p=P(Y=1) probability of Y being 1. Compared with Y which can take just 1 or 0 as a class probability of p=P(Y=1) can take values from [0 to 1]. But it is also not enough to cause if we make a model p=B0+B1X1+B2X2...+…Bn it is not guaranteed that the right side of a function is going to be in an 0 to 1 interval. The solution is to use odds and after put them under logarithm.
Similar solution also offers Decision tree.  

   Log(odds) = B0+B1X1+B2X2…+…Bn Now everything is normal. Log odds can take values from – infinity to + infinity. The 0 in log odds is equal to even odds of 1 or probability of 0.5. If log odds are – probability of observation to be in a specific class goes down and if it is + the probability goes up.

R logistic regression:

Read the files
Credit<-read.csv("Credit.csv")
Score<-read.csv("Credit scoring.csv")
library(caret)

Separate the data into testing and training parts

index<-createDataPartition(Credit$default, p=0.75, list=F,times=1)


Train<-Credit[index,]

Test<-Credit[-index,]



Do initial descriptive statistics aimed to understand differences between defaulted and non-defaulted customers (this could be in format of charts, t-test, chi-square test or whatever you think is appropriate).

T-testt.test(Credit$age~Credit$default)
## t = 3.5011, df = 294.08, p-value = 0.0005353
From P value, we can see if the variable is significant or not.
Make boxplots to compare distributions

boxplot(Credit$age~Credit$default)

The initial regression model

Model1<-glm(default~.,data=Train, family="binomial")


summary(Model1)

## 

## Call:

## glm(formula = default ~ ., family = "binomial", data = Train)

## 


## Deviance Residuals: 

##     Min       1Q   Median       3Q      Max  

## -2.5736  -0.6640  -0.2858   0.3032   3.0940  

## 

## Coefficients:

##                   Estimate Std. Error z value Pr(>|z|)    

## (Intercept)      -0.982345   0.774908  -1.268 0.204908    
## age               0.040454   0.020184   2.004 0.045038 *  
## edhigh school    -0.004028   0.410193  -0.010 0.992164    
## edno high school -0.457978   0.403519  -1.135 0.256392    
## edpostgraduate    0.879676   1.386050   0.635 0.525648    
## edundergraduate  -0.366409   0.591096  -0.620 0.535336    
## employ           -0.254660   0.039732  -6.409 1.46e-10 ***
## address          -0.092858   0.027085  -3.428 0.000607 ***
## income           -0.024618   0.013601  -1.810 0.070287 .  
## debtinc           0.042544   0.038007   1.119 0.262988    
## creddebt          0.764952   0.151046   5.064 4.10e-07 ***
## othdebt           0.134778   0.100650   1.339 0.180546    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 



Education, income, and age, other debt are removed as they are non-significant

Model2<-glm(default~employ+address+debtinc+creddebt,data=Train, family="binomial")
Interpretations: When employment increases, the odds of default decreases by 20% (1-0.8), with the years in current address increasing the odds of default decreases by 8%, etc.


Accuracy Measure

library(caret)


pred_test<-predict(Model2, newdata=Test, type="response")

pr_label <-ifelse(pred_test>0.5, "Yes", "No")



# Create confusionmatrix

library(caret)

# You need to set which one is the positive case

confusionMatrix(pr_label,Test$default, positive="Yes")

## Confusion Matrix and Statistics

## 

##           Reference

## Prediction  No Yes

##        No  121  23

##        Yes   8  22

##                                           

##                Accuracy : 0.8218          

##                  95% CI : (0.7568, 0.8756)

##     No Information Rate : 0.7414          

##     P-Value [Acc > NIR] : 0.007927        
##                                           
##                   Kappa : 0.4788          
##  Mcnemar's Test P-Value : 0.011921        
##                                           
##             Sensitivity : 0.4889          
##             Specificity : 0.9380          
##          Pos Pred Value : 0.7333          
##          Neg Pred Value : 0.8403          
##              Prevalence : 0.2586          
##          Detection Rate : 0.1264          
##    Detection Prevalence : 0.1724          
##       Balanced Accuracy : 0.7134          
##                                           
##        'Positive' Class : Yes             
##

The ROC Curve and AUC
library(ROCR)
## Loading required package: gplots
## 

## Attaching package: 'gplots'

## The following object is masked from 'package:stats':

## 

##     lowess

P_Test <- prediction(pred_test, Test$default) 



perf <- performance(P_Test,"tpr","fpr")

# The ROC curve

plot(perf)




# Find Area under the curve (AUC)

performance(P_Test, "auc")@y.values

## [[1]]

## [1] 0.8463394

Predicting on the new scoring set
Scoring<-predict(Model2, newdata=Score, type="response")
Scoring<-data.frame(Score, Scores=Scoring)


The file “Credit scoring.csv” includes cases for new customer with no known outcome (default/non-default). Use your final model to predict default probabilities for these cases. Compare top10% and bottom 10 % of the customers (based on their default probability) by some independent variables. What do you see? 
q<-quantile(Scoring$Scores, c(0.1,0.9))
df<-Scoring[Scoring$Scores<=q[1]|Scoring$Score>=q[2],]

df$TB<-ifelse(df$Scores<=q[1],"Bottom","Top" )


 Reference: Data MIning for Business analytics Galit Shmueli 


Related Posts Plugin for WordPress, Blogger...