Logistic regression explained: Linear regression very looks like Logistic regression by the idea but the Logistic regression expands the area where regression can be used. The expanded area is a Y dependent variable, which in Logistic regression can be categorical variable. For example when Y dependent variable donates a recommendation on holding, selling, buying a stock we can create an ordinal logistic regression with three categorical variables. Logistic regression uses predictor variables to classify an observation where its class is unknown.
This regression is used in applications such as:
1. Classifying customers as returning or churning (classification)
2. Finding factors that differentiate between male and female top executives (profiling)
3. Predicting approval or disapproval of a loan based on customer information
Logistic regression is used in different fields, especially when we need to classify a category (in particular a binary).
Logistic regression Model:
The idea of logistic regression is simple: Instead of using Y as an independent variable, we use a function of it, which is called logit. To understand it we should first take p=P(Y=1) probability of Y being 1. Compared with Y which can take just 1 or 0 as a class probability of p=P(Y=1) can take values from [0 to 1]. But it is also not enough to cause if we make a model p=B0+B1X1+B2X2...+…Bn it is not guaranteed that the right side of a function is going to be in an 0 to 1 interval. The solution is to use odds and after put them under logarithm. Similar solution also offers Decision tree.
Log(odds) = B0+B1X1+B2X2…+…Bn Now everything is normal. Log odds can take values from – infinity to + infinity. The 0 in log odds is equal to even odds of 1 or probability of 0.5. If log odds are – probability of observation to be in a specific class goes down and if it is + the probability goes up.
R logistic regression:
Read the files
Score<-read.csv("Credit scoring.csv")
Separate the data into testing and training parts
index<-createDataPartition(Credit$default, p=0.75, list=F,times=1)
Do initial descriptive statistics aimed to understand differences between defaulted and non-defaulted customers (this could be in format of charts, t-test, chi-square test or whatever you think is appropriate).
## t = 3.5011, df = 294.08, p-value = 0.0005353
From P value, we can see if the variable is significant or not.
Make boxplots to compare distributions
The initial regression model
Model1<-glm(default~.,data=Train, family="binomial")
## Call:
## glm(formula = default ~ ., family = "binomial", data = Train)
## Min 1Q Median 3Q Max
## -2.5736 -0.6640 -0.2858 0.3032 3.0940
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.982345 0.774908 -1.268 0.204908
## age 0.040454 0.020184 2.004 0.045038 *
## edhigh school -0.004028 0.410193 -0.010 0.992164
## edno high school -0.457978 0.403519 -1.135 0.256392
## edpostgraduate 0.879676 1.386050 0.635 0.525648
## edundergraduate -0.366409 0.591096 -0.620 0.535336
## employ -0.254660 0.039732 -6.409 1.46e-10 ***
## address -0.092858 0.027085 -3.428 0.000607 ***
## income -0.024618 0.013601 -1.810 0.070287 .
## debtinc 0.042544 0.038007 1.119 0.262988
## creddebt 0.764952 0.151046 5.064 4.10e-07 ***
## othdebt 0.134778 0.100650 1.339 0.180546
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Education, income, and age, other debt are removed as they are non-significant
Model2<-glm(default~employ+address+debtinc+creddebt,data=Train, family="binomial")
Interpretations: When employment increases, the odds of default decreases by 20% (1-0.8), with the years in current address increasing the odds of default decreases by 8%, etc.
Accuracy Measure
pred_test<-predict(Model2, newdata=Test, type="response")
pr_label <-ifelse(pred_test>0.5, "Yes", "No")
# Create confusionmatrix
# You need to set which one is the positive case
confusionMatrix(pr_label,Test$default, positive="Yes")
## Confusion Matrix and Statistics
## Reference
## Prediction No Yes
## No 121 23
## Yes 8 22
## Accuracy : 0.8218
## 95% CI : (0.7568, 0.8756)
## No Information Rate : 0.7414
## P-Value [Acc > NIR] : 0.007927
## Kappa : 0.4788
## Mcnemar's Test P-Value : 0.011921
## Sensitivity : 0.4889
## Specificity : 0.9380
## Pos Pred Value : 0.7333
## Neg Pred Value : 0.8403
## Prevalence : 0.2586
## Detection Rate : 0.1264
## Detection Prevalence : 0.1724
## Balanced Accuracy : 0.7134
## 'Positive' Class : Yes
The ROC Curve and AUC
P_Test <- prediction(pred_test, Test$default)
perf <- performance(P_Test,"tpr","fpr")
# The ROC curve
# Find Area under the curve (AUC)
performance(P_Test, "auc")@y.values
## [[1]]
## [1] 0.8463394
Predicting on the new scoring set
Scoring<-predict(Model2, newdata=Score, type="response")
Scoring<-data.frame(Score, Scores=Scoring)
The file “Credit scoring.csv” includes cases for new customer with no known outcome (default/non-default). Use your final model to predict default probabilities for these cases. Compare top10% and bottom 10 % of the customers (based on their default probability) by some independent variables. What do you see?
q<-quantile(Scoring$Scores, c(0.1,0.9))
df$TB<-ifelse(df$Scores<=q[1],"Bottom","Top" )
