**Logistic regression explained:**Linear regression very looks like Logistic regression by the idea but the Logistic regression expands the area where regression can be used. The expanded area is a Y dependent variable, which in Logistic regression can be categorical variable. For example when Y dependent variable donates a recommendation on holding, selling, buying a stock we can create an ordinal logistic regression with three categorical variables. Logistic regression uses predictor variables to classify an observation where its class is unknown.

This regression is used in applications such as:

**1.**Classifying customers as returning or churning (classification)

**2.**Finding factors that differentiate between male and female top executives (profiling)

**3.**Predicting approval or disapproval of a loan based on customer information

Logistic regression is used in different fields, especially when we need to classify a category (in particular a binary).

**Logistic regression Model:**

Similar solution also offers Decision tree.

**Log(odds) = B0+B1X1+B2X2…+…Bn**Now everything is normal. Log odds can take values from – infinity to + infinity. The 0 in log odds is equal to even odds of 1 or probability of 0.5. If log odds are – probability of observation to be in a specific class goes down and if it is + the probability goes up.

**R logistic regression:**

Read the files

Credit<-read.csv("Credit.csv")

Score<-read.csv("Credit scoring.csv")

Score<-read.csv("Credit scoring.csv")

library(caret)

Separate the data into testing and training parts

index<-createDataPartition(Credit$default, p=0.75, list=F,times=1)

Train<-Credit[index,]

Test<-Credit[-index,]

Do initial descriptive statistics aimed to understand differences between defaulted and non-defaulted customers (this could be in format of charts, t-test, chi-square test or whatever you think is appropriate).

T-testt.test(Credit$age~Credit$default)

## t = 3.5011, df = 294.08, p-value = 0.0005353

From P value, we can see if the variable is significant or not.

Make boxplots to compare distributions

boxplot(Credit$age~Credit$default)

The initial regression model

Model1<-glm(default~.,data=Train, family="binomial")

summary(Model1)

##

## Call:

## glm(formula = default ~ ., family = "binomial", data = Train)

##

## Min 1Q Median 3Q Max

## -2.5736 -0.6640 -0.2858 0.3032 3.0940

##

## Coefficients:

## Estimate Std. Error z value Pr(>|z|)

## (Intercept) -0.982345 0.774908 -1.268 0.204908

## age 0.040454 0.020184 2.004 0.045038 *

## edhigh school -0.004028 0.410193 -0.010 0.992164

## edno high school -0.457978 0.403519 -1.135 0.256392

## edpostgraduate 0.879676 1.386050 0.635 0.525648

## edundergraduate -0.366409 0.591096 -0.620 0.535336

## employ -0.254660 0.039732 -6.409 1.46e-10 ***

## address -0.092858 0.027085 -3.428 0.000607 ***

## income -0.024618 0.013601 -1.810 0.070287 .

## debtinc 0.042544 0.038007 1.119 0.262988

## creddebt 0.764952 0.151046 5.064 4.10e-07 ***

## othdebt 0.134778 0.100650 1.339 0.180546

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

**Education, income, and age, other debt are removed as they are non-significant**

Model2<-glm(default~employ+address+debtinc+creddebt,data=Train, family="binomial")

Interpretations: When employment increases, the odds of default decreases by 20% (1-0.8), with the years in current address increasing the odds of default decreases by 8%, etc.

**Accuracy Measure**

library(caret)

pred_test<-predict(Model2, newdata=Test, type="response")

pr_label <-ifelse(pred_test>0.5, "Yes", "No")

# Create confusionmatrix

library(caret)

# You need to set which one is the positive case

confusionMatrix(pr_label,Test$default, positive="Yes")

## Confusion Matrix and Statistics

##

## Reference

## Prediction No Yes

## No 121 23

## Yes 8 22

##

## Accuracy : 0.8218

## 95% CI : (0.7568, 0.8756)

## No Information Rate : 0.7414

## P-Value [Acc > NIR] : 0.007927

##

## Kappa : 0.4788

## Mcnemar's Test P-Value : 0.011921

##

## Sensitivity : 0.4889

## Specificity : 0.9380

## Pos Pred Value : 0.7333

## Neg Pred Value : 0.8403

## Prevalence : 0.2586

## Detection Rate : 0.1264

## Detection Prevalence : 0.1724

## Balanced Accuracy : 0.7134

##

## 'Positive' Class : Yes

##

**The ROC Curve and AUC**

library(ROCR)

## Loading required package: gplots

##

## Attaching package: 'gplots'

## The following object is masked from 'package:stats':

##

## lowess

P_Test <- prediction(pred_test, Test$default)

perf <- performance(P_Test,"tpr","fpr")

# The ROC curve

plot(perf)

# Find Area under the curve (AUC)

performance(P_Test, "auc")@y.values

## [[1]]

## [1] 0.8463394

Predicting on the new scoring set

Scoring<-predict(Model2, newdata=Score, type="response")

Scoring<-data.frame(Score, Scores=Scoring)

The file “Credit scoring.csv” includes cases for new customer with no known outcome (default/non-default). Use your final model to predict default probabilities for these cases. Compare top10% and bottom 10 % of the customers (based on their default probability) by some independent variables. What do you see?

q<-quantile(Scoring$Scores, c(0.1,0.9))

df<-Scoring[Scoring$Scores<=q[1]|Scoring$Score>=q[2],]

df$TB<-ifelse(df$Scores<=q[1],"Bottom","Top" )

*Reference: Data MIning for Business analytics Galit Shmueli*