Interest rate analysis in R using logistic models

Introduction

I will continue working with the lending club data.

I will take the perspective of an investor to the lending club.

The goal is to select a subset of the most promising loans to invest. I will do so using the method of logistic regression.

Load the data

lc_raw <- read_csv("~/Documents/LBS/Data Science/Session 03/LendingClub Data.csv",  skip=1) %>%  #since the first row is a title we want to skip it. 
  clean_names() # use janitor::clean_names()

ICE the data: Inspect, Clean, Explore

Any data science engagement starts with ICE. Inspecting, Clean and Explore the data.

Inspect the data

I will inspect the data to understand what different variables mean.

summary(lc_raw)
##     int_rate      loan_amnt      term_months    installment        dti      
##  Min.   :0      Min.   :  500   Min.   :36     Min.   :  16   Min.   : 0    
##  1st Qu.:0      1st Qu.: 5500   1st Qu.:36     1st Qu.: 169   1st Qu.: 8    
##  Median :0      Median :10000   Median :36     Median : 288   Median :13    
##  Mean   :0      Mean   :11253   Mean   :42     Mean   : 333   Mean   :13    
##  3rd Qu.:0      3rd Qu.:15000   3rd Qu.:60     3rd Qu.: 449   3rd Qu.:19    
##  Max.   :0      Max.   :35000   Max.   :60     Max.   :1305   Max.   :30    
##  NA's   :4669   NA's   :4669    NA's   :4669   NA's   :4669   NA's   :4669  
##   delinq_2yrs     annual_inc         grade            emp_title        
##  Min.   : 0     Min.   :   4000   Length:42538       Length:42538      
##  1st Qu.: 0     1st Qu.:  40800   Class :character   Class :character  
##  Median : 0     Median :  59534   Mode  :character   Mode  :character  
##  Mean   : 0     Mean   :  69148                                        
##  3rd Qu.: 0     3rd Qu.:  82800                                        
##  Max.   :11     Max.   :6000000                                        
##  NA's   :4669   NA's   :4669                                           
##   emp_length        home_ownership     verification_status   issue_d         
##  Length:42538       Length:42538       Length:42538        Length:42538      
##  Class :character   Class :character   Class :character    Class :character  
##  Mode  :character   Mode  :character   Mode  :character    Mode  :character  
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##    zip_code          addr_state        loan_status            desc          
##  Length:42538       Length:42538       Length:42538       Length:42538      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    purpose             title             x20            x21         
##  Length:42538       Length:42538       Mode:logical   Mode:logical  
##  Class :character   Class :character   NA's:42538     NA's:42538    
##  Mode  :character   Mode  :character                                
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##    x22            x23            x24            x25            x26         
##  Mode:logical   Mode:logical   Mode:logical   Mode:logical   Mode:logical  
##  NA's:42538     NA's:42538     NA's:42538     NA's:42538     NA's:42538    
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##    x27            x28            x29            x30            x31         
##  Mode:logical   Mode:logical   Mode:logical   Mode:logical   Mode:logical  
##  NA's:42538     NA's:42538     NA's:42538     NA's:42538     NA's:42538    
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##    x32            x33            x34            x35            x36         
##  Mode:logical   Mode:logical   Mode:logical   Mode:logical   Mode:logical  
##  NA's:42538     NA's:42538     NA's:42538     NA's:42538     NA's:42538    
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##    x37            x38            x39            x40            x41         
##  Mode:logical   Mode:logical   Mode:logical   Mode:logical   Mode:logical  
##  NA's:42538     NA's:42538     NA's:42538     NA's:42538     NA's:42538    
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##    x42            x43            x44            x45            x46         
##  Mode:logical   Mode:logical   Mode:logical   Mode:logical   Mode:logical  
##  NA's:42538     NA's:42538     NA's:42538     NA's:42538     NA's:42538    
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##    x47            x48            x49            x50            x51         
##  Mode:logical   Mode:logical   Mode:logical   Mode:logical   Mode:logical  
##  NA's:42538     NA's:42538     NA's:42538     NA's:42538     NA's:42538    
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##    x52            x53            x54            x55            x56         
##  Mode:logical   Mode:logical   Mode:logical   Mode:logical   Mode:logical  
##  NA's:42538     NA's:42538     NA's:42538     NA's:42538     NA's:42538    
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##    x57            x58            x59            x60            x61         
##  Mode:logical   Mode:logical   Mode:logical   Mode:logical   Mode:logical  
##  NA's:42538     NA's:42538     NA's:42538     NA's:42538     NA's:42538    
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##    x62            x63            x64            x65            x66         
##  Mode:logical   Mode:logical   Mode:logical   Mode:logical   Mode:logical  
##  NA's:42538     NA's:42538     NA's:42538     NA's:42538     NA's:42538    
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##    x67            x68            x69            x70            x71         
##  Mode:logical   Mode:logical   Mode:logical   Mode:logical   Mode:logical  
##  NA's:42538     NA's:42538     NA's:42538     NA's:42538     NA's:42538    
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##    x72            x73            x74            x75            x76         
##  Mode:logical   Mode:logical   Mode:logical   Mode:logical   Mode:logical  
##  NA's:42538     NA's:42538     NA's:42538     NA's:42538     NA's:42538    
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##    x77            x78             x79             x80         
##  Mode:logical   Mode:logical   Mode :logical   Mode :logical  
##  NA's:42538     NA's:42538     FALSE:2660      FALSE:2718     
##                                NA's :39878     NA's :39820    
##                                                               
##                                                               
##                                                               
## 

Clean the data

Are there any redundant columns and rows? Are all the variables in the correct format (e.g., numeric, factor, date)? Lets fix it.

The variable “loan_status” contains information as to whether the loan has been repaid or charged off (i.e., defaulted). I will create a binary factor variable for this and focus on it in my further analysis.

lc_clean<- lc_raw %>%
  dplyr::select(-x20:-x80) %>% #delete empty columns
  filter(!is.na(int_rate)) %>%   #delete empty rows
  mutate(
    issue_d = lubridate::mdy(issue_d),  # lubridate::mdy() to fix date format
    term = factor(term_months),     # turn 'term' into a categorical variable
    delinq_2yrs = factor(delinq_2yrs) # turn 'delinq_2yrs' into a categorical variable
  ) %>% 
  mutate(default = dplyr::recode(loan_status, 
                      "Charged Off" = "1", 
                      "Fully Paid" = "0"))%>%
    mutate(default = as.factor(default)) %>%
  dplyr::select(-emp_title,-installment, -term_months, everything()) #move some not-so-important variables to the end. 

Explore the data

I will explore loan defaults by creating different visualizations. I will start with examining how prevalent defaults are, whether the default rate changes by loan grade or number of delinquencies, and a couple of scatter plots of defaults against loan amount and income.

#bar chart of defaults
def_vis1<-ggplot(data=lc_clean, aes(x=default)) +
  geom_bar(aes(y = (..count..)/sum(..count..))) + 
  labs(x="Default, 1=Yes, 0=No", y="relative frequencies") +
  scale_y_continuous(labels=scales::percent) +
  geom_text(aes( label = scales::percent((..count..)/sum(..count..) ),y=(..count..)/sum(..count..) ), stat= "count",vjust=-0.5) 

def_vis1

#bar chart of defaults per loan grade
def_vis2<-ggplot(data=lc_clean, aes(x=default), group=grade) +
  geom_bar(aes(y = (..count..)/sum(..count..), fill = factor(..x..)), stat="count")  +
  labs(title="Defaults by Grade", x="Default, 1=Yes, 0=No", y="relative frequencies") +
  scale_y_continuous(labels=scales::percent) +facet_grid(~grade) +
  theme(legend.position = "none") +
  geom_text(aes( label = scales::percent((..count..)/sum(..count..) ),y=(..count..)/sum(..count..) ), stat= "count",vjust=-0.5) 

def_vis2

#bar chart of defaults per number of Delinquencies
def_vis3<-lc_clean %>%
  filter(as.numeric(delinq_2yrs)<4) %>%
  ggplot(aes(x=default), group=delinq_2yrs) +
  geom_bar(aes(y = (..count..)/sum(..count..), fill = factor(..x..)), stat="count")  +
  labs(title="Defaults by Number of Delinquencies", x="Default, 1=Yes, 0=No", y="relative frequencies") +
  scale_y_continuous(labels=scales::percent) +
  facet_grid(~delinq_2yrs) +
  theme(legend.position = "none") +
  geom_text(aes( label = scales::percent((..count..)/sum(..count..) ),y=(..count..)/sum(..count..) ), stat= "count",vjust=-0.5)

def_vis3

#scatter plots 

#We select 2000 random loans to display only to make the display less busy. 
set.seed(1234)
reduced<-lc_clean[sample(0:nrow(lc_clean), 2000, replace = FALSE),]%>%
  mutate(default=as.numeric(default)-1) # also convert default to a numeric {0,1} to make it easier to plot.

          
# scatter plot of defaults against loan amount                         
def_vis4<-ggplot(data=reduced, aes(y=default,x=I(loan_amnt/1000)))  +
  labs(y="Default, 1=Yes, 0=No", x="Loan Amnt (1000 $)") +
  geom_jitter(width=0, height=0.05, alpha=0.7)

def_vis4

#scatter plot of defaults against loan amount.
def_vis5<-ggplot(data=reduced, aes(y=default,x=I(annual_inc/1000))) +
  labs(y="Default, 1=Yes, 0=No", x="Annual Income(1000 $)") +
  geom_jitter(width=0, height=0.05, alpha=0.7) +
  xlim(0,400)

def_vis5

I will also estimate a correlation table between defaults and other continuous variables.

# correlation table using GGally::ggcor()
# this takes a while to plot
lc_clean %>% 
    mutate(default=as.numeric(default)-1)%>%
  select(loan_amnt, dti, annual_inc, default) %>% #keep Y variable last
 ggcorr(method = c("pairwise", "pearson"), label_round=2, label = TRUE)

ggplot(lc_clean, aes(x= loan_amnt,y= default, colour=term)) +
  geom_boxplot() +
  facet_wrap(~grade) +
  labs(title= "Average annual income across all the grades for term 36 and 60", y="Default", x="Annual Income(1000 $)")

lc_clean %>% 
  select(grade, int_rate, default) %>% 
           ggpairs(aes(colour=default), alpha=1)+
  theme_bw()

From the graph it is interesting to observe that across particular loan grades from A to G the loan amount tends to increase in both terms for 36 months and 60 months. However, the loan amount for 36 months always stays slower disregard grade or default.

From the ggpairs plot, we can draw a few conclusions. First of all, the amount of data on default=1 is much smaller than the smount of data on default=0. Second of all, the biggest group by grade is group B, and groups that mostly tend to default are B, C or D. Interestingly, the interest rate tends to be bigger for people who fail to pay the loan back. Also the interest rate is the lowest for grade A and increases across the grades.


Linear vs. logistic regression for binary response variables

It is certainly possible to use the OLS approach to find the line that minimizes the sum of square errors when the dependent variable is binary (i.e., default no default). In this case, the predicted values take the interpretation of a probability. I can also estimate a logistic regression instead. I will do both below.

model_lm<-lm(as.numeric(default)~I(annual_inc/1000), lc_clean)
summary(model_lm)
## 
## Call:
## lm(formula = as.numeric(default) ~ I(annual_inc/1000), data = lc_clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.160 -0.149 -0.144 -0.133  1.327 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.16e+00   2.70e-03   429.3   <2e-16 ***
## I(annual_inc/1000) -2.48e-04   2.92e-05    -8.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.35 on 37867 degrees of freedom
## Multiple R-squared:  0.0019, Adjusted R-squared:  0.00188 
## F-statistic: 72.2 on 1 and 37867 DF,  p-value: <2e-16
logistic1<-glm(default~I(annual_inc/1000), family="binomial", lc_clean)
summary(logistic1)
## 
## Call:
## glm(formula = default ~ I(annual_inc/1000), family = "binomial", 
##     data = lc_clean)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.625  -0.579  -0.557  -0.511   3.639  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.51916    0.02924   -52.0   <2e-16 ***
## I(annual_inc/1000) -0.00408    0.00040   -10.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 31130  on 37868  degrees of freedom
## Residual deviance: 31005  on 37867  degrees of freedom
## AIC: 31009
## 
## Number of Fisher Scoring iterations: 5
ggplot(data=reduced, aes(x=I(annual_inc/1000), y=default)) +
  geom_smooth(method="lm", se=0, aes(color="OLS"))+ geom_smooth(method = "glm", method.args = list(family = "binomial"),  se=0, aes(color="Logistic")) +
  labs(y="Prob of Default", x="Annual Income(1000 $)") +  xlim(0,450)+scale_y_continuous(labels=scales::percent) +
  geom_jitter(width=0, height=0.05, alpha=0.7) +
  scale_colour_manual(name="Fitted Model", values=c("blue", "red"))

Conclusion

Logistic model in the case when real y (default) takes values equal to 0 or 1 is much better than OLS. It is because it avoids the negative values for probability, which are impossible to achieve empirically and not explainable.


Multivariate logistic regression

I can estimate logistic regression with multiple explanatory variables as well. For that, I will use annual_inc, term, grade, and loan amount as features.

logistic2<-glm(default~I(annual_inc/1000)+term+grade+loan_amnt, family="binomial", lc_clean)
summary(logistic2)
## 
## Call:
## glm(formula = default ~ I(annual_inc/1000) + term + grade + loan_amnt, 
##     family = "binomial", data = lc_clean)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.063  -0.608  -0.482  -0.345   4.099  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -2.43e+00   5.06e-02  -48.09   <2e-16 ***
## I(annual_inc/1000) -6.02e-03   4.73e-04  -12.73   <2e-16 ***
## term60              4.79e-01   3.56e-02   13.45   <2e-16 ***
## gradeB              6.60e-01   5.27e-02   12.52   <2e-16 ***
## gradeC              1.03e+00   5.39e-02   19.13   <2e-16 ***
## gradeD              1.29e+00   5.70e-02   22.58   <2e-16 ***
## gradeE              1.42e+00   6.66e-02   21.24   <2e-16 ***
## gradeF              1.67e+00   8.63e-02   19.30   <2e-16 ***
## gradeG              1.77e+00   1.34e-01   13.20   <2e-16 ***
## loan_amnt           2.84e-06   2.35e-06    1.21     0.23    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 31130  on 37868  degrees of freedom
## Residual deviance: 29286  on 37859  degrees of freedom
## AIC: 29306
## 
## Number of Fisher Scoring iterations: 5
#compare the fit of logistic 1 and logistic 2
anova(logistic1,logistic2)
Resid. DfResid. DevDfDeviance
3.79e+043.1e+04        
3.79e+042.93e+0481.72e+03

Explanation of the notions:

  1. Estimated Coefficient The Estimated Coeffiecient explains how a change in a particular varaible influence directly the risk factor decrease/increase. Risk factor is connected to the outcome y (probability) though the logistic function in the model. Sign of the coefficient has a general interpretation - if negative, the change in variable by 1 will cause decrease in the probability. If coeffiecient is positive, the change in variable by 1 will cause increase in probability.

  2. Standard error of coefficient The magnitude of the standard error estimates how estimated coefficient would be affected if different data are fed in the model. It is also useful when constructing the 95% Confidence Interval for the true coefficients.

  3. p-value of coefficient The p-value of the coefficient is connected to the Z-score in the Z-test. If the p-value is smaller than 0.05, the coeffieint is signifficant. The p-value describes the probability of the coeffiecient’s value falling inside the 95% confidence interval.

  4. Deviance Deviance (or -2 log-likelihood (-2LL) statistic) is a goodness-of-fit statistic for a statistical model. We often use it for statistical hypothesis testing. Deviance is a generalization of the idea of using the sum of squares of residuals in ordinary least squares to cases where model-fitting is achieved by maximum likelihood. It tells us how much unexplained variation there is in our logistic regression model - the higher the value the less accurate the model.

  5. AIC The Akaike Information Critetion rectifies deviance by penalizing number of coefficients. In other words AIC is an estimator of out-of-sample prediction error. The furmula is as follows: =-2log(L)+2k, where k is the number of coefficients. Again, the lower the score, the more accurate our model. AIC is used most frequently in situations when one is not able to easilty test model’s performance with use of machine learning (when having a small dataset or time series).

  6. Null Deviance It is the outcome of Deviance, when model depends only on intercept.

Conclusion

FOr our predicitons, logistic 2 model is better than logistic 1 model as both deviance and AIC are lower.


Predicting probability

#Predict the probability of default
prob_default2<-predict(logistic2, lc_clean, type="response")

#plot 1: Density of predictions
g0<-ggplot( lc_clean, aes( prob_default2 ) )+
  geom_density( size=1)+
  ggtitle( "Predicted Probability with Logistic 2" )+  xlab("Estimated Probability")
g0

#plot 2: Density of predictions by default
g1<-ggplot( lc_clean, aes( prob_default2, color=default) ) +
  geom_density( size=1)+
  ggtitle( "Predicted Probability with Logistic 2 by default" )+  xlab("Estimated Probability")
g1

From probability to classification

The logistic regression model gives us a sense of how likely defaults are; it gives us a probability estimate. To convert this into a prediction, we need to choose a cutoff probability and classify every loan with a predicted probability of default above the cutoff as a prediction of default (and vice versa for loans with a predicted probability below this cutoff).

Let’s choose a threshold of 20%. Of course some of our predictions will turn out to be right but some will turn out to be wrong – you can see this in the density figures of the previous section. Let’s call “default” the “positive” class since this is the class we are trying to predict. We could be making two types of mistakes. False positives (i.e., predict that a loan will default when it will not) and false negatives (I.e., predict that a loan will not default when it does). These errors are summarized in the confusion matrix.

Confusion matrix

I will now produce the confusion matrix for the model logistic 2 for a cutoff of 18%.

#using the logistic 2 model predict default probabilities
prob_default2<- predict(logistic2, lc_clean, type="response")
  
#Call any loan with probability more than 18% as default and any loan with lower probability as non-default.
p_class<-factor(ifelse(prob_default2>0.18,"1","0"),levels=levels(lc_clean$default))
  
#produce the confusion matrix and set default as the positive outcome
con2<-confusionMatrix(p_class,lc_clean$default,positive="1") 

#print the confusion matrix
con2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 24674  2818
##          1  7766  2611
##                                         
##                Accuracy : 0.721         
##                  95% CI : (0.716, 0.725)
##     No Information Rate : 0.857         
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0.175         
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.4809        
##             Specificity : 0.7606        
##          Pos Pred Value : 0.2516        
##          Neg Pred Value : 0.8975        
##              Prevalence : 0.1434        
##          Detection Rate : 0.0689        
##    Detection Prevalence : 0.2740        
##       Balanced Accuracy : 0.6208        
##                                         
##        'Positive' Class : 1             
## 

Using the confusion matrix, I can calculate the following:

a. Accuracy Percentage of status correctly predicted, regardless of default status. (2611+24674)/(2611+2818+24674+7766)=0.7205

b. Sensitivity Proportion of good loans correctly identified as non-default. 2611/(2611+2818)=0.48094

c. Specificity Proportion of bad loans correctly identified as default. 24674/(24674+7766)=0.76060

ROC curve and AUS

Using the model logistic 2 I will produce the ROC curve and calculate the AUC measure.

#estimate the ROC curve for Logistic 2
ROC_logistic2 <- roc(lc_clean$default,prob_default2)

#estimate the AUC for Logistic 2 and round it to two decimal places
AUC2<-round(auc(lc_clean$default,prob_default2)*100, digits=2)
#Plot the ROC curve and display the AUC in the title
ROC2<-ggroc(ROC_logistic2,  alpha = 0.5)+ ggtitle(paste("Model Logistic 2: AUC=",round(auc(lc_clean$default,prob_default2)*100, digits=2),"%"))  + geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color="grey", linetype="dashed")+geom_segment(aes(x = 1, xend = 1, y = 0, yend = 1), color="black", linetype="dashed")+geom_segment(aes(x = 1, xend = 0, y = 1, yend = 1), color="black", linetype="dashed")

ROC2

The ROC curve shows for any given level of True Positives, what level of True Negatives we can expect.The stright line shows the situation in which instead of a model we were using the probability of 0.5. Anything about the line is better than the random prediction. We aim for the ROC to be as close to the (1.00,1.00) point as possible. The AUC measure measures the area under the curve. Again for random prediciton (stright line) the AUC=0.5. The better the AUC the better the model. We expect the AUC of any predictive model to be between 0.5 and 1 as we expect is to be better than the random prediciton based on 0.5 chance. In case of a bad model, the AUC could be equal to the number below 0.5. However, in that case, we should immediately drop the model, as it means, that our model predictions are worse that the random chance of being correct. The AUC cannot be equal to the number above 1, as in case AUC=1, the model predictions are already 100% correct.

Out of sample ROC curve estimation

So far we have only worked in-sample. I will now split the data into training and testing and estimate the models ROC curve and AUC measure out of sample.

# splitting the data into training and testing
set.seed(1234)
train_test_split <- initial_split(lc_clean, prop = 0.8)
testing <- testing(train_test_split) #20% of the data is set aside for testing
training <- training(train_test_split) #80% of the data is set aside for training

# run logistic 2 on the training set 
logistic2_in<-glm(default~I(annual_inc/1000)+term+grade+loan_amnt, family="binomial", training)

#calculate probability of default in the training sample 
p_in<-predict(logistic2_in, training, type = "response")
  
#ROC curve using in-sample predictions
ROC_logistic2_in <- roc(training$default,p_in)
#AUC using in-sample predictions
AUC_logistic2_in<-round(auc(training$default,p_in)*100, digits=2)
  
#calculate probability of default out of sample 
p_out<-predict(logistic2_in, testing, type = "response")

#ROC curve using out-of-sample predictions
ROC_logistic2_out <- roc(testing$default,p_out)
#AUC using out-of-sample predictions
AUC_logistic2_out <- round(auc(testing$default,p_out)*100, digits=2)
#plot in the same figure both ROC curves and print the AUC of both curves in the title
ggroc(list("Logistic 2 in-sample"=ROC_logistic2_in, "Logistic 2 out-of-sample"=ROC_logistic2_out))+ggtitle(paste("Model Logistic 2 in-sample AUC=",round(auc(training$default,p_in)*100, digits=2),"%\nModel Logistic 2 out-of-sample AUC=",round(auc(testing$default,p_out)*100, digits=2),"%"))    +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color="grey", linetype="dashed")

There is a very weak evidence in overfitting, because the ROC curve and AUC statistic has remained roughly the same in both in-sample and out-of-sample cases. Therefore we can assume that overfitting does not occur in our model.

Selecting loans to invest using the model Logistic 2.

Before I look for a better model than logistic 2 let’s see how I can use this model to select loans to invest. Let’s make the simplistic assumption that every loan generates $20 profit if it is paid off and $100 loss if it is charged off for an investor. Let’s use a cut-off value to determine which loans to invest in, that is, if the predicted probability of default for a loan is below this value then I invest in that loan and not if it is above.

To do this I will split the data in three parts: training, validation, and testing.

# splitting the data into training and testing
set.seed(121)
train_test_split <- initial_split(lc_clean, prop = 0.6)
training <- training(train_test_split) #60% of the data is set aside for training
remaining <- testing(train_test_split) #40% of the data is set aside for validation & testing
set.seed(121)
train_test_split <- initial_split(remaining, prop = 0.5)
validation<-training(train_test_split) #50% of the remaining data (20% of total data) will be used for validation
testing<-testing(train_test_split) #50% of the remaining data (20% of total data) will be used for testing

Finding cut-off threshold

I will now use the trained model to determine the optimal cut-off threshold based on the validation test.

logistic2_in<-glm(default~I(annual_inc/1000)+term+grade+loan_amnt, family="binomial", training)
summary(logistic2_in)
## 
## Call:
## glm(formula = default ~ I(annual_inc/1000) + term + grade + loan_amnt, 
##     family = "binomial", data = training)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.053  -0.607  -0.487  -0.338   3.237  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -2.49e+00   6.64e-02  -37.54   <2e-16 ***
## I(annual_inc/1000) -5.70e-03   6.08e-04   -9.38   <2e-16 ***
## term60              5.02e-01   4.59e-02   10.94   <2e-16 ***
## gradeB              7.34e-01   6.89e-02   10.65   <2e-16 ***
## gradeC              1.07e+00   7.08e-02   15.09   <2e-16 ***
## gradeD              1.32e+00   7.48e-02   17.59   <2e-16 ***
## gradeE              1.42e+00   8.74e-02   16.27   <2e-16 ***
## gradeF              1.70e+00   1.11e-01   15.32   <2e-16 ***
## gradeG              1.78e+00   1.72e-01   10.35   <2e-16 ***
## loan_amnt           1.45e-06   3.03e-06    0.48     0.63    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 18591  on 22721  degrees of freedom
## Residual deviance: 17493  on 22712  degrees of freedom
## AIC: 17513
## 
## Number of Fisher Scoring iterations: 5
p_val<-predict(logistic2_in, validation, type = "response") #predict probability of default on the validation set

#we select the cutoff threshold using the estimated model and the validation set
profit=0
threshold=0
for(i in 1:100) {
  threshold[i]=i/400
  one_or_zero_search<-ifelse(p_val>threshold[i],"1","0")
  p_class_search<-factor(one_or_zero_search,levels=levels(validation$default))

  con_search<-confusionMatrix(p_class_search,validation$default,positive="1")
  profit[i]=con_search$table[1,1]*20-con_search$table[1,2]*100
}

ggplot(as.data.frame(threshold), aes(x=threshold,y=profit)) + geom_smooth(method = 'loess', se=0) +labs(title="Profit curve with logistic 2 based on validation set, ")

paste0("Based on the validation set: Maximum profit per loan is $", round(max(profit)/nrow(validation),2), " achieved at a threshold of ", threshold[which.is.max(profit)]*100,"%.")
## [1] "Based on the validation set: Maximum profit per loan is $5.55 achieved at a threshold of 17%."
#calculate probability of default out of sample 
p_out<-predict(logistic2_in, testing, type = "response")

#we select the cutoff threshold using the estimated model and the testing set
profitt=0
thresholdt=0
for(i in 1:100) {
  thresholdt[i]=i/400
  one_or_zero_search_t<-ifelse(p_out>thresholdt[i],"1","0")
  p_class_search_t<-factor(one_or_zero_search_t,levels=levels(testing$default))

  con_search_t<-confusionMatrix(p_class_search_t,testing$default,positive="1")
  profitt[i]=con_search_t$table[1,1]*20-con_search_t$table[1,2]*100
}

ggplot(as.data.frame(thresholdt), aes(x=thresholdt,y=profitt)) + geom_smooth(method = 'loess', se=0) +labs(title="Profit curve with logistic 2 based on testing set, ")

paste0("Based on the testing set: Maximum profit per loan is $", round(max(profitt)/nrow(testing),2), " achieved at a threshold of ", thresholdt[which.is.max(profitt)]*100,"%.")
## [1] "Based on the testing set: Maximum profit per loan is $5.62 achieved at a threshold of 17.75%."

Based on the validation set, we can observe that the maximum profit per loan is >’$’5.55
achieved at a threshold of 17% optimal cutoff. In case of using the testing set, I obtain also the cutoff point of 17.5%, however the maximum profit per loan amount to $5.62.

More realistic revenue model

I will now build a more realistic profit and loss model. Each loan has different terms (e.g., different interest rate and different duration) and therefore a different return if fully paid. For example, a 36 month loan of $5000 with installment of $163 per month would generate a return of 163*36/5000-1 if there was no default. Let’s assume that it would generate a loss of -60% if there was a default (the loss is not 100% because the loan may not default immediately and/or the lending club may be able to recover part of the loan).

validation2<-validation%>%
  mutate(default1=as.numeric(default)-1,
         return= ifelse (default1==1, -0.6, term_months*installment/loan_amnt-1))

validation2 %>% 
  summarize(total_return_percent = sum(return)/n()*100)
total_return_percent
11.5
summary(validation2)
##     int_rate       loan_amnt          dti         delinq_2yrs  
##  Min.   :0.050   Min.   :  500   Min.   : 0.00   0      :6766  
##  1st Qu.:0.090   1st Qu.: 5500   1st Qu.: 8.07   1      : 630  
##  Median :0.120   Median :10000   Median :13.32   2      : 129  
##  Mean   :0.121   Mean   :11284   Mean   :13.24   3      :  32  
##  3rd Qu.:0.150   3rd Qu.:15000   3rd Qu.:18.51   4      :   9  
##  Max.   :0.240   Max.   :35000   Max.   :29.99   5      :   4  
##                                                  (Other):   4  
##    annual_inc         grade            emp_length        home_ownership    
##  Min.   :   4200   Length:7574        Length:7574        Length:7574       
##  1st Qu.:  41000   Class :character   Class :character   Class :character  
##  Median :  60000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :  69926                                                           
##  3rd Qu.:  83000                                                           
##  Max.   :6000000                                                           
##                                                                            
##  verification_status    issue_d             zip_code          addr_state       
##  Length:7574         Min.   :2007-07-01   Length:7574        Length:7574       
##  Class :character    1st Qu.:2010-05-01   Class :character   Class :character  
##  Mode  :character    Median :2011-02-01   Mode  :character   Mode  :character  
##                      Mean   :2010-11-04                                        
##                      3rd Qu.:2011-08-01                                        
##                      Max.   :2011-12-01                                        
##                                                                                
##  loan_status            desc             purpose             title          
##  Length:7574        Length:7574        Length:7574        Length:7574       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  term      default   term_months    installment    emp_title        
##  36:5570   0:6490   Min.   :36.0   Min.   :  16   Length:7574       
##  60:2004   1:1084   1st Qu.:36.0   1st Qu.: 170   Class :character  
##                     Median :36.0   Median : 295   Mode  :character  
##                     Mean   :42.4   Mean   : 334                     
##                     3rd Qu.:60.0   3rd Qu.: 451                     
##                     Max.   :60.0   Max.   :1303                     
##                                                                     
##     default1         return      
##  Min.   :0.000   Min.   :-0.600  
##  1st Qu.:0.000   1st Qu.: 0.120  
##  Median :0.000   Median : 0.183  
##  Mean   :0.143   Mean   : 0.115  
##  3rd Qu.:0.000   3rd Qu.: 0.254  
##  Max.   :1.000   Max.   : 0.723  
## 

According to the assumption above, I reach a return of 11.45%.

Unfortunately, I cannot use the realized return to select loans to invest (as at the time we make the investment decision we do not know which loan will default). Instead, I can calculate an expected return using the estimated probabilities of default – expected return = return if not default * (1-prob(default)) + return if default * prob(default).

Expected return of the loans

In the nest step, I will calculate the expected return of the loans in the validation set using the logistic 2 model trained in the training set. I will check how the realized return varies as I change \(n\).

expected_return <- p_val*(-0.6)+(1-p_val)*(validation2$term_months*validation2$installment/validation2$loan_amnt-1)

expected_return <-as.data.frame(expected_return) %>% 
  arrange(-expected_return) 
  
expected_return_800<-expected_return %>% head(800)

expected_return_800%>% 
  summarize(expected_return_in_percent = sum(expected_return)/800*100)
expected_return_in_percent
23.6
expected_return_1500<-expected_return %>% head(1500)

expected_return_1500%>% 
  summarize(expected_return_in_percent =sum(expected_return)/1500*100)
expected_return_in_percent
20.5
expected_return_8000<-expected_return %>% head(8000)

expected_return_8000%>% 
  summarize(expected_return_in_percent =sum(expected_return)/8000*100)
expected_return_in_percent
11
expected_return_80<-expected_return %>% head(80)

expected_return_80%>% 
  summarize(expected_return_in_percent =sum(expected_return)/80*100)
expected_return_in_percent
35.2

In case n=800, our expected return amount to 23.64%. In case of increasing n, the perctange value of expected return decreases. If we decrease \(n\), then the expected return in percent increases. We can use the expected return metric as one of the means to select a portfolio, however we should not rely only on this measure. The less loans we choose, the less nominal profit we will obtain. Therefore, it would be smart to use other measures to complement this approach.


For \(n=800\), I will check how sensitive is my answer to the assumption that if a loan defaults I lose 60% of the value.

To answer this question I will assess how the realized return of the 800 loans chosen in my portfolio change if the loss proportion varies from 20%-80%.

expected_return20 <- p_val*(-0.2)+(1-p_val)*(validation2$term_months*validation2$installment/validation2$loan_amnt-1)

expected_return20 <-as.data.frame(expected_return20) %>% 
  arrange(-expected_return20) 
  
expected_return_800_20<-expected_return20 %>% head(800)

expected_return_800_20%>% 
  summarize(sum(expected_return20)/800*100)
sum(expected_return20)/800 * 100
33.4
expected_return80 <- p_val*(-0.8)+(1-p_val)*(validation2$term_months*validation2$installment/validation2$loan_amnt-1)

expected_return80 <-as.data.frame(expected_return80) %>% 
  arrange(-expected_return80) 
  
expected_return_800_80<-expected_return80 %>% head(800)

expected_return_800_80%>% 
  summarize(sum(expected_return80)/800*100)
sum(expected_return80)/800 * 100
19.4
expected_return40 <- p_val*(-0.4)+(1-p_val)*(validation2$term_months*validation2$installment/validation2$loan_amnt-1)

expected_return40 <-as.data.frame(expected_return40) %>% 
  arrange(-expected_return40) 
  
expected_return_800_40<-expected_return40 %>% head(800)

expected_return_800_40%>% 
  summarize(sum(expected_return40)/800*100)
sum(expected_return40)/800 * 100
28.3

If we lose 20% on default loans, we have expected return on the 800 loans as 33.36%. If we lose 80% on default loans, we have expected return on the 800 loans as 19.44%. The return changes but we still gain profit from it. The expected return increases more if the loss is small. It means that our answer is not this much sensitive to the assupmtion that if the loan defaults, we lose 60% of the value.

Models experimentation

I will now experiment with different models using more features, interactions, and non-linear transformations.

I will include below only my best model and explain why I chose it.

#showing best model
#splitting the data again
set.seed(121)
train_test_split <- initial_split(lc_clean, prop = 0.6)
training <- training(train_test_split) #60% of the data is set aside for training
remaining <- testing(train_test_split) 

train_test_split <- initial_split(remaining, prop = 0.5)
validation<-training(train_test_split)
testing<-testing(train_test_split) 

logistic_wild<-glm(default~
                     poly(I(annual_inc/1000),3)+
                     term+
                     grade+
                     poly(int_rate,3):loan_amnt+
                     issue_d*grade, family="binomial", training)
summary(logistic_wild)
## 
## Call:
## glm(formula = default ~ poly(I(annual_inc/1000), 3) + term + 
##     grade + poly(int_rate, 3):loan_amnt + issue_d * grade, family = "binomial", 
##     data = training)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.211  -0.606  -0.476  -0.320   3.000  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -1.22e+01   2.89e+00   -4.23  2.3e-05 ***
## poly(I(annual_inc/1000), 3)1 -1.49e+02   4.00e+01   -3.72  0.00020 ***
## poly(I(annual_inc/1000), 3)2 -1.88e+02   7.05e+01   -2.66  0.00773 ** 
## poly(I(annual_inc/1000), 3)3 -1.18e+02   2.87e+01   -4.12  3.8e-05 ***
## term60                        5.02e-01   4.91e-02   10.21  < 2e-16 ***
## gradeB                        1.33e+01   3.35e+00    3.97  7.2e-05 ***
## gradeC                        1.31e+01   3.36e+00    3.89  9.9e-05 ***
## gradeD                        1.03e+01   3.57e+00    2.87  0.00409 ** 
## gradeE                        7.04e+00   4.38e+00    1.61  0.10845    
## gradeF                        1.57e+01   6.26e+00    2.51  0.01212 *  
## gradeG                        3.05e+01   1.02e+01    2.98  0.00290 ** 
## issue_d                       6.40e-04   1.93e-04    3.31  0.00092 ***
## poly(int_rate, 3)1:loan_amnt  2.78e-03   4.71e-04    5.91  3.3e-09 ***
## poly(int_rate, 3)2:loan_amnt -1.99e-03   4.46e-04   -4.45  8.6e-06 ***
## poly(int_rate, 3)3:loan_amnt  1.10e-03   2.87e-04    3.82  0.00013 ***
## gradeB:issue_d               -8.65e-04   2.25e-04   -3.85  0.00012 ***
## gradeC:issue_d               -8.32e-04   2.26e-04   -3.69  0.00022 ***
## gradeD:issue_d               -6.25e-04   2.40e-04   -2.61  0.00905 ** 
## gradeE:issue_d               -4.04e-04   2.94e-04   -1.38  0.16879    
## gradeF:issue_d               -9.67e-04   4.19e-04   -2.31  0.02103 *  
## gradeG:issue_d               -1.96e-03   6.87e-04   -2.85  0.00432 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 18591  on 22721  degrees of freedom
## Residual deviance: 17378  on 22701  degrees of freedom
## AIC: 17420
## 
## Number of Fisher Scoring iterations: 10
#calculate probability of default in the training sample 
p_wild_in<-predict(logistic_wild, training, type = "response")

#ROC curve using in-sample predictions
ROC_logistic_wild <- roc(training$default,p_wild_in)
#AUC using in-sample predictions
AUC_logistic_wild<-round(auc(training$default,p_wild_in)*100, digits=2)

#quick check for overfitting

#calculate probability of default out of sample 
p_wild_out<-predict(logistic_wild, testing, type = "response")

#ROC curve using out-of-sample predictions
ROC_logistic_wild_out <- roc(testing$default,p_wild_out)
#AUC using out-of-sample predictions
AUC_logistic_wild_out <- round(auc(testing$default,p_wild_out)*100, digits=2)
#plot in the same figure both ROC curves and print the AUC of both curves in the title
ggroc(list("Logistic Wild in-sample"=ROC_logistic_wild, "Logistic Wild out-of-sample"=ROC_logistic_wild_out))+ggtitle(paste("Model Logistic Wild in-sample AUC=",round(auc(training$default,p_wild_in)*100, digits=2),"%\nModel Logistic Wild out-of-sample AUC=",round(auc(testing$default,p_wild_out)*100, digits=2),"%"))    +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color="grey", linetype="dashed")

#very weak evidence of overfitting, we can compare now our model logistic_wild to logistic2
  
logistic2_in<-glm(default~I(annual_inc/1000)+term+grade+loan_amnt, family="binomial", training)

#calculate probability of default in the training sample of model logistic2_in
p_in<-predict(logistic2_in, training, type = "response")
  
#ROC curve using in-sample predictions
ROC_logistic2_in <- roc(training$default,p_in)
#AUC using in-sample predictions
AUC_logistic2_in<-round(auc(training$default,p_in)*100, digits=2)
#plot in the same figure both ROC curves and print the AUC of both curves in the title
ggroc(list("Logistic 2"=ROC_logistic2_in, "Logistic Wild"=ROC_logistic_wild))+ggtitle(paste("Model Logistic 2 on training set AUC=",round(auc(training$default,p_in)*100, digits=2),"%\nModel Logistic Wild on training set AUC=",round(auc(training$default,p_wild_in)*100, digits=2),"%"))    +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color="grey", linetype="dashed")

#calculate probability of default in the training sample 
p_wild_val<-predict(logistic_wild, validation, type = "response")
expected_return60l <- p_wild_val*(-0.6)+(1-p_wild_val)*(validation$term_months*validation$installment/validation$loan_amnt-1)

expected_return60l <-as.data.frame(expected_return60l) %>% 
  arrange(-expected_return60l) 

expected_return_800_60l<-expected_return60l %>% head(800)

expected_return_800_60l%>% 
  summarize(expected_return_in_percent= sum(expected_return60l)/800*100)
expected_return_in_percent
24.3
p_val<-predict(logistic2_in, validation, type = "response")
expected_return <- p_val*(-0.6)+(1-p_val)*(validation$term_months*validation$installment/validation$loan_amnt-1)

expected_return <-as.data.frame(expected_return) %>% 
  arrange(-expected_return) 

expected_return_800<-expected_return %>% head(800)

expected_return_800%>% 
  summarize(expected_return_in_percent = sum(expected_return)/800*100)
expected_return_in_percent
23.7

My best model is model logistic_wild. I managed to achieve the AIC at the level of 17420 and AUC of the model accounting to 68,43%.When it comes to expected returns, the ones based on model logistic_wild bring return amounting to 24,34% (in comparison to 23,70% from model logistic2) basing it on the validation set. When searching the model, I tried the LASSO method on training split, I tried as well to add variables on inflation and price, however, they would increase my AIC and decrease the AUC. Therefore, I sticked to model logistic_wild.

Predicting outcomes for the testing data

The file “Assessment Data_2020.csv” contains information on almost 1800 new loans. I will use my best model form the previous point to choose 200 loans to invest. For this I will assume that the loss proportion is 60%.

lc_assessment<- read_csv("~/Documents/LBS/Data Science/Session 05/Assessment Data_2020.csv") 
 lc_assessment<-  lc_assessment %>% 
   janitor::clean_names() %>% # use janitor::clean_names() 
  mutate(
    issue_d = lubridate::mdy(issue_d),  # lubridate::mdy() to fix date format
    term = factor(term_months),     # turn 'term' into a categorical variable
    delinq_2yrs = factor(delinq_2yrs)) # turn 'delinq_2yrs' into a categorical variable
 
#  lc_assessment %>% head(10)
  
 p_assessment<-predict(logistic_wild, lc_assessment, type = "response")

  p_assessment %>% head(10)
##      1      2      3      4      5      6      7      8      9     10 
## 0.0873 0.1225 0.0498 0.1685 0.2225 0.1730 0.1021 0.0771 0.1113 0.1775
   expected_return_assessment <- p_assessment*(-0.6)+(1-p_assessment)*(lc_assessment$term_months*lc_assessment$installment/lc_assessment$loan_amnt-1)

lc_assessment_total<- lc_assessment %>% 
  mutate(expected_return=expected_return_assessment ) %>% 
arrange(-expected_return ) 

lc_assessment_total_200 <- lc_assessment_total %>%  
  head(200) %>% 
  mutate( agnieszka_prawda= 1)



merged_assessment<- merge(x = lc_assessment_total, y = lc_assessment_total_200, by = "loan_number", all.x = TRUE) %>% 
  select(loan_number, agnieszka_prawda) 
  
merged_assessment<- replace_na(merged_assessment,list(agnieszka_prawda=0))


merged_assessment %>% head(20) #glimpse for 10 rows
loan_numberagnieszka_prawda
10
20
30
40
51
60
70
80
90
100
110
120
130
140
151
160
170
180
190
201

Critique

No data science engagement is perfect. Therefore, I decided to include a critique of my project in here to reflect on the limitations of the analysis and ways for future improvement.

First of all, I believe collecting some more historical data on all the loan takers, on how they coped with their liabilities and credit history. I tried to use models on inflation and on bond prices however, the data did not improve my model, so I decided to drop it. Before using it in practice, I would like to try and find true value of the return in case of default instead of 60% assumed. I would also try to run Rigde regression and see how is compares to LASSO ( the best model in the homework).

Data on gender or race

We do not include data on race or gender as our model could become bias and disciminate poeple of particular sex or skin color. Even if such variables may be good predictors, they may sometimes be not the best choice for machine learning if not carefully trained and considered. It is an aesthetic approach to data and modelling. We need to remember that when using machine learning methods, the models may sometimes take spourious values as significant ones and become sexist or racist.