Credit Card Risk Analytics using Random Forest, XGBoost and Logistic Regression

6 min readJan 17, 2021

Business understanding:

CredX is a leading credit card provider that gets thousands of credit card applicants every year. But in the past few years, it has experienced an increase in credit loss. The CEO believes that the best strategy to mitigate credit risk is to ‘acquire the right customers’.

In this project, We will help CredX identify the right customers using predictive models. Using past data of the bank’s applicants, you need to determine the factors affecting credit risk, create strategies to mitigate the acquisition risk and assess the financial benefit of your project.

Goals:

Help the business to identify the key risk features to be considered before approving credit card.
Create a model which will predict the decision on approval.
Create application score card to be used in approval process.

CRISP-DM Framework

As part of the CRISP-DM framework, the document will be details in below topics.
1)Business understanding
2)Data Understanding
3)Data Preparation
4)Data Modelling
5)Model Evaluation and Insights delivery

Problem Solving Methodology

Data Understanding

We have got 2 datasets for solving the business problem:

Demographic data:

Gives information about the applicant
Information related to individual level such as, Age, Gender, Income, Marital Status etc.,

Credit bureau data:

Gives the information on credit history which was recorded for the individuals by credit bureau.
It has information like, Outstanding balance, Types of loan they availed, No. of default times if any, etc.,
In some cases, credit bureau data are zero and credit card utilization is missing. These represent cases in which there is a no-hit in the credit bureau.
Cases with credit card utilization missing, These are the cases in which the applicant
Does not have any other credit card.

Feature Importance based on WOE and Information Value

WOE and Information values are calculated for each of features in Demographic and Credit bureau data against target features.
In the plot shown, we have no strong features in demographic data, so we expect the model with Demographic and Credit bureau data to do better.

Numerical Variable Distributions

Age column is normally distributed.
Rest plots looks more on right skewed data.

Categorical Variable Segmented Distributions

More Male, Married applicants are present in dataset.

EDA on Categorical Features Against Default rate

Education: Education categorizes as Others showed highest default rate.

Marital Status: We can see there is slight increase in Default rate for singles then the Married applicants.

EDA on Numerical Features Default Rate -1

Income: We can see low income applicants will have high rate of defaults.

No of Months in Current Residence : We can see higher the number of months in current residence is increasing the chances of default. But we are only seeing correlation here. So it might have a different causation.

EDA on Numerical Features Default Rate — 2

Avg CC utilization in last 12 months: Highly utilized cc in last 12 months would have higher chances of default rate.
Number of Trades: We can see higher the number of trades would increase the default rate

Correlation Plot for Numeric feature

Model Selection

Model with only demographic data is not giving good accuracy or Sensitivity.
Logistic Regression is giving the best the accuracy, specificity and sensitivity.
Random Forest is second best model.
Let’s considered Logistic regression as final model based on evaluation metrics.

Model Performance

Logistic Regression — Demo
1. Model Accuracy is around 54%.
2. Sensitivity is 0.048%
3. Specificity is also 96%
4. The optimum cutoff is around 0.042

Logistic Regression — Demo + CB data
1. Model Accuracy is around 63%.
2. Sensitivity is 63%
3. Specificity is also 59%
4. The optimum cutoff is around 0.488

Random Forest
1. Model Accuracy is around 62%.
2. Sensitivity is 62%
3. Specificity is also 58%
4. The optimum cutoff is around 0.5883

Xgboost — Demo + CB data
1. Model Accuracy is around 57%.
2. Sensitivity is 57%
3. Specificity is also 62%
4. The optimum cutoff is around 0.84

Final Model

Below are the most significant variables which has been captured in the final model which will be used in developing Application score card.

Logistic Regression ( Demographic + Credit Bureau Data)

No.of.times.30.DPD.or.worse.in.last.12.months
Avgas.CC.Utilization.in.last.12.months
No.of.PL.trades.opened.in.last.12.months
No.of.Inquiries.in.last.12.months..excluding.home…auto.loans

Model Metrics for Final Model:

Accuracy :0.631
Specificity:0.5893
Sensitivity:0.6333

Application Score Card

Application score card is built based on following information:

The probability of good customer divided by the probability of bad customer
Odds should double at every 20 points, so factor = 20/log(2)
Offset should start from points 400 with odds of 10 to 1
Based on the values, application scorecard is built with cut off score being 335
Customers having less than 335 have the chances of defaulting while customers with high score is having less chance of defaulting

Financial Benefit Assessment

As per the Application Score Card:

428 will be rejected based on the score
428/1425 i.e 30% of the rejected population
962/1425 i.e 67% of non defaulters also got rejected — Which would be a financial benefit.

For Every 1000 customer credit card rejections using old assessment system, we have an opportunity to acquire 67% of them to issue a Credit card which will increase the revenue for the bank.

Assuming the average spent per month by CC customer is $800, Profit margin is around 5% (MDR + Late payment Fee), so $40 profit from each customer on average.

So by using by new model, for every 100 rejects we have an opportunity to increase revenue by 40 * 67 = $2680.