Home Default Credit Dataset

Adam Davis
4 min readApr 11, 2021

--

(where to start?)

“Develop success from failures. Discouragement and failure are two of the surest stepping stones to success.” ― Dale Carnegie

Many people out there are more familiar with using Python for analysis needs or R. I’ve been using Julia for a few years now and its been working out pretty well. There are a lot of wrap-arounds that can be used. I work in the mortgage industry right now so we go through a huuuuge amount of data. Credit scores, amount of each payment, delinquencies etc. Looking for a way to help the company a little better (and myself included) I tried to see what would help us the most. The first answer was the chance of a person defaulting on their credit payments and in turn their loan that we could give them. Next step would be to find the best way to determine this. This was done with a neural network instead of logistic regression for testing purposes (logistic test will be in a future post).

A lot of data is publicly available. This dataset was from Kaggle.com from a competition a few years ago. The data is in a 307,000x122 matrix. I ended up trimming this down substantially and decided that I wanted to predict a loan defaulting instead of not defaulting. We sort of expect a loan that we give to somebody to be able to be paid back. Non-payment or defaulting would be a big no no for us. We want to prevent that if we can. This test would ideally nip that in the bud pretty early.

Julia has a maching learning library called Flux. It allows us to easily set up a neural network model and chain the layers together. You could also insert your own custom loss functions and optimizations that you would want to use. Through testing on the dataset I determined that the optimal structure would be a 10x6x6x2 structure:

m=Chain(Dense(10,6,sigmoid),
Dense(6,6,sigmoid),
Dense(6,6,sigmoid),
Dense(6,2,sigmoid),softmax)

The predictor column was one hot encoded and this was then done as a binary classification still using softmax:

sort_array=shuffle(collect(1:side))
a=a[sort_array,:]
mark=convert(Int,round(side*.9))
y_train=a[1:mark,1]
Y_test=a[(mark+1):side,1]
Y = onehotbatch(y_train, 0:1)
X=a[1:mark,2:top]'
X_test=a[(mark+1):side,2:top]'
Y_test=onehotbatch(Y_test,0:1)

Feature selection is going to be up to you. For example there were a lot of different occupations of the borrowers included. Also many people weren’t delinquent in their payments too or had good credit.

Testing the model in first stages gave better performance with the sigmoid activation than the rely (rectified linear unit) activation so we left that in there. Softmax is so that the probabilities of defaulting and not defaulting add up to one and the highest probability is the “guess” that is made by the model based on the characteristics used from the dataset. The ADAM optimization algorithm was used with the base learning rate used. Testing determined that there was actually a decrease in accuracy of a greater learning rate. Loss functions and accuracy functions are as follows:

loss(x, y) = logitcrossentropy((m(x)), y)
opt = ADAM()
accuracy(x, y) = mean(onecold(m(x)) .== onecold(y))

The logit crossentropy had a greater impact than just regular crossentropy in terms of accuracy which is what we care about in this case. Training the model I used a loop instead of running a test once using the dataset repeated multiple times. This was to collect the accuracy rates for the training and test sets as well as the loss after each run:

tots_loss=[]
accu=[]
test_accu=[]
for j in 1:100

Flux.train!(l,Flux.params(m),dataset,opt,cb=throttle(evalcb,30))
push!(tots_loss,l(X,Y))
push!(accu,accuracy(X,Y))
push!(test_accu,accuracy(X_test,Y_test))
end

After running the loop which trains the model over 100 epochs we have this graph of the accuracy for the test and training set as well as the loss:

Not a huge adjustment on the loss here. More fine tuning can be made. We now want to see what the ROC curve looks like. This is the receiver operator characteristic and is a visualization of the true positive rate against the false positive rate. The call for this funciton is:

Pkg.add("ROCAnalysis")
using ROCAnalysis
curve=ROCAnalysis.roc(onecold(Y_test),onecold(m(X_test)))
plot(curve)

This looks like the model had an average performance. Not poor. How does the confusion matrix look?:

C=confusmat(2,onecold(Y_test),onecold(m(X_test)))2×2 Array{Int64,2}:
0 460
0 2519

So looks like it did pretty good. The accuracy here is 84.5%. We would like this to be in the nineties for sure. More testing could help us to get the accuracy higher. For a model with perfect prediction would have an area of one under the roc curve. The auc for this model is:

auc(curve)
0.5772071164820409

So there is room for improvement. According to the confusion matrix the model predicted that each loan would default. This was with the data having a proportion of 20% loans not defaulting. 15.5% of the data was predicted incorrectly. The model is able to pick out the defaulting loans at a somewhat acceptable rate. More testing and feature selection will help to increase our accuracy rate. More in the future posts!

--

--