Lending Club Loan Analysis: Making Money with Logistic Regression

The Lending Club is an online marketplace for loans. As a borrower, you can apply for a loan, and if accepted, your loan gets listed in the marketplace. As an investor, you can browse loans in the marketplace, and invest in individual loans at your discretion. This peer to peer model has many advantages over traditional banking counterparts, for example, lower overhead costs, lower cost of capital, etc.

But what excites me the most about peer to peer lending is the democratization of data. As an investor, you can see each and every rejected, completed, ongoing, and available loan. While loan data excludes personally identifiable information, it does include attributes like credit rating, location, college education level, lines of credit, and descriptions of why the applicant needs the loan.

For your average investor who doesn’t have the sophistication (or time) to sift through tens of thousands of reviews, the Lending Club provides tools to find loans based on one’s risk and diversification goals. Being a data geek, I of course immediately downloaded the full dataset.

One of the first things I noticed was that many loans have fairly long descriptions:

"Dear Lenders, I was involved in a sports injury approximately 18 months ago…..Thank you for taking time to read this letter. Thank you"

While this borrower is clearly in an unfortunate situation (the full text was over 1500 characters in length), it appears as if borrowers who write longer descriptions actually have much higher default rates:

So, is it possible to aggregate across several attributes with the goal of improving upon Lending Club's basic investment strategies?

The basic problem to be solved here is one of predicting loan default rate. Given a loan with an interest rate of 12% and another loan with an interest rate of 16%, the expected loan default rate of each loan will tell me my expected return. For example, if the first loan has an expected default rate of 25%, and the second a rate of 50%, then my expected interest rates from the loan would be 9% and 8%, respectively. I’d be better off investing in the first loan.

The Lending Club’s analysis tools model default risk solely as a function of a single attribute, credit grade. I built a logistic regression model that optimizes over twelve different attributes including loan size, interest rate, application date, debt to income ratio, home ownership status, and description length.

The model was trained over earliest 50% of loans issued and evaluated over the other half. For each loan, I predict expected default rate and use this to predict the expected interest rate for the loan. Loans are then sorted by highest expected rate. The following shows actual interest rate for investments in the best 40 loans with highest predicted interest rates, up to investments in the best 1000 loans:

For investments over a smaller number of loans (fewer than 400), the logistic regression model clearly outperforms the others. Credit grade binning computes risk as the average default rate of each credit grade, and the final method assumes a default rate of zero for all loans (i.e. just invest in loans with the highest interest rate first).

To get a better idea of sensitivity, for each of the twelve attributes used to train the model, I trained a new model that held out one attribute and used the remaining eleven attributes to train a new model. I then computed expected interest rate for an investment of 80 loans. Resulting interest rate reductions for each attribute are as follows:

Attribute:            Interest rate reduction
amount requested      0.83%
fico range            0.39%
application date      0.35%
earliest credit line  0.31%
interest rate         0.26%
open credit lines     0.26%
total credit lines    0.06%
home ownership        0.04%
credit grade          0.04%
debt to income ratio  0.04%
description length    0.04%
monthly income        0.00%

According to this analysis, the amount requested for a loan is the most important single attribute in the logistic regression model; interest rate drops by 0.83% if this attribute is omitted. On the other hand, description length is relatively unimportant in terms of model sensitivity. This is due to the fact that most loans actually have relatively short descriptions.

Surprisingly, application date is actually quite important to the model. However, when investing in a loan, this isn't a factor that you can really optimize over, e.g., you can't invest in a loan issued in 2007, nor can you invest in a loan in the future that someone hasn't yet applied for. It appears as if the Lending Club's loan approvals have trended towards riskier loans with higher interest rates:

So, what’s the catch? Why am I blogging here instead of just quietly investing?

I do invest in lending club loans, and I will be incorporating my analysis here into my investment strategy.
There is of course much more complexity to this problem than I’m presenting. In particular, my model invests in loans with the highest expected return and doesn’t have any real risk model beyond this. I ignore all macroscopic effects.
Perhaps the biggest risk of all is if the Lending Club were to go out of business.
There are lots of details about my analysis that I haven't described here. All code can be found on github: https://github.com/drjasondavis/Lending-Club-Learning.
There's a ton more work to be done here: incorporating semantic analysis of descriptions, education information about borrowers, etc.
The Lending Club assesses collection fees for loans that are passed due. It's not 100% clear how these fees are applied, but it probably makes investing in riskier loans less appealing than the models presented here suggest. See more information here under "Investor Fees": http://www.lendingclub.com/public/rates-and-fees.action
As @jderick points out in the comments, this analysis doesn't accurately account for the cost of capital, which is higher for riskier loans with larger default rates.
I’m generally very bullish when it comes to online marketplaces, so I’m excited to share my findings.

Disclaimer: I am not an investment professional. I do not warrant any information supplied here. Invest at your own risk!