Lending Club Loan Analysis: Making Money with Logistic Regression

The Lending Club is an online marketplace for loans. As a borrower, you can apply for a loan, and if accepted, your loan gets listed in the marketplace. As an investor, you can browse loans in the marketplace, and invest in individual loans at your discretion. This peer to peer model has many advantages over traditional banking counterparts, for example, lower overhead costs, lower cost of capital, etc.

But what excites me the most about peer to peer lending is the democratization of data. As an investor, you can see each and every rejected, completed, ongoing, and available loan. While loan data excludes personally identifiable information, it does include attributes like credit rating, location, college education level, lines of credit, and descriptions of why the applicant needs the loan.

For your average investor who doesn’t have the sophistication (or time) to sift through tens of thousands of reviews, the Lending Club provides tools to find loans based on one’s risk and diversification goals. Being a data geek, I of course immediately downloaded the full dataset.

One of the first things I noticed was that many loans have fairly long descriptions:

“Dear Lenders, I was involved in a sports injury approximately 18 months ago…..Thank you for taking time to read this letter.  Thank you”

While this borrower is clearly in an unfortunate situation (the full text was over 1500 characters in length), it appears as if borrowers who write longer descriptions actually have much higher default rates:

So, is it possible to aggregate across several attributes with the goal of improving upon Lending Club’s basic investment strategies?

The basic problem to be solved here is one of predicting loan default rate. Given a loan with an interest rate of 12% and another loan with an interest rate of 16%, the expected loan default rate of each loan will tell me my expected return. For example, if the first loan has an expected default rate of 25%, and the second a rate of 50%, then my expected interest rates from the loan would be 9% and 8%, respectively. I’d be better off investing in the first loan.

The Lending Club’s analysis tools model default risk solely as a function of a single attribute, credit grade. I built a logistic regression model that optimizes over twelve different attributes including loan size, interest rate, application date, debt to income ratio, home ownership status, and description length.

The model was trained over earliest 50% of loans issued and evaluated over the other half. For each loan, I predict expected default rate and use this to predict the expected interest rate for the loan. Loans are then sorted by highest expected rate. The following shows actual interest rate for investments in the best 40 loans with highest predicted interest rates, up to investments in the best 1000 loans:

For investments over a smaller number of loans (fewer than 400), the logistic regression model clearly outperforms the others. Credit grade binning computes risk as the average default rate of each credit grade, and the final method assumes a default rate of zero for all loans (i.e. just invest in loans with the highest interest rate first).

To get a better idea of sensitivity, for each of the twelve attributes used to train the model, I trained a new model that held out one attribute and used the remaining eleven attributes to train a new model. I then computed expected interest rate for an investment of 80 loans. Resulting interest rate reductions for each attribute are as follows:

Attribute:            Interest rate reduction
amount requested      0.83%
fico range            0.39%
application date      0.35%
earliest credit line  0.31%
interest rate         0.26%
open credit lines     0.26%
total credit lines    0.06%
home ownership        0.04%
credit grade          0.04%
debt to income ratio  0.04%
description length    0.04%
monthly income        0.00%

According to this analysis, the amount requested for a loan is the most important single attribute in the logistic regression model; interest rate drops by 0.83% if this attribute is omitted. On the other hand, description length is relatively unimportant in terms of model sensitivity. This is due to the fact that most loans actually have relatively short descriptions.

Surprisingly, application date is actually quite important to the model. However, when investing in a loan, this isn’t a factor that you can really optimize over, e.g., you can’t invest in a loan issued in 2007, nor can you invest in a loan in the future that someone hasn’t yet applied for. It appears as if the Lending Club’s loan approvals have trended towards riskier loans with higher interest rates:

So, what’s the catch? Why am I blogging here instead of just quietly investing?

  • I do invest in lending club loans, and I will be incorporating my analysis here into my investment strategy.
  • There is of course much more complexity to this problem than I’m presenting. In particular, my model invests in loans with the highest expected return and doesn’t have any real risk model beyond this. I ignore all macroscopic effects.
  • Perhaps the biggest risk of all is if the Lending Club were to go out of business.
  • There are lots of details about my analysis that I haven’t described here. All code can be found on github: https://github.com/drjasondavis/Lending-Club-Learning.
  • There’s a ton more work to be done here: incorporating semantic analysis of descriptions, education information about borrowers, etc.
  • The Lending Club assesses collection fees for loans that are passed due. It’s not 100% clear how these fees are applied, but it probably makes investing in riskier loans less appealing than the models presented here suggest. See more information here under “Investor Fees”: http://www.lendingclub.com/public/rates-and-fees.action
  • As @jderick points out in the comments, this analysis doesn’t accurately account for the cost of capital, which is higher for riskier loans with larger default rates.
  • I’m generally very bullish when it comes to online marketplaces, so I’m excited to share my findings.

Disclaimer: I am not an investment professional. I do not warrant any information supplied here. Invest at your own risk!

Published by

Jason Davis (@jasondavis)

Entrepreneur, Data Scientist, Hacker

14 thoughts on “Lending Club Loan Analysis: Making Money with Logistic Regression”

  1. Pretty interesting work. I’m not totally clear on what ‘default rate’ means. Is that per year or over the lifetime of the loan? Also, isn’t it important when the loan defaults as well?

    As for calling this ‘discrete optimization’, can you explain what you mean? I don’t think you are talking about integer programming, are you?

  2. “Default rate” is the probability that the loan isn’t repaid (i.e. the loan “defaults”).

    As for “discrete optimization”, I was referring to an argument that I ended up removing from the original post. Basically, the Lending Club bins loans into one of 36 discrete categories (A1 – G6). And each loan is assigned an interest rate as a function of these categories. This is a source of inefficiency in their system: one can imagine that some loans within any given category are better than others. So, even if their algorithms were “perfect”, there would still be an opportunity to optimize the discrete nature of their system.

  3. I must not be understanding something. In your example you say a 50% default rate would reduce your returns from 16% to 8%. But say you leant 100$ and half of it defaulted, then you made 16% on the other half. Now you have 58$, which is a net loss rather than a 8% gain.

    1. This is an excellent point and big assumption in the model. Expected return could be more clearly called “expected interest rate collected from the loan”. In fact, cost of capital is big driver in return, although measuring this (much less estimating) is somewhat tricky due to the fact that the Lending Club hasn’t been around for that long (as Peter Renton pointed out in his comment). I’ve updated the post to clarify this.

  4. Great analysis Jason. While I haven’t built a logistic regression model (I have no idea what that even is) I found your results fascinating. Although I was surprised you didn’t find much correlation with income – my analysis has shown it has made quite a difference on past Lending Club loans. Another factor you didn’t include is number of inquiries which I have found also makes quite a difference.

    A couple of other points. You are right is stating the average interest rate has been going up. One of the reasons for that is the number of larger loans on the platform (the average has been increasing for some time) which drives up interest rates. Also, collection fees are negligible, they make very little difference to returns.

    So what did you use for your dataset? Did you look at just completed loan or every loan on the platform? As you are no doubt aware the average age of loans on the platform is still very young because Lending Club is growing so fast.

    1. Thanks Peter.

      Your observation about income is correct. My sensitivity analysis was fairly simple: I just removed a single attribute and then retrained my model. If, say, the Lending Club were to base its credit grade 100% off of income, then removing income would have no impact on the resulting model (since credit grade here would correlated 100% to income, and credit grade would “take up the slack”).

      On its own, however, income is in fact a great predictor, and correlates quite highly to expected interest rate: http://drjasondavis.files.wordpress.com/2012/04/monthly_income_x_expected_interest_rate.png

      In terms of datasets, I used the raw data of ~45,000 loans available for download from the Lending Club (downloaded February 2012). I remove all loans with status “current”, “issued”, or “in review” from my analysis. This basically leaves me with those that are completed or defaulted, which is fewer than half of all loans.

      1. Interesting graph on income vs interest rate. I can see what you mean about the correlation.

        What I do when investing is I look for small pockets of mispricing where I believe the interest rates are overstated. One place I have found that is with income. Someone may be earning $10,000 a month but have a high interest rate due to a large loan amount or a low credit score (or other possible factors). Historically, some of these loans have performed very well and so I look for new loans that match this very small subsection of the loan history.

  5. It is an interesting analysis specially number of characters in loan description. I also finished similar analysis on LC’s 50,000+ historical loans. What I found that default risk is below average if Education and Loan Description field have no characters and Loan Title has 22 or less characters.

    1. Anil – very interesting. Someone needs to dig into some deeper semantic analysis of this content as well….

      1. Anil and Jason, You guys may want to check out textanalyser.net for more in depth text analysis options. That tool extracts lots of key parameters out of text. It would be interesting to see how average syllables per word and readability scores correlate to default rates. The part that I haven’t figured out is how to efficiently feed the text into the tool and extract the results. The other option would be to integrate their algorithms into your tool for a more customized solution.

        I have a few other questions…

        Any assumptions about defaulted loans are critical because these are the primary drivers of poor investment performance and defaults at the beginning of a loan are significantly worse than defaults at the end of the loan. This adds a lot of complexity to the problem. What are you doing to account for this issue? It seems like creating something like a default recovery index using something like the ratio of the ‘payments made to date’ divided by the ‘issued loan amount’ may be useful for this. This would represent the percentage of funds recovered from the defaults and would probably be useable as a weighting parameter.

        I was really surprised to see that application date was the third most sensitive attribute. I would have guessed that this would have very little influence on rates/returns. I know lending club became more strict on their approval process once in the past and it was a pretty big deal. This probably corresponds to the abrupt change in 2008. However the exponential rise in interest rate afterwards has me concerned of the possibility of a default bubble. Lending club has been growing exponentially at a high rate since inception and yet they quote their expected returns based on the linear average of all loans. This means that the vast vast majority of the loans used to come up with their figures are young ones and only a small percentage have been fully paid off. You mentioned that this paid off number was less than 50% of all loans. I believe it is much less than this but I haven’t actually taken the tally. They probably issued more loans in January 2013 than in all of 2008. This brings me back to my concern. I am afraid that the rise in rates is a result of Lending Club realizing that they underestimated the impact of defaults so they began pushing rates higher to compensate for the risk as they learned more and more about how actual loans panned out. Are they issuing riskier loans or are they merely adjusting rates. A graph of FICO score vs time may help to answer that.

        The other explanation for the rise in rate vs time could be an increase in the applicant/investor ratio. If more people apply for loans then investors can choose the high rate low risk applicants first and the rest will fail to be issued thus pushing rates higher. I’m not sure if rates have this free market relationship but if they do then that is good news to investors. Investors may be uneasy about some of the risks wheareas borrowers benefit by these same risks (Risk that lending club fails, difficulty in collecting on delinquent notes) which leads me to expect that the borrower/applicant pool will grow faster than the lender pool. The exponential interest rate growth seems to support this theory.

        In short, I am waiting to see how more loans actually pan out before beginning to invest, but I have found your analysis to be the most detailed and effective one yet and thus would like to thank you for blogging about it.

        P.S. Sorry about the monster post!

      2. Hi Brendon – thanks for your thoughtful comments.

        Yes, you bring up some excellent points re loan age, acceptance rates, and true risk over time. My analysis made some very simple assumptions about default risk as a function of loan age. I used some sort of basic linear model wherein I assumed a sort of default rate per year (or per time unit). In reality, I’m sure loans tend to default less in early stages and more towards the end.

        The number of loans that the Lending Club is making is increasing at a fairly fast rate. Growing their business involves growing both the supply and demand side simultaneously. My guess is they grow the supply side as fast as possible and then lend out money accordingly. So, the more actual demand there is, the higher the quality of loans, since under this assumption they only have a fixed amount of money to loan.

        As for automating all of this. It would be great if the Lending Club had an API. Plug the data in, then have the algorithm make investment decisions automatically.

  6. Jason, did you try feature vector selection with Random Forest models? I got an optimal number fo features as only 4 and they were ranked in this order:

    #Optimal number of features : 4
    #Feature ranking:
    #1. feature 9 (0.233677) #number of inquires
    #2. feature 1 (0.193933) # FICO high
    #3. feature 10 (0.090055) #Revolving balance
    #4. feature 5 (0.080833) #DTI
    #5. feature 8 (0.076548) #Total_acc
    #6. feature 7 (0.065637) #Open account
    #7. feature 2 (0.063432) #ann_income
    #8. feature 4 (0.061500) #loan_amount
    #9. feature 0 (0.061454) #len(description)
    #10. feature 6 (0.049553) #purpose
    #11. feature 21 (0.017243) #Delinquences 2year
    #12 feature 15 (0.006136) #Bankruptcies

    Interestingly most of probability of default can be predicted with just Number of inquiries, FICO high score, revolving balance and DTI.

    What do you think? The code is here:


    1. Hi Pavel – I didn’t experiment with random forest models. It would be interesting to run some sort of PCA or correlation analysis between these variables – I remember several of the sources being quite interdependent.
      Look forward to checking out your SVM code!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s