Lending Club Loan Analysis: Making Money with Logistic Regression

April 08, 2012 by Jason Davis in Data

The Lending Club is an online marketplace for loans. As a borrower, you can apply for a loan, and if accepted, your loan gets listed in the marketplace. As an investor, you can browse loans in the marketplace, and invest in individual loans at your discretion. This peer to peer model has many advantages over traditional banking counterparts, for example, lower overhead costs, lower cost of capital, etc.

But what excites me the most about peer to peer lending is the democratization of data. As an investor, you can see each and every rejected, completed, ongoing, and available loan. While loan data excludes personally identifiable information, it does include attributes like credit rating, location, college education level, lines of credit, and descriptions of why the applicant needs the loan.

Why ad networks should optimize for precision and not recall

January 30, 2012 by Jason Davis in Data, Startups

When I was working on my PhD living in Austin, I owned several motorcycles, and spent lots of time online researching parts, upgrades, repairs, etc. on sites like svrider.com and vfrworld.com. Without sites like these, when I had a problem with my motorcycle, I would have had to read the shop manual, go to the parts store, talk to a mechanic, call friends to ask for help, etc. I still did these things on occasion, but online resources made information more immediately accessible, and made my research much more efficient. This sort of information availability is one of the defining disruptions of the web.And not surprisingly, deep content is really my favorite “part” of the web. But internet ads, especially those on many of the sites I frequent, just don’t get my attention. Many ad networks today claim to have awesome semantic targeting technology that can develop complex models of interpreting content in order to place the most relevant ad. But if a forum post is discussing steel brake lines for a motrcycle, and the ad network only has a generic ad for an auto parts store, then the placement can only be so relevant, regardless of technology. Some of the best content on the web is quite deep, but most ad inventory generally lacks required specificity. The reason why ad networks today aren’t able to get my attention has nothing to do with their technology. It has 100% due to lack of inventory.

Recall & Precision

January 21, 2012 by Jason Davis in Data

Precision and recall are two fundamental quality measures in search and information retrieval applications. Google is fundamentally a search application. But Google doesn't need to optimize for recall, it just optimizes for precision. When I Google for “Michael Jordan”, I’m really just looking for a single page about the basketball legend. A search application’s ability to find *all* pages about “Michael Jordan” is a measure of recall, and users don't want to read thousands of pages about Mike, so Google doesn't optimize for this. Optimizing for recall is hard, and I spent most of my academic life working on algorithms to improve this measure. At an intuitive level, these algorithms "discover" relational inferences between various entities. For example, basketball, court, and rim are related, whereas field goal, penalty shot, and slam dunk are not. At a technical level, these algorithms worked by learning a distance measure between objects. These distance metric learning algorithms did some very heavy lifting to show improvements in quality, but their improvements really only increased recall.