Precision and recall are two fundamental quality measures in search and information retrieval applications. Google is fundamentally a search application. But Google doesn’t need to optimize for recall, it just optimizes for precision. When I Google for “Michael Jordan”, I’m really just looking for a single page about the basketball legend. A search application’s ability to find *all* pages about “Michael Jordan” is a measure of recall, and users don’t want to read thousands of pages about Mike, so Google doesn’t optimize for this.
Optimizing for recall is hard, and I spent most of my academic life working on algorithms to improve this measure. At an intuitive level, these algorithms “discover” relational inferences between various entities. For example, basketball, court, and rim are related, whereas field goal, penalty shot, and slam dunk are not. At a technical level, these algorithms worked by learning a distance measure between objects. These distance metric learning algorithms did some very heavy lifting to show improvements in quality, but their improvements really only increased recall.
I found the math behind these algorithms to be very interesting. At the core of these algorithms was optimizing a class of distance functions called Bregman divergences. The simplest Bregman divergence which you may be familiar with is the Euclidean distance in two dimensional space. This distance can be generalized beyond two multi-dimensional spaces, and Bregman divergences are a class of distance measures that go beyond just standard sum of squares Euclidean distances. In fact, Bregman divergences can be further generalized between non-vector inputs, and the core distance measure used by these metric learning algorithms is in fact a distance between matrices. Without getting into details, this distance function between matrices is used to capture the relationship between pairs of features (words, phrases, etc.) in the problem domain.
The supervision aspect of this metric learning problem came in the form of pairs of constraints: two points (documents, web pages, etc.) can be constrained to be similar, or points can be constrained dissimilar.
Each point in the image above represents a handwritten digit (0-9). Here, images of the same digit would be constrained to be similar, whereas images of different digits would be constrained dissimilar.
These algorithms did some very heavy lifting to improve problem recall. And I did some very heavy lifting of my own to develop these algorithms. This problem is extremely hard in general, and I don’t work on these sorts of problems any more.
So, why is optimizing for precision easy? A search like “east village 13th street squatters” is really best served by a handful of highly relevant pages on the subject matter (for example, this page on east village squatting). Google is able to find these results because it crawls and indexes billions and billions of pages from the web. If Google instead indexed say only a few tens of millions of pages, their algorithmic challenges would be much greater: they’d have to make much deeper inferences about pages that are less topical to my query. For example, Wikipedia’s page on squatting makes reference to New York City, but it may ultimately refer me to a book that I could only find in a library.
Google of course has developed world class algorithms and has in many ways perfected web search. But Google optimizes for precision, and optimizing precision is easier than optimizing recall.
And I’ve found that many real world problems require optimizing precision over recall.