Your First A/B Tests: A Step by Step Guide

3626405330_292a14d71a

Your A/B testing system is hot off the press, you’ve integrated with Optimizely, Google’s Content Experiments, or maybe deployed some code of your own. You open your eyes, look around, and see a whole new world of experimental opportunities.

But suddenly, you start asking more questions and forming more hypotheses than one could possibly answer in a lifetime. Should you start testing 41 shades of background hues to see what’s most attention grabbing? Maybe start by testing a rewrite of every piece of copy on your site? Or maybe you should start A/B/C/D testing your web fonts?

No.

Here’s a step by step guide.

Step 1: Start by running an A/A test.

aa_test

You’re excited about A/B testing, but you’ve never run an experiment on your product. And since your A/B system is hot off the presses, everyone else in your organization will also be new to this.

A/A tests are conceptually quite simple: the two variants are both A’s, you’ve changed absolutely nothing, except you’re exercising the A/B system by randomly assigning users to buckets. A fundamental component of A/B testing is the natural statistical variation that occurs between buckets. Running an A/A test is a great way to “see” what this variation naturally looks like. You’ll be expecting there to be zero variation between your variants, but there won’t be.

So as your first step, run an “empty” A/A test, watch your results for a week, and try to familiarize yourself with some basic statistics behind A/B testing.

Step 2: A/B test a single feature release.

You make changes to your site constantly. Whitespace changes, bug fixes, tool tips, are a big part of keeping your product quality high, but most of the time will have no impact on metrics like time on site, conversion rate, bounce rate, etc. Don’t start by testing these.

Some aspects of your product get a high amount of traffic (say, your homepage or search page), and some components are mission critical to your business (in an e-commerce setting, your checkout funnel). Yet other components are neither high traffic nor mission critical: for example, your about page.

So start by picking a relatively simple front-end change affecting a high traffic page. For example, suppose your social media share button has some slick share logo, and you’d like to understand if clarifying the logo gets people to share more often.

Control: share_old
Test: share_new

Why is this a good test? First, the button appears on every content page on the site. Second, copy changes can oftentimes have larger impacts on users behavior than you’d expect. And finally, making this change is really simple. Your first true A/B experiment should focus on methodology and outcome; don’t test something that may be buggy or harder to integrate. Tools like Optimizely or Google’s Content Experiments are great for testing visual changes like the one above, but are much harder (or impossible) to test with things like search ranking, recommendations, or email.

Step 3: Iterate.

As you run more and more tests, you’ll start to learn which tests move the needle and which tests don’t.

If a test showed positive improvements, ask yourself if you can further improve on the result. If the “Share” copy above increased tweets by 100%, consider changing the background color from a dull gray to something more noticeable.

If a test showed little to no change, take a step back and ask the following:

  • Is volume too small to matter? Say your site gets tweeted 10,000 times per month, and your change increases this rate by 1%. That’s 100 additional tweets, probably not worth dwelling on.
  • Is the change at all impactful? You changed your share button background to hot pink, but people aren’t sharing any more than when it was gray. Perhaps the original problem with the share button was just one of understanding and not discovering.
  • Is the change measurable? Some changes you make are brand or design focused. A new logo design shouldn’t show any significant differences in your A/B metrics. Maybe you shouldn’t have tested this in the first place.
  • Are you measuring the right things? Not every change you make will affect your site’s overall conversion rate. If you’re testing search ranking changes, you’d expect people to click higher in the result set if quality has improved.

This case is the hardest to analyze and can be frustrating. Dig deeper into what’s going on, run some user tests (I love usertesting.com), talk to your customers, and use your brain.

Finally, many times the most useful of an experiment’s three possible outcomes is that of a negative change. Here, the test shows that things got worse in a significant way. Volume is high enough, the change is negatively impactful and is directly measurable.

Revisit your key hypotheses and test assumptions, and question everything.

Step 4: Think strategically, test strategically.

A/B testing has lots of benefits, and the biggest in my opinion is learning. Think about your strategic initiatives and develop larger hypotheses for your product and business.

You’ve run three experiments to try to get users to share your content more on Twitter, and each has shown no changes. Maybe your site isn’t as social as you thought. You have a gifting site and people actually don’t like sharing presents on Twitter.

Also think about strategic testing. You’re worried that your site is too slow, and engineering tells you they need six weeks to decrease average response from 800 milliseconds to 500. You look at the dozens of initiatives on your roadmap, and ask if this is really necessary. As step zero, instead of trying to speed up the site, you slow it down. Add a 200ms timeout to every request, and measure engagement, etc. with an average response time of 1000ms.

Or maybe you’re concerned with the quality of your homepage recommendation module: try an experiment in which the module shows popular content 50% of the time and existing personalized content the other 50%.

Your business as an experiment

As any successful entrepreneur or CEO will tell you, scaling your business requires learning, and learning requires an open mind. A/B testing can be a great tool for experimenting with your business, exposing you to new findings and opportunities. And when used properly and in the right context, proper experimentation can be eye opening.

Thanks to the product and engineering team @ Loverly, and to Dan McKinley and Greg Fodor for reading drafts of this.

A Last Minute Infographic; Last Minute Eatin’ featured in the NYTimes

lme_infographic

Last Minute Eatin’ has been running for over two months now and has tweeted over 1,500 tables since launch. LME knows exactly when and where tables are open, and you can now visualize these openings with the Last Minute Eatin’ Explorer. Every day at midnight, LME checks availability for almost 1,000 restaurants on OpenTable, and the infographic shows which restaurants and neighborhoods have prime time (between 7 and 10pm) tables available. Read more about LME data and how it works.

And in related news, Tejal Rao wrote a great piece about Last Minute Eatin’ and some other folks also tackling this problem: Coveted Restaurant Reservations Without the Groveling.

25STRAT-articleLarge

A Mixpanel Data Exporter

4141428555_2cf236c43d

Mixpanel is a great analytics tool for small to medium sized web and mobile shops. And not surprisingly, their analytics product has pretty good adoption (over 1,400 companies using it, according to their homepage).

One things I’ve noticed, however, is that as some of these shops grow in size, they slowly start to ask more than Mixpanel can answer. Their data science team may want to do some in-depth analysis over customer lifetime value. Their product team wants to do some deeper funnel analysis comparing variants in a recent A/B test. Or their search team wants to do some click-depth inference on long-tailed queries.

Luckily, Mixpanel allows customers to export their data for deeper analysis off of Mixpanel. This way, business folks can run custom SQL, analysts can play with data via R, and data scientists can do some Hadoop deep diving with Pig or Cascading. Further, if Mixpanel data lives with the rest of your data, you can easily join it with things that aren’t in Mixpanel. Like comparing customer acquisition costs across your paid ads on Google or Facebook.

I met up with an old friend last week who’s at a company that fits this exact mold. I mentioned that I had built a simple tool to pull Mixpanel data to S3, and he encouraged me to open source it. Having seen this pattern several times before, I spent a few minutes cleaning things up this morning, and you can now find my Mixpanel-Puller on Github.

Eight Ways You’ve Misconfigured Your A/B Test

moai

You’ve read about the virtues of A/B testing feature releases. You love iterating quickly, testing quickly, and continually learning in a data-driven fashion. You appreciate the importance of keeping an eye on the statistics behind your testing, and perhaps you even use a tool or two to make sure your results are statistically valid.

But, you ran a test last week, the results have been coming in for some time now, but, the data just doesn’t look quite right.

Where to begin?

Read on, eight common problems, maybe one of which might just be your problem.

Problem: You’ve changed your weights mid-experiment

ab_rampup

A basic premise behind A/B testing is that of running two different variants at the same time. Your bounce rates, conversion rates, and pretty much all your metrics vary from day to day, and oftentimes this variation is larger than the difference between each of your experiment variants. Running your A variant on Sunday, Monday, and Tuesday, and then switching things to B on Wednesday, Thursday, and Friday just doesn’t work. An A/B test allows you to randomly assign users into one of two bins, running two variants simultaneously throughout the entire time period.

And as it turns out, switching variant weights mid-test suffers from the same problems as running variants serially. Intuitively speaking, switching from a 100/0 test (with A at 100% and B at 0%) on Tuesday to a 0/100 test on Wednesday is quite similar switching from a 99/1 test on Tuesday to a 1/99 test on Wednesday. In both cases, while the “average test” turns out to be 50/50, the results are still dominated by the day over day variation from Tuesday to Wednesday: A’s metrics will be more reflective of Tuesday’s metrics, and B’s metrics will reflect Wednesday’s metrics. If Tuesday happens to be stronger than Wednesday, then A will appear to have won; conversely, if Wednesday is stronger, then B would falsely appear to be the winner.

Although less magnified, switching from a 99/1 test to a 95/5 suffers from this problem as well.

So, make sure to measure results only over period in which your A/B ratios remain constant. And as a corollary, make sure you’re also recording your A/B ratios as you change them.

Google Analytics (as well as others) use so-called “bandit tests” in which the variants ramp-up (or ramp-down) depending on variant performance. The statistics are somewhat different from standard A/B testing, and changing weights mid-experiment is of course central to how bandit testing works. As an aside, I think Google’s bandit test optimizer is a nice tool for quick and dirty experiments, but I’m generally not a huge fan of bandit testing (see Dan’s rant for a good reflection of my take).

Problem: Your experiment retains users

retained_users

You just launched an experiment showing big improvements to your registration funnel, and registrations have increased by 10% in your experiment bucket (bucket B). But since registered users are more likely to come back to your site than non-registered users, you start seeing your B bucket growing in size. Most A/B systems bucket based on users and not visits. So what started as a 50/50 test will change to a 49/51 test on day two, and then a 48/52 test on day three, etc. as more users convert in your B group than the A. Your buckets are no longer evenly distributed either as B is biased towards logged-in and registered users.

Since your B variant is growing in size, and since this growth is coming from existing users and not new users, you’ll also start to see registration rates decrease for your test group (even though it’s actually outperforming the control!).

The first red flag to look out for here is your bucket sizes; even though you’ve configured your test to be 50/50, ratios start changing slowly over time. Once you notice this, you can try to restrict your analysis to the first few days (or week) of the experiment. Or you could also make an educated decision about the winner based on an understanding of the dynamics of what you’ve changed.

Bucketing based on visits instead of users is tempting, but can also add to user confusion as people will likely see multiple variants.

A related case that deserves special attention are features that are explicitly designed to create return visits to your site. For example, eBay’s saved search reminders or something even as simple as an email signup form. These sorts of features will quickly create imbalances in your visit counts.

Problem: Your segmentation is wrong

When you start segmenting users in your A/B tests, you’re effectively introducing a third variant for your experiment: A, B, and not in a variant. Consider the two following code snippets:

# INCORRECT
bucket = ab_selector.assign_bucket_and_log(test, user)
if (user.gender == ‘female’):
   if (bucket == ‘control’):
     # Render control variant
   elif (bucket == ‘test’):
     # Render test variant
else:
   # Render control variant

# CORRECT
if (user.gender == ‘female’):
   bucket = ab_selector.assign_bucket_and_log(test, user)
   if (bucket == ‘control’):
     # Render control variant
   elif (bucket == ‘test’):
     # Render test variant
else:
   # Render control variant

In the first snippet, the bucketing logic appears outside the conditional. So when the call to assign_bucket_and_log is made, each and every user is logged into either the control or the test, regardless of whether or not they’ve met the selection criteria.

Let’s take a deeper look into how this sort of experiment should be bucketing users in the correct version of the code above:

In Segment Not In Segment
Control (A) A no variant
Test (B) B no variant

Let’s look at the problem with the incorrect configuration where all users are bucketed:

In Segment Not In Segment
Control (A) A A
Test (B) B A

Here, all Not In Segment users are all shown variant A, even though users within this group who were logged as seeing B ended up actually seeing A.

Problem: You’re cherry picking

Say you have a new recommendation algorithm that you think people will love, but it only works for users who have previously purchased 10 items.

Consider the following incorrect snippet:

# INCORRECT
if (new_hot_algorithm.has_results):
   bucket = ab_selector.assign_bucket_and_log(test, user)
   if (bucket == ‘test’):
     # Render test
   elif (bucket == 'control'):
     # Render control
else:
   # Render control

You may notice that this snippet is quite similar to the segmentation snippet. In fact, the correct version of the segmentation snippet actually looks quite similar to the incorrect version of the so-called “cherry picking” experiment.

Here, the feature only works for a certain segment of users (or conditions, etc). Let’s assume this experiment is deployed on your site’s homepage, and that it only “works” for 10% of users. And by “works”, we mean that new_hot_algorithm.has_results is true 10% of the time. And let’s further assume that the experiment has been incorrectly configured as shown above, and that the experimental results showed a 5% increase in purchase conversion rates.

The problem here is with the statement: “My experiment just increased homepage conversion rates by 5%”. Of course, the correct statement requires backing this out over your 10% coverage which yields increased homepage conversion rates of only 0.5%.

One could argue that this case isn’t a mistake, per se. However, a fundamental goal of A/B testing is to understand the impact of your changes at hand, and oftentimes this sort of “cherry picking” can lead to confused results.

Problem: You’re letting users “sneak preview” things

sneakpreview_rt

Say you’ve turned on the experiment for your new_hot_algorithm above to 5%. People start noticing, Twitter starts blowing up and users start begging you for access.

Responding to these concerns, you quickly modify your experiment to look like this:

bucket = ab_selector.assign_bucket_and_log(test, user)
if (current_url.params['sneakpreview'] == 'on'):
   bucket = 'test' # INCORRECT!
if (new_hot_algorithm.has_results):
   if (bucket == ‘test’):
     # Render test
   elif (bucket == 'control'):
     # Render control
else:
   # Render control

And you give your users a sneak preview link that looks something like http://www.funpalace.com?sneakpreview=on. And they tell their friends and tweet the link. And their friends’ friends retweet it. And suddenly lots and lots of people are sneak previewing.

The problem with the implementation above is that your framework logs the variant on line 1, but then re-assigns the bucket (and consequently, what the user sees) on the next line. And here’s what’s going on:

sneak_preview not in url sneak_preview in url
Control (A) A B
Test (B) B B

The problem lies in the case where the A/B selection framework is selecting and logging variant A, but the sneak_preview override forces the user to see variant B.

Problem: Caching!

cache_problems

Caching is generally at odds with A/B testing. Say your homepage is entirely static, and that you use Akamai (or some other edge network) to cache your homepage. And say one day you decide to run a test to change the copy on the homepage.

Your code looks something like this:

bucket = ab_selector.assign_bucket_and_log(test, user)
if (bucket == ‘control’):
   greeting = "Greetings!"
elif (bucket == ‘test’):
   greeting = "Welcome!"

And every 30 minutes (or however long your caching TTL is), exactly one request will fall through your edge network to your data center (steps 1 and 2 in the picture above), exercise the code above (step 3 in the picture), which will in turn choose exactly one of the two variants. All subsequent requests for your homepage (such as steps 4 & 5) will return the cached version of the page, each of which say either “Greetings!” or “Welcome!”.

There are plenty of other ways that caching can invalidate your A/B testing results. So think about how you’re using memcache, squid, edge networks, and any other source of caching at any layer.

Problem: You’re using separate bucket logic for both logged in and logged out users

logged_out_the_in

Perhaps the most technical of the seven errors, this one makes some deeper assumptions about how your bucketing works. Many A/B frameworks bucket by taking a hash of either user ids or session identifiers. The advantage is that a user’s bucket can be assigned randomly, but no explicit storage (i.e. a database) is required to maintain state. For more details, see Section 4.1.1 on Kohavi et al’s KDD paper, or refer to Etsy’s implementation on GitHub.

While hashing schemes are great, they break down if you use different bucketing keys for your logged out vs logged in users.

Let’s consider an example visit where a user arrives, sees a test, and then logs in:

Example Visit:

User arrives at homepage No Variant
User searches, triggers test A
User logs in No Variant
User searches, triggers test B

Since the user’s unique identifier changes after she logs in, it’s possible that her A/B bucket will change as well. For example, if Jane’s session identifier is the string “QsfdET34″ which hashes to bucket A, after she logs in, her email (jane@gmail.com) could very well hash to bucket B.

Let’s assume that your A/B analytics discards visits like Jane’s that have conflicting A and B variants. And say you’re running a 90/10 experiment:


P(logged out & A) = 90%
P(logged out & B) = 10%
P(logged in & A) = 90%
P(logged in & B) = 10%

Now let’s look at visits that are both logged in and logged out:

P(logged in & A, logged out & A) = 81%
P(logged in & B, logged out & B) = 1%
P(logged in & A, logged out & B) = 9%
P(logged in & B, logged out & A) = 9%

Your analytics will discard the last two as inconsistent visits, leaving a break down of 81% and 1% for your A vs B variants. But your experiment was configured as a 90% A and 10% B, resulting in incorrectly sized bins.

The simplest solution here is to bucket on only logged in users or only logged out users, but not both.

Problem: You actually have no idea how your A/B system works

machine_wtf

There’s more complexity in A/B testing than appears at first glance, and understanding basic measurement constructs is critical in spotting problems.

Understanding your tools is critical to running a valid A/B test. You may be using a complete end-to-end A/B tool like Google Analytics Content Experiments. Or perhaps you’re using Optimizely’s slick JavaScript A/B framework and measuring results with Mixpanel. Maybe you’re doing A/B selection yourself and even using your own pipeline for analysis.

Almost two years ago, Google made some fundamental changes to how they sessionize their web visits. One side effect of this change involved how they count visits from third party referring sites (including email, social media, etc.); understanding this change is critical not just for A/B testing email, but for generally understanding email analytics. And of course the example above illustrates some unexpected side effects of hash-based bucketing.

A/B testing is a great way of measuring the effectiveness of the products and improvements you’re building. A well designed A/B system should make releasing an experiment alongside a feature release just as easy as releasing the feature by itself. But unlike most software debugging where errors can be easily reproduced by refreshing a web page or running a script, such is not the case with analytics and certainly not with A/B testing. So even if your A/B framework makes it really easy to setup and run A/B tests, A/B testing can get very complicated very quickly, and it isn’t always easy to debug things when they go wrong.

Thanks to Dan McKinley for reading drafts of this.

The Life of a Last Minute New York City Restaurant Reservation

wizard-of-oz-man-behind-the-curtain1

There’s been a lot of hoopla recently about so-called high frequency restaurant reservation trading. Are computers stealing my reservations? Will I ever get into Per Se again? It’s Saturday night, I have a hot date but no table, am I screwed?

Last Wednesday, I launched Last Minute Eatin’, a same day reservation service that tweets New York’s hottest tables. While the service has only been running publicly for a couple of days now, I’ve been running it silently for quite some time, and it’s been monitoring thousands of restaurants for several months now.

Last Minute Eatin': How it works

Last Minute Eatin’ tweets the hottest tables at the hardest to get restaurants in New York. At the start of every day, LME checks OpenTable for same day availability at over 800 of New York’s most popular restaurants (specifically, those restaurants with at least 100 reviews on OpenTable). From this set, LME identifies the 100 hardest restaurants to get into: those that rarely have tables available on popular weekend nights.

These “hottest tables” are those listed on the Last Minute Eatin’ homepage.

Throughout the day, Last Minute Eatin’ continually checks each of these restaurants for availability. And during normal business hours (8am onward), LME tweets one table every 20 minutes.

Which table? Well, the hottest table from the most exclusive restaurant, of course.

A brief tour of a same day New York City reservation

So, Last Minute Eatin’ only tweets three reservations an hour, but behind the scenes, LME monitors and analyzes thousands of availabilities every day. How and when are NYC diners booking these tables? Read on.

Let’s take a look at when people start booking last minute tables. The following graph shows same day availabilities starting at midnight and running all the way up until dinner time.

lme_availabilities_time_of_day

Before noon, availabilities are pretty constant, after which they start to drop off all the way until dinner seating begins. LME only tweets “prime” reservations: tables after 7pm and before 10pm. Prime availabilities are fewer and harder to come by.

Fact: 42% of same day reservations are made for tables at “off-prime” hours: before 7pm or after 10pm.

And as expected, there are fewer availabilities on weekends as compared to weekdays.

lme_availabilities_by_day

Fact: NYC restaurants have 45% less availability on Fridays and Saturdays as compared to the rest of the week.

Cancellations occur when a table that was previously unavailable becomes available. Last Minute Eatin’ is all about cancellations: one minute a restaurant is booked, and the next minute a table becomes available. Last Minute Eatin’ tweets it, you click the link, and the table is yours.

Let’s take a closer look at some of the “hardest to get” cancellations: those that were available for less than 10 minutes before they were rebooked.

cancellations_by_day

Wednesday, Thursday, and Friday have more cancellations than any other day of the week. One hypothesis: Wednesday and Thursday are popular nights for business dinners which are more prone to be rescheduled, cancelled, or last-minute scheduled. And on Fridays, people make plans but are more prone to cancel if they’ve had a long week, etc. This is in contrast to Saturdays, where people generally stick to their plans; availability is hard to come by as are cancellations.

Fact: Although Fridays have relatively low availability rates, they have relatively high cancellation rates.

And let’s now take a look at what time of day we’ll generally find these “hard to get” cancellations:

cancellations_by_hour

Fact: Cancellations spike mid-day just after lunch and then again right before dinner. Surprisingly, cancellations also occur at a steady rate throughout the entire night as well.

High frequency restaurant reservations: the man behind the curtain

And finally, the question we’ve all been waiting for: how common are short, “hard to get” cancellations compared to longer ones?

cancellation_duration_dist

Fact: 70% of all cancellations open and close within a period of 10 minutes or less; 44% in a period of 5 minutes or less.

Equivalently, 26% of 10 minute or shorter cancellations are available for more than 5 minutes.

So, are computers stealing your reservations? Has “high frequency restaurant reservation trading” become the norm? Or maybe this phenom just an artifact of old-fashioned, hard-workin’ people, hanging out on OpenTable, trying to get a reservation.

Follow @LastMinuteEatin on Twitter today to get your last minute table tonight.

Thanks to Nellwyn Thomas, Joe Clark, and Frank Harris for reading drafts of this.

Last Minute Eatin’ Launches; Gets You the Hottest Table in Town

restaurant_small

Do you love eating out but hate making plans? There’s nothing worse than trying to find a great table when you need it most, only to find that all your favorite places are completely booked. New York City can be expensive, but the thought of eating bad food at a second or third tier restaurant is unpalatable.

Last Minute Eatin’ is an experiment in immediate gratification and schedule free living. When same day restaurant openings come up, they get tweeted from @LastMinuteEatin along with a link to make your reservation on OpenTable. Last Minute Eatin’ continuously monitors thousands of reservation openings and cancellations every day, so if you see a table tweeted, rest assured it’s one of the hottest tables in the city for your last minute plans.

Last Minute Eatin’ is not about 30% off coupons, free drinks, or 2 for 1 “special offers” at places where no one wants to go. Last Minute Eatin’ is about spending money and enjoying life. I built Last Minute Eatin’ to solve a problem that I faced on a regular basis, and over the past few months, I’ve shared it with a select group of friends and family.

Today, it’s yours, and I hope you enjoy it.

Dr. Jason Davis

Follow @LastMinuteEatin on Twitter today to get your last minute table tonight.

Coding in the Rain

It’s been rainy here in NYC as of late. Just about the only thing worse than 90 degree city heat is 90 degree city heat with intense thunderstorms roaring through. So I find myself indoors when it rains, crunching data, writing code, checking into GitHub.

Of course I’m not unique here. There are thousands of other GitHub coders in New York and millions of contributors worldwide. The data scientist inside of me asks questions. Is it possible to measure these effects? And if so, exactly how much more do people code when it’s rainy?

So, I poured through GitHub’s publicly available data on BigQuery and found that, yes, it is possible to measure the effects of weather and, yes, these effects are sizable. But the analysis turned out to be trickier than expected.

Data

The dataset here was collected from GitHub’s data hosted on BigQuery. Github check-in size can vary greatly by users, and also greatly by project. While my check-ins may average a few hundred lines, others may average 2 or 3 lines. So if I write 1,000 lines of code, this may only be 4 or 5 check-ins for me, but dozens of check-ins for others. Instead of using a raw check-in count, I counted the number of unique users per location who checked in at any time during a given day.

Location data was geocoded using Google Maps API, and then nearest weather stations were determined via WeatherUnderground’s API. Historical weather was also collected from WeatherUnderground. Since the collection of historical weather data is somewhat expensive, I restricted analysis to the most popular 1,000 locations on GitHub.

The final dataset contains day, location, count, and weather information for each data point. You can download the full GitHub weather dataset in csv format.

All analysis was done using IPython notebook and Numpy. IPython’s new notebook feature is awesome if you haven’t checked it out.

Challenges

Many factors drive check-in behavior on GitHub. For example, check-ins drop drastically on the weekends; people check-in 56% more on Fridays than they do on Saturdays.

day_of_week_counts

As is the general increase in GitHub usage, and check-ins drop during the New Years holiday.

checkins_over_time

Notice the cyclic weekend / weekday behavior is also noticeable here.

Unfortunately, measuring the impact of the weather turns out to be less straightforward. Let’s start by considering the naive analysis below.

A naive analysis of the weather

In the dataset at hand, it rains about 26% of the time. For each day in the dataset, compute total check-in counts for both rainy locations and non-rainy locations. Then normalize these counts across their respective probabilities:


p(rain) * E(rain_count_per_day) = observed_rain_count
p(clear) * E(clear_count_per_day) = observed_clear_count

Solving for E(x) makes the two values comparable. Intuitively, the value can be thought of as what the values would have looked like had it rained (or not rained) 100% of the time.

Unfortunately, this analysis results in an imperceptible difference between the two groups.

naive_rain_normalization

And not surprisingly, the average increase in check-ins E(rain_count_per_day) – E(clear_count_per_day) showed effectively zero increase, much less a statistically significant increase.

Simulating a controlled environment

Why did the above analysis show zero change? In short, there’s too much other stuff going on. In the context of weather, factors such as day of week, seasonality, and location are much bigger drivers of check-in volume. And if weather is at all a factor, its effect is smaller.

In a controlled environment, you could run an A/B test: put 50% of users in a rainy bucket and put the other 50% of users in a clear bucket. Half of users would see rain; the other half see clear skies. The resulting data would look something like this: “On June 3, the rain test group within the NYC population committed 5,234 total check-ins, while the clear-skies group committed 4,923 check-ins”. The resulting analysis would simply aggregate these numbers across all dates and locations, and then measure the change in rainy vs sunny environments.

Of course, in the domain of weather, this sort of experiment is impossible.

So while we can’t measure these numbers directly, is it possible to estimate them? E.g., for each day and each location, if it did in fact rain on that day, how many check-ins would we have expected had it actually not rained?

Modeling clear skies with linear regression

To answer this question, I turned to one of the simplest tools in my toolbox: linear regression. To train the model, I used the following variables as inputs:

  • is_weekend: 0/1 indicator variable. “1” if day is Saturday or Sunday, “0” otherwise
  • days_since_beginning: # of days since first day in dataset
  • is_christmas: 0/1 indicator variable. Dip around December actually began on Saturday, December 21 and went through January 2
  • seven_day_average_count: Average check-in count overall all locations for 3 days before and after date

The following graphs below show the performance of this model when trained over various subsets of these features. The first plot shows the regression over just day_number. As expected, the regression captures overall upward trend, but the linearity of a single dimension doesn’t really capture what’s going on.

day_number_regressed

Adding in seven day counts does a much better job of capturing non-linear growth patterns.

day_number_and_7day_count_regressed

But the model still doesn’t capture week over week trends in the lower number of weekend checkins. So adding is_weekend to the model, along with the rest of the variables, the regression improves.

all_features_regressed

Notice how the root mean squared error (RMSE) decreases as more and more independent variables are added to the model.

Given that the regression models above provide a reasonable way of predicting check-in counts, we can build a model over clear days, apply it to days when it actually rained, and then compare the actual number of check-ins to the expected number of check-ins predicted by the model.

Specifically, the process is:

  1. Split the data into a training set consisting of sunny date / location pairs, and a test set consisting of rainy pairs.
  2. Train a linear model (using the variables defined above) over the sunny training set.
  3. Evaluate model on the rainy test set. Measure change in actual counts compared to predicted counts by model.

The linear regression models account for major sources of variances outlined above: weekends, seasonality, and top line GitHub growth. So for example, if it’s sunny in New York on a Friday and rains the next day on Saturday, the model will predict a decrease based on the fact that the weekend has arrived. The above process then compares this predicted (decreased) count for Saturday with the actual observed count. A decrease in check-in counts from Friday to Saturday is expected; the model will predict exactly how large this decrease should be had it not rained on Saturday, and comparing it with the actual value shows whether this decrease was larger or smaller than expected. The model serves as a normalizing factor.

The following plot shows predicted rain values vs actual rain values.

rain_predicted_counts_vs_actual

The actual counts are higher than the predicted counts, and people do in fact check-in more often when it rains. The residuals of the regression — the average increase in actual rain count check-ins as compared to predicted clear-weather check-ins — shows a 10.1% increase. Or equivalently, people code (or at least check-in) 10% more when it’s raining as compared to when it’s clear.

Assessing statistical significance

The method above involves forming a training and testing set over the data, partitioned by weather (rainy vs sunny). To assess statistical significance, I ran this same process over randomly selected training and testing sets. This is in contrast to the method above where the data was split such that all rainy data was in the training set, and sunny data in the test set. This is a standard process in statistics known as bootstrapping.

The following graph shows the distribution of the average residual increase of actual vs predicted values over the testing set.

pdf_bootstrap_regression_deltas

Since over 99.5% of the probability mass falls to the left of the 10% average residual increase value, we can conclude that the result here is in fact significant.

Conclusions

Is a 10.1% increase in check-ins on rainy days a surprising result? Probably not. But understanding relationships between human behavior and these sorts of externalities can be critical. While A/B tests represent a holy grail in terms of analysis, they’re not always feasible, possible, or practical.

The key idea here lies in building a model off the “A” test group and then applying it to infer what points in the “B” group would have looked like had they been in “A”. But in order to do this, you need a reasonable way of modeling behavior of what’s going on. In the domain of GitHub and weather, a simple linear regressor worked quite well. In other domains, more sophisticated algorithms may be necessary. And in some domains, building such a model may be impossible.

This sort of methodology can also be adopted to to the context of scenario planning. What would check-in counts look like in July had it rained every day? Would online sales in Q1 be lower had a massive snowstorm not hit the northeast? Will the construction delays on the F train affect my business in August? In the GitHub analysis here, the predictive model was applied to historical data. Applying this sort of analysis to scenarios in the future could be a useful tool for all types of businesses.

But alas, if GitHub sees an uptick in July check-ins, we know that at least some of this can be blamed on the weather.

Thanks to Andrew Morrison, Daniel Loreto, and Brian Kulis for reading drafts of this.

A Corollary to ExperimentCalculator.com (with examples)

34

Dan McKinley recently put together a very useful tool in estimating how long to run your A/B tests.

The obvious corollary here being, “your experiments will take much longer than you think”.

Let’s dive into some real-world numbers.

Adwords campaign optimization

The scenario. You’re buying clicks from Google Adwords to get people to sign up for your startup’s new service. You just made some copy changes to the landing page which you’re hoping will improve signup conversion. Your base signup rate is 10%, and you expect your new changes to increase signup rate to 15% (a +50% increase!). You spend $0.50 per click with a budget of $100 per day, so your landing pages see a total of 200 visits each day.

The statistics. You’d have to run this campaign for 8 days and spend $800 to verify the changes. Alternately, if conversion rate increased to only 11% (a +10% change), then you’d have to spend $15,000 to verify the change.

Ecommerce optimization (Etsy)

The scenario. During a company hack week, a designer makes several changes to the cart page and wants to run a 1% experiment. The designer is quite bullish about the changes and thinks that it could in fact boost sales by 5% (!), or about $50 million from 2013’s expected sales of over $1 billion.

etsy_cart

The statistics. According to their blog, Etsy sold over $100 million with of goods in April with almost 1.5 billion page views. Assuming standard e-commerce conversion rates of 4% (along with some other assumptions about average order size), this experiment would need to be run for over 3 years! An experiment affecting 10% of users would require only two weeks.

My last startup (Adtuitive)

The scenario. We bought relatively cheap display ads on niche content sites and matched sku-level ads from our database of millions of products. Depending on placement and sites, click rates for us were sometimes around 0.1% (which believe it or not was a huge improvement over static banner ads). We were serving around 200 million ads a month, and we were releasing an algorithmic change that we thought might increase click rates (and our revenue) by 10% (!).

adtuitive_ad_example

As the change was somewhat major, we didn’t want to roll it out to more than 10% of visits during our experiment.

The statistics. We would have had to run the experiment for 39 days. Our 200 million ads per month equated to 3 million per day, or about 500k visits per day (visitors view multiple ads). Running it at 50% would have required only 7 days.

Takeaways

Calling bullshit. Next time someone claims they increased their landing page conversion from 10% to 15%, you may want to question things. Exactly how many conversions are they dealing with? And how many separate changes did they make? Small changes are also harder to measure than larger ones.

Google’s famed 1% experiments really only work at Google scale. You’ll have to run your experiments at 10% or 50% levels. And of course, make sure you double check your statistics.

Opportunity cost. Experiments take more than just design and software to code up, they also take time to run and verify. So before restyling the checkout button, ask yourself if there are other parts of your core funnels or product that you’d be better off testing first.

Other reasons to test. Sometimes changes are necessary to accommodate future functionality or new strategic changes for the overall product. E.g. restyling the cart page to provide more whitespace for a future gift cards launch, or revamping the homepage to give  attention to some fledgling social aspects of your site. In these cases, even when you expect a 0% change (or even a negative change), testing is still important to understand impact. And of course, statistics still apply.

So, next time you’re planning to run an experiment, you may want to spend some time with Mr. ExperimentCalculator.com first. Your intuition is most likely wrong.

Restart

gun hairdryer

Etsy acquired my startup Adtuitive in 2009. At the time, we had a pretty cool product that automated online advertising for small retailers, and we were operating at a modest scale of 200 million ads a month.

Deciding to sell the company was tough, but the last three years at Etsy were awesome. I had the privilege of working with very talented folks across a full stack of things, from Hadoop infrastructure to search ranking to search UI. And of course, search ads. During this time period, I saw Etsy grow from $180 million in 2009 sales, to over $80 million last October alone. My team grew from Adtuitive’s engineering team of only 4 to almost 30 in total.

I’ll miss my time at Etsy, but I’m an entrepreneur at heart, and it’s time to start over. I’ll be taking some time off in the upcoming months – time off from work, time off from management, time off from NYC life. I look forward to writing code again and thinking about real world problems at a fundamental and disruptive level. I’ll be back up running at 110% sometime early to mid next year.

Stay or get back in touch with me at jvdavis ‘at’ gmail.com.

NYC Dining: The cost of a “B”

If you eat out in New York City, the image above should evoke some sort of visceral reaction. In July of 2010, the NYC Department of Health began rating each of the  24,000 restaurants throughout the five boroughs of the city. Each restaurant is given a grade of “A”, “B”, or “C” based on violations ranging from improper food temperature to sewage problems to the presence of vermin. You can browse  the complete list here.

Fast forward 2 years, and the new system seems to be a win for consumers – Mayor Bloomberg credits the program to a 14% reduction in Salmonella, the lowest rate in 20 years. And according to this press release, NYC restaurant revenue is also up 9.3% since grading began. But still many restauranteurs disagree, expressing anger over these health inspections. Restaurants complain about the complexity in understanding the grading system, fighting with the city over infraction points, and spending additional money to maintain their facilities to meet the city’s guidelines.

But clearly the biggest cost associated with the city’s program is the fear of a “B” rating, or, even worse, an unmentionable “C” rating.

Just how costly is a “B”?

To quantify these costs, I correlated NYC restaurant inspection rating changes with their restaurant ratings on the popular review site, Yelp. Starting with the most popular 1000 restaurants in Manhattan on Yelp, I crawled each of their review pages, extracted ratings for each restaurant. NYC health inspection ratings are available via NYC’s OpenData initiative, and each of these top Yelp restaurants were then correlated with their corresponding health code ratings. All code is available on GitHub under my Nyc Restaurant Inspection Project, along with a csv that contains joined Yelp restaurant reviews with their corresponding inspection ratings.

According to the Mayor’s argument, Salmonella cases have gone down since restaurant inspection ratings have, on average, increased since the start of the program. The Mayor’s report claims that the number of “A” ratings has increased form 65% to 72% of all restaurants since the start of the program. And within the set of top Manhattan restaurants analyzed here, trends are similar. The plot below shows average rating inspection value since July 2010 (5.0 represents “A”, 4.0 “B”, etc):

Looking at average Yelp reviews since 2005, we can see that the time period since August 2010 is relatively stable, hovering between 3.8 and 3.9.

To get a better sense of how ratings are impacted by inspection grades, let’s look at restaurant grade changes (“A” to “B”, “A” to “C”) and see how their yelp ratings in the 60 days before and after changed:

Change    Rating Before    Rating After    Delta  
A -> C 3.94 3.68 -6.7%
B -> C 3.86 3.69 -4.6%
A -> B 3.77 3.76 -0.3%

Restaurants downgraded to a “C” rating received significantly lower Yelp ratings in the month after the downgrade, but restaurants receiving a “B” rating were relatively unaffected in their review quality.

So restaurants with “C” ratings tend to have a lower review quality on Yelp, but do lower ratings deter people from dining at a restaurant in the first place? Looking at overal review counts for 60 day periods before and after rating changes:

Change    Count Before    Count After    Delta  
A -> C 167 157 -6.0%
B -> C 214 230 +7.5%
A -> B 724 699 -3.5%

The increase in review counts in “B” to “C” downgrades is most likely due to the data being somewhat thin. Across all three downgrades, Yelp review counts as well as rating counts showed average decreases of almost 2%.

Takeaways

A recent study by Michael Luca found that increased Yelp review rating quality can lead to increased revenue. Among other things, the study also found that independently owned restaurants were much more affected by these reviews than ones with chain affiliations. Many of Manhattan’s top restaurants analyzed here are independent, and the decrease in Yelp ratings here undoubtedly also corresponds to lost revenue.

An interesting question to consider is one of causation: one goal of the inspection program is to improve sanitary conditions at restaurants in NYC. When a restaurant transitions from an “A” rating to a “C” rating, the only change we can say for certain is the letter grade posted outside the front door. In the days and weeks following a downgrade, one would expect restaurants to actually clean up their sanitary conditions. So, during the time period analyzed here, sanitary conditions before the downgrade are probably worse than after.

Of course, the other goal of NYC’s inspection process is to increase consumer awareness. And consumers seem to notice: when restaurants are downgraded, the costs are measurable.