Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Surprising Power of Online Experiments (2017) (hbr.org)
81 points by nadalizadeh on Feb 24, 2019 | hide | past | favorite | 24 comments


Notably absent from this article is any discussion about multi-armed bandits. A/B testing only leads to finding the more optimal of treatments, where as multi-armed bandit algorithms help find the optimal treatment and exploit it during testing. If profit is involved, it seems obvious that you should be considering both exploring treatments but most importantly exploiting the treatments that currently yields the most profit.

Google has a pretty good FAQ on this:

https://support.google.com/analytics/answer/2847021?hl=en&re...


Gonna toot my own horn here for a second, but here[0] is a presentation I did on multi-armed bandits (and specifically Thompson sampling, the tragically underutilized optimal bandit method) to an undergrad ML group. Amusingly, the diagram I use on slide 3 is from a research paper from Microsoft. We had a guest speaker from MS's exp-platform team [1] the prior week, and she had discussed A/B testing, but not touched on bandits, and I felt a need to make this exact point.

[0]: https://gtagency.github.io/2016/experimentation-with-no-ragr...

[1]: https://exp-platform.com/


Yep, and GA does multi armed bandit testing.

Definitely the way to go for A/B.


It feels click-baity to reference a $50M revenue improvement without specifying the revenue before the split-tests. Is $50M a 1% improvement? 100%?


Clickbait scores well in A/B testing, so I'd expect an article about A/B testing to have a clickbait title.


And volume. A/B testing a website/service with 10 visitors per day doesn't make sense at all.


Anybody have a recommendation for an A/B testing service?

We've talked to Optimizely but their pricing was going to come in at the same ballpark as our AWS spend (into the six-figure range), which seems absurd. They charge based on monthly users, but a lot of our traffic consists of organic search bounces.

For now we just want to run ~5 experiments per month, want to record events server-side so we can be sure not to lose any, and are wary about implementing it ourselves since there are so many ways to screw it up without realizing it.


You could try the free version of Google Optimize if you don't mind using a Google product.


We use Hansel


At one place I worked we got around Optimizely pricing by only sending down Optimizely code to some small subset of our users — enough to yield statistical significance but not so much to break the bank.


You can use VWO.com

PS: it’s a product that I launched here on HN 9 years ago and wouldn’t have been possible without the awesome community.


It's not that hard to DIY. You can start with: https://facebook.github.io/planout/


Split (https://www.split.io/) works pretty well for us


I hypothesize that many of these statistical tests have led to worse outcomes. Not in the commonly espoused bad for society externality view, but as a straight up bad for business view. There are three problems I see with A/B tests on a vast scale for small tests.

1. If the effect size is incredibly small, which most minor UI changes will be, finding statistical tests to prove them is really difficult. If you’re looking for an incredibly small positive effect, even with hundreds of thousands of sample size, the probability of rejecting the null hypothesis while the effect is actually a negative influence is surprisingly high! Very easy to make mistakes.

2) short term gains on engagement may lead to long term disengagement.

3) business incentives for management are easily misaligned. I would imagine a dominant negative influence is managers exaggerating the statistical influence found in a test because that means they get to lead the change, on an otherwise vast tech ecosystem the performance of which probably won’t change all that much. Attribution is also hard (how sure are they on how much to attribute here?) so credit is difficult to allocate beyond initial value sizing.


I literally did a research project for a firm recently where we lowered wages. Workers are super monitored, I have data on every minute of their day. Post wage cut, workers worked just as hard. Awesome! 5 months later, the best people are (significantly) gone. 7 months later, and it is demonstrably Obvious that this was a value destroying move--average fixed effects look horrible now, yet this is only clear to those of us watching from the outside. Internally, the difference is completely overlooked.

I needed to find cites, and ironically I found the Parent article this morning. This one is way better about the lies we tell ourselves: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3204791


How do you measure people like that?


1) Agreed that the smaller the effect, the more statistical power (usually from a larger sample size) you need to detect them. But to assume that all changes have tiny effects, and therefore not detectable and a waste of time, is a flawed assumption.

Once upon a time we published over 100 a/b tests here: https://www.goodui.org/evidence/ and clearly the relative effects vary (not all single changes have always a small effect).

More so, the effects of a/b tests can be further increased by grouping multiple higher confidence ideas together into a single variation.

2) Short term gains may (or may not) lead to long term disengagement. Measuring micro (shallow) and macro (deeper) metrics would be the right way to answer this.


Source: https://hbr.org/2017/09/the-surprising-power-of-online-exper...

I used to work with Ron Kohavi in his group.


Ouch. We changed the URL from http://blog.rootshell.ir/2019/02/how-to-increase-annual-reve..., which seems to have copied that content, and banned that site.

Thanks for the heads-up.


Thanks. I was wondering why Ron was writing there; he wasn't.


Good in-depth overview but unfortunately doesn’t get into bigger issues of potential ethical concerns of this sort of narrow maximization. At a certain point you end up becoming Facebook or YouTube and vastly amplifying toxic and dangerous content because it generates more comments or more time spent on the site. And even if you believe in pure amoral capitalism, blindly following what your algorithms tell you to do will eventually lead you into a trap and the backlash won’t necessarily be pretty.


Supporting your point, YouTube claims to be backing off its "clicks at any cost" model.

As has been mentioned previously, our business depends on the trust users place in our services to provide reliable, high-quality information. The primary goal of our recommendation systems today is to create a trusted and positive experience for our users. Ensuring these recommendation systems less frequently provide fringe or low-quality disinformation content is a top priority for the company. The YouTube company-wide goal is framed not just as “Growth”, but as “Responsible Growth”.

[1] https://blog.google/documents/33/HowGoogleFightsDisinformati...


Nice article but visual examples would really help.

Someone could change the CSS of the ads, for example to flash periodically, and that would be the reason for the increased clicks.


I always wonder how long the better of the A/B results will stay.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: