Real Price of an A/B Test Between Service Providers

Michael Kechinov
CEO at REES46 Technologies Inc.

Logic tells us that every online store should choose the marketing automation and personalization services that drive its revenue. And never use the tools that hinder that or, what’s even worse, incur losses. However, in real life, people often make weird choices.

Digital marketers like to limit their logic to “This company provides its services to some big and famous retailers. And those are no slouch when it comes to making money, so I’d better make the same choice.”. Guess what – a lot of those big retailers based their choice on the same logic.

That when it is so easy to run an A/B test and understand the real value of the services to you? Before putting any big money on the table.

I’ve recently learned to my surprise that occasions of such useful A/B testing are rare to none. The reason, according to the market, being that running such tests is a very costly practice. Expressed in numbers, an A/B test of two recommendation engines would cost over $10000. Having run the calculations three times, I asked the only question on my mind: “Whoa..?! Where did you get these numbers from?”.

To that I was listed the following, quoting:

  1. The price of the correctly set server traffic distribution and sending custom dimensions to Google Analytics via Measurement Protocol.
  2. Considering the count/division of the experimental sessions must start from the first page, even for the new visitors, the testing company will have to use its own clientID instead of Google CID (that is generated on the client’s side).
  3. Order status tracking in Google Analytics via Measurement Protocol (full sync with CRM/ERP).
  4. Full dataset about every order and session loaded from Google Analytics via their API after the test has been finished.
  5. Matching this data with the marginal income of every order to calculate EBITDA for each.
  6. Loading the result numbers into R/Python/Excel and assessing the sample distribution.
  7. Finding and isolating all the anomalies (95/99-percentile) and calculating the standard deviation.
  8. Also, calculating the confidence intervals and statistical validity of the sample.
  9. Continue testing if the confidence intervals values are too high and the required statistical significance has not been reached.

All that would cost you over $10000. And that is if you are able to find an expert, who can manage it, and a store that will agree to test.

How can we be talking about a profitability analysis when 90% of online stores do not know how to run server-to-server transaction management?

End of the quote.

My reaction was literally:

What would I understand from this text if I were a marketer with very limited experience in coding? I would look at the phrases “server traffic distribution”, “measurement protocol”, “percentile”, “confidence intervals” and think: “Wow! What is all that?! That’s way too complicated and would take the next half a year to understand.”.

Unfortunately (for my opponents), I’ve spent half of my life coding, and half of this time – managing different developer teams. I know well what devs tell their managers when they want to get rid of the project: “This will be unbelievably complicated and will take a year.”. They also use techy slang like “legacy”, “needs a lot of refactoring”, “the architecture won’t support it” and so on, to validate their response. By the way, if you hear something like this from your developers, just tell them to stop whining.

It’s the same situation in our case with A/B testing.

I’ll try to translate the requirements into a more “human” language. I’ll also give you the code and the real price for this job. But first, let’s speak about the basics! In my area, the average price for a qualified dev outsourcing a project is about $80 per hour. Referring to the cost of the job stated above ($10000),  we can presume that an A/B test would require 125 hours of work. In other words, ~4 weeks of coding (we also presume that devs really work 5-6 hours a day spending the rest drinking coffee, smoking and visiting bathroom).

So, a month-long test. A qualified dev consistently working on the project for 1 month. Will this person alone maintain the server to ensure its stability throughout the project? Or will this person assign orders to the chosen segments in the manual mode, each time a new order comes?

Let us return to the “unbelievably complicated” task list I’ve been given by my market colleagues.

The first point of attention is Google Analytics. The author says you cannot trust it, so we need to assign the visitors to the segments on the server ourselves, then we should upload this data to Google Analytics, and finally download it back from GA and match to our dataset. This scheme is rather complicated, and I cannot understand why you would still use Google Analytics if you cannot trust it. If so, it’s better to skip it altogether.

Although, I agree: if the goal of the test is to measure the net profit, then Google Analytics or Google Optimize is not the best choice. Why? Because we need to work only with the orders have been paid for, not the orders in total, and with the whole set of resources, not partial resources like in Google Analytics.

Let’s assign the visitors to the segments on the server and add the info about the segment to each order ourselves.

A code (PHP) that successfully nails items 1 to 4:

if( !$_SESSION[`abc_test_segments`] ) {

  $segments = [`A`, `B`, `C`];

  $segment = $segments[random_int(0, count($segments))];

  $_SESSION[`abc_test_segments`] = $segment;

  mysql_query(`INSERT INTO segments (segment) VALUES («` . $segment . `»)`);


If we translate it into human, it means: if the customer has not yet been assigned to a segment, assign, and create a record in the database for this visit.

If this customer made an order,  update the segment with this data:

mysql_query(`UPDATE orders SET segment = «` . $segment . `» WHERE id = ` . $order_id);

A code that successfully nails items 5 to 7:

SELECT (order_items.amount * items.price * items.price_margin) AS net_profit_segment_A FROM order_items LEFT JOIN items ON = order_items.item_id WHERE order_items.order_id IN (SELECT id FROM orders WHERE date BETWEEN (…) AND segment = `A` AND status = `complete` AND order_sum <= (SELECT MAX(order_sum) * 0.95 FROM orders WHERE date BETWEEN(…) AND status = `complete` ) AND order_sum >= (SELECT MAX(order_sum) * 0.05 FROM orders WHERE date BETWEEN(…) AND status = `complete` ) );

It may look complicated to you, but it would take only a couple of minutes for an experienced dev to work it out. Translated into human, it means: get the net profit for all completed orders from the A segment, excluding the most (5%) and least (another 5%) expensive as anomalies. To get the net profit for the B and C segments, you only need to change the letter A in the request to B or C respectively.

Item 8 – the confidence intervals. What is a confidence interval? In our case, it is a difference in proceeds (range of likely values) in varied segments that, when exceeding a certain level, can have a target value for the parameter. For instance, if the A segment value is $1746 and the B segment – $1754, the difference ($8) is too small to say that the B segment will stably outperform A segment by 0.5%.

However, the difference of $349 is significant enough to have a target value. What about $78? Confidence interval estimate will help you in this case – a concept (and the formula) known to every student of math or economic faculty of any university.

The code to determine the confidence interval (close your eyes, if you’re not a dev):

SELECT ( 1.96 * SQRT( ( (SELECT COUNT() FROM orders WHERE segment = `A`)/(SELECT COUNT() FROM segments WHERE segment = `A`) * (1 — (SELECT COUNT() FROM orders WHERE segment = `A`) / (SELECT COUNT() FROM segments WHERE segment = `A`) ) ) / SELECT COUNT(*) FROM segments WHERE segment = `A`) ) AS confidence_interval_for_A

This type of request is run for every segment to determine the confidence interval for each of them.

If you paid attention, you noticed that the solution is only 1147 symbols long, while the problem took 1429 symbols to describe. If you subsequently divide $10000 (stated cost of the test) to 1147, the cost of one symbol will be $8.72. A lawyer in my area makes less by the symbol!

With all due respect, it took me 15 minutes to create this code and around 30 to google “confidence interval mysql”. It is less than an hour in total. Even if you give this project to a junior and pay this person as a qualified senior (say, $120), it would take only around 20 hours to complete it. So, the final price would be $2400 – circa 4 times less than the initially stated price.

All in All

It’s a good idea to run an A/B test of a service you think to use. Don’t be tricked by “We are the leading company in this area” or “We can !guarantee! you progress”, or «A/B testing is a very costly procedure”, or  «Millions of lemmings can’t be wrong”.


Leave a Reply

Your email address will not be published. Required fields are marked *