At Incisively we have been working hard on an easy-to-use service that allows customers to **continuously optimise** various aspects of their own applications or sites.
As we explained in our earlier post, we believe that applying a continuous optimisation approach to your business, rather than one that revolves around ad-hoc AB testing, can simplify how you carry out optimisation, and lead to larger improvements in things like:

- conversion rates;
- engagement; and
- revenue and margin;

In this post, I'm going to give a short technical overview of how we model these optimisation problems, using many tried-and-tested approaches from the machine learning and statistics literature.

As a motivating example of the types of problem our product helps to solve, let's consider the case of an on-line retailer or travel agency, who wants to improve their conversion rates from the search result part of their funnel.
Reviewing the most common site-searches, it's clear that many products are often not shown on the first page of search results, which might be costing conversions.
To ensure that the default search result order encourages as many conversions as possible, Incisively can be used to optimise *strategies* for ordering products.

The remainder of this post outlines how we frame these types of problems in a machine learning context, and how we utilise some neat algorithms to help solve them.

## The Multi-armed Bandit Problem

The Multi-armed Bandit Problem (MAB problem) is a learning problem wherein an agent is repeatedly faced with $N$ possible choices (arms), each of which, when pulled, returns a payoff from some unknown distribution.
The agent wishes to maximise their overall payoff, but they only have the feedback from pulling arm $a_i$ at each time-step to learn from.
That is to say, the agent does not know what the underlying payoff distributions are for each arm, and can only *estimate* them by learning from payoff information.
Why is it called the Multi-armed Bandit Problem? Well, it's analogous to a slot machine with $N$ arms, and your problem, as the player, is to figure out which arm you should be pulling on each go.

The MAB problem is difficult because in order to perform well, one needs to balance *exploration* and *exploitation*.
Exploitation is the continued selection of the best known arm, while exploration involves selecting arms that are not necessarily expected to be best.
Exploration is important for learning the underlying payoff distribution of each arm.
Exploitation is vital in order to maximise the total payoff over time, by choosing the arm with the best payoff distribution as much as possible.

In the problem described in the previous section, different strategies for sorting search results are *arms*, and every time a user makes a search, deciding which of these strategies to sort their results by, is equivalent to pulling one of the arms.
Most of the time there will be no payoff at all when an arm is pulled, but occasionally there will be; the task is to exploit the best strategy in order to maximise overall payoff over time, while balancing the process of learning the underlying payoffs of all the strategies.

### Firming things up

The MAB model can be thought of as comprising two sets: (1) a set of arms $\mathcal{A}$; and (2) a set of payoff (reward) distributions $\mathcal{R}$.

$$\mathcal{B} = \langle\mathcal{A}, \mathcal{R}\rangle$$

At each time-step $t$, the agent chooses an arm $a_t \in \mathcal{A}$ and receives a payoff $r_{t} \sim \mathcal{R}_{a}$ from the reward distribution associated with that arm.
The agent's task is of course to maximise the cumulative payoff over some time-period $T$, i.e., $\sum^{T}_{t=1} r_t$.

In a special case of the MAB problem—the Binary (or Bernoulli) Bandit problem—all arm payoffs are are either zero or one.
Therefore, each arm's reward distribution $\mathcal{R}_{a}$ is the Binomial distribution $B(1,p_a)$, where $p_a$ is the probability of success from pulling arm $a$, and conversely $1-p_a$ describes the probability of failure.

When discussing ways of solving the bandit problem, rather than thinking in terms of maximising cumulative reward, it is more common to think in terms of *minimising regret*.
In decision theory, regret is the discrepancy between an action's payoff, and the payoff of the best action that was available.
In terms of the bandit problem, always pulling the best arm would result in zero regret, while every pull of the non-optimal arms results in some positive regret.
In terms of regret, we then wish to minimize the following in the Binary bandit problem:

$$\rho = Tp^{*} - \sum^{T}_{t=1}r_t$$

where $p^*$ is the maximal probability (and in this case maximum mean reward distribution value) available.

Real-world scenarios, such as maximising conversions or clicks on call-to-actions, lend themselves well to this form of the MAB problem.
Each strategy or variant that you associate with an arm $a$, has an underlying conversion/click-through rate, $p_a$.
The best of those is $p^{*}$.

Now the dilemma is clear.
Balancing the *exploitation* of the arm with $p^{*}$ against the *exploration* of arms, in order to accurately estimate to which arm $p^{*}$ belongs.

This problem is made harder when you consider that, in the real world, conversion rates change over time!
Therefore, even when you're confident that you have identified the arm with $p^{*}$, you must always maintain some form of exploration, to keep your estimations up to date.

Algorithms for solving the MAB problem form the basis for our approach to continuous optimisation.
In the next section I'll discuss in more detail how we go about doing this.

## Incisively's Approach to Solving the MAB Problem

Before discussing the approach we take at Incisively, I will quickly mention a couple of other approaches to solving the MAB problem.
The first is the elephant in the room: A/B testing.
AB is not really an algorithm at all, but is just a term for describing the process of carrying out a randomised experiment with two variants.
An algorithm that would carry out an A/B test to solve the MAB problem for the two-arm case might look like:

- While stopping criteria not reached, pull arm A with 50% probability and arm B with 50% probability.
- Once stopping criteria reached, forever pull the best arm if it's better than the other.

There is no trade-off between exploring and exploiting arms; they're given the same preference.
The stopping criteria needs to be determined before the algorithm starts, and involves achieving a certain number of pulls in order that a performance difference of a certain size can be detected.
If such a performance difference is detected, then the algorithm will pull the winning arm forever.

There are some issues with this approach to optimisation, for example:

- In the first stage of the algorithm, regret grows linearly because
there is always a (high) constant probability of selecting the
non-optimal arm.
- In the second stage of the algorithm exploration stops entirely.
While this leads to a zero regret situation
**if** the correct arm
is being pulled, any changes to the underlying payoff distributions
would not be detected.
- You need to decide before hand what size of performance difference
you want to detect, in order to know how long to run the experiment
for.

While the A/B testing approach is incredibly common for one-off tests, it doesn't lend itself well to continuous optimisation.
Luckily, the reinforcement learning literature has provided us with some far better algorithms over the years.

The canonical reinforcement learning algorithm for the MAB problem is called $\epsilon$-greedy (pronounced "epsilon greedy"), and it's trivial to describe:

- With probability $\epsilon$ pull one of the arms randomly.
- With probability $1-\epsilon$ pull the arm with the best average
historical payoff.

Algorithms that use strategies like this are called *semi-uniform strategies*, because they either exploit the best arm, or they uniformly explore others.
Notice that it differs from the A/B testing algorithm in the following ways: (1) from the beginning it spends some of its time exploiting the best known arm; (2) it never stops exploring; and (3) it does not need to run for a set amount of time.

This algorithm does a better job of balancing the exploration/exploitation dilemma, by maintaining a non-zero probability of exploration while always exploiting to some extent.
Another semi-uniform strategy, and a variant of the $\epsilon$-greedy algorithm, $\epsilon$-decreasing, reduces $\epsilon$ over time so that the best arm is exploited more often, as more data is built up about the best paying arm.

The problem with all these semi-uniform strategies though, is that either regret grows linearly (because they always explore with some fixed probability), or where it is possible to achieve logarithmic regret bounds, it requires the careful setting of a free parameter (e.g., in the case of $\epsilon$-decreasing, how to reduce $\epsilon$ over time).
There are, however, more powerful approaches that have the ability to provide logarithmic regret bounds without the need to carefully set free parameters.

One of the issues with the above algorithms is that they don't consider how good all arms are, with respect to each other.
Instead they rely on exploiting the perceived best arm, and uniformly sampling from the remaining arms.
Intuitively, this results in suboptimal performance, because the probability of an arm being the best is not used when determining which arm to pull.

There is a solution to this problem; an approach called Randomised Probability Matching.
A common way to apply a randomised probability matching approach to the MAB problem is to use a heuristic called *Thompson Sampling*.
Thompson Sampling is not particularly complicated or novel, in fact it was first described in the 1930s.
It was ignored in the literature for a long period of time, and only recently has been used as an approach for tackling the exploration / exploitation dilemma.

Thompson Sampling is trivial to describe.
Every time we need to pull an arm on the bandit, Thompson Sampling dictates that we randomly sample each arm's *estimated* payoff distribution to produce a vector of random variates.
The arm with the largest variate is the one that is then pulled.
The only real question is how to estimate the payoff distribution for each arm?

Thompson Sampling is Bayesian in nature, which means that for each arm we require a *prior* distribution.
Recall in the Binary Bandit problem, rewards for each arm are Bernoulli distributed according to some unknown probability $p$.
The *Beta distribution* is a conjugate prior for a Bernoulli likelihood function, so we can use the Beta distribution as a prior to model the underlying probability of $p$ for each arm in the bandit.

The Beta prior has two hyperparameters, $\alpha > 0$ and $\beta > 0$, and conveniently $\alpha-1$ can be interpreted as the number of successes from pulling the associated arm, and $\beta-1$ as the number of failures.
Updating the prior for each arm is trivial.
Given arm $a$ is pulled, the hyperparameters can be updated as follows:

$$\begin{align}
r_t &\sim \mathcal{R}_a \nonumber \\
\alpha_a &= \alpha_a + r_t \nonumber \\
\beta_a &= \beta_a + 1 - r_t \nonumber
\end{align}$$

Not only is this method is very simple, it has some useful properties:

- Regret grows logarithmically;
- Exploration is naturally balanced according to the estimated payoff
of each arm, and how confident the algorithm is in the estimation;
- There are no free parameters to set;
- Because payoff distributions are sampled, each arm pull is not
deterministic, which means we don't have to update the
distributions immediately after each pull.

While we have added a fair amount of secret sauce to our MAB algorithms, Thompson Sampling forms the basis of how our underlying optimisation platform works at Incisively.
In a future post I'll be describing some new features we've been busy working on.

## Further Reading