Reinforcement Learning

minute read

For retailers looking to personalise their digital offering, significant gains can be achieved with a rule-based approach combined with effective analytics, or even better with an integrated A/B testing framework.

However, this approach comes with its challenges, for example:

Even with significant existing insight (i.e. data, experience) it can be difficult to know in advance how effective a given personalisation rule could be
A/B testing often takes time (i.e. days, weeks) to determine a statistically significant “rule winner”
Over time, the natural tendency is to add more rules to target more effectively – at a certain point this complexity may hinder further optimisation efforts.

Machine learning represents an opportunity for more sophisticated retailers to work around the complexities of rule-based personalisation, by developing dynamic models that can ‘learn’ effective personalisation outcomes.

One approach growing in popularity involves the deployment of multi-armed bandit models, part of a branch of reinforcement learning algorithms that can learn successful outcomes by incorporating real-time feedback about the success (or otherwise) of each prediction.

Multi-armed bandits

For a textbook definition, MAB is a classic reinforcement learning framework for algorithms which make cumulative decisions under uncertainty. In other words, how an agent makes predictions in order to maximise the aggregate reward over the long term.

The name is derived from a thought experiment on choosing multiple slot machines (i.e. “the one arm bandits”) with unknown payouts in an order that maximises reward efficiently. For those unfamiliar with MABs, we recommend Anson Wong’s “Solving the Multi-Armed Bandit Problem” and Alex Slivkin’s “Exploring the fundamentals of multi-armed bandits” for a more comprehensive introduction.

Below are some interesting characteristics of MABs.

Reward calculations

Following each prediction, a reward is calculated based on how effectively that prediction achieved the retailer’s goals. For example: a retailer trying to optimise the layout of a certain screen may decide that a good measure of success is a user who clicks through to the next screen, a more delayed reward like a successful purchase, or even something as complex as a determination of the purchased cart value. Popular MAB frameworks don’t actually stipulate the reward function. This is left to the retailer and represents one of the biggest challenges (and opportunities) with deploying these models.

Contextual bandits

Contextual bandits can be thought of as an extension of the multi-armed bandit (MAB) framework in which the customer’s context is taken into account. If we frame our definition in respect to the airline booking flow, the context can refer to any number of data points, such as: the route, whether the trip is leisure or business or whether the customer is a loyalty member, time of day, etc. This extra information allows the algorithm to make better predictions by discovering attributes (features) that act as strong predictors of user behaviour.

While A/B tests take context into account as well, these parameters are predefined for A/B tests: contextual bandits are less prescriptive in that regard. Consequently, we might consider contextual bandits as an evolution of A/B testing, whereby an intelligent display can dynamically learn the best test variants, and maximise their respective benefits even during a test.

Reward exploration vs. exploitation

As previously alluded to, one of the major advantages of MABs over classic A/B testing is that they allow the retailer to trade-off between “exploration and exploitation” of learned predictions.

In effect, the algorithm first learns how to make successful predictions (based on the reward calculation), and then exploits these predictions for future users presenting similar characteristics. For example, if screen layout variant “B” worked the best for “business users”, then keep presenting this variant to future business users. (Please excuse my reference to a classic rule-based segmentation like ‘business user’– a typical ML model would discover for itself far more obscure and opaque segments).

However, it could be that other prediction variants could produce even better outcomes in certain situations. For example: to a certain subset of these business users, or perhaps these users behave differently on different weekdays, or just change over the course of the year. MABs allow retailers to sacrifice a certain volume of traffic to “explore” different prediction variants in real-time. Over time therefore, the model progressively improves itself.

Contrast this with classic A/B testing where the share of traffic allocated to each prediction is typically set at the start of the experiment, which could then for weeks. Traffic going to non-performing test variants can represent a significant loss in revenue for retailers.

Diagram showing how MABs limit the opportunity cost of exploration.

Diagram from Expedia Group Technology showing how MABs limit the opportunity cost of exploration. Source: https://medium.com/expedia-group-tech/how-we-optimized-hero-images-on-hotels-com-using-multi-armed-bandit-algorithms-4503c2c32eae

Industry examples

Below are a few examples of retailers and tech providers in the travel space who have experimented with contextual bandits and MABs as a way to improve personalisation:

Skyscanner: For the past few years, the data science team at Skyscanner has developed a contextual bandits service for optimizing widget placement. When a customer searches for flights on Skyscanner, the flight results include any number of widgets, from filtering by airline brand to ancillary offers. While Skyscanner has a rich tradition of A/B testing, they found this approach works as a global solution. Through experimentation, they ratified and expanded the use of contextual bandits in order to optimise by route and deliver a more targeted experience.
Hotels.com: As customers can tend to choose accommodation types based on images, selecting the primary or hero image is incredibly important. As part of a pioneering computer vision programme, Expedia Group used Amazon Mechanical Turk campaigns and historical A/B tests to score and rank their vast content library. They realised complementing their current programme with MAB optimisation would incorporate customer feedback, which had been missing up to this point. The approach allowed them to explore new options for a hero image whilst minimising regret.
Deepair: As airlines shift ever closer to full dynamic pricing of flight and ancillary offers, Deepair has developed an innovative “Adaptive Model Selection Framework”, that leverages MABs to adaptively choose between many competing dynamic pricing models, and apply the best performing one for each pricing transaction.

The way forward for travel

In a post-pandemic world, airlines and other travel retailers are faced with a choice: invest in improving their digital experience, or lose customers to those who do. While A/B testing and experimentation will remain important common practice, reinforcement learning, MABs will continue to expand in prevalence and adoption in the travel scene for several reasons.

There are many varied opportunities to leverage MABs in travel. From implementing a recommender to suggest relevant destinations in the flight search screen, to improving flight/ancillary selling with hyper-personalised offer layouts in the booking flow, MABs can enable self-optimising experiences which truly engage each traveller.

Further, establishing a generalised MAB framework, retailers can scale MABs more easily to more varied use cases in an amplification of data-driven working culture. With MABs, e-commerce, digital and product teams can experiment more dynamically, and minimise the opportunity cost of poor performing variants.

This represents a transformation of internal processes, and an evolution towards enabling smarter product development and experimentation which ultimately drive long-term goals of maximising revenue and improving customer engagement.

Our roadmap

At Branchspace, we are looking to amplify our technology platform to leverage contextual bandits for digital experience personalisation, because we truly believe in the practical benefits for airlines as they evolve towards an ever more adaptive and real-time customer offering.

The good (and bad) news is that there are many tools, frameworks, platforms, algorithms and services available to help us on our journey – from highly flexible MAB algorithms that require significant infrastructure to deploy and optimise, all the way up to “ML-as-a-service” (MLaaS) offerings from the major public cloud providers, and everything in between.

Two such interesting MLaaS’ are Azure Personalizer and Amazon Personalize, which wrap MAB libraries in (relatively speaking) easy-to-use cloud infrastructure to enable use cases like product ranking. In theory, you can train and deploy a functional self-optimising service in hours. In reality, to deploy something that is not just functional and effective, requires (a) a deep understanding about what these services are doing behind the scenes, and (b) a lot of time, energy and cloud resourcing cost experimenting with different parameters, data attributes, reward algorithms, etc. While we see a lot of potential, we have likewise discovered many potentially prohibitive limitations.

Fortunately, these cloud providers offer more adventurous tech teams with more sophisticated tools to build their own MAB infrastructure. One example is Amazon SageMaker RL which provides pre-baked MAB libraries that can be deployed within the SageMaker infrastructure. The advantage is that this infrastructure is significantly more flexible, however in general has a steeper (human) learning curve.

We are hugely excited about the opportunity for this technology to accelerate our customers’ ability to deliver more adaptive and real-time personalisation. Combined with our ongoing work to expand our user experiences from transactional flows to full website and customer portals, we’re looking forward to deploying some truly innovative solutions in 2021.

We recognise as well that reinforcement learning can be a difficult realm to enter: if you have any questions, drop us a line and we’ll be happy to help.