Mixed-Effect Thompson Sampling
AISTATS 2023
TL;DR.
Looking for a practical algorithm, grounded in theory, that capitalizes on action similarities? Discover how to learn more efficiently in contextual bandits with large action spaces.
Summary.
Mixed-Effect Thompson Sampling (meTS) leverages action correlations to explore the action space efficiently. It enjoys:
Efficient posterior updates.
Potential information sharing among actions.
Reduced Bayesian regret in comparison to standard methods.
Lower Bayes regret compared to standard methods.
Strong empirical performance without additional tuning.
Bridges the gap between offline learning of action correlations and online learning to act under uncertainty.
Illustration.
Think about movie recommendation, like when Netflix suggests movies to users. This situation can be seen as a contextual bandit problem. Here, we view user features as context, the movies they can choose as actions, and the rating they give to a movie as the reward.
A common tactic is to learn how each user would rate each movie. So, we assess a movie's worth based on the ratings that other users gave to it. This is what standard methods like LinUCB and LinTS do. But the issue is, when you have a lot of movies (a.k.a., large action space), this method isn't very efficient. Luckily, there's a silver lining: movies tend to be connected because they fit into specific categories, like action, sci-fi, drama, and more.
Here's a key question: How can we leverage these categories to explore movies more efficiently? We'll explore some solutions, starting with the first one. Solution (A) involves learning how each user rates each movie category based on the ratings of all movies in that category. Then, when predicting a user's rating for a movie, we simply use the rating they would have given to that movie's category. This approach is more efficient because we're learning fewer parameters (ratings for a few categories instead of ratings for many individual movies) and we're using more data (ratings for all movies within a category instead of just one movie). However, there's a catch: it tends to introduce a lot of bias because it assumes that all movies within the same category would have the same rating from a user. In a nutshell, it might oversimplify things and we summrize this below.
To address the bias issue, we can take a similar approach as in (A), but with a twist. Here, we treat the rating that a user assigns to a movie as a random variable, with its average being the rating that the user would typically give to the category that the movie falls under. This captures the connection between movies in the same category, as they all draw from the category's rating. However, since a movie's rating is no longer exactly the same as its category rating, we need to learn both category ratings and individual movie ratings. The category ratings are learned by considering the ratings of all movies within that category, while we fine-tune the movie ratings by focusing on the ratings for that specific movie. This method introduces some randomness to help reduce bias. This is Solution (B) and it is summarized below.
There's a limitation with (B) – it assumes that a movie belongs to only one category. In reality, a movie can often fit into multiple categories. So, we take things a step further, much like (B), by considering the rating a user gives to a movie as a random value. However, now, the average rating is like a mixture of the ratings that the user typically gives to all the categories the movie falls under. The coefficients in this mixture show how strongly the movie is associated with each category. This adds a layer of complexity, creating even more connections between movies and making exploration more statistically efficient. This final Solution (C) is what we call meTS, and it's a generalization of all the solutions we've discussed so far. It's like connecting all the dots.
Formal Model.
Formally, we consider a mixed-effect model that writes
The corresponding graphical representation is given below.
A practical example is the Gaussian mixed-effect model where each action parameter is a Gaussian linear mixture of the effect parameters. That is, when
Even though the latent structure in this example is linear, the reward can be either linear or nonlinear.
Bridging the Gap Between Offline and Online Learning.
A question remains! How do we establish the mixed-effect model (with their structure) in the first place? Well, there are two possible routes. First, it can be inherently embedded in the problem itself. For instance, in drug design, the actions represent different drugs, and the effects correspond to their components. The mixing weight (b) that links an action (drug) to an effect (component) reflects the dosage of that component within the drug. In such cases, the relationship is apparent from the nature of the problem. However, in other situations, it might not be immediately clear how these effects come about, and we need to learn this relationship. The good news is, this learning process can be carried out offline using off-the-shelf techniques like Gaussian Mixture Models (GMM). These models can be trained on offline (noisy) estimates of the action parameters to create a proxy structure that's well-suited for meTS.
Algorithm.
After formally presenting the model, we are in a position to present meTS, which is a hierarchical Thompson sampling algorithm that operates in two steps: it initially samples the effects, and then it proceeds to sample the actions based on the sampled effects. Here's how it works
Theory is developed for meTS. We start with the derivations of the effect and action posteriors.
Then, the corresponding regret bounds capture the structure and they have the following form
The regret bound of meTS depends on both the number of actions (K) and the number of effects (L). But there are special cases where it only depends on one of these factors. For example, if the action parameters are always the same for a given effect parameter, the agent only needs to learn the effect parameters, and the regret bound only depends on L. Similarly, if the effect parameters are always the same, the agent only needs to learn the action parameters, and the regret bound only depends on K. In general, the regret depends on both K and L, but it's still better than the regret of standard methods like LinTS, because meTS reduces the posterior variance.
To fully understand the benefits of meTS, we should pay close attention to the constants in the regret bound, as they can have a big impact. As we see above (refer to the paper for details), meTS can significantly reduce regret, especially when the action space is large and when the effect parameters are more uncertain than the action parameters. Additionally, meTS has similar computational efficiency to standard LinTS and it greatly improves the time and space complexities of TS that models the joint distribution of action parameters.
Experiments.
We start by simulating our mixed-effect model and observe that meTS and its factored variant (meTS-Fa) outperform all baseline methods that ignore the structure or incorporate it partially. The gain is significant. Note that meTS-Fa is an approximation of meTS-Fa where effects are sampled independently. More details are in the paper.
But what if we can't simulate or access the mixed-effect model? This is the problem we address in MovieLens. Here, the rewards are not sampled from a mixed-effect model. However, we use a Gaussian mixture model (GMM) trained on offline estimates of movie parameters to create a proxy mixed-effect model for meTS. Even though the true rewards are not generated from a mixed-effect model, meTS still outperforms other methods. This shows that meTS is flexible and robust to model misspecification.
Main takeaways from experiments.
When the mixed-effect structure is available, it's better to design a TS algorithm that exploits it rather than marginalizing the effects and applying a standard TS algorithm. This is because action information sharing reduces variance.
When the mixed-effect structure is not available, you can use offline estimates of the action parameters to create a proxy structure and apply meTS to it. In some cases, you may not have offline estimates of the action parameters. In that case, you could use external datasets that are somewhat related to the problem. For example, you could use CLIP embeddings of movie images and descriptions to get noisy estimates of the action parameters and train a GMM on them. However, the effectiveness of meTS with external data has not been proven empirically.
Conclusion.
We have proposed a mixed-effect bandit framework that can be used to explore more efficiently. Our framework is represented by a two-level graphical model where actions can depend on multiple effects. We have also designed an exploration algorithm, meTS, that leverages this structure. meTS has been shown to perform extremely well on both synthetic and real-world problems.
The core idea of our approach is that TS relies on prior distributions that can be chosen or learned in a clever fashion to improve computational and statistical efficiency. With the large availability of offline data, we can use it to learn informative priors like the mixed-effect one. This can further improve the performance of TS and make it more efficient.
Our algorithmic ideas apply to the general mixed-effect model, which opens the door to richer models, such as general directed acyclic graphs (DAGs) or diffusion models.