Exponential Smoothing for Off-Policy Learning
ICML 2023, Oral
TL;DR.
Ever grappled with off-policy evaluation (OPE) and learning (OPL)? If you're in the loop, you'd know the challenges of high variance associated with the popular inverse propensity scoring (IPS) estimator. Here's some exciting news: we've designed a smooth regularization for IPS, introducing some bias to reduce that notorious variance. But wait, there's more! We back this up with a scalable PAC-Bayesian bound, breaking free from the widely used bounded importance weights assumption. The icing on the cake? Our bound also holds for standard IPS without assuming a uniform coverage of the logging policy.
Summary.
We explain the shortcomings of the commonly used hard clipping regularization for IPS, suggesting a more refined technique named exponential smoothing.
Diving deeper, we leverage PAC-Bayes theory, unveiling a two-sided generalization bound for our estimator. What's special about it? It is tractable, scalable, and applicable to standard IPS without making the traditional bounded importance weights (or uniform coverage) assumption.
We've also shed light on the sample complexity of our learning method. And, to top it off, our results show that our method enjoys strong performance.