Online decision-making and learning under structured non-stationarity
Access full-text files
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Many well-studied online decision-making and learning models rely on the assumption that the environment remains intact and unaffected by the agent’s actions. This assumption, however, can be considered somewhat unrealistic for several real-life applications. In recommendation systems, for instance, the preferences of a particular user (and, thus, the “best” products to recommend) can change over time. This change can occur either naturally (e.g., the user discovers a new category of desired products) or as a response to past recommendations (e.g., incessantly recommended products can be perceived as “spam” and, thus, be ignored in the future). In this dissertation, we consider generalizations of multi-armed bandit and optimal-stopping settings, in which the payoff distribution of each action is a function of the number of rounds passed since its last play. We study such generalizations in the planning regime, where knowledge of the prior distributions is assumed, and the learning regime, where priors are learned online using the collected samples.