25w5428 Home Confirmed Participants Schedule Workshop Videos

Schedule for: 25w5428 - Advances in Stochastic Control and Reinforcement Learning

Beginning on Sunday, April 27 and ending Friday May 2, 2025

All times in Banff, Alberta time, MDT (UTC-6).

Sunday, April 27
16:00 - 17:30	Check-in begins at 16:00 on Sunday and is open 24 hours (Front Desk - Professional Development Centre)
17:30 - 19:30	Dinner ↓ A buffet dinner is served daily between 5:30pm and 7:30pm in Vistas Dining Room, top floor of the Sally Borden Building. (Vistas Dining Room)
20:00 - 22:00	Informal gathering (TCPL Foyer)

Monday, April 28
07:00 - 08:45	Breakfast ↓ Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building. (Vistas Dining Room)
08:45 - 09:00	Introduction and Welcome by BIRS Staff ↓ A brief introduction to BIRS with important logistical information, technology instruction, and opportunity for participants to ask questions. (TCPL 201)
09:00 - 10:00	Csaba Szepesvári: Foundations and Frontiers in Reinforcement Learning Theory ↓ Reinforcement Learning (RL) presents a rich set of theoretical challenges, centering on the design and analysis of algorithms that make decisions under uncertainty. This talk offers a high-level overview of core questions in RL theory, with a focus on generalization, sample efficiency, and computational tractability. I will discuss recent progress in understanding the use of function approximation for value-based methods, and examine the limitations and possibilities of learning from different data sources—including simulators, online interaction, and offline datasets. I will also briefly cover developments in online learning under partial observability. The talk will conclude with a look at underexplored directions. (TCPL 201)
10:00 - 10:30	Coffee Break (TCPL Foyer)
10:30 - 11:00	Christoph Reisinger: Efficient Learning for Entropy-Regularized Markov Decision Processes via Multilevel Monte Carlo ↓ Designing efficient learning algorithms with complexity guarantees for Markov decision processes (MDPs) with large or continuous state and action spaces remains a fundamental challenge. We address this challenge for entropy-regularized MDPs with Polish state and action spaces, assuming access to a generative model of the environment. We propose a novel family of multilevel Monte Carlo (MLMC) algorithms that integrate fixed-point iteration with MLMC techniques and a generic stochastic approximation of the Bellman operator. We quantify the precise impact of the chosen approximate Bellman operator on the accuracy of the resulting MLMC estimator. Leveraging this error analysis, we show that using a biased plain MC estimate for the Bellman operator results in quasi-polynomial sample complexity, whereas an unbiased randomized multilevel approximation of the Bellman operator achieves polynomial sample complexity in expectation. Notably, these complexity bounds are independent of the dimensions or cardinalities of the state and action spaces, distinguishing our approach from existing algorithms whose complexities scale with the sizes of these spaces. We validate these theoretical performance guarantees through numerical experiments. (TCPL 201)
11:10 - 11:40	Jose Blanchet: Wasserstein Distributionally Robust Regret Optimization ↓ Over the years, Distributionally Robust Control, Reinforcement Learning, and Optimization have been developed as an approach to dealing with optimal decision making in the context of distributional uncertainty. While the approach has been successfully applied in a wide range of areas, its adversarial nature can also lead to decisions that tend to be too conservative, especially in situations in which there is a significant upside in comparison to a relatively small potential downside. In order to mitigate overconservative decisions, we investigate Distributionally Robust Regret Optimization (DRRO) and we focus on Wasserstein-based distributional uncertainty sets, which are popular in the Distributionally Robust Optimization (DRO) settings due to various connections to traditional machine learning methods, including norm regularization, among others. We provide a systematic study of fundamental properties of Wasserstein DRRO in the spirit of what is known for the Wasserstein DRO counterpart, including convex reformulations, hardness, sensitivity analysis, and practical algorithms, among others. This is joint work with Lukas Fiechtner. (TCPL 201)
11:30 - 13:00	Lunch ↓ Lunch is served daily between 11:30am and 1:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building. (Vistas Dining Room)
13:00 - 13:30	David Siska: Entropy Annealing for Policy Mirror Descent in Continuous Time and Space ↓ Entropy regularization has been widely used in policy optimization algorithms to enhance exploration and the robustness of the optimal control; however it also introduces an additional regularization bias. This work quantifies the impact of entropy regularization on the convergence of policy gradient methods for stochastic exit time control problems. We analyze a continuous-time policy mirror descent dynamics, which updates the policy based on the gradient of an entropy-regularized value function and adjusts the strength of entropy regularization as the algorithm progresses. We prove that with a fixed entropy level, the mirror descent dynamics converges exponentially to the optimal solution of the regularized problem. We further show that when the entropy level decays at suitable polynomial rates, the annealed flow converges to the solution of the unregularized problem at a rate of $\mathcal O(1/S)$ for discrete action spaces and, under suitable conditions, at a rate of $\mathcal O(1/\sqrt{S})$ for general action spaces, with $S$ being the gradient flow running time. The technical challenge lies in analyzing the gradient flow in the infinite-dimensional space of Markov kernels for nonconvex objectives. This paper explains how entropy regularization improves policy optimization, even with the true gradient, from the perspective of convergence rate. (TCPL 201)
13:30 - 14:00	Huyên Pham: An Optimal Interpolation Diffusion Model for Time Series: Bridging Schrödinger and Bass ↓ We address the problem of generating a continuous semi-martingale with a prescribed joint didstribution at successive times $0$ $=$ $t_0$ $<$ $\ldots$ $<$ $t_N$ $=$ $T$. This is formulated as an optimal interpolation problem that unifies the Schr\"odinger bridge and Bass frameworks, allowing for the construction of a diffusion process that simultaneously calibrates both drift and volatility to times series data. By decomposing the problem into a sequence of optimal transport problems between successive time steps, where mass is transported from a Dirac measure to a given conditional density, we derive a simple quasi-analytic formula to compute the drift and volatility of the optimal diffusion sequentially over each time interval. (TCPL 201)
14:00 - 14:30	Coffee Break (TCPL Foyer)
14:30 - 15:00	Wenpin Tang: Stochastic Approaches to Guide Generative Models and Applications ↓ Recently, there has been growing interest in guiding, or fine tuning pretrained diffusion models or LLMs for specific purposes, e.g., aesthetic quality of images, functional property of proteins, and downstream tasks in finance and operations management. In this talk, I will discuss several (principled) approaches, encompassing conditional guidance, regularization and reinforcement learning (RLHF). Some applications will also be presented. (TCPL 201)
15:00 - 15:30	Renyuan Xu: Stochastic Control for Fine-Tuning Diffusion Models: Optimality, Regularity, and Convergence ↓ Diffusion models have emerged as powerful tools for generative modeling, demonstrating exceptional capability in capturing target data distributions from large datasets. However, fine-tuning these massive models for specific downstream tasks, constraints, and human preferences remains a critical challenge. While recent advances have leveraged reinforcement learning algorithms to tackle this problem, much of the progress has been empirical, with limited theoretical understanding. To bridge this gap, we propose a stochastic control framework for fine-tuning diffusion models. Building on denoising diffusion probabilistic models as the pre-trained reference dynamics, our approach integrates linear dynamics control with Kullback-Leibler regularization. We establish the well-posedness and regularity of the stochastic control problem and develop a policy iteration algorithm (PI-FT) for numerical solution. We show that PI-FT achieves global convergence at a linear rate. Unlike existing work that assumes regularities throughout training, we prove that the control and value sequences generated by the algorithm maintain the regularity. Additionally, we explore extensions of our framework to parametric settings and continuous-time formulations. (TCPL 201)
17:30 - 19:30	Dinner ↓ A buffet dinner is served daily between 5:30pm and 7:30pm in Vistas Dining Room, top floor of the Sally Borden Building. (Vistas Dining Room)

Tuesday, April 29
07:00 - 09:00	Breakfast ↓ Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building. (Vistas Dining Room)
09:00 - 09:30	Serdar Yuksel: Stochastic Kernel Topologies on Models and Policies: Implications for Approximations, Robustness and Learning ↓ Stochastic kernels represent dynamical systems, randomized policies, and measurement channels, and thus offer a general mathematical framework for learning, robustness, and approximation analysis. In this talk, we will first present several kernel topologies and study their equivalence properties. These include weak* (also called Borkar) topology, Young topology, and kernel mean embedding topologies. Implications on convergence properties on model learning and policy approximations, and on robustness will be presented: On models viewed as kernels; we study robustness to model perturbations, including finite approximations for discrete-time models and robustness to more general modeling errors and study the mismatch loss of optimal control policies designed for incorrect models applied to the true system, as the incorrect model approaches the true model under a variety of kernel convergence criteria: We, in particular, show that the expected induced cost is robust under continuous weak convergence of transition kernels. Under stronger Wasserstein or total variation regularity, a modulus of continuity is also applicable. As an application of robustness under continuous weak convergence on empirical consistency of model learning, (i) robustness to empirical model learning for discounted and average cost criteria is obtained with sample complexity bounds; and (ii) convergence and near optimality of a quantized Q-learning algorithm for MDPs with standard Borel spaces, which we show to be converging to an optimal solution of an approximate model under both discounted and average cost criteria, is established. On policies viewed as kernels: In discrete-time, continuity of cost as well as invariant measures on control policies under Young topology will be established. In the context of continuous-time models, we obtain counterparts where we show continuity of cost in Young/Borkar policies, and robustness of optimal cost in models including discrete-time approximations for finite horizon and infinite-horizon discounted/ergodic criteria. Discrete-time approximations under several criteria and information structures will then be obtained via a unified approach of policy and model convergence. Based on joint work with Ali Kara, Naci Saldi, Somnath Pradhan, Tamas Linder, and Omar Mrani-Zentar. (TCPL 201)
09:30 - 10:00	Xinyu Li: An α-Potential Game Framework for N-player Dynamic Games ↓ This paper proposes and studies a general form of dynamic N-player non-cooperative games called $\alpha$-potential games, where the change of a player's value function upon her unilateral deviation from her strategy is equal to the change of an $\alpha$-potential function up to an error $\alpha$. Analogous to the static potential game (which corresponds to $\alpha=0$), the $\alpha$-potential game framework is shown to reduce the challenging task of finding $\alpha$-Nash equilibria for a dynamic game to minimizing the $\alpha$-potential function. Moreover, an analytical characterization of $\alpha$-potential functions is established, with $\alpha$ represented in terms of the magnitude of the asymmetry of value functions' second-order derivatives. For stochastic differential games in which the state dynamic is a controlled diffusion, $\alpha$ is characterized in terms of the number of players, the choice of admissible strategies, and the intensity of interactions and the level of heterogeneity among players. Two classes of stochastic differential games, namely distributed games and games with mean field interactions, are analyzed to highlight the dependence of $\alpha$ on general game characteristics that are beyond the mean-field paradigm, which focuses on the limit of N with homogeneous players. To analyze the $\alpha$-NE, the associated optimization problem is embedded into a conditional McKean-Vlasov control problem. A verification theorem is established to construct $\alpha$-NE based on solutions to an infinite-dimensional Hamilton-Jacobi-Bellman equation, which is reduced to a system of ordinary differential equations for linear-quadratic games. (TCPL 201)
10:00 - 10:30	Coffee Break (TCPL Foyer)
10:30 - 11:00	Gergely Neu: Optimal Transport Distances for Markov Chains ↓ How can one define similarity metrics between stochastic processes? Understanding this question can help us design better representations for dynamical systems, study distances between structured objects, formally verify complex programs, and so on. In the past, the dominant framework for studying this question has been that of bisimulation metrics, a concept coming from theoretical computer science. My recent work has been exploring an alternative perspective based on the theory of optimal transport, which has led to surprising results, including a proof of the fact that bisimulation metrics are, in fact, optimal transport distances. This realization allowed us to import tools from optimal transport and develop computationally efficient methods for computing distances between Markov chains via the reduction of the problem to a finite-dimensional linear program. In this talk, I will introduce this framework and the foundations of the most recent algorithmic developments, as well as discuss the potential for representation learning in more detail. Based on joint work with Sergio Calo, Anders Jonsson, Ludovic Schwartz, and Javier Segovia-Aguas. (TCPL 201)
11:00 - 11:30	Hao Ni: High Rank Path Development: An Approach to Learning the Filtration of Stochastic Processes ↓ Since the weak convergence for stochastic processes does not account for the growth of information over time which is represented by the underlying filtration, a slightly erroneous stochastic model in weak topology may cause huge loss in multi-periods decision making problems. To address such discontinuities Aldous introduced the extended weak convergence which can fully characterise all essential properties, including the filtration, of stochastic processes; however was considered to be hard to find efficient numerical implementations. In this talk, we introduce a novel metric called High Rank PCF Distance (HRPCFD) for extended weak convergence based on the high rank path development method from rough path theory, which also defines the characteristic function for measure-valued processes. We then show that such HRPCFD admits many favourable analytic properties which allows us to design efficient algorithms to ensure the stability and feasibility in training. Finally, by using such metric as the discriminator in hypothesis testing and generative modeling, our numerical experiments validate the out-performance of the approach based on HRPCFD compared with several state-of-the-art methods designed from the perspective of weak convergence, and therefore demonstrate the potential applications of this approach in many classical financial and economic circumstances such as optimal stopping or utility maximisation problems, where the weak convergence fails and the extended weak convergence is needed. (TCPL 201)
11:30 - 11:40	Group Photo ↓ Meet in foyer of TCPL to participate in the BIRS group photo. The photograph will be taken outdoors, so dress appropriately for the weather. Please don't be late, or you might not be in the official group photo! (TCPL Foyer)
11:40 - 13:00	Lunch ↓ Lunch is served daily between 11:30am and 1:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building. (Vistas Dining Room)
13:00 - 13:30	Ben Hambly: Systemic Risk, Endogenous Contagion and Mean Field Control ↓ We consider some particle system models for systemic risk. The particles represent the health of financial institutions and we incorporate common noise and contagion into their dynamics. Defaults within the system reduce the financial health of other institutions, causing contagion. By taking a mean field limit we derive a McKean-Vlasov equation for the financial system as a whole. The task of a central planner, who wishes to control the system to prevent systemic events at minimal cost, leads to a novel McKean-Vlasov control problem. We discuss the mathematical issues and illustrate the results numerically. (TCPL 201)
13:30 - 14:00	Anran Hu: Continuous-Time Mean Field Games: A Primal-Dual Characterization ↓ This talk presents a primal-dual formulation for continuous-time mean field games (MFGs) and establishes a complete analytical characterization of the set of all Nash equilibria (NEs). We first show that for any given mean field flow, the representative player's control problem with {\it measurable coefficients} is equivalent to a linear program over the space of occupation measures. We then establish the dual formulation of this linear program as a maximization problem over smooth subsolutions of the associated Hamilton-Jacobi-Bellman (HJB) equation, which plays a fundamental role in characterizing NEs of MFGs. Finally, a complete characterization of \emph{all NEs for MFGs} is established by the strong duality between the linear program and its dual problem. This strong duality is obtained by studying the solvability of the dual problem, and in particular through analyzing the regularity of the associated HJB equation. Compared with existing approaches for MFGs, the primal-dual formulation and its NE characterization require neither the convexity of the associated Hamiltonian nor the uniqueness of its optimizer, and remain applicable when the HJB equation lacks classical or even continuous solutions. (TCPL 201)
14:00 - 14:30	Coffee Break (TCPL Foyer)
14:30 - 15:00	Sebastian Jaimungal: Broker-Trader Partial Information Nash Equilibria ↓ We study partial information Nash equilibrium between a broker and an informed trader. In this model, the informed trader, who possesses knowledge of a trading signal, trades multiple assets with the broker in a dealer market. Simultaneously, the broker trades these assets in a lit exchange where their actions impact the asset prices. The broker, however, only observes aggregate prices and cannot distinguish between underlying trends and volatility. Both the broker and the informed trader aim to maximize their penalized expected wealth. Using convex analysis, we characterize the Nash equilibrium and demonstrate its existence and uniqueness. Furthermore, we establish that this equilibrium corresponds to the solution of a nonstandard system of forward-backward stochastic differential equations. We develop a novel Picard iteration scheme for approximating the solution of the FBSDE system and demonstrate its efficacy on some toy examples. [ Joint work with Xuchen Wu ] (TCPL 201)
15:00 - 15:30	Leandro Sánchez-Betancourt: Market Making with Exogenous Competition ↓ We study liquidity provision in the presence of exogenous competition. We consider a 'reference market maker' who monitors her inventory and the aggregated inventory of the competing market makers. We assume that the competing market makers use a 'rule of thumb' to determine their posted depths, depending linearly on their inventory. By contrast, the reference market maker optimises over her posted depths, and we assume that her fill probability depends on the difference between her posted depths and the competition's depths in an exponential way. For a linear-quadratic goal functional, we show that this model admits an approximate closed-form solution. We illustrate the features of our model and compare against alternative ways of solving the problem either via an Euler scheme or state-of-the-art reinforcement learning techniques. (joint work with Martin Herdegen and Robert Boyce) (TCPL 201)
15:30 - 16:00	Samuel Cohen: Bayesian Opimal Adaptive Control ↓ In this talk we consider a class of partially observed stochastic optimal control problems, where the control determines the filtration. Despite this issue, we will see that the problem (in both strong and weak formulations) can still be characterized by the usual HJB equation, after an appropriate Markovian lift. This yields explicit infinitesimal characterizations of the exploration premium. (TCPL 201)
17:30 - 19:30	Dinner ↓ A buffet dinner is served daily between 5:30pm and 7:30pm in Vistas Dining Room, top floor of the Sally Borden Building. (Vistas Dining Room)

Wednesday, April 30
07:00 - 09:00	Breakfast ↓ Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building. (Vistas Dining Room)
09:00 - 09:30	Ruimeng Hu: A Mean Field Analysis of Climate Change Uncertainty ↓ Climate-economic models are usually highly nonlinear and complex, in addition to the challenges posed by model uncertainty and heterogeneity among the population. These features make it difficult to design effective policies that balance economic growth, climate mitigation, and social equity. In this talk, we present a quantitative computational analysis of socially optimal climate policy, and the expected discounted values of the social payoffs that determine these optimal policies, in the face of regional heterogeneity and model uncertainty. We design a deep reinforcement learning algorithm to solve the proposed climate-economic model, which we formulate as a mean-field control problem. This algorithm is used to evaluate key model mechanisms and quantify the uncertainty channels that drive our social valuations. We observe numerical convergence and demonstrate that, in our current setting, uncertainty aversion shifts optimal abatement and R\&D investment policies to mitigate climate change damage inequities. This is joint work with Michael Barnett (ASU), Lars Peter Hansen (UChicago), and Hezhong Zhang (UCSB). (TCPL 201)
09:30 - 10:00	Mathieu Lauriere: On Deep Learning and Reinforcement Learning for Continuous Time Stochastic Optimal Control ↓ This talk is composed of two parts. In the first part, we present a novel on-policy algorithm for solving stochastic optimal control (SOC) problems. By leveraging the Girsanov theorem, our method directly computes on-policy gradients of the SOC objective without expensive backpropagation through stochastic differential equations or adjoint problem solutions. This approach significantly accelerates the optimization of neural network control policies while scaling efficiently to high-dimensional problems and long time horizons. We evaluate our method on classical SOC benchmarks as well as applications to sampling from unnormalized distributions via Schrödinger-Föllmer processes and fine-tuning pre-trained diffusion models. Experimental results demonstrate substantial improvements in both computational speed and memory efficiency compared to existing approaches. This is joint work with Mengjian Hua and Eric Vanden-Eijnden. In the second part, we present a rigorous connection between discrete time reinforcement learning and continuous time reinforcement learning as introduced by Wang, Zariphopoulou and Zhou. More precisely, we prove a strong convergence result in the spirit of Euler schemes but with randomized actions. This is joint work with Rene Carmona. (TCPL 201)
10:00 - 10:30	Coffee Break (TCPL Foyer)
10:30 - 11:00	Yuhua Zhu: Optimal PhiBE — A Model-Free PDE-Based Framework for Continuous-Time RL ↓ This talk addresses continuous-time reinforcement learning (RL) in settings where the system dynamics are governed by a stochastic differential equation but remains unknown, with only discrete-time observations available. While the optimal Bellman equation (optimal-BE) enables model-free algorithms, its discretization error is significant when the reward function oscillates. Conversely, model-based PDE approaches offer better accuracy but suffer from non-identifiable inverse problems. To bridge this gap, we introduce Optimal-PhiBE, an equation that integrates discrete-time information into a PDE, combining the strengths of both RL and PDE formulations. Compared to the RL formulation, Optimal-PhiBE is less sensitive to reward oscillations, leading to smaller discretization errors. In linear-quadratic control, Optimal-PhiBE can even achieve accurate continuous-time optimal policy with only discrete-time information. Compared to the PDE formulation, it skips the identification of the dynamics and enables model-free algorithm derivation. Furthermore, we extend Optimal-PhiBE to higher orders, providing increasingly accurate approximations. (TCPL 201)
11:00 - 11:30	Philipp Plank: Policy Gradient Methods for Continuous-time Finite-horizon Linear-Quadratic Graphon Mean Field Games ↓ We analyze the convergence of policy gradient methods for continuous-time finite-horizon linear-quadratic graphon mean field games, which model the large-population limit of competing agents interacting weakly through a weighted graph. Each agent’s equilibrium policy is an affine function of the state variable, sharing a common slope function while having an agent-specific bias term. We propose a policy gradient method that iteratively performs multiple policy updates for a fixed population distribution, followed by an update of the distribution using the latest policies. We prove that these policy iterates converge globally to the optimal policy at a linear rate. Our analysis leverages the optimization landscape over infinite-dimensional policy spaces and carefully controls error propagation across iterations. Numerical experiments across various graphon structures validate the convergence and robustness of our algorithm. (TCPL 201)
11:30 - 13:00	Lunch ↓ Lunch is served daily between 11:30am and 1:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building. (Vistas Dining Room)
13:30 - 17:30	Free Afternoon (Banff National Park)
17:30 - 19:30	Dinner ↓ A buffet dinner is served daily between 5:30pm and 7:30pm in Vistas Dining Room, top floor of the Sally Borden Building. (Vistas Dining Room)

Thursday, May 1
07:00 - 09:00	Breakfast ↓ Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building. (Vistas Dining Room)
09:00 - 09:30	Roxana Dumitrescu: On Multi-Scale Mean-Field Games ↓ In this talk, I will present an ongoing work which introduces several classes of two-scale mean-field games (with regular control and stopping) in the presence of common noise. I will present several results (which include the existence of an equilibria in this framework and the approximation in the $N$-player setting), and will explain the main steps of the proofs. (Online)
09:30 - 10:00	Dena Firoozi: Infinite-Dimensional LQ Mean Field Games ↓ Mean field games (MFGs) were originally developed in finite-dimensional spaces. However, there are scenarios where Euclidean spaces do not adequately capture the essence of problems such as those involving non-Markovian systems. We present a comprehensive study of linear-quadratic (LQ) MFGs in Hilbert spaces, involving agents whose dynamics are governed by infinite-dimensional stochastic equations. We first study the well-posedness of a system of N coupled semilinear stochastic evolution equations establishing the foundation of MFGs in Hilbert spaces. We then specialize to N-player LQ games and study the asymptotic behavior as the number of agents, N, approaches infinity. We develop an infinite-dimensional variant of the Nash Certainty Equivalence principle and characterize a unique Nash equilibrium for the limiting MFG. Furthermore, we demonstrate that the resulting limiting best-response strategies form an $\varepsilon$-Nash equilibrium for the N-player game in Hilbert spaces. Finally, we prove that neural operators can learn the solution operator to this class of MFGs. (TCPL 201)
10:00 - 10:30	Coffee Break (TCPL Foyer)
10:30 - 11:00	Clemens Possnig: Strategic Communication and Algorithmic Advice ↓ We study a model of communication in which a better-informed sender learns to communicate with a receiver who takes an action that affects the welfare of both. Specifically, we model the sender as a machine-learning-based algorithmic recommendation system and the receiver as a rational, best-responding agent that understands how the algorithm works. The results demonstrate robust communication, which either emerges from scratch (i.e., originating from babbling where no common language initially exists) or persists when initialized. We show that the sender's learning hinders communication, limiting the extent of information transmission even when the algorithm's designer's and the receiver's preferences are aligned. We then show that when the two are not aligned, there is a robust pattern where the algorithm plays a cut-off strategy pooling messages when its private information suggests actions in the direction of its preference bias while sending mostly separate signals otherwise. (TCPL 201)
11:30 - 13:00	Lunch ↓ Lunch is served daily between 11:30am and 1:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building. (Vistas Dining Room)
13:00 - 13:30	Jiacheng Zhang: Minimax-Optimal Trust-Aware Multi-Armed Bandits ↓ Multi-armed bandit (MAB) algorithms have achieved significant success in sequential decision-making applications, under the premise that humans perfectly implement the recommended policy. However, existing methods often overlook the crucial factor of human trust in learning algorithms. When trust is lacking, humans may deviate from the recommended policy, leading to undesired learning performance. Motivated by this gap, we study the trust-aware MAB problem by integrating a dynamic trust model into the standard MAB framework. Specifically, it assumes that the recommended and actually implemented policy differs depending on human trust, which in turn evolves with the quality of the recommended policy. We establish the minimax regret in the presence of the trust issue and demonstrate the suboptimality of vanilla MAB algorithms such as the upper confidence bound (UCB) algorithm. To overcome this limitation, we introduce a novel two-stage trust-aware procedure that provably attains near-optimal statistical guarantees. A simulation study is conducted to illustrate the benefits of our proposed algorithm when dealing with the trust issue. (TCPL 201)
13:30 - 14:00	Han Yuxuan: Optimal Offline Policy Learning for MNL Bandits Under Partial Coverage ↓ Policy learning of the Multinomial Logit (MNL) bandits is a widely studied problem in data-driven assortment optimization, where sellers need to determine the optimal subset of the products to offer based on historical customer choice data. Most existing research on MNL bandits focuses on the online policy learning via repeated interaction with customers, such exploration can be costly in many real-world scenarios. In this talk, we explore the offline learning paradigm by providing a complete characterization of the complexity of offline MNL bandit learning problems. We show that the "optimal item coverage"—where each item in the optimal assortment appears sufficiently often in historical data—is necessary for efficient offline learning by establishing the corresponding information-theoretic lower bounds. For sufficiency, we show that the performance of Pessimistic Rank-Breaking (PRB) algorithm—an algorithm based on pessimism principle for offline decision making—can match the proposed lower bound. (TCPL 201)
14:00 - 14:30	Coffee Break (TCPL Foyer)
17:30 - 19:30	Dinner ↓ A buffet dinner is served daily between 5:30pm and 7:30pm in Vistas Dining Room, top floor of the Sally Borden Building. (Vistas Dining Room)

Friday, May 2
07:00 - 09:00	Breakfast ↓ Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building. (Vistas Dining Room)
09:00 - 10:00	Open Discussion (TCPL 201)
09:00 - 12:30	Yufei Zhang: Research Discussions - BIRS Placeholder Camera Activation (TCPL 201)
10:00 - 10:30	Coffee Break (TCPL Foyer)
10:30 - 11:00	Checkout by 11AM ↓ 5-day workshop participants are welcome to use BIRS facilities (TCPL ) until 3 pm on Friday, although participants are still required to checkout of the guest rooms by 11AM. (Front Desk - Professional Development Centre)
11:00 - 12:00	Open Discussion (Online)
12:00 - 13:30	Lunch from 11:30 to 13:30 (Vistas Dining Room)

©2025 Banff International Research Station for Mathematical Innovation and Discovery. All Rights Reserved.