Planning and learning under drifting MDPs: risk-averse tree search (RATS) and confidence-widened, sliding-window optimism for sublinear dynamic regret.
September 28, 2025 Ā· Ali Najar, Mazdak Teymourian
Reinforcement learning (RL) agents traditionally assume a stationary environment, yet many real-world settings-from
robotics to finance-exhibit evolving dynamics and reward structures that violate this assumption. This non-stationarity
manifests as drifting transition probabilities, shifting objectives, or co-learning agents, leading to outdated value estimates,
policy instability, and performance degradation. Motivated by the ubiquity of time-varying phenomena in
practical applications, our work investigates algorithms that provably adapt to changing MDPs without prior knowledge
of drift magnitude or change points. We aim to (1) characterize the fundamental limits of learning under bounded
and unbounded non-stationarity, (2) design both model-based and model-free methods, leveraging sliding-window
estimation, exponential forgetting, and optimism-driven exploration, to minimize dynamic regret, and (3) extend
these techniques to deep and multi-agent settings where high-dimensional representations and adversarial co-learners
exacerbate non-stationarity. Through theoretical analysis and empirical evaluation, we seek to deliver scalable RL
solutions that maintain robust performance across a spectrum of evolving environments.
Reinforcement learning (RL) studies how an agent interacts with an environment over time in order to maximize cumulative reward. The environment is commonly modeled as a Markov Decision Process (MDP), defined by
where $W_1$ is the Wasserstein-1 distance. This captures smoothly and gradually evolving environments.
Because $\pi_t^\star$ shifts over time, an RL algorithm cannot hope to achieve zero regret with respect to a fixed comparator. Instead, the relevant benchmark is dynamic regret, defined as
where $\pi_t$ is the algorithmās policy at time $t$.
Minimizing dynamic regret requires a careful balance:
Adaptation. Algorithms must forget or discount outdated data that no longer reflects the environment.
Exploration. At the same time, they must actively probe the environment to reduce uncertainty in the current dynamics.
This explorationāadaptation tradeoff is central to learning in non-stationary MDPs, and motivates algorithms such as RATS (robust planning under Lipschitz drift) and confidence-widened sliding-window RL (provable guarantees under bounded variation).
Imagine you are controlling a robot in a world that is constantly changing.
If your robot plans assuming that todayās model will remain the same tomorrow, its strategy may fail badly.
This is the fundamental challenge of non-stationarity: plans that ignore possible changes in the environment are brittle.
How can we design an RL agent that plans robustly in the face of uncertainty about the future?
Risk-Averse Tree-Search (RATS) [Lecarpentier & Rachelson, 2019] addresses this challenge by adopting a minimax view of planning.
Instead of assuming the environment will behave according to a single estimated model, RATS treats nature as an adversary:
At each decision point, the agent considers all Lipschitz-consistent evolutions of the environment (i.e., all possible ways transitions and rewards could drift within known smoothness bounds).
It then evaluates the worst-case return across these scenarios.
Finally, it chooses the action that maximizes this worst-case value.
where $W_1$ is the 1-Wasserstein distance.
This definition bounds how fast the environment can evolve.
At a given time $t_0$, the agent only has access to the current snapshot $(P_{t_0}, r_{t_0})$.
Since the exact future models are unknown, we restrict them to the admissible set: all possible models that are consistent with the Lipschitz drift bounds.
Formally, for each $(s,a)$ and future time $t \ge t_0$:
The first constraint says that the transition distribution cannot drift faster than $L_p$ per time step (in $\ell_1$ or Wasserstein distance).
The second constraint says that the immediate reward cannot drift faster than $L_r$ per time step.
So $\Delta_{t_0}^t$ is a set of plausible MDPs at time $t$, centered around the snapshot $MDP_{t_0}$ and expanding as $t$ grows.
Robust value function: Given this uncertainty, the agent evaluates a policy $\pi$ pessimistically, against the worst admissible sequence of models.
This defines the worst-case value function:
Let $H(s)$ be the heuristic value used at leaves of depth $d_{\max}$.
If $H$ has uniform error $\delta$ with respect to the true value, then the propagated error at a node $\nu$ at depth $d$ is bounded by
This provides a principled way to construct pessimistic heuristics: by subtracting a drift-dependent margin from snapshot values, one obtains a safe lower bound for use at leaf nodes.
Constructing a RATS tree of depth $d_{\max}$ has complexity
$$
O\Big( B , |S|^{1.5} |A| , (|S||A|)^{d_{\max}} \Big),
$$
where $B$ is the horizon, $|S|$ the number of states, and $|A|$ the number of actions.
The main cost comes from evaluating worst-case transitions using Wasserstein computations.
Discounted return vs. drift ε.
As drift $\epsilon$ increases, snapshot planning collapses: expected return degrades severely since it ignores model evolution.
DP-NSMDP maintains strong performance, but it has access to privileged information (future models).
RATS closely tracks the oracleās performance, even under large $\epsilon$, showing its robustness.
While the above guarantees are worst-case, experiments show that RATS also consistently improves Conditional Value-at-Risk (CVaR) at level $\alpha=5%$.
This means RATS not only secures the worst case mathematically, but also empirically minimizes downside risk, yielding safer behavior under uncertainty.
Reinforcement learning thrives on optimism in the face of uncertainty:
when we are unsure about transitions or rewards, we pretend they are as good as plausibly possible, then explore.
This principle underlies algorithms like UCRL2, which achieve tight regret bounds in stationary MDPs.
But what if the world itself is drifting?
In non-stationary environments, naive optimism can backfire: the optimistic model may imagine wildly long horizons or exploding diameters, leading to poor control of regret.
The blessing of optimism in drifting MDPs is more subtle:
sometimes more optimism, not less, actually restores robustness.
The key idea is to adapt sliding-window UCRL2 (SW-UCRL2) with a technique called confidence widening.
Instead of pooling all past data, the algorithm uses only the last $W$ steps to estimate rewards and transitions:
where $\epsilon_m = 1/\sqrt{\tau(m)}$ controls the accuracy.
The agent then follows $\tilde \pi_m$ until enough new data is collected to update estimates.
defSWUCRL2_CW(T, S, A, W, eta):
t =1 s = initial_state()
for m in range(1, ā):
Ļ = t
ν = { (s,a): 0for s in S for a in A }
# compute confidence sets using sliding window + widening Hr = build_reward_confidence_sets(Ļ, W)
Hp = build_transition_confidence_sets(Ļ, W, eta)
# compute optimistic policy Ļ_tilde = ExtendedValueIteration(Hr, Hp, eps=1/sqrt(Ļ))
# follow policy until stopping conditionwhilenot multiple_of_W(t) and ν[s, Ļ_tilde(s)] < N_plus(Ļ, s, Ļ_tilde(s)):
a = Ļ_tilde(s)
r, s_next = step_env(s, a)
ν[s,a] +=1 s = s_next
t +=1if t > T:
return
This algorithm - SWUCRL2āCW - achieves sublinear dynamic regret under non-stationarity.
Dynamic regret. Define dynamic regret as the gap to the best per-round policy sequence ${\pi_t^\star}$:
Moreover, the meta-algorithm BORL (Bandit-over-RL) can learn $(W,\eta)$ online and achieves the same bound without prior knowledge of $(B_r,B_p)$.
Structure of BORL algorithm.
BORL (Bandit-over-RL).
The idea is to treat SWUCRL2āCW as a base learner with hyperparameters $(W,\eta)$ and then run an adversarial bandit (EXP3.P) over a finite set of candidates
$\mathcal{K}={(W_k,\eta_k)}_{k=1}^K$. Each candidate is an āarm.ā Time is partitioned into blocks of length $H$ (picked so that one run of SWUCRL2āCW has enough samples to stabilize its windowed estimates).
At the start of block $m$:
Bandit selection. EXP3.P samples an arm $k_m\in{1,\dots,K}$ with probability $p_{k,m}$.
Deploy base learner. Run SWUCRL2āCW with $(W_{k_m},\eta_{k_m})$ for the next $H$ steps, producing realized block return $R_m$ (or loss $\ell_m$).
Feedback to the bandit. Construct an importance-weighted loss estimate (normalized to $[0,1]$),
and update EXP3.Pās weights. This lets the bandit learn which $(W,\eta)$ works best under the current non-stationarity.
Intuition: different drifts favor different windows $W$ (short windows react faster; long windows reduce variance) and different widenings $\eta$ (extra optimism keeps the optimistic modelās effective diameter controlled under drift). BORL adapts online by routing data to the currently promising configuration.
Why blocks?
SWUCRL2āCW updates policies episodically; giving it a contiguous budget of $H$ steps per arm avoids churning and lets its confidence sets/optimistic planning stabilize within each block.
The bandit only needs a scalar loss per block, keeping EXP3.Pās feedback simple and adversarially robust.
Loss choice.
Any block-level surrogate that orders arms correctly works (and is clipped to $[0,1]$ for EXP3.P): e.g., negative average return, or a proxy for dynamic regret over the block:
estimated from the same statistics SWUCRL2āCW maintains (sliding-window counts, confidence radii).
Grid design.
Pick a small geometric grid for $W$ (e.g., $W\in{2^0,2^1,\dots}$ up to a cap) and a few $\eta$ levels (e.g., $\eta\in{0,\eta_0,2\eta_0,\dots}$). This keeps $K$ modest (logarithmic choices), so EXP3.Pās overhead is tiny.
High-level guarantee (why BORL matches the theorem).
Let $M=\lceil T/H\rceil$ be the number of blocks. EXP3.P ensures bandit regret
$$
\tilde{\mathcal{O}}\big(\sqrt{M\log K}\big)
$$
against the best fixed arm in hindsight. The best arm corresponds to a (near-)optimal $(W^\star,\eta^\star)$ for the unknown drift $(B_r,B_p)$, whose base regret over $T$ steps is
Choosing a reasonable block length $H$ and a compact grid $\mathcal{K}$ makes the bandit overhead lower order, so BORL achieves the same $\tilde{\mathcal O}(T^{3/4})$ dynamic-regret rate, without knowing $(B_r,B_p)$ in advance.
Lecarpentier, Erwan, Rachelson, Emmanuel. Non-Stationary Markov Decision Processes: A Worst-Case Approach using Model-Based Reinforcement Learning. NeurIPS, 2019. https://arxiv.org/abs/1904.10090
Cheung, Wang Chi, et al. Reinforcement Learning for Non-Stationary Markov Decision Processes: The Blessing of (More) Optimism. ICML, 2020. https://arxiv.org/abs/2006.14389