Fitted Q-Iteration Methods - Reinforcement Learning Beyond the Agent Loop: Models, Control, and Data

The previous chapter established the theoretical foundations of simulation-based approximate dynamic programming: Monte Carlo integration for evaluating expectations, Q-functions for efficient action selection, and techniques for mitigating overestimation bias. Those developments assumed we could sample freely from transition distributions and choose optimization parameters without constraint. This chapter develops a unified framework for fitted Q-iteration algorithms that spans both offline and online settings. We begin with batch algorithms that learn from fixed datasets, then show how the same template generates online methods like DQN through systematic variations in data collection, optimization strategy, and function approximation.

Design Choices in FQI Methods¶

All FQI methods share the same two-level structure. At iteration $n$ , the outer loop applies the Bellman operator $\Bellman$ to construct targets $y_i^{(n)} = (\widehat{\Bellman}_N q)(s_i, a_i; \boldsymbol{\theta}^{(n)})$ at observed state-action pairs $(s_i, a_i)$ , where $\widehat{\Bellman}_N$ denotes Monte Carlo approximation with $N$ samples from the transition distribution. The inner loop solves the regression problem $\min_{\boldsymbol{\theta}} \sum_i \ell(q(s_i, a_i; \boldsymbol{\theta}), y_i^{(n)})$ to find parameters that match these targets. We can write this abstractly as:

\begin{aligned} &\textbf{repeat } n = 0, 1, 2, \ldots \\ &\quad \text{Construct } (X^{(n)}, y^{(n)}) \text{ where } y_i^{(n)} = (\widehat{\Bellman}_N q)(s_i, a_i; \boldsymbol{\theta}^{(n)}) \text{ for each } (s_i, a_i) \in X^{(n)} \\ &\quad \boldsymbol{\theta}^{(n+1)} \leftarrow \texttt{fit}(X^{(n)}, y^{(n)}, \boldsymbol{\theta}_{\text{init}}, K) \\ &\textbf{until } \text{convergence} \end{aligned}

(1)

The fit operation minimizes the regression loss using $K$ optimization steps (gradient descent for neural networks, tree construction for ensembles, matrix inversion for linear models) starting from initialization $\boldsymbol{\theta}_{\text{init}}$ . Standard supervised learning uses random initialization ( $\boldsymbol{\theta}_{\text{init}} = \boldsymbol{\theta}_0$ ) and runs to convergence ( $K = \infty$ ). Reinforcement learning algorithms vary these choices: warm starting from the previous iteration ( $\boldsymbol{\theta}_{\text{init}} = \boldsymbol{\theta}^{(n)}$ ), partial optimization ( $K \in \{10, \ldots, 100\}$ ), or single-step updates ( $K=1$ ).

The input points $X^{(n)}$ may stay fixed (batch setting) or change (online setting), but the targets $y^{(n)}$ always change because they depend on the evolving Q-function. In practice, we typically have $N=1$ (single observed next state per transition), giving $(\widehat{\Bellman}_1 q)(s_i, a_i; \boldsymbol{\theta}^{(n)}) = r_i + \gamma \max_{a'} q(s'_i, a'; \boldsymbol{\theta}^{(n)})$ for transition tuples $(s_i, a_i, r_i, s'_i)$ .

Remark 1 (Notation: Transition Data vs Regression Data)

We maintain a careful distinction between two types of datasets throughout:

Transition dataset $\mathcal{D} = \{(s_i, a_i, r_i, s'_i)\}$ : Raw experience tuples collected from the environment (offline) or stored in replay buffer (online). This is the “environment data” that never changes once collected.
Regression dataset $\mathcal{D}_n^{\text{fit}} = \{((s_i, a_i), y_i)\}$ : Supervised learning data with state-action pairs and computed target values $y_i$ . This changes every outer iteration $n$ as we recompute targets using the current Q-function.

The relationship: at iteration $n$ , we construct $\mathcal{D}_n^{\text{fit}}$ from $\mathcal{D}$ (or from buffer sample $\mathcal{B}.\text{sample}(b)$ ) by evaluating the Bellman operator:

y_i = r_i + \gamma \max_{a'} q(s'_i, a'; \boldsymbol{\theta}_n) \quad \text{or} \quad y_i = r_i + \gamma \frac{1}{\beta}\log\sum_{a'} \exp(\beta q(s'_i, a'; \boldsymbol{\theta}_n))

(2)

This distinction matters pedagogically: the transition data is fixed (offline) or evolves via online collection, while the regression data evolves via target recomputation. Fitted Q-iteration is the outer loop over target recomputation, not the inner loop over gradient steps.

This template provides a blueprint for instantiating concrete algorithms. Six design axes generate algorithmic diversity: the function approximator (trees, neural networks, linear models), the Bellman operator (hard max vs smooth logsumexp, discussed in the regularized MDP chapter), the inner optimization strategy (full convergence, $K$ steps, or single step), the initialization scheme (cold vs warm start), the data collection mechanism (offline, online, replay buffer), and bias mitigation approaches (none, double Q-learning, learned correction). While individual algorithms include additional refinements, these axes capture the primary sources of variation. The table below shows how several well-known methods instantiate this template:

Algorithm	Approximator	Bellman	Inner Loop	Initialization	Data	Bias Fix
FQI Ernst et al. (2005)	Extra Trees	Hard	Full	Cold	Offline	None
NFQI Riedmiller (2005)	Neural Net	Hard	Full	Warm	Offline	None
DQN Mnih et al. (2013)	Deep NN	Hard	K=1	Warm	Replay	None
Double DQN Van Hasselt et al. (2016)	Deep NN	Hard	K=1	Warm	Replay	Double Q
Q-learning Sutton & Barto (2018)	Tabular/Linear	Hard	K=1	Warm	Online	None
Soft Q Haarnoja et al. (2017)	Neural Net	Smooth	K steps	Warm	Replay	None

This table omits continuous action methods (NFQCA, DDPG, SAC), which introduce an additional design dimension. We address those in the continuous action chapter. The initialization choice becomes particularly important when moving from batch to online algorithms.

From Mini-Batch Sampling to Replay Buffers¶

Both offline and online reinforcement learning follow the same pattern: repeatedly sample observed transitions to construct regression problems. The difference lies in how the set of transitions evolves.

Consider standard mini-batch training in supervised learning. We have a fixed dataset and cycle through it during training, shuffling at each epoch and processing in mini-batches. This shuffling is sampling uniformly from the dataset—we see all data but in random order. In offline RL, we do the same: given a fixed dataset $\mathcal{D} = \{(s_i, a_i, r_i, s'_i)\}_{i=1}^N$ of observed transitions, we sample mini-batches by drawing indices $i_1, \ldots, i_b$ uniformly from $\{1, \ldots, N\}$ and using transitions $\{\tau_{i_j}\}_{j=1}^b$ where $\tau_i = (s_i, a_i, r_i, s'_i)$ .

This is statistical bootstrap: we treat our finite sample as the population. Formally, define the empirical distribution

\hat{P}_{\mathcal{D}} = \frac{1}{N}\sum_{i=1}^N \delta_{\tau_i}

(3)

where $\delta_{\tau_i}$ is the Dirac mass at transition $\tau_i$ . Sampling a mini-batch means sampling from $\hat{P}_{\mathcal{D}}$ . The true distribution of transitions under the MDP and behavior policy is unknown; we approximate expectations under it by resampling from $\hat{P}_{\mathcal{D}}$ .

Remark 2 (Mathematical Formulation of Empirical Distributions)

The empirical distribution $\hat{P}_{\mathcal{D}} = \frac{1}{N}\sum_{i=1}^N \delta_{\tau_i}$ is a discrete probability measure over $N$ points regardless of whether state and action spaces are continuous or discrete. For any measurable set $A \subseteq \mathcal{S} \times \mathcal{A} \times \mathbb{R} \times \mathcal{S}$ :

\hat{P}_{\mathcal{D}}(A) = \frac{1}{N}\sum_{i=1}^N \mathbb{1}[\tau_i \in A] = \frac{|\{i : \tau_i \in A\}|}{N}

(4)

The empirical distribution assigns probability $1/N$ to each observed tuple and zero elsewhere.

Now suppose we allow the dataset to grow over time. As we interact with the environment, we observe new transitions and add them:

\mathcal{D}_t = \mathcal{D}_{t-1} \cup \{(s_t, a_t, r_t, s'_{t+1})\}

(5)

This creates a time-varying empirical distribution $\hat{P}_{\mathcal{D}_t}$ that evolves as data accumulates. When we sample mini-batches for training, we still sample from an empirical distribution—it just changes over time.

Storing every transition becomes impractical. Memory is limited, and old data collected under early policies may no longer be relevant. The solution is a replay buffer: a fixed-capacity circular buffer of size $B$ . When full, new transitions replace the oldest (FIFO):

\mathcal{D}_t = \{\tau_{t-B+1}, \ldots, \tau_t\}, \quad \hat{P}_{\mathcal{D}_t} = \frac{1}{|\mathcal{D}_t|}\sum_{\tau \in \mathcal{D}_t} \delta_{\tau}

(6)

The empirical distribution becomes a sliding window over recent experience. We still sample uniformly from a collection of transitions; the collection itself moves through interaction history.

This gives a continuum of data reuse strategies:

Classical FQI: Fixed $\mathcal{D}$ , never changes. Sample from the same $\hat{P}_{\mathcal{D}}$ throughout. Maximum reuse.
DQN: Buffer capacity $B$ , grows until full then maintains most recent $B$ transitions. Sample from evolving $\hat{P}_{\mathcal{D}_t}$ . Moderate reuse.
Online Q-learning: Buffer capacity 1, use each transition once then discard. Minimal reuse.

The replay ratio (optimization steps per environment transition) quantifies data reuse:

\text{Replay ratio} = \begin{cases} K \cdot N_{\text{epochs}} \cdot N / b & \text{Offline (batch size } b \text{ from } N \text{ transitions)} \\ b & \text{DQN (one step per transition, batch size } b \text{)} \\ 1 & \text{Online Q-learning} \end{cases}

(7)

where $K$ is the number of gradient steps per outer iteration. Higher replay ratios amortize environment interaction but risk overfitting to $\hat{P}_{\mathcal{D}}$ , especially when it reflects outdated policies. The choice of $K$ and buffer capacity $B$ place an algorithm on the spectrum from sample average approximation (offline) to stochastic approximation (online).

A second notion of “bootstrap” appears in how we construct targets. The target $y_i = r_i + \gamma \max_{a'} q(s'_i, a'; \boldsymbol{\theta})$ uses our current Q-function estimate at $s'_i$ rather than a full Monte Carlo return $\sum_{t=0}^{T-i} \gamma^t r_{i+t}$ from trajectory rollout. We replace the unknown value-to-go with our current approximation—bootstrapping from value estimates.

These two forms are independent. We could resample from $\hat{P}_{\mathcal{D}}$ while using Monte Carlo returns (statistical bootstrap without value bootstrapping), or compute exact expectations under true dynamics while using value estimates (value bootstrapping without statistical bootstrap). Fitted Q-iteration does both: it resamples from observed transitions and uses one-step targets.

Batch Algorithms: Ernst’s FQI and NFQI¶

The offline (batch) setting begins with a fixed dataset $\mathcal{D} = \{(s_i, a_i, r_i, s'_i)\}_{i=1}^N$ collected once before learning. This data might come from a previous controller, from human demonstrations, or from exploratory interactions. The task is to extract the best Q-function approximation from this data without additional environment interactions.

This is the simplest instantiation of the unified template presented earlier. The dataset $\mathcal{D}$ defines a static empirical distribution $\hat{P}_{\mathcal{D}} = \frac{1}{N}\sum_{i=1}^N \delta_{\tau_i}$ where $\tau_i = (s_i, a_i, r_i, s'_i)$ . At each outer iteration $n$ , we construct the input set $X^{(n)} = \{(s_i, a_i)\}_{i=1}^N$ from the same transitions (the state-action pairs remain fixed), compute targets $y_i^{(n)} = r_i + \gamma \max_{a'} q(s'_i, a'; \boldsymbol{\theta}^{(n)})$ using the current Q-function, and solve the regression problem via $\boldsymbol{\theta}^{(n+1)} \leftarrow \texttt{fit}(X^{(n)}, y^{(n)}, \boldsymbol{\theta}_{\text{init}}, K)$ . The transitions never change, but the targets change at every iteration because they depend on the evolving Q-function. Each transition $(s_i, a_i, r_i, s'_i)$ provides a single Monte Carlo sample $s'_i$ for evaluating the Bellman operator at $(s_i, a_i)$ , giving us $\widehat{\Bellman}_1 q$ with $N=1$ .

Algorithm 1 (Generic Fitted Q-Iteration (Batch))

Input: Dataset $\mathcal{D} = \{(s_i, a_i, r_i, s'_i)\}_{i=1}^N$ , function approximator $q(s,a; \boldsymbol{\theta})$ , discount factor $\gamma$ , maximum iterations $n_{\max}$

Output: Learned Q-function parameters $\boldsymbol{\theta}$

Initialize $\boldsymbol{\theta}_0$
$n \leftarrow 0$
repeat $\quad$ // Outer loop: Value Iteration
$\quad$ // Construct regression dataset with Bellman targets
$\quad$ $X^{(n)} \leftarrow \{(s_i, a_i) : (s_i, a_i, r_i, s'_i) \in \mathcal{D}\}$
$\quad$ for each $(s_i, a_i, r_i, s'_i) \in \mathcal{D}$ do
$\quad\quad$ $y_i^{(n)} \leftarrow r_i + \gamma \max_{a' \in \mathcal{A}} q(s'_i, a'; \boldsymbol{\theta}_n)$
$\quad$ end for
$\quad$ // Inner loop: Fit Q-function to targets (projection step)
$\quad$ $\boldsymbol{\theta}_{n+1} \leftarrow \texttt{fit}(X^{(n)}, y^{(n)}, \boldsymbol{\theta}_{\text{init}}, K)$
$\quad$ $n \leftarrow n+1$
until convergence or $n \geq n_{\max}$
return $\boldsymbol{\theta}_n$

The fit operation in line 10 abstracts the inner optimization loop that minimizes $\sum_{i=1}^N \ell(q(s_i, a_i; \boldsymbol{\theta}), y_i^{(n)})$ . This line hides the key algorithmic choice: which function approximator to use and how to optimize it. The initialization $\boldsymbol{\theta}_{\text{init}}$ and number of optimization steps $K$ control whether we use cold or warm starting and whether we optimize to convergence or perform partial updates.

Fitted Q-Iteration (FQI): Ernst et al. Ernst et al. (2005) instantiate this template with extremely randomized trees (extra trees), an ensemble method that partitions the state-action space into regions with piecewise constant Q-values. The fit operation trains the ensemble until completion using the tree construction algorithm. Trees handle high-dimensional inputs naturally and the ensemble reduces overfitting. FQI uses cold start initialization: $\boldsymbol{\theta}_{\text{init}} = \boldsymbol{\theta}_0$ (randomly initialized) at every iteration, since trees don’t naturally support incremental refinement. The loss $\ell$ is squared error. This method demonstrated that batch reinforcement learning could work with complex function approximators on continuous-state problems.

Neural Fitted Q-Iteration (NFQI): Riedmiller Riedmiller (2005) replaces the tree ensemble with a neural network $q(s,a; \boldsymbol{\theta})$ , providing smooth interpolation across the state-action space. The fit operation runs gradient-based optimization (RProp, chosen for its insensitivity to hyperparameter choices) to convergence: train the network until the loss stops decreasing (multiple epochs through the full dataset $\mathcal{D}$ ), corresponding to $K=\infty$ in our framework. NFQI uses warm start initialization: $\boldsymbol{\theta}_{\text{init}} = \boldsymbol{\theta}_n$ at iteration $n$ , meaning the network continues learning from the previous iteration’s weights rather than resetting. This ensures the network accurately represents the projected Bellman operator before moving to the next outer iteration. For episodic tasks with goal and forbidden regions, Riedmiller uses modified target computations (detailed below).

Remark 3 (Goal State Heuristics in NFQI)

For episodic tasks with goal states $S^+$ and forbidden states $S^-$ , Riedmiller modifies the target structure:

y_i^{(n)} = \begin{cases} c(s_i, a_i, s'_i) & \text{if } s'_i \in S^+ \text{ (goal reached)} \\ C^- & \text{if } s'_i \in S^- \text{ (forbidden state, typically } C^- = 1.0\text{)} \\ c(s_i, a_i, s'_i) + \gamma \max_{a'} q(s'_i, a'; \boldsymbol{\theta}_n) & \text{otherwise} \end{cases}

(8)

where $c(s, a, s')$ is the immediate cost. Goal states have zero future cost (no bootstrapping), forbidden states have high penalty, and regular states use the standard Bellman backup. Additionally, the hint-to-goal heuristic adds synthetic transitions $(s, a, s')$ where $s \in S^+$ with target value $c(s,a,s') = 0$ to explicitly clamp the Q-function to zero in the goal region. This stabilizes learning by encoding the boundary condition without requiring additional prior knowledge.

From Nested to Flattened Q-Iteration¶

Fitted Q-iteration has an inherently nested structure: an outer loop performs approximate value iteration by computing Bellman targets, and an inner loop performs regression by fitting the function approximator to those targets. Understanding this nested structure clarifies that FQI is approximate dynamic programming with function approximation, distinct from supervised learning with changing targets.

When the inner loop uses gradient descent for $K$ steps, we have:

\texttt{fit}(\mathcal{D}_n^{\text{fit}}, \boldsymbol{\theta}_n, K) = \boldsymbol{\theta}_n - \alpha \sum_{k=0}^{K-1} \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_n^{(k)}; \mathcal{D}_n^{\text{fit}})

(9)

This is a sequence of updates $\boldsymbol{\theta}_n^{(k+1)} = \boldsymbol{\theta}_n^{(k)} - \alpha \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_n^{(k)}; \mathcal{D}_n^{\text{fit}})$ for $k = 0, \ldots, K-1$ starting from $\boldsymbol{\theta}_n^{(0)} = \boldsymbol{\theta}_n$ . Since these inner updates are themselves a loop, we can algebraically rewrite the nested loops as a single flattened loop. This flattening is purely representational. The algorithm remains approximate value iteration, but the presentation obscures the conceptual structure.

In the flattened form, the parameters used for computing targets are called the target network $\boldsymbol{\theta}_{\text{target}}$ , which corresponds to $\boldsymbol{\theta}_n$ in the nested form. The target network gets updated every $K$ steps, marking the boundaries between outer iterations. Many modern algorithms, especially those that collect data online like DQN, are presented in flattened form. This can make them appear different from batch methods when they are the same template with different design choices.

Tree ensemble methods like random forests or extra trees have no continuous parameter space and no gradient-based optimization. The fit operation builds the entire tree structure in one pass. There’s no sequence of incremental updates to unfold into a single loop. Ernst’s FQI Ernst et al. (2005) retains the explicit nested structure with cold start initialization at each outer iteration, while neural methods can be flattened.

Making the Nested Structure Explicit¶

To see how flattening works, we first make the nested structure completely explicit by expanding the fit operation to show the inner gradient descent loop. Starting from the generic batch FQI template (Algorithm Algorithm 1), we replace the abstract fit call with explicit gradient updates:

Algorithm 2 (Neural Fitted Q-Iteration with Explicit Inner Loop)

Input: MDP $(S, A, P, R, \gamma)$ , offline transition dataset $\mathcal{D}$ , neural network $q(s,a; \boldsymbol{\theta})$ , learning rate $\alpha$ , inner optimization steps $K$ , initialization $\boldsymbol{\theta}_0$

Output: Parameters $\boldsymbol{\theta}$ for Q-function approximation

Initialize $\boldsymbol{\theta}_0$ randomly
$n \leftarrow 0$ $\quad$ // Outer loop counter (value iteration)
repeat
1. // Compute Bellman targets using current Q-function
2. $\mathcal{D}_n^{\text{fit}} \leftarrow \emptyset$
3. for each $(s,a,r,s') \in \mathcal{D}$ do
  1. $y_{s,a} \leftarrow r + \gamma \max_{a' \in A} q(s',a'; \boldsymbol{\theta}_n)$
  2. $\mathcal{D}_n^{\text{fit}} \leftarrow \mathcal{D}_n^{\text{fit}} \cup \{((s,a), y_{s,a})\}$
4. // Inner optimization loop: fit to targets via gradient descent
5. $\boldsymbol{\theta}_n^{(0)} \leftarrow \boldsymbol{\theta}_n$ $\quad$ // Warm start from previous outer iteration
6. $k \leftarrow 0$ $\quad$ // Inner loop counter (regression)
7. repeat
  1. $\boldsymbol{\theta}_n^{(k+1)} \leftarrow \boldsymbol{\theta}_n^{(k)} - \alpha \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_n^{(k)}; \mathcal{D}_n^{\text{fit}})$
  2. $k \leftarrow k + 1$
8. until $k = K$ $\quad$ // Partial optimization: exactly K gradient steps
9. $\boldsymbol{\theta}_{n+1} \leftarrow \boldsymbol{\theta}_n^{(K)}$
10. $n \leftarrow n + 1$
until training complete
return $\boldsymbol{\theta}_n$

This makes the two-level structure completely transparent. The outer loop (indexed by $n$ ) computes targets $y_{s,a} = r + \gamma \max_{a'} q(s',a'; \boldsymbol{\theta}_n)$ using the parameters $\boldsymbol{\theta}_n$ from the end of the previous outer iteration. These targets remain fixed throughout the entire inner loop. The inner loop (indexed by $k$ ) performs $K$ gradient steps to fit $q(s,a; \boldsymbol{\theta})$ to the regression dataset $\mathcal{D}_n^{\text{fit}} = \{((s_i, a_i), y_i)\}$ , warm starting from $\boldsymbol{\theta}_n$ (the parameters that computed the targets). After $K$ steps, the inner loop produces $\boldsymbol{\theta}_{n+1} = \boldsymbol{\theta}_n^{(K)}$ , which becomes the starting point for the next outer iteration.

The notation $\boldsymbol{\theta}_n^{(k)}$ emphasizes that we are at inner step $k$ within outer iteration $n$ . The targets depend only on $\boldsymbol{\theta}_n = \boldsymbol{\theta}_n^{(0)}$ , not on the intermediate inner iterates $\boldsymbol{\theta}_n^{(k)}$ for $k > 0$ .

Flattening into a Single Loop¶

We can now flatten the nested structure by treating all gradient steps uniformly and using a global step counter $t$ instead of separate outer/inner counters. We introduce a target network $\boldsymbol{\theta}_{\text{target}}$ that holds the parameters used for computing targets. This target network gets updated every $K$ steps, which marks what would have been the boundary between outer iterations. The transformation works as follows: we replace the outer counter $n$ and inner counter $k$ with a single counter $t$ , where $t = nK + k$ . When $n$ advances from $n$ to $n+1$ , this corresponds to $K$ steps of $t$ : from $t = nK$ to $t = (n+1)K$ . The target network $\boldsymbol{\theta}_{\text{target}}$ equals $\boldsymbol{\theta}_n$ throughout outer iteration $n$ and gets updated via $\boldsymbol{\theta}_{\text{target}} \leftarrow \boldsymbol{\theta}_t$ every $K$ steps (when $t \bmod K = 0$ ). Parameters that were $\boldsymbol{\theta}_n^{(k)}$ in nested form become $\boldsymbol{\theta}_t$ in flattened form.

This transformation is purely algebraic. No algorithmic behavior changes, only the presentation:

Algorithm 3 (Flattened Neural Fitted Q-Iteration)

Input: MDP $(S, A, P, R, \gamma)$ , offline transition dataset $\mathcal{D}$ , neural network $q(s,a; \boldsymbol{\theta})$ , learning rate $\alpha$ , target update frequency $K$ , initialization $\boldsymbol{\theta}_0$

Output: Parameters $\boldsymbol{\theta}$ for Q-function approximation

Initialize $\boldsymbol{\theta}_0$ randomly
$\boldsymbol{\theta}_{\text{target}} \leftarrow \boldsymbol{\theta}_0$
$t \leftarrow 0$ $\quad$ // Single flattened loop counter
while training do
1. // Compute targets using fixed target network
2. $\mathcal{D}_t^{\text{fit}} \leftarrow \emptyset$
3. for each $(s,a,r,s') \in \mathcal{D}$ do
  1. $y_{s,a} \leftarrow r + \gamma \max_{a' \in A} q(s',a'; \boldsymbol{\theta}_{\text{target}})$
  2. $\mathcal{D}_t^{\text{fit}} \leftarrow \mathcal{D}_t^{\text{fit}} \cup \{((s,a), y_{s,a})\}$
4. // Gradient step on online network
5. $\boldsymbol{\theta}_{t+1} \leftarrow \boldsymbol{\theta}_t - \alpha \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_t; \mathcal{D}_t^{\text{fit}})$
6. // Periodic target network update (marks outer iteration boundary)
7. if $t \bmod K = 0$ then
  1. $\boldsymbol{\theta}_{\text{target}} \leftarrow \boldsymbol{\theta}_t$
8. $t \leftarrow t + 1$
return $\boldsymbol{\theta}_t$

At step $t$ , we have $n = \lfloor t/K \rfloor$ (outer iteration) and $k = t \bmod K$ (position within inner loop). The target network $\boldsymbol{\theta}_{\text{target}}$ equals $\boldsymbol{\theta}_n$ throughout the $K$ steps from $t = nK$ to $t = (n+1)K - 1$ , then gets updated to $\boldsymbol{\theta}_{n+1}$ at $t = (n+1)K$ . The parameters $\boldsymbol{\theta}_t$ correspond to $\boldsymbol{\theta}_n^{(k)}$ in the nested form. The flattening reindexes the iteration structure: outer iteration 3, inner step 7 becomes step 37.

The target network arises directly from flattening the nested FQI structure. When DQN is presented with a target network that updates every $K$ steps, this is approximate value iteration in flattened form. The algorithm still performs outer loop (value iteration) and inner loop (regression), but the presentation obscures this structure. The periodic target updates mark the boundaries between outer iterations. DQN is batch approximate DP in flattened form, using online data collection with a replay buffer.

Smooth Target Updates via Exponential Moving Average¶

An alternative to periodic hard updates is exponential moving average (EMA) (also called Polyak averaging), which updates the target network smoothly at every step rather than synchronizing it every $K$ steps:

Algorithm 4 (Flattened Neural Fitted Q-Iteration with EMA Target Updates)

Input: MDP $(S, A, P, R, \gamma)$ , offline transition dataset $\mathcal{D}$ , neural network $q(s,a; \boldsymbol{\theta})$ , learning rate $\alpha$ , EMA rate $\tau \in (0, 1)$ , initialization $\boldsymbol{\theta}_0$

Output: Parameters $\boldsymbol{\theta}$ for Q-function approximation

Initialize $\boldsymbol{\theta}_0$ randomly
$\boldsymbol{\theta}_{\text{target}} \leftarrow \boldsymbol{\theta}_0$
$t \leftarrow 0$ $\quad$ // Single flattened loop counter
while training do
1. // Compute targets using fixed target network
2. $\mathcal{D}_t^{\text{fit}} \leftarrow \emptyset$
3. for each $(s,a,r,s') \in \mathcal{D}$ do
  1. $y_{s,a} \leftarrow r + \gamma \max_{a' \in A} q(s',a'; \boldsymbol{\theta}_{\text{target}})$
  2. $\mathcal{D}_t^{\text{fit}} \leftarrow \mathcal{D}_t^{\text{fit}} \cup \{((s,a), y_{s,a})\}$
4. // Gradient step on online network
5. $\boldsymbol{\theta}_{t+1} \leftarrow \boldsymbol{\theta}_t - \alpha \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_t; \mathcal{D}_t^{\text{fit}})$
6. // ← CHANGED: Smooth EMA update at every step
7. $\boldsymbol{\theta}_{\text{target}} \leftarrow \tau\boldsymbol{\theta}_{t+1} + (1-\tau)\boldsymbol{\theta}_{\text{target}}$
8. $t \leftarrow t + 1$
return $\boldsymbol{\theta}_t$

With EMA updates, the target network slowly tracks the online network instead of making discrete jumps. For small $\tau$ (typically $\tau \in [0.001, 0.01]$ ), the target lags behind the online network by roughly $1/\tau$ steps. This provides smoother learning dynamics and avoids the discontinuous changes in targets that occur with periodic hard updates. The EMA approach became popular with DDPG Lillicrap et al. (2015) for continuous control and is now standard in algorithms like TD3 Fujimoto et al. (2018) and SAC Haarnoja et al. (2018).

Online Algorithms: DQN and Double DQN¶

The flattened structure naturally extends to online data collection. Instead of repeatedly sampling from a fixed offline dataset, we collect new transitions while learning and store them in a replay buffer with finite capacity. When the buffer is full, old transitions are discarded (typically FIFO). At each step, we: (1) act in the environment using an exploration policy (e.g., $\varepsilon$ -greedy), (2) add the observed transition to the buffer, (3) sample a mini-batch from the buffer, (4) compute targets using the target network, and (5) take a gradient step on the online network.

This yields Deep Q-Network (DQN), which combines the flattened structure with online collection:

Algorithm 5 (Deep Q-Network (DQN))

Input: MDP $(S, A, P, R, \gamma)$ , neural network $q(s,a; \boldsymbol{\theta})$ , learning rate $\alpha$ , target update frequency $K$ , replay buffer capacity $B$ , mini-batch size $b$ , exploration rate $\varepsilon$

Output: Parameters $\boldsymbol{\theta}$ for Q-function approximation

Initialize $\boldsymbol{\theta}_0$ randomly
$\boldsymbol{\theta}_{target} \leftarrow \boldsymbol{\theta}_0$
Initialize replay buffer $\mathcal{R}$ with capacity $B$
$t \leftarrow 0$
while training:
1. Observe current state $s$
2. Select action: $a \leftarrow \begin{cases} \arg\max_{a'} q(s,a';\boldsymbol{\theta}_t) & \text{with probability } 1-\varepsilon \\ \text{random action} & \text{with probability } \varepsilon \end{cases}$
3. Execute $a$ , observe reward $r$ and next state $s'$
4. Store $(s,a,r,s')$ in $\mathcal{R}$ , replacing oldest if full
5. Sample mini-batch of $b$ transitions $\{(s_i,a_i,r_i,s_i')\}_{i=1}^b$ from $\mathcal{R}$
6. For each sampled transition $(s_i,a_i,r_i,s_i')$ :
  1. $y_i \leftarrow r_i + \gamma \max_{a' \in A} q(s_i',a'; \boldsymbol{\theta}_{target})$
7. $\boldsymbol{\theta}_{t+1} \leftarrow \boldsymbol{\theta}_t - \alpha \nabla_{\boldsymbol{\theta}} \frac{1}{b}\sum_{i=1}^b (q(s_i,a_i;\boldsymbol{\theta}_t) - y_i)^2$
8. If $t \bmod K = 0$ :
  1. $\boldsymbol{\theta}_{target} \leftarrow \boldsymbol{\theta}_t$
9. $t \leftarrow t + 1$
return $\boldsymbol{\theta}_t$

Double DQN addresses overestimation bias by decoupling action selection from evaluation. The online network selects the action, while the target network evaluates it:

Algorithm 6 (Double Deep Q-Network (Double DQN))

Output: Parameters $\boldsymbol{\theta}$ for Q-function approximation

Initialize $\boldsymbol{\theta}_0$ randomly
$\boldsymbol{\theta}_{target} \leftarrow \boldsymbol{\theta}_0$
Initialize replay buffer $\mathcal{R}$ with capacity $B$
$t \leftarrow 0$
while training:
1. Observe current state $s$
2. Select action: $a \leftarrow \begin{cases} \arg\max_{a'} q(s,a';\boldsymbol{\theta}_t) & \text{with probability } 1-\varepsilon \\ \text{random action} & \text{with probability } \varepsilon \end{cases}$
3. Execute $a$ , observe reward $r$ and next state $s'$
4. Store $(s,a,r,s')$ in $\mathcal{R}$ , replacing oldest if full
5. Sample mini-batch of $b$ transitions $\{(s_i,a_i,r_i,s_i')\}_{i=1}^b$ from $\mathcal{R}$
6. For each sampled transition $(s_i,a_i,r_i,s_i')$ :
  1. $a^*_i \leftarrow \arg\max_{a' \in A} q(s_i',a'; \boldsymbol{\theta}_t)$
  2. $y_i \leftarrow r_i + \gamma q(s_i',a^*_i; \boldsymbol{\theta}_{target})$
7. $\boldsymbol{\theta}_{t+1} \leftarrow \boldsymbol{\theta}_t - \alpha \nabla_{\boldsymbol{\theta}} \frac{1}{b}\sum_{i=1}^b (q(s_i,a_i;\boldsymbol{\theta}_t) - y_i)^2$
8. If $t \bmod K = 0$ :
  1. $\boldsymbol{\theta}_{target} \leftarrow \boldsymbol{\theta}_t$
9. $t \leftarrow t + 1$
return $\boldsymbol{\theta}_t$

These algorithms instantiate the unified template with online data collection ( $c = \text{online}$ ), replay buffers with finite capacity, mini-batch sampling, single gradient steps ( $K=1$ ), and warm starting. The transition from batch to online is continuous: as buffer capacity increases, replay ratio increases, and we move from pure stochastic approximation toward sample average approximation. DQN and Double DQN demonstrated that this design point works well when combined with deep convolutional networks and careful engineering, achieving success on high-dimensional control tasks.

Algorithm 7 (Unified Simulation-Based Q-Iteration Template)

Input: MDP $(S, A, P, R, \gamma)$ , Q-function $q(s,a; \boldsymbol{\theta})$ , Bellman operator type $\mathcal{T} \in \{\max, \text{softmax}\}$ , initial parameters $\boldsymbol{\theta}_0$ , offline transition dataset $\mathcal{D}$ (or $\emptyset$ ), buffer capacity $B$ , mini-batch size $b$ , truncation steps $K$ , warmstart flag $w \in \{\text{cold}, \text{warm}\}$ , online collection flag $c \in \{\text{offline}, \text{online}\}$

Output: Learned Q-function parameters $\boldsymbol{\theta}$

Initialize $\boldsymbol{\theta}_0$
Initialize buffer $\mathcal{B} \leftarrow \mathcal{D}$ with capacity $B$ (or $\mathcal{B} \leftarrow \emptyset$ if no offline data)
$n \leftarrow 0$
while training:
1. // Data collection (if online)
2. if $c = \text{online}$ :
  1. Observe current state $s$
  2. Select action $a$ using exploration policy based on $q(s, \cdot; \boldsymbol{\theta}_n)$
  3. Execute $a$ , observe $r, s'$
  4. $\mathcal{B}.\text{add}(s, a, r, s')$
3. // Sample mini-batch from buffer
4. $\{(s_i, a_i, r_i, s'_i)\}_{i=1}^b \leftarrow \mathcal{B}.\text{sample}(b)$
5. // Compute targets from transitions
6. $\mathcal{D}_n^{\text{fit}} \leftarrow \emptyset$
7. for each sampled transition $(s_i, a_i, r_i, s'_i)$ :
  1. $y_i \leftarrow r_i + \gamma \cdot \begin{cases} \max_{a'} q(s'_i, a'; \boldsymbol{\theta}_{\text{target}}) & \text{if } \mathcal{T} = \max \\ \frac{1}{\beta}\log\sum_{a'} \exp(\beta q(s'_i, a'; \boldsymbol{\theta}_{\text{target}})) & \text{if } \mathcal{T} = \text{softmax} \end{cases}$
  2. $\mathcal{D}_n^{\text{fit}} \leftarrow \mathcal{D}_n^{\text{fit}} \cup \{((s_i, a_i), y_i)\}$
8. // Fit Q-function with warmstart and truncation
9. $\boldsymbol{\theta}_{\text{init}} \leftarrow \begin{cases} \boldsymbol{\theta}_0 & \text{if } w = \text{cold} \\ \boldsymbol{\theta}_n & \text{if } w = \text{warm} \end{cases}$
10. $\boldsymbol{\theta}_{n+1} \leftarrow \texttt{fit}(\mathcal{D}_n^{\text{fit}}, \boldsymbol{\theta}_{\text{init}}, K)$
11. // Update target network (if using)
12. Update $\boldsymbol{\theta}_{\text{target}}$ (e.g., $\boldsymbol{\theta}_{\text{target}} \leftarrow \boldsymbol{\theta}_{n+1}$ periodically or via EMA)
13. $n \leftarrow n + 1$
return $\boldsymbol{\theta}_n$

The unified template shows how all the algorithms we have developed share the same basic structure, with variations in data collection, optimization strategy, and operator choice. This template encompasses FQI (offline, full optimization), NFQI (offline, full optimization with warm start), DQN (online with replay, single step), Double DQN (online with replay, single step, decoupled selection/evaluation), and smooth variants.

Regression Losses and Noise Models¶

The previous chapter examined algorithmic approaches to bias mitigation (Keane-Wolpin correction, double Q-learning). A complementary approach modifies the loss function in the supervised learning step. The fit operation in the inner loop of fitted Q-iteration is fundamentally a statistical estimation problem: we seek parameters $\boldsymbol{\theta}$ that minimize a loss $\mathcal{L}(\boldsymbol{\theta}; \mathcal{D}^{\text{fit}})$ over regression data. The choice of loss function encodes implicit assumptions about the noise structure in our targets. Every loss corresponds to a maximum likelihood estimator under some noise distribution: squared error assumes Gaussian noise, absolute error assumes Laplace noise, and so on. In deep reinforcement learning, where Bellman targets are noisy, bootstrapped, and high-capacity networks can overfit to outliers, matching the loss to the actual noise structure can substantially affect performance.

This section examines alternative loss functions for the regression step. The standard approach uses squared error, but the noise in Bellman targets has special structure due to the max operator and bootstrapping. Two strategies have shown empirical success: Gumbel regression, which uses the proper likelihood for extreme-value noise, and classification-based methods, which avoid parametric noise assumptions by working with distributions over value bins.

Gumbel Regression¶

Extreme value theory tells us that the maximum of Gaussian errors has Gumbel-distributed tails. If we take this distribution seriously, maximum likelihood estimation should use a Gumbel likelihood rather than a Gaussian one. Garg, Tang, Kahn, and Levine Garg et al. (2023) developed this idea in Extreme Q-Learning (XQL). Instead of modeling the Bellman error as additive Gaussian noise:

y_i = q^*(s_i, a_i) + \varepsilon_i, \quad \varepsilon_i \sim \mathcal{N}(0, \sigma^2)

(10)

they model it as Gumbel noise:

y_i = q^*(s_i, a_i) + \varepsilon_i, \quad \varepsilon_i \sim -\text{Gumbel}(0, \beta)

(11)

The negative Gumbel distribution arises because we are modeling errors in targets that overestimate the true value. The corresponding maximum likelihood loss is Gumbel regression:

\mathcal{L}_{\text{Gumbel}}(\boldsymbol{\theta}) = \sum_i \left[\frac{q(s_i, a_i; \boldsymbol{\theta}) - y_i}{\beta} + \exp\left(\frac{q(s_i, a_i; \boldsymbol{\theta}) - y_i}{\beta}\right)\right]

(12)

The temperature parameter $\beta$ controls the heaviness of the tail. The score function (gradient with respect to $q$ ) is:

\frac{\partial \mathcal{L}_{\text{Gumbel}}}{\partial q} = \frac{1}{\beta}\left[1 + \exp\left(\frac{q - y}{\beta}\right)\right]

(13)

When $q < y$ (underestimation), the exponential term is small and the gradient is mild. When $q > y$ (overestimation), the gradient grows exponentially with the error. This asymmetry deliberately penalizes overestimation more heavily than underestimation. When targets are systematically biased upward due to the max operator, this loss geometry pushes the estimates toward conservative Q-values.

The Gumbel loss can be understood as the natural likelihood for problems involving max operators, just as the Gaussian is the natural likelihood for problems involving averages. The central limit theorem tells us that sums converge to Gaussians; extreme value theory tells us that maxima converge to Gumbel (for light-tailed base distributions). Squared error is optimal for Gaussian noise; Gumbel regression is optimal for Gumbel noise.

The practical advantage is that we do not need to estimate variances or compute weighted averages. The loss function itself handles the asymmetric error structure through its score function. XQL has shown improvements in both value-based and actor-critic methods, particularly in offline reinforcement learning where the max-operator bias compounds across iterations without corrective exploration.

Classification-Based Q-Learning¶

Gumbel regression commits to a specific parametric noise model. Classification-based Q-learning takes a different approach: instead of modeling scalar noise explicitly, we work with probability distributions over value bins and use cross-entropy loss. This avoids parametric assumptions about TD noise while changing both the target representation and the loss geometry.

We can understand classification-based Q-learning through three components: the noise model, the functional we estimate, and the loss geometry.

Basic Setup

Suppose we choose a finite grid of values $z_1 < z_2 < \cdots < z_K$ spanning the range of plausible returns. Instead of outputting a scalar $q(s,a; \boldsymbol{\theta}) \in \mathbb{R}$ , the network outputs logits $\ell_{\boldsymbol{\theta}}(s,a) \in \mathbb{R}^K$ , which are converted to probabilities via softmax:

p_{\boldsymbol{\theta}}(k \mid s,a) = \frac{\exp(\ell_{\boldsymbol{\theta}}(s,a)_k)}{\sum_{j=1}^K \exp(\ell_{\boldsymbol{\theta}}(s,a)_j)}

(14)

The Q-value is the expected value under this categorical distribution:

q(s,a; \boldsymbol{\theta}) = \sum_{k=1}^K z_k \, p_{\boldsymbol{\theta}}(k \mid s,a)

(15)

This represents a scalar as the mean of a discrete distribution. Training requires converting each scalar TD target $y_i$ into a target distribution $q(\cdot \mid y_i)$ over the bins. The two-hot representation places probability mass on the two neighboring bins proportional to how close $y_i$ is to each. If target $y_i$ falls between bins $z_j$ and $z_{j+1}$ :

q_j(y_i) = \frac{z_{j+1} - y_i}{z_{j+1} - z_j}, \quad q_{j+1}(y_i) = \frac{y_i - z_j}{z_{j+1} - z_j}

(16)

with $q_k(y_i) = 0$ for all other bins $k \notin \{j, j+1\}$ . This is linear interpolation expressed as a categorical distribution. The expectation $\sum_k z_k q_k(y_i) = y_i$ recovers the original scalar target exactly. The loss is cross-entropy between target and predicted distributions:

\mathcal{L}_{\text{CE}}(\boldsymbol{\theta}; \mathcal{D}^{\text{fit}}) = -\frac{1}{|\mathcal{D}^{\text{fit}}|} \sum_{((s_i,a_i), y_i) \in \mathcal{D}^{\text{fit}}} \sum_{k=1}^K q_k(y_i) \log p_{\boldsymbol{\theta}}(k \mid s_i, a_i)

(17)

1. Noise: Letting Targets Define the Distribution

In the scalar L2 picture, we assume $y = q^*(s,a) + \varepsilon$ and posit something like $\varepsilon \sim \mathcal{N}(0,\sigma^2)$ (Gaussian) or $\varepsilon \sim \text{Gumbel}$ (XQL). The noise model is explicit and parametric.

With the categorical representation, we take a different route. We choose a fixed grid $z_1 < \dots < z_K$ of possible value levels. For each scalar TD target $y_i$ , we construct a target distribution $q(\cdot \mid y_i)$ on $\{1,\dots,K\}$ (two-hot or HL-Gauss). This does not require us to posit a parametric model for the distribution of $y$ itself. Whatever complicated, non-Gaussian, multi-modal TD noise we get appears as a mixture of these per-sample target distributions over the dataset. If a particular $(s,a)$ has many slightly different TD targets $y_i$ due to stochastic transitions and bootstrapping, the empirical distribution over bins at that $(s,a)$ becomes a mixture of the corresponding $q(\cdot \mid y_i)$ . The learned $p_\theta(\cdot \mid s,a)$ tries to match that empirical distribution in the KL sense, not a hand-chosen Gaussian or Gumbel model. The noise structure is encoded in the target distributions themselves rather than hard-coded in the loss.

2. Functional: KL Between Distributions, Q as a Derived Mean

With squared loss, we elicit the conditional mean: $f^*(x) = \mathbb{E}[Y \mid X=x]$ . With cross-entropy on distributions, the target is a discrete distribution $q(\cdot)$ , and the model predicts $p_\theta(\cdot)$ . The loss is:

\mathcal{L}_{\text{CE}}(\theta) = -\mathbb{E}_{((s,a),y)}\left[ \sum_{k} q_k(y)\log p_\theta(k\mid s,a) \right]

(18)

For fixed $(s,a)$ , the minimizer satisfies:

p_\theta(\cdot \mid s,a) = \arg\min_p \; \mathrm{KL}\big(q(\cdot \mid s,a) \,\|\, p(\cdot \mid s,a)\big)

(19)

We are projecting onto our function class in KL geometry on the simplex, not in $L^2$ on $\mathbb{R}$ . The primary object we fit is the distribution $p_\theta(\cdot\mid s,a)$ , not the mean. The scalar Q-value is a derived statistic: the expectation under this fitted distribution. This recovers a risk-neutral decision rule at the policy level (we still act greedily in expectation), even though training is distributional.

3. Geometry: Simplex, KL, and Implicit Robustness

Cross-entropy/KL induces a very different error geometry from L2. We operate on the probability simplex $\Delta^{K-1}$ (vectors of non-negative entries summing to 1), not $\mathbb{R}$ . The natural distance is KL divergence, not the Euclidean norm. For a single sample with target $q$ and prediction $p_\theta$ , the gradient with respect to logits is:

\nabla_{\ell_\theta} \mathcal{L}_{\text{CE}} = p_\theta - q

(20)

Three properties provide implicit robustness:

Bounded influence of outliers: Under L2, an error of magnitude $E$ contributes a gradient proportional to $E$ . With cross-entropy, the gradient for bin $k$ is $p_\theta(k) - q_k(y)$ . If $q_k(y)$ is 1 and $p_\theta(k)$ is tiny, the gradient is about -1 (up to softmax scaling), not unbounded. Each sample’s influence is capped at $O(1)$ per bin. There is no analogue of squared residual that explodes with scalar error magnitude; instead, you pay in proportion to probability miscalibration.
Quantization clips extreme values: Because we represent values on a fixed grid $[z_1, \dots, z_K]$ , any TD target beyond the grid maps to boundary bins. Extremely large or small targets cannot pull the model arbitrarily far; they saturate at the extremes of the support. With scalar L2, a single ridiculous target of value 10⁶ literally defines the regression scale unless manually clipped.
Label smoothing spreads responsibility: Two-hot and HL-Gauss encodings spread target mass around the central bin. A noisy TD target does not create a hard constraint that one exact bin must be 1.0; instead, it softly encourages a local neighborhood of bins to carry mass. When multiple noisy TD targets appear around the same $(s,a)$ , their smeared distributions average into something stable.

Together, these ingredients (bounded per-sample influence, finite support, smoothed targets) provide robustness without explicitly writing down a heavy-tailed noise model or tuning a Huber threshold. Outliers are automatically clipped by the grid, gradient magnitudes are controlled by probability geometry, and local structure is enforced by the smoothing kernel.

Connection to Monotone Approximators

The two-hot encoding is linear interpolation expressed in probabilistic language. The weights $q_j(y_i)$ and $q_{j+1}(y_i)$ are barycentric coordinates: non-negative weights summing to one that represent $y_i$ as a convex combination of the neighboring grid points $z_j$ and $z_{j+1}$ . This is identical to the linear interpolation formula used in the dynamic programming chapter for handling continuous state spaces (see Algorithm Algorithm 2). When we discretize a continuous state space and interpolate between grid points, we compute exactly these weights to estimate the value function at off-grid states.

This connection places the two-hot encoding within the framework developed in the projection methods chapter. The target construction satisfies the conditions for a monotone approximator in Gordon’s sense: it maps value functions to value functions using a stochastic matrix with non-negative entries and rows summing to one (Definition Definition 1). Such operators preserve the contraction property of the Bellman operator (Theorem Theorem 2), guaranteeing convergence of value iteration. Piecewise linear interpolation always provides stability because the barycentric weights are non-negative and sum to one. The two-hot encoding inherits this structure: the target at bin $k$ is a weighted average of the neighbors, bounded by the minimum and maximum values in the local region.

The neural network $q(s,a; \boldsymbol{\theta})$ that predicts the categorical distribution, however, does not preserve monotonicity. The function class is highly flexible: learned features, nonlinear activations, and high-dimensional parameter spaces provide no guarantees about order preservation or bounded sensitivity. Classification-based Q-learning is a hybrid: monotone targets paired with a non-monotone function approximator. The target distribution $q(\cdot \mid y_i)$ has the averager structure from classical dynamic programming theory, while the predicted distribution $p_{\boldsymbol{\theta}}(\cdot \mid s_i, a_i)$ comes from a flexible learned representation.

Classical dynamic programming theory requires monotone approximators throughout to guarantee convergence: both the projection step and the function representation must preserve contraction. Deep reinforcement learning relaxes this by using non-monotone function classes but retains structure on the target side. The two-hot encoding provides implicit regularization through its interpolation structure: targets are smooth, locally bounded, and order-preserving, even though the approximator learns arbitrary features. This may partially explain why classification loss stabilizes learning beyond the geometric robustness properties. The target encoding imposes geometric constraints inherited from monotone approximation theory, while the neural network retains the flexibility to represent complex value functions.

Alternative Encodings and Empirical Results

An alternative to the two-hot encoding is the histogram loss with Gaussian smoothing (HL-Gauss), which treats the scalar target $y_i$ as the mean of a Gaussian and projects that Gaussian onto the histogram bins to produce a smoothed categorical distribution. This adds label smoothing: instead of placing mass only on the two immediate neighbors, the Gaussian tail spreads small amounts of probability to nearby bins. This regularization can improve generalization, particularly when targets are noisy.

The categorical representation encodes ordinal structure. Neighboring bins represent similar values, and the two-hot or Gaussian-smoothed encodings ensure nearby bins receive similar probabilities. The network learns that a Q-value of 5.1 resembles 5.0 and 4.9, which L2 regression does not explicitly encode. This ordinal inductive bias can improve sample efficiency.

Empirically, cross-entropy loss scales better with network capacity. Farebrother et al. Farebrother et al. (2024) found that standard DQN and CQL degrade when Q-networks are scaled to large ResNets or mixture-of-experts architectures, but classification loss (specifically HL-Gauss) maintains or improves performance as network size increases. L2 loss appears to overfit to noise in targets when given high-capacity approximators, while the combination of KL geometry, quantization, and smoothing prevents this degradation. The study demonstrated gains across online Atari, offline Atari, and robotic manipulation tasks.

The classification approach has connections to distributional reinforcement learning, particularly the C51 algorithm Bellemare et al. (2017), which applies the Bellman operator to entire return distributions rather than scalar expectations. C51 uses a categorical representation and KL divergence (equivalently, cross-entropy) between distributions. However, the classification framework can be applied without adopting the full distributional Bellman machinery. Using cross-entropy loss on scalar TD targets already provides substantial benefits, suggesting that the loss geometry contributes to improved performance independently of the distributional perspective.

Practical Considerations¶

Implementing classification-based Q-learning requires choosing the number of bins $K$ and their range $[z_1, z_K]$ . Typical choices use $K \in \{51, 101, 201\}$ bins uniformly spaced over the expected return range. The range can be estimated from domain knowledge or learned adaptively during training. The network architecture changes minimally: instead of a single scalar output per action, we output $K$ logits per action. For discrete action spaces with $|\mathcal{A}|$ actions, this means a final layer of size $K \times |\mathcal{A}|$ rather than $|\mathcal{A}|$ . The computational overhead is modest, and the implementation can use standard cross-entropy loss functions available in deep learning libraries. The main conceptual shift is viewing Q-learning as categorical prediction rather than scalar regression.

Gumbel regression and classification-based Q-learning represent two strategies for matching the loss function to the noise structure in TD targets. Gumbel regression commits to an extreme-value noise model and uses the corresponding likelihood. Classification avoids parametric assumptions by working with distributions over bins, changing both the target representation and the projection geometry from $L^2$ to KL divergence. The choice depends on whether the Gumbel assumption is justified and whether the computational overhead of maintaining $K$ bins per action is acceptable.

Summary¶

This chapter developed fitted Q-iteration as a unified framework for value-based reinforcement learning. All algorithms share the same two-level structure: an outer loop that applies the Bellman operator to construct targets, and an inner loop that fits a function approximator to those targets. The algorithmic diversity arises from systematic variations along design axes: function approximator, Bellman operator type, inner optimization strategy, initialization scheme, data collection mechanism, and bias mitigation approach.

The empirical distribution perspective unifies offline and online learning. Both operate by sampling from observed transitions to form regression problems, with the distinction being whether the empirical distribution remains fixed (batch) or evolves (online). Replay buffers implement efficient sampling with finite capacity, and the replay ratio quantifies data reuse.

The nested-to-flattened transformation shows how target networks arise naturally from approximate value iteration. The target network is not a stabilization trick but the consequence of maintaining fixed targets for multiple gradient steps when writing the algorithm as a single loop. This transformation connects classical batch methods (FQI, NFQI) to modern deep reinforcement learning (DQN, Double DQN).

The methods developed here search in the space of value functions, extracting policies through maximization. The next chapter takes a complementary approach: directly parameterizing and optimizing policies. Just as value-based methods progressed from exact evaluation to Monte Carlo sampling, policy methods face the same challenge of evaluating expectations, leading to similar Monte Carlo techniques applied to policy gradients rather than Bellman operators.

References¶

Ernst, D., Geurts, P., & Wehenkel, L. (2005). Tree-Based Batch Mode Reinforcement Learning. Journal of Machine Learning Research, 6, 503–556.
Riedmiller, M. (2005). Neural Fitted Q Iteration – First Experiences with a Data Efficient Neural Reinforcement Learning Method. Proceedings of the 16th European Conference on Machine Learning (ECML), 317–328.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., & others. (2013). Playing Atari with Deep Reinforcement Learning. NIPS Deep Learning Workshop.
Van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-learning. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1).
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
Haarnoja, T., Tang, H., Abbeel, P., & Levine, S. (2017). Reinforcement Learning with Deep Energy-Based Policies. Proceedings of the 34th International Conference on Machine Learning, 70, 1352–1361.
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2015). Continuous Control with Deep Reinforcement Learning. arXiv Preprint arXiv:1509.02971.
Fujimoto, S., Hoof, H., & Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. International Conference on Machine Learning (ICML), 1587–1596.
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proceedings of the 35th International Conference on Machine Learning (ICML), 1861–1870.
Garg, D., Tang, J., Kahn, G., Levine, S., & Finn, C. (2023). Extreme Q-Learning: MaxEnt RL without Entropy. International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=SJ0Lde3tRL
Farebrother, J., Machado, M. C., & Bowling, M. (2024). Stop Regressing: Training Value Functions via Classification for Scalable Deep RL. arXiv Preprint arXiv:2403.03950. https://arxiv.org/abs/2403.03950
Bellemare, M. G., Dabney, W., & Munos, R. (2017). A Distributional Perspective on Reinforcement Learning. Proceedings of the 34th International Conference on Machine Learning, 70, 449–458.

Simulation-Based Methods

Monte Carlo Integration in Approximate Dynamic Programming

Simulation-Based Methods

Policy Parametrization Methods