Skip to article frontmatterSkip to article content

Fitted Q-Iteration for Continuous Action Spaces

The previous chapter showed how fitted Q-iteration handles large state spaces through function approximation. FQI maintains a unified structure across batch and online settings: a replay buffer Bt\mathcal{B}_t inducing empirical distribution P^Bt\hat{P}_{\mathcal{B}_t}, a target map TqT_q derived from the Bellman operator, a loss function \ell, and an optimization budget. Algorithms differ in how they instantiate these components (buffer evolution, hard vs soft Bellman, update frequency), but all follow the same template.

However, this framework breaks down when the action space becomes large or continuous. Computing Bellman targets requires evaluating maxaAq(s,a;θ)\max_{a' \in \mathcal{A}} q(s',a';\boldsymbol{\theta}) for each next state ss'. When actions are continuous (ARm\mathcal{A} \subset \mathbb{R}^m), this maximization requires solving a nonlinear program at every target computation. For a replay buffer with millions of transitions, this becomes computationally prohibitive.

This chapter addresses the continuous action problem while maintaining the FQI framework. We develop several approaches, unified by a common theme: amortization. Rather than solving the optimization problem maxaq(s,a;θ)\max_a q(s,a;\boldsymbol{\theta}) repeatedly at inference time, we invest computational effort during training to learn a mapping that directly produces good actions. This trades training-time cost for inference-time speed.

The strategies we examine are:

  1. Explicit optimization (Section 2): Solve the maximization numerically for a subset of states, accepting the computational cost for exact solutions.

  2. Policy network amortization (Sections 3-5): Learn a deterministic or stochastic policy network πw\pi_{\boldsymbol{w}} that approximates argmaxaq(s,a;θ)\arg\max_a q(s,a;\boldsymbol{\theta}) or the optimal stochastic policy, enabling fast action selection via a single forward pass. This includes both hard-max methods (DDPG, TD3) and soft-max methods (SAC, PCL).

Each approach represents a different point in the computation-accuracy trade-off, and all fit within the FQI template by modifying how targets are computed.

Explicit Optimization

Recall that in fitted Q methods, the main idea is to compute the Bellman operator only at a subset of all states, relying on function approximation to generalize to the remaining states. At each step of the successive approximation loop, we build a dataset of input state-action pairs mapped to their corresponding optimality operator evaluations:

Dn={((s,a),(Lq)(s,a;θn))(s,a)B}\mathcal{D}_n = \{((s, a), (\Bellman q)(s, a; \boldsymbol{\theta}_n)) \mid (s,a) \in \mathcal{B}\}

This dataset is then fed to our function approximator (neural network, random forest, linear model) to obtain the next set of parameters:

θn+1fit(Dn)\boldsymbol{\theta}_{n+1} \leftarrow \texttt{fit}(\mathcal{D}_n)

While this strategy allows us to handle very large or even infinite (continuous) state spaces, it still requires maximizing over actions (maxaA\max_{a \in \mathcal{A}}) during the dataset creation when computing the operator L\Bellman for each basepoint. This maximization becomes computationally expensive for large action spaces. We can address this by adding another level of optimization: for each sample added to our regression dataset, we employ numerical optimization methods to find actions that maximize the Bellman operator for the given state.

The above pseudocode introduces a generic maximize\texttt{maximize} routine which represents any numerical optimization method that searches for an action maximizing the given function. This approach is versatile and can be adapted to different types of action spaces. For continuous action spaces, we can employ standard nonlinear optimization methods like gradient descent or L-BFGS (e.g., using scipy.optimize.minimize). For large discrete action spaces, we can use integer programming solvers - linear integer programming if the Q-function approximator is linear in actions, or mixed-integer nonlinear programming (MINLP) solvers for nonlinear Q-functions. The choice of solver depends on the structure of our Q-function approximator and the constraints on our action space.

While explicit optimization provides exact solutions, it becomes computationally expensive when we need to compute targets for millions of transitions in a replay buffer. Can we avoid solving an optimization problem at every decision? The answer is amortization.

Amortized Optimization Approach

This process is computationally intensive. We can “amortize” some of this computation by replacing the explicit optimization for each sample with a direct mapping that gives us an approximate maximizer directly. For Q-functions, recall that the operator is given by:

(Lq)(s,a)=r(s,a)+γp(dss,a)maxaA(s)q(s,a)(\Bellman q)(s,a) = r(s,a) + \gamma \int p(ds'|s,a)\max_{a' \in \mathcal{A}(s')} q(s', a')

If qq^* is the optimal state-action value function, then v(s)=maxaq(s,a)v^*(s) = \max_a q^*(s,a), and we can derive the optimal policy directly by computing the decision rule:

π(s)=argmaxaA(s)q(s,a)\pi^\star(s) = \arg\max_{a \in \mathcal{A}(s)} q^\star(s,a)

Since qq^* is a fixed point of L\Bellman, we can write:

q(s,a)=(Lq)(s,a)=r(s,a)+γp(dss,a)maxaA(s)q(s,a)=r(s,a)+γp(dss,a)q(s,π(s))\begin{align*} q^\star(s,a) &= (\Bellman q^*)(s,a) \\ &= r(s,a) + \gamma \int p(ds'|s,a) \max_{a' \in \mathcal{A}(s')} q^\star(s', a') \\ &= r(s,a) + \gamma \int p(ds'|s,a) q^\star(s', \pi^\star(s')) \end{align*}

Note that π\pi^\star is implemented by our maximize\texttt{maximize} numerical solver in the procedure above. A practical strategy would be to collect these maximizer values at each step and use them to train a function approximator that directly predicts these solutions. Due to computational constraints, we might want to compute these exact maximizer values only for a subset of states, based on some computational budget, and use the fitted decision rule to generalize to the remaining states. This leads to the following amortized version:

Note that the policy πw\pi_{\boldsymbol{w}} is being trained on a dataset Dπ\mathcal{D}_\pi containing optimal actions computed with respect to an evolving Q-function. Specifically, at iteration nn, we collect pairs (s,as)(s', a^*_{s'}) where as=argmaxaq(s,a;θn)a^*_{s'} = \arg\max_a q(s', a; \boldsymbol{\theta}_n). However, after updating to θn+1\boldsymbol{\theta}_{n+1}, these actions may no longer be optimal with respect to the new Q-function.

A natural approach to handle this staleness would be to maintain only the most recent optimization data. We could modify our procedure to keep a sliding window of KK iterations, where at iteration nn, we only use data from iterations max(0,nK)\max(0, n-K) to nn. This would be implemented by augmenting each entry in Dπ\mathcal{D}_\pi with a timestamp:

Dπ(n)={(s,as,t)t{nK,,n}}\mathcal{D}_\pi^{(n)} = \{(s', a^*_{s'}, t) \mid t \in \{n-K,\ldots,n\}\}

where tt indicates the iteration at which the optimal action was computed. When fitting the policy network, we would then only use data points that are at most KK iterations old:

wn+1fit({(s,as)(s,as,t)Dπ(n),nKtn})\boldsymbol{w}_{n+1} \leftarrow \texttt{fit}(\{(s', a^*_{s'}) \mid (s', a^*_{s'}, t) \in \mathcal{D}_\pi^{(n)}, n-K \leq t \leq n\})

This introduces a trade-off between using more data (larger KK) versus using more recent, accurate data (smaller KK). The choice of KK would depend on how quickly the Q-function evolves and the computational budget available for computing exact optimal actions.

The main limitation of this approach, beyond the out-of-distribution drift, is that it requires computing exact optimal actions via the solver for states in Bopt\mathcal{B}_{\text{opt}}. Can we reduce or eliminate this computational expense? As the policy improves at selecting actions, we can bootstrap from these increasingly better choices. Continuously amortizing these improving actions over time creates a virtuous cycle of self-improvement toward the optimal policy. However, this bootstrapping process requires careful management: moving too quickly can destabilize training.

Deterministic Parametrized Policies

In this section, we consider deterministic parametrized policies of the form πw(s)\pi_{\boldsymbol{w}}(s) which directly output an action given a state. This approach differs from stochastic policies that output probability distributions over actions, making it particularly suitable for continuous control problems where the optimal policy is often deterministic. Fitted Q-value methods can be naturally extended to simultaneously learn both the Q-function and such a deterministic policy.

The Amortization Problem for Continuous Actions

When actions are continuous, aRda \in \mathbb{R}^d, extracting a greedy policy from a Q-function becomes computationally expensive. Consider a robot arm control task where the action is a dd-dimensional torque vector. To act greedily given Q-function q(s,a;θ)q(s,a; \boldsymbol{\theta}), we must solve:

π(s)=argmaxaAq(s,a;θ),\pi(s) = \arg\max_{a \in \mathcal{A}} q(s, a; \boldsymbol{\theta}),

where ARd\mathcal{A} \subset \mathbb{R}^d is a continuous set (often a box or polytope). This requires running an optimization algorithm at every time step. For neural network Q-functions, this means solving a nonlinear program whose objective involves forward passes through the network.

After training converges, the agent must select actions in real-time during deployment. Running interior-point methods or gradient-based optimizers at every decision creates unacceptable latency, especially in high-frequency control where decisions occur at 100Hz or faster.

The solution is to amortize the optimization cost by learning a separate policy network πw(s)\pi_{\boldsymbol{w}}(s) that directly outputs actions. During training, we optimize w\boldsymbol{w} so that πw(s)argmaxaq(s,a;θ)\pi_{\boldsymbol{w}}(s) \approx \arg\max_a q(s,a; \boldsymbol{\theta}) for states we encounter. At deployment, action selection reduces to a single forward pass through the policy network: a=πw(s)a = \pi_{\boldsymbol{w}}(s). The computational cost of optimization is paid during training (where time is less constrained) rather than at inference.

This introduces a second approximation beyond the Q-function. We now have two function approximators: a critic q(s,a;θ)q(s,a; \boldsymbol{\theta}) that estimates values, and an actor πw(s)\pi_{\boldsymbol{w}}(s) that selects actions. The critic is trained using Bellman targets as in standard fitted Q-iteration. The actor is trained to maximize the critic:

ww+αEs[wq(s,πw(s);θ)],\boldsymbol{w} \leftarrow \boldsymbol{w} + \alpha \mathbb{E}_s \left[\nabla_{\boldsymbol{w}} q(s, \pi_{\boldsymbol{w}}(s); \boldsymbol{\theta})\right],

where the expectation is over states in the dataset or replay buffer. This gradient ascent pushes the actor toward actions that the critic considers valuable. By the chain rule, this equals (aq(s,a;θ)a=πw(s))(wπw(s))(\nabla_a q(s,a; \boldsymbol{\theta})|_{a=\pi_{\boldsymbol{w}}(s)}) \cdot (\nabla_{\boldsymbol{w}} \pi_{\boldsymbol{w}}(s)), which can be efficiently computed via backpropagation through the composition of the two networks.

Neural Fitted Q-Iteration for Continuous Actions (NFQCA)

NFQCA Hafner & Riedmiller, 2011 extends the NFQI template from the previous chapter to handle continuous action spaces by replacing the maxaq(s,a;θ)\max_{a'} q(s',a'; \boldsymbol{\theta}) operator in the Bellman target with a parameterized policy πw(s)\pi_{\boldsymbol{w}}(s'). This transforms fitted Q-iteration into an actor-critic method: the critic q(s,a;θ)q(s,a; \boldsymbol{\theta}) evaluates state-action pairs via the standard regression step, while the actor πw(s)\pi_{\boldsymbol{w}}(s) provides actions by directly maximizing the learned Q-function.

The algorithm retains the two-level structure of NFQI: an outer loop performs approximate value iteration by computing Bellman targets, and an inner loop fits the Q-function to those targets. NFQCA adds a third component (policy improvement) that updates w\boldsymbol{w} to maximize the Q-function over states sampled from the dataset.

From Discrete to Continuous Actions

Recall from the FQI chapter that NFQI computes Bellman targets using the hard max:

ys,a=r+γmaxaAq(s,a;θn)y_{s,a} = r + \gamma \max_{a' \in \mathcal{A}} q(s',a'; \boldsymbol{\theta}_n)

When A\mathcal{A} is finite and small, this max is computed by enumeration. When A\mathcal{A} is continuous or high-dimensional, enumeration is intractable. NFQCA replaces the max with a parameterized policy that approximately solves the maximization:

ys,a=r+γq(s,πwn(s);θn)y_{s,a} = r + \gamma q(s', \pi_{\boldsymbol{w}_n}(s'); \boldsymbol{\theta}_n)

The policy πw(s)\pi_{\boldsymbol{w}}(s) acts as an amortized optimizer: instead of solving argmaxaq(s,a)\arg\max_{a'} q(s',a') from scratch at each state ss' during target computation, we train a neural network to output near-optimal actions directly. The term “amortized” refers to spreading the cost of optimization across training: we pay once to learn πw\pi_{\boldsymbol{w}}, then reuse it for all future target computations.

To train the policy, we maximize the expected Q-value under the distribution of states in the dataset. If we had access to the optimal Q-function qq^*, we would solve:

maxwEsP^D[q(s,πw(s))]\max_{\boldsymbol{w}} \mathbb{E}_{s \sim \hat{P}_{\mathcal{D}}}[q^*(s, \pi_{\boldsymbol{w}}(s))]

where P^D\hat{P}_{\mathcal{D}} is the empirical distribution over states induced by the offline dataset D\mathcal{D}. In practice, we use the current Q-function approximation q(s,a;θn+1)q(s,a; \boldsymbol{\theta}_{n+1}) after it has been fitted to the latest targets. The expectation is approximated by the sample average over states appearing in D\mathcal{D}:

maxw1D(s,a,r,s)Dq(s,πw(s);θn+1)\max_{\boldsymbol{w}} \frac{1}{|\mathcal{D}|} \sum_{(s,a,r,s') \in \mathcal{D}} q(s, \pi_{\boldsymbol{w}}(s); \boldsymbol{\theta}_{n+1})

This policy improvement step runs after the Q-function has been updated, using the newly-fitted critic to guide the actor toward higher-value actions. Both the Q-function fitting and policy improvement use gradient-based optimization on the respective objectives.

The algorithm structure mirrors NFQI (Algorithm Algorithm 1 in the FQI chapter) with two extensions. First, target computation (line 7-8) replaces the discrete max with a policy network call πwn(s)\pi_{\boldsymbol{w}_n}(s'), making the Bellman operator tractable for continuous actions. Second, after fitting the Q-function (line 11), we add a policy improvement step (line 13) that updates w\boldsymbol{w} to maximize the Q-function evaluated at policy-generated actions over states in the dataset.

Both fit operations use gradient descent with warm starting, consistent with the NFQI template. The Q-function minimizes squared Bellman error using targets computed with the current policy. The policy maximizes the Q-function via gradient ascent on the composition q(s,πw(s);θn+1)q(s, \pi_{\boldsymbol{w}}(s); \boldsymbol{\theta}_{n+1}), which is differentiable end-to-end when both networks are differentiable. The gradient with respect to w\boldsymbol{w} is:

wq(s,πw(s);θ)=aq(s,a;θ)a=πw(s)wπw(s)\nabla_{\boldsymbol{w}} q(s, \pi_{\boldsymbol{w}}(s); \boldsymbol{\theta}) = \nabla_a q(s, a; \boldsymbol{\theta})\Big|_{a=\pi_{\boldsymbol{w}}(s)} \cdot \nabla_{\boldsymbol{w}} \pi_{\boldsymbol{w}}(s)

computed via the chain rule (backpropagation through the actor into the critic). Modern automatic differentiation libraries handle this composition automatically.

Deep Deterministic Policy Gradient (DDPG)

We now extend NFQCA to the online setting with evolving replay buffers, mirroring how DQN extended NFQI in the FQI chapter. Just as DQN allowed Bt\mathcal{B}_t and P^Bt\hat{P}_{\mathcal{B}_t} to evolve during learning instead of using a fixed offline dataset, DDPG Lillicrap et al., 2015 collects new transitions during training and stores them in a circular replay buffer.

Like DQN, DDPG uses the flattened FQI structure with target networks. But where DQN maintains a single target network θtarget\boldsymbol{\theta}_{\text{target}} for the Q-function, DDPG maintains two target networks: one for the critic θtarget\boldsymbol{\theta}_{\text{target}} and one for the actor wtarget\boldsymbol{w}_{\text{target}}. Both are updated periodically (every KK steps) to mark outer-iteration boundaries, following the same nested-to-flattened transformation shown for DQN.

The online network now plays a triple role in DDPG: (1) the parameters being actively trained (θt\boldsymbol{\theta}_t for critic, wt\boldsymbol{w}_t for actor), (2) the policy used to collect new data, and (3) the gradient source for policy improvement. The target networks serve only one purpose: computing stable Bellman targets.

Exploration via Action Noise

Since the policy πw(s)\pi_{\boldsymbol{w}}(s) is deterministic, exploration requires adding noise to actions during data collection:

a=πwt(s)+ηta = \pi_{\boldsymbol{w}_t}(s) + \eta_t

where ηt\eta_t is exploration noise. The original DDPG paper used an Ornstein-Uhlenbeck (OU) process, which generates temporally correlated noise through the discretized stochastic differential equation:

ηt+1=ηt+θ(μηt)Δt+σΔtϵt,ϵtN(0,1)\eta_{t+1} = \eta_t + \theta(\mu - \eta_t)\Delta t + \sigma\sqrt{\Delta t}\epsilon_t, \quad \epsilon_t \sim \mathcal{N}(0,1)

where μ\mu is the long-term mean (typically 0), θ\theta controls the strength of mean reversion, σ\sigma scales the random fluctuations, and Δt\Delta t is the time step. The term θ(μηt)Δt\theta(\mu - \eta_t)\Delta t acts like damped motion through a viscous fluid: when ηt\eta_t deviates from μ\mu, this force pulls it back smoothly without oscillation. The random term σΔtϵt\sigma\sqrt{\Delta t}\epsilon_t adds perturbations, creating noise that wanders but is gently pulled back toward μ\mu. This temporal correlation produces smoother exploration trajectories than independent Gaussian noise.

However, later work (including TD3, discussed below) found that simple uncorrelated Gaussian noise ηtN(0,σ2)\eta_t \sim \mathcal{N}(0, \sigma^2) works equally well and is easier to tune. The exploration mechanism is orthogonal to the core algorithmic structure.

The algorithm structure parallels DQN (Algorithm Algorithm 5 in the FQI chapter) with the continuous-action extensions from NFQCA. Lines 1-5 initialize both networks and their targets, following the same pattern as DQN but with an additional actor network. Line 3 uses the online actor with exploration noise for data collection, replacing DQN’s ε\varepsilon-greedy selection. Line 7 computes targets using both target networks: the actor target πwtarget(si)\pi_{\boldsymbol{w}_{\text{target}}}(s'_i) selects the next action, the critic target q(;θtarget)q(\cdot; \boldsymbol{\theta}_{\text{target}}) evaluates it. This replaces the maxa\max_{a'} operator in DQN. Lines 8-9 update both networks: critic via TD error minimization, actor via policy gradient through the updated critic. Line 10 performs periodic hard updates every KK steps, marking outer-iteration boundaries.

The policy gradient in line 9 uses the chain rule to backpropagate through the actor-critic composition:

wq(s,πw(s);θ)=aq(s,a;θ)a=πw(s)wπw(s)\nabla_{\boldsymbol{w}} q(s, \pi_{\boldsymbol{w}}(s); \boldsymbol{\theta}) = \nabla_a q(s,a; \boldsymbol{\theta})\Big|_{a=\pi_{\boldsymbol{w}}(s)} \cdot \nabla_{\boldsymbol{w}} \pi_{\boldsymbol{w}}(s)

This is identical to the NFQCA gradient, but now computed on mini-batches sampled from an evolving replay buffer rather than a fixed offline dataset. The critic gradient aq(s,a;θ)\nabla_a q(s,a; \boldsymbol{\theta}) at the policy-generated action provides the direction of steepest ascent in Q-value space, weighted by how sensitive the policy output is to its parameters via wπw(s)\nabla_{\boldsymbol{w}} \pi_{\boldsymbol{w}}(s).

Twin Delayed Deep Deterministic Policy Gradient (TD3)

DDPG inherits the overestimation bias from DQN’s use of the max operator in Bellman targets. TD3 Fujimoto et al., 2018 addresses this through three modifications to the DDPG template, following similar principles to Double DQN but adapted for continuous actions and taking a more conservative approach.

Twin Q-Networks and the Minimum Operator

Recall from the Monte Carlo chapter that overestimation arises when we use the same noisy estimate both to select which action looks best and to evaluate that action. Double Q-learning breaks this coupling by maintaining two independent estimators with noise terms εa(1)\varepsilon^{(1)}_a and εa(2)\varepsilon^{(2)}_a:

a=argmaxa{r(s,a)+γμ^N(1)(s,a)},Y=r(s,a)+γμ^N(2)(s,a).a^\star = \arg\max_{a} \left\{r(s,a) + \gamma \hat{\mu}^{(1)}_N(s,a)\right\}, \quad Y = r(s,a^\star) + \gamma \hat{\mu}^{(2)}_N(s,a^\star).

When ε(1)\varepsilon^{(1)} and ε(2)\varepsilon^{(2)} are independent, the tower property of conditional expectation gives E[εa(2)a]=E[εa(2)]=0\mathbb{E}[\varepsilon^{(2)}_{a^\star} \mid a^\star] = \mathbb{E}[\varepsilon^{(2)}_{a^\star}] = 0 because aa^\star (determined by ε(1)\varepsilon^{(1)}) is independent of ε(2)\varepsilon^{(2)}. This eliminates evaluation bias: we no longer use the same positive noise that selected an action to also inflate its value. By conditioning on the selected action and then taking expectations over the independent evaluation noise, the bias in the evaluation term vanishes.

Double DQN (Algorithm Algorithm 6) implements this principle in the discrete action setting by using the online network θt\boldsymbol{\theta}_t for selection (aiargmaxaq(si,a;θt)a^*_i \leftarrow \arg\max_{a'} q(s_i',a'; \boldsymbol{\theta}_t)) and the target network θtarget\boldsymbol{\theta}_{\text{target}} for evaluation (yiri+γq(si,ai;θtarget)y_i \leftarrow r_i + \gamma q(s_i',a^*_i; \boldsymbol{\theta}_{\text{target}})). Since these networks experience different training noise, their errors are approximately independent, achieving the independence condition needed to eliminate evaluation bias. However, selection bias remains: the argmax still picks actions that received positive noise in the selection network, so Eε(1)[μ(s,a)]maxaμ(s,a)\mathbb{E}_{\varepsilon^{(1)}}[\mu(s,a^\star)] \ge \max_a \mu(s,a).

TD3 takes a more conservative approach. Instead of decoupling selection from evaluation, TD3 maintains twin Q-networks qA(s,a;θA)q^A(s,a; \boldsymbol{\theta}^A) and qB(s,a;θB)q^B(s,a; \boldsymbol{\theta}^B) trained on the same data with different random initializations. When computing targets, TD3 uses the target policy πwtarget(s)\pi_{\boldsymbol{w}_{\text{target}}}(s') to select actions (no maximization over a discrete set), then takes the minimum of the two Q-networks’ evaluations:

yi=ri+γmin(qA(si,a~i;θtargetA),qB(si,a~i;θtargetB))y_i = r_i + \gamma \min\left(q^A(s'_i, \tilde{a}_i; \boldsymbol{\theta}^A_{\text{target}}), q^B(s'_i, \tilde{a}_i; \boldsymbol{\theta}^B_{\text{target}})\right)

where a~i=πwtarget(si)\tilde{a}_i = \pi_{\boldsymbol{w}_{\text{target}}}(s'_i). This minimum operation provides a pessimistic estimate: if the two Q-networks have independent errors qA(s,a)=q(s,a)+εAq^A(s',a) = q^*(s',a) + \varepsilon^A and qB(s,a)=q(s,a)+εBq^B(s',a) = q^*(s',a) + \varepsilon^B, then E[min(qA,qB)]q(s,a)\mathbb{E}[\min(q^A, q^B)] \le q^*(s',a), producing systematic underestimation rather than overestimation.

The connection to the conditional independence framework is subtle but important. While Double DQN uses independence to eliminate bias in expectation (one network selects, another evaluates), TD3 uses independence to construct a deliberate lower bound. Both approaches rely on maintaining two Q-functions with partially decorrelated errors, achieved through different initializations and stochastic gradients during training, but they aggregate these functions differently. Double DQN’s decoupling targets unbiased estimation by breaking the correlation between selection and evaluation noise. TD3’s minimum operation targets robust estimation by taking the most pessimistic view when the two networks disagree.

This trade-off between bias and robustness is deliberate. In actor-critic methods, the policy gradient pushes toward actions with high Q-values. Overestimation is particularly harmful because it can lead the policy to exploit erroneous high-value regions. Underestimation is generally safer: the policy may ignore some good actions, but it will not be misled into pursuing actions that only appear valuable due to approximation error. The minimum operation implements a “trust the pessimist” principle that complements the policy optimization objective.

TD3 also introduces two additional modifications beyond the clipped double Q-learning. First, target policy smoothing adds clipped noise to the target policy’s actions when computing targets: a~=πwtarget(s)+clip(ε,c,c)\tilde{a} = \pi_{\boldsymbol{w}_{\text{target}}}(s') + \text{clip}(\varepsilon, -c, c). This regularization prevents the policy from exploiting narrow peaks in the Q-function approximation error by averaging over nearby actions. Second, delayed policy updates change the actor update frequency: the actor updates every dd steps instead of every step. This reduces per-update error by letting the critics converge before the actor adapts to them.

TD3 also replaces DDPG’s hard target updates with exponential moving average (EMA) updates, following the smooth update scheme from Algorithm Algorithm 4 in the FQI chapter. Instead of copying θtargetθt\boldsymbol{\theta}_{\text{target}} \leftarrow \boldsymbol{\theta}_t every KK steps, EMA smoothly tracks the online network: θtargetτθt+(1τ)θtarget\boldsymbol{\theta}_{\text{target}} \leftarrow \tau \boldsymbol{\theta}_t + (1-\tau)\boldsymbol{\theta}_{\text{target}} at every update. For small τ[0.001,0.01]\tau \in [0.001, 0.01], the target lags behind the online network by roughly 1/τ1/\tau steps, providing smoother learning dynamics.

The algorithm structure parallels Double DQN but with continuous actions. Lines 8.1-8.2 implement clipped double Q-learning: smoothing adds noise to target actions (preventing exploitation of Q-function artifacts), and the min operation (highlighted in blue) provides pessimistic value estimates. Both critics update toward the same shared target (lines 10-11), but their different initializations and stochastic gradient noise keep their errors partially decorrelated, following the same principle underlying Double DQN’s independence assumption. Line 13 gates policy updates to every dd steps (typically d=2d=2), and lines 13.2-13.4 use EMA updates following Algorithm Algorithm 4.

TD3 simplifies exploration by replacing DDPG’s Ornstein-Uhlenbeck process with uncorrelated Gaussian noise εN(0,σexplore2)\varepsilon \sim \mathcal{N}(0, \sigma_{\text{explore}}^2) (line 5.3). This eliminates the need to tune multiple OU parameters while providing equally effective exploration.

Soft Actor-Critic

DDPG and TD3 address continuous actions by learning a deterministic policy that amortizes the argmaxaq(s,a)\arg\max_a q(s,a) operation. But deterministic policies have a fundamental limitation: they require external exploration noise (Gaussian perturbations in TD3) and can converge to suboptimal deterministic behaviors without adequate coverage of the state-action space.

The smoothing chapter presents an alternative: entropy-regularized MDPs, where the agent maximizes expected return plus a bonus for policy randomness. This yields stochastic policies with exploration built into the objective itself. The smooth Bellman operator replaces the hard max with a soft-max:

v(s)=1βlogaAexp(βq(s,a))v^*(s) = \frac{1}{\beta} \log \sum_{a \in \mathcal{A}} \exp\left(\beta \cdot q^*(s,a)\right)

where β=1/α\beta = 1/\alpha is the inverse temperature and α\alpha is the entropy regularization weight. For finite action spaces, this log-sum-exp is easy to compute. But for continuous actions ARm\mathcal{A} \subset \mathbb{R}^m, the sum becomes an integral:

v(s)=1βlogAexp(βq(s,a))dav^*(s) = \frac{1}{\beta} \log \int_{\mathcal{A}} \exp\left(\beta \cdot q^*(s,a)\right) da

This integral is intractable. We face an infinite-dimensional sum over the continuous action space. The very smoothness that gives us stochastic policies creates a new computational barrier, distinct from but analogous to the argmax\arg\max problem in standard FQI.

From Intractable Integral to Tractable Expectation

Soft actor-critic (SAC) Haarnoja et al., 2018Haarnoja et al., 2018 exploits an equivalence between the intractable integral and an expectation. The optimal policy under entropy regularization is the Boltzmann distribution π(as)exp(βq(s,a))\pi^*(a|s) \propto \exp(\beta \cdot q^*(s,a)). Under this policy, the soft value function becomes:

v(s)=Eaπ(s)[q(s,a)αlogπ(as)]v^*(s) = \mathbb{E}_{a \sim \pi^*(\cdot|s)}\left[q^*(s,a) - \alpha \log \pi^*(a|s)\right]

This converts the intractable integral into an expectation we can estimate by sampling. SAC learns a parametric policy πϕ\pi_{\boldsymbol{\phi}} that approximates the Boltzmann distribution, enabling fast action selection via a single forward pass. For bootstrap targets, SAC samples a~πϕ(s)\tilde{a}' \sim \pi_{\boldsymbol{\phi}}(\cdot|s') and computes:

y=r+γ[minj=1,2qθtargetj(s,a~)αlogπϕ(a~s)]y = r + \gamma \left[\min_{j=1,2} q^j_{\boldsymbol{\theta}_{\text{target}}}(s', \tilde{a}') - \alpha \log \pi_{\boldsymbol{\phi}}(\tilde{a}'|s')\right]

The minimum over twin Q-networks applies the clipped double-Q trick from TD3. Exploration comes from the policy’s stochasticity rather than external noise.

Learning the Policy: Matching the Boltzmann Distribution

The Q-network update assumes a policy πϕ\pi_{\boldsymbol{\phi}} that approximates the Boltzmann distribution π(as)exp(βq(s,a))\pi^*(a|s) \propto \exp(\beta \cdot q^*(s,a)). Training such a policy presents a problem: the Boltzmann distribution requires the partition function Z(s)=Aexp(βq(s,a))daZ(s) = \int_{\mathcal{A}} \exp(\beta \cdot q(s,a))da, the very integral we are trying to avoid. SAC sidesteps this by minimizing the KL divergence from the policy to the (unnormalized) Boltzmann distribution:

minϕEsD[DKL(πϕ(s)exp(βqθ(s,))Zθ(s))]\min_{\boldsymbol{\phi}} \mathbb{E}_{s \sim \mathcal{D}}\left[D_{KL}\left(\pi_{\boldsymbol{\phi}}(\cdot|s) \| \frac{\exp(\beta \cdot q_{\boldsymbol{\theta}}(s,\cdot))}{Z_{\boldsymbol{\theta}}(s)}\right)\right]

Since logZθ(s)\log Z_{\boldsymbol{\theta}}(s) does not depend on ϕ\boldsymbol{\phi}, this reduces to:

minϕEsDEaπϕ(s)[αlogπϕ(as)qθ(s,a)]\min_{\boldsymbol{\phi}} \mathbb{E}_{s \sim \mathcal{D}}\mathbb{E}_{a \sim \pi_{\boldsymbol{\phi}}(\cdot|s)}\left[\alpha \log \pi_{\boldsymbol{\phi}}(a|s) - q_{\boldsymbol{\theta}}(s,a)\right]

This pushes probability toward high Q-value actions while the logπϕ\log \pi_{\boldsymbol{\phi}} term penalizes concentrating probability mass, maintaining entropy. The entropy bonus comes from the KL divergence structure rather than from an explicit regularization term.

To estimate gradients of this objective, we face a technical problem: the policy parameters ϕ\boldsymbol{\phi} appear in the sampling distribution πϕ\pi_{\boldsymbol{\phi}}, making ϕEaπϕ[]\nabla_{\boldsymbol{\phi}} \mathbb{E}_{a \sim \pi_{\boldsymbol{\phi}}}[\cdot] difficult to compute. SAC uses a Gaussian policy πϕ(as)=N(μϕ(s),σϕ(s)2)\pi_{\boldsymbol{\phi}}(a|s) = \mathcal{N}(\mu_{\boldsymbol{\phi}}(s), \sigma_{\boldsymbol{\phi}}(s)^2) with the reparameterization trick. Express samples as a deterministic function of parameters and independent noise:

a=fϕ(s,ϵ)=μϕ(s)+σϕ(s)ϵ,ϵN(0,I)a = f_{\boldsymbol{\phi}}(s, \epsilon) = \mu_{\boldsymbol{\phi}}(s) + \sigma_{\boldsymbol{\phi}}(s) \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

This moves ϕ\boldsymbol{\phi} out of the sampling distribution and into the integrand:

minϕEsDEϵN(0,I)[αlogπϕ(fϕ(s,ϵ)s)qθ(s,fϕ(s,ϵ))]\min_{\boldsymbol{\phi}} \mathbb{E}_{s \sim \mathcal{D}}\mathbb{E}_{\epsilon \sim \mathcal{N}(0,I)}\left[\alpha \log \pi_{\boldsymbol{\phi}}(f_{\boldsymbol{\phi}}(s,\epsilon)|s) - q_{\boldsymbol{\theta}}(s,f_{\boldsymbol{\phi}}(s,\epsilon))\right]

We can now differentiate through fϕf_{\boldsymbol{\phi}} and the Q-network, as DDPG differentiates through a deterministic policy. SAC extends this by sampling noise ϵ\epsilon at each gradient step rather than outputting a single deterministic action.

The algorithm interleaves three updates. The Q-networks (lines 7-10) follow fitted Q-iteration with the soft Bellman target: sample a next action a~i\tilde{a}'_i from the current policy, compute the entropy-adjusted target yi=ri+γ[minjqtargetj(si,a~i)αlogπϕ(a~isi)]y_i = r_i + \gamma[\min_j q^j_{\text{target}}(s'_i, \tilde{a}'_i) - \alpha \log \pi_{\boldsymbol{\phi}}(\tilde{a}'_i|s'_i)], and minimize squared error. The minimum over twin Q-networks mitigates overestimation as in TD3. The policy (lines 12-13) updates to match the Boltzmann distribution induced by the current Q-function, using the reparameterization trick for gradient estimation. Target networks update via EMA (lines 15-16) to stabilize training.

The stochastic policy serves the same amortization purpose as in DDPG and TD3: it replaces the intractable argmax\arg\max operation with a fast network forward pass. SAC’s entropy regularization produces exploration through the policy’s inherent stochasticity rather than external noise. This makes SAC more robust to hyperparameters and eliminates the need to tune exploration schedules.

Path Consistency Learning (PCL)

DDPG, TD3, and SAC all follow the same solution template from fitted Q-iteration: compute Bellman targets using the current Q-function, fit the Q-function to those targets, repeat. This is successive approximation, the function iteration approach vk+1=PLvkv_{k+1} = \Proj \Bellman v_k from the projection methods chapter.

Path Consistency Learning (PCL) Nachum et al., 2017 solves the Bellman equation differently. Instead of iterating the operator, it directly minimizes a residual. This is the least-squares approach from projection methods: solve N(v)=0\Residual(v) = 0 by minimizing N(v)2\|\Residual(v)\|^2. The method exploits special structure (smooth Bellman operators under deterministic dynamics) that conventional methods cannot leverage.

The Path Consistency Property

Consider the entropy-regularized Q-function Bellman equation from the smoothing chapter. Under general stochastic dynamics, it involves an expectation over next states:

q(s,a)=r(s,a)+γEs[v(s)]q^*(s,a) = r(s,a) + \gamma \mathbb{E}_{s'}[v^*(s')]

Suppose the dynamics are deterministic: s=f(s,a)s' = f(s,a). The next state is uniquely determined, so the expectation disappears:

q(s,a)=r(s,a)+γv(f(s,a))q^*(s,a) = r(s,a) + \gamma v^*(f(s,a))

The value function relates to Q-functions through the soft-max:

v(s)=αlogAexp(q(s,a)/α)dav^*(s) = \alpha \log \int_{\mathcal{A}} \exp(q^*(s,a)/\alpha) da

Contrast two cases: general policies versus the optimal Boltzmann policy.

For general policies, the value equals an expectation:

vπ(s)=Eaπ(s)[qπ(s,a)αlogπ(as)]v^\pi(s) = \mathbb{E}_{a \sim \pi(\cdot|s)}[q^\pi(s,a) - \alpha \log \pi(a|s)]

This is an average. For a single observed action aa, we have:

qπ(s,a)αlogπ(as)=vπ(s)+ε(s,a)q^\pi(s,a) - \alpha\log\pi(a|s) = v^\pi(s) + \varepsilon(s,a)

where ε(s,a)\varepsilon(s,a) is sampling error with Eaπ[ε(s,a)]=0\mathbb{E}_{a \sim \pi}[\varepsilon(s,a)] = 0. Individual actions give noisy estimates that fluctuate around the mean.

For the optimal policy under entropy regularization, the Boltzmann structure produces an exact pointwise identity. The optimal policy is:

π(as)=exp(q(s,a)/α)exp(v(s)/α)\pi^*(a|s) = \frac{\exp(q^*(s,a)/\alpha)}{\exp(v^*(s)/\alpha)}

Taking logarithms and rearranging:

v(s)=q(s,a)αlogπ(as)for all av^*(s) = q^*(s,a) - \alpha \log \pi^*(a|s) \quad \text{for all } a

This holds exactly for every action aa, not just in expectation. There is no sampling error. The advantage q(s,a)v(s)q^*(s,a) - v^*(s) is encoded in the log-probability: suboptimal actions have low q(s,a)q^*(s,a) but also large αlogπ(as)-\alpha\log\pi^*(a|s) (low probability means large negative log-probability), and these terms balance exactly to give v(s)v^*(s).

Now take a trajectory segment (s0,a0,s1,a1,,sd)(s_0, a_0, s_1, a_1, \ldots, s_d) where each transition follows the deterministic dynamics st+1=f(st,at)s_{t+1} = f(s_t, a_t). Start with q(s0,a0)=r0+γv(s1)q^*(s_0, a_0) = r_0 + \gamma v^*(s_1) and use equation (34) to substitute v(s1)=q(s1,a1)αlogπ(a1s1)v^*(s_1) = q^*(s_1,a_1) - \alpha\log\pi^*(a_1|s_1) exactly:

q(s0,a0)=r0+γ[q(s1,a1)αlogπ(a1s1)]q^*(s_0, a_0) = r_0 + \gamma[q^*(s_1,a_1) - \alpha\log\pi^*(a_1|s_1)]

Substitute q(s1,a1)=r1+γv(s2)q^*(s_1,a_1) = r_1 + \gamma v^*(s_2):

q(s0,a0)=r0+γr1γαlogπ(a1s1)+γ2v(s2)q^*(s_0, a_0) = r_0 + \gamma r_1 - \gamma\alpha\log\pi^*(a_1|s_1) + \gamma^2 v^*(s_2)

Continue this telescoping for dd steps. Each substitution is exact:

q(s0,a0)=t=0d1γtrtαt=1d1γtlogπ(atst)+γdv(sd)q^*(s_0, a_0) = \sum_{t=0}^{d-1} \gamma^t r_t - \alpha \sum_{t=1}^{d-1} \gamma^t \log\pi^*(a_t|s_t) + \gamma^d v^*(s_d)

Apply equation (34) once more to get v(s0)=q(s0,a0)αlogπ(a0s0)v^*(s_0) = q^*(s_0,a_0) - \alpha\log\pi^*(a_0|s_0):

v(s0)=t=0d1γt[rtαlogπ(atst)]+γdv(sd)v^*(s_0) = \sum_{t=0}^{d-1} \gamma^t [r_t - \alpha\log\pi^*(a_t|s_t)] + \gamma^d v^*(s_d)

Rearranging gives the path consistency residual:

R(s0:sd;π,v)=v(s0)γdv(sd)t=0d1γt[rtαlogπ(atst)]=0R(s_0:s_d; \pi^*, v^*) = v^*(s_0) - \gamma^d v^*(s_d) - \sum_{t=0}^{d-1} \gamma^t[r_t - \alpha\log\pi^*(a_t|s_t)] = 0

The telescoping produces an exact identity: R=0R = 0 for every action sequence, not just in expectation. The behavior policy never appears because the constraint holds as a deterministic identity for any observed (s0,a0,,sd)(s_0, a_0, \ldots, s_d). This enables off-policy learning without importance sampling.

Remark 1 (Contrasting General Policies and Optimal Boltzmann Policies)

The distinction between equations (31) and (34) is subtle but crucial.

For general policies (equation (31)), the value is an average over actions sampled from the policy. Individual actions give noisy estimates: if we draw aπ(s)a \sim \pi(\cdot|s), then qπ(s,a)αlogπ(as)=vπ(s)+εq^\pi(s,a) - \alpha\log\pi(a|s) = v^\pi(s) + \varepsilon where ε\varepsilon is a zero-mean random variable. We need to average many samples to estimate vπ(s)v^\pi(s) accurately. Multi-step telescoping would accumulate these sampling errors ε0,ε1,,εd1\varepsilon_0, \varepsilon_1, \ldots, \varepsilon_{d-1}, producing noisy residuals even at the true solution. Off-policy learning would require importance weights to correct for using actions from a different behavior policy.

For the optimal entropy-regularized policy (equation (34)), the Boltzmann structure collapses the expectation to a pointwise identity. The relationship v(s)=q(s,a)αlogπ(as)v^*(s) = q^*(s,a) - \alpha\log\pi^*(a|s) holds exactly for every action aa, optimal or not. A suboptimal action has low q(s,a)q^*(s,a) (low expected return) and low π(as)\pi^*(a|s) (low probability), making αlogπ(as)-\alpha\log\pi^*(a|s) large. These terms balance precisely to give v(s)v^*(s). No sampling error exists. The telescoping is exact, producing a residual that equals zero for every action sequence, not just in expectation. Off-policy learning works because the constraint holds as a deterministic identity for any observed path.

This property is unique to soft-max operators. For hard-max, v(s)=maxaq(s,a)v^*(s) = \max_a q^*(s,a) holds only when aa is optimal. Suboptimal actions satisfy v(s)>q(s,a)v^*(s) > q^*(s,a), an inequality that cannot be used to construct a residual.

Structural Requirements: Deterministic Dynamics and Entropy Regularization

PCL’s two structural requirements (deterministic dynamics and entropy regularization) are not arbitrary design choices. Each addresses a fundamental theoretical issue.

Deterministic Dynamics: Avoiding the Double Sampling Problem

Under stochastic dynamics, the Q-function Bellman equation has an expectation over next states:

q(s,a)=r(s,a)+γEsp(s,a)[v(s)]q^*(s,a) = r(s,a) + \gamma \mathbb{E}_{s' \sim p(\cdot|s,a)}[v^*(s')]

The exact relationship (34) still holds, so we can write the path consistency constraint. But now consider what PCL minimizes: the squared residual E[R2]\mathbb{E}[R^2] where

R=vϕ(s0)γdvϕ(sd)t=0d1γt[rtαlogπθ(atst)]R = v_{\boldsymbol{\phi}}(s_0) - \gamma^d v_{\boldsymbol{\phi}}(s_d) - \sum_{t=0}^{d-1} \gamma^t[r_t - \alpha\log\pi_{\boldsymbol{\theta}}(a_t|s_t)]

At the true optimum (v,π)(v^*, \pi^*), the constraint is E[R]=0\mathbb{E}[R] = 0, which implies (E[R])2=0(\mathbb{E}[R])^2 = 0. But PCL minimizes E[R2]\mathbb{E}[R^2], and by Jensen’s inequality:

E[R2](E[R])2\mathbb{E}[R^2] \geq (\mathbb{E}[R])^2

with equality only when RR has zero variance. Under stochastic dynamics, even at optimality, individual trajectory residuals are random variables with mean zero but positive variance (due to transition noise). Minimizing E[R2]\mathbb{E}[R^2] to zero would require driving Var(R)0\text{Var}(R) \to 0, which is impossible and pushes the solution away from the true optimum.

This is Baird’s double sampling problem Baird, 1995. To get an unbiased gradient of (E[R])2(\mathbb{E}[R])^2, we need:

(E[R])2=2E[R]E[R]=2E[R]E[R]\nabla (\mathbb{E}[R])^2 = 2\mathbb{E}[R] \cdot \nabla \mathbb{E}[R] = 2\mathbb{E}[R] \cdot \mathbb{E}[\nabla R]

This requires two independent samples of the next state from the same (s,a)(s,a) pair: one for estimating E[R]\mathbb{E}[R] and one for E[R]\mathbb{E}[\nabla R]. With a simulator, this is possible. With real trajectories, it is not.

Under deterministic dynamics, RR is deterministic (no transition noise), so E[R2]=(E[R])2\mathbb{E}[R^2] = (\mathbb{E}[R])^2 and Jensen’s inequality holds with equality. Minimizing the squared residual is equivalent to solving E[R]=0\mathbb{E}[R] = 0.

Entropy Regularization: Enabling All-Action Consistency

Attempt the same path consistency derivation with the hard-max Bellman operator. Under deterministic dynamics, the Q-function satisfies:

q(s,a)=r(s,a)+γv(f(s,a))q^*(s,a) = r(s,a) + \gamma v^*(f(s,a))

where v(s)=maxaq(s,a)v^*(s) = \max_{a'} q^*(s,a') and the optimal policy is π(s)=argmaxaq(s,a)\pi^*(s) = \arg\max_a q^*(s,a) (deterministic).

Now try to relate v(s)v^*(s) to an arbitrary observed action aa. For the optimal action aargmaxaq(s,a)a^* \in \arg\max_{a'} q^*(s,a'), we have:

v(s)=q(s,a)v^*(s) = q^*(s,a^*)

But for a suboptimal action aaa \ne a^*:

v(s)=maxaq(s,a)>q(s,a)v^*(s) = \max_{a'} q^*(s,a') > q^*(s,a)

This is an inequality, not an equation. There is no formula expressing v(s)v^*(s) in terms of q(s,a)q^*(s,a) for suboptimal actions.

Attempt the multi-step telescoping. Start with q(s0,a0)=r0+γv(s1)q^*(s_0, a_0) = r_0 + \gamma v^*(s_1). To continue, we need to express v(s1)v^*(s_1) using the observed action a1a_1. But we only have:

v(s1)q(s1,a1)v^*(s_1) \geq q^*(s_1, a_1)

with equality only if a1a_1 happens to be optimal at s1s_1. We cannot substitute this into the Q-function equation to get an exact telescoping. The derivation breaks at the first step.

Compare this to the soft-max case. The Boltzmann structure gives equation (34): v(s)=q(s,a)αlogπ(as)v^*(s) = q^*(s,a) - \alpha\log\pi^*(a|s) for all actions aa. The log-probability term compensates exactly for suboptimality: low-probability actions have large αlogπ(as)-\alpha\log\pi^*(a|s), which adds to the low q(s,a)q^*(s,a) to recover v(s)v^*(s). This enables exact substitution at every step:

v(s1)=q(s1,a1)αlogπ(a1s1)(exact for any a1)v^*(s_1) = q^*(s_1, a_1) - \alpha\log\pi^*(a_1|s_1) \quad \text{(exact for any } a_1\text{)}

The telescoping proceeds without inequalities or restrictions on which actions were chosen. Multi-step hard-max Q-learning lacks theoretical justification for off-policy data because when we observe a trajectory with suboptimal actions, we cannot write an exact path consistency constraint.

Both requirements are structural:

RequirementAddresses
Deterministic dynamicsDouble sampling bias: ensures E[R2]=(E[R])2\mathbb{E}[R^2] = (\mathbb{E}[R])^2
Entropy regularizationAll-action consistency (equation (34))

Without deterministic dynamics, residual minimization is biased. Without entropy regularization, the constraint holds only for optimal actions.

The Learning Objective

Equation (39) provides a constraint that the optimal (v,π)(v^*, \pi^*) must satisfy: the residual equals zero for every observed path. For parametric approximations (vϕ,πθ)(v_{\boldsymbol{\phi}}, \pi_{\boldsymbol{\theta}}) that are not yet optimal, the residual is nonzero:

R(s0:sd;θ,ϕ)=vϕ(s0)γdvϕ(sd)t=0d1γt[rtαlogπθ(atst)]R(s_0:s_d; \boldsymbol{\theta}, \boldsymbol{\phi}) = v_{\boldsymbol{\phi}}(s_0) - \gamma^d v_{\boldsymbol{\phi}}(s_d) - \sum_{t=0}^{d-1} \gamma^t[r_t - \alpha \log \pi_{\boldsymbol{\theta}}(a_t|s_t)]

PCL minimizes the squared residual over observed path segments:

minθ,ϕsegments12R(si:si+d;θ,ϕ)2\min_{\boldsymbol{\theta}, \boldsymbol{\phi}} \sum_{\text{segments}} \frac{1}{2} R(s_i:s_{i+d}; \boldsymbol{\theta}, \boldsymbol{\phi})^2

This is the least-squares residual approach from the projection methods chapter. SAC computes targets yiy_i and fits to them (successive approximation). PCL directly minimizes the residual without computing targets or performing a separate fitting step.

Gradient descent gives:

θk+1=θk+ηπiRiαt=0d1γtθlogπθk(ai+tsi+t)ϕk+1=ϕkηviRi[ϕvϕk(si)γdϕvϕk(si+d)]\begin{align} \boldsymbol{\theta}_{k+1} &= \boldsymbol{\theta}_k + \eta_\pi \sum_i R_i \cdot \alpha \sum_{t=0}^{d-1} \gamma^t \nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}_k}(a_{i+t}|s_{i+t}) \\ \boldsymbol{\phi}_{k+1} &= \boldsymbol{\phi}_k - \eta_v \sum_i R_i \left[\nabla_{\boldsymbol{\phi}} v_{\boldsymbol{\phi}_k}(s_i) - \gamma^d \nabla_{\boldsymbol{\phi}} v_{\boldsymbol{\phi}_k}(s_{i+d})\right] \end{align}

where Ri=R(si:si+d;θk,ϕk)R_i = R(s_i:s_{i+d}; \boldsymbol{\theta}_k, \boldsymbol{\phi}_k). Large residuals drive larger updates.

The algorithm collects trajectories from the current policy and stores them in a replay buffer. At each iteration, it samples a trajectory (possibly old) and performs gradient descent on the path residual for all dd-step segments. The replay buffer enables off-policy learning: trajectories from old policies, expert demonstrations, or exploratory behavior all provide valid training signals.

Unified Parameterization: Single Q-Network

Algorithm Algorithm 7 uses separate networks for policy and value. But we can use a single Q-network qθ(s,a)q_{\boldsymbol{\theta}}(s,a) and derive both:

vθ(s)=αlogaexp(qθ(s,a)/α),πθ(as)=exp(qθ(s,a)/α)aexp(qθ(s,a)/α)v_{\boldsymbol{\theta}}(s) = \alpha \log \sum_{a} \exp(q_{\boldsymbol{\theta}}(s,a)/\alpha), \qquad \pi_{\boldsymbol{\theta}}(a|s) = \frac{\exp(q_{\boldsymbol{\theta}}(s,a)/\alpha)}{\sum_{a'} \exp(q_{\boldsymbol{\theta}}(s,a')/\alpha)}

The path residual becomes:

R(si:si+d;θ)=vθ(si)γdvθ(si+d)t=0d1γt[ri+tαlogπθ(ai+tsi+t)]R(s_i:s_{i+d}; \boldsymbol{\theta}) = v_{\boldsymbol{\theta}}(s_i) - \gamma^d v_{\boldsymbol{\theta}}(s_{i+d}) - \sum_{t=0}^{d-1} \gamma^t[r_{i+t} - \alpha \log \pi_{\boldsymbol{\theta}}(a_{i+t}|s_{i+t})]

and the gradient combines both value and policy contributions through the same parameters. This unified architecture eliminates the actor-critic separation: one Q-network serves both roles.

Connection to Existing Methods

Single-step case (d=1d=1): The path residual becomes R(s:s;θ,ϕ)=vϕ(s)γvϕ(s)r+αlogπθ(as)R(s:s'; \boldsymbol{\theta}, \boldsymbol{\phi}) = v_{\boldsymbol{\phi}}(s) - \gamma v_{\boldsymbol{\phi}}(s') - r + \alpha\log\pi_{\boldsymbol{\theta}}(a|s). For unified parameterization where vθ(s)=qθ(s,a)αlogπθ(as)v_{\boldsymbol{\theta}}(s) = q_{\boldsymbol{\theta}}(s,a) - \alpha\log\pi_{\boldsymbol{\theta}}(a|s) exactly, this becomes R=qθ(s,a)rγvθ(s)R = q_{\boldsymbol{\theta}}(s,a) - r - \gamma v_{\boldsymbol{\theta}}(s'), the soft Bellman residual. Minimizing iRi2\sum_i R_i^2 is equivalent to soft Q-learning, though SAC solves this via successive approximation (compute targets, fit) rather than direct residual minimization.

No entropy (α0\alpha \to 0): The residual becomes R=v(si)γdv(si+d)tγtrtR = v(s_i) - \gamma^d v(s_{i+d}) - \sum_t \gamma^t r_t, the negative dd-step advantage. But unlike A2C/A3C where vv tracks the current policy’s value, PCL’s value converges to vv^* because the residual couples policy and value through the optimality condition.

Multi-step with hard-max: No analog exists. The hard-max Bellman operator maxaq(s,a)\max_a q(s,a) does not have an exact pointwise relationship like equation (34). Multi-step telescoping would accumulate errors from the max operator, making the constraint valid only in expectation under the optimal policy. The soft-max structure enables exact off-policy path consistency.

PCL vs SAC: Residual Minimization vs Successive Approximation

Both methods solve entropy-regularized MDPs but use fundamentally different solution strategies:

AspectSACPCL
Solution methodSuccessive approximation: compute targets yiy_i, fit qq to targetsResidual minimization: minimize iRi2\sum_i R_i^2 directly
Update structureTarget computation + regression stepSingle gradient step on squared residual
Target networksRequired (mark outer-iteration boundaries)None (residual constraint, not target fitting)
Temporal horizonSingle-step TD: y=r+γV(s)y = r + \gamma V(s')Multi-step paths: accumulate over dd steps
Off-policy handlingReplay buffer with single-sample biasNo importance sampling (works for any trajectory)
Dynamics requirementGeneral stochastic transitionsDeterministic transitions s=f(s,a)s' = f(s,a)
ArchitectureTwin Q-networks + policy networkSingle Q-network (unified parameterization)

PCL requires deterministic dynamics. It gains multi-step telescoping and off-policy learning without importance weights, but only for deterministic systems (robotic manipulation, many control tasks). SAC works for general stochastic MDPs.

PCL as Amortization

PCL amortizes at a different level than DDPG/TD3/SAC. Those methods amortize the action maximization: learn a policy network that outputs argmaxaq(s,a)\arg\max_a q(s,a) directly. PCL amortizes the solution of the Bellman equation itself. Instead of repeatedly applying the Bellman operator (which requires Aexp(q/α)da\int_{\mathcal{A}} \exp(q/\alpha) da at every iteration), PCL samples path segments and minimizes their residual. The computational cost of verifying optimality across all states and path lengths is distributed across training through sampled gradient updates.

Model Predictive Path Integral Control

SAC and PCL both learn policies that approximate the Boltzmann distribution π(as)exp(βq(s,a))\pi^*(a|s) \propto \exp(\beta \cdot q^*(s,a)) induced by entropy regularization. This amortization allows fast action selection at deployment: a single forward pass through the policy network. An alternative approach forgoes learning entirely and instead performs optimization at every decision.

Model Predictive Path Integral control (MPPI) Williams et al., 2017 uses the Boltzmann weighting directly for action sequence selection. Given a dynamics model st+1=f(st,at)s_{t+1} = f(s_t, a_t) and current state s0s_0, MPPI samples KK action sequences {a(i)}i=1K\{\boldsymbol{a}^{(i)}\}_{i=1}^K, rolls them out to get costs C(i)=t=0H1c(st(i),at(i))C^{(i)} = \sum_{t=0}^{H-1} c(s_t^{(i)}, a_t^{(i)}), and computes the optimal action as a weighted average:

a0=i=1Kw(i)a0(i),w(i)=exp(C(i)/λ)jexp(C(j)/λ)a_0^* = \sum_{i=1}^K w^{(i)} a_0^{(i)}, \quad w^{(i)} = \frac{\exp(-C^{(i)}/\lambda)}{\sum_j \exp(-C^{(j)}/\lambda)}

where λ>0\lambda > 0 is a temperature parameter. The weighting w(i)exp(C(i)/λ)w^{(i)} \propto \exp(-C^{(i)}/\lambda) is exactly the Boltzmann distribution. MPPI solves the entropy-regularized objective:

minaEξ[t=0H1c(st,at)+λH(π)]\min_{\boldsymbol{a}} \mathbb{E}_{\boldsymbol{\xi}}\left[\sum_{t=0}^{H-1} c(s_t, a_t) + \lambda H(\pi)\right]

where π\pi is the distribution over action sequences and H(π)=E[logπ(a)]H(\pi) = -\mathbb{E}[\log \pi(\boldsymbol{a})] is entropy. The importance sampling estimate approximates the optimal action under this objective. The temperature λ\lambda controls the trade-off between exploitation (focus on low-cost sequences) and exploration (maintain entropy).

The algorithm samples perturbed action sequences around a nominal trajectory {aˉt}\{\bar{a}_t\} (often the previous optimal sequence, shifted forward). The Boltzmann weights assign high probability to low-cost sequences. After executing a0a_0^*, the agent observes the next state and replans.

MPPI as Non-Amortized Optimization

The contrast between MPPI and the methods in this chapter illuminates what amortization provides. SAC learns a policy πϕ\pi_{\boldsymbol{\phi}} that approximates the Boltzmann distribution over actions at each state. PCL learns a Q-function from which the Boltzmann policy can be derived. Both invest computational effort during training to enable fast action selection at deployment: a single forward pass.

MPPI performs full optimization at every decision. At each state, it samples action sequences, weights them by exponentiated costs, and returns the weighted average. No learning occurs. The policy is implicitly defined by the optimization procedure itself.

This trade-off has practical consequences:

AspectAmortized (SAC, PCL)Non-Amortized (MPPI)
Action selectionSingle forward passO(KH)O(KH) model evaluations
GeneralizationPolicy generalizes across statesOptimization from scratch at each state
Model requirementNone (SAC) or deterministic (PCL)Accurate dynamics model
Approximation errorPolicy network approximationNone (exact optimization)
AdaptabilityRequires retraining for new tasksAdapts immediately to new cost functions

MPPI excels at real-time control for systems with fast, accurate models (robotics, autonomous vehicles). The replanning handles model errors and disturbances without retraining. However, the per-step computation (K100K \approx 100-1000 rollouts) makes it expensive for complex dynamics or long horizons.

The entropy regularization that connects SAC, PCL, and MPPI is not coincidental. All three methods solve variants of the soft Bellman equation. SAC and PCL amortize the solution by learning value functions and policies. MPPI solves it directly through sampling. The Boltzmann weighting emerges in all cases as the optimal policy structure under entropy regularization.

An Alternative: Euler Equation Methods

The methods developed in this chapter all parameterize policies, but they remain rooted in the Bellman equation. NFQCA, DDPG, TD3, and SAC learn Q-functions through successive approximation, then derive policies by maximizing these Q-functions. PCL minimizes a path residual derived from the soft Bellman equation. The policy serves as an amortized optimizer for a value-based objective.

There is a different approach, developed in computational economics Judd, 1992Rust, 1996Judd, 1998, that also parameterizes policies but solves an entirely different functional equation. Consider a control problem with continuous states and actions, deterministic dynamics s=f(s,a)s' = f(s,a), and differentiable reward r(s,a)r(s,a). The optimal action π(s)\pi^*(s) satisfies the first-order condition:

r(s,a)aa=π(s)+γv(s)ss=f(s,π(s))f(s,a)aa=π(s)=0.\frac{\partial r(s,a)}{\partial a}\Big|_{a=\pi^*(s)} + \gamma\, \frac{\partial v^*(s')}{\partial s'}\Big|_{s'=f(s,\pi^*(s))}\, \frac{\partial f(s,a)}{\partial a}\Big|_{a=\pi^*(s)} = 0.

This Euler equation expresses optimality through derivatives rather than through the max operator. For problems with special structure (the Euler class, where dynamics are affine in the controlled state), envelope theorems eliminate vv^* entirely, yielding a closed functional equation E(π)(s)=0\mathcal{E}(\pi)(s) = 0 in the policy alone.

With a parameterized policy πθ(s)\pi_{\boldsymbol{\theta}}(s), we can discretize via collocation or Galerkin projection:

G(θ):=[E(πθ)(s1)E(πθ)(sN)]=0.G(\boldsymbol{\theta}) := \begin{bmatrix} \mathcal{E}(\pi_{\boldsymbol{\theta}})(s_1) \\ \vdots \\ \mathcal{E}(\pi_{\boldsymbol{\theta}})(s_N) \end{bmatrix} = 0.

This is root-finding, not fixed-point iteration. Newton-type methods replace the successive approximation of fitted Q-iteration. The Euler operator is not a contraction, so convergence guarantees are problem-dependent.

What does this mean for reinforcement learning? The Euler approach shares the amortization idea: learn a policy network that directly outputs actions. But the training objective comes from first-order optimality conditions rather than from Bellman residuals or Q-function maximization. This raises questions worth considering. Could Euler-style objectives provide useful training signals for actor-critic methods? When dynamics are known or learned, could first-order conditions offer advantages over value-based objectives? The connection between these traditions remains underexplored.

Summary

When actions are continuous, computing maxaq(s,a;θ)\max_{a} q(s,a;\boldsymbol{\theta}) at each Bellman target requires solving a nonlinear program. Amortization replaces this runtime optimization with a learned policy network: a single forward pass instead of an optimization loop. The FQI structure from the previous chapter remains intact, with the policy network providing actions for target computation.

NFQCA, DDPG, TD3, and SAC use successive approximation: compute Bellman targets, fit Q-function to targets, update policy to maximize Q-function. Deterministic policies (DDPG, TD3) require external exploration noise; stochastic policies (SAC) explore through their inherent randomness. TD3 and SAC use twin Q-networks with min(q1,q2)\min(q^1, q^2) to mitigate overestimation. PCL takes a different approach, minimizing the path consistency residual directly rather than iterating the Bellman operator. This requires deterministic dynamics but enables multi-step constraints without importance sampling.

MPPI forgoes learning entirely, performing Boltzmann-weighted optimization at every decision. This avoids policy approximation error but requires O(KH)O(KH) model rollouts per action. SAC, PCL, and MPPI all solve entropy-regularized objectives; SAC and PCL amortize the solution while MPPI computes it directly.

The next chapter takes a different approach: parameterize the policy directly and optimize via gradient ascent on expected return. The starting point is derivative estimation for stochastic optimization rather than Bellman equations, though value functions return as variance reduction tools in actor-critic methods.

References
  1. Hafner, R., & Riedmiller, M. (2011). Reinforcement learning in feedback control: Challenges and benchmarks from technical process control. Machine Learning, 84(1–2), 137–169. 10.1007/s10994-011-5235-x
  2. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2015). Continuous Control with Deep Reinforcement Learning. arXiv Preprint arXiv:1509.02971.
  3. Fujimoto, S., Hoof, H., & Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. International Conference on Machine Learning (ICML), 1587–1596.
  4. Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proceedings of the 35th International Conference on Machine Learning (ICML), 1861–1870.
  5. Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic algorithms and applications. arXiv Preprint arXiv:1812.05905.
  6. Nachum, O., Norouzi, M., Xu, K., & Schuurmans, D. (2017). Bridging the Gap Between Value and Policy Based Reinforcement Learning. Advances in Neural Information Processing Systems, 30, 2775–2785.
  7. Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. Proceedings of the Twelfth International Conference on Machine Learning, 30–37.
  8. Williams, G., Aldrich, A., & Theodorou, E. A. (2017). Model Predictive Path Integral Control: From Theory to Parallel Computation. Journal of Guidance, Control, and Dynamics, 40(2), 344–357. 10.2514/1.G001921
  9. Judd, K. L. (1992). Projection methods for solving aggregate growth models. Journal of Economic Theory, 58(2), 410–452.
  10. Rust, J. (1996). Chapter 14 Numerical dynamic programming in economics. In Handbook of Computational Economics (pp. 619–729). Elsevier. 10.1016/s1574-0021(96)01016-7
  11. Judd, K. L. (1998). Numerical Methods in Economics. MIT Press.