Skip to article frontmatterSkip to article content

Policy Parametrization Methods

In the previous chapter, we explored various approaches to approximate dynamic programming, focusing on ways to handle large state spaces through function approximation. However, these methods still face significant challenges when dealing with large or continuous action spaces. The need to maximize over actions during the Bellman operator evaluation becomes computationally prohibitive as the action space grows.

This chapter explores a natural evolution of these ideas: rather than exhaustively searching over actions, we can parameterize and directly optimize the policy itself. We begin by examining how fitted Q methods, while effective for handling large state spaces, still struggle with action space complexity.

Embedded Optimization

Recall that in fitted Q methods, the main idea is to compute the Bellman operator only at a subset of all states, relying on function approximation to generalize to the remaining states. At each step of the successive approximation loop, we build a dataset of input state-action pairs mapped to their corresponding optimality operator evaluations:

Dn={((s,a),(Lq)(s,a;θn))(s,a)B}\mathcal{D}_n = \{((s, a), (Lq)(s, a; \boldsymbol{\theta}_n)) \mid (s,a) \in \mathcal{B}\}

This dataset is then fed to our function approximator (neural network, random forest, linear model) to obtain the next set of parameters:

θn+1fit(Dn)\boldsymbol{\theta}_{n+1} \leftarrow \texttt{fit}(\mathcal{D}_n)

While this strategy allows us to handle very large or even infinite (continuous) state spaces, it still requires maximizing over actions (maxaA\max_{a \in A}) during the dataset creation when computing the operator LL for each basepoint. This maximization becomes computationally expensive for large action spaces. A natural improvement is to add another level of optimization: for each sample added to our regression dataset, we can employ numerical optimization methods to find actions that maximize the Bellman operator for the given state.

The above pseudocode introduces a generic maximize\texttt{maximize} routine which represents any numerical optimization method that searches for an action maximizing the given function. This approach is versatile and can be adapted to different types of action spaces. For continuous action spaces, we can employ standard nonlinear optimization methods like gradient descent or L-BFGS (e.g., using scipy.optimize.minimize). For large discrete action spaces, we can use integer programming solvers - linear integer programming if the Q-function approximator is linear in actions, or mixed-integer nonlinear programming (MINLP) solvers for nonlinear Q-functions. The choice of solver depends on the structure of our Q-function approximator and the constraints on our action space.

Amortized Optimization Approach

This process is computationally intensive. A natural question is whether we can “amortize” some of this computation by replacing the explicit optimization for each sample with a direct mapping that gives us an approximate maximizer directly. For Q-functions, recall that the operator is given by:

(Lq)(s,a)=r(s,a)+γp(dss,a)maxaA(s)q(s,a)(\Bellman q)(s,a) = r(s,a) + \gamma \int p(ds'|s,a)\max_{a' \in \mathcal{A}(s')} q(s', a')

If qq^* is the optimal state-action value function, then v(s)=maxaq(s,a)v^*(s) = \max_a q^*(s,a), and we can derive the optimal policy directly by computing the decision rule:

π(s)=argmaxaA(s)q(s,a)\pi^\star(s) = \arg\max_{a \in \mathcal{A}(s)} q^\star(s,a)

Since qq^* is a fixed point of LL, we can write:

q(s,a)=(Lq)(s,a)=r(s,a)+γp(dss,a)maxaA(s)q(s,a)=r(s,a)+γp(dss,a)q(s,π(s))\begin{align*} q^\star(s,a) &= (Lq^*)(s,a) \\ &= r(s,a) + \gamma \int p(ds'|s,a) \max_{a' \in \mathcal{A}(s')} q^\star(s', a') \\ &= r(s,a) + \gamma \int p(ds'|s,a) q^\star(s', \pi^\star(s')) \end{align*}

Note that π\pi^\star is implemented by our maximize\texttt{maximize} numerical solver in the procedure above. A practical strategy would be to collect these maximizer values at each step and use them to train a function approximator that directly predicts these solutions. Due to computational constraints, we might want to compute these exact maximizer values only for a subset of states, based on some computational budget, and use the fitted decision rule to generalize to the remaining states. This leads to the following amortized version:

An important observation about this procedure is that the policy d(s;w)d(s; \boldsymbol{w}) is being trained on a dataset Dd\mathcal{D}_d containing optimal actions computed with respect to an evolving Q-function. Specifically, at iteration n, we collect pairs (s,as)(s', a^*_{s'}) where as=argmaxaq(s,a;θn)a^*_{s'} = \arg\max_a q(s', a; \boldsymbol{\theta}_n). However, after updating to θn+1\boldsymbol{\theta}_{n+1}, these actions may no longer be optimal with respect to the new Q-function.

A natural approach to handle this staleness would be to maintain only the most recent optimization data. We could modify our procedure to keep a sliding window of K iterations, where at iteration n, we only use data from iterations max(0, n-K) to n. This would be implemented by augmenting each entry in Dd\mathcal{D}_d with a timestamp:

Dπt={(s,as,t)t{nK,,n}}\mathcal{D}_\pi^t = \{(s', a^*_{s'}, t) \mid t \in \{n-K,\ldots,n\}\}

where t indicates the iteration at which the optimal action was computed. When fitting the policy network, we would then only use data points that are at most K iterations old:

wn+1fit({(s,as)(s,as,t)Dπt,nKtn})\boldsymbol{w}_{n+1} \leftarrow \texttt{fit}(\{(s', a^*_{s'}) \mid (s', a^*_{s'}, t) \in \mathcal{D}_\pi^t, n-K \leq t \leq n\})

This introduces a trade-off between using more data (larger K) versus using more recent, accurate data (smaller K). The choice of K would depend on how quickly the Q-function evolves and the computational budget available for computing exact optimal actions.

Now the main issue with this approach, apart from the intrinsic out-of-distribution drift that we are trying to track, is that it requires “ground truth” - samples of optimal actions computed by the actual solver. This raises a natural question: how few samples do we actually need? Could we even envision eliminating the solver entirely? What seems impossible at first glance turns out to be achievable. The intuition is that as our policy improves at selecting actions, we can bootstrap from these increasingly better choices. As we continuously amortize these improving actions over time, it creates a virtuous cycle of self-improvement towards the optimal policy. But for this bootstrapping process to work, we need careful management. Move too quickly and the process may become unstable. Let’s examine how this balance can be achieved.

Deterministic Parametrized Policies

In this section, we consider deterministic parametrized policies of the form d(s;w)d(s; \boldsymbol{w}) which directly output an action given a state. This approach differs from stochastic policies that output probability distributions over actions, making it particularly suitable for continuous control problems where the optimal policy is often deterministic. We’ll see how fitted Q-value methods can be naturally extended to simultaneously learn both the Q-function and such a deterministic policy.

The Amortization Problem for Continuous Actions

When actions are continuous, aRda \in \mathbb{R}^d, extracting a greedy policy from a Q-function becomes computationally expensive. Consider a robot arm control task where the action is a dd-dimensional torque vector. To act greedily given Q-function q(s,a;θ)q(s,a; \boldsymbol{\theta}), we must solve:

π(s)=argmaxaAq(s,a;θ),\pi(s) = \arg\max_{a \in \mathcal{A}} q(s, a; \boldsymbol{\theta}),

where ARd\mathcal{A} \subset \mathbb{R}^d is a continuous set (often a box or polytope). This requires running an optimization algorithm at every time step. For neural network Q-functions, this means solving a nonlinear program whose objective involves forward passes through the network.

This computation happens at inference time. After training converges, we deploy the agent and it must select actions in real-time. Running interior-point methods or gradient-based optimizers at every decision creates unacceptable latency, especially in high-frequency control where decisions occur at 100Hz or faster.

The solution is to amortize the optimization cost by learning a separate policy network d(s;w)d(s; \boldsymbol{w}) that directly outputs actions. During training, we optimize w\boldsymbol{w} so that d(s;w)argmaxaq(s,a;θ)d(s; \boldsymbol{w}) \approx \arg\max_a q(s,a; \boldsymbol{\theta}) for states we encounter. At deployment, action selection reduces to a single forward pass through the policy network: a=d(s;w)a = d(s; \boldsymbol{w}). The computational cost of optimization is paid during training (where time is less constrained) rather than at inference.

This introduces a second approximation beyond the Q-function. We now have two function approximators: a critic q(s,a;θ)q(s,a; \boldsymbol{\theta}) that estimates values, and an actor d(s;w)d(s; \boldsymbol{w}) that selects actions. The critic is trained using Bellman targets as in standard fitted Q-iteration. The actor is trained to maximize the critic:

ww+αEs[wq(s,d(s;w);θ)],\boldsymbol{w} \leftarrow \boldsymbol{w} + \alpha \mathbb{E}_s \left[\nabla_{\boldsymbol{w}} q(s, d(s; \boldsymbol{w}); \boldsymbol{\theta})\right],

where the expectation is over states in the dataset or replay buffer. This gradient ascent pushes the actor toward actions that the critic considers valuable. By the chain rule, this equals (aq(s,a;θ)a=d(s;w))(wd(s;w))(\nabla_a q(s,a; \boldsymbol{\theta})|_{a=d(s;\boldsymbol{w})}) \cdot (\nabla_{\boldsymbol{w}} d(s; \boldsymbol{w})), which can be efficiently computed via backpropagation through the composition of the two networks.

Neural Fitted Q-iteration for Continuous Actions (NFQCA)

To develop this approach, let’s first consider an idealized setting where we have access to qq^\star, the optimal Q-function. Then we can state our goal as finding policy parameters w\boldsymbol{w} that maximize qq^\star with respect to the actions chosen by our policy across the state space:

maxwq(s,d(s;w))for all s\max_{\boldsymbol{w}} q^*(s, d(s; \boldsymbol{w})) \quad \text{for all } s

However, it’s computationally infeasible to satisfy this condition for every possible state ss, especially in large or continuous state spaces. To address this, we assume a distribution of states, denoted μ(s)\mu(s), and take the expectation, leading to the problem:

maxwEsμ(s)[q(s,d(s;w))]\max_{\boldsymbol{w}} \mathbb{E}_{s \sim \mu(s)}[q^*(s, d(s; \boldsymbol{w}))]

However in practice, we do not have access to qq^*. Instead, we need to approximate qq^* with a Q-function q(s,a;θ)q(s, a; \boldsymbol{\theta}), parameterized by θ\boldsymbol{\theta}, which we will learn simultaneously with the policy function d(s;w)d(s; \boldsymbol{w}). Given a samples of initial states drawn from μ\mu, we then maximize this objective via a Monte Carlo surrogate problem:

maxwEsμ(s)[q(s,d(s;w);θ)]maxw1BsBq(s,d(s;w);θ)\max_{\boldsymbol{w}} \mathbb{E}_{s \sim \mu(s)}[q(s, d(s; \boldsymbol{w}); \boldsymbol{\theta})] \approx \max_{\boldsymbol{w}} \frac{1}{|\mathcal{B}|} \sum_{s \in \mathcal{B}} q(s, d(s; \boldsymbol{w}); \boldsymbol{\theta})

When using neural networks to parametrize qq and dd, we obtain the Neural Fitted Q-Iteration with Continuous Actions (NFQCA) algorithm proposed by Hafner & Riedmiller (2011).

In practice, both the fit and minimize operations above are implemented using gradient descent. For the Q-function, the fit operation minimizes the mean squared error between the network’s predictions and the target values:

fit(Dq)=argminθ1Dq((s,a),y)Dq(q(s,a;θ)y)2\texttt{fit}(\mathcal{D}_q) = \arg\min_{\boldsymbol{\theta}} \frac{1}{|\mathcal{D}_q|} \sum_{((s,a), y) \in \mathcal{D}_q} (q(s,a; \boldsymbol{\theta}) - y)^2

For the policy update, the minimize operation uses gradient descent on the composition of the “critic” network qq and the “actor” network dd. This results in the following update rule:

wn+1=wn+αw(1B(s,a,r,s)Bq(s,d(s;w);θn+1))\boldsymbol{w}_{n+1} = \boldsymbol{w}_n + \alpha \nabla_{\boldsymbol{w}} \left(\frac{1}{|\mathcal{B}|} \sum_{(s,a,r,s') \in \mathcal{B}} q(s, d(s; \boldsymbol{w}); \boldsymbol{\theta}_{n+1})\right)

where α\alpha is the learning rate. Both operations can be efficiently implemented using modern automatic differentiation libraries and stochastic gradient descent variants like Adam or RMSProp.

Deep Deterministic Policy Gradient (DDPG)

Just as DQN adapted Neural Fitted Q-Iteration to the online setting, DDPG Lillicrap et al. (2015) extends NFQCA to learn from data collected online. Like NFQCA, DDPG simultaneously learns a Q-function and a deterministic policy that maximizes it, but differs in how it collects and processes data.

Instead of maintaining a fixed set of basepoints, DDPG uses a replay buffer that continuously stores new transitions as the agent interacts with the environment. Since the policy is deterministic, exploration becomes challenging. DDPG addresses this by adding noise to the policy’s actions during data collection:

a=d(s;w)+Na = d(s; \boldsymbol{w}) + \mathcal{N}

where N\mathcal{N} represents exploration noise drawn from an Ornstein-Uhlenbeck (OU) process. The OU process is particularly well-suited for control tasks as it generates temporally correlated noise, leading to smoother exploration trajectories compared to independent random noise. It is defined by the stochastic differential equation:

dNt=θ(μNt)dt+σdWtd\mathcal{N}_t = \theta(\mu - \mathcal{N}_t)dt + \sigma dW_t

where μ\mu is the long-term mean value (typically set to 0), θ\theta determines how strongly the noise is pulled toward this mean, σ\sigma scales the random fluctuations, and dWtdW_t is a Wiener process (continuous-time random walk). For implementation, we discretize this continuous-time process using the Euler-Maruyama method:

Nt+1=Nt+θ(μNt)Δt+σΔtϵt\mathcal{N}_{t+1} = \mathcal{N}_t + \theta(\mu - \mathcal{N}_t)\Delta t + \sigma\sqrt{\Delta t}\epsilon_t

where Δt\Delta t is the time step and ϵtN(0,1)\epsilon_t \sim \mathcal{N}(0,1) is standard Gaussian noise. Think of this process like a spring mechanism: when the noise value Nt\mathcal{N}_t deviates from μ\mu, the term θ(μNt)Δt\theta(\mu - \mathcal{N}_t)\Delta t acts like a spring force, continuously pulling it back. Unlike a spring, however, this return to μ\mu is not oscillatory - it’s more like motion through a viscous fluid, where the force simply decreases as the noise gets closer to μ\mu. The random term σΔtϵt\sigma\sqrt{\Delta t}\epsilon_t then adds perturbations to this smooth return trajectory. This creates noise that wanders away from μ\mu (enabling exploration) but is always gently pulled back (preventing the actions from wandering too far), with θ\theta controlling the strength of this pulling force.

The policy gradient update follows the same principle as NFQCA:

wEsμ(s)[q(s,d(s;w);θ)]\nabla_{\boldsymbol{w}} \mathbb{E}_{s \sim \mu(s)}[q(s, d(s; \boldsymbol{w}); \boldsymbol{\theta})]

We then embed this exploration mechanism into the data collection procedure and use the same flattened FQI structure that we adopted in DQN. Similar to DQN, flattening the outer-inner optimization structure leads to the need for target networks - both for the Q-function and the policy.

Twin Delayed Deep Deterministic Policy Gradient (TD3)

While DDPG provided a foundation for continuous control with deep RL, it suffers from similar overestimation issues as DQN. TD3 Fujimoto et al. (2018) addresses these challenges through three key modifications: double Q-learning to reduce overestimation bias, delayed policy updates to reduce per-update error, and target policy smoothing to prevent exploitation of Q-function errors.

Similar to Double Q-learning, TD3 decouples selection from evaluation when forming the targets. However, instead of intertwining the two existing online and target networks, TD3 suggests learning two Q-functions simultaneously and uses their minimum when computing target values to help combat the overestimation bias further.

Furthermore, when computing target Q-values, TD3 adds small random noise to the target policy’s actions and clips it to keep the perturbations bounded. This regularization technique essentially implements a form of “policy smoothing” that prevents the policy from exploiting areas where the Q-function may have erroneously high values:

$$\tilde{a} = d(s'; \boldsymbol{w}_{target}) + \text{clip}(\mathcal{N}(0, \sigma), -c, c)$$

While DDPG used the OU process which generates temporally correlated noise, TD3’s authors found that simple uncorrelated Gaussian noise works just as well for exploration. It is also easier to implement and tune since you only need to set a single parameter (σexplore\sigma_{explore}) for exploration rather than the multiple parameters required by the OU process (θ\theta, μ\mu, σ\sigma).

Finally, TD3 updates the policy network (and target networks) less frequently than the Q-networks, typically once every dd Q-function updates. This helps reduce the per-update error and gives the Q-functions time to become more accurate before they are used to update the policy.

Soft Actor-Critic

Adapting the intuition of NFQCA to the smooth Bellman optimality equations leads us to the soft actor-critic algorithm Haarnoja et al. (2018). To understand this connection, let’s first examine how the smooth Bellman equations follow from entropy regularization.

Consider the standard Bellman operator augmented with an entropy term. The smooth Bellman operator Lβ\mathrm{L}_\beta takes the form:

(Lβv)(s)=maxπΠMR{Eaπ[r(s,a)+γv(s)]+βH(π)}(\Bellman _\beta v)(s) = \max_{\pi \in \Pi^{MR}}\{\mathbb{E}_{a \sim \pi}[r(s,a) + \gamma v(s')] + \beta\mathcal{H}(\pi)\}

where H(π)=Eaπ[logπ(as)]\mathcal{H}(\pi) = -\mathbb{E}_{a \sim \pi}[\log \pi(a|s)] represents the entropy of the policy. To find the solution to the optimization problem embedded in the operator Lβ\mathrm{L}_\beta, we set the functional derivative of the objective with respect to the decision rule to zero:

δδπ(as)[Aπ(as)(r(s,a)+γv(s))daβAπ(as)logπ(as)da]=0\frac{\delta}{\delta \pi(a|s)} \left[\int_A \pi(a|s)(r(s,a) + \gamma v(s'))da - \beta\int_A \pi(a|s)\log \pi(a|s)da \right] = 0

Enforcing that Aπ(as)da=1\int_A \pi(a|s)da = 1 leads to the following Lagrangian:

r(s,a)+γv(s)β(1+logπ(as))λ(s)=0r(s,a) + \gamma v(s') - \beta(1 + \log \pi(a|s)) - \lambda(s) = 0

Solving for π\pi shows that the optimal policy is a Boltzmann distribution

π(as)=exp(1β(r(s,a)+γEs[v(s)]))Z(s)\pi^*(a|s) = \frac{\exp(\frac{1}{\beta}(r(s,a) + \gamma \mathbb{E}_{s'}[v(s')]))}{Z(s)}

When we substitute this optimal policy back into the entropy-regularized objective, we obtain:

v(s)=Ead[r(s,a)+γv(s)]+βH(d)=βlogAexp(1β(r(s,a)+γEs[v(s)]))da\begin{align*} v(s) &= \mathbb{E}_{a \sim d^*}[r(s,a) + \gamma v(s')] + \beta\mathcal{H}(d^*) \\ &= \beta \log \int_A \exp(\frac{1}{\beta}(r(s,a) + \gamma \mathbb{E}_{s'}[v(s')]))da \end{align*}

As we saw at the beginning of this chapter, the smooth Bellman optimality operator for Q-factors is defined as:

(Lβq)(s,a)=r(s,a)+γEs[βlogAexp(1βq(s,a))da](\Bellman _\beta q)(s,a) = r(s,a) + \gamma \mathbb{E}_{s'}\left[\beta \log \int_A \exp(\frac{1}{\beta}q(s',a'))da'\right]

This operator maintains the contraction property of its standard counterpart, guaranteeing a unique fixed point qq^*. The optimal policy takes the form:

d(as)=exp(1βq(s,a))Z(s)d^*(a|s) = \frac{\exp(\frac{1}{\beta}q^*(s,a))}{Z(s)}

where Z(s)=Aexp(1βq(s,a))daZ(s) = \int_A \exp(\frac{1}{\beta}q^*(s,a))da. The optimal value function can be recovered as:

v(s)=βlogAexp(1βq(s,a))dav^*(s) = \beta \log \int_A \exp(\frac{1}{\beta}q^*(s,a))da

Fitted Q-Iteration for the Smooth Bellman Equations

Following the principles of fitted value iteration, we can approximate approximate the effect of the smooth Bellman operator by computing it exactly at a number of basepoints and generalizing elsewhere using function approximation. Concretely, given a collection of states sis_i and actions aia_i, we would compute regression target values:

yi=r(si,ai)+γEs[βlogAexp(1βqθ(s,a))da]y_i = r(s_i,a_i) + \gamma \mathbb{E}_{s'}\left[\beta \log \int_A \exp(\frac{1}{\beta}q_\theta(s',a'))da'\right]

and fit our Q-function approximator by minimizing:

minθi(qθ(si,ai)yi)2\min_\theta \sum_i (q_\theta(s_i,a_i) - y_i)^2

The expectation over next states can be handled through Monte Carlo estimation using samples from the environment: given a transition (si,ai,si)(s_i,a_i,s'_i), we can approximate:

Es[βlogAexp(1βqθ(s,a))da]βlogAexp(1βqθ(si,a))da\mathbb{E}_{s'}\left[\beta \log \int_A \exp(\frac{1}{\beta}q_\theta(s',a'))da'\right] \approx \beta \log \int_A \exp(\frac{1}{\beta}q_\theta(s'_i,a'))da'

However, we still face the challenge of computing the integral over actions. This motivates maintaining separate function approximators for both Q and V, using samples from the current policy to estimate the value function:

vψ(s)Ead(s;ϕ)[qθ(s,a)βlogd(as;ϕ)]v_\psi(s) \approx \mathbb{E}_{a \sim d(\cdot|s;\phi)}\left[q_\theta(s,a) - \beta \log d(a|s;\phi)\right]

By maintaining both approximators, we can estimate targets using sampled actions from our policy. Specifically, if we have a transition (si,ai,si)(s_i,a_i,s'_i) and sample aid(si;ϕ)a'_i \sim d(\cdot|s'_i;\phi), our target becomes:

yi=r(si,ai)+γ(qθ(si,ai)βlogd(aisi;ϕ))y_i = r(s_i,a_i) + \gamma\left(q_\theta(s'_i,a'_i) - \beta \log d(a'_i|s'_i;\phi)\right)

This approach exists only due to the dual representation of the smooth Bellman equations as an entropy-regularized problem, which transforms the intractable log-sum-exp into a form we can estimate efficiently through sampling.

Approximating Boltzmann Policies by Gaussians

The entropy-regularized objective and the smooth Bellman equation are mathematically equivalent. However, both formulations face a practical challenge: they require evaluating an intractable integral due to the Boltzmann distribution. Soft Actor-Critic (SAC) addresses this problem by approximating the optimal policy with a simpler, more tractable Gaussian distribution. Given the optimal soft policy:

d(as)=exp(1βq(s,a))Z(s)d^*(a|s) = \frac{\exp(\frac{1}{\beta}q^*(s,a))}{Z(s)}

we seek to approximate it with a Gaussian policy:

d(as;ϕ)=N(μϕ(s),σϕ(s))d(a|s;\phi) = \mathcal{N}(\mu_\phi(s), \sigma_\phi(s))

This approximation task naturally raises the question of how to measure the “closeness” between the target Boltzmann distribution and a candidate Gaussian approximation. Following common practice in deep learning, we employ the Kullback-Leibler (KL) divergence as our measure of distributional distance. To find the best approximation, we minimize the KL divergence between our policy and the optimal policy, using our current estimate qθq_\theta of qq^*:

minimizeϕEsμ(s)[DKL(d(s;ϕ)exp(1βqθ(s,))Z(s))]\operatorname{minimize}_{\phi} \mathbb{E}_{s \sim \mu(s)}\left[D_{KL}\left(d(\cdot|s;\phi) \| \frac{\exp(\frac{1}{\beta}q_\theta(s,\cdot))}{Z(s)}\right)\right]

However, an important question remains: how can we solve this optimization problem when it involves the intractable partition function Z(s)Z(s)? To see this, recall that for two distributions p and q, the KL divergence takes the form DKL(pq)=Exp[logp(x)logq(x)]D_{KL}(p\|q) = \mathbb{E}_{x \sim p}[\log p(x) - \log q(x)]. Let’s denote the target Boltzmann distribution based on our current Q-estimate as:

dθ(as)=exp(1βqθ(s,a))Zθ(s)d_\theta(a|s) = \frac{\exp(\frac{1}{\beta}q_\theta(s,a))}{Z_\theta(s)}

Then the KL minimization becomes:

DKL(d(s;ϕ)dθ)=Ead(s;ϕ)[logd(as;ϕ)logdθ(as)]=Ead(s;ϕ)[logd(as;ϕ)log(exp(1βqθ(s,a))Zθ(s))]=Ead(s;ϕ)[logd(as;ϕ)1βqθ(s,a)+logZθ(s)]\begin{align*} D_{KL}(d(\cdot|s;\phi)\|d_\theta) &= \mathbb{E}_{a \sim d(\cdot|s;\phi)}[\log d(a|s;\phi) - \log d_\theta(a|s)] \\ &= \mathbb{E}_{a \sim d(\cdot|s;\phi)}\left[\log d(a|s;\phi) - \log \left(\frac{\exp(\frac{1}{\beta}q_\theta(s,a))}{Z_\theta(s)}\right)\right] \\ &= \mathbb{E}_{a \sim d(\cdot|s;\phi)}\left[\log d(a|s;\phi) - \frac{1}{\beta}q_\theta(s,a) + \log Z_\theta(s)\right] \end{align*}

Since logZ(s)\log Z(s) is constant with respect to ϕ\phi, minimizing this KL divergence is equivalent to:

minimizeϕEsμ(s)Ead(s;ϕ)[logd(as;ϕ)1βqθ(s,a)]\operatorname{minimize}_{\phi} \mathbb{E}_{s \sim \mu(s)}\mathbb{E}_{a \sim d(\cdot|s;\phi)}[\log d(a|s;\phi) - \frac{1}{\beta}q_\theta(s,a)]

Reparameterizating the Objective

One last challenge remains: ϕ\phi appears in the distribution underlying the inner expectation, as well as in the integrand. This setting departs from standard empirical risk minimization (ERM) in supervised learning where the distribution of the data (e.g., cats and dogs in image classification) remains fixed regardless of model parameters. Here, however, the “data” - our sampled actions - depends directly on the parameters ϕ\phi we’re trying to optimize.

This dependence prevents us from simply using sample average estimators and differentiating through them, as we typically do in supervised learning. The challenge of correctly and efficiently estimating such derivatives has been extensively studied in the simulation literature under the umbrella of “derivative estimation.” SAC adopts a particular solution known as the reparameterization trick in deep learning (or the IPA estimator in simulation literature). This approach transforms the problem by pushing ϕ\phi inside the expectation through a change of variables.

To address this, we can express our Gaussian policy through a deterministic function fϕf_\phi that transforms noise samples to actions:

fϕ(s,ϵ)=μϕ(s)+σϕ(s)ϵ,ϵN(0,1)f_\phi(s,\epsilon) = \mu_\phi(s) + \sigma_\phi(s)\epsilon, \quad \epsilon \sim \mathcal{N}(0,1)

This transformation allows us to rewrite our objective using an expectation over the fixed noise distribution:

Esμ(s)EϵN(0,1)[logd(fϕ(s,ϵ)s;ϕ)1βqθ(s,fϕ(s,ϵ))]\begin{align*} &\mathbb{E}_{s \sim \mu(s)}\mathbb{E}_{\epsilon \sim \mathcal{N}(0,1)}[\log d(f_\phi(s,\epsilon)|s;\phi) - \frac{1}{\beta}q_\theta(s,f_\phi(s,\epsilon))] \end{align*}

Now ϕ\phi appears only in the integrand through the function fϕf_\phi, not in the sampling distribution. The objective involves two terms. First, the log-probability of our Gaussian policy has a simple closed form:

logd(fϕ(s,ϵ)s;ϕ)=12log(2πσϕ(s)2)(fϕ(s,ϵ)μϕ(s))22σϕ(s)2\log d(f_\phi(s,\epsilon)|s;\phi) = -\frac{1}{2}\log(2\pi\sigma_\phi(s)^2) - \frac{(f_\phi(s,\epsilon)-\mu_\phi(s))^2}{2\sigma_\phi(s)^2}

Second, ϕ\phi enters through the composition of qq^\star with fϕf_\phi: q(s,fϕ(s,ϵ))q^\star(s,f_\phi(s,\epsilon)). The chain rule for this composition would involve derivatives of both functions. While this might be problematic if the Q-factors were to come from outside of our control (ie. not in the computational graph), but since SAC learns it simultaneously with the policy, then we can simply compute all required derivatives through automatic differentiation.

This composition of policy and value functions - where fϕf_\phi enters as input to qθq_\theta - directly parallels the structure we encountered in deterministic policy methods like NFQCA and DDPG. In those methods, we optimized:

maxϕEsμ(s)[qθ(s,fϕ(s))]\max_{\phi} \mathbb{E}_{s \sim \mu(s)}[q_\theta(s, f_\phi(s))]

where fϕ(s)f_\phi(s) was a deterministic policy. SAC extends this idea to stochastic policies by having fϕf_\phi transform both state and noise:

maxϕEsμ(s)EϵN(0,1)[qθ(s,fϕ(s,ϵ))]\max_{\phi} \mathbb{E}_{s \sim \mu(s)}\mathbb{E}_{\epsilon \sim \mathcal{N}(0,1)}[q_\theta(s,f_\phi(s,\epsilon))]

Thus, rather than learning a single action for each state as in DDPG, we learn a function that transforms random noise into actions, explicitly parameterizing a distribution over actions while maintaining the same underlying principle of differentiating through composed policy and value functions.

Derivative Estimation for Stochastic Optimization

Consider optimizing an objective that involves an expectation:

J(θ)=Exp(x;θ)[f(x,θ)]J(\theta) = \mathbb{E}_{x \sim p(x;\theta)}[f(x,\theta)]

For concreteness, let’s examine a simple example where xN(θ,1)x \sim \mathcal{N}(\theta,1) and f(x,θ)=x2θf(x,\theta) = x^2\theta. The derivative we seek is:

ddθJ(θ)=ddθx2θp(x;θ)dx\frac{d}{d\theta}J(\theta) = \frac{d}{d\theta}\int x^2\theta p(x;\theta)dx

While we can compute this exactly for the Gaussian example, this is often impossible for more general problems. We might then be tempted to approximate our objective using samples:

J(θ)1Ni=1Nf(xi,θ),xip(x;θ)J(\theta) \approx \frac{1}{N}\sum_{i=1}^N f(x_i,\theta), \quad x_i \sim p(x;\theta)

Then differentiate this approximation:

ddθJ(θ)1Ni=1Nθf(xi,θ)\frac{d}{d\theta}J(\theta) \approx \frac{1}{N}\sum_{i=1}^N \frac{\partial}{\partial \theta}f(x_i,\theta)

However, this naive approach ignores that the samples themselves depend on θ\theta. The correct derivative requires the product rule:

ddθJ(θ)=θ[f(x,θ)p(x;θ)]dx=[fθp(x;θ)+f(x,θ)p(x;θ)θ]dx\frac{d}{d\theta}J(\theta) = \int \frac{\partial}{\partial \theta}[f(x,\theta)p(x;\theta)]dx = \int \left[\frac{\partial f}{\partial \theta}p(x;\theta) + f(x,\theta)\frac{\partial p(x;\theta)}{\partial \theta}\right]dx

The issue here is that while the first term could be numerically integrated using the Monte Carlo, the second one can’t as it’s not an expectation.

Would there be a way to transform our objective in such a way that the Monte Carlo estimator for the objective could be differentiated directly while ensuring that the resulting derivative is unbiased? We will see that there are two main solutions to that problem: by doing a change of measure, or a change of variables.

The Likelihood Ratio Method

One solution comes from rewriting our objective using any distribution q(x)q(x):

J(θ)=f(x,θ)p(x;θ)q(x)q(x)dx=Exq(x)[f(x,θ)p(x;θ)q(x)]J(\theta) = \int f(x,\theta)\frac{p(x;\theta)}{q(x)}q(x)dx = \mathbb{E}_{x \sim q(x)}\left[f(x,\theta)\frac{p(x;\theta)}{q(x)}\right]

Let’s write this more functionally by defining:

J(θ)=Exq(x)[h(x,θ)],h(x,θ)f(x,θ)p(x;θ)q(x)J(\theta) = \mathbb{E}_{x \sim q(x)}[h(x,\theta)] , \enspace h(x,\theta) \equiv f(x,\theta)\frac{p(x;\theta)}{q(x)}

Now when we differentiate JJ, it’s clear that we must take the partial derivative of hh with respect to its second argument:

ddθJ(θ)=Exq(x)[hθ(x,θ)]=Exq(x)[f(x,θ)θp(x;θ)q(x)+p(x;θ)q(x)fθ(x,θ)]\frac{d}{d\theta}J(\theta) = \mathbb{E}_{x \sim q(x)}\left[\frac{\partial h}{\partial \theta}(x,\theta)\right] = \mathbb{E}_{x \sim q(x)}\left[f(x,\theta)\frac{\partial}{\partial \theta}\frac{p(x;\theta)}{q(x)} + \frac{p(x;\theta)}{q(x)}\frac{\partial f}{\partial \theta}(x,\theta)\right]

The so-called “score function” derivative estimator is obtained for the choice of q(x)=p(x;θ)q(x) = p(x;\theta), where the ratio simplifies to 1 and its derivative becomes the score function:

ddθJ(θ)=Exp(x;θ)[f(x,θ)logp(x,θ)θ+f(x,θ)θ]\frac{d}{d\theta}J(\theta) = \mathbb{E}_{x \sim p(x;\theta)}\left[f(x,\theta)\frac{\partial \log p(x,\theta)}{\partial \theta} + \frac{\partial f(x,\theta)}{\partial \theta}\right]

The Reparameterization Trick

An alternative approach eliminates the θ\theta-dependence in the sampling distribution by expressing xx through a deterministic transformation of the noise:

x=g(ϵ,θ),ϵq(ϵ)x = g(\epsilon,\theta), \quad \epsilon \sim q(\epsilon)

Therefore if we want to sample from some target distribution p(x;θ)p(x;\theta), we can do so by first sampling from a simple base distribution q(ϵ)q(\epsilon) (like a standard normal) and then transforming those samples through a carefully chosen function gg. If g(,θ)g(\cdot,\theta) is invertible, the change of variables formula tells us how these distributions relate:

p(x;θ)=q(g1(x,θ))detg1(x,θ)x=q(ϵ)detg(ϵ,θ)ϵ1p(x;\theta) = q(g^{-1}(x,\theta))\left|\det\frac{\partial g^{-1}(x,\theta)}{\partial x}\right| = q(\epsilon)\left|\det\frac{\partial g(\epsilon,\theta)}{\partial \epsilon}\right|^{-1}

For example, if we want to sample from any multivariate Gaussian distributions with covariance matrix Σ\Sigma and mean μ\mu, it suffices to be able to sample from a standard normal noise and compute the linear transformation:

x=μ+Σ1/2ϵ,ϵN(0,I)x = \mu + \Sigma^{1/2}\epsilon, \quad \epsilon \sim \mathcal{N}(0,I)

where Σ1/2\Sigma^{1/2} is the matrix square root obtained via Cholesky decomposition. In the univariate case, this transformation is simply:

x=μ+σϵ,ϵN(0,1)x = \mu + \sigma \epsilon, \quad \epsilon \sim \mathcal{N}(0,1)

where σ=σ2\sigma = \sqrt{\sigma^2} is the standard deviation (square root of the variance).

Common Examples of Reparameterization

The Truncated Normal Distribution

When we need samples constrained to an interval [a,b][a,b], we can use the truncated normal distribution. To sample from it, we transform uniform noise through the inverse cumulative distribution function (CDF) of the standard normal:

x=Φ1(uΦ(b)+(1u)Φ(a)),uUniform(0,1)x = \Phi^{-1}(u\Phi(b) + (1-u)\Phi(a)), \quad u \sim \text{Uniform}(0,1)

Here:

The resulting samples follow a normal distribution restricted to [a,b][a,b], with the density properly normalized over this interval.

The Kumaraswamy Distribution

When we need samples in the unit interval [0,1], a natural choice might be the Beta distribution. However, its inverse CDF doesn’t have a closed form. Instead, we can use the Kumaraswamy distribution as a convenient approximation, which allows for a simple reparameterization:

x=(1(1uα)1/β),uUniform(0,1)x = (1-(1-u^{\alpha})^{1/\beta}), \quad u \sim \text{Uniform}(0,1)

where:

The Kumaraswamy distribution has density:

f(x;α,β)=αβxα1(1xα)β1,x[0,1]f(x; \alpha, \beta) = \alpha\beta x^{\alpha-1}(1-x^{\alpha})^{\beta-1}, \quad x \in [0,1]
The Gumbel-Softmax Distribution

When sampling from a categorical distribution with probabilities {πi}\{\pi_i\}, one approach uses Gumbel(0,1)\text{Gumbel}(0,1) noise combined with the argmax of log-perturbed probabilities:

argmaxi(logπi+gi),giGumbel(0,1)\text{argmax}_i(\log \pi_i + g_i), \quad g_i \sim \text{Gumbel}(0,1)

This approach, known in machine learning as the Gumbel-Max trick, relies on sampling Gumbel noise from uniform random variables through the transformation gi=log(log(ui))g_i = -\log(-\log(u_i)) where uiUniform(0,1)u_i \sim \text{Uniform}(0,1). To see why this gives us samples from the categorical distribution, consider the probability of selecting category ii:

P(argmaxj(logπj+gj)=i)=P(logπi+gi>logπj+gj for all ji)=P(gigj>logπjlogπi for all ji)\begin{align*} P(\text{argmax}_j(\log \pi_j + g_j) = i) &= P(\log \pi_i + g_i > \log \pi_j + g_j \text{ for all } j \neq i) \\ &= P(g_i - g_j > \log \pi_j - \log \pi_i \text{ for all } j \neq i) \end{align*}

Since the difference of two Gumbel random variables follows a logistic distribution, gigjLogistic(0,1)g_i - g_j \sim \text{Logistic}(0,1), and these differences are independent for different jj (due to the independence of the original Gumbel variables), we can write:

P(argmaxj(logπj+gj)=i)=jiP(gigj>logπjlogπi)=jiπiπi+πj=πi\begin{align*} P(\text{argmax}_j(\log \pi_j + g_j) = i) &= \prod_{j \neq i} P(g_i - g_j > \log \pi_j - \log \pi_i) \\ &= \prod_{j \neq i} \frac{\pi_i}{\pi_i + \pi_j} = \pi_i \end{align*}

The last equality requires some additional algebra to show, but follows from the fact that these probabilities must sum to 1 over all ii.

While we have shown that the Gumbel-Max trick gives us exact samples from a categorical distribution, the argmax operation isn’t differentiable. For stochastic optimization problems of the form:

Exp(x;θ)[f(x)]=EϵGumbel(0,1)[f(g(ϵ,θ))]\mathbb{E}_{x \sim p(x;\theta)}[f(x)] = \mathbb{E}_{\epsilon \sim \text{Gumbel}(0,1)}[f(g(\epsilon,\theta))]

we need gg to be differentiable with respect to θ\theta. This leads us to consider a continuous relaxation where we replace the hard argmax with a temperature-controlled softmax:

zi=exp((logπi+gi)/τ)jexp((logπj+gj)/τ)z_i = \frac{\exp((\log \pi_i + g_i)/\tau)}{\sum_j \exp((\log \pi_j + g_j)/\tau)}

As τ0\tau \to 0, this approximation approaches the argmax:

limτ0exp(xi/τ)jexp(xj/τ)={1if xi=maxjxj0otherwise\lim_{\tau \to 0} \frac{\exp(x_i/\tau)}{\sum_j \exp(x_j/\tau)} = \begin{cases} 1 & \text{if } x_i = \max_j x_j \\ 0 & \text{otherwise} \end{cases}

The resulting distribution over the probability simplex is called the Gumbel-Softmax (or Concrete) distribution. The temperature parameter τ\tau controls the discreteness of our samples: smaller values give samples closer to one-hot vectors but with less stable gradients, while larger values give smoother gradients but more diffuse samples.

Numerical Analysis of Gradient Estimators

Let us examine the behavior of our three gradient estimators for the stochastic optimization objective:

J(θ)=ExN(θ,1)[x2θ]J(\theta) = \mathbb{E}_{x \sim \mathcal{N}(\theta,1)}[x^2\theta]

To get an analytical expression for the derivative, first note that we can factor out θ\theta to obtain J(θ)=θE[x2]J(\theta) = \theta\mathbb{E}[x^2] where xN(θ,1)x \sim \mathcal{N}(\theta,1). By definition of the variance, we know that Var(x)=E[x2](E[x])2\text{Var}(x) = \mathbb{E}[x^2] - (\mathbb{E}[x])^2, which we can rearrange to E[x2]=Var(x)+(E[x])2\mathbb{E}[x^2] = \text{Var}(x) + (\mathbb{E}[x])^2. Since xN(θ,1)x \sim \mathcal{N}(\theta,1), we have Var(x)=1\text{Var}(x) = 1 and E[x]=θ\mathbb{E}[x] = \theta, therefore E[x2]=1+θ2\mathbb{E}[x^2] = 1 + \theta^2. This gives us:

J(θ)=θ(1+θ2)J(\theta) = \theta(1 + \theta^2)

Now differentiating with respect to θ\theta using the product rule yields:

ddθJ(θ)=1+3θ2\frac{d}{d\theta}J(\theta) = 1 + 3\theta^2

For concreteness, we fix θ=1.0\theta = 1.0 and analyze samples drawn using Monte Carlo estimation with batch size 1000 and 1000 independent trials. Evaluating at θ=1\theta = 1 gives us ddθJ(θ)θ=1=1+3(1)2=4\frac{d}{d\theta}J(\theta)\big|_{\theta=1} = 1 + 3(1)^2 = 4, which serves as our ground truth against which we compare our estimators:

  1. First, we consider the naive estimator that incorrectly differentiates the Monte Carlo approximation:

    g^naive(θ)=1Ni=1Nxi2\hat{g}_{\text{naive}}(\theta) = \frac{1}{N}\sum_{i=1}^N x_i^2

    For xN(1,1)x \sim \mathcal{N}(1,1), we have E[x2]=θ2+1=2.0\mathbb{E}[x^2] = \theta^2 + 1 = 2.0 and E[g^naive]=2.0\mathbb{E}[\hat{g}_{\text{naive}}] = 2.0. We should therefore expect a bias of about -2 in our experiment.

  2. Then we compute the score function estimator:

    g^SF(θ)=1Ni=1N[xi2θ(xiθ)+xi2]\hat{g}_{\text{SF}}(\theta) = \frac{1}{N}\sum_{i=1}^N \left[x_i^2\theta(x_i - \theta) + x_i^2\right]

    This estimator is unbiased with E[g^SF]=4\mathbb{E}[\hat{g}_{\text{SF}}] = 4

  3. Finally, through the reparameterization x=θ+ϵx = \theta + \epsilon where ϵN(0,1)\epsilon \sim \mathcal{N}(0,1), we obtain:

    g^RT(θ)=1Ni=1N[2θ(θ+ϵi)+(θ+ϵi)2]\hat{g}_{\text{RT}}(\theta) = \frac{1}{N}\sum_{i=1}^N \left[2\theta(\theta + \epsilon_i) + (\theta + \epsilon_i)^2\right]

    This estimator is also unbiased with E[g^RT]=4\mathbb{E}[\hat{g}_{\text{RT}}] = 4.

Source
import jax
import jax.numpy as jnp
import matplotlib.pyplot as plt

key = jax.random.PRNGKey(0)

# Define the objective function f(x,θ) = x²θ where x ~ N(θ, 1)
def objective(x, theta):
    return x**2 * theta

# Naive Monte Carlo gradient estimation
@jax.jit
def naive_gradient_batch(key, theta):
    samples = jax.random.normal(key, (1000,)) + theta
    # Use jax.grad on the objective with respect to theta
    grad_fn = jax.grad(lambda t: jnp.mean(objective(samples, t)))
    return grad_fn(theta)

# Score function estimator (REINFORCE)
@jax.jit
def score_function_batch(key, theta):
    samples = jax.random.normal(key, (1000,)) + theta
    # f(x,θ) * ∂logp(x|θ)/∂θ + ∂f(x,θ)/∂θ
    # score function for N(θ,1) is (x-θ)
    score = samples - theta
    return jnp.mean(objective(samples, theta) * score + samples**2)

# Reparameterization gradient
@jax.jit
def reparam_gradient_batch(key, theta):
    eps = jax.random.normal(key, (1000,))
    # Use reparameterization x = θ + ε, ε ~ N(0,1)
    grad_fn = jax.grad(lambda t: jnp.mean(objective(t + eps, t)))
    return grad_fn(theta)

# Run trials
n_trials = 1000
theta = 1.0
true_grad = 3 + theta**2

keys = jax.random.split(key, n_trials)
naive_estimates = jnp.array([naive_gradient_batch(k, theta) for k in keys])
score_estimates = jnp.array([score_function_batch(k, theta) for k in keys])
reparam_estimates = jnp.array([reparam_gradient_batch(k, theta) for k in keys])

# Create violin plots with individual points
plt.figure(figsize=(12, 6))
data = [naive_estimates, score_estimates, reparam_estimates]
colors = ['#ff9999', '#66b3ff', '#99ff99']

parts = plt.violinplot(data, showextrema=False)
for i, pc in enumerate(parts['bodies']):
    pc.set_facecolor(colors[i])
    pc.set_alpha(0.7)

# Add box plots
plt.boxplot(data, notch=True, showfliers=False)

# Add true gradient line
plt.axhline(y=true_grad, color='r', linestyle='--', label='True Gradient')

plt.xticks([1, 2, 3], ['Naive', 'Score Function', 'Reparam'])
plt.ylabel('Gradient Estimate')
plt.title(f'Gradient Estimators (θ={theta}, true grad={true_grad:.2f})')
plt.grid(True, alpha=0.3)
plt.legend()

# Print statistics
methods = {
    'Naive': naive_estimates,
    'Score Function': score_estimates, 
    'Reparameterization': reparam_estimates
}

for name, estimates in methods.items():
    bias = jnp.mean(estimates) - true_grad
    variance = jnp.var(estimates)
    print(f"\n{name}:")
    print(f"Mean: {jnp.mean(estimates):.6f}")
    print(f"Bias: {bias:.6f}")
    print(f"Variance: {variance:.6f}")
    print(f"MSE: {bias**2 + variance:.6f}")

The numerical experiments coroborate our theory. The naive estimator consistently underestimates the true gradient by 2.0, though it maintains a relatively small variance. This systematic bias would make it unsuitable for optimization despite its low variance. The score function estimator corrects this bias but introduces substantial variance. While unbiased, this estimator would require many samples to achieve reliable gradient estimates. Finally, the reparameterization trick achieves a much lower variance while remaining unbiased. While this experiment is for didactic purposes only, it reproduces what is commonly found in practice: that when applicable, the reparameterization estimator tends to perform better than the score function counterpart.

Score Function Gradient Estimation in Reinforcement Learning

Let G(τ)t=0Tr(st,at)G(\tau) \equiv \sum_{t=0}^T r(s_t, a_t) be the sum of undiscounted rewards in a trajectory τ\tau. The stochastic optimization problem we face is to maximize:

J(w)=Eτp(τ;w)[G(τ)]J(\boldsymbol{w}) = \mathbb{E}_{\tau \sim p(\tau;\boldsymbol{w})}[G(\tau)]

where τ=(s0,a0,s1,a1,...)\tau = (s_0,a_0,s_1,a_1,...) is a trajectory and G(τ)G(\tau) is the total return. Applying the score function estimator, we get:

wJ(w)=wEτ[G(τ)]=Eτ[G(τ)wlogp(τ;w)]=Eτ[G(τ)wt=0Tlogd(atst;w)]=Eτ[G(τ)t=0Twlogd(atst;w)]\begin{align*} \nabla_{\boldsymbol{w}}J(\boldsymbol{w}) &= \nabla_{\boldsymbol{w}}\mathbb{E}_{\tau}[G(\tau)] \\ &= \mathbb{E}_{\tau}\left[G(\tau)\nabla_{\boldsymbol{w}}\log p(\tau;\boldsymbol{w})\right] \\ &= \mathbb{E}_{\tau}\left[G(\tau)\nabla_{\boldsymbol{w}}\sum_{t=0}^T\log d(a_t|s_t;\boldsymbol{w})\right] \\ &= \mathbb{E}_{\tau}\left[G(\tau)\sum_{t=0}^T\nabla_{\boldsymbol{w}}\log d(a_t|s_t;\boldsymbol{w})\right] \end{align*}

We have eliminated the need to know the transition probabilities in this estimator since the probability of a trajectory factorizes as:

p(τ;w)=p(s0)t=0Td(atst;w)p(st+1st,at)p(\tau;\boldsymbol{w}) = p(s_0)\prod_{t=0}^T d(a_t|s_t;\boldsymbol{w})p(s_{t+1}|s_t,a_t)

Therefore, only the policy depends on w\boldsymbol{w}. When taking the logarithm of this product, we get a sum where all the w\boldsymbol{w}-independent terms vanish. The final estimator samples trajectories under the distribution p(τ;w)p(\tau; \boldsymbol{w}) and computes:

wJ(w)1Ni=1N[G(τ(i))t=0Twlogd(at(i)st(i);w)]\nabla_{\boldsymbol{w}}J(\boldsymbol{w}) \approx \frac{1}{N}\sum_{i=1}^N\left[G(\tau^{(i)})\sum_{t=0}^T\nabla_{\boldsymbol{w}}\log d(a_t^{(i)}|s_t^{(i)};\boldsymbol{w})\right]

This is a direct application of the score function estimator. However, we rarely use this form in practice and instead make several improvements to further reduce the variance.

Leveraging Conditional Independence

Given the Markov property of the MDP, rewards rkr_k for k<tk < t are conditionally independent of action ata_t given the history ht=(s0,a0,...,st1,at1,st)h_t = (s_0,a_0,...,s_{t-1},a_{t-1},s_t). This allows us to only need to consider future rewards when computing policy gradients.

wJ(w)=Eτ[t=0Twlogd(atst;w)k=0Trk]=Eτ[t=0Twlogd(atst;w)(k=0t1rk+k=tTrk)]=Eτ[t=0Twlogd(atst;w)k=tTrk]\begin{align*} \nabla_{\boldsymbol{w}}J(\boldsymbol{w}) &= \mathbb{E}_{\tau}\left[\sum_{t=0}^T\nabla_{\boldsymbol{w}}\log d(a_t|s_t;\boldsymbol{w})\sum_{k=0}^T r_k\right] \\ &= \mathbb{E}_{\tau}\left[\sum_{t=0}^T\nabla_{\boldsymbol{w}}\log d(a_t|s_t;\boldsymbol{w})\left(\sum_{k=0}^{t-1} r_k + \sum_{k=t}^T r_k\right)\right] \\ &= \mathbb{E}_{\tau}\left[\sum_{t=0}^T\nabla_{\boldsymbol{w}}\log d(a_t|s_t;\boldsymbol{w})\sum_{k=t}^T r_k\right] \end{align*}

The condition independence assumption means that the term Eτ[t=0Twlogd(atst;w)k=0t1rk]\mathbb{E}_{\tau}\left[\sum_{t=0}^T\nabla_{\boldsymbol{w}}\log d(a_t|s_t;\boldsymbol{w})\sum_{k=0}^{t-1} r_k \right] vanishes. To see this, let’s factor the trajectory distribution as:

p(τ)=p(s0,...,st,a0,...,at1)d(atst;w)p(st+1,...,sT,at+1,...,aTst,at)p(\tau) = p(s_0,...,s_t,a_0,...,a_{t-1})\cdot d(a_t|s_t;\boldsymbol{w})\cdot p(s_{t+1},...,s_T,a_{t+1},...,a_T|s_t,a_t)

We can now re-write a single term of this summation as:

Eτ[wlogd(atst;w)k=0t1rk]=Es0:t,a0:t1[k=0t1rkEat[wlogd(atst;w)]]\mathbb{E}_{\tau}\left[\nabla_{\boldsymbol{w}}\log d(a_t|s_t;\boldsymbol{w})\sum_{k=0}^{t-1} r_k\right] = \mathbb{E}_{s_{0:t},a_{0:t-1}}\left[\sum_{k=0}^{t-1} r_k \cdot \mathbb{E}_{a_t}\left[\nabla_{\boldsymbol{w}}\log d(a_t|s_t;\boldsymbol{w})\right]\right]

The inner expectation is zero because

Eat[wlogd(atst;w)]=wlogd(atst;w)d(atst;w)dat=wd(atst;w)d(atst;w)d(atst;w)dat=wd(atst;w)dat=wd(atst;w)dat=w1=0\begin{align*} \mathbb{E}_{a_t}\left[\nabla_{\boldsymbol{w}}\log d(a_t|s_t;\boldsymbol{w})\right] &= \int \nabla_{\boldsymbol{w}}\log d(a_t|s_t;\boldsymbol{w})d(a_t|s_t;\boldsymbol{w})da_t \\ &= \int \frac{\nabla_{\boldsymbol{w}}d(a_t|s_t;\boldsymbol{w})}{d(a_t|s_t;\boldsymbol{w})}d(a_t|s_t;\boldsymbol{w})da_t \\ &= \int \nabla_{\boldsymbol{w}}d(a_t|s_t;\boldsymbol{w})da_t \\ &= \nabla_{\boldsymbol{w}}\int d(a_t|s_t;\boldsymbol{w})da_t \\ &= \nabla_{\boldsymbol{w}}1 = 0 \end{align*}

The Monte Carlo estimator becomes:

wJ(w)1Ni=1N[t=0Twlogd(at(i)st(i);w)k=tTrk(i)]\nabla_{\boldsymbol{w}}J(\boldsymbol{w}) \approx \frac{1}{N}\sum_{i=1}^N\left[\sum_{t=0}^T\nabla_{\boldsymbol{w}}\log d(a_t^{(i)}|s_t^{(i)};\boldsymbol{w})\sum_{k=t}^T r_k^{(i)}\right]

The benefit of this estimator compared to the naive one is that it generally has less variance. More formally, we can show that this estimator arises from the application of a variance reduction technique known as the Extended Conditional Monte Carlo Method.

Variance Reduction via Control Variates

A control variate is a zero-mean random variable that we subtract from our estimator to reduce variance. Given an estimator ZZ and a control variate CC with E[C]=0\mathbb{E}[C]=0, we can construct a new unbiased estimator:

Zcv=ZαCZ_{\text{cv}} = Z - \alpha C

where α\alpha is a coefficient we can choose. The variance of this new estimator is:

Var(Zcv)=Var(Z)+α2Var(C)2αCov(Z,C)\text{Var}(Z_{\text{cv}}) = \text{Var}(Z) + \alpha^2\text{Var}(C) - 2\alpha\text{Cov}(Z,C)

The optimal α\alpha that minimizes this variance is:

α=Cov(Z,C)Var(C)\alpha^* = \frac{\text{Cov}(Z,C)}{\text{Var}(C)}

In the reinforcement learning setting, we usually choose Ct=wlogd(atst;w)C_t = \nabla_{\boldsymbol{w}}\log d(a_t|s_t;\boldsymbol{w}) as our control variate at each timestep. For a given state sts_t, our estimator at time tt is:

Zt=wlogd(atst;w)k=tTrkZ_t = \nabla_{\boldsymbol{w}}\log d(a_t|s_t;\boldsymbol{w})\sum_{k=t}^T r_k

Our control variate estimator becomes:

Zt,cv=ZtαtCt=wlogd(atst;w)(k=tTrkαt)Z_{t,\text{cv}} = Z_t - \alpha_t^* C_t = \nabla_{\boldsymbol{w}}\log d(a_t|s_t;\boldsymbol{w})(\sum_{k=t}^T r_k - \alpha_t^*)

Following the general theory, and using the fact that E[Ctst]=0\mathbb{E}[C_t|s_t] = 0 the optimal coefficient is:

αt=Cov(Zt,Ctst)Var(Ctst)=E[ZtCtTst]E[Ztst]E[CtTst]E[CtCtTst]E[Ctst]E[CtTst]=E[wlogd(atst;w)wlogd(atst;w)Tk=tTrkst]0E[wlogd(atst;w)wlogd(atst;w)Tst]0=E[wlogd(atst;w)2k=tTrkst]E[wlogd(atst;w)2st]=Eatst[wlogd(atst;w)2]E[k=tTrkst]Eatst[wlogd(atst;w)2]=E[k=tTrkst]=vdw(st)\begin{align*} \alpha^*_t = \frac{\text{Cov}(Z_t,C_t|s_t)}{\text{Var}(C_t|s_t)} &= \frac{\mathbb{E}[Z_tC_t^T|s_t] - \mathbb{E}[Z_t|s_t]\mathbb{E}[C_t^T|s_t]}{\mathbb{E}[C_tC_t^T|s_t] - \mathbb{E}[C_t|s_t]\mathbb{E}[C_t^T|s_t]} \\ &= \frac{\mathbb{E}[\nabla_{\boldsymbol{w}}\log d(a_t|s_t;\boldsymbol{w})\nabla_{\boldsymbol{w}}\log d(a_t|s_t;\boldsymbol{w})^T\sum_{k=t}^T r_k|s_t] - 0}{\mathbb{E}[\nabla_{\boldsymbol{w}}\log d(a_t|s_t;\boldsymbol{w})\nabla_{\boldsymbol{w}}\log d(a_t|s_t;\boldsymbol{w})^T|s_t] - 0} \\ &= \frac{\mathbb{E}[\|\nabla_{\boldsymbol{w}}\log d(a_t|s_t;\boldsymbol{w})\|^2\sum_{k=t}^T r_k|s_t]}{\mathbb{E}[\|\nabla_{\boldsymbol{w}}\log d(a_t|s_t;\boldsymbol{w})\|^2|s_t]} \\ &= \frac{\mathbb{E}_{a_t|s_t}[\|\nabla_{\boldsymbol{w}}\log d(a_t|s_t;\boldsymbol{w})\|^2]\mathbb{E}[\sum_{k=t}^T r_k|s_t]}{\mathbb{E}_{a_t|s_t}[\|\nabla_{\boldsymbol{w}}\log d(a_t|s_t;\boldsymbol{w})\|^2]} \\ &= \mathbb{E}[\sum_{k=t}^T r_k|s_t] = v^{d_{\boldsymbol{w}}}(s_t) \end{align*}

Therefore, our variance-reduced estimator becomes:

Zcv,t=wlogd(atst;w)(k=tTrkvdw(st))Z_{\text{cv},t} = \nabla_{\boldsymbol{w}}\log d(a_t|s_t;\boldsymbol{w})\left(\sum_{k=t}^T r_k - v^{d_{\boldsymbol{w}}}(s_t)\right)

In practice when implementing this estimator, we won’t have access to the true value function. So as we did earlier for NFQCA or SAC, we commonly learn that value function simultaneously with the policy. Do do so, we could either using a fitted value approach, or even more simply just regress from states to sum of rewards to learn what Williams 1992 called a “baseline”:

When implementing this algorithm nowadays, we always using mini-batching to make full use of our GPUs. Therefore, a more representative variant for this algorithm would be:

Generalized Advantage Estimator

Given our control variate estimator with baseline v(s)v(s), we have:

wlogd(atst;w)(Gtv(st))\nabla_{\boldsymbol{w}}\log d(a_t|s_t;\boldsymbol{w})(G_t - v(s_t))

where GtG_t is the return k=tTrk\sum_{k=t}^T r_k. We can improve this estimator by considering how it relates to the advantage function, defined as:

A(st,at)=q(st,at)v(st)A(s_t,a_t) = q(s_t,a_t) - v(s_t)

where q(st,at)q(s_t,a_t) is the action-value function. Due to the Bellman equation:

q(st,at)=Est+1,rt[rt+γv(st+1)st,at]q(s_t,a_t) = \mathbb{E}_{s_{t+1},r_t}[r_t + \gamma v(s_{t+1})|s_t,a_t]

This leads to the one-step TD error:

δt=rt+γv(st+1)v(st)\delta_t = r_t + \gamma v(s_{t+1}) - v(s_t)

Now, let’s decompose our original term:

Gtv(st)=rt+γGt+1v(st)=rt+γv(st+1)v(st)+γ(Gt+1v(st+1))=δt+γ(Gt+1v(st+1))\begin{align*} G_t - v(s_t) &= r_t + \gamma G_{t+1} - v(s_t) \\ &= r_t + \gamma v(s_{t+1}) - v(s_t) + \gamma(G_{t+1} - v(s_{t+1})) \\ &= \delta_t + \gamma(G_{t+1} - v(s_{t+1})) \end{align*}

Expanding recursively:

Gtv(st)=δt+γ(Gt+1v(st+1))=δt+γ[δt+1+γ(Gt+2v(st+2))]=δt+γδt+1+γ2δt+2+...+γTtδT\begin{align*} G_t - v(s_t) &= \delta_t + \gamma(G_{t+1} - v(s_{t+1})) \\ &= \delta_t + \gamma[\delta_{t+1} + \gamma(G_{t+2} - v(s_{t+2}))] \\ &= \delta_t + \gamma\delta_{t+1} + \gamma^2\delta_{t+2} + ... + \gamma^{T-t}\delta_T \end{align*}

GAE generalizes this by introducing a weighted version with parameter λ\lambda:

AGAE(γ,λ)(st,at)=(1λ)k=0λkl=0kγlδt+lA^{\text{GAE}(\gamma,\lambda)}(s_t,a_t) = (1-\lambda)\sum_{k=0}^{\infty}\lambda^k\sum_{l=0}^k \gamma^l\delta_{t+l}

Which simplifies to:

AGAE(γ,λ)(st,at)=l=0(γλ)lδt+lA^{\text{GAE}(\gamma,\lambda)}(s_t,a_t) = \sum_{l=0}^{\infty}(\gamma\lambda)^l\delta_{t+l}

This formulation allows us to trade off bias and variance through λ\lambda:

The general GAE algorithm with mini-batches is the following:

With λ=0\lambda = 0, the GAE advantage estimator becomes just the one-step TD error:

AGAE(γ,0)(st,at)=δt=rt+γv(st+1)v(st)A^{\text{GAE}(\gamma,0)}(s_t,a_t) = \delta_t = r_t + \gamma v(s_{t+1}) - v(s_t)

The non-batched, purely online, GAE(0) algorithm then becomes:

This version was first derived by Richard Sutton in his 1984 PhD thesis. He called it the Adaptive Heuristic Actor-Critic algorithm. As far as I know, it was not derived using the score function method outlined here, but rather through intuitive reasoning (great intuition!).

The Policy Gradient Theorem

Sutton 1999 provided an expression for the gradient of the infinite discounted return with respect to the parameters of a parameterized policy. Here is an alternative derivation by considering a bilevel optimization problem:

maxwαvγd\max_{\mathbf{w}} \alpha^\top \mathbf{v}_\gamma^{d^\infty}

subject to:

(IγPd)vγd=rd(\mathbf{I} - \gamma \mathbf{P}_d) \mathbf{v}_\gamma^{d^\infty} = \mathbf{r}_d

The Implicit Function Theorem states that if there is a solution to the problem F(v,w)=0F(\mathbf{v}, \mathbf{w}) = 0, then we can “reparameterize” our problem as F(v(w),w)F(\mathbf{v}(\mathbf{w}), \mathbf{w}) where v(w)\mathbf{v}(\mathbf{w}) is an implicit function of w\mathbf{w}. If the Jacobian Fv\frac{\partial F}{\partial \mathbf{v}} is invertible, then:

dv(w)dw=(F(v(w),w)x)1F(v(w),w)w\frac{d\mathbf{v}(\mathbf{w})}{d\mathbf{w}} = -\left(\frac{\partial F(\mathbf{v}(\mathbf{w}), \mathbf{w})}{\partial \mathbf{x}}\right)^{-1}\frac{\partial F(\mathbf{v}(\mathbf{w}), \mathbf{w})}{\partial \mathbf{w}}

Here we made it clear in our notation that the derivative must be evaluated at root (v(w),w)(\mathbf{v}(\mathbf{w}), \mathbf{w}) of FF. For the remaining of this derivation, we will drop this dependence to make notation more compact.

Applying this to our case with F(v,w)=(IγPd)vrdF(\mathbf{v}, \mathbf{w}) = (\mathbf{I} - \gamma \mathbf{P}_d)\mathbf{v} - \mathbf{r}_d:

vγdw=(IγPd)1(rdw+γPdwvγd)\frac{\partial \mathbf{v}_\gamma^{d^\infty}}{\partial \mathbf{w}} = (\mathbf{I} - \gamma \mathbf{P}_d)^{-1}\left(\frac{\partial \mathbf{r}_d}{\partial \mathbf{w}} + \gamma \frac{\partial \mathbf{P}_d}{\partial \mathbf{w}}\mathbf{v}_\gamma^{d^\infty}\right)

Then:

wJ(w)=αvγdw=xα(rdw+γPdwvγd)\begin{align*} \nabla_{\mathbf{w}}J(\mathbf{w}) &= \alpha^\top \frac{\partial \mathbf{v}_\gamma^{d^\infty}}{\partial \mathbf{w}} \\ &= \mathbf{x}_\alpha^\top\left(\frac{\partial \mathbf{r}_d}{\partial \mathbf{w}} + \gamma \frac{\partial \mathbf{P}_d}{\partial \mathbf{w}}\mathbf{v}_\gamma^{d^\infty}\right) \end{align*}

where we have defined the discounted state visitation distribution:

xαα(IγPd)1.\mathbf{x}_\alpha^\top \equiv \alpha^\top(\mathbf{I} - \gamma \mathbf{P}_d)^{-1}.

Remember the vector notation for MDPs:

[rd]s=ad(as)r(s,a)[Pd]ss=ad(as)P(ss,a)\begin{align*} [\mathbf{r}_d]_s &= \sum_a d(a|s)r(s,a) \\ [\mathbf{P}_d]_{ss'} &= \sum_a d(a|s)P(s'|s,a) \end{align*}

Then taking the derivatives gives us:

[rdw]s=awd(as)r(s,a)[Pdwvγd]s=awd(as)sP(ss,a)vγd(s)\begin{align*} \left[\frac{\partial \mathbf{r}_d}{\partial \mathbf{w}}\right]_s &= \sum_a \nabla_{\mathbf{w}}d(a|s)r(s,a) \\ \left[\frac{\partial \mathbf{P}_d}{\partial \mathbf{w}}\mathbf{v}_\gamma^{d^\infty}\right]_s &= \sum_a \nabla_{\mathbf{w}}d(a|s)\sum_{s'} P(s'|s,a)v_\gamma^{d^\infty}(s') \end{align*}

Substituting back:

wJ(w)=sxα(s)(awd(as)r(s,a)+γawd(as)sP(ss,a)vγd(s))=sxα(s)awd(as)(r(s,a)+γsP(ss,a)vγd(s))\begin{align*} \nabla_{\mathbf{w}}J(\mathbf{w}) &= \sum_s x_\alpha(s)\left(\sum_a \nabla_{\mathbf{w}}d(a|s)r(s,a) + \gamma\sum_a \nabla_{\mathbf{w}}d(a|s)\sum_{s'} P(s'|s,a)v_\gamma^{d^\infty}(s')\right) \\ &= \sum_s x_\alpha(s)\sum_a \nabla_{\mathbf{w}}d(a|s)\left(r(s,a) + \gamma \sum_{s'} P(s'|s,a)v_\gamma^{d^\infty}(s')\right) \end{align*}

This is the policy gradient theorem, where xα(s)x_\alpha(s) is the discounted state visitation distribution and the term in parentheses is the state-action value function q(s,a)q(s,a).

Normalized Discounted State Visitation Distribution

The discounted state visitation xα(s)x_\alpha(s) is not normalized. Therefore the expression we obtained above is not an expectation. However, we can tranform it into one by normalizing by 1γ1- \gamma. Note that for any initial distribution α\alpha:

sxα(s)=α(IγPd)11=α11γ=11γ\sum_s x_\alpha(s) = \alpha^\top(\mathbf{I} - \gamma \mathbf{P}_d)^{-1}\mathbf{1} = \frac{\alpha^\top\mathbf{1}}{1-\gamma} = \frac{1}{1-\gamma}

Therefore, defining the normalized state distribution μα(s)=(1γ)xα(s)\mu_\alpha(s) = (1-\gamma)x_\alpha(s), we can write:

wJ(w)=11γsμα(s)awd(as)(r(s,a)+γsP(ss,a)vγd(s))=11γEsμα[awd(as)Q(s,a)]\begin{align*} \nabla_{\mathbf{w}}J(\mathbf{w}) &= \frac{1}{1-\gamma}\sum_s \mu_\alpha(s)\sum_a \nabla_{\mathbf{w}}d(a|s)\left(r(s,a) + \gamma \sum_{s'} P(s'|s,a)v_\gamma^{d^\infty}(s')\right) \\ &= \frac{1}{1-\gamma}\mathbb{E}_{s\sim\mu_\alpha}\left[\sum_a \nabla_{\mathbf{w}}d(a|s)Q(s,a)\right] \end{align*}

Now we have expressed the policy gradient theorem in terms of expectations under the normalized discounted state visitation distribution. But what does sampling from μα\mu_\alpha means? Recall that xα=α(IγPd)1\mathbf{x}_\alpha^\top = \alpha^\top(\mathbf{I} - \gamma \mathbf{P}_d)^{-1}. Using the Neumann series expansion (valid when γPd<1\|\gamma \mathbf{P}_d\| < 1, which holds for γ<1\gamma < 1 since Pd\mathbf{P}_d is a stochastic matrix) we have:

μα=(1γ)αk=0(γPd)k\boldsymbol{\mu}_\alpha^\top = (1-\gamma)\alpha^\top\sum_{k=0}^{\infty} (\gamma \mathbf{P}_d)^k

We can then factor out the first term from this summation to obtain: s

μα=(1γ)αk=0(γPd)k=(1γ)α+(1γ)αk=1(γPd)k=(1γ)α+(1γ)αγPdk=0(γPd)k=(1γ)α+γμαPd\begin{align*} \boldsymbol{\mu}_\alpha^\top &= (1-\gamma)\alpha^\top\sum_{k=0}^{\infty} (\gamma \mathbf{P}_d)^k \\ &= (1-\gamma)\alpha^\top + (1-\gamma)\alpha^\top\sum_{k=1}^{\infty} (\gamma \mathbf{P}_d)^k \\ &= (1-\gamma)\alpha^\top + (1-\gamma)\alpha^\top\gamma\mathbf{P}_d\sum_{k=0}^{\infty} (\gamma \mathbf{P}_d)^k \\ &= (1-\gamma)\alpha^\top + \gamma\boldsymbol{\mu}_\alpha^\top \mathbf{P}_d \end{align*}

The balance equation :

μα=(1γ)α+γμαPd\boldsymbol{\mu}_\alpha^\top = (1-\gamma)\alpha^\top + \gamma\boldsymbol{\mu}_\alpha^\top \mathbf{P}_d

shows that μα\boldsymbol{\mu}_\alpha is a mixture distribution: with probability 1γ1-\gamma you draw a state from the initial distribution α\alpha (reset), and with probability γ\gamma you follow the policy dynamics Pd\mathbf{P}_d from the current state (continue). This interpretation directly connects to the geometric process: at each step you either terminate and resample from α\alpha (with probability 1γ1-\gamma) or continue following the policy (with probability γ\gamma).

import numpy as np

def sample_from_discounted_visitation(
    alpha, 
    policy, 
    transition_model, 
    gamma, 
    n_samples=1000
):
    """Sample states from the discounted visitation distribution.
    
    Args:
        alpha: Initial state distribution (vector of probabilities)
        policy: Function (state -> action probabilities)
        transition_model: Function (state, action -> next state probabilities)
        gamma: Discount factor
        n_samples: Number of states to sample
    
    Returns:
        Array of sampled states
    """
    samples = []
    n_states = len(alpha)
    
    # Initialize state from alpha
    current_state = np.random.choice(n_states, p=alpha)
    
    for _ in range(n_samples):
        samples.append(current_state)
        
        # With probability (1-gamma): reset
        if np.random.random() > gamma:
            current_state = np.random.choice(n_states, p=alpha)
        # With probability gamma: continue
        else:
            # Sample action from policy
            action_probs = policy(current_state)
            action = np.random.choice(len(action_probs), p=action_probs)
            
            # Sample next state from transition model
            next_state_probs = transition_model(current_state, action)
            current_state = np.random.choice(n_states, p=next_state_probs)
    
    return np.array(samples)

# Example usage for a simple 2-state MDP
alpha = np.array([0.7, 0.3])  # Initial distribution
policy = lambda s: np.array([0.8, 0.2])  # Dummy policy
transition_model = lambda s, a: np.array([0.9, 0.1])  # Dummy transitions
gamma = 0.9

samples = sample_from_discounted_visitation(alpha, policy, transition_model, gamma)

# Check empirical distribution
print("Empirical state distribution:")
print(np.bincount(samples) / len(samples))

While the math shows that sampling from the discounted visitation distribution μα\boldsymbol{\mu}_\alpha would give us unbiased policy gradient estimates, Thomas (2014) demonstrated that this implementation can be detrimental to performance in practice. The issue arises because terminating trajectories early (with probability 1γ1-\gamma) reduces the effective amount of data we collect from each trajectory. This early termination weakens the learning signal, as many trajectories don’t reach meaningful terminal states or rewards.

Therefore, in practice, we typically sample complete trajectories from the undiscounted process (i.e., run the policy until natural termination or a fixed horizon) while still using γ\gamma in the advantage estimation. This approach preserves the full learning signal from each trajectory and has been empirically shown to lead to better performance.

This is one of several cases in RL where the theoretically optimal procedure differs from the best practical implementation.

Policy Optimization with a Model

In this section, we’ll explore how incorporating a model of the dynamics can help us design better policy gradient estimators. Let’s begin with a pure model-free approach that uses a critic to maximize a deterministic policy:

J(w)=Esρ[Q(s,d(s;w))],wJ(w)=Esρ[aQ(s,a)a=d(s;w)wd(s;w)]J(\boldsymbol{w}) = \mathbb{E}_{s\sim\rho}[Q(s,d(s;\boldsymbol{w}))], \enspace \nabla_{\boldsymbol{w}} J(\boldsymbol{w}) = \mathbb{E}_{s\sim\rho}[\nabla_a Q(s,a)|_{a=d(s;\boldsymbol{w})} \nabla_{\boldsymbol{w}} d(s;\boldsymbol{w})]

Using the recursive structure of the Bellman equations, we can unroll our objective one step ahead:

J(w)=Esρ[r(s,d(s;w))+γEsp(s,d(s;w))[Q(s,d(s;w))]],J(\boldsymbol{w}) = \mathbb{E}_{s\sim\rho}[r(s,d(s;\boldsymbol{w})) + \gamma\mathbb{E}_{s'\sim p(\cdot|s,d(s;\boldsymbol{w}))}[Q(s',d(s';\boldsymbol{w}))]],

To differentiate this objective, we need access to both a model of the dynamics and the reward function, as shown in the following expression:

wJ(w)=Esρ[ar(s,a)a=d(s;w)wd(s;w)+γ(Esp(s,d(s;w))[aQ(s,a)a=d(s;w)wd(s;w)]+wd(s;w)sap(ss,a)a=d(s;w)Q(s,d(s;w))ds)]\nabla_{\boldsymbol{w}} J(\boldsymbol{w}) = \mathbb{E}_{s\sim\rho}[\nabla_a r(s,a)|_{a=d(s;\boldsymbol{w})}\nabla_{\boldsymbol{w}} d(s;\boldsymbol{w}) + \gamma(\mathbb{E}_{s'\sim p(\cdot|s,d(s;\boldsymbol{w}))}[\nabla_a Q(s',a)|_{a=d(s';\boldsymbol{w})}\nabla_{\boldsymbol{w}} d(s';\boldsymbol{w})] + \\\nabla_{\boldsymbol{w}} d(s;\boldsymbol{w})\int_{s'} \nabla_a p(s'|s,a)|_{a=d(s;\boldsymbol{w})}Q(s',d(s';\boldsymbol{w}))ds')]

While w\boldsymbol{w} doesn’t appear in the outer expectation over initial states, it affects the inner expectation over next states—a distributional dependency. As a result, the product rule of calculus yields two terms: the first being an expectation, and the second being problematic for sample average estimation. However, we have tools to address this challenge: we can either apply the reparameterization trick to the dynamics or use score function estimators.

For the reparameterization approach, assuming s=f(s,a,ξ;w)s' = f(s,a,\xi;\boldsymbol{w}) where ξ\xi is the noise variable:

JDPMB-R(w)=Esρ,ξ[r(s,d(s;w))+γQ(f(s,d(s;w),ξ),d(f(s,d(s;w),ξ);w))]wJDPMB-R(w)=Esρ,ξ[ar(s,a)a=d(s;w)wd(s;w)+γ(aQ(s,a)a=d(s;w)wd(s;w)+sQ(s,d(s;w))wf(s,d(s;w),ξ)wd(s;w))]\begin{align*} J^{\text{DPMB-R}}(\boldsymbol{w}) &= \mathbb{E}_{s\sim\rho,\xi}[r(s,d(s;\boldsymbol{w})) + \gamma Q(f(s,d(s;\boldsymbol{w}),\xi),d(f(s,d(s;\boldsymbol{w}),\xi);\boldsymbol{w}))]\\ \nabla_{\boldsymbol{w}} J^{\text{DPMB-R}}(\boldsymbol{w}) &= \mathbb{E}_{s\sim\rho,\xi}[\nabla_a r(s,a)|_{a=d(s;\boldsymbol{w})}\nabla_{\boldsymbol{w}} d(s;\boldsymbol{w}) + \\ &\gamma(\nabla_a Q(s',a)|_{a=d(s';\boldsymbol{w})}\nabla_{\boldsymbol{w}} d(s';\boldsymbol{w}) + \nabla_{s'} Q(s',d(s';\boldsymbol{w}))\nabla_{\boldsymbol{w}} f(s,d(s;\boldsymbol{w}),\xi)\nabla_{\boldsymbol{w}} d(s;\boldsymbol{w}))] \end{align*}

As for the score function approach:

wJDPMB-SF(w)=Esρ[ar(s,a)a=d(s;w)wd(s;w)+γEsp(s,d(s;w))[wlogp(ss,d(s;w))Q(s,d(s;w))+aQ(s,a)a=d(s;w)wd(s;w)]]\begin{align*} \nabla_{\boldsymbol{w}} J^{\text{DPMB-SF}}(\boldsymbol{w}) = \mathbb{E}_{s\sim\rho}[&\nabla_a r(s,a)|_{a=d(s;\boldsymbol{w})}\nabla_{\boldsymbol{w}} d(s;\boldsymbol{w}) + \\ &\gamma\mathbb{E}_{s'\sim p(\cdot|s,d(s;\boldsymbol{w}))}[\nabla_{\boldsymbol{w}} \log p(s'|s,d(s;\boldsymbol{w}))Q(s',d(s';\boldsymbol{w})) + \nabla_a Q(s',a)|_{a=d(s';\boldsymbol{w})}\nabla_{\boldsymbol{w}} d(s';\boldsymbol{w})]] \end{align*}

We’ve now developed a hybrid algorithm that combines model-based and model-free approaches while integrating derivative estimators for stochastic dynamics with a deterministic policy parameterization. While this hybrid estimator remains relatively unexplored in practice, it could prove valuable for systems with specific structural properties.

Consider a robotics scenario with hybrid continuous-discrete dynamics: a robotic arm operates in continuous space but interacts with discrete object states. While the arm’s policy remains differentiable (wd\nabla_{\boldsymbol{w}} d), the object state transitions follow categorical distributions. In this case, reparameterization becomes impractical, but the score function approach is viable if we can compute wlogp(ss,d(s;w))\nabla_{\boldsymbol{w}} \log p(s'|s,d(s;\boldsymbol{w})) from the known transition model. Similar structures arise in manufacturing processes, where machine actions might be continuous and differentiable, but material state transitions often follow discrete steps with known probabilities. Note that both approaches require knowledge of transition probabilities and won’t work with pure black-box simulators or systems where we can only sample transitions without probability estimates.

Another dimension to explore in our algorithm design is the number of steps we wish to unroll our model. This allows us to better understand and control the bias-variance tradeoffs in our methods.

Backpropagation Policy Optimization

The questions of derivative estimators only arise with stochastic dynamics. For systems with deterministic dynamics and a deterministic policy, the one-step gradient unroll simplifies to:

wJ(w)=Esρ[ar(s,a)a=d(s;w)wd(s;w)+γ(aQ(s,a)a=d(s;w)wd(s;w)+wd(s;w)af(s,a)a=d(s;w)sQ(s,d(s;w)))]\begin{align*} \nabla_{\boldsymbol{w}} J(\boldsymbol{w}) = &\mathbb{E}_{s\sim\rho}[\nabla_a r(s,a)|_{a=d(s;\boldsymbol{w})}\nabla_{\boldsymbol{w}} d(s;\boldsymbol{w}) + \\ &\gamma(\nabla_a Q(s',a)|_{a=d(s';\boldsymbol{w})}\nabla_{\boldsymbol{w}} d(s';\boldsymbol{w}) + \nabla_{\boldsymbol{w}} d(s;\boldsymbol{w})\nabla_a f(s,a)|_{a=d(s;\boldsymbol{w})}\nabla_{s'} Q(s',d(s';\boldsymbol{w})))] \end{align*}

where s=f(s,d(s;w))s' = f(s,d(s;\boldsymbol{w})) is the deterministic next state. Notice the resemblance between this expression and that obtained for wJDPMB-R\nabla_{\boldsymbol{w}} J^{\text{DPMB-R}} above. They are essentially the same except that in the reparameterized case, the dynamics have made the dependence on the noise variable explicit and the outer expectation has been updated accordingly. This similarity arises because differentiation through reparameterized dynamics models is, in essence, backpropagation: it tracks the propagation of perturbations through a computation graph—which we refer to as a stochastic computation graph in this setting.

Still under this simplified setting with deterministic policies and dynamics, we can extend the expression for the gradient through n-steps of model unroll, leading to:

wJ(w)=Esρ[t=0n1γt(ar(st,at)at=d(st;w)wd(st;w)+str(st,d(st;w))wst)+γn(aQ(sn,a)a=d(sn;w)wd(sn;w)+snQ(sn,d(sn;w))wsn)]\begin{align*} \nabla_{\boldsymbol{w}} J(\boldsymbol{w}) = &\mathbb{E}_{s\sim\rho}[\sum_{t=0}^{n-1} \gamma^t(\nabla_a r(s_t,a_t)|_{a_t=d(s_t;\boldsymbol{w})}\nabla_{\boldsymbol{w}} d(s_t;\boldsymbol{w}) + \nabla_{s_t} r(s_t,d(s_t;\boldsymbol{w}))\nabla_{\boldsymbol{w}} s_t) + \\ &\gamma^n(\nabla_a Q(s_n,a)|_{a=d(s_n;\boldsymbol{w})}\nabla_{\boldsymbol{w}} d(s_n;\boldsymbol{w}) + \nabla_{s_n} Q(s_n,d(s_n;\boldsymbol{w}))\nabla_{\boldsymbol{w}} s_n)] \end{align*}

where st+1=f(st,d(st;w))s_{t+1} = f(s_t,d(s_t;\boldsymbol{w})) for t=0,,n1t=0,\ldots,n-1, s0=ss_0=s, and wst\nabla_{\boldsymbol{w}} s_t follows the recursive relationship:

wst+1=af(st,a)a=d(st;w)wd(st;w)+sf(st,d(st;w))wst\nabla_{\boldsymbol{w}} s_{t+1} = \nabla_a f(s_t,a)|_{a=d(s_t;\boldsymbol{w})}\nabla_{\boldsymbol{w}} d(s_t;\boldsymbol{w}) + \nabla_s f(s_t,d(s_t;\boldsymbol{w}))\nabla_{\boldsymbol{w}} s_t

with base case ws0=0\nabla_{\boldsymbol{w}} s_0 = 0 since the initial state does not depend on the policy parameters.

At both ends of the spectrum, we have that for n=0, we fall back to the pure critic NFQCA-like approach, and for n=n=\infty, we don’t bootstrap at all and are fully model-based without a critic. The pure model-based critic-free approach to optimization is what we may refer to as a backpropagation-based policy optimization (BPO).

But just as backpropagation through RNNs or very deep networks can be challenging due to exploding and vanishing gradients, “vanilla” Backpropagation Policy Optimization (BPO) without a critic can severely suffer from the curse of horizon. This is because it essentially implements single shooting optimization. Using a critic can be an effective remedy to this problem, allowing us to better control the bias-variance tradeoff while preserving gradient information that would be lost with a more drastic truncated backpropagation approach.

Stochastic Value Gradient (SVG)

The stochastic value gradient framework of Heess (2015) applies the recipe for policy optimization with a model using reparameterized dynamics and action selection via randomized policies. In this setting, the stochastic policy model based reparameterized estimator over n steps is

JSPMB-R-N(w)=Esρ,{ϵi}i=0n1,{ξi}i=0n1[i=0n1γir(si,d(si,ϵi;w))+γnQ(sn,d(sn,ϵn;w))]\begin{align*} J^{\text{SPMB-R-N}}(\boldsymbol{w}) &= \mathbb{E}_{s\sim\rho,\{\epsilon_i\}_{i=0}^{n-1},\{\xi_i\}_{i=0}^{n-1}}[\sum_{i=0}^{n-1} \gamma^i r(s_i,d(s_i,\epsilon_i;\boldsymbol{w})) + \gamma^n Q(s_n,d(s_n,\epsilon_n;\boldsymbol{w}))] \end{align*}

where s0=ss_0 = s and for i0i \geq 0, si+1=f(si,d(si,ϵi;w),ξi)s_{i+1} = f(s_i,d(s_i,\epsilon_i;\boldsymbol{w}),\xi_i). The gradient becomes:

wJSPMB-R-N(w)=Esρ,{ϵi}i=0n1,{ξi}i=0n1[i=0n1γi(ar(si,a)a=d(si,ϵi;w)wd(si,ϵi;w)+sir(si,d(si,ϵi;w))wsi)+γn(aQ(sn,a)a=d(sn,ϵn;w)wd(sn,ϵn;w)+snQ(sn,d(sn,ϵn;w))wsn)]\begin{align*} \nabla_{\boldsymbol{w}} J^{\text{SPMB-R-N}}(\boldsymbol{w}) &= \mathbb{E}_{s\sim\rho,\{\epsilon_i\}_{i=0}^{n-1},\{\xi_i\}_{i=0}^{n-1}}[\sum_{i=0}^{n-1} \gamma^i \left(\nabla_a r(s_i,a)|_{a=d(s_i,\epsilon_i;\boldsymbol{w})}\nabla_{\boldsymbol{w}} d(s_i,\epsilon_i;\boldsymbol{w}) + \nabla_{s_i} r(s_i,d(s_i,\epsilon_i;\boldsymbol{w}))\nabla_{\boldsymbol{w}} s_i\right) \enspace + \\ &\qquad \gamma^n(\nabla_a Q(s_n,a)|_{a=d(s_n,\epsilon_n;\boldsymbol{w})}\nabla_{\boldsymbol{w}} d(s_n,\epsilon_n;\boldsymbol{w}) + \nabla_{s_n} Q(s_n,d(s_n,\epsilon_n;\boldsymbol{w}))\nabla_{\boldsymbol{w}} s_n)] \end{align*}

where ws0=0\nabla_{\boldsymbol{w}} s_0 = 0 and for i0i \geq 0:

wsi+1=af(si,a,ξi)a=d(si,ϵi;w)wd(si,ϵi;w)+sif(si,d(si,ϵi;w),ξi)wsi\nabla_{\boldsymbol{w}} s_{i+1} = \nabla_a f(s_i,a,\xi_i)|_{a=d(s_i,\epsilon_i;\boldsymbol{w})}\nabla_{\boldsymbol{w}} d(s_i,\epsilon_i;\boldsymbol{w}) + \nabla_{s_i} f(s_i,d(s_i,\epsilon_i;\boldsymbol{w}),\xi_i)\nabla_{\boldsymbol{w}} s_i

While we could implement this expression for the gradient ourselves, this approach is much easier, less error-prone, and most likely better optimized for performance when using automatic differentiation. Given a set of rollouts (for which we know the primitive noise variables), then we can compute the monte carlo objective:

J^SPMB-R-N(w)=1Mm=1M[i=0n1γir(sim,d(sim,ϵim;w))+γnQ(snm,d(snm,ϵnm;w))]\hat{J}^{\text{SPMB-R-N}}(\boldsymbol{w}) = \frac{1}{M}\sum_{m=1}^M [\sum_{i=0}^{n-1} \gamma^i r(s_i^m,d(s_i^m,\epsilon_i^m;\boldsymbol{w})) + \gamma^n Q(s_n^m,d(s_n^m,\epsilon_n^m;\boldsymbol{w}))]

where the states are generated recursively using the known noise variables: starting with initial state s0ms_0^m, each subsequent state is computed as si+1m=f(sim,d(sim,ϵim;w),ξim)s_{i+1}^m = f(s_i^m,d(s_i^m,\epsilon_i^m;\boldsymbol{w}),\xi_i^m). Thus, a trajectory is completely determined by just the sequence of noise variables:(ϵ0m,ξ0m,ϵ1m,ξ1m,...,ϵn1m,ξn1m,ϵnm)(\epsilon_0^m, \xi_0^m, \epsilon_1^m, \xi_1^m, ..., \epsilon_{n-1}^m, \xi_{n-1}^m, \epsilon_n^m) where ϵim\epsilon_i^m are the action noise variables and ξim\xi_i^m are the dynamics noise variables.

The choice of unroll steps nn gives us precise control over the balance between model-based and critic-based components in our gradient estimator. At one extreme, setting n=0n = 0 yields a purely model-free algorithm known as SVG(0) (Heess et al., 2015), which relies entirely on the critic for value estimation:

J^SPMB-R-0(w)=1Mm=1MQ(s0m,d(s0m,ϵ0m;w))\hat{J}^{\text{SPMB-R-0}}(\boldsymbol{w}) = \frac{1}{M}\sum_{m=1}^M Q(s_0^m,d(s_0^m,\epsilon_0^m;\boldsymbol{w}))

At the other extreme, as nn \to \infty, we can drop the critic term (since γnQ0\gamma^n Q \to 0) to obtain a purely model-based algorithm, SVG(∞):

J^SPMB-R-(w)=1Mm=1Mi=0γir(sim,d(sim,ϵim;w))\hat{J}^{\text{SPMB-R-$\infty$}}(\boldsymbol{w}) = \frac{1}{M}\sum_{m=1}^M \sum_{i=0}^{\infty} \gamma^i r(s_i^m,d(s_i^m,\epsilon_i^m;\boldsymbol{w}))

In the 2015 paper, the authors make a specific choice to reparameterize both the dynamics and action selections using normal distributions. For the policy, they use:

at=d(st;w)+σ(st;w)ϵt,ϵtN(0,I)a_t = d(s_t;\boldsymbol{w}) + \sigma(s_t;\boldsymbol{w})\epsilon_t, \quad \epsilon_t \sim \mathcal{N}(0,I)

where d(st;w)d(s_t;\boldsymbol{w}) predicts the mean action and σ(st;w)\sigma(s_t;\boldsymbol{w}) predicts the standard deviation. For the dynamics:

st+1=μ(st,at;ϕ)+Σ(st,at;ϕ)ξt,ξtN(0,I)s_{t+1} = \mu(s_t,a_t;\boldsymbol{\phi}) + \Sigma(s_t,a_t;\boldsymbol{\phi})\xi_t, \quad \xi_t \sim \mathcal{N}(0,I)

where μ(st,at;ϕ)\mu(s_t,a_t;\boldsymbol{\phi}) predicts the mean next state and Σ(st,at;ϕ)\Sigma(s_t,a_t;\boldsymbol{\phi}) predicts the covariance matrix.

Under this reparameterization, the n-step surrogate loss becomes:

J^SPMB-R-n(w)=1Mm=1M[t=0n1γtr(stm(w),d(stm(w);w)+σ(stm(w);w)ϵtm)+γnQ(snm(w),d(snm(w);w)+σ(snm(w);w)ϵnm)]\begin{align*} \hat{J}^{\text{SPMB-R-n}}(\boldsymbol{w}) = \frac{1}{M}&\sum_{m=1}^M [\sum_{t=0}^{n-1} \gamma^t r(s_t^m(\boldsymbol{w}),d(s_t^m(\boldsymbol{w});\boldsymbol{w}) + \sigma(s_t^m(\boldsymbol{w});\boldsymbol{w})\epsilon_t^m) + \\ &\gamma^n Q(s_n^m(\boldsymbol{w}),d(s_n^m(\boldsymbol{w});\boldsymbol{w}) + \sigma(s_n^m(\boldsymbol{w});\boldsymbol{w})\epsilon_n^m)] \end{align*}

where:

st+1m(w)=μ(stm(w),d(stm(w);w)+σ(stm(w);w)ϵtm;ϕ)+Σ(stm(w),atm(w);ϕ)ξtms_{t+1}^m(\boldsymbol{w}) = \mu(s_t^m(\boldsymbol{w}),d(s_t^m(\boldsymbol{w});\boldsymbol{w}) + \sigma(s_t^m(\boldsymbol{w});\boldsymbol{w})\epsilon_t^m;\boldsymbol{\phi}) + \Sigma(s_t^m(\boldsymbol{w}),a_t^m(\boldsymbol{w});\boldsymbol{\phi})\xi_t^m

Noise Inference in SVG

The method we’ve presented so far assumes we have direct access to the noise variables ϵ\epsilon and ξ\xi used to generate trajectories. This works well in the on-policy setting where we generate our own data. However, in off-policy scenarios where we receive externally generated trajectories, we need to infer these noise variables—a process the authors call noise inference.

For the Gaussian case discussed above, this inference is straightforward. Given an observed scalar action ata_t and the current policy parameters w\boldsymbol{w}, we can recover the action noise ϵt\epsilon_t through inverse reparameterization:

ϵt=atd(st;w)σ(st;w)\epsilon_t = \frac{a_t - d(s_t;\boldsymbol{w})}{\sigma(s_t;\boldsymbol{w})}

Similarly, for the dynamics noise where states are typically vector-valued:

ξt=Σ(st,at;ϕ)1(st+1μ(st,at;ϕ))\xi_t = \Sigma(s_t,a_t;\boldsymbol{\phi})^{-1}(s_{t+1} - \mu(s_t,a_t;\boldsymbol{\phi}))

This simple inversion is possible because the Gaussian reparameterization is an affine transformation, which is invertible as long as σ(st;w)\sigma(s_t;\boldsymbol{w}) is non-zero for scalar actions and Σ(st,at;ϕ)\Sigma(s_t,a_t;\boldsymbol{\phi}) is full rank for vector-valued states. The same principle extends naturally to vector-valued actions, where σ\sigma would be replaced by a full covariance matrix.

More generally, this idea of invertible transformations can be extended far beyond simple Gaussian reparameterization. We could consider a sequence of invertible transformations:

z0N(0,I)f1z1f2z2f3fKzK=atz_0 \sim \mathcal{N}(0,I) \xrightarrow{f_1} z_1 \xrightarrow{f_2} z_2 \xrightarrow{f_3} \cdots \xrightarrow{f_K} z_K = a_t

where each fkf_k is an invertible neural network layer. The forward process can be written compactly as:

at=(fKfK1f1)(z0;w)a_t = (f_K \circ f_{K-1} \circ \cdots \circ f_1)(z_0;\boldsymbol{w})

This is the idea behind normalizing flows: a series of invertible transformations that can transform a simple base distribution (like a standard normal) into a complex target distribution while maintaining exact invertibility.

The noise inference in this case would involve applying the inverse transformations:

z0=(f11fK1)(at;w)z_0 = (f_1^{-1} \circ \cdots \circ f_K^{-1})(a_t;\boldsymbol{w})

This approach offers several advantages:

  1. More expressive policies and dynamics models capable of capturing multimodal distributions

  2. Exact likelihood computation through the change of variables formula (can be useful for computing the log prob terms in entropy regularization for example)

  3. Precise noise inference through the guaranteed invertibility of the flow

As far as I know, this approach has not been explored in the literature.

DPG as a Special Case of SAC

At first glance, SAC and DPG might appear to be fundamentally different approaches to policy optimization. SAC begins with the principle of entropy maximization and policy distribution matching through KL divergence minimization, while DPG directly optimizes a deterministic policy to maximize expected Q-values. However, we can show that DPG is a special case of SAC when the temperature parameter goes to zero.

Policy Optimization with a Trust Region

Trust region methods in optimization approximate the objective function with a simpler local model within a region where we “trust” this approximation to be good. This brings about the need to define what we mean by a local region, and therefore to pick a geometry which suits our problem.

In standard optimization in the Euclidean space on Rn\mathbb{R}^n, at each iteration kk, we create a quadratic approximation around the current point xkx_k:

mk(p)=f(xk)+gkTp+12pTBkpm_k(p) = f(x_k) + g_k^T p + \frac{1}{2}p^T B_k p

where gk=f(xk)g_k = \nabla f(x_k) is the gradient and BkB_k approximates the Hessian. The trust region constrains updates using Euclidean distance:

minpmk(p) subject to pΔk\min_p m_k(p) \text{ subject to } \|p\| \leq \Delta_k

However, when optimizing over probability distributions p(x;θ)p(x;\theta), the Euclidean geometry becomes unnatural. Instead, the Kullback-Leibler divergence provides a more natural mean of measuring proximity:

DKL(p(x;θ)p(x;θk))=p(x;θ)log(p(x;θ)p(x;θk))dxD_{KL}(p(x;\theta) || p(x;\theta_k)) = \int p(x;\theta) \log\left(\frac{p(x;\theta)}{p(x;\theta_k)}\right)dx

This leads to the following trust region subproblem:

minθmk(θ) subject to DKL(p(x;θ)p(x;θk))Δk\min_\theta m_k(\theta) \text{ subject to } D_{KL}(p(x;\theta) || p(x;\theta_k)) \leq \Delta_k

For exponential families, the KL divergence locally reduces to a quadratic form involving the Fisher Information Matrix I(θk)I(\theta_k):

DKL(p(x;θ)p(x;θk))12(θθk)TI(θk)(θθk)D_{KL}(p(x;\theta) || p(x;\theta_k)) \approx \frac{1}{2}(\theta - \theta_k)^T I(\theta_k)(\theta - \theta_k)

In both cases, after solving for the step, we evaluate the actual versus predicted reduction ratio:

ρk=f(xk)f(xk+p)mk(0)mk(p)\rho_k = \frac{f(x_k) - f(x_k + p)}{m_k(0) - m_k(p)}

This ratio determines both step acceptance and trust region adjustment:

Δk+1={α1Δkif ρk<η1 (poor prediction)Δkif η1ρk<η2 (acceptable)α2Δkif ρkη2 (very good)\Delta_{k+1} = \begin{cases} \alpha_1 \Delta_k & \text{if } \rho_k < \eta_1 \text{ (poor prediction)} \\ \Delta_k & \text{if } \eta_1 \leq \rho_k < \eta_2 \text{ (acceptable)} \\ \alpha_2 \Delta_k & \text{if } \rho_k \geq \eta_2 \text{ (very good)} \end{cases}

The method accepts steps when the model prediction is sufficiently accurate:

xk+1={xk+pif ρk>η1xkotherwisex_{k+1} = \begin{cases} x_k + p & \text{if } \rho_k > \eta_1 \\ x_k & \text{otherwise} \end{cases}
References
  1. Hafner, R., & Riedmiller, M. (2011). Reinforcement learning in feedback control: Challenges and benchmarks from technical process control. Machine Learning, 84(1–2), 137–169. 10.1007/s10994-011-5235-x
  2. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2015). Continuous Control with Deep Reinforcement Learning. arXiv Preprint arXiv:1509.02971.
  3. Fujimoto, S., Hoof, H., & Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. International Conference on Machine Learning (ICML), 1587–1596.
  4. Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proceedings of the 35th International Conference on Machine Learning (ICML), 1861–1870.