Skip to article frontmatterSkip to article content

Policy Gradient Methods

The previous chapter showed how to handle continuous action spaces in fitted Q-iteration by amortizing action selection with policy networks. Methods like NFQCA, DDPG, TD3, and SAC all learn both a Q-function and a policy, using the Q-function to guide policy improvement. This chapter explores a different approach: optimizing policies directly without maintaining explicit value functions.

Direct policy optimization offers several advantages. First, it naturally handles stochastic policies, which can be essential for partially observable environments or problems requiring explicit exploration. Second, it avoids the detour through value function approximation, which may introduce errors that compound during policy extraction. Third, for problems with simple policy classes but complex value landscapes, directly searching in policy space can be more efficient than searching in value space.

The foundation of policy gradient methods rests on computing gradients of expected returns with respect to policy parameters. This chapter develops the mathematical machinery needed for this computation, starting with general derivative estimation techniques from stochastic optimization, then specializing to reinforcement learning settings, and finally examining variance reduction methods that make these estimators practical.

Derivative Estimation for Stochastic Optimization

Consider optimizing an objective that involves an expectation:

J(θ)=Exp(x;θ)[f(x,θ)]J(\theta) = \mathbb{E}_{x \sim p(x;\theta)}[f(x,\theta)]

For concreteness, consider a simple example where xN(θ,1)x \sim \mathcal{N}(\theta,1) and f(x,θ)=x2θf(x,\theta) = x^2\theta. The derivative we seek is:

ddθJ(θ)=ddθx2θp(x;θ)dx\frac{d}{d\theta}J(\theta) = \frac{d}{d\theta}\int x^2\theta p(x;\theta)dx

While we can compute this exactly for the Gaussian example, this is often impossible for more general problems. We might then be tempted to approximate our objective using samples:

J(θ)1Ni=1Nf(xi,θ),xip(x;θ)J(\theta) \approx \frac{1}{N}\sum_{i=1}^N f(x_i,\theta), \quad x_i \sim p(x;\theta)

Then differentiate this approximation:

ddθJ(θ)1Ni=1Nθf(xi,θ)\frac{d}{d\theta}J(\theta) \approx \frac{1}{N}\sum_{i=1}^N \frac{\partial}{\partial \theta}f(x_i,\theta)

However, this naive approach ignores that the samples themselves depend on θ\theta. The correct derivative requires the product rule:

ddθJ(θ)=θ[f(x,θ)p(x;θ)]dx=[fθp(x;θ)+f(x,θ)p(x;θ)θ]dx\frac{d}{d\theta}J(\theta) = \int \frac{\partial}{\partial \theta}[f(x,\theta)p(x;\theta)]dx = \int \left[\frac{\partial f}{\partial \theta}p(x;\theta) + f(x,\theta)\frac{\partial p(x;\theta)}{\partial \theta}\right]dx

While the first term could be numerically integrated using Monte Carlo, the second one cannot as it is not in the form of an expectation.

To transform our objective so that the Monte Carlo estimator for the objective could be differentiated directly while ensuring that the resulting derivative is unbiased, there are two main solutions: a change of measure, or a change of variables.

The Likelihood Ratio Method

One solution comes from rewriting our objective using a proposal distribution q(x)q(x) that does not depend on θ\theta:

J(θ)=f(x,θ)p(x;θ)q(x)q(x)dx=Exq(x)[f(x,θ)p(x;θ)q(x)]J(\theta) = \int f(x,\theta)\frac{p(x;\theta)}{q(x)}q(x)dx = \mathbb{E}_{x \sim q(x)}\left[f(x,\theta)\frac{p(x;\theta)}{q(x)}\right]

Define the likelihood ratio ρ(x,q,θ)p(x;θ)q(x)\rho(x, q, \theta) \equiv \frac{p(x;\theta)}{q(x)}, where we treat qq as a separate argument. The objective becomes:

J(θ)=Exq(x)[f(x,θ)ρ(x,q,θ)]J(\theta) = \mathbb{E}_{x \sim q(x)}[f(x,\theta)\rho(x, q, \theta)]

When we differentiate JJ, we take the partial derivative with respect to θ\theta while holding qq fixed (since qq does not depend on θ\theta):

ddθJ(θ)=Exq(x)[f(x,θ)ρθ(x,q,θ)+ρ(x,q,θ)fθ(x,θ)]\frac{d}{d\theta}J(\theta) = \mathbb{E}_{x \sim q(x)}\left[f(x,\theta)\frac{\partial \rho}{\partial \theta}(x, q, \theta) + \rho(x, q, \theta)\frac{\partial f}{\partial \theta}(x,\theta)\right]

The partial derivative of ρ\rho with respect to θ\theta (treating qq as fixed) is:

ρθ(x,q,θ)=1q(x)p(x;θ)θ=ρ(x,q,θ)logp(x;θ)θ\frac{\partial \rho}{\partial \theta}(x, q, \theta) = \frac{1}{q(x)}\frac{\partial p(x;\theta)}{\partial \theta} = \rho(x, q, \theta)\frac{\partial \log p(x;\theta)}{\partial \theta}

Now fix any reference parameter θ0\theta_0 and choose the proposal distribution q(x)=p(x;θ0)q(x) = p(x;\theta_0). This is a fixed distribution that does not change as θ\theta varies. We simply evaluate the family p(x;)p(x;\cdot) at the specific point θ0\theta_0. With this choice, evaluating the gradient at θ=θ0\theta = \theta_0 gives ρ(x,q,θ0)=p(x;θ0)/p(x;θ0)=1\rho(x, q, \theta_0) = p(x;\theta_0)/p(x;\theta_0) = 1. The gradient formula becomes:

ddθJ(θ)θ=θ0=Exp(x;θ0)[f(x,θ0)logp(x;θ)θθ0+f(x,θ)θθ0]\frac{d}{d\theta}J(\theta)\Big|_{\theta=\theta_0} = \mathbb{E}_{x \sim p(x;\theta_0)}\left[f(x,\theta_0)\frac{\partial \log p(x;\theta)}{\partial \theta}\Big|_{\theta_0} + \frac{\partial f(x,\theta)}{\partial \theta}\Big|_{\theta_0}\right]

Since θ0\theta_0 is arbitrary, we can drop the subscript and write the score function estimator as:

ddθJ(θ)=Exp(x;θ)[f(x,θ)logp(x;θ)θ+f(x,θ)θ]\frac{d}{d\theta}J(\theta) = \mathbb{E}_{x \sim p(x;\theta)}\left[f(x,\theta)\frac{\partial \log p(x;\theta)}{\partial \theta} + \frac{\partial f(x,\theta)}{\partial \theta}\right]

The Reparameterization Trick

An alternative approach eliminates the θ\theta-dependence in the sampling distribution by expressing xx through a deterministic transformation of the noise:

x=g(ϵ,θ),ϵq(ϵ)x = g(\epsilon,\theta), \quad \epsilon \sim q(\epsilon)

Therefore if we want to sample from some target distribution p(x;θ)p(x;\theta), we can do so by first sampling from a simple base distribution q(ϵ)q(\epsilon) (like a standard normal) and then transforming those samples through a carefully chosen function gg. If g(,θ)g(\cdot,\theta) is invertible, the change of variables formula tells us how these distributions relate:

p(x;θ)=q(g1(x,θ))detg1(x,θ)x=q(ϵ)detg(ϵ,θ)ϵ1p(x;\theta) = q(g^{-1}(x,\theta))\left|\det\frac{\partial g^{-1}(x,\theta)}{\partial x}\right| = q(\epsilon)\left|\det\frac{\partial g(\epsilon,\theta)}{\partial \epsilon}\right|^{-1}

For example, if we want to sample from any multivariate Gaussian distributions with covariance matrix Σ\Sigma and mean μ\mu, it suffices to be able to sample from a standard normal noise and compute the linear transformation:

x=μ+Σ1/2ϵ,ϵN(0,I)x = \mu + \Sigma^{1/2}\epsilon, \quad \epsilon \sim \mathcal{N}(0,I)

where Σ1/2\Sigma^{1/2} is the matrix square root obtained via Cholesky decomposition. In the univariate case, this transformation is simply:

x=μ+σϵ,ϵN(0,1)x = \mu + \sigma \epsilon, \quad \epsilon \sim \mathcal{N}(0,1)

where σ=σ2\sigma = \sqrt{\sigma^2} is the standard deviation (square root of the variance).

Common Examples of Reparameterization

The Truncated Normal Distribution

When we need samples constrained to an interval [a,b][a,b], we can use the truncated normal distribution. To sample from it, we transform uniform noise through the inverse cumulative distribution function (CDF) of the standard normal:

x=Φ1(uΦ(b)+(1u)Φ(a)),uUniform(0,1)x = \Phi^{-1}(u\Phi(b) + (1-u)\Phi(a)), \quad u \sim \text{Uniform}(0,1)

Here:

The resulting samples follow a normal distribution restricted to [a,b][a,b], with the density properly normalized over this interval.

The Kumaraswamy Distribution

When we need samples in the unit interval [0,1], a natural choice might be the Beta distribution. However, its inverse CDF doesn’t have a closed form. Instead, we can use the Kumaraswamy distribution as a convenient approximation, which allows for a simple reparameterization:

x=(1(1uα)1/β),uUniform(0,1)x = (1-(1-u^{\alpha})^{1/\beta}), \quad u \sim \text{Uniform}(0,1)

where:

The Kumaraswamy distribution has density:

f(x;α,β)=αβxα1(1xα)β1,x[0,1]f(x; \alpha, \beta) = \alpha\beta x^{\alpha-1}(1-x^{\alpha})^{\beta-1}, \quad x \in [0,1]
The Gumbel-Softmax Distribution

When sampling from a categorical distribution with probabilities {πi}\{\pi_i\}, one approach uses Gumbel(0,1)\text{Gumbel}(0,1) noise combined with the argmax of log-perturbed probabilities:

argmaxi(logπi+gi),giGumbel(0,1)\text{argmax}_i(\log \pi_i + g_i), \quad g_i \sim \text{Gumbel}(0,1)

This approach, known in machine learning as the Gumbel-Max trick, relies on sampling Gumbel noise from uniform random variables through the transformation gi=log(log(ui))g_i = -\log(-\log(u_i)) where uiUniform(0,1)u_i \sim \text{Uniform}(0,1). To see why this gives us samples from the categorical distribution, consider the probability of selecting category ii:

P(argmaxj(logπj+gj)=i)=P(logπi+gi>logπj+gj for all ji)=P(gigj>logπjlogπi for all ji)\begin{align*} P(\text{argmax}_j(\log \pi_j + g_j) = i) &= P(\log \pi_i + g_i > \log \pi_j + g_j \text{ for all } j \neq i) \\ &= P(g_i - g_j > \log \pi_j - \log \pi_i \text{ for all } j \neq i) \end{align*}

Since the difference of two Gumbel random variables follows a logistic distribution, gigjLogistic(0,1)g_i - g_j \sim \text{Logistic}(0,1), and these differences are independent for different jj (due to the independence of the original Gumbel variables), we can write:

P(argmaxj(logπj+gj)=i)=jiP(gigj>logπjlogπi)=jiπiπi+πj=πi\begin{align*} P(\text{argmax}_j(\log \pi_j + g_j) = i) &= \prod_{j \neq i} P(g_i - g_j > \log \pi_j - \log \pi_i) \\ &= \prod_{j \neq i} \frac{\pi_i}{\pi_i + \pi_j} = \pi_i \end{align*}

The last equality requires some additional algebra to show, but follows from the fact that these probabilities must sum to 1 over all ii.

While we have shown that the Gumbel-Max trick gives us exact samples from a categorical distribution, the argmax operation isn’t differentiable. For stochastic optimization problems of the form:

Exp(x;θ)[f(x)]=EϵGumbel(0,1)[f(g(ϵ,θ))]\mathbb{E}_{x \sim p(x;\theta)}[f(x)] = \mathbb{E}_{\epsilon \sim \text{Gumbel}(0,1)}[f(g(\epsilon,\theta))]

we need gg to be differentiable with respect to θ\theta. This leads us to consider a continuous relaxation where we replace the hard argmax with a temperature-controlled softmax:

zi=exp((logπi+gi)/τ)jexp((logπj+gj)/τ)z_i = \frac{\exp((\log \pi_i + g_i)/\tau)}{\sum_j \exp((\log \pi_j + g_j)/\tau)}

As τ0\tau \to 0, this approximation approaches the argmax:

limτ0exp(xi/τ)jexp(xj/τ)={1if xi=maxjxj0otherwise\lim_{\tau \to 0} \frac{\exp(x_i/\tau)}{\sum_j \exp(x_j/\tau)} = \begin{cases} 1 & \text{if } x_i = \max_j x_j \\ 0 & \text{otherwise} \end{cases}

The resulting distribution over the probability simplex is called the Gumbel-Softmax (or Concrete) distribution. The temperature parameter τ\tau controls the discreteness of our samples: smaller values give samples closer to one-hot vectors but with less stable gradients, while larger values give smoother gradients but more diffuse samples.

Numerical Analysis of Gradient Estimators

Let us examine the behavior of our three gradient estimators for the stochastic optimization objective:

J(θ)=ExN(θ,1)[x2θ]J(\theta) = \mathbb{E}_{x \sim \mathcal{N}(\theta,1)}[x^2\theta]

To get an analytical expression for the derivative, first note that we can factor out θ\theta to obtain J(θ)=θE[x2]J(\theta) = \theta\mathbb{E}[x^2] where xN(θ,1)x \sim \mathcal{N}(\theta,1). By definition of the variance, we know that Var(x)=E[x2](E[x])2\text{Var}(x) = \mathbb{E}[x^2] - (\mathbb{E}[x])^2, which we can rearrange to E[x2]=Var(x)+(E[x])2\mathbb{E}[x^2] = \text{Var}(x) + (\mathbb{E}[x])^2. Since xN(θ,1)x \sim \mathcal{N}(\theta,1), we have Var(x)=1\text{Var}(x) = 1 and E[x]=θ\mathbb{E}[x] = \theta, therefore E[x2]=1+θ2\mathbb{E}[x^2] = 1 + \theta^2. This gives us:

J(θ)=θ(1+θ2)J(\theta) = \theta(1 + \theta^2)

Now differentiating with respect to θ\theta using the product rule yields:

ddθJ(θ)=1+3θ2\frac{d}{d\theta}J(\theta) = 1 + 3\theta^2

For concreteness, we fix θ=1.0\theta = 1.0 and analyze samples drawn using Monte Carlo estimation with batch size 1000 and 1000 independent trials. Evaluating at θ=1\theta = 1 gives us ddθJ(θ)θ=1=1+3(1)2=4\frac{d}{d\theta}J(\theta)\big|_{\theta=1} = 1 + 3(1)^2 = 4, which serves as our ground truth against which we compare our estimators:

  1. First, we consider the naive estimator that incorrectly differentiates the Monte Carlo approximation:

    g^naive(θ)=1Ni=1Nxi2\hat{g}_{\text{naive}}(\theta) = \frac{1}{N}\sum_{i=1}^N x_i^2

    For xN(1,1)x \sim \mathcal{N}(1,1), we have E[x2]=θ2+1=2.0\mathbb{E}[x^2] = \theta^2 + 1 = 2.0 and E[g^naive]=2.0\mathbb{E}[\hat{g}_{\text{naive}}] = 2.0. We should therefore expect a bias of about -2 in our experiment.

  2. Then we compute the score function estimator:

    g^SF(θ)=1Ni=1N[xi2θ(xiθ)+xi2]\hat{g}_{\text{SF}}(\theta) = \frac{1}{N}\sum_{i=1}^N \left[x_i^2\theta(x_i - \theta) + x_i^2\right]

    This estimator is unbiased with E[g^SF]=4\mathbb{E}[\hat{g}_{\text{SF}}] = 4

  3. Finally, through the reparameterization x=θ+ϵx = \theta + \epsilon where ϵN(0,1)\epsilon \sim \mathcal{N}(0,1), we obtain:

    g^RT(θ)=1Ni=1N[2θ(θ+ϵi)+(θ+ϵi)2]\hat{g}_{\text{RT}}(\theta) = \frac{1}{N}\sum_{i=1}^N \left[2\theta(\theta + \epsilon_i) + (\theta + \epsilon_i)^2\right]

    This estimator is also unbiased with E[g^RT]=4\mathbb{E}[\hat{g}_{\text{RT}}] = 4.

Source
%config InlineBackend.figure_format = 'retina'
import jax
import jax.numpy as jnp
import matplotlib.pyplot as plt

# Apply book style
try:
    import scienceplots
    plt.style.use(['science', 'notebook'])
except (ImportError, OSError):
    pass  # Use matplotlib defaults

key = jax.random.PRNGKey(0)

# Define the objective function f(x,θ) = x²θ where x ~ N(θ, 1)
def objective(x, theta):
    return x**2 * theta

# Naive Monte Carlo gradient estimation
@jax.jit
def naive_gradient_batch(key, theta):
    samples = jax.random.normal(key, (1000,)) + theta
    # Use jax.grad on the objective with respect to theta
    grad_fn = jax.grad(lambda t: jnp.mean(objective(samples, t)))
    return grad_fn(theta)

# Score function estimator (REINFORCE)
@jax.jit
def score_function_batch(key, theta):
    samples = jax.random.normal(key, (1000,)) + theta
    # f(x,θ) * ∂logp(x|θ)/∂θ + ∂f(x,θ)/∂θ
    # score function for N(θ,1) is (x-θ)
    score = samples - theta
    return jnp.mean(objective(samples, theta) * score + samples**2)

# Reparameterization gradient
@jax.jit
def reparam_gradient_batch(key, theta):
    eps = jax.random.normal(key, (1000,))
    # Use reparameterization x = θ + ε, ε ~ N(0,1)
    grad_fn = jax.grad(lambda t: jnp.mean(objective(t + eps, t)))
    return grad_fn(theta)

# Run trials
n_trials = 1000
theta = 1.0
true_grad = 1 + 3 * theta**2

keys = jax.random.split(key, n_trials)
naive_estimates = jnp.array([naive_gradient_batch(k, theta) for k in keys])
score_estimates = jnp.array([score_function_batch(k, theta) for k in keys])
reparam_estimates = jnp.array([reparam_gradient_batch(k, theta) for k in keys])

# Create violin plots with individual points
plt.figure(figsize=(12, 6))
data = [naive_estimates, score_estimates, reparam_estimates]
colors = ['#ff9999', '#66b3ff', '#99ff99']

parts = plt.violinplot(data, showextrema=False)
for i, pc in enumerate(parts['bodies']):
    pc.set_facecolor(colors[i])
    pc.set_alpha(0.7)

# Add box plots
plt.boxplot(data, notch=True, showfliers=False)

# Add true gradient line
plt.axhline(y=true_grad, color='r', linestyle='--', label='True Gradient')

plt.xticks([1, 2, 3], ['Naive', 'Score Function', 'Reparam'])
plt.ylabel('Gradient Estimate')
plt.title(f'Gradient Estimators (θ={theta}, true grad={true_grad:.2f})')
plt.grid(True, alpha=0.3)
plt.legend()

# Print statistics
methods = {
    'Naive': naive_estimates,
    'Score Function': score_estimates, 
    'Reparameterization': reparam_estimates
}

for name, estimates in methods.items():
    bias = jnp.mean(estimates) - true_grad
    variance = jnp.var(estimates)
    print(f"\n{name}:")
    print(f"Mean: {jnp.mean(estimates):.6f}")
    print(f"Bias: {bias:.6f}")
    print(f"Variance: {variance:.6f}")
    print(f"MSE: {bias**2 + variance:.6f}")

Naive:
Mean: 2.000417
Bias: -1.999583
Variance: 0.005933
MSE: 4.004266

Score Function:
Mean: 3.996162
Bias: -0.003838
Variance: 0.057295
MSE: 0.057309

Reparameterization:
Mean: 3.999940
Bias: -0.000060
Variance: 0.017459
MSE: 0.017459
<Figure size 1200x600 with 1 Axes>

The numerical experiments corroborate our theory. The naive estimator consistently underestimates the true gradient by 2.0, though it maintains a relatively small variance. This systematic bias would make it unsuitable for optimization despite its low variance. The score function estimator corrects this bias but introduces substantial variance. While unbiased, this estimator would require many samples to achieve reliable gradient estimates. Finally, the reparameterization trick achieves a much lower variance while remaining unbiased. While this experiment is for didactic purposes only, it reproduces what is commonly found in practice: that when applicable, the reparameterization estimator tends to perform better than the score function counterpart.

Score Function Methods in Reinforcement Learning

The score function estimator from the previous section applies directly to reinforcement learning. Since it requires only the ability to evaluate and differentiate logπw(as)\log \pi_{\boldsymbol{w}}(a|s), it works with any differentiable policy, including discrete action spaces where reparameterization is unavailable. It requires no model of the environment dynamics.

Let G(τ)t=0Tr(st,at)G(\tau) \equiv \sum_{t=0}^T r(s_t, a_t) be the sum of undiscounted rewards in a trajectory τ\tau. The stochastic optimization problem we face is to maximize:

J(w)=Eτp(τ;w)[G(τ)]J(\boldsymbol{w}) = \mathbb{E}_{\tau \sim p(\tau;\boldsymbol{w})}[G(\tau)]

where τ=(s0,a0,s1,a1,...)\tau = (s_0,a_0,s_1,a_1,...) is a trajectory and G(τ)G(\tau) is the total return. Applying the score function estimator, we get:

wJ(w)=wEτ[G(τ)]=Eτ[G(τ)wlogp(τ;w)]=Eτ[G(τ)wt=0Tlogπw(atst)]=Eτ[G(τ)t=0Twlogπw(atst)]\begin{align*} \nabla_{\boldsymbol{w}}J(\boldsymbol{w}) &= \nabla_{\boldsymbol{w}}\mathbb{E}_{\tau}[G(\tau)] \\ &= \mathbb{E}_{\tau}\left[G(\tau)\nabla_{\boldsymbol{w}}\log p(\tau;\boldsymbol{w})\right] \\ &= \mathbb{E}_{\tau}\left[G(\tau)\nabla_{\boldsymbol{w}}\sum_{t=0}^T\log \pi_{\boldsymbol{w}}(a_t|s_t)\right] \\ &= \mathbb{E}_{\tau}\left[G(\tau)\sum_{t=0}^T\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\right] \end{align*}

We have eliminated the need to know the transition probabilities in this estimator since the probability of a trajectory factorizes as:

p(τ;w)=p(s0)t=0Tπw(atst)p(st+1st,at)p(\tau;\boldsymbol{w}) = p(s_0)\prod_{t=0}^T \pi_{\boldsymbol{w}}(a_t|s_t)p(s_{t+1}|s_t,a_t)

Therefore, only the policy depends on w\boldsymbol{w}. When taking the logarithm of this product, we get a sum where all the w\boldsymbol{w}-independent terms vanish. The final estimator samples trajectories under the distribution p(τ;w)p(\tau; \boldsymbol{w}) and computes:

wJ(w)1Ni=1N[G(τ(i))t=0Twlogπw(at(i)st(i))]\nabla_{\boldsymbol{w}}J(\boldsymbol{w}) \approx \frac{1}{N}\sum_{i=1}^N\left[G(\tau^{(i)})\sum_{t=0}^T\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t^{(i)}|s_t^{(i)})\right]

This is a direct application of the score function estimator. However, we rarely use this form in practice and instead make several improvements to further reduce the variance.

Leveraging Conditional Independence

Given the Markov property of the MDP, rewards rkr_k for k<tk < t are conditionally independent of action ata_t given the history ht=(s0,a0,...,st1,at1,st)h_t = (s_0,a_0,...,s_{t-1},a_{t-1},s_t). This allows us to only need to consider future rewards when computing policy gradients.

wJ(w)=Eτ[t=0Twlogπw(atst)k=0Trk]=Eτ[t=0Twlogπw(atst)(k=0t1rk+k=tTrk)]=Eτ[t=0Twlogπw(atst)k=tTrk]\begin{align*} \nabla_{\boldsymbol{w}}J(\boldsymbol{w}) &= \mathbb{E}_{\tau}\left[\sum_{t=0}^T\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\sum_{k=0}^T r_k\right] \\ &= \mathbb{E}_{\tau}\left[\sum_{t=0}^T\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\left(\sum_{k=0}^{t-1} r_k + \sum_{k=t}^T r_k\right)\right] \\ &= \mathbb{E}_{\tau}\left[\sum_{t=0}^T\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\sum_{k=t}^T r_k\right] \end{align*}

The conditional independence assumption means that the term Eτ[t=0Twlogπw(atst)k=0t1rk]\mathbb{E}_{\tau}\left[\sum_{t=0}^T\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\sum_{k=0}^{t-1} r_k \right] vanishes. To see this, factor the trajectory distribution as:

p(τ)=p(s0,...,st,a0,...,at1)πw(atst)p(st+1,...,sT,at+1,...,aTst,at)p(\tau) = p(s_0,...,s_t,a_0,...,a_{t-1}) \pi_{\boldsymbol{w}}(a_t|s_t) p(s_{t+1},...,s_T,a_{t+1},...,a_T|s_t,a_t)

We can now re-write a single term of this summation as:

Eτ[wlogπw(atst)k=0t1rk]=Es0:t,a0:t1[k=0t1rkEat[wlogπw(atst)]]\mathbb{E}_{\tau}\left[\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\sum_{k=0}^{t-1} r_k\right] = \mathbb{E}_{s_{0:t},a_{0:t-1}}\left[\sum_{k=0}^{t-1} r_k \, \mathbb{E}_{a_t}\left[\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\right]\right]

The inner expectation is zero because

Eat[wlogπw(atst)]=wlogπw(atst)πw(atst)dat=wπw(atst)πw(atst)πw(atst)dat=wπw(atst)dat=wπw(atst)dat=w1=0\begin{align*} \mathbb{E}_{a_t}\left[\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\right] &= \int \nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\pi_{\boldsymbol{w}}(a_t|s_t)da_t \\ &= \int \frac{\nabla_{\boldsymbol{w}}\pi_{\boldsymbol{w}}(a_t|s_t)}{\pi_{\boldsymbol{w}}(a_t|s_t)}\pi_{\boldsymbol{w}}(a_t|s_t)da_t \\ &= \int \nabla_{\boldsymbol{w}}\pi_{\boldsymbol{w}}(a_t|s_t)da_t \\ &= \nabla_{\boldsymbol{w}}\int \pi_{\boldsymbol{w}}(a_t|s_t)da_t \\ &= \nabla_{\boldsymbol{w}}1 = 0 \end{align*}

The Monte Carlo estimator becomes:

wJ(w)1Ni=1N[t=0Twlogπw(at(i)st(i))k=tTrk(i)]\nabla_{\boldsymbol{w}}J(\boldsymbol{w}) \approx \frac{1}{N}\sum_{i=1}^N\left[\sum_{t=0}^T\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t^{(i)}|s_t^{(i)})\sum_{k=t}^T r_k^{(i)}\right]

This gives us the REINFORCE algorithm:

The benefit of this estimator compared to the naive one (which would weight each score function by the full trajectory return G(τ)G(\tau)) is that it generally has less variance. This variance reduction arises from the conditional independence structure we exploited: past rewards do not depend on future actions. More formally, this estimator is an instance of a variance reduction technique known as the Extended Conditional Monte Carlo Method.

The Surrogate Loss Perspective

The algorithm above computes a gradient estimate g^\hat{g} explicitly. In practice, implementations using automatic differentiation frameworks take a different approach: they define a surrogate loss whose gradient matches the REINFORCE estimator. For a single trajectory, consider:

Lsurrogate(w)=t=0Tlogπw(atst)GtL_{\text{surrogate}}(\boldsymbol{w}) = -\sum_{t=0}^T \log \pi_{\boldsymbol{w}}(a_t|s_t) \, G_t

where the returns GtG_t and actions ata_t are treated as fixed constants (detached from the computation graph). Taking the gradient with respect to w\boldsymbol{w}:

wLsurrogate=t=0Twlogπw(atst)Gt\nabla_{\boldsymbol{w}} L_{\text{surrogate}} = -\sum_{t=0}^T \nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t) \, G_t

Minimizing this surrogate loss via gradient descent yields the same update as maximizing expected return via REINFORCE. The negative sign converts our maximization problem into a minimization suitable for standard optimizers.

This surrogate loss is not the expected return J(w)J(\boldsymbol{w}) we are trying to maximize. It is a computational device that produces the correct gradient at the current parameter values. Several properties distinguish it from a true loss function:

  1. It changes each iteration. The returns GtG_t come from trajectories sampled under the current policy. After updating w\boldsymbol{w}, we must collect new trajectories and construct a new surrogate loss.

  2. Its value is not meaningful. Unlike supervised learning where the loss measures prediction error, the numerical value of LsurrogateL_{\text{surrogate}} has no direct interpretation. Only its gradient matters.

  3. It is valid only locally. The surrogate loss provides the correct gradient only at the parameters used to collect the data. Moving far from those parameters invalidates the gradient estimate.

This perspective explains why policy gradient code often looks different from the pseudocode above. Instead of computing g^\hat{g} explicitly, implementations define the surrogate loss and call loss.backward():

# Surrogate loss implementation (single trajectory)
log_probs = [policy.log_prob(a_t, s_t) for s_t, a_t in trajectory]
returns = compute_returns(rewards)
surrogate_loss = -sum(lp * G for lp, G in zip(log_probs, returns))
surrogate_loss.backward()  # computes REINFORCE gradient
optimizer.step()

Variance Reduction via Control Variates

Recall that the REINFORCE gradient estimator, after leveraging conditional independence, takes the form:

wJ(w)1Ni=1N[t=0Twlogπw(at(i)st(i))k=tTrk(i)]\nabla_{\boldsymbol{w}}J(\boldsymbol{w}) \approx \frac{1}{N}\sum_{i=1}^N\left[\sum_{t=0}^T\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t^{(i)}|s_t^{(i)})\sum_{k=t}^T r_k^{(i)}\right]

This is a sum over trajectories and timesteps. The gradient contribution at timestep tt of trajectory ii is:

wlogπw(at(i)st(i))k=tTrk(i)\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t^{(i)}|s_t^{(i)})\sum_{k=t}^T r_k^{(i)}

While unbiased, this estimator suffers from high variance because the return k=tTrk\sum_{k=t}^T r_k can vary significantly across trajectories even for the same state-action pair. The control variate method provides a principled way to reduce this variance.

General Control Variate Theory

For a general estimator ZZ of some quantity μ=E[Z]\mu = \mathbb{E}[Z], and a control variate CC with known expectation E[C]=0\mathbb{E}[C]=0, we can construct:

Zcv=ZαCZ_{\text{cv}} = Z - \alpha C

This remains unbiased since E[Zcv]=E[Z]αE[C]=E[Z]\mathbb{E}[Z_{\text{cv}}] = \mathbb{E}[Z] - \alpha\mathbb{E}[C] = \mathbb{E}[Z]. The variance is:

Var(Zcv)=Var(Z)+α2Var(C)2αCov(Z,C)\text{Var}(Z_{\text{cv}}) = \text{Var}(Z) + \alpha^2\text{Var}(C) - 2\alpha\text{Cov}(Z,C)

The 2αCov(Z,C)-2\alpha\text{Cov}(Z,C) term is what enables variance reduction. If ZZ and CC are positively correlated, we can choose α>0\alpha > 0 to make this term negative and large in magnitude, reducing the overall variance. However, the α2Var(C)\alpha^2\text{Var}(C) term grows quadratically with α\alpha, so if we make α\alpha too large, this quadratic term will eventually dominate and the variance will increase rather than decrease. The variance as a function of α\alpha is a parabola opening upward, with a unique minimum. Setting ddαVar(Zcv)=0\frac{d}{d\alpha}\text{Var}(Z_{\text{cv}}) = 0 gives:

α=Cov(Z,C)Var(C)\alpha^* = \frac{\text{Cov}(Z,C)}{\text{Var}(C)}

This is the coefficient from ordinary least squares regression: we predict the estimator ZZ using the control variate CC as the predictor. Since E[C]=0\mathbb{E}[C] = 0, the linear model is ZE[Z]+αCZ \approx \mathbb{E}[Z] + \alpha^* C, where α\alpha^* is the OLS slope coefficient. The control variate estimator Zcv=ZαCZ_{\text{cv}} = Z - \alpha^* C computes the residual: the part of ZZ that cannot be explained by CC.

Substituting α\alpha^* into the variance formula yields:

Var(Zcv)=Var(Z)[Cov(Z,C)]2Var(C)=(1R2)Var(Z)\text{Var}(Z_{\text{cv}}) = \text{Var}(Z) - \frac{[\text{Cov}(Z,C)]^2}{\text{Var}(C)} = (1 - R^2) \text{Var}(Z)

where R2=[Cov(Z,C)]2Var(Z)Var(C)R^2 = \frac{[\text{Cov}(Z,C)]^2}{\text{Var}(Z)\text{Var}(C)} is the coefficient of determination from regressing ZZ on CC. The variance reduction is R2Var(Z)R^2 \text{Var}(Z): the better CC predicts ZZ, the more variance we eliminate.

Application to REINFORCE

In the reinforcement learning setting, our REINFORCE gradient estimator is a sum over timesteps: t=0TZt\sum_{t=0}^T Z_t where each ZtZ_t represents the gradient contribution at timestep tt. We apply control variates separately to each term. Since Var(tZt)=tVar(Zt)+tsCov(Zt,Zs)\text{Var}(\sum_t Z_t) = \sum_t \text{Var}(Z_t) + \sum_{t \neq s} \text{Cov}(Z_t, Z_s), reducing the variance of each ZtZ_t reduces the total variance, though we do not explicitly address the cross-timestep covariance terms.

For a given trajectory at state sts_t, the gradient contribution at time tt is:

Zt=wlogπw(atst)k=tTrkZ_t = \nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\sum_{k=t}^T r_k

This is the product of the score function wlogπw(atst)\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t) and the return-to-go k=tTrk\sum_{k=t}^T r_k. We can subtract any state-dependent function b(st)b(s_t) from the return without introducing bias, as long as b(st)b(s_t) does not depend on ata_t. This is because:

Eatπw(st)[wlogπw(atst)b(st)]=b(st)Eat[wlogπw(atst)]=0\mathbb{E}_{a_t \sim \pi_{\boldsymbol{w}}(\cdot|s_t)}\left[\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)b(s_t)\right] = b(s_t)\mathbb{E}_{a_t}\left[\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\right] = 0

where the last equality follows from the score function identity (38).

We can now define our control variate as:

Ct=wlogπw(atst)b(st)C_t = \nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t) \cdot b(s_t)

where b(st)b(s_t) is a baseline function that depends only on the state. This satisfies E[Ctst]=0\mathbb{E}[C_t|s_t] = 0. Our control variate estimator becomes:

Zt,cv=ZtCt=wlogπw(atst)(k=tTrkb(st))Z_{t,\text{cv}} = Z_t - C_t = \nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\left(\sum_{k=t}^T r_k - b(s_t)\right)

The optimal baseline b(st)b^*(s_t) minimizes the variance. To find it, consider the scalar parameter case for simplicity. Write g(at)wlogπw(atst)g(a_t) \equiv \nabla_w \log \pi_w(a_t|s_t) and Gt=k=tTrkG_t = \sum_{k=t}^T r_k. We want to minimize:

b(st)=argminbVaratπw(st)[g(at)(Gtb)]b^*(s_t) = \arg\min_{b} \text{Var}_{a_t \sim \pi_{\boldsymbol{w}}(\cdot|s_t)}\left[g(a_t)(G_t - b)\right]

Since the mean does not depend on bb, minimizing the variance is equivalent to minimizing the second moment E[g(at)2(Gtb)2st]\mathbb{E}[g(a_t)^2(G_t - b)^2|s_t]. Expanding and taking the derivative with respect to bb gives:

b(st)=Eatst[g(at)2Gt]Eatst[g(at)2]b^*(s_t) = \frac{\mathbb{E}_{a_t|s_t}\left[g(a_t)^2 G_t\right]}{\mathbb{E}_{a_t|s_t}\left[g(a_t)^2\right]}

For vector-valued parameters w\boldsymbol{w}, we minimize a scalar proxy such as the trace of the covariance matrix, which yields the same formula with wlogπw(atst)2\|\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\|^2 in place of g(at)2g(a_t)^2:

b(st)=Eatst[wlogπw(atst)2Gt]Eatst[wlogπw(atst)2]b^*(s_t) = \frac{\mathbb{E}_{a_t|s_t}\left[\|\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\|^2 G_t\right]}{\mathbb{E}_{a_t|s_t}\left[\|\nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\|^2\right]}

This is the exact optimal baseline: a weighted average of returns where the weights are the squared norms of the score function. In practice, we treat the squared norm as roughly constant across actions at a given state, which leads to the simpler and widely used choice:

b(st)E[Gtst]=vπw(st)b(s_t) \approx \mathbb{E}[G_t|s_t] = v^{\pi_{\boldsymbol{w}}}(s_t)

With this approximation, the variance-reduced gradient contribution at timestep tt becomes:

Zcv,t=wlogπw(atst)(k=tTrkvπw(st))Z_{\text{cv},t} = \nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\left(\sum_{k=t}^T r_k - v^{\pi_{\boldsymbol{w}}}(s_t)\right)

The term in parentheses is exactly the advantage function: Aπw(st,at)=qπw(st,at)vπw(st)A^{\pi_{\boldsymbol{w}}}(s_t, a_t) = q^{\pi_{\boldsymbol{w}}}(s_t, a_t) - v^{\pi_{\boldsymbol{w}}}(s_t), where the Q-function is approximated by the Monte Carlo return k=tTrk\sum_{k=t}^T r_k. The full gradient estimate for a trajectory is then the sum over all timesteps:

g^=t=0TZcv,t=t=0Twlogπw(atst)(Gtvπw(st))\hat{g} = \sum_{t=0}^T Z_{\text{cv},t} = \sum_{t=0}^T \nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t)\left(G_t - v^{\pi_{\boldsymbol{w}}}(s_t)\right)

In practice, we do not have access to the true value function and must learn it. Unlike the methods in the amortization chapter, where we learned value functions to approximate the optimal Q-function, here our goal is policy evaluation: estimating the value of the current policy πw\pi_{\boldsymbol{w}}. The same function approximation techniques apply, but we target vπwv^{\pi_{\boldsymbol{w}}} rather than vv^*. The simplest approach is to regress from states to Monte Carlo returns, learning what Williams (1992) called a “baseline”:

When implementing this algorithm nowadays, we always use mini-batching to make full use of our GPUs. Therefore, a more representative variant for this algorithm would be:

The value function is trained by regressing states directly to their sampled Monte Carlo returns GG. Advantage normalization (step 2.5) is not part of the optimal baseline derivation but improves optimization in practice and is standard in modern implementations.

Generalized Advantage Estimation

The baseline construction gave us a gradient estimator of the form:

wJ(w)1Ni=1Nt=0Twlogπw(at(i)st(i))(Gt(i)v(st(i)))\nabla_{\boldsymbol{w}}J(\boldsymbol{w}) \approx \frac{1}{N}\sum_{i=1}^N \sum_{t=0}^T \nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t^{(i)}|s_t^{(i)}) \left(G_t^{(i)} - v(s_t^{(i)})\right)

where Gt=k=tTrkG_t = \sum_{k=t}^T r_k is the Monte Carlo return from time tt. For each visited state-action pair (st,at)(s_t, a_t), the term in parentheses

A^tMC=Gtv(st)\widehat{A}_t^{\text{MC}} = G_t - v(s_t)

is a Monte Carlo estimate of the advantage Aπ(st,at)=qπ(st,at)vπ(st)A^{\pi}(s_t, a_t) = q^{\pi}(s_t, a_t) - v^{\pi}(s_t). If the baseline equals the true value function, v=vπv = v^{\pi}, then E[A^tMCst,at]=Aπ(st,at)\mathbb{E}[\widehat{A}_t^{\text{MC}} | s_t, a_t] = A^{\pi}(s_t, a_t), so this estimator is unbiased.

However, as an estimator it has two limitations. First, it has high variance because GtG_t depends on all future rewards. Second, it uses the value function only as a baseline, not as a predictor of long-term returns. We essentially discard the information in v(st+1),v(st+2),v(s_{t+1}), v(s_{t+2}), \ldots

GAE addresses these issues by constructing a family of estimators that interpolate between pure Monte Carlo and pure bootstrapping. A parameter λ\lambda controls the bias-variance tradeoff.

Decomposing the Monte Carlo Advantage

Fix a value function v(s)v(s) (not necessarily equal to vπv^{\pi}) and define the one-step residual:

δt=rt+γv(st+1)v(st)\delta_t = r_t + \gamma v(s_{t+1}) - v(s_t)

Start from the Monte Carlo advantage and add and subtract γv(st+1)\gamma v(s_{t+1}):

Gtv(st)=rt+γGt+1v(st)=rt+γv(st+1)v(st)+γ(Gt+1v(st+1))=δt+γ(Gt+1v(st+1))\begin{align*} G_t - v(s_t) &= r_t + \gamma G_{t+1} - v(s_t) \\ &= r_t + \gamma v(s_{t+1}) - v(s_t) + \gamma(G_{t+1} - v(s_{t+1})) \\ &= \delta_t + \gamma(G_{t+1} - v(s_{t+1})) \end{align*}

Applying this decomposition recursively yields:

Gtv(st)=l=0Ttγlδt+lG_t - v(s_t) = \sum_{l=0}^{T-t} \gamma^l \delta_{t+l}

The Monte Carlo advantage is exactly the discounted sum of future residuals. This is an algebraic identity, not an approximation.

The sequence {δt+l}l0\{\delta_{t+l}\}_{l \geq 0} provides incremental corrections to the value function as we move forward in time. The term δt\delta_t depends only on (st,at,st+1)(s_t, a_t, s_{t+1}); δt+1\delta_{t+1} depends on (st+1,at+1,st+2)(s_{t+1}, a_{t+1}, s_{t+2}), and so on. As ll increases, the corrections become more noisy (they depend on more random outcomes) and more sensitive to errors in the value function at later states. Although the full sum is unbiased when v=vπv = v^{\pi}, it can have high variance and can be badly affected by approximation error in vv.

GAE as a Shrinkage Estimator

The decomposition above suggests a family of estimators that downweight residuals farther in the future. Let λ[0,1]\lambda \in [0,1] and define:

Atλ=l=0Tt(γλ)lδt+lA_t^{\lambda} = \sum_{l=0}^{T-t} (\gamma\lambda)^l \delta_{t+l}

This is the generalized advantage estimator AtGAE(γ,λ)A_t^{\text{GAE}(\gamma,\lambda)}.

Two special cases illustrate the extremes. When λ=1\lambda = 1, we recover the Monte Carlo advantage:

Atλ=1=l=0Ttγlδt+l=Gtv(st)A_t^{\lambda=1} = \sum_{l=0}^{T-t} \gamma^l \delta_{t+l} = G_t - v(s_t)

When λ=0\lambda = 0, we keep only the immediate residual:

Atλ=0=δt=rt+γv(st+1)v(st)A_t^{\lambda=0} = \delta_t = r_t + \gamma v(s_{t+1}) - v(s_t)

Intermediate values 0<λ<10 < \lambda < 1 interpolate between these extremes. The influence of δt+l\delta_{t+l} decays geometrically as (γλ)l(\gamma\lambda)^l. The parameter λ\lambda acts as a shrinkage parameter: small λ\lambda shrinks the estimator toward the one-step residual; large λ\lambda allows the estimator to behave more like the Monte Carlo advantage.

If v=vπv = v^{\pi} is the true value function, then E[δtst,at]=Aπ(st,at)\mathbb{E}[\delta_t | s_t, a_t] = A^{\pi}(s_t, a_t) and E[δt+lst,at]=0\mathbb{E}[\delta_{t+l} | s_t, a_t] = 0 for l1l \geq 1. In this case:

E[Atλst,at]=l=0Tt(γλ)lE[δt+lst,at]=Aπ(st,at)\mathbb{E}[A_t^{\lambda} | s_t, a_t] = \sum_{l=0}^{T-t} (\gamma\lambda)^l \mathbb{E}[\delta_{t+l} | s_t, a_t] = A^{\pi}(s_t, a_t)

for all λ[0,1]\lambda \in [0,1]. When the value function is exact, GAE is unbiased regardless of λ\lambda; changing λ\lambda only affects variance.

In practice, we approximate vπv^{\pi} with a function approximator, and the residuals δt+l\delta_{t+l} inherit approximation error. Distant residuals involve multiple applications of the approximate value function and are more contaminated by modeling error. Downweighting them (choosing λ<1\lambda < 1) introduces bias but can reduce variance and limit the impact of those errors.

Mixture of Multi-Step Estimators

Another perspective on GAE comes from multi-step returns. Define the kk-step return from time tt:

Gt(k)=l=0k1γlrt+l+γkv(st+k)G_t^{(k)} = \sum_{l=0}^{k-1} \gamma^l r_{t+l} + \gamma^k v(s_{t+k})

and the corresponding kk-step advantage estimator At(k)=Gt(k)v(st)A_t^{(k)} = G_t^{(k)} - v(s_t). Each At(k)A_t^{(k)} uses kk rewards before bootstrapping; larger kk means more variance but less bootstrapping error.

The GAE estimator can be written as a geometric mixture:

Atλ=(1λ)k=1Ttλk1At(k)A_t^{\lambda} = (1-\lambda) \sum_{k=1}^{T-t} \lambda^{k-1} A_t^{(k)}

GAE is a weighted average of the kk-step advantage estimators, with shorter horizons weighted more heavily when λ\lambda is small.

Using GAE in the Policy Gradient

Once we choose λ\lambda, we plug AtλA_t^{\lambda} in place of Gtv(st)G_t - v(s_t) in the policy gradient estimator:

wJ(w)1Ni=1Nt=0Twlogπw(at(i)st(i))Atλ,(i)\nabla_{\boldsymbol{w}}J(\boldsymbol{w}) \approx \frac{1}{N}\sum_{i=1}^N \sum_{t=0}^T \nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t^{(i)}|s_t^{(i)}) A_t^{\lambda,(i)}

We still use a control variate to reduce variance (the baseline vv), but now we construct the advantage target by smoothing the sequence of residuals {δt}\{\delta_t\} with a geometrically decaying kernel.

For the value function, it is convenient to define the λ\lambda-return:

Gtλ=Atλ+v(st)G_t^{\lambda} = A_t^{\lambda} + v(s_t)

When λ=1\lambda = 1, GtλG_t^{\lambda} reduces to the Monte Carlo return; when λ=0\lambda = 0, it becomes the one-step bootstrapped target rt+γv(st+1)r_t + \gamma v(s_{t+1}).

When λ=1\lambda = 1, this reduces (up to advantage normalization) to the Monte Carlo baseline algorithm earlier in the chapter. When λ=0\lambda = 0, advantages become the one-step residuals δt\delta_t, and the λ\lambda-returns reduce to standard one-step bootstrapped targets.

Actor-Critic as the λ=0\lambda = 0 Limit

The case λ=0\lambda = 0 is particularly simple. The advantage becomes:

Atλ=0=δt=rt+γv(st+1)v(st)A_t^{\lambda=0} = \delta_t = r_t + \gamma v(s_{t+1}) - v(s_t)

and the policy update reduces to:

ww+αwwlogπw(atst)δt\boldsymbol{w} \leftarrow \boldsymbol{w} + \alpha_w \nabla_{\boldsymbol{w}}\log \pi_{\boldsymbol{w}}(a_t|s_t) \delta_t

while the value update becomes a standard one-step regression toward rt+γv(st+1)r_t + \gamma v(s_{t+1}). This gives the online actor-critic algorithm:

This algorithm was derived by Sutton in his 1984 thesis as an “adaptive heuristic” for temporal credit assignment. In the language of this chapter, it is the λ=0\lambda = 0 member of the GAE family: it uses the most local residual δt\delta_t as both the target for the value function and the advantage estimate for the policy gradient.

Likelihood Ratio Methods in Reinforcement Learning

The score function estimator from the previous section is a special case of the likelihood ratio method where the proposal distribution equals the target distribution. We now consider the general case where they differ.

Recall the likelihood ratio gradient estimator from the beginning of this chapter. For objective J(θ)=Exp(x;θ)[f(x)]J(\theta) = \mathbb{E}_{x \sim p(x;\theta)}[f(x)] and any proposal distribution q(x)q(x):

θJ(θ)=Exq(x)[f(x)ρ(x;q,θ)θlogp(x;θ)]\nabla_\theta J(\theta) = \mathbb{E}_{x \sim q(x)}\left[f(x) \rho(x; q, \theta) \nabla_\theta \log p(x;\theta)\right]

where ρ(x;q,θ)=p(x;θ)q(x)\rho(x; q, \theta) = \frac{p(x;\theta)}{q(x)} is the likelihood ratio. The partial derivative ρθ=ρθlogp\frac{\partial \rho}{\partial \theta} = \rho \nabla_\theta \log p holds because xx is treated as fixed, having been sampled from qq, which does not depend on θ\theta.

In reinforcement learning, let x=τx = \tau be a trajectory, f(τ)=G(τ)f(\tau) = G(\tau) the return, p(τ;w)p(\tau;\boldsymbol{w}) the trajectory distribution under policy πw\pi_{\boldsymbol{w}}, and q(τ)q(\tau) the trajectory distribution under some other policy πq\pi_q. The gradient becomes:

wJ(w)=Eτπq[G(τ)ρ(τ)t=0Twlogπw(atst)]\nabla_{\boldsymbol{w}} J(\boldsymbol{w}) = \mathbb{E}_{\tau \sim \pi_q}\left[G(\tau) \rho(\tau) \sum_{t=0}^T \nabla_{\boldsymbol{w}} \log \pi_{\boldsymbol{w}}(a_t|s_t)\right]

where the trajectory likelihood ratio simplifies because transition probabilities cancel:

ρ(τ)=p(τ;w)q(τ)=t=0Tπw(atst)πq(atst)=t=0Tρt\rho(\tau) = \frac{p(\tau;\boldsymbol{w})}{q(\tau)} = \prod_{t=0}^T \frac{\pi_{\boldsymbol{w}}(a_t|s_t)}{\pi_q(a_t|s_t)} = \prod_{t=0}^T \rho_t

This product of T+1T+1 ratios can become extremely large or small as TT grows, leading to high variance. The temporal structure provides some relief: since Eaπq[ρ]=1\mathbb{E}_{a \sim \pi_q}[\rho] = 1, future ratios ρk\rho_{k} for k>tk > t that do not affect the reward rtr_t can be marginalized out. However, past ratios ρ0:t1\rho_{0:t-1} are still needed to correctly weight the probability of reaching state sts_t.

In practice, algorithms like PPO and TRPO make an additional approximation: they use only the per-step ratio ρt\rho_t rather than the cumulative product ρ0:t\rho_{0:t}. This ignores the mismatch between the state distributions induced by the two policies. Combined with a baseline b(st)b(s_t), the approximate estimator is:

wJ(w)Eτπq[t=0Tρtwlogπw(atst)(Gtb(st))]\nabla_{\boldsymbol{w}} J(\boldsymbol{w}) \approx \mathbb{E}_{\tau \sim \pi_q}\left[\sum_{t=0}^T \rho_t \nabla_{\boldsymbol{w}} \log \pi_{\boldsymbol{w}}(a_t|s_t) (G_t - b(s_t))\right]

This approximation corresponds to maximizing the importance-weighted surrogate objective:

LIS(w)=Eτπq[t=0TρtAt]L^{\text{IS}}(\boldsymbol{w}) = \mathbb{E}_{\tau \sim \pi_q}\left[\sum_{t=0}^T \rho_t A_t\right]

where At=Gtb(st)A_t = G_t - b(s_t). Taking the gradient with respect to w\boldsymbol{w}, only ρt\rho_t depends on w\boldsymbol{w} (since trajectories are sampled from πq\pi_q):

wLIS(w)=Eτπq[t=0TAtwρt]\nabla_{\boldsymbol{w}} L^{\text{IS}}(\boldsymbol{w}) = \mathbb{E}_{\tau \sim \pi_q}\left[\sum_{t=0}^T A_t \nabla_{\boldsymbol{w}} \rho_t\right]

The gradient of the ratio is:

wρt=wπw(atst)πq(atst)=wπw(atst)πq(atst)=ρtwlogπw(atst)\nabla_{\boldsymbol{w}} \rho_t = \nabla_{\boldsymbol{w}} \frac{\pi_{\boldsymbol{w}}(a_t|s_t)}{\pi_q(a_t|s_t)} = \frac{\nabla_{\boldsymbol{w}} \pi_{\boldsymbol{w}}(a_t|s_t)}{\pi_q(a_t|s_t)} = \rho_t \nabla_{\boldsymbol{w}} \log \pi_{\boldsymbol{w}}(a_t|s_t)

Substituting back:

wLIS(w)=Eτπq[t=0Tρtwlogπw(atst)At]\nabla_{\boldsymbol{w}} L^{\text{IS}}(\boldsymbol{w}) = \mathbb{E}_{\tau \sim \pi_q}\left[\sum_{t=0}^T \rho_t \nabla_{\boldsymbol{w}} \log \pi_{\boldsymbol{w}}(a_t|s_t) A_t\right]

This matches equation (76). When πq=πw\pi_q = \pi_{\boldsymbol{w}}, the ratios ρt=1\rho_t = 1 and we recover the score function estimator. The approximation error grows as the policies diverge, which motivates the trust region and clipping mechanisms discussed below.

Variance and the Dominance Condition

The ratio ρt=πw(atst)/πq(atst)\rho_t = \pi_{\boldsymbol{w}}(a_t|s_t)/\pi_q(a_t|s_t) is well-behaved only when the two policies are similar. If πw\pi_{\boldsymbol{w}} assigns high probability to an action where πq\pi_q assigns low probability, the ratio explodes. For example, if πq(as)=0.01\pi_q(a|s) = 0.01 and πw(as)=0.5\pi_{\boldsymbol{w}}(a|s) = 0.5, then ρ=50\rho = 50, amplifying any noise in the advantage estimate.

Importance sampling also requires the dominance condition: the support of πw\pi_{\boldsymbol{w}} must be contained in the support of πq\pi_q. If πw(as)>0\pi_{\boldsymbol{w}}(a|s) > 0 but πq(as)=0\pi_q(a|s) = 0, the ratio is undefined. Stochastic policies typically have full support, but the ratio can still become arbitrarily large as πq(as)0\pi_q(a|s) \to 0.

A common use case is to set πq=πwold\pi_q = \pi_{\boldsymbol{w}_{\text{old}}}, a previous version of the policy. This allows reusing data across multiple gradient steps: collect trajectories once, then update w\boldsymbol{w} several times. But each update moves w\boldsymbol{w} further from wold\boldsymbol{w}_{\text{old}}, making the ratios more extreme. Eventually, the gradient signal is dominated by a few samples with large weights.

Proximal Policy Optimization

The variance issues suggest a natural solution: keep the ratio ρt\rho_t close to 1 by ensuring the new policy stays close to the behavior policy. This keeps the importance-weighted surrogate LIS(w)L^{\text{IS}}(\boldsymbol{w}) from (77) well-behaved.

Trust Region Policy Optimization (TRPO) formalizes this by adding a constraint on the KL divergence between the old and new policies:

maxwLIS(w)subject toEs[DKL(πwold(s)πw(s))]δ\max_{\boldsymbol{w}} L^{\text{IS}}(\boldsymbol{w}) \quad \text{subject to} \quad \mathbb{E}_s\left[D_{\text{KL}}(\pi_{\boldsymbol{w}_{\text{old}}}(\cdot|s) \| \pi_{\boldsymbol{w}}(\cdot|s))\right] \leq \delta

The KL constraint ensures that the two distributions remain similar, which bounds how extreme the importance weights can become. This is a constrained optimization problem, and one could in principle apply standard methods such as projected gradient descent or augmented Lagrangian approaches (as discussed in the trajectory optimization chapter). TRPO takes a different approach: it uses a second-order Taylor approximation of the KL constraint around the current parameters and solves the resulting trust region subproblem using conjugate gradient methods. This involves computing the Fisher information matrix (the Hessian of the KL divergence), which adds computational overhead.

Proximal Policy Optimization (PPO) achieves similar behavior through a simpler mechanism: rather than constraining the distributions to be similar, it directly clips the ratio ρt\rho_t to prevent it from moving too far from 1. This is a construction-level guarantee rather than an optimization-level constraint.

From Trajectory Expectations to State-Action Averages

Before defining the PPO objective, we need to clarify the relationship between the trajectory-level surrogate (77) and the state-action level objective that PPO actually optimizes. The importance-weighted surrogate is defined as an expectation over trajectories:

LIS(w)=Eτπq[t=0TρtAt]L^{\text{IS}}(\boldsymbol{w}) = \mathbb{E}_{\tau \sim \pi_q}\left[\sum_{t=0}^T \rho_t A_t\right]

We can rewrite this as an expectation over state-action pairs by introducing a sampling distribution. For a finite horizon TT, define the averaged time-marginal distribution:

ξπq(s,a)=1T+1t=0Tdtπq(s)πq(as)\xi_{\pi_q}(s, a) = \frac{1}{T+1} \sum_{t=0}^{T} d_t^{\pi_q}(s) \pi_q(a|s)

where dtπq(s)d_t^{\pi_q}(s) is the probability of being in state ss at time tt when following policy πq\pi_q from the initial distribution. This is the uniform mixture over the time-indexed state-action distributions: we pick a timestep tt uniformly at random from {0,1,,T}\{0, 1, \ldots, T\}, then sample (s,a)(s, a) from the joint distribution at that timestep.

With this definition, the trajectory expectation becomes:

Eτπq[t=0TρtAt]=(T+1)E(s,a)ξπq[ρ(s,a)A(s,a)]\mathbb{E}_{\tau \sim \pi_q}\left[\sum_{t=0}^T \rho_t A_t\right] = (T+1) \cdot \mathbb{E}_{(s,a) \sim \xi_{\pi_q}}\left[\rho(s, a) A(s, a)\right]

The factor (T+1)(T+1) is just a constant that does not affect the optimization. This reformulation shows that the importance-weighted surrogate is equivalent to an expectation over state-action pairs drawn from the averaged time-marginal distribution. This is not a stationary distribution or a discounted visitation distribution, but the empirical mixture induced by the finite-horizon rollout procedure.

The Clipped Surrogate Objective

PPO replaces the linear importance-weighted term ρA\rho A with a clipped version. For a state-action pair (s,a)(s, a) with advantage AA and importance ratio ρ(w)=πw(as)/πwold(as)\rho(\boldsymbol{w}) = \pi_{\boldsymbol{w}}(a|s) / \pi_{\boldsymbol{w}_{\text{old}}}(a|s), define the per-sample clipped objective:

CLIP(w;s,a,A)=min(ρ(w)A,clip(ρ(w),1ϵ,1+ϵ)A)\ell^{\text{CLIP}}(\boldsymbol{w}; s, a, A) = \min\left(\rho(\boldsymbol{w}) A, \, \text{clip}(\rho(\boldsymbol{w}), 1-\epsilon, 1+\epsilon) A\right)

where ϵ\epsilon is a hyperparameter (typically 0.1 or 0.2) and clip(x,a,b)=max(a,min(x,b))\text{clip}(x, a, b) = \max(a, \min(x, b)) restricts xx to the interval [a,b][a, b].

The population-level PPO objective is then:

LCLIP(w)=E(s,a,A)ξπwold[CLIP(w;s,a,A)]L^{\text{CLIP}}(\boldsymbol{w}) = \mathbb{E}_{(s,a,A) \sim \xi_{\pi_{\boldsymbol{w}_{\text{old}}}}}\left[\ell^{\text{CLIP}}(\boldsymbol{w}; s, a, A)\right]

where the expectation is taken over the averaged time-marginal distribution (83) induced by πwold\pi_{\boldsymbol{w}_{\text{old}}}.

In practice, we never compute this expectation exactly. Instead, we collect a batch of transitions D={(st(i),at(i),At(i))}\mathcal{D} = \{(s_t^{(i)}, a_t^{(i)}, A_t^{(i)})\} by running πwold\pi_{\boldsymbol{w}_{\text{old}}} and approximate the expectation with an empirical average:

L^CLIP(w;D)=1D(s,a,A)DCLIP(w;s,a,A)\hat{L}^{\text{CLIP}}(\boldsymbol{w}; \mathcal{D}) = \frac{1}{|\mathcal{D}|} \sum_{(s,a,A) \in \mathcal{D}} \ell^{\text{CLIP}}(\boldsymbol{w}; s, a, A)

This is the same plug-in approximation used in fitted Q-iteration: replace the unknown population distribution with the empirical distribution P^D\hat{P}_{\mathcal{D}} induced by the collected batch, then compute the sample average. The empirical surrogate L^CLIP\hat{L}^{\text{CLIP}} is simply an expectation under P^D\hat{P}_{\mathcal{D}}. No assumptions about stationarity or discounted visitation are needed. We just average over the transitions we collected.

Intuition for the Clipping Mechanism

The min\min operator in (85) selects the more pessimistic estimate. Consider the two cases:

In both cases, the clipping removes the incentive to move the probability ratio beyond the interval [1ϵ,1+ϵ][1-\epsilon, 1+\epsilon]. This keeps the new policy close to the old policy without explicitly computing or constraining the KL divergence.

The algorithm collects a batch of trajectories, then performs KK epochs of mini-batch updates on the same data. The empirical surrogate L^CLIP\hat{L}^{\text{CLIP}} approximates the population objective (86) using samples from the averaged time-marginal distribution. The clipped objective ensures that even after multiple updates, the policy does not move too far from the policy that collected the data. The ratio ρ\rho is computed in log-space for numerical stability.

PPO has become one of the most widely used policy gradient algorithms due to its simplicity and robustness. Compared to TRPO, it avoids the computational overhead of constrained optimization while achieving similar sample efficiency. The clip parameter ϵ\epsilon is the main hyperparameter controlling the trust region size: smaller values keep the policy closer to the behavior policy but may slow learning, while larger values allow faster updates but risk instability.

The Policy Gradient Theorem

The algorithms developed so far (REINFORCE, actor-critic, GAE, and PPO) all estimate policy gradients from sampled trajectories. We now establish the theoretical foundation for these estimators by deriving the policy gradient theorem in the discounted infinite-horizon setting.

Sutton et al. (1999) provided the original derivation. Here we present an alternative approach using the Implicit Function Theorem, which frames policy optimization as a bilevel problem:

maxwαvγπw\max_{\mathbf{w}} \alpha^\top \mathbf{v}_\gamma^{\pi_{\boldsymbol{w}}}

subject to:

(IγPπw)vγπw=rπw(\mathbf{I} - \gamma \mathbf{P}_{\pi_{\boldsymbol{w}}}) \mathbf{v}_\gamma^{\pi_{\boldsymbol{w}}} = \mathbf{r}_{\pi_{\boldsymbol{w}}}

The Implicit Function Theorem states that if there is a solution to the problem F(v,w)=0F(\mathbf{v}, \mathbf{w}) = 0, then we can “reparameterize” our problem as F(v(w),w)F(\mathbf{v}(\mathbf{w}), \mathbf{w}) where v(w)\mathbf{v}(\mathbf{w}) is an implicit function of w\mathbf{w}. If the Jacobian Fv\frac{\partial F}{\partial \mathbf{v}} is invertible, then:

dv(w)dw=(F(v(w),w)v)1F(v(w),w)w\frac{d\mathbf{v}(\mathbf{w})}{d\mathbf{w}} = -\left(\frac{\partial F(\mathbf{v}(\mathbf{w}), \mathbf{w})}{\partial \mathbf{v}}\right)^{-1}\frac{\partial F(\mathbf{v}(\mathbf{w}), \mathbf{w})}{\partial \mathbf{w}}

Here we made it clear in our notation that the derivative must be evaluated at root (v(w),w)(\mathbf{v}(\mathbf{w}), \mathbf{w}) of FF. For the remaining of this derivation, we will drop this dependence to make notation more compact.

Applying this to our case with F(v,w)=(IγPπw)vrπwF(\mathbf{v}, \mathbf{w}) = (\mathbf{I} - \gamma \mathbf{P}_{\pi_{\boldsymbol{w}}})\mathbf{v} - \mathbf{r}_{\pi_{\boldsymbol{w}}}:

vγπww=(IγPπw)1(rπww+γPπwwvγπw)\frac{\partial \mathbf{v}_\gamma^{\pi_{\boldsymbol{w}}}}{\partial \mathbf{w}} = (\mathbf{I} - \gamma \mathbf{P}_{\pi_{\boldsymbol{w}}})^{-1}\left(\frac{\partial \mathbf{r}_{\pi_{\boldsymbol{w}}}}{\partial \mathbf{w}} + \gamma \frac{\partial \mathbf{P}_{\pi_{\boldsymbol{w}}}}{\partial \mathbf{w}}\mathbf{v}_\gamma^{\pi_{\boldsymbol{w}}}\right)

Then:

wJ(w)=αvγπww=xα(rπww+γPπwwvγπw)\begin{align*} \nabla_{\mathbf{w}}J(\mathbf{w}) &= \alpha^\top \frac{\partial \mathbf{v}_\gamma^{\pi_{\boldsymbol{w}}}}{\partial \mathbf{w}} \\ &= \mathbf{x}_\alpha^\top\left(\frac{\partial \mathbf{r}_{\pi_{\boldsymbol{w}}}}{\partial \mathbf{w}} + \gamma \frac{\partial \mathbf{P}_{\pi_{\boldsymbol{w}}}}{\partial \mathbf{w}}\mathbf{v}_\gamma^{\pi_{\boldsymbol{w}}}\right) \end{align*}

where we have defined the discounted state visitation distribution:

xαα(IγPπw)1.\mathbf{x}_\alpha^\top \equiv \alpha^\top(\mathbf{I} - \gamma \mathbf{P}_{\pi_{\boldsymbol{w}}})^{-1}.

Recall the vector notation for MDPs from the dynamic programming chapter:

rπ(s)aAsπ(as)r(s,a),[Pπ]s,saAsπ(as)p(ss,a).\begin{align*} \mathbf{r}_\pi(s) &\equiv \sum_{a \in \mathcal{A}_s} \pi(a \mid s) \, r(s, a), \\ [\mathbf{P}_\pi]_{s,s'} &\equiv \sum_{a \in \mathcal{A}_s} \pi(a \mid s) \, p(s' \mid s, a). \end{align*}

Taking derivatives with respect to w\mathbf{w} gives:

[rπww]s=aAswπw(as)r(s,a),[Pπwwvγπw]s=aAswπw(as)sp(ss,a)vγπw(s).\begin{align*} \left[\frac{\partial \mathbf{r}_{\pi_{\boldsymbol{w}}}}{\partial \mathbf{w}}\right]_s &= \sum_{a \in \mathcal{A}_s} \nabla_{\mathbf{w}}\pi_{\boldsymbol{w}}(a \mid s) \, r(s,a), \\ \left[\frac{\partial \mathbf{P}_{\pi_{\boldsymbol{w}}}}{\partial \mathbf{w}}\mathbf{v}_\gamma^{\pi_{\boldsymbol{w}}}\right]_s &= \sum_{a \in \mathcal{A}_s} \nabla_{\mathbf{w}}\pi_{\boldsymbol{w}}(a \mid s)\sum_{s'} p(s' \mid s,a) \, v_\gamma^{\pi_{\boldsymbol{w}}}(s'). \end{align*}

Substituting back:

wJ(w)=sxα(s)(awπw(as)r(s,a)+γawπw(as)sp(ss,a)vγπw(s))=sxα(s)awπw(as)(r(s,a)+γsp(ss,a)vγπw(s))\begin{align*} \nabla_{\mathbf{w}}J(\mathbf{w}) &= \sum_s x_\alpha(s)\left(\sum_a \nabla_{\mathbf{w}}\pi_{\boldsymbol{w}}(a \mid s) \, r(s,a) + \gamma\sum_a \nabla_{\mathbf{w}}\pi_{\boldsymbol{w}}(a \mid s)\sum_{s'} p(s' \mid s,a) \, v_\gamma^{\pi_{\boldsymbol{w}}}(s')\right) \\ &= \sum_s x_\alpha(s)\sum_a \nabla_{\mathbf{w}}\pi_{\boldsymbol{w}}(a \mid s)\left(r(s,a) + \gamma \sum_{s'} p(s' \mid s,a) \, v_\gamma^{\pi_{\boldsymbol{w}}}(s')\right) \end{align*}

This is the policy gradient theorem, where xα(s)x_\alpha(s) is the discounted state visitation distribution and the term in parentheses is the state-action value function qπw(s,a)q^{\pi_{\boldsymbol{w}}}(s,a).

Normalized Discounted State Visitation Distribution

The discounted state visitation xα(s)x_\alpha(s) is not normalized. Therefore the expression we obtained above is not an expectation. However, we can transform it into one by normalizing by 1γ1 - \gamma. Note that for any initial distribution α\alpha:

sxα(s)=α(IγPπw)11=α11γ=11γ\sum_s x_\alpha(s) = \alpha^\top(\mathbf{I} - \gamma \mathbf{P}_{\pi_{\boldsymbol{w}}})^{-1}\mathbf{1} = \frac{\alpha^\top\mathbf{1}}{1-\gamma} = \frac{1}{1-\gamma}

Therefore, defining the normalized state distribution ξα(s)=(1γ)xα(s)\xi_\alpha(s) = (1-\gamma)x_\alpha(s), we can write:

wJ(w)=11γsξα(s)awπw(as)(r(s,a)+γsp(ss,a)vγπw(s))=11γEsξα[awπw(as)qπw(s,a)]\begin{align*} \nabla_{\mathbf{w}}J(\mathbf{w}) &= \frac{1}{1-\gamma}\sum_s \xi_\alpha(s)\sum_a \nabla_{\mathbf{w}}\pi_{\boldsymbol{w}}(a \mid s)\left(r(s,a) + \gamma \sum_{s'} p(s' \mid s,a) \, v_\gamma^{\pi_{\boldsymbol{w}}}(s')\right) \\ &= \frac{1}{1-\gamma}\mathbb{E}_{s\sim\xi_\alpha}\left[\sum_a \nabla_{\mathbf{w}}\pi_{\boldsymbol{w}}(a \mid s) \, q^{\pi_{\boldsymbol{w}}}(s,a)\right] \end{align*}

Now we have expressed the policy gradient theorem in terms of expectations under the normalized discounted state visitation distribution. But what does sampling from ξα\xi_\alpha mean? Recall that xα=α(IγPπw)1\mathbf{x}_\alpha^\top = \alpha^\top(\mathbf{I} - \gamma \mathbf{P}_{\pi_{\boldsymbol{w}}})^{-1}. Using the Neumann series expansion (valid when γPπw<1\|\gamma \mathbf{P}_{\pi_{\boldsymbol{w}}}\| < 1, which holds for γ<1\gamma < 1 since Pπw\mathbf{P}_{\pi_{\boldsymbol{w}}} is a stochastic matrix) we have:

ξα=(1γ)αk=0(γPπw)k\boldsymbol{\xi}_\alpha^\top = (1-\gamma)\alpha^\top\sum_{k=0}^{\infty} (\gamma \mathbf{P}_{\pi_{\boldsymbol{w}}})^k

We can then factor out the first term from this summation to obtain:

ξα=(1γ)αk=0(γPπw)k=(1γ)α+(1γ)αk=1(γPπw)k=(1γ)α+(1γ)αγPπwk=0(γPπw)k=(1γ)α+γξαPπw\begin{align*} \boldsymbol{\xi}_\alpha^\top &= (1-\gamma)\alpha^\top\sum_{k=0}^{\infty} (\gamma \mathbf{P}_{\pi_{\boldsymbol{w}}})^k \\ &= (1-\gamma)\alpha^\top + (1-\gamma)\alpha^\top\sum_{k=1}^{\infty} (\gamma \mathbf{P}_{\pi_{\boldsymbol{w}}})^k \\ &= (1-\gamma)\alpha^\top + (1-\gamma)\alpha^\top\gamma\mathbf{P}_{\pi_{\boldsymbol{w}}}\sum_{k=0}^{\infty} (\gamma \mathbf{P}_{\pi_{\boldsymbol{w}}})^k \\ &= (1-\gamma)\alpha^\top + \gamma\boldsymbol{\xi}_\alpha^\top \mathbf{P}_{\pi_{\boldsymbol{w}}} \end{align*}

The balance equation:

ξα=(1γ)α+γξαPπw\boldsymbol{\xi}_\alpha^\top = (1-\gamma)\alpha^\top + \gamma\boldsymbol{\xi}_\alpha^\top \mathbf{P}_{\pi_{\boldsymbol{w}}}

shows that ξα\boldsymbol{\xi}_\alpha is a mixture distribution: with probability 1γ1-\gamma you draw a state from the initial distribution α\alpha (reset), and with probability γ\gamma you follow the policy dynamics Pπw\mathbf{P}_{\pi_{\boldsymbol{w}}} from the current state (continue). This interpretation directly connects to the geometric process: at each step you either terminate and resample from α\alpha (with probability 1γ1-\gamma) or continue following the policy (with probability γ\gamma).

import numpy as np

def sample_from_discounted_visitation(
    alpha, 
    policy, 
    transition_model, 
    gamma, 
    n_samples=1000
):
    """Sample states from the discounted visitation distribution.
    
    Args:
        alpha: Initial state distribution (vector of probabilities)
        policy: Function (state -> action probabilities)
        transition_model: Function (state, action -> next state probabilities)
        gamma: Discount factor
        n_samples: Number of states to sample
    
    Returns:
        Array of sampled states
    """
    samples = []
    n_states = len(alpha)
    
    # Initialize state from alpha
    current_state = np.random.choice(n_states, p=alpha)
    
    for _ in range(n_samples):
        samples.append(current_state)
        
        # With probability (1-gamma): reset
        if np.random.random() > gamma:
            current_state = np.random.choice(n_states, p=alpha)
        # With probability gamma: continue
        else:
            # Sample action from policy
            action_probs = policy(current_state)
            action = np.random.choice(len(action_probs), p=action_probs)
            
            # Sample next state from transition model
            next_state_probs = transition_model(current_state, action)
            current_state = np.random.choice(n_states, p=next_state_probs)
    
    return np.array(samples)

# Example usage for a simple 2-state MDP
alpha = np.array([0.7, 0.3])  # Initial distribution
policy = lambda s: np.array([0.8, 0.2])  # Dummy policy
transition_model = lambda s, a: np.array([0.9, 0.1])  # Dummy transitions
gamma = 0.9

samples = sample_from_discounted_visitation(alpha, policy, transition_model, gamma)

# Check empirical distribution
print("Empirical state distribution:")
print(np.bincount(samples) / len(samples))
Empirical state distribution:
[0.85 0.15]

While the math shows that sampling from the discounted visitation distribution ξα\boldsymbol{\xi}_\alpha would give us unbiased policy gradient estimates, Thomas (2014) demonstrated that this implementation can be detrimental to performance in practice. The issue arises because terminating trajectories early (with probability 1γ1-\gamma) reduces the effective amount of data we collect from each trajectory. This early termination weakens the learning signal, as many trajectories don’t reach meaningful terminal states or rewards.

Therefore, in practice, we typically sample complete trajectories from the undiscounted process (running the policy until natural termination or a fixed horizon) while still using γ\gamma in the advantage estimation. This approach preserves the full learning signal from each trajectory and has been empirically shown to lead to better performance.

This is one of several cases in RL where the theoretically optimal procedure differs from the best practical implementation.

The Actor-Critic Architecture

The policy gradient theorem shows that the gradient depends on the action-value function qπw(s,a)q^{\pi_{\boldsymbol{w}}}(s,a). In practice, we do not have access to the true qq-function and must estimate it. This leads to the actor-critic architecture: the actor maintains the policy πw\pi_{\boldsymbol{w}}, while the critic maintains an estimate of the value function.

This architecture traces back to Sutton’s 1984 thesis, where he proposed the Adaptive Heuristic Critic. The actor uses the critic’s value estimates to compute advantage estimates for the policy gradient, while the critic learns from the same trajectories generated by the actor. The algorithms we developed earlier (REINFORCE with baseline, GAE, and the one-step actor-critic) are all instances of this architecture.

We are simultaneously learning two functions that depend on each other, which creates a stability challenge. The actor’s gradient uses the critic’s estimates, but the critic is trained on data generated by the actor’s policy. If both change too quickly, the learning process can become unstable.

Konda (2002) analyzed this coupled learning problem and established convergence guarantees under a two-timescale condition: the critic must update faster than the actor. Intuitively, the critic needs to “track” the current policy’s value function before the actor uses those estimates to update. If the actor moves too fast, it uses stale or inaccurate value estimates, leading to poor gradient estimates.

In practice, this is implemented by using different learning rates: a larger learning rate αθ\alpha_\theta for the critic and a smaller learning rate αw\alpha_w for the actor, with αθ>αw\alpha_\theta > \alpha_w. Alternatively, one can perform multiple critic updates per actor update. The soft actor-critic algorithm discussed earlier in the amortization chapter follows this same principle, inheriting the actor-critic structure while incorporating entropy regularization and learning Q-functions directly.

The actor-critic architecture also connects to the bilevel optimization perspective of the policy gradient theorem: the outer problem optimizes the policy, while the inner problem solves for the value function given that policy. The two-timescale condition ensures that the inner problem is approximately solved before taking a step on the outer problem.

Reparameterization Methods in Reinforcement Learning

When dynamics are known or can be learned, reparameterization provides an alternative to score function methods. By expressing actions and state transitions as deterministic functions of noise, we can backpropagate through trajectories to compute policy gradients with lower variance than score function estimators.

Stochastic Value Gradients

The reparameterization trick requires that we can express our random variable as a deterministic function of noise. In reinforcement learning, this applies naturally when we have a learned model of the dynamics. Consider a stochastic policy πw(as)\pi_{\boldsymbol{w}}(a|s) that we can reparameterize as a=πw(s,ϵ)a = \pi_{\boldsymbol{w}}(s,\epsilon) where ϵp(ϵ)\epsilon \sim p(\epsilon), and a dynamics model s=f(s,a,ξ)s' = f(s,a,\xi) where ξp(ξ)\xi \sim p(\xi) represents environment stochasticity. Both transformations are deterministic given the noise variables.

With these reparameterizations, we can write an nn-step return as a differentiable function of the noise:

Rn(s0,{ϵi},{ξi})=i=0n1γir(si,ai)R_n(s_0,\{\epsilon_i\},\{\xi_i\}) = \sum_{i=0}^{n-1} \gamma^i r(s_i,a_i)

where ai=πw(si,ϵi)a_i = \pi_{\boldsymbol{w}}(s_i,\epsilon_i) and si+1=f(si,ai,ξi)s_{i+1} = f(s_i,a_i,\xi_i) for i=0,...,n1i=0,...,n-1. The objective becomes:

J(w)=E{ϵi},{ξi}[Rn(s0,{ϵi},{ξi})]J(\boldsymbol{w}) = \mathbb{E}_{\{\epsilon_i\},\{\xi_i\}}[R_n(s_0,\{\epsilon_i\},\{\xi_i\})]

We can now apply the reparameterization gradient estimator:

wJ(w)=E{ϵi},{ξi}[wRn(s0,{ϵi},{ξi})]\nabla_{\boldsymbol{w}}J(\boldsymbol{w}) = \mathbb{E}_{\{\epsilon_i\},\{\xi_i\}}\left[\nabla_{\boldsymbol{w}}R_n(s_0,\{\epsilon_i\},\{\xi_i\})\right]

This gradient can be computed by automatic differentiation through the sequence of policy and model evaluations. The computation requires backpropagating through nn steps of model rollouts, which becomes expensive for large nn but avoids the high variance of score function estimators.

The Stochastic Value Gradients (SVG) framework Heess et al., 2015 uses this approach while introducing a hybrid objective that combines model rollouts with value function bootstrapping:

JSVG(n)(w)=E{ϵi},{ξi}[i=0n1γir(si,ai)+γnq(sn,an;θ)]J^{\text{SVG}(n)}(\boldsymbol{w}) = \mathbb{E}_{\{\epsilon_i\},\{\xi_i\}}\left[\sum_{i=0}^{n-1} \gamma^i r(s_i,a_i) + \gamma^n q(s_n,a_n;\theta)\right]

The terminal value function q(sn,an;θ)q(s_n,a_n;\theta) approximates the value beyond horizon nn, allowing shorter rollouts while still capturing long-term value. This creates a spectrum of algorithms parameterized by nn.

SVG(0): Model-Free Reparameterization

When n=0n=0, the objective collapses to:

JSVG(0)(w)=EsρEϵp(ϵ)[q(s,πw(s,ϵ);θ)]J^{\text{SVG}(0)}(\boldsymbol{w}) = \mathbb{E}_{s \sim \rho}\mathbb{E}_{\epsilon \sim p(\epsilon)}\left[q(s,\pi_{\boldsymbol{w}}(s,\epsilon);\theta)\right]

No model is required. We simply differentiate the critic with respect to actions sampled from the reparameterized policy. This is the approach used in DDPG Lillicrap et al., 2015 (with a deterministic policy where ϵ\epsilon is absent) and SAC Haarnoja et al., 2018 (where ϵ\epsilon produces the stochastic component). The gradient is:

wJSVG(0)=Es,ϵ[aq(s,a;θ)a=πw(s,ϵ)wπw(s,ϵ)]\nabla_{\boldsymbol{w}} J^{\text{SVG}(0)} = \mathbb{E}_{s,\epsilon}\left[\nabla_a q(s,a;\theta)\big|_{a=\pi_{\boldsymbol{w}}(s,\epsilon)} \nabla_{\boldsymbol{w}} \pi_{\boldsymbol{w}}(s,\epsilon)\right]

This requires only that the critic qq be differentiable with respect to actions, not a learned dynamics model. All bias comes from errors in the value function approximation.

SVG(1) to SVG(nn): Model-Based Rollouts

For n1n \geq 1, we unroll a learned dynamics model for nn steps before bootstrapping with the critic. Consider SVG(1):

JSVG(1)(w)=Es,ϵ,ξ[r(s,πw(s,ϵ))+γq(f(s,πw(s,ϵ),ξ),πw(s,ϵ);θ)]J^{\text{SVG}(1)}(\boldsymbol{w}) = \mathbb{E}_{s,\epsilon,\xi}\left[r(s,\pi_{\boldsymbol{w}}(s,\epsilon)) + \gamma q(f(s,\pi_{\boldsymbol{w}}(s,\epsilon),\xi), \pi_{\boldsymbol{w}}(s',\epsilon');\theta)\right]

where s=f(s,πw(s,ϵ),ξ)s' = f(s,\pi_{\boldsymbol{w}}(s,\epsilon),\xi) is the next state predicted by the model. The gradient now flows through both the reward and the model transition. Increasing nn propagates reward information more directly through the model rollout, reducing reliance on the critic. However, model errors compound over the horizon. If the model is inaccurate, longer rollouts can degrade performance.

SVG(\infty): Pure Model-Based Optimization

As nn \to \infty, we eliminate the critic entirely:

JSVG()(w)=E{ϵi},{ξi}[i=0T1γir(si,πw(si,ϵi))]J^{\text{SVG}(\infty)}(\boldsymbol{w}) = \mathbb{E}_{\{\epsilon_i\},\{\xi_i\}}\left[\sum_{i=0}^{T-1} \gamma^i r(s_i,\pi_{\boldsymbol{w}}(s_i,\epsilon_i))\right]

This is pure model-based policy optimization, differentiating through the entire trajectory. Approaches like PILCO Deisenroth & Rasmussen, 2011 and Dreamer Hafner et al., 2019 operate in this regime. With an accurate model, this provides the most direct gradient signal. The tradeoff is computational: backpropagating through hundreds of time steps is expensive, and gradient magnitudes can explode or vanish over long horizons.

The choice of nn reflects a fundamental bias-variance tradeoff. Small nn relies on the critic for long-term value estimation, inheriting its approximation errors. Large nn relies on the model, accumulating its prediction errors. In practice, intermediate values like n=5n=5 or n=10n=10 often work well when combined with a reasonably accurate learned model.

Noise Inference for Off-Policy Learning

A subtle issue arises when combining reparameterization with experience replay. SVG naturally supports off-policy learning: states ss can be sampled from a replay buffer rather than the current policy. However, reparameterization requires the noise variables ϵ\epsilon that generated each action.

For on-policy data, we can simply store ϵ\epsilon alongside each transition (s,a,r,s)(s, a, r, s'). For off-policy data collected under a different policy, the noise is unknown. To apply reparameterization gradients to such data, we must infer the noise that would have produced the observed action under the current policy.

For invertible policies, this is straightforward. If a=πw(s,ϵ)a = \pi_{\boldsymbol{w}}(s, \epsilon) with ϵN(0,I)\epsilon \sim \mathcal{N}(0, I), and the policy takes the form a=μw(s)+σw(s)ϵa = \mu_{\boldsymbol{w}}(s) + \sigma_{\boldsymbol{w}}(s) \odot \epsilon (as in a Gaussian policy), we can recover the noise exactly:

ϵ=aμw(s)σw(s)\epsilon = \frac{a - \mu_{\boldsymbol{w}}(s)}{\sigma_{\boldsymbol{w}}(s)}

This recovered ϵ\epsilon can then be used for gradient computation. However, this introduces a subtle dependence: the inferred ϵ\epsilon depends on the current policy parameters w\boldsymbol{w}, not just the data. As the policy changes during training, the same action aa corresponds to different noise values.

For dynamics noise ξ\xi, the situation is more complex. If we have a probabilistic model s=f(s,a,ξ)s' = f(s, a, \xi) and observe the actual next state ss', we could in principle infer ξ\xi. In practice, environment stochasticity is often treated as irreducible: we cannot replay the exact same noise realization. SVG handles this by either: (1) using deterministic models and ignoring environment stochasticity, (2) re-simulating from the model rather than using observed next states, or (3) using importance weighting to correct for the distribution mismatch.

The noise inference perspective connects reparameterization gradients to the broader question of credit assignment in RL. By explicitly tracking which noise realizations led to which outcomes, we can more precisely attribute value to policy parameters rather than to lucky or unlucky samples.

When dynamics are deterministic or can be accurately reparameterized, SVG-style methods offer an efficient alternative to the score function methods developed in the previous section. However, many reinforcement learning problems involve unknown dynamics or dynamics that resist accurate modeling. In those settings, score function methods remain the primary tool since they require only the ability to sample trajectories under the policy.

Summary

This chapter developed the mathematical foundations for policy gradient methods. Starting from general derivative estimation techniques in stochastic optimization, we saw two main approaches: the likelihood ratio (score function) method and the reparameterization trick. While the reparameterization trick typically offers lower variance, it requires that the sampling distribution be reparameterizable, making it inapplicable to discrete actions or environments with complex dynamics.

For reinforcement learning, the score function estimator provides a model-free gradient that depends only on the policy parametrization, not the transition dynamics. Through variance reduction techniques (leveraging conditional independence, using control variates, and the Generalized Advantage Estimator), we can make these gradients practical for learning. The likelihood ratio perspective then led to importance-weighted surrogates and PPO’s clipped objective for stable off-policy updates.

We also established the policy gradient theorem, which provides the theoretical foundation for these estimators in the discounted infinite-horizon setting. The actor-critic architecture emerges from approximating the value function that appears in this theorem, with the two-timescale condition ensuring stable learning.

When dynamics models are available, reparameterization through Stochastic Value Gradients offers lower-variance alternatives. SVG(0) recovers actor-critic methods like DDPG and SAC, while SVG(\infty) represents pure model-based optimization through differentiable simulation.

References
  1. Williams, R. J. (1992). Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8(3), 229–256. 10.1007/BF00992696
  2. Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (1999). Policy Gradient Methods for Reinforcement Learning with Function Approximation. Advances in Neural Information Processing Systems, 12, 1057–1063.
  3. Konda, V. R. (2002). Actor-Critic Algorithms [Phdthesis]. Massachusetts Institute of Technology.
  4. Heess, N., Wayne, G., Silver, D., Lillicrap, T., Erez, T., & Tassa, Y. (2015). Learning Continuous Control Policies by Stochastic Value Gradients. Advances in Neural Information Processing Systems, 28, 2944–2952.
  5. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2015). Continuous Control with Deep Reinforcement Learning. arXiv Preprint arXiv:1509.02971.
  6. Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proceedings of the 35th International Conference on Machine Learning (ICML), 1861–1870.
  7. Deisenroth, M. P., & Rasmussen, C. E. (2011). PILCO: A Model-Based and Data-Efficient Approach to Policy Search. Proceedings of the 28th International Conference on Machine Learning (ICML), 465–472.
  8. Hafner, D., Lillicrap, T., Ba, J., & Norouzi, M. (2019). Dream to Control: Learning Behaviors by Latent Imagination. arXiv Preprint arXiv:1912.01603.