Skip to article frontmatterSkip to article content

Weighted Residual Methods for Functional Equations

The Bellman optimality equation Lv=v\Bellman v = v, whose contraction property underlies the convergence of value iteration, is a functional equation: an equation where the unknown is an entire function rather than a finite-dimensional vector. When the state space is continuous or very large, we cannot represent the value function exactly on a computer. We must instead work with finite-dimensional approximations. This motivates weighted residual methods (also called minimum residual methods), a general framework for transforming infinite-dimensional problems into tractable finite-dimensional ones Chakraverty et al. (2019)Atkinson & Potra (1987).

A Motivating Example: Optimal Stopping with Continuous States

Before developing the general theory, consider a concrete example that illustrates the core challenge. An agent observes a state s[0,1]s \in [0, 1] and must decide whether to stop (receive reward ss and end the episode) or continue (receive nothing, and the state redraws uniformly on [0,1][0, 1]). With discount factor γ=0.9\gamma = 0.9, the Bellman optimality equation is:

v(s)=max{s,  γ01v(s)ds}.v^*(s) = \max\left\{ s, \; \gamma \int_0^1 v^*(s') \, ds' \right\}.

The first term is the immediate payoff from stopping; the second is the discounted expected continuation value. Since the continuation value vˉ=01v(s)ds\bar{v} = \int_0^1 v^*(s') ds' is a constant (it doesn’t depend on the current state ss), the optimal policy has a threshold structure: stop if sss \geq s^* for some threshold ss^*, continue otherwise.

At the threshold, the agent is indifferent: s=γvˉs^* = \gamma \bar{v}. Computing vˉ\bar{v} by integrating vv^*:

vˉ=0sγvˉds+s1sds=sγvˉ+1(s)22.\bar{v} = \int_0^{s^*} \gamma \bar{v} \, ds' + \int_{s^*}^1 s' \, ds' = s^* \gamma \bar{v} + \frac{1 - (s^*)^2}{2}.

Substituting s=γvˉs^* = \gamma \bar{v} and solving gives the exact threshold and value.

Source
#  label: fig-optimal-stopping-exact
#  caption: The exact value function for the optimal stopping problem.

%config InlineBackend.figure_format = 'retina'
import numpy as np
import matplotlib.pyplot as plt

gamma = 0.9

# Solve for exact threshold
v_bar_exact = (1 - np.sqrt(1 - gamma**2)) / gamma**2
s_star_exact = gamma * v_bar_exact

print(f"Exact solution:")
print(f"  Threshold s* = {s_star_exact:.6f}")
print(f"  Continuation value v̄ = {v_bar_exact:.6f}")

# The exact value function
def v_exact(s):
    return np.where(s >= s_star_exact, s, gamma * v_bar_exact)

# Plot the exact value function
s_grid = np.linspace(0, 1, 200)
plt.figure(figsize=(8, 4))
plt.plot(s_grid, v_exact(s_grid), 'b-', linewidth=2, label='Exact $v^*(s)$')
plt.axvline(s_star_exact, color='r', linestyle='--', label=f'Threshold $s^* = {s_star_exact:.3f}$')
plt.xlabel('State $s$')
plt.ylabel('Value $v^*(s)$')
plt.legend()
plt.title('Optimal Stopping: Exact Value Function')
plt.grid(True, alpha=0.3)
plt.tight_layout()
Exact solution:
  Threshold s* = 0.626789
  Continuation value v̄ = 0.696432
<Figure size 800x400 with 1 Axes>

The exact value function is piecewise linear: constant at γvˉ\gamma \bar{v} below the threshold, equal to ss above it. Now suppose we want to approximate vv^* using a polynomial basis with nn terms:

v^(s;θ)=j=0n1θjsj=θ0+θ1s+θ2s2+\hat{v}(s; \theta) = \sum_{j=0}^{n-1} \theta_j s^j = \theta_0 + \theta_1 s + \theta_2 s^2 + \cdots

The residual at state ss measures how far our approximation is from satisfying the Bellman equation:

R(s;θ)=max{s,  γ01v^(s;θ)ds}v^(s;θ).R(s; \theta) = \max\left\{ s, \; \gamma \int_0^1 \hat{v}(s'; \theta) \, ds' \right\} - \hat{v}(s; \theta).

For a perfect solution, R(s;θ)=0R(s; \theta) = 0 for all s[0,1]s \in [0, 1]. But a polynomial cannot exactly represent the kink at ss^*. We must choose how to make the residual “small” across the state space.

Collocation picks nn points {s1,,sn}\{s_1, \ldots, s_n\} and requires the residual to vanish exactly there:

R(si;θ)=0,i=1,,n.R(s_i; \theta) = 0, \quad i = 1, \ldots, n.

Galerkin requires the residual to be orthogonal to each basis function:

01R(s;θ)sjw(s)ds=0,j=0,,n1.\int_0^1 R(s; \theta) s^{j} w(s) \, ds = 0, \quad j = 0, \ldots, n-1.
Source
#  label: fig-collocation-comparison
#  caption: Polynomial collocation approximation with 5 Chebyshev nodes.

from scipy.integrate import quad

def chebyshev_nodes(n, a=0, b=1):
    """Chebyshev nodes on [a, b]."""
    k = np.arange(1, n + 1)
    nodes = 0.5 * (a + b) + 0.5 * (b - a) * np.cos((2*k - 1) * np.pi / (2*n))
    return np.sort(nodes)

def collocation_solve(n, gamma, max_iter=100, tol=1e-8):
    """Solve optimal stopping via polynomial collocation."""
    nodes = chebyshev_nodes(n)
    Phi = np.vander(nodes, n, increasing=True)
    
    theta = np.zeros(n)
    for iteration in range(max_iter):
        def v_approx(s):
            return sum(theta[j] * s**j for j in range(n))
        
        v_bar, _ = quad(v_approx, 0, 1)
        targets = np.maximum(nodes, gamma * v_bar)
        theta_new = np.linalg.solve(Phi, targets)
        
        if np.linalg.norm(theta_new - theta) < tol:
            return theta_new, iteration + 1
        theta = theta_new
    return theta, max_iter

# Solve with different numbers of basis functions
for n in [3, 5, 8]:
    theta, iters = collocation_solve(n, gamma)
    def v_approx(s, theta=theta, n=n):
        return sum(theta[j] * s**j for j in range(n))
    errors = [abs(v_approx(s) - v_exact(s)) for s in np.linspace(0, 1, 1000)]
    print(f"n = {n}: converged in {iters} iters, max error = {max(errors):.6f}")

# Plot comparison for n=5
n = 5
theta, _ = collocation_solve(n, gamma)
v_approx_5 = lambda s: sum(theta[j] * s**j for j in range(n))

plt.figure(figsize=(8, 4))
plt.plot(s_grid, v_exact(s_grid), 'b-', linewidth=2, label='Exact')
plt.plot(s_grid, [v_approx_5(s) for s in s_grid], 'r--', linewidth=2, label=f'Collocation ($n={n}$)')
plt.scatter(chebyshev_nodes(n), [v_approx_5(s) for s in chebyshev_nodes(n)], 
            color='red', s=50, zorder=5, label='Collocation nodes')
plt.xlabel('State $s$')
plt.ylabel('Value')
plt.legend()
plt.title('Polynomial Collocation Approximation')
plt.grid(True, alpha=0.3)
plt.tight_layout()
n = 3: converged in 49 iters, max error = 0.053991
n = 5: converged in 41 iters, max error = 0.052626
n = 8: converged in 54 iters, max error = 0.015880
<Figure size 800x400 with 1 Axes>

This example illustrates the fundamental tension in weighted residual methods: with finite parameters, we cannot satisfy the Bellman equation everywhere. We must choose how to allocate our approximation capacity. The rest of this chapter develops the general theory behind these choices.

Testing Whether a Residual Vanishes

Consider a functional equation N(f)=0\Residual(f) = 0, where N\Residual is an operator and the unknown ff is an entire function (in our case, the Bellman optimality equation Lv=v\Bellman v = v, which we can write as N(v)Lvv=0\Residual(v) \equiv \Bellman v - v = 0). Suppose we have found a candidate approximate solution f^\hat{f}. To verify it satisfies N(f^)=0\Residual(\hat{f}) = 0, we compute the residual function R(s)=N(f^)(s)R(s) = \Residual(\hat{f})(s). For a true solution, this residual should be the zero function: R(s)=0R(s) = 0 for every state ss.

How might we test whether a function is zero? One approach: sample many input points {s1,s2,,sm}\{s_1, s_2, \ldots, s_m\}, check whether R(si)=0R(s_i) = 0 at each, and summarize the results into a single scalar test by computing a weighted sum i=1mwiR(si)\sum_{i=1}^m w_i R(s_i) with weights wi>0w_i > 0. If RR is zero everywhere, this sum is zero. If RR is nonzero somewhere, we can choose points and weights to make the sum nonzero. For vectors in finite dimensions, the inner product r,y=i=1nriyi\langle \mathbf{r}, \mathbf{y} \rangle = \sum_{i=1}^n r_i y_i implements exactly this idea: it tests r\mathbf{r} by weighting and summing. Indeed, a vector rRn\mathbf{r} \in \mathbb{R}^n equals zero if and only if r,y=0\langle \mathbf{r}, \mathbf{y} \rangle = 0 for every vector yRn\mathbf{y} \in \mathbb{R}^n. To see why, suppose r0\mathbf{r} \neq \mathbf{0}. Choosing y=r\mathbf{y} = \mathbf{r} gives r,r=r2>0\langle \mathbf{r}, \mathbf{r} \rangle = \|\mathbf{r}\|^2 > 0, contradicting the claim that all inner products vanish.

The same principle extends to functions. A function RR equals the zero function if and only if its “inner product” with every “test function” pp vanishes:

R=0if and only ifR,pw=SR(s)p(s)w(s)ds=0for all test functions p,R = 0 \quad \text{if and only if} \quad \langle R, p \rangle_w = \int_{\mathcal{S}} R(s) p(s) w(s) ds = 0 \quad \text{for all test functions } p,

where w(s)>0w(s) > 0 is a weight function that is part of the inner product definition. Why does this work? For the same reason as in finite dimensions: if RR is not the zero function, there must be some region where R(s)0R(s) \neq 0. We can then choose a test function pp that is nonzero in that same region (for instance, p(s)=R(s)p(s) = R(s) itself), which will produce R,pw=R(s)p(s)w(s)ds>0\langle R, p \rangle_w = \int R(s) p(s) w(s) ds > 0, witnessing that RR is nonzero. Conversely, if RR is the zero function, then R,pw=0\langle R, p \rangle_w = 0 for any test function pp.

This ability to distinguish between different functions using inner products is a fundamental principle from functional analysis. Just as we can test a vector by taking inner products with other vectors, we can test a function by taking inner products with other functions.

This transforms the pointwise condition “R(s)=0R(s) = 0 for all ss” (infinitely many conditions, one per state) into an equivalent condition about inner products. We still cannot test against all possible test functions, since there are infinitely many of those too. But the inner product perspective suggests a natural computational strategy: choose a finite collection of test functions {p1,,pn}\{p_1, \ldots, p_n\} and use them to construct nn conditions that we can actually compute.

From Variational Conditions to Optimization

Making a residual “small” is an optimization problem. We want to find θ\theta that minimizes R(;θ)\lVert R(\cdot; \theta) \rVert for some norm. Different methods correspond to different choices of norm:

The first-order conditions for these optimization problems take the form R,pjw=0\langle R, p_j \rangle_w = 0 for appropriate “test functions” pjp_j. The variational formulation is useful for analysis, but we are simply minimizing the residual in a chosen norm.

The rest of this chapter develops the computational framework: how to parameterize the unknown function, define the residual, choose a norm, and solve the resulting finite-dimensional problem.

The General Framework

Consider an operator equation of the form

N(f)=0,\Residual(f) = 0,

where N:B1B2\Residual: B_1 \to B_2 is a continuous operator between complete normed vector spaces B1B_1 and B2B_2. For the Bellman equation, we have N(v)=Lvv\Residual(v) = \Bellman v - v, so that solving N(v)=0\Residual(v) = 0 is equivalent to finding the fixed point v=Lvv = \Bellman v.

Just as we transcribed infinite-dimensional continuous optimal control problems into finite-dimensional discrete optimal control problems in earlier chapters, we seek a finite-dimensional approximation to this infinite-dimensional functional equation. Recall that for continuous optimal control, we adopted control parameterization: we represented the control trajectory using a finite set of basis functions (piecewise constants, polynomials, splines) and searched over the finite-dimensional coefficient space instead of the infinite-dimensional function space. For integrals in the objective and constraints, we used numerical quadrature to approximate them with finite sums.

We follow the same strategy here. We parameterize the value function using a finite set of basis functions {φ1,,φn}\{\varphi_1, \ldots, \varphi_n\}, commonly polynomials (Chebyshev, Legendre), though other function classes (splines, radial basis functions, neural networks) are possible, and search for coefficients θ=(θ1,,θn)\theta = (\theta_1, \ldots, \theta_n) in Rn\mathbb{R}^n. When integrals appear in the Bellman operator or projection conditions, we approximate them using numerical quadrature. The projection method approach consists of several conceptual steps that accomplish this transcription.

Step 1: Choose a Finite-Dimensional Approximation Space

We begin by selecting a basis Φ={φ1,φ2,,φn}\Phi = \{\varphi_1, \varphi_2, \ldots, \varphi_n\} and approximating the unknown function as a linear combination:

f^(x)=i=1nθiφi(x).\hat{f}(x) = \sum_{i=1}^n \theta_i \varphi_i(x).

The choice of basis functions φi\varphi_i is problem-dependent. Common choices include:

The number of basis functions nn determines the flexibility of our approximation. In practice, we start with small nn and increase it until the approximation quality is satisfactory. The only unknowns now are the coefficients θ=(θ1,,θn)\theta = (\theta_1, \ldots, \theta_n).

While the classical presentation of projection methods focuses on polynomial bases, the framework applies equally well to other function classes. Neural networks, for instance, can be viewed through this lens: a neural network f^(x;θ)\hat{f}(x; \theta) with parameters θ\theta defines a flexible function class, and many training procedures can be interpreted as projection methods with specific choices of test functions or residual norms. The distinction is that classical methods typically use predetermined basis functions with linear coefficients, while neural networks use adaptive nonlinear features. Throughout this chapter, we focus on the classical setting to develop the core concepts, but the principles extend naturally to modern function approximators.

Step 2: Define the Residual Function

Since we are approximating ff with f^\hat{f}, the operator N\Residual will generally not vanish exactly. Instead, we obtain a residual function:

R(x;θ)=N(f^(;θ))(x).R(x; \theta) = \Residual(\hat{f}(\cdot; \theta))(x).

This residual measures how far our candidate solution is from satisfying the equation at each point xx. As we discussed in the introduction, we want to make this residual small—an optimization problem whose formulation depends on how we measure “small.”

Step 3: Minimize the Residual

Having chosen our basis and defined the residual, we must find θ\theta that makes the residual small. This is an optimization problem: we minimize R(;θ)\lVert R(\cdot; \theta) \rVert for some norm. The choice of norm determines the method:

MethodNorm being minimizedConditions (nn equations)
Least squaresRw2=R(x;θ)2w(x)dx\displaystyle\lVert R \rVert_w^2 = \int R(x; \theta)^2 w(x) dxRRθjwdx=0\displaystyle\int R \cdot \frac{\partial R}{\partial \theta_j} \, w \, dx = 0, j=1,,nj = 1, \ldots, n
GalerkinRV\lVert R \rVert_{\mathcal{V}^*} (dual norm of approx. space)R(x;θ)φj(x)w(x)dx=0\displaystyle\int R(x; \theta) \varphi_j(x) w(x) dx = 0, j=1,,nj = 1, \ldots, n
CollocationiωiR(xi;θ)2\displaystyle\sum_i \omega_i R(x_i; \theta)^2 (discrete)R(xi;θ)=0R(x_i; \theta) = 0, i=1,,ni = 1, \ldots, n

The first-order conditions for each optimization problem yield nn equations in the nn unknowns θ1,,θn\theta_1, \ldots, \theta_n. We now describe each method, starting with the most computationally attractive.

Collocation: Make the Residual Zero at Selected Points

The simplest approach is to choose nn points {x1,,xn}\{x_1, \ldots, x_n\} and require the residual to vanish exactly at each:

R(xi;θ)=0,i=1,,n.R(x_i; \theta) = 0, \quad i = 1, \ldots, n.

This gives nn equations for nn unknowns. Collocation is computationally attractive because it avoids integration entirely—we only evaluate RR at discrete points. The resulting system is:

N(f^(;θ))(xi)=0,i=1,,n.\Residual(\hat{f}(\cdot; \theta))(x_i) = 0, \quad i = 1, \ldots, n.

For a linear operator, this is a linear system; for the Bellman equation, it is nonlinear due to the max.

Verify for yourself: with n=2n = 2 collocation points and n=2n = 2 basis functions, the system Φθ=t\boldsymbol{\Phi}\theta = t is a 2×22 \times 2 linear system. What must be true about the collocation matrix Φ\boldsymbol{\Phi} for this system to have a unique solution?

The choice of collocation points matters. Orthogonal collocation (or spectral collocation) places points at the zeros of the nn-th orthogonal polynomial in a family (Chebyshev, Legendre, etc.). For Chebyshev polynomials T0,T1,,Tn1T_0, T_1, \ldots, T_{n-1}, we place collocation points at the zeros of Tn(x)T_n(x). These points are also optimal nodes for Gauss quadrature, so:

The Chebyshev interpolation theorem guarantees that forcing R(xi;θ)=0R(x_i; \theta) = 0 at these carefully chosen points makes R(x;θ)R(x; \theta) small everywhere, with well-conditioned systems and near-optimal interpolation error.

Galerkin: Make the Residual Orthogonal to the Approximation Space

The Galerkin method requires the residual to be orthogonal to each basis function:

SR(x;θ)φi(x)w(x)dx=0,i=1,,n.\int_{\mathcal{S}} R(x; \theta) \varphi_i(x) w(x) dx = 0, \quad i = 1, \ldots, n.

To understand why this is optimal, consider the approximation space V=span{φ1,,φn}\mathcal{V} = \text{span}\{\varphi_1, \ldots, \varphi_n\} as an nn-dimensional subspace. If the residual RR is orthogonal to all basis functions, then by linearity, RR is orthogonal to every function in V\mathcal{V}:

R,gw=R,i=1nciφiw=i=1nciR,φiw=0for all gV.\langle R, g \rangle_w = \left\langle R, \sum_{i=1}^n c_i \varphi_i \right\rangle_w = \sum_{i=1}^n c_i \langle R, \varphi_i \rangle_w = 0 \quad \text{for all } g \in \mathcal{V}.

The residual has “zero overlap” with our approximation space—it is as “invisible” to our basis as possible. This is the defining property of orthogonal projection.

In what sense is Galerkin minimizing a norm? The dual norm of RR with respect to V\mathcal{V} measures RR by its largest inner product with functions in V\mathcal{V}:

RV=supgVgw=1R,gw.\lVert R \rVert_{\mathcal{V}^*} = \sup_{\substack{g \in \mathcal{V} \\ \lVert g \rVert_w = 1}} \lvert \langle R, g \rangle_w \rvert.

The Galerkin conditions R,φjw=0\langle R, \varphi_j \rangle_w = 0 for all jj imply R,gw=0\langle R, g \rangle_w = 0 for all gVg \in \mathcal{V}, so RV=0\lVert R \rVert_{\mathcal{V}^*} = 0. Galerkin makes the residual “invisible” when measured against the approximation space—it minimizes the dual norm to zero.

A finite-dimensional analogy: to approximate a vector vR3\mathbf{v} \in \mathbb{R}^3 using only the xyxy-plane, the best approximation is v^=(v1,v2,0)\hat{\mathbf{v}} = (v_1, v_2, 0). The error r=vv^=(0,0,v3)\mathbf{r} = \mathbf{v} - \hat{\mathbf{v}} = (0, 0, v_3) points purely in the zz-direction, orthogonal to the plane. The Galerkin condition generalizes this: the residual is orthogonal to the approximation space.

Galerkin requires integration to compute the conditions, making it more expensive per iteration than collocation. However, when using orthogonal polynomial bases with matching weight functions, the integrals simplify and the resulting systems are well-conditioned.

Least Squares: Minimize the L2L^2 Norm of the Residual

The most direct approach is to minimize the weighted L2L^2 norm of the residual:

minθSR(x;θ)2w(x)dx.\min_\theta \int_{\mathcal{S}} R(x; \theta)^2 w(x) dx.

The first-order conditions are:

SR(x;θ)R(x;θ)θjw(x)dx=0,j=1,,n.\int_{\mathcal{S}} R(x; \theta) \frac{\partial R(x; \theta)}{\partial \theta_j} w(x) dx = 0, \quad j = 1, \ldots, n.

This directly minimizes how far our approximation is from satisfying the equation. For the Bellman equation R=Lv^v^R = \Bellman\hat{v} - \hat{v}, this is Bellman residual minimization: we minimize Lv^v^w2\lVert \Bellman\hat{v} - \hat{v} \rVert_w^2.

The gradient Rθj\frac{\partial R}{\partial \theta_j} involves differentiating the operator N\Residual. For the Bellman operator with its max, this requires the Envelope Theorem (discussed in Step 4). The need to differentiate through the operator distinguishes least squares from Galerkin and collocation.

Fitted Q-Iteration: Project, Then Iterate

For iterative methods, there is a computationally simpler alternative to minimizing the residual directly. Fitted Q-Iteration (FQI) uses a two-step iteration:

  1. Apply the Bellman operator to get a target: fk=Lv^kf_k = \Bellman \hat{v}_k

  2. Project the target back onto the approximation space: v^k+1=argminθv^(;θ)fkw2\hat{v}_{k+1} = \arg\min_\theta \lVert \hat{v}(\cdot; \theta) - f_k \rVert_w^2

The projection step solves minθv^fkw2\min_\theta \lVert \hat{v} - f_k \rVert_w^2, whose first-order conditions are v^fk,φjw=0\langle \hat{v} - f_k, \varphi_j \rangle_w = 0. This is a standard least-squares fit of the basis to the target values. Combining these steps gives:

v^k+1=ΠwLv^k,\hat{v}_{k+1} = \Pi_w \, \Bellman \hat{v}_k,

where Πw\Pi_w denotes orthogonal projection onto span{φj}\text{span}\{\varphi_j\} with respect to the weighted inner product.

FQI does not minimize the Bellman residual Lv^v^2\lVert \Bellman\hat{v} - \hat{v} \rVert^2 directly. It projects, then iterates. FQI’s projection step uses only the gradient of v^\hat{v} with respect to θ\theta (the “semi-gradient”), while Bellman residual minimization requires differentiating through L\Bellman (the “full gradient”). We return to this distinction when discussing temporal difference learning.

Step 4: Solve the Finite-Dimensional Problem

The conditions from Step 3 give us a finite-dimensional problem to solve:

In each case, we have nn equations (or first-order conditions) in nn unknowns θ1,,θn\theta_1, \ldots, \theta_n. For the Bellman equation, these systems are nonlinear due to the max operator.

Computational Cost and Conditioning

The computational cost per iteration varies significantly across methods:

For methods requiring integration, the choice of quadrature rule should match the basis. Gaussian quadrature with nodes at orthogonal polynomial zeros is efficient. When combined with collocation at those same points, the quadrature is exact for polynomials up to a certain degree. This coordination between quadrature and collocation makes orthogonal collocation effective.

The conditioning of the system depends on the choice of test functions. The Jacobian matrix has entries:

Jij=Piθj=R(;θ)θj,piw.J_{ij} = \frac{\partial P_i}{\partial \theta_j} = \left\langle \frac{\partial R(\cdot; \theta)}{\partial \theta_j}, p_i \right\rangle_w.

When test functions are orthogonal (or nearly so), the Jacobian tends to be well-conditioned. This is why orthogonal polynomial bases are preferred in Galerkin methods: they produce Jacobians with controlled condition numbers. Poorly chosen basis functions or collocation points can lead to nearly singular Jacobians, causing numerical instability. Orthogonal bases and carefully chosen collocation points (like Chebyshev nodes) help maintain good conditioning.

Two Main Solution Approaches

We have two fundamentally different ways to solve the projection equations: function iteration (exploiting fixed-point structure) and Newton’s method (exploiting smoothness). The choice depends on whether the original operator equation has contraction properties and how well those properties are preserved by the finite-dimensional approximation.

Method 1: Function Iteration (Successive Approximation)

When the operator equation has the form f=Tff = \Contraction f where T\Contraction is a contraction, the most natural approach is to iterate the operator directly:

f^(k+1)=Tf^(k).\hat{f}^{(k+1)} = \Contraction \hat{f}^{(k)}.

The infinite-dimensional iteration becomes a finite-dimensional iteration in coefficient space once we choose our weighted residual method. Given a current approximation f^(k)(x;θ(k))\hat{f}^{(k)}(x; \theta^{(k)}), how do we find the coefficients θ(k+1)\theta^{(k+1)} for the next iterate f^(k+1)\hat{f}^{(k+1)}?

Different weighted residual methods answer this differently. For collocation, we proceed in two steps:

  1. Evaluate the operator: At each collocation point xix_i, compute what the next iterate should be: ti(k)=(Tf^(k))(xi)t_i^{(k)} = (\Contraction \hat{f}^{(k)})(x_i). These nn target values tell us what f^(k+1)\hat{f}^{(k+1)} should equal at the collocation points.

  2. Find matching coefficients: Determine θ(k+1)\theta^{(k+1)} so that f^(k+1)(xi;θ(k+1))=ti(k)\hat{f}^{(k+1)}(x_i; \theta^{(k+1)}) = t_i^{(k)} for all ii. This is a linear system: jθj(k+1)φj(xi)=ti(k)\sum_j \theta_j^{(k+1)} \varphi_j(x_i) = t_i^{(k)}.

In matrix form: Φθ(k+1)=t(k)\boldsymbol{\Phi} \theta^{(k+1)} = t^{(k)}, where Φ\boldsymbol{\Phi} is the collocation matrix with entries Φij=φj(xi)\Phi_{ij} = \varphi_j(x_i). Solving this system gives θ(k+1)=Φ1t(k)\theta^{(k+1)} = \boldsymbol{\Phi}^{-1} t^{(k)}.

For Galerkin, the projection condition f^(k+1)Tf^(k+1),φiw=0\langle \hat{f}^{(k+1)} - \Contraction \hat{f}^{(k+1)}, \varphi_i \rangle_w = 0 directly gives a system for θ(k+1)\theta^{(k+1)}. When T\Contraction is linear in its argument (as in many integral equations), this is a linear system. When T\Contraction is nonlinear (as in the Bellman equation), we must solve a nonlinear system at each iteration, though each solution still only involves nn unknowns rather than an infinite-dimensional function.

When T\Contraction is a contraction in the infinite-dimensional space with constant γ<1\gamma < 1, iterating it pulls any starting function toward the unique fixed point. The hope is that the finite-dimensional operator, evaluating T\Contraction and projecting back onto the span of the basis functions, inherits this contraction property. When it does, function iteration converges globally from any initial guess, with each iteration reducing the error by a factor of roughly γ\gamma. This is computationally attractive: we only evaluate the operator and solve a linear system (for collocation) or a relatively simple system (for other methods).

However, the finite-dimensional approximation doesn’t always preserve contraction. High-order polynomial bases, in particular, can create oscillations between basis functions that amplify rather than contract errors. Even when contraction is preserved, convergence can be painfully slow when γ\gamma is close to 1, the “weak contraction” regime common in economic problems with patient agents (γ0.95\gamma \approx 0.95 or higher). Finally, not all operator equations naturally present themselves as contractions; some require reformulation (like f=fαN(f)f = f - \alpha \Residual(f)), and finding a good α\alpha can be problem-specific.

Method 2: Newton’s Method

Alternatively, we can treat the projection equations as a rootfinding problem G(θ)=0G(\theta) = 0 where Gi(θ)=Pi(θ)G_i(\theta) = P_i(\theta) for test function methods, or solve the first-order conditions for least squares. Newton’s method uses the update:

θ(k+1)=θ(k)JG(θ(k))1G(θ(k)),\theta^{(k+1)} = \theta^{(k)} - J_G(\theta^{(k)})^{-1} G(\theta^{(k)}),

where JG(θ)J_G(\theta) is the Jacobian of GG at θ\theta.

To apply this update, we must compute the Jacobian entries Jij=GiθjJ_{ij} = \frac{\partial G_i}{\partial \theta_j}. For collocation, Gi(θ)=f^(xi;θ)(Tf^(;θ))(xi)G_i(\theta) = \hat{f}(x_i; \theta) - (\Contraction \hat{f}(\cdot; \theta))(x_i), so:

Giθj=f^(xi;θ)θj(Tf^(;θ))(xi)θj.\frac{\partial G_i}{\partial \theta_j} = \frac{\partial \hat{f}(x_i; \theta)}{\partial \theta_j} - \frac{\partial (\Contraction \hat{f}(\cdot; \theta))(x_i)}{\partial \theta_j}.

The first term is straightforward (it’s just φj(xi)\varphi_j(x_i) for a linear approximation). The second term requires differentiating the operator T\Contraction with respect to the parameters.

When T\Contraction involves optimization (as in the Bellman operator Lv=maxa{r(s,a)+γE[v(s)]}\Bellman v = \max_a \{r(s,a) + \gamma \mathbb{E}[v(s')]\}), computing this derivative appears problematic because the max operator is not differentiable. However, the Envelope Theorem resolves this difficulty.

Before reading the box below, try differentiating v(θ)=maxxf(x,θ)v(\theta) = \max_x f(x, \theta) using the chain rule. What term involving x/θ\partial x^*/\partial \theta appears? Why might this term vanish at an optimum?

With the Envelope Theorem providing a tractable way to compute Jacobians for problems involving optimization, Newton’s method becomes practical for weighted residual methods applied to Bellman equations and similar problems. The method offers quadratic convergence near the solution. Once in the neighborhood of the true fixed point, Newton’s method typically converges in just a few iterations. Unlike function iteration, it doesn’t rely on the finite-dimensional approximation preserving any contraction property, making it applicable to a broader range of problems, particularly those with high-order polynomial bases or large discount factors where function iteration struggles.

However, Newton’s method demands more from both the algorithm and the user. Each iteration requires computing and solving a full Jacobian system, making the per-iteration cost significantly higher than function iteration. The method is also sensitive to initialization: started far from the solution, Newton’s method may diverge or converge to spurious fixed points that the finite-dimensional problem introduces but the original infinite-dimensional problem lacks. When applying the Envelope Theorem, implementation becomes more complex. We must track the optimal action at each evaluation point and compute the Jacobian entries using the formula above (expected basis function values at next states under optimal actions), though the economic interpretation (tracking how value propagates through optimal decisions) often makes the computation conceptually clearer than explicit derivative calculations would be.

Comparison and Practical Recommendations
MethodConvergencePer-iteration costInitial guess sensitivity
Function iterationLinear (when contraction holds)LowRobust
Newton’s methodQuadratic (near solution)Moderate (Jacobian + solve)Requires good initial guess

Which method to use? When the problem has strong contraction (small γ\gamma, well-conditioned bases, shape-preserving approximations like linear interpolation or splines), function iteration is simple and robust. For weak contraction (large γ\gamma, high-order polynomials), a hybrid approach works well: run function iteration for several iterations to enter the basin of attraction, then switch to Newton’s method for rapid final convergence. When the finite-dimensional approximation destroys contraction entirely (common with non-monotone bases), Newton’s method may be necessary from the start, though careful initialization (from a coarser approximation or perturbation methods) is required.

Quasi-Newton methods like BFGS or Broyden offer a middle ground. They approximate the Jacobian using function evaluations only, avoiding explicit derivative computations while maintaining superlinear convergence. This can be useful when computing the exact Jacobian via the Envelope Theorem is expensive or when the approximation quality is acceptable.

Step 5: Verify the Solution

Once we have computed a candidate solution f^\hat{f}, we must verify its quality. Projection methods optimize f^\hat{f} with respect to specific criteria (specific test functions or collocation points), but we should check that the residual is small everywhere, including directions or points we did not optimize over.

Typical diagnostic checks include:

In summary, we have established a template: parameterize the unknown function using basis functions, define a residual measuring how far from a solution we are, and impose conditions via inner products with test functions. Different test functions yield different methods: Galerkin uses the basis itself, collocation uses delta functions at chosen points, and least squares uses residual gradients. We now apply this framework to the Bellman equation.

Application to the Bellman Equation

Consider the Bellman optimality equation v(s)=Lv(s)=maxaAs{r(s,a)+γjSp(js,a)v(j)}v(s) = \Bellman v(s) = \max_{a \in \mathcal{A}_s} \{ r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) v(j) \}. For a candidate approximation v^(s)=i=1nθiφi(s)\hat{v}(s) = \sum_{i=1}^n \theta_i \varphi_i(s), the residual is:

R(s;θ)=Lv^(s)v^(s)=maxaAs{r(s,a)+γjSp(js,a)v^(j)}i=1nθiφi(s).R(s; \theta) = \Bellman\hat{v}(s) - \hat{v}(s) = \max_{a \in \mathcal{A}_s} \left\{ r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) \hat{v}(j) \right\} - \sum_{i=1}^n \theta_i \varphi_i(s).

We examine how collocation and Galerkin, the two most common weighted residual methods for Bellman equations, specialize the general solution approaches from Step 4.

Collocation

For collocation, we choose nn states {s1,,sn}\{s_1, \ldots, s_n\} and require the Bellman equation to hold exactly at these points:

j=1nθjφj(si)=maxaAsi{r(si,a)+γjSp(jsi,a)=1nθφ(j)},i=1,,n.\sum_{j=1}^n \theta_j \varphi_j(s_i) = \max_{a \in \mathcal{A}_{s_i}} \left\{ r(s_i,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s_i,a) \sum_{\ell=1}^n \theta_\ell \varphi_\ell(j) \right\}, \quad i = 1, \ldots, n.

It helps to define the parametric Bellman operator Lφ:RnRn\mathrm{L}_\varphi: \mathbb{R}^n \to \mathbb{R}^n by [Lφ(θ)]i=[Lv^(;θ)](si)[\mathrm{L}_\varphi(\theta)]_i = [\Bellman\hat{v}(\cdot; \theta)](s_i), the Bellman operator evaluated at collocation point sis_i. Let Φ\boldsymbol{\Phi} be the n×nn \times n matrix with entries Φij=φj(si)\Phi_{ij} = \varphi_j(s_i). Then the collocation equations become Φθ=Lφ(θ)\boldsymbol{\Phi} \theta = \mathrm{L}_\varphi(\theta).

Function iteration for collocation proceeds as follows. Given current coefficients θ(k)\theta^{(k)}, we evaluate the Bellman operator at each collocation point to get target values ti(k)=[Lφ(θ(k))]it_i^{(k)} = [\mathrm{L}_\varphi(\theta^{(k)})]_i. We then find new coefficients by solving the linear system Φθ(k+1)=t(k)\boldsymbol{\Phi} \theta^{(k+1)} = t^{(k)}. This is parametric value iteration: apply the Bellman operator, fit the result.

When the state space is continuous, we approximate expectations using numerical quadrature (Gauss-Hermite for normal shocks, etc.). The method is simple and robust when the finite-dimensional approximation preserves contraction, but can be slow for large discount factors.

Newton’s method for collocation treats the problem as rootfinding: G(θ)=ΦθLφ(θ)=0G(\theta) = \boldsymbol{\Phi} \theta - \mathrm{L}_\varphi(\theta) = 0. The Jacobian is JG=ΦJLφJ_G = \boldsymbol{\Phi} - J_{\mathrm{L}_\varphi}, where the Envelope Theorem (Step 4) gives us [JLφ]ij=γsp(ssi,ai(θ))φj(s)[J_{\mathrm{L}_\varphi}]_{ij} = \gamma \sum_{s'} p(s'|s_i, a_i^*(\theta)) \varphi_j(s'). Here ai(θ)a_i^*(\theta) is the optimal action at collocation point sis_i given the current coefficients.

This converges rapidly near the solution but requires good initialization and more computation per iteration than function iteration. The method is equivalent to policy iteration: each step evaluates the value of the current greedy policy, then improves it.

Why is collocation popular for Bellman equations? Because it avoids integration when testing the residual. We only evaluate the Bellman operator at nn discrete points. In contrast, Galerkin requires integrating the residual against each basis function.

Worked Example: Collocation on the Optimal Stopping Problem

Returning to our motivating example, let us trace through the collocation algorithm with n=4n = 4 polynomial basis functions at Chebyshev nodes:

Source
import numpy as np
from scipy.integrate import quad

gamma = 0.9
n = 4

# Chebyshev nodes on [0, 1]
k = np.arange(1, n + 1)
nodes = 0.5 + 0.5 * np.cos((2*k - 1) * np.pi / (2*n))
nodes = np.sort(nodes)

# Vandermonde matrix
Phi = np.vander(nodes, n, increasing=True)

# Exact solution
v_bar_exact = (1 - np.sqrt(1 - gamma**2)) / gamma**2
s_star_exact = gamma * v_bar_exact
def v_exact(s):
    return np.where(s >= s_star_exact, s, gamma * v_bar_exact)

# Collocation iteration
theta = np.zeros(n)
print("Collocation iteration trace:")
print(f"{'Iter':<6} {'||theta||':<12} {'Max error':<12}")
print("-" * 30)

for iteration in range(15):
    def v_approx(s, th=theta):
        return sum(th[j] * s**j for j in range(n))
    v_bar, _ = quad(v_approx, 0, 1)
    test_points = np.linspace(0, 1, 100)
    max_error = max(abs(v_approx(s) - v_exact(s)) for s in test_points)
    print(f"{iteration:<6} {np.linalg.norm(theta):<12.6f} {max_error:<12.6f}")
    
    targets = np.maximum(nodes, gamma * v_bar)
    theta_new = np.linalg.solve(Phi, targets)
    if np.linalg.norm(theta_new - theta) < 1e-10:
        print(f"\nConverged in {iteration + 1} iterations")
        break
    theta = theta_new
Collocation iteration trace:
Iter   ||theta||    Max error   
------------------------------
0      0.000000     1.000000    
1      1.000000     0.626789    
2      1.613062     0.198576    
3      0.706118     0.087935    
4      1.044850     0.039335    
5      1.292433     0.032162    
6      1.412062     0.033449    
7      1.467070     0.034028    
8      1.492030     0.034289    
9      1.503301     0.034406    
10     1.508381     0.034459    
11     1.510668     0.034483    
12     1.511698     0.034493    
13     1.512161     0.034498    
14     1.512370     0.034500    

Galerkin

For Galerkin, we use the basis functions themselves as test functions. The conditions are:

S[Lv^(s;θ)v^(s;θ)]φi(s)w(s)ds=0,i=1,,n.\int_{\mathcal{S}} [\Bellman\hat{v}(s; \theta) - \hat{v}(s; \theta)] \varphi_i(s) w(s) ds = 0, \quad i = 1, \ldots, n.

where w(s)w(s) is a weight function (often the stationary distribution dπ(s)d^\pi(s) in RL applications, or simply w(s)=1w(s) = 1). Expanding this:

S[maxa{r(s,a)+γE[v(s)]}jθjφj(s)]φi(s)w(s)ds=0.\int_{\mathcal{S}} \left[ \max_a \left\{ r(s,a) + \gamma \mathbb{E}[v(s')] \right\} - \sum_j \theta_j \varphi_j(s) \right] \varphi_i(s) w(s) ds = 0.

Function iteration for Galerkin works differently than for collocation. Given θ(k)\theta^{(k)}, we cannot simply evaluate the Bellman operator and fit. Instead, we must solve an integral equation. At each iteration, we seek θ(k+1)\theta^{(k+1)} satisfying:

Sjθj(k+1)φj(s)φi(s)w(s)ds=S[Lv^(s;θ(k))]φi(s)w(s)ds.\int_{\mathcal{S}} \sum_j \theta_j^{(k+1)} \varphi_j(s) \varphi_i(s) w(s) ds = \int_{\mathcal{S}} [\Bellman\hat{v}(s; \theta^{(k)})] \varphi_i(s) w(s) ds.

The left side is a linear system (the “mass matrix” Mij=φiφjwM_{ij} = \int \varphi_i \varphi_j w), and the right side requires integrating the Bellman operator output against each test function. When the basis functions are orthogonal polynomials with matching weight ww, the mass matrix is diagonal, simplifying the solve. But we still need numerical integration to evaluate the right side. This makes Galerkin substantially more expensive than collocation per iteration.

Newton’s method for Galerkin similarly requires integration. The residual is R(s;θ)=Lv^(s;θ)v^(s;θ)R(s; \theta) = \Bellman\hat{v}(s; \theta) - \hat{v}(s; \theta), and we need Gi(θ)=R(s;θ)φi(s)w(s)ds=0G_i(\theta) = \int R(s; \theta) \varphi_i(s) w(s) ds = 0. The Jacobian entry is:

Jij=[Lv^(s;θ)θjφj(s)]φi(s)w(s)ds.J_{ij} = \int \left[ \frac{\partial \Bellman\hat{v}(s; \theta)}{\partial \theta_j} - \varphi_j(s) \right] \varphi_i(s) w(s) ds.

The Envelope Theorem gives Lv^(s;θ)θj=γE[φj(s)s,a(s;θ)]\frac{\partial \Bellman\hat{v}(s; \theta)}{\partial \theta_j} = \gamma \mathbb{E}[\varphi_j(s') \mid s, a^*(s;\theta)], so we must integrate expected basis function values (under optimal actions) against test functions and weight. This requires both numerical integration and careful tracking of optimal actions across the state space, making it substantially more complex than collocation’s pointwise evaluation.

The advantage of Galerkin over collocation lies in its theoretical properties: when using orthogonal polynomials, Galerkin provides optimal approximation in the weighted L2L^2 norm. For smooth problems, this can yield better accuracy per degree of freedom than collocation. In practice, collocation’s computational simplicity usually outweighs Galerkin’s theoretical optimality for Bellman equations, especially in high-dimensional problems where integration becomes prohibitively expensive.

The algorithms above reduce the infinite-dimensional Bellman fixed-point problem to finite-dimensional coefficient computation. Collocation avoids integration entirely by requiring exact satisfaction at discrete points, while Galerkin imposes weighted orthogonality conditions requiring numerical quadrature. Both can be solved via function iteration (when contraction is preserved) or Newton’s method (for faster convergence near the solution). The discrete MDP specialization below reveals connections to algorithms widely used in reinforcement learning.

Galerkin for Discrete MDPs: LSTD and LSPI

When the state space is discrete and finite, the Galerkin conditions simplify dramatically. The integrals become sums, and we can write everything in matrix form. This specialization shows the connection to algorithms widely used in reinforcement learning.

For a discrete state space S={s1,,sm}\mathcal{S} = \{s_1, \ldots, s_m\}, the Galerkin orthogonality conditions

S[Lv^(s;θ)v^(s;θ)]φi(s)w(s)ds=0\int_{\mathcal{S}} [\Bellman\hat{v}(s; \theta) - \hat{v}(s; \theta)] \varphi_i(s) w(s) ds = 0

become weighted sums over states:

sSξ(s)[Lv^(s;θ)v^(s;θ)]φi(s)=0,i=1,,n,\sum_{s \in \mathcal{S}} \xi(s) [\Bellman\hat{v}(s; \theta) - \hat{v}(s; \theta)] \varphi_i(s) = 0, \quad i = 1, \ldots, n,

where ξ(s)0\xi(s) \geq 0 with sξ(s)=1\sum_s \xi(s) = 1 is a probability distribution over states. Define the feature matrix ΦRm×n\boldsymbol{\Phi} \in \mathbb{R}^{m \times n} with entries Φsi=φi(s)\Phi_{si} = \varphi_i(s) (each row contains the features for one state), and let Ξ=diag(ξ)\boldsymbol{\Xi} = \text{diag}(\xi) be the diagonal matrix with the state distribution on the diagonal.

Policy Evaluation: LSTD

For policy evaluation with a fixed policy π\pi, the Bellman operator is linear:

[Lπv^](s)=r(s,π(s))+γjSp(js,π(s))v^(j).[\BellmanPi \hat{v}](s) = r(s, \pi(s)) + \gamma \sum_{j \in \mathcal{S}} p(j|s, \pi(s)) \hat{v}(j).

With linear function approximation v^(s)=φ(s)θ=iθiφi(s)\hat{v}(s) = \boldsymbol{\varphi}(s)^\top \theta = \sum_i \theta_i \varphi_i(s), this becomes:

[Lπv^](s)=r(s,π(s))+γjSp(js,π(s))iθiφi(j).[\BellmanPi \hat{v}](s) = r(s, \pi(s)) + \gamma \sum_{j \in \mathcal{S}} p(j|s, \pi(s)) \sum_i \theta_i \varphi_i(j).

Let rπRm\mathbf{r}_\pi \in \mathbb{R}^m be the vector of rewards [rπ]s=r(s,π(s))[\mathbf{r}_\pi]_s = r(s, \pi(s)), and PπRm×m\mathbf{P}_\pi \in \mathbb{R}^{m \times m} be the transition matrix with [Pπ]sj=p(js,π(s))[\mathbf{P}_\pi]_{sj} = p(j|s, \pi(s)). Then Lπv^=rπ+γPπΦθ\BellmanPi \hat{v} = \mathbf{r}_\pi + \gamma \mathbf{P}_\pi \boldsymbol{\Phi} \theta in vector form.

The Galerkin conditions require Lπv^v^,φiξ=0\langle \BellmanPi \hat{v} - \hat{v}, \varphi_i \rangle_\xi = 0 for all basis functions, which in matrix form is:

ΦΞ(rπ+γPπΦθΦθ)=0.\boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\mathbf{r}_\pi + \gamma \mathbf{P}_\pi \boldsymbol{\Phi} \theta - \boldsymbol{\Phi} \theta) = \mathbf{0}.

Rearranging:

ΦΞ(ΦγPπΦ)θ=ΦΞrπ.\boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\boldsymbol{\Phi} - \gamma \mathbf{P}_\pi \boldsymbol{\Phi}) \theta = \boldsymbol{\Phi}^\top \boldsymbol{\Xi} \mathbf{r}_\pi.

This is the LSTD (Least Squares Temporal Difference) solution. The matrix A=ΦΞ(ΦγPπΦ)\mathbf{A} = \boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\boldsymbol{\Phi} - \gamma \mathbf{P}_\pi \boldsymbol{\Phi}) and vector b=ΦΞrπ\mathbf{b} = \boldsymbol{\Phi}^\top \boldsymbol{\Xi} \mathbf{r}_\pi give the linear system Aθ=b\mathbf{A} \theta = \mathbf{b}.

When ξ\xi is the stationary distribution of policy π\pi (so ξPπ=ξ\xi^\top \mathbf{P}_\pi = \xi^\top), this system has a unique solution, and the projected Bellman operator PLπ\Proj \BellmanPi is a contraction in the weighted L2L^2 norm ξ\|\cdot\|_\xi. This is the theoretical foundation for TD learning with linear function approximation. The fixed point computed here is the same one that TD(0) converges to stochastically; we derive the incremental algorithm in the Monte Carlo chapter.

Check that the dimensions work out: if we have mm states and nn basis functions, what are the dimensions of Φ\boldsymbol{\Phi}, Ξ\boldsymbol{\Xi}, Pπ\mathbf{P}_\pi, and the matrix A=ΦΞ(ΦγPπΦ)\mathbf{A} = \boldsymbol{\Phi}^\top \boldsymbol{\Xi}(\boldsymbol{\Phi} - \gamma \mathbf{P}_\pi \boldsymbol{\Phi})?

Worked Example: LSTD for Policy Evaluation

To illustrate LSTD concretely, consider a 3-state Markov chain under a fixed policy:

Pπ=(0.70.20.10.30.40.30.10.30.6),rπ=(120).\mathbf{P}_\pi = \begin{pmatrix} 0.7 & 0.2 & 0.1 \\ 0.3 & 0.4 & 0.3 \\ 0.1 & 0.3 & 0.6 \end{pmatrix}, \quad \mathbf{r}_\pi = \begin{pmatrix} 1 \\ 2 \\ 0 \end{pmatrix}.
Source
import numpy as np

P_pi = np.array([[0.7, 0.2, 0.1], [0.3, 0.4, 0.3], [0.1, 0.3, 0.6]])
r_pi = np.array([1.0, 2.0, 0.0])
gamma = 0.9

# Feature matrix: phi_1(s) = 1, phi_2(s) = s
states = np.array([1, 2, 3])
Phi = np.column_stack([np.ones(3), states])

# Uniform weighting
xi = np.ones(3) / 3
Xi = np.diag(xi)

# LSTD matrices
A = Phi.T @ Xi @ (Phi - gamma * P_pi @ Phi)
b = Phi.T @ Xi @ r_pi

theta_lstd = np.linalg.solve(A, b)
v_lstd = Phi @ theta_lstd
v_exact = np.linalg.solve(np.eye(3) - gamma * P_pi, r_pi)

print(f"LSTD solution: theta = ({theta_lstd[0]:.4f}, {theta_lstd[1]:.4f})")
print(f"\n{'State':<8} {'Exact':<12} {'LSTD':<12} {'Error':<12}")
print("-" * 44)
for s in range(3):
    print(f"{s+1:<8} {v_exact[s]:<12.4f} {v_lstd[s]:<12.4f} {v_lstd[s] - v_exact[s]:<12.4f}")

# Verify orthogonality
residual = r_pi + gamma * P_pi @ v_lstd - v_lstd
print(f"\nGalerkin orthogonality: <residual, phi_1> = {np.sum(xi * residual * Phi[:,0]):.6f}")
LSTD solution: theta = (12.2772, -0.9901)

State    Exact        LSTD         Error       
--------------------------------------------
1        10.0207      11.2871      1.2664      
2        10.8717      10.2970      -0.5746     
3        8.3418       9.3069       0.9652      

Galerkin orthogonality: <residual, phi_1> = 0.000000

The Bellman Optimality Equation: Function Iteration and Newton’s Method

For the Bellman optimality equation, the max operator introduces nonlinearity:

[Lv^](s)=maxaAs{r(s,a)+γjSp(js,a)v^(j)}.[\Bellman\hat{v}](s) = \max_{a \in \mathcal{A}_s} \left\{ r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) \hat{v}(j) \right\}.

The Galerkin conditions become:

F(θ)ΦΞ(Lv^(;θ)Φθ)=0,F(\theta) \equiv \boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\Bellman\hat{v}(\cdot; \theta) - \boldsymbol{\Phi} \theta) = \mathbf{0},

where the Bellman operator must be evaluated at each state ss to find the optimal action and compute the target value. This is a system of nn nonlinear equations in nn unknowns.

Function iteration applies the Bellman operator and projects back. Given θ(k)\theta^{(k)}, compute the greedy policy π(k)(s)=argmaxa{r(s,a)+γjSp(js,a)φ(j)θ(k)}\pi^{(k)}(s) = \arg\max_a \{r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) \boldsymbol{\varphi}(j)^\top \theta^{(k)}\} at each state, then solve:

ΦΞ(ΦγPπ(k)Φ)θ(k+1)=ΦΞrπ(k).\boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\boldsymbol{\Phi} - \gamma \mathbf{P}_{\pi^{(k)}} \boldsymbol{\Phi}) \theta^{(k+1)} = \boldsymbol{\Phi}^\top \boldsymbol{\Xi} \mathbf{r}_{\pi^{(k)}}.

This evaluates the current greedy policy using LSTD, then implicitly improves by computing a new greedy policy at the next iteration. However, convergence can be slow when the finite-dimensional approximation poorly preserves contraction.

Newton’s method treats G(θ)=0G(\theta) = 0 as a rootfinding problem and uses the Jacobian to accelerate convergence. The Jacobian of GG is:

JG(θ)=Gθ=ΦΞ(Lv^(;θ)θΦ).J_G(\theta) = \frac{\partial G}{\partial \theta} = \boldsymbol{\Phi}^\top \boldsymbol{\Xi} \left( \frac{\partial \Bellman\hat{v}(\cdot; \theta)}{\partial \theta} - \boldsymbol{\Phi} \right).

To compute Lv^(s;θ)θj\frac{\partial \Bellman\hat{v}(s; \theta)}{\partial \theta_j}, we use the Envelope Theorem from Step 4. At the current θ(k)\theta^{(k)}, let a(s;θ(k))a^*(s; \theta^{(k)}) be the optimal action at state ss. Then:

[Lv^](s;θ(k))θj=γjSp(js,a(s;θ(k)))φj(j).\frac{\partial [\Bellman\hat{v}](s; \theta^{(k)})}{\partial \theta_j} = \gamma \sum_{j \in \mathcal{S}} p(j|s, a^*(s; \theta^{(k)})) \varphi_j(j).

Define the policy π(k)(s)=a(s;θ(k))\pi^{(k)}(s) = a^*(s; \theta^{(k)}). The Jacobian becomes:

JG(θ(k))=ΦΞ(γPπ(k)ΦΦ)=ΦΞ(ΦγPπ(k)Φ).J_G(\theta^{(k)}) = \boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\gamma \mathbf{P}_{\pi^{(k)}} \boldsymbol{\Phi} - \boldsymbol{\Phi}) = -\boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\boldsymbol{\Phi} - \gamma \mathbf{P}_{\pi^{(k)}} \boldsymbol{\Phi}).

The Newton update θ(k+1)=θ(k)JG(θ(k))1G(θ(k))\theta^{(k+1)} = \theta^{(k)} - J_G(\theta^{(k)})^{-1} G(\theta^{(k)}) simplifies. We have:

G(θ(k))=ΦΞ(Lv^(;θ(k))Φθ(k)).G(\theta^{(k)}) = \boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\Bellman\hat{v}(\cdot; \theta^{(k)}) - \boldsymbol{\Phi} \theta^{(k)}).

At each state ss, the greedy value is [Lv^(;θ(k))](s)=r(s,π(k)(s))+γjp(js,π(k)(s))φ(j)θ(k)[\Bellman\hat{v}(\cdot; \theta^{(k)})](s) = r(s, \pi^{(k)}(s)) + \gamma \sum_j p(j|s, \pi^{(k)}(s)) \boldsymbol{\varphi}(j)^\top \theta^{(k)}, which equals [Lπ(k)v^(;θ(k))](s)[\mathrm{L}_{\pi^{(k)}} \hat{v}(\cdot; \theta^{(k)})](s). Thus:

G(θ(k))=ΦΞ(rπ(k)+γPπ(k)Φθ(k)Φθ(k)).G(\theta^{(k)}) = \boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\mathbf{r}_{\pi^{(k)}} + \gamma \mathbf{P}_{\pi^{(k)}} \boldsymbol{\Phi} \theta^{(k)} - \boldsymbol{\Phi} \theta^{(k)}).

The Newton step becomes:

θ(k+1)=θ(k)+[ΦΞ(ΦγPπ(k)Φ)]1ΦΞ(rπ(k)+γPπ(k)Φθ(k)Φθ(k)).\theta^{(k+1)} = \theta^{(k)} + [\boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\boldsymbol{\Phi} - \gamma \mathbf{P}_{\pi^{(k)}} \boldsymbol{\Phi})]^{-1} \boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\mathbf{r}_{\pi^{(k)}} + \gamma \mathbf{P}_{\pi^{(k)}} \boldsymbol{\Phi} \theta^{(k)} - \boldsymbol{\Phi} \theta^{(k)}).

Multiplying through and simplifying:

ΦΞ(ΦγPπ(k)Φ)θ(k+1)=ΦΞrπ(k).\boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\boldsymbol{\Phi} - \gamma \mathbf{P}_{\pi^{(k)}} \boldsymbol{\Phi}) \theta^{(k+1)} = \boldsymbol{\Phi}^\top \boldsymbol{\Xi} \mathbf{r}_{\pi^{(k)}}.

This is LSPI (Least Squares Policy Iteration). Each Newton step:

  1. Computes the greedy policy π(k)(s)=argmaxa{r(s,a)+γjp(js,a)φ(j)θ(k)}\pi^{(k)}(s) = \arg\max_a \{r(s,a) + \gamma \sum_j p(j|s,a) \boldsymbol{\varphi}(j)^\top \theta^{(k)}\}

  2. Solves the LSTD equation for this policy to get θ(k+1)\theta^{(k+1)}

Newton’s method for the Galerkin-projected Bellman optimality equation is equivalent to policy iteration in the function approximation setting. Just as Newton’s method for collocation corresponded to policy iteration (Step 4), Newton’s method for discrete Galerkin gives LSPI.

Galerkin projection with linear function approximation reduces policy iteration to a sequence of linear systems, each solvable in closed form. For discrete MDPs, we can compute the matrices ΦΞΦ\boldsymbol{\Phi}^\top \boldsymbol{\Xi} \boldsymbol{\Phi} and ΦΞPπΦ\boldsymbol{\Phi}^\top \boldsymbol{\Xi} \mathbf{P}_\pi \boldsymbol{\Phi} exactly.

Extension to Nonlinear Approximators

The weighted residual methods developed so far have focused on linear function classes: polynomial bases, piecewise linear interpolants, and linear combinations of fixed basis functions. Neural networks, kernel methods, and decision trees do not fit this template. How does the framework extend to nonlinear approximators?

Recall the Galerkin approach for linear approximation vθ=i=1dθiφiv_{\boldsymbol{\theta}} = \sum_{i=1}^d \theta_i \varphi_i. The orthogonality conditions vLv,φiw=0\langle v - \Bellman v, \varphi_i \rangle_w = 0 for all ii define a linear system with a closed-form solution. These equations arise from minimizing vLvw2\|v - \Bellman v\|_w^2 over the subspace, since at the minimum, the gradient with respect to each coefficient must vanish. The connection between norm minimization and orthogonality holds generally. For any norm w\|\cdot\|_w induced by an inner product ,w\langle \cdot, \cdot \rangle_w, minimizing f(θ)w2\|f(\boldsymbol{\theta})\|_w^2 with respect to parameters requires θif(θ)w2=0\frac{\partial}{\partial \theta_i} \|f(\boldsymbol{\theta})\|_w^2 = 0. Since fw2=f,fw\|f\|_w^2 = \langle f, f \rangle_w, the chain rule gives 2f,fθiw=02\langle f, \frac{\partial f}{\partial \theta_i} \rangle_w = 0. Minimizing the residual norm is thus equivalent to requiring orthogonality f,fθiw=0\langle f, \frac{\partial f}{\partial \theta_i} \rangle_w = 0 for all ii. The equivalence holds for any choice of inner product: weighted L2L^2 integrals for Galerkin, sums over collocation points for collocation, or sampled expectations for neural networks.

For nonlinear function classes parameterized by θRp\boldsymbol{\theta} \in \mathbb{R}^p (neural networks, kernel expansions), the same minimization principle applies:

θ=argminθvθLvθw2.\boldsymbol{\theta}^* = \arg\min_{\boldsymbol{\theta}} \|v_{\boldsymbol{\theta}} - \Bellman v_{\boldsymbol{\theta}}\|_w^2.

The first-order stationarity condition yields orthogonality:

vθLvθ,vθθiw=0for all i.\Big\langle v_{\boldsymbol{\theta}} - \Bellman v_{\boldsymbol{\theta}}, \frac{\partial v_{\boldsymbol{\theta}}}{\partial \theta_i} \Big\rangle_w = 0 \quad \text{for all } i.

The test functions are now the partial derivatives vθθi\frac{\partial v_{\boldsymbol{\theta}}}{\partial \theta_i}, which span the tangent space to the manifold {vθ:θRp}\{v_{\boldsymbol{\theta}} : \boldsymbol{\theta} \in \mathbb{R}^p\} at the current parameters. In the linear case vθ=iθiφiv_{\boldsymbol{\theta}} = \sum_i \theta_i \varphi_i, the partial derivative vθθi=φi\frac{\partial v_{\boldsymbol{\theta}}}{\partial \theta_i} = \varphi_i recovers the fixed basis functions of Galerkin. For nonlinear parameterizations, the test functions change with θ\boldsymbol{\theta}, and the orthogonality conditions define a nonlinear system solved by iterative gradient descent.

The dual pairing formulation Legrand & Junca (2025) extends this framework to settings where test objects need not be regular functions. We have been informal about this distinction in our treatment of collocation, but the Dirac deltas δ(xxi)\delta(x - x_i) used there are not classical functions. They are distributions, defined rigorously only through their action on test functions via N(v),δ(xxi)=(Nv)(xi)\langle \Residual(v), \delta(x - x_i) \rangle = (\Residual v)(x_i). The simple calculus argument for orthogonality does not apply directly to such objects; the dual pairing framework provides the proper mathematical foundation. The induced dual norm N(v)=supw=1N(v),w\|\Residual(v)\|_* = \sup_{\|w\|=1} |\langle \Residual(v), w \rangle| measures residuals by their worst-case effect on test functions, a perspective that has inspired adversarial formulations Zang et al. (2020) where both trial and test functions are learned.

The minimum residual framework thus connects classical projection methods to modern function approximation. The unifying principle is orthogonality of residuals to test functions. Linear methods use fixed test functions and admit closed-form solutions. Nonlinear methods use parameter-dependent test functions and require iterative optimization.

We now turn to the question of convergence: when does the iteration vk+1=PLvkv_{k+1} = \Proj \Bellman v_k converge?

Monotone Projection and the Preservation of Contraction

The informal discussion of shape preservation hints at a deeper theoretical question: when does the function iteration method converge? Recall from our discussion of collocation that function iteration proceeds in two steps:

  1. Apply the Bellman operator at collocation points: t(k)=v(θ(k))t^{(k)} = v(\theta^{(k)}) where ti(k)=Lv^(k)(si)t_i^{(k)} = \Bellman\hat{v}^{(k)}(s_i)

  2. Fit new coefficients to match these targets: Φθ(k+1)=t(k)\boldsymbol{\Phi} \theta^{(k+1)} = t^{(k)}, giving θ(k+1)=Φ1v(θ(k))\theta^{(k+1)} = \boldsymbol{\Phi}^{-1} v(\theta^{(k)})

We can reinterpret this iteration in function space rather than coefficient space. Let P\Proj be the projection operator that takes any function ff and returns its approximation in span{φ1,,φn}\text{span}\{\varphi_1, \ldots, \varphi_n\}. For collocation, P\Proj is the interpolation operator: (Pf)(s)(\Proj f)(s) is the unique linear combination of basis functions that matches ff at the collocation points. Then Step 2 can be written as: fit v^(k+1)\hat{v}^{(k+1)} so that v^(k+1)(si)=Lv^(k)(si)\hat{v}^{(k+1)}(s_i) = \Bellman\hat{v}^{(k)}(s_i) for all collocation points, which means v^(k+1)=P(Lv^(k))\hat{v}^{(k+1)} = \Proj(\Bellman\hat{v}^{(k)}).

In other words, function iteration is equivalent to projected value iteration in function space:

v^(k+1)=PLv^(k).\hat{v}^{(k+1)} = \Proj \Bellman \hat{v}^{(k)}.

We know that standard value iteration vk+1=Lvkv_{k+1} = \Bellman v_k converges because L\Bellman is a γ\gamma-contraction in the sup norm. But now we’re iterating with the composed operator PL\Proj \Bellman instead of L\Bellman alone.

This PL\Proj \Bellman structure is not specific to collocation. It is inherent in all projection methods. The general pattern is always the same: apply the Bellman operator to get a target function Lv^(k)\Bellman\hat{v}^{(k)}, then project it back onto our approximation space to get v^(k+1)\hat{v}^{(k+1)}. The projection step defines an operator P\Proj that depends on our choice of test functions:

But regardless of which projection method we use, iteration takes the form v^(k+1)=PLv^(k)\hat{v}^{(k+1)} = \Proj \Bellman\hat{v}^{(k)}.

The central question is whether the composition PL\Proj \Bellman inherits the contraction property of L\Bellman. If not, the iteration may diverge, oscillate, or converge to a spurious fixed point even though the original problem is well-posed.

Monotone Approximators and Stability

The answer turns out to depend on specific properties of the approximation operator P\Proj. This theory was developed independently across multiple research communities: computational economics Judd (1992)Judd (1996)McGrattan (1997)Santos & Vigo-Aguiar (1998), economic dynamics Stachurski (2009), and reinforcement learning Gordon (1995)Gordon (1999). These communities arrived at essentially the same mathematical conditions.

Monotonicity Implies Nonexpansiveness

It turns out that approximation operators satisfying simple structural properties automatically preserve contraction.

This proposition shows that monotonicity and constant preservation automatically imply nonexpansiveness. There is no need to verify this separately. The intuition is that a monotone, constant-preserving operator acts like a weighted average that respects order structure and cannot amplify differences between functions.

Preservation of Contraction

Combining nonexpansiveness with the contraction property of the Bellman operator yields the main stability result.

This error bound tells us that the fixed-point error is controlled by how well P\Proj can represent vv^*. If vRange(P)v^* \in \text{Range}(\Proj), then Pv=v\Proj v^* = v^* and the error vanishes. Otherwise, the error is proportional to the approximation error Pvv\|\Proj v^* - v^*\|_\infty, amplified by the factor (1γ)1(1-\gamma)^{-1}.

Averagers in Discrete-State Problems

For discrete-state problems, the monotonicity conditions have a natural interpretation as averaging with nonnegative weights. This characterization was developed by Gordon in the context of reinforcement learning.

Averagers automatically satisfy the monotonicity conditions: linearity follows from matrix multiplication, monotonicity follows from nonnegativity of entries, and constant preservation follows from row sums equaling one.

This specializes the Santos-Vigo-Aguiar theorem to discrete states, expressed in the probabilistic language of stochastic matrices. The stochastic matrix characterization connects to Markov chain theory: Pv\Proj v represents expected values after one transition, and the monotonicity property reflects the fact that expectations preserve order.

Examples of averagers include state aggregation (averaging values within groups), K-nearest neighbors (averaging over nearest states), kernel smoothing with positive kernels, and multilinear interpolation on grids (barycentric weights are nonnegative and sum to one). Counterexamples include linear least squares regression (projection matrix may have negative entries) and high-order polynomial interpolation (Runge phenomenon produces negative weights).

The following table summarizes which common approximation operators satisfy the monotonicity conditions:

MethodMonotone?Notes
Piecewise linear interpolationYesAlways an averager; guaranteed stability
Multilinear interpolation (grid)YesBarycentric weights are nonnegative and sum to one
Shape-preserving splines (Schumaker)YesDesigned to maintain monotonicity
State aggregationYesExact averaging within groups
Kernel smoothing (positive kernels)YesIf kernel integrates to one
High-order polynomial interpolationNoOscillations violate monotonicity (Runge phenomenon)
Least squares projection (arbitrary basis)NoProjection matrix may have negative entries
Fourier/spectral methodsNoNot monotone-preserving in general
Neural networksNoHighly flexible but no monotonicity guarantees

The distinction between “safe” (monotone) and “potentially unstable” (non-monotone) approximators provides rigorous foundation for the folk wisdom that linear interpolation is reliable while high-order polynomials can be dangerous for value iteration. But notice that the table’s verdict on “least squares projection” is somewhat abstract. It doesn’t specifically address the three weighted residual methods we introduced at the start of this chapter.

The choice of solution method determines which approximation operators are safe to use. Successive approximation (fixed-point iteration) requires monotone approximators to guarantee convergence. Rootfinding methods like Newton’s method do not require monotonicity. Stability depends on numerical properties of the Jacobian rather than contraction preservation. These considerations suggest hybrid strategies. One approach runs a few iterations with a monotone method to generate an initial guess, then switches to Newton’s method with a smooth approximation for rapid final convergence.

Connecting Back to Collocation, Galerkin, and Least Squares

We have now developed a general stability theory for projected value iteration and surveyed which approximation operators are monotone. But what does this mean for the three specific weighted residual methods we introduced at the start of this chapter: collocation, Galerkin, and least squares? Each method defines a different projection operator P\Proj, and we now need to determine which satisfy the monotonicity conditions that guarantee convergence.

Collocation with piecewise linear interpolation is monotone. When we use collocation with piecewise linear basis functions on a grid, the projection operator performs linear interpolation between grid points. At any state ss between grid points sis_i and si+1s_{i+1}, the interpolated value is:

(Pv)(s)=si+1ssi+1siv(si)+ssisi+1siv(si+1).(\Proj v)(s) = \frac{s_{i+1} - s}{s_{i+1} - s_i} v(s_i) + \frac{s - s_i}{s_{i+1} - s_i} v(s_{i+1}).

The interpolation weights (barycentric coordinates) are nonnegative and sum to one, making this an averager in Gordon’s sense. Therefore collocation with piecewise linear bases satisfies the monotonicity conditions and the Santos-Vigo-Aguiar stability theorem applies. The folk wisdom that “linear interpolation is safe for value iteration” has rigorous theoretical foundation.

Galerkin projection is generally not monotone. The Galerkin projection operator for a general basis {φ1,,φn}\{\varphi_1, \ldots, \varphi_n\} has the form:

P=Φ(ΦWΦ)1ΦW,\Proj = \boldsymbol{\Phi}(\boldsymbol{\Phi}^\top \mathbf{W} \boldsymbol{\Phi})^{-1} \boldsymbol{\Phi}^\top \mathbf{W},

where W\mathbf{W} is a diagonal weight matrix and Φ\boldsymbol{\Phi} contains the basis function evaluations. This projection matrix typically has negative entries. To see why, consider a simple example with polynomial basis functions {1,x,x2}\{1, x, x^2\} on [1,1][-1, 1]. The projection of a function onto this space involves computing (ΦWΦ)1(\boldsymbol{\Phi}^\top \mathbf{W} \boldsymbol{\Phi})^{-1}, and the resulting operator can map nonnegative functions to functions with negative values. This is the same phenomenon underlying the Runge phenomenon in high-order polynomial interpolation: the projection weights oscillate in sign.

Since Galerkin projection is not monotone, the sup norm contraction theory does not guarantee convergence of projected value iteration vk+1=PLvkv_{k+1} = \Proj \Bellman v_k with Galerkin.

Least squares methods share the non-monotonicity issue. The least squares projection operator minimizes N(f^)w2\|\Residual(\hat{f})\|_w^2 and has the same mathematical form as Galerkin projection. It is a linear projection onto span{φ1,,φn}\text{span}\{\varphi_1, \ldots, \varphi_n\} with respect to a weighted inner product. Like Galerkin, the projection matrix typically contains negative entries and violates monotonicity.

The monotone approximator framework successfully covers collocation with simple bases, but leaves two important methods, Galerkin and least squares, without convergence guarantees. These methods are used in least-squares temporal difference learning (LSTD) and modern reinforcement learning with linear function approximation. We need a different analytical framework to understand when these non-monotone projections lead to convergent algorithms.

Monotone projections (piecewise linear interpolation, state aggregation) automatically preserve the Bellman operator’s contraction property, guaranteeing convergence of projected value iteration. Non-monotone projections (Galerkin, high-order polynomials) may destroy contraction in the sup norm, requiring either different solution methods (Newton) or analysis in different norms. The next section develops the latter approach for policy evaluation.

Beyond Monotone Approximators

The monotone approximator theory gives us a clean sufficient condition for convergence: if P\Proj is monotone (and constant-preserving), then P\Proj is non-expansive in the sup norm \|\cdot\|_\infty. Since L\Bellman is a γ\gamma-contraction in the sup norm, their composition PL\Proj \Bellman is also a γ\gamma-contraction in the sup norm, guaranteeing convergence of projected value iteration.

But what if P\Proj is not monotone? Can we still guarantee convergence? Galerkin and least squares projections typically violate monotonicity, yet they are widely used in practice, particularly in reinforcement learning through least-squares temporal difference learning (LSTD). In general, proving convergence for non-monotone projections is difficult. However, for the special case of policy evaluation, computing the value function vπv_\pi of a fixed policy π\pi, we can establish convergence by working in a different norm.

The Policy Evaluation Problem and LSTD

Consider the policy evaluation problem: given policy π\pi, we want to solve the policy Bellman equation vπ=rπ+γPπvπv_\pi = r_\pi + \gamma \mathbf{P}_\pi v_\pi, where rπr_\pi and Pπ\mathbf{P}_\pi are the reward vector and transition matrix under π\pi. This is the core computational task in policy iteration, actor-critic algorithms, and temporal difference learning. In reinforcement learning, we typically learn from sampled experience: trajectories (s0,a0,r1,s1,a1,r2,s2,)(s_0, a_0, r_1, s_1, a_1, r_2, s_2, \ldots) generated by following π\pi. If the Markov chain induced by π\pi is ergodic, the state distribution converges to a stationary distribution ξ\xi satisfying ξPπ=ξ\xi^\top \mathbf{P}_\pi = \xi^\top.

This distribution determines which states appear frequently in our data. States visited often contribute more samples and have more influence on any learned approximation. States visited rarely contribute little. For a linear approximation vθ(s)=jθjφj(s)v_\theta(s) = \sum_j \theta_j \varphi_j(s), the least-squares temporal difference (LSTD) algorithm computes coefficients by solving:

ΦΞ(ΦγPπΦ)θ=ΦΞrπ,\boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\boldsymbol{\Phi} - \gamma \mathbf{P}_\pi \boldsymbol{\Phi}) \boldsymbol{\theta} = \boldsymbol{\Phi}^\top \boldsymbol{\Xi} \mathbf{r}_\pi,

where Φ\boldsymbol{\Phi} is the matrix of basis function evaluations and Ξ=diag(ξ)\boldsymbol{\Xi} = \text{diag}(\xi). We write this matrix equation for analysis purposes, but the actual algorithm does not compute it this way. For large state spaces, we cannot enumerate all states to form Φ\boldsymbol{\Phi} or explicitly represent the transition matrix Pπ\mathbf{P}_\pi. Instead, the practical algorithm accumulates sums from sampled transitions (s,r,s)(s, r, s'), incrementally building the matrices ΦΞΦ\boldsymbol{\Phi}^\top \boldsymbol{\Xi} \boldsymbol{\Phi} and ΦΞPπΦ\boldsymbol{\Phi}^\top \boldsymbol{\Xi} \mathbf{P}_\pi \boldsymbol{\Phi} without ever forming the full objects. The algorithm is derived from first principles through temporal difference learning, and the Galerkin perspective provides an interpretation of what it computes.

LSTD as Projected Bellman Equation

To see what this equation means, let v^=Φθ\hat{v} = \boldsymbol{\Phi} \boldsymbol{\theta} be the solution. Expanding the parentheses:

ΦΞΦθγΦΞPπΦθ=ΦΞrπ.\boldsymbol{\Phi}^\top \boldsymbol{\Xi} \boldsymbol{\Phi} \boldsymbol{\theta} - \gamma \boldsymbol{\Phi}^\top \boldsymbol{\Xi} \mathbf{P}_\pi \boldsymbol{\Phi} \boldsymbol{\theta} = \boldsymbol{\Phi}^\top \boldsymbol{\Xi} \mathbf{r}_\pi.

Moving all terms to the left side and factoring out ΦΞ\boldsymbol{\Phi}^\top \boldsymbol{\Xi}:

ΦΞ(ΦθγPπΦθrπ)=0.\boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\boldsymbol{\Phi} \boldsymbol{\theta} - \gamma \mathbf{P}_\pi \boldsymbol{\Phi} \boldsymbol{\theta} - \mathbf{r}_\pi) = \mathbf{0}.

Since v^=Φθ\hat{v} = \boldsymbol{\Phi} \boldsymbol{\theta} and the policy Bellman operator is Lπv^=rπ+γPπv^\BellmanPi \hat{v} = \mathbf{r}_\pi + \gamma \mathbf{P}_\pi \hat{v}, we can write:

ΦΞ(v^Lπv^)=0.\boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\hat{v} - \BellmanPi \hat{v}) = \mathbf{0}.

Let φj\boldsymbol{\varphi}_j denote the jj-th column of Φ\boldsymbol{\Phi}, which contains the evaluations of the jj-th basis function at all states. The equation above says that for each jj:

φjΞ(v^Lπv^)=0.\boldsymbol{\varphi}_j^\top \boldsymbol{\Xi} (\hat{v} - \BellmanPi \hat{v}) = 0.

But φjΞ(v^Lπv^)\boldsymbol{\varphi}_j^\top \boldsymbol{\Xi} (\hat{v} - \BellmanPi \hat{v}) is exactly the ξ\xi-weighted inner product φj,v^Lπv^ξ\langle \boldsymbol{\varphi}_j, \hat{v} - \BellmanPi \hat{v} \rangle_\xi. So the residual v^Lπv^\hat{v} - \BellmanPi \hat{v} is orthogonal to every basis function, and therefore orthogonal to the entire subspace span(Φ)\text{span}(\boldsymbol{\Phi}).

By definition, the orthogonal projection Py\Proj y of a vector yy onto a subspace is the unique vector in that subspace such that yPyy - \Proj y is orthogonal to the subspace. Here, v^\hat{v} lies in span(Φ)\text{span}(\boldsymbol{\Phi}) (since v^=Φθ\hat{v} = \boldsymbol{\Phi} \boldsymbol{\theta}), and we have just shown that Lπv^v^\BellmanPi \hat{v} - \hat{v} is orthogonal to span(Φ)\text{span}(\boldsymbol{\Phi}). Therefore, v^=PLπv^\hat{v} = \Proj \BellmanPi \hat{v}, where P\Proj is orthogonal projection onto span(Φ)\text{span}(\boldsymbol{\Phi}) with respect to the ξ\xi-weighted inner product:

u,vξ=uΞv,vξ=vΞv,P=Φ(ΦΞΦ)1ΦΞ.\langle u, v \rangle_\xi = u^\top \boldsymbol{\Xi} v, \qquad \|v\|_\xi = \sqrt{v^\top \boldsymbol{\Xi} v}, \qquad \Proj = \boldsymbol{\Phi}(\boldsymbol{\Phi}^\top \boldsymbol{\Xi} \boldsymbol{\Phi})^{-1} \boldsymbol{\Phi}^\top \boldsymbol{\Xi}.

The weighting by ξ\xi is not arbitrary. Temporal difference learning performs stochastic updates using individual transitions: θk+1=θk+αk(r+γvθk(s)vθk(s))vθk(s)\theta_{k+1} = \theta_k + \alpha_k (r + \gamma v_{\theta_k}(s') - v_{\theta_k}(s)) \nabla v_{\theta_k}(s), with states sampled from ξ\xi. The ODE analysis of this stochastic process (Borkar-Meyn theory) shows convergence to a fixed point, which can be expressed in closed form as the ξ\xi-weighted projected Bellman operator. LSTD is an algorithm that computes this analytical fixed point.

Orthogonal Projection is Non-Expansive

Suppose ξ\xi is the steady-state distribution: ξPπ=ξ\xi^\top \mathbf{P}_\pi = \xi^\top. Our goal is to establish that PLπ\Proj \BellmanPi is a contraction in ξ\|\cdot\|_\xi. If we can establish that P\Proj is non-expansive in this norm and that Lπ\BellmanPi is a γ\gamma-contraction in ξ\|\cdot\|_\xi, then their composition will be a γ\gamma-contraction:

PLπvPLπwξLπvLπwξγvwξ.\|\Proj \BellmanPi v - \Proj \BellmanPi w\|_\xi \leq \|\BellmanPi v - \BellmanPi w\|_\xi \leq \gamma \|v - w\|_\xi.

First, we establish that orthogonal projection is non-expansive. For any vector vv, we can decompose v=Pv+(vPv)v = \Proj v + (v - \Proj v), where (vPv)(v - \Proj v) is orthogonal to the subspace span(Φ)\text{span}(\boldsymbol{\Phi}). By the Pythagorean theorem in the ξ\|\cdot\|_\xi inner product:

vξ2=Pvξ2+vPvξ2.\|v\|_\xi^2 = \|\Proj v\|_\xi^2 + \|v - \Proj v\|_\xi^2.

Since vPvξ20\|v - \Proj v\|_\xi^2 \geq 0, we have:

vξ2Pvξ2.\|v\|_\xi^2 \geq \|\Proj v\|_\xi^2.

Taking square roots of both sides (which preserves the inequality since both norms are non-negative):

Pvξvξ.\|\Proj v\|_\xi \leq \|v\|_\xi.

This holds for all vv, so P\Proj is non-expansive in ξ\|\cdot\|_\xi.

Contraction of Lπ\BellmanPi in ξ\|\cdot\|_\xi

To show Lπ=rπ+γPπ\BellmanPi = r_\pi + \gamma \mathbf{P}_\pi is a γ\gamma-contraction, we need to verify:

LπvLπwξ=γPπ(vw)ξ=γPπ(vw)ξ.\|\BellmanPi v - \BellmanPi w\|_\xi = \|\gamma \mathbf{P}_\pi (v - w)\|_\xi = \gamma \|\mathbf{P}_\pi (v - w)\|_\xi.

This will be at most γvwξ\gamma \|v - w\|_\xi if Pπ\mathbf{P}_\pi is non-expansive, meaning Pπzξzξ\|\mathbf{P}_\pi z\|_\xi \leq \|z\|_\xi for any vector zz. We therefore need to establish that Pπ\mathbf{P}_\pi is non-expansive in ξ\|\cdot\|_\xi.

Before reading the proof below, try to show that Pπ\mathbf{P}_\pi is non-expansive in ξ\|\cdot\|_\xi. Hint: what property of ξ\xi relates it to Pπ\mathbf{P}_\pi?

Consider the squared norm of Pπz\mathbf{P}_\pi z. By definition of the weighted norm:

Pπzξ2=sξ(s)[(Pπz)(s)]2.\|\mathbf{P}_\pi z\|_\xi^2 = \sum_s \xi(s) [(\mathbf{P}_\pi z)(s)]^2.

The ss-th component of Pπz\mathbf{P}_\pi z is (Pπz)(s)=sp(ss,π(s))z(s)(\mathbf{P}_\pi z)(s) = \sum_{s'} p(s'|s,\pi(s)) z(s'). This is a weighted average of the values z(s)z(s') with weights p(ss,π(s))p(s'|s,\pi(s)) that sum to one. Therefore:

Pπzξ2=sξ(s)[sp(ss,π(s))z(s)]2.\|\mathbf{P}_\pi z\|_\xi^2 = \sum_s \xi(s) \left[\sum_{s'} p(s'|s,\pi(s)) z(s')\right]^2.

Since the function xx2x \mapsto x^2 is convex, Jensen’s inequality applied to the probability distribution p(s,π(s))p(\cdot|s,\pi(s)) gives:

[sp(ss,π(s))z(s)]2sp(ss,π(s))z(s)2.\left[\sum_{s'} p(s'|s,\pi(s)) z(s')\right]^2 \leq \sum_{s'} p(s'|s,\pi(s)) z(s')^2.

Substituting this into the norm expression:

Pπzξ2sξ(s)sp(ss,π(s))z(s)2=sz(s)2sξ(s)p(ss,π(s)).\|\mathbf{P}_\pi z\|_\xi^2 \leq \sum_s \xi(s) \sum_{s'} p(s'|s,\pi(s)) z(s')^2 = \sum_{s'} z(s')^2 \sum_s \xi(s) p(s'|s,\pi(s)).

The stationarity condition ξPπ=ξ\xi^\top \mathbf{P}_\pi = \xi^\top means sξ(s)p(ss,π(s))=ξ(s)\sum_s \xi(s) p(s'|s,\pi(s)) = \xi(s') for all ss'. Therefore:

Pπzξ2sz(s)2ξ(s)=zξ2.\|\mathbf{P}_\pi z\|_\xi^2 \leq \sum_{s'} z(s')^2 \xi(s') = \|z\|_\xi^2.

Taking square roots, Pπzξzξ\|\mathbf{P}_\pi z\|_\xi \leq \|z\|_\xi, so Pπ\mathbf{P}_\pi is non-expansive in ξ\|\cdot\|_\xi. This makes Lπ=rπ+γPπ\BellmanPi = r_\pi + \gamma \mathbf{P}_\pi a γ\gamma-contraction in ξ\|\cdot\|_\xi. Composing with the non-expansive projection:

PLπvPLπwξLπvLπwξγvwξ.\|\Proj \BellmanPi v - \Proj \BellmanPi w\|_\xi \leq \|\BellmanPi v - \BellmanPi w\|_\xi \leq \gamma \|v - w\|_\xi.

By Banach’s fixed-point theorem, PLπ\Proj \BellmanPi has a unique fixed point and iterates converge from any initialization.

Interpretation: The On-Policy Condition

The result shows that convergence depends on matching the weighting to the operator. We cannot choose an arbitrary weighted L2L^2 norm and expect PLπ\Proj \BellmanPi to be a contraction. Instead, the weighting ξ\xi must have a specific relationship with the transition matrix Pπ\mathbf{P}_\pi in the operator Lπ\BellmanPi: namely, ξ\xi must be the stationary distribution of Pπ\mathbf{P}_\pi. This is what makes the weighted geometry compatible with the operator’s structure. When this match holds, Jensen’s inequality gives us non-expansiveness of Pπ\mathbf{P}_\pi in the ξ\|\cdot\|_\xi norm, and the composition PLπ\Proj \BellmanPi inherits the contraction property.

In reinforcement learning, this has a practical interpretation. When we learn by following policy π\pi and collecting transitions (s,a,r,s)(s, a, r, s'), the states we visit are distributed according to the stationary distribution of π\pi. This is on-policy learning. The LSTD algorithm uses data sampled from this distribution, which means the empirical weighting naturally matches the operator structure. Our analysis shows that the iterative algorithm vk+1=PLπvkv_{k+1} = \Proj \BellmanPi v_k converges to the same fixed point that LSTD computes in closed form.

This is fundamentally different from the monotone approximator theory. There, we required structural properties of P\Proj itself (monotonicity, constant preservation) to guarantee that P\Proj preserves the sup-norm contraction property of L\Bellman. Here, we place no such restriction on P\Proj. Galerkin projection is not monotone. Instead, convergence depends on matching the norm to the operator. When ξ\xi does not match the stationary distribution, as in off-policy learning where data comes from a different behavior policy, the Jensen inequality argument breaks down. The operator Pπ\mathbf{P}_\pi need not be non-expansive in ξ\|\cdot\|_\xi, and PLπ\Proj \BellmanPi may fail to contract. This explains divergence phenomena such as Baird’s counterexample Baird (1995).

The Bellman Optimality Case

Can we extend this weighted L2L^2 analysis to the Bellman optimality operator Lv=maxa[ra+γPav]\Bellman v = \max_a [r_a + \gamma \mathbf{P}_a v]? The answer is no, at least not with this approach. The obstacle appears at the Jensen inequality step. For policy evaluation, we had:

Pπzξ2=sξ(s)[sp(ss,π(s))z(s)]2.\|\mathbf{P}_\pi z\|_\xi^2 = \sum_s \xi(s) \left[\sum_{s'} p(s'|s,\pi(s)) z(s')\right]^2.

The inner term is a convex combination of the values z(s)z(s'), which allowed us to apply Jensen’s inequality to the convex function xx2x \mapsto x^2. For the optimal Bellman operator, we would need to bound:

[maxasp(ss,a)z(s)]2.\left[\max_{a} \sum_{s'} p(s'|s,a) z(s')\right]^2.

But the maximum of convex combinations is not itself a convex combination. It is a pointwise maximum. Jensen’s inequality does not apply. We cannot conclude that maxa[Paz]\max_a [\mathbf{P}_a z] is non-expansive in any weighted L2L^2 norm.

Is convergence of PL\Proj \Bellman with Galerkin projection impossible, or merely difficult to prove? The situation is subtle. In practice, fitted Q-iteration and approximate value iteration with neural networks often work well, suggesting that some form of stability exists. But there are also well-documented divergence examples (e.g., Q-learning with linear function approximation can diverge). The theoretical picture remains incomplete. Some results exist for restricted function classes or under strong assumptions on the MDP structure, but no general convergence guarantee like the policy evaluation result is available. The interplay between the max operator, the projection, and the norm geometry is not well understood. This is an active area of research in reinforcement learning theory.

Despite these theoretical gaps, the practical algorithm template is straightforward. We now present fitted-value iteration as a meta-algorithm that combines any supervised learning method with the Bellman operator.

Fitted-Value/Q Iteration (FVI/FQI)

We have developed weighted residual methods through abstract functional equations: choose test functions, impose orthogonality conditions R,piw=0\langle R, p_i \rangle_w = 0, solve for coefficients. What are we actually computing when we solve these equations by successive approximation? The answer is simpler than the formalism suggests: function iteration with a fitting step.

Recall that the weighted residual conditions vLv,piw=0\langle v - \Bellman v, p_i \rangle_w = 0 define a fixed-point problem v=PLvv = \Proj \Bellman v, where P\Proj is a projection operator onto span(Φ)\text{span}(\boldsymbol{\Phi}). We can solve this by iteration: vk+1=PLvkv_{k+1} = \Proj \Bellman v_k. Under appropriate conditions (monotonicity of P\Proj, or matching the weight to the operator for policy evaluation), this converges to a solution.

In parameter space, this iteration becomes a fitting procedure. Consider Galerkin projection with a finite state space of nn states. Let Φ\boldsymbol{\Phi} be the n×dn \times d matrix of basis evaluations, W\mathbf{W} the diagonal weight matrix, and y\mathbf{y} the vector of Bellman operator evaluations: yi=(Lvk)(si)y_i = (\Bellman v_k)(s_i). The projection is:

θk+1=(ΦWΦ)1ΦWy.\boldsymbol{\theta}_{k+1} = (\boldsymbol{\Phi}^\top \mathbf{W} \boldsymbol{\Phi})^{-1} \boldsymbol{\Phi}^\top \mathbf{W} \mathbf{y}.

This is weighted least-squares regression: fit Φθ\boldsymbol{\Phi} \boldsymbol{\theta} to targets y\mathbf{y}. For collocation, we require exact interpolation Φθk+1=y\boldsymbol{\Phi} \boldsymbol{\theta}_{k+1} = \mathbf{y} at chosen collocation points. For continuous state spaces, we approximate the Galerkin integrals using sampled states, reducing to the same finite-dimensional fitting problem. The abstraction remains consistent: function iteration in the abstract becomes generate targets, fit to targets, repeat in the implementation.

This extends beyond linear basis functions. Neural networks, decision trees, and kernel methods all implement variants of this procedure. Given data {(si,yi)}\{(s_i, y_i)\} where yi=(Lvk)(si)y_i = (\Bellman v_k)(s_i), each method produces a function vk+1:SRv_{k+1}: \mathcal{S} \to \mathbb{R} by fitting to the targets. The projection operator P\Proj is simply one instantiation of a fitting procedure. Galerkin and collocation correspond to specific choices of approximation class and loss function.

The abstraction fit\mathtt{fit} encapsulates all the complexity of function approximation, whether that involves solving a linear system, running gradient descent, or training an ensemble. Any regression model with a fit(X, y) interface works: LinearRegression, RandomForestRegressor, GradientBoostingRegressor, MLPRegressor, or custom neural networks. The projection operator P\Proj is one instantiation: when F\mathcal{F} is a linear subspace and we minimize weighted squared error, we recover Galerkin or collocation. The broader view is that FVI reduces dynamic programming to repeated calls to a supervised learning subroutine.

The following code demonstrates fitted-value iteration on the optimal stopping problem:

Source
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

gamma = 0.9
v_bar_exact = (1 - np.sqrt(1 - gamma**2)) / gamma**2
s_star_exact = gamma * v_bar_exact
def v_exact(s):
    return np.where(s >= s_star_exact, s, gamma * v_bar_exact)

def fitted_value_iteration(s_grid, gamma, degree, max_iter=50, tol=1e-6):
    X = s_grid.reshape(-1, 1)
    v = np.zeros(len(s_grid))
    
    for k in range(max_iter):
        # Use trapezoidal rule for E[v] under uniform distribution on [0,1]
        v_bar = np.trapezoid(v, s_grid)
        targets = np.maximum(s_grid, gamma * v_bar)
        
        model = make_pipeline(PolynomialFeatures(degree), Ridge(alpha=1e-6))
        model.fit(X, targets)
        v_new = model.predict(X)
        
        if np.linalg.norm(v_new - v) < tol:
            return v_new, k + 1
        v = v_new
    return v, max_iter

s_grid = np.linspace(0, 1, 50)
print(f"{'Degree':<10} {'Iterations':<12} {'Max Error':<12}")
print("-" * 34)
for deg in [3, 5, 8]:
    v, iters = fitted_value_iteration(s_grid, gamma, deg)
    max_error = np.max(np.abs(v - v_exact(s_grid)))
    print(f"{deg:<10} {iters:<12} {max_error:<12.6f}")
Degree     Iterations   Max Error   
----------------------------------
3          26           0.035112    
5          25           0.022469    
8          25           0.020715    

A limitation of FVI/FQI is that it assumes we can evaluate the Bellman operator exactly. Computing yi=(Lvk)(si)y_i = (\Bellman v_k)(s_i) requires knowing transition probabilities and summing over all next states. In practice, we often have only a simulator or observed data. The next chapter shows how to approximate these expectations from samples, connecting the fitted-value iteration framework to simulation-based methods.

Summary

This chapter developed weighted residual methods for solving functional equations like the Bellman equation. We approximate the value function using a finite basis, then impose conditions that make the residual orthogonal to chosen test functions. Different choices of test functions yield different methods: Galerkin tests against the basis itself, collocation tests at specific points, and least squares minimizes the residual norm. All reduce to the same computational pattern: generate Bellman targets, fit a function approximator, repeat.

Convergence depends on how the projection interacts with the Bellman operator. The contraction property of the Bellman operator reappears here: whether projected iteration converges depends on whether the projection preserves this contraction or whether we work in a norm compatible with the operator. For monotone projections (piecewise linear interpolation, state aggregation), the composition PL\Proj \Bellman inherits the contraction property in the sup norm and iteration converges. For non-monotone projections like Galerkin, convergence requires matching the weighting to the stationary distribution, which holds in on-policy settings. The Bellman optimality case remains theoretically incomplete.

Throughout this chapter, we assumed access to the transition model: computing Lv(s)\Bellman v(s) requires summing over all next states weighted by transition probabilities. In practice, we often have only a simulator or observed trajectories, not an explicit model. The next chapter addresses this gap. Monte Carlo methods estimate expectations from sampled transitions, replacing exact Bellman operator evaluations with sample averages. This connects the projection framework developed here to the simulation-based algorithms used in reinforcement learning.

References
  1. Chakraverty, S., Mahato, N. R., Karunakar, P., & Rao, T. D. (2019). Weighted Residual Methods. In Advanced Numerical and Semi-Analytical Methods for Differential Equations (pp. 25–44). John Wiley & Sons, Inc. 10.1002/9781119423461.ch3
  2. Atkinson, K. E., & Potra, F. A. (1987). Projection and Iterated Projection Methods for Nonlinear Integral Equations. SIAM Journal on Numerical Analysis, 24(6), 1352–1373. 10.1137/0724087
  3. Legrand, M., & Junca, S. (2025). Weighted Residual Solution Methods.
  4. Zang, Y., Bao, G., Ye, X., & Zhou, H. (2020). Weak adversarial networks for high-dimensional partial differential equations. Journal of Computational Physics, 411, 109409. 10.1016/j.jcp.2020.109409
  5. Judd, K. L. (1992). Projection methods for solving aggregate growth models. Journal of Economic Theory, 58(2), 410–452.
  6. Judd, K. L. (1996). Approximation, perturbation, and projection methods in economic analysis. In H. M. Amman, D. A. Kendrick, & J. Rust (Eds.), Handbook of Computational Economics (Vol. 1, pp. 509–585). Elsevier.
  7. McGrattan, E. R. (1997). Application of Weighted Residual Methods to Dynamic Economic Models.
  8. Santos, M. S., & Vigo-Aguiar, J. (1998). Analysis of a numerical dynamic programming algorithm applied to economic models. Econometrica, 66(2), 409–426.
  9. Stachurski, J. (2009). Economic Dynamics: Theory and Computation. MIT Press.
  10. Gordon, G. J. (1995). Stable function approximation in dynamic programming. Proceedings of the Twelfth International Conference on International Conference on Machine Learning, 261–268.
  11. Gordon, G. J. (1999). Approximate Solutions to Markov Decision Problems [Phdthesis]. Carnegie Mellon University.
  12. Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. Proceedings of the Twelfth International Conference on Machine Learning, 30–37.