Skip to article frontmatterSkip to article content

Weighted Residual Methods for Functional Equations

The Bellman optimality equation Lv=v\Bellman v = v is a functional equation: an equation where the unknown is an entire function rather than a finite-dimensional vector. When the state space is continuous or very large, we cannot represent the value function exactly on a computer. We must instead work with finite-dimensional approximations. This motivates weighted residual methods (also called minimum residual methods), a general framework for transforming infinite-dimensional problems into tractable finite-dimensional ones Chakraverty et al. (2019)Atkinson & Potra (1987).

What Does It Mean for a Residual to Be Zero?

Consider a functional equation N(f)=0\Residual(f) = 0, where N\Residual is an operator and the unknown ff is an entire function (in our case, the Bellman optimality equation Lv=v\Bellman v = v, which we can write as N(v)Lvv=0\Residual(v) \equiv \Bellman v - v = 0). Suppose we have found a candidate approximate solution f^\hat{f}. To verify it satisfies N(f^)=0\Residual(\hat{f}) = 0, we compute the residual function R(s)=N(f^)(s)R(s) = \Residual(\hat{f})(s). For a true solution, this residual should be the zero function: R(s)=0R(s) = 0 for every state ss.

How might we test whether a function is zero? One approach: sample many input points {s1,s2,,sm}\{s_1, s_2, \ldots, s_m\}, check whether R(si)=0R(s_i) = 0 at each, and summarize the results into a single scalar test by computing a weighted sum i=1mwiR(si)\sum_{i=1}^m w_i R(s_i) with weights wi>0w_i > 0. If RR is zero everywhere, this sum is zero. If RR is nonzero somewhere, we can choose points and weights to make the sum nonzero. For vectors in finite dimensions, the inner product r,y=i=1nriyi\langle \mathbf{r}, \mathbf{y} \rangle = \sum_{i=1}^n r_i y_i implements exactly this idea: it tests r\mathbf{r} by weighting and summing. Indeed, a vector rRn\mathbf{r} \in \mathbb{R}^n equals zero if and only if r,y=0\langle \mathbf{r}, \mathbf{y} \rangle = 0 for every vector yRn\mathbf{y} \in \mathbb{R}^n. To see why, suppose r0\mathbf{r} \neq \mathbf{0}. Choosing y=r\mathbf{y} = \mathbf{r} gives r,r=r2>0\langle \mathbf{r}, \mathbf{r} \rangle = \|\mathbf{r}\|^2 > 0, contradicting the claim that all inner products vanish.

The same principle extends to functions. A function RR equals the zero function if and only if its “inner product” with every “test function” pp vanishes:

R=0if and only ifR,pw=SR(s)p(s)w(s)ds=0for all test functions p,R = 0 \quad \text{if and only if} \quad \langle R, p \rangle_w = \int_{\mathcal{S}} R(s) p(s) w(s) ds = 0 \quad \text{for all test functions } p,

where w(s)>0w(s) > 0 is a weight function that is part of the inner product definition. Why does this work? For the same reason as in finite dimensions: if RR is not the zero function, there must be some region where R(s)0R(s) \neq 0. We can then choose a test function pp that is nonzero in that same region (for instance, p(s)=R(s)p(s) = R(s) itself), which will produce R,pw=R(s)p(s)w(s)ds>0\langle R, p \rangle_w = \int R(s) p(s) w(s) ds > 0, witnessing that RR is nonzero. Conversely, if RR is the zero function, then R,pw=0\langle R, p \rangle_w = 0 for any test function pp.

This ability to distinguish between different functions using inner products is a fundamental principle from functional analysis. Just as we can test a vector by taking inner products with other vectors, we can test a function by taking inner products with other functions.

Why is this useful? It transforms the pointwise condition “R(s)=0R(s) = 0 for all ss” (infinitely many conditions, one per state) into an equivalent condition about inner products. We still cannot test against all possible test functions, since there are infinitely many of those too. But the inner product perspective suggests a natural computational strategy: choose a finite collection of test functions {p1,,pn}\{p_1, \ldots, p_n\} and use them to construct nn conditions that we can actually compute.

This suggests the weighted residual (or minimum residual) approach: choose nn test functions {p1,,pn}\{p_1, \ldots, p_n\} and a weight function w(s)w(s), then require the residual to have zero weighted inner product with each test function:

R,piw=SR(s;θ)pi(s)w(s)ds=0,i=1,,n,\langle R, p_i \rangle_w = \int_{\mathcal{S}} R(s; \theta) p_i(s) w(s) ds = 0, \quad i = 1, \ldots, n,

where the residual is R(s;θ)=N(f^(;θ))(s)R(s; \theta) = \Residual(\hat{f}(\cdot; \theta))(s) and the approximation is f^(s;θ)=j=1nθjφj(s)\hat{f}(s; \theta) = \sum_{j=1}^n \theta_j \varphi_j(s). For the Bellman equation, N(v)=Lvv\Residual(v) = \Bellman v - v, so R(s;θ)=Lv^(s;θ)v^(s;θ)R(s; \theta) = \Bellman\hat{v}(s; \theta) - \hat{v}(s; \theta). This transforms the impossible task of verifying “R(s)=0R(s) = 0 for all ss” into a finite-dimensional problem: find nn coefficients θ=(θ1,,θn)\theta = (\theta_1, \ldots, \theta_n) satisfying nn weighted integral conditions.

The weight function w(s)w(s) is part of the inner product definition and serves important purposes: it can emphasize certain regions of the state space, or represent a natural probability measure over states. In unweighted problems, we simply take w(s)=1w(s) = 1. In reinforcement learning applications, w(s)w(s) is often chosen as the stationary distribution dπ(s)d^\pi(s) under a policy π\pi. This is exactly what happens in methods like LSTD (Least Squares Temporal Difference), which can be viewed as a Galerkin method with w(s)=dπ(s)w(s) = d^\pi(s).

Different choices of test functions give different methods, each with different computational and theoretical properties. When pi=φip_i = \varphi_i (test functions equal basis functions), the method is called Galerkin, and can be interpreted geometrically as a projection of the residual onto the orthogonal complement of the approximation space. The rest of this chapter develops this framework systematically: how to choose basis functions {φj}\{\varphi_j\}, how to select test functions {pi}\{p_i\}, and how to solve the resulting systems.

The General Framework

Consider an operator equation of the form

N(f)=0,\Residual(f) = 0,

where N:B1B2\Residual: B_1 \to B_2 is a continuous operator between complete normed vector spaces B1B_1 and B2B_2. For the Bellman equation, we have N(v)=Lvv\Residual(v) = \Bellman v - v, so that solving N(v)=0\Residual(v) = 0 is equivalent to finding the fixed point v=Lvv = \Bellman v.

Just as we transcribed infinite-dimensional continuous optimal control problems into finite-dimensional discrete optimal control problems in earlier chapters, we seek a finite-dimensional approximation to this infinite-dimensional functional equation. Recall that for continuous optimal control, we adopted control parameterization: we represented the control trajectory using a finite set of basis functions (piecewise constants, polynomials, splines) and searched over the finite-dimensional coefficient space instead of the infinite-dimensional function space. For integrals in the objective and constraints, we used numerical quadrature to approximate them with finite sums.

We follow the same strategy here. We parameterize the value function using a finite set of basis functions {φ1,,φn}\{\varphi_1, \ldots, \varphi_n\}—commonly polynomials (Chebyshev, Legendre), though other function classes (splines, radial basis functions, neural networks) are possible—and search for coefficients θ=(θ1,,θn)\theta = (\theta_1, \ldots, \theta_n) in Rn\mathbb{R}^n. When integrals appear in the Bellman operator or projection conditions, we approximate them using numerical quadrature. The projection method approach consists of several conceptual steps that accomplish this transcription.

Step 1: Choose a Finite-Dimensional Approximation Space

We begin by selecting a basis Φ={φ1,φ2,,φn}\Phi = \{\varphi_1, \varphi_2, \ldots, \varphi_n\} and approximating the unknown function as a linear combination:

f^(x)=i=1nθiφi(x).\hat{f}(x) = \sum_{i=1}^n \theta_i \varphi_i(x).

The choice of basis functions φi\varphi_i is problem-dependent. Common choices include:

The number of basis functions nn determines the flexibility of our approximation. In practice, we start with small nn and increase it until the approximation quality is satisfactory. The only unknowns now are the coefficients θ=(θ1,,θn)\theta = (\theta_1, \ldots, \theta_n).

While the classical presentation of projection methods focuses on polynomial bases, the framework applies equally well to other function classes. Neural networks, for instance, can be viewed through this lens: a neural network f^(x;θ)\hat{f}(x; \theta) with parameters θ\theta defines a flexible function class, and many training procedures can be interpreted as projection methods with specific choices of test functions or residual norms. The distinction is that classical methods typically use predetermined basis functions with linear coefficients, while neural networks use adaptive nonlinear features. Throughout this chapter, we focus on the classical setting to develop the core concepts, but the principles extend naturally to modern function approximators.

Step 2: Define the Residual Function

Since we are approximating ff with f^\hat{f}, the operator N\Residual will generally not vanish exactly. Instead, we obtain a residual function:

R(x;θ)=N(f^(;θ))(x).R(x; \theta) = \Residual(\hat{f}(\cdot; \theta))(x).

This residual measures how far our candidate solution is from satisfying the equation at each point xx. As we discussed in the introduction, we will assess whether this residual is “close to zero” by testing its inner products against chosen test functions.

Step 3: Impose Conditions to Determine Coefficients

Having chosen our basis and defined the residual, we must decide how to make the residual “close to zero.” As discussed in the introduction, we test the residual using weighted integrals. We select test functions {p1,,pn}\{p_1, \ldots, p_n\} and a weight function w(x)w(x) (often w(x)=1w(x) = 1 for unweighted problems), then impose weighted residual conditions:

SR(x;θ)pi(x)w(x)dx=0,i=1,,n.\int_{\mathcal{S}} R(x; \theta) p_i(x) w(x) \, dx = 0, \quad i = 1, \ldots, n.

These nn integral conditions provide nn equations to determine the nn unknown coefficients in θ\theta. Different choices of test functions pip_i give different methods:

An alternative is the least squares approach, which minimizes the weighted norm of the residual:

minθSR(x;θ)2w(x)dx.\min_\theta \int_{\mathcal{S}} R(x; \theta)^2 w(x) dx.

We focus primarily on methods distinguished by their choice of test functions pip_i. The Galerkin method, where pi=φip_i = \varphi_i, can be interpreted geometrically as a projection when working in a Hilbert space with a weighted inner product.

Let us examine the standard choices of test functions and what they tell us about the residual:

Galerkin Method: Test Against the Basis

The Galerkin method chooses test functions pi=φip_i = \varphi_i, the same basis functions used to approximate f^\hat{f}:

SR(x;θ)φi(x)w(x)dx=0,i=1,,n.\int_{\mathcal{S}} R(x; \theta) \varphi_i(x) w(x) dx = 0, \quad i = 1, \ldots, n.

To understand what this means, recall that in finite dimensions, two vectors are orthogonal when their inner product is zero. For functions, R,φiw=R(x)φi(x)w(x)dx=0\langle R, \varphi_i \rangle_w = \int R(x) \varphi_i(x) w(x) dx = 0 expresses the same concept: RR and φi\varphi_i are orthogonal as functions with respect to the weighted inner product. But there’s more to this than just testing against individual basis functions.

Consider our approximation space span{φ1,,φn}\text{span}\{\varphi_1, \ldots, \varphi_n\} as an nn-dimensional subspace within the infinite-dimensional space of all functions. Any function gg in this space can be written as g=i=1nciφig = \sum_{i=1}^n c_i \varphi_i for some coefficients cic_i. If the residual RR is orthogonal to all basis functions φi\varphi_i, then by linearity of the inner product, for any such function gg:

R,g=R,i=1nciφi=i=1nciR,φi=0.\langle R, g \rangle = \left\langle R, \sum_{i=1}^n c_i \varphi_i \right\rangle = \sum_{i=1}^n c_i \langle R, \varphi_i \rangle = 0.

This shows that RR is orthogonal to every function we can represent with our basis. The residual has “zero overlap” with our approximation space: we cannot express any part of it using our basis functions. In this sense, the residual is as “invisible” to our approximation as possible.

This condition is the defining property of optimality. By choosing our approximation f^\hat{f} so that the residual R=N(f^)R = \Residual(\hat{f}) is orthogonal to the entire approximation space, we ensure that f^\hat{f} is the orthogonal projection of the true solution onto spanφ1,,φn\text{span}{\varphi_1, \ldots, \varphi_n}. Within this nn-dimensional space, no better choice is possible: any other coefficients would yield a residual with a nonzero component inside the space, and therefore a larger norm.

The finite-dimensional analogy makes this concrete. Suppose you want to approximate a vector vR3\mathbf{v} \in \mathbb{R}^3 using only the xyxy-plane (a 2D subspace). The best approximation is to project v\mathbf{v} onto the plane, giving v^=(v1,v2,0)\hat{\mathbf{v}} = (v_1, v_2, 0). The error is r=vv^=(0,0,v3)\mathbf{r} = \mathbf{v} - \hat{\mathbf{v}} = (0, 0, v_3), which points purely in the zz-direction, orthogonal to the entire xyxy-plane. We see the Galerkin condition in action: the error is orthogonal to the approximation space.

Method of Moments: Test Against Monomials

The method of moments, for problems on DRD \subset \mathbb{R}, chooses test functions pi(x)=xi1p_i(x) = x^{i-1} for i=1,,ni = 1, \ldots, n:

R(;θ),xi1=0,i=1,,n.\langle R(\cdot; \theta), x^{i-1} \rangle = 0, \quad i = 1, \ldots, n.

This requires the first nn moments of the residual function to vanish, ensuring the residual is “balanced” in the sense that it has no systematic trend captured by low-order polynomials. The moments xkR(x;θ)w(x)dx\int x^k R(x; \theta) w(x) dx measure weighted averages of the residual, with increasing powers of xx giving more weight to larger values. Setting these to zero ensures the residual doesn’t grow systematically with xx. This approach is particularly useful when w(x)w(x) is chosen as a probability measure, making the conditions natural moment restrictions familiar from statistics and econometrics.

Collocation Method: Test Against Delta Functions

The collocation method chooses test functions pi(x)=δ(xxi)p_i(x) = \delta(x - x_i), the Dirac delta functions at points {x1,,xn}\{x_1, \ldots, x_n\}:

R(;θ),δ(xi)=R(xi;θ)=0,i=1,,n.\langle R(\cdot; \theta), \delta(\cdot - x_i) \rangle = R(x_i; \theta) = 0, \quad i = 1, \ldots, n.

This is projection against the most localized test functions possible: delta functions that “sample” the residual at specific points, requiring the residual to vanish exactly where we test it. The computational advantage is significant: collocation avoids numerical integration entirely, requiring only pointwise evaluation of RR.

Orthogonal collocation combines collocation with spectral basis functions (orthogonal polynomials like Chebyshev, Legendre, or Hermite) for smooth problems. We choose collocation points at the zeros of the nn-th polynomial in the family. For example, with Chebyshev polynomials T0,T1,,Tn1T_0, T_1, \ldots, T_{n-1}, we place collocation points at the zeros of Tn(x)T_n(x).

These zeros are also optimal nodes for Gauss quadrature with the associated weight function. This coordination means:

This approach is often called a pseudospectral method or spectral collocation method. The Chebyshev interpolation theorem guarantees that forcing R(xi;θ)=0R(x_i; \theta) = 0 at these carefully chosen points makes R(x;θ)R(x; \theta) small everywhere, with well-conditioned systems and near-optimal interpolation error.

Subdomain Method: Test Against Indicator Functions

The subdomain method partitions the domain into nn subregions {D1,,Dn}\{D_1, \ldots, D_n\} and chooses test functions pi=IDip_i = I_{D_i}, the indicator functions:

R(;θ),IDiw=DiR(x;θ)w(x)dx=0,i=1,,n.\langle R(\cdot; \theta), I_{D_i} \rangle_w = \int_{D_i} R(x; \theta) w(x) dx = 0, \quad i = 1, \ldots, n.

This requires the residual to have zero average over each subdomain, ensuring the approximation is good “on average” over each piece of the domain. This approach is particularly natural for finite element methods where the domain is divided into elements, ensuring local balance of the residual within each element.

Least Squares

The least squares approach appears different at first glance, but it also fits the test function framework. We minimize:

minθSR(x;θ)2w(x)dx=minθR(;θ),R(;θ)w.\min_\theta \int_{\mathcal{S}} R(x; \theta)^2 w(x) dx = \min_\theta \langle R(\cdot; \theta), R(\cdot; \theta) \rangle_w.

The first-order conditions for this minimization problem are:

R(;θ),R(;θ)θiw=0,i=1,,n.\left\langle R(\cdot; \theta), \frac{\partial R(\cdot; \theta)}{\partial \theta_i} \right\rangle_w = 0, \quad i = 1, \ldots, n.

Thus least squares implicitly uses test functions pi=R/θip_i = \partial R / \partial \theta_i, the gradients of the residual with respect to parameters. Unlike other methods where test functions are chosen a priori, here they depend on the current guess for θ\theta and on the structure of our approximation.

We can now see the unifying structure of weighted residual methods: whether we use projection conditions or least squares minimization, all these methods follow the same template of restricting the search to an nn-dimensional function space and imposing nn conditions on the residual. For projection methods specifically, we pick nn test functions and require R,pi=0\langle R, p_i \rangle = 0. They differ only in their philosophy about which test functions best detect whether the residual is “nearly zero.” Galerkin tests against the approximation basis itself (natural for orthogonal bases), the method of moments tests against monomials (ensuring polynomial balance), collocation tests against delta functions (pointwise satisfaction), subdomain tests against indicators (local average satisfaction), and least squares tests against residual gradients (global norm minimization). Each choice reflects different priorities: computational efficiency, theoretical optimality, ease of implementation, or sensitivity to errors in different regions of the domain.

Step 4: Solve the Finite-Dimensional Problem

The projection conditions give us a system to solve for the coefficients θ\theta. For test function methods (Galerkin, collocation, moments, subdomain), we solve:

Pi(θ)R(;θ),piw=0,i=1,,n.P_i(\theta) \equiv \langle R(\cdot; \theta), p_i \rangle_w = 0, \quad i = 1, \ldots, n.

This is a system of nn (generally nonlinear) equations in nn unknowns. For least squares, we solve the optimization problem minθR(;θ),R(;θ)w\min_\theta \langle R(\cdot; \theta), R(\cdot; \theta) \rangle_w.

Computational Cost and Conditioning

The computational cost per iteration varies significantly across methods:

For methods requiring integration, the choice of quadrature rule should match the basis. Gaussian quadrature with nodes at orthogonal polynomial zeros is efficient. When combined with collocation at those same points, the quadrature is exact for polynomials up to a certain degree. This coordination between quadrature and collocation makes orthogonal collocation effective.

The conditioning of the system depends on the choice of test functions. The Jacobian matrix has entries:

Jij=Piθj=R(;θ)θj,piw.J_{ij} = \frac{\partial P_i}{\partial \theta_j} = \left\langle \frac{\partial R(\cdot; \theta)}{\partial \theta_j}, p_i \right\rangle_w.

When test functions are orthogonal (or nearly so), the Jacobian tends to be well-conditioned. This is why orthogonal polynomial bases are preferred in Galerkin methods: they produce Jacobians with controlled condition numbers. Poorly chosen basis functions or collocation points can lead to nearly singular Jacobians, causing numerical instability. Orthogonal bases and carefully chosen collocation points (like Chebyshev nodes) help maintain good conditioning.

Two Main Solution Approaches

We have two fundamentally different ways to solve the projection equations: function iteration (exploiting fixed-point structure) and Newton’s method (exploiting smoothness). The choice depends on whether the original operator equation has contraction properties and how well those properties are preserved by the finite-dimensional approximation.

Method 1: Function Iteration (Successive Approximation)

When the operator equation has the form f=Tff = \Contraction f where T\Contraction is a contraction, the most natural approach is to iterate the operator directly:

f^(k+1)=Tf^(k).\hat{f}^{(k+1)} = \Contraction \hat{f}^{(k)}.

The infinite-dimensional iteration becomes a finite-dimensional iteration in coefficient space once we choose our weighted residual method. Given a current approximation f^(k)(x;θ(k))\hat{f}^{(k)}(x; \theta^{(k)}), how do we find the coefficients θ(k+1)\theta^{(k+1)} for the next iterate f^(k+1)\hat{f}^{(k+1)}?

Different weighted residual methods answer this differently. For collocation, we proceed in two steps:

  1. Evaluate the operator: At each collocation point xix_i, compute what the next iterate should be: ti(k)=(Tf^(k))(xi)t_i^{(k)} = (\Contraction \hat{f}^{(k)})(x_i). These nn target values tell us what f^(k+1)\hat{f}^{(k+1)} should equal at the collocation points.

  2. Find matching coefficients: Determine θ(k+1)\theta^{(k+1)} so that f^(k+1)(xi;θ(k+1))=ti(k)\hat{f}^{(k+1)}(x_i; \theta^{(k+1)}) = t_i^{(k)} for all ii. This is a linear system: jθj(k+1)φj(xi)=ti(k)\sum_j \theta_j^{(k+1)} \varphi_j(x_i) = t_i^{(k)}.

In matrix form: Φθ(k+1)=t(k)\boldsymbol{\Phi} \theta^{(k+1)} = t^{(k)}, where Φ\boldsymbol{\Phi} is the collocation matrix with entries Φij=φj(xi)\Phi_{ij} = \varphi_j(x_i). Solving this system gives θ(k+1)=Φ1t(k)\theta^{(k+1)} = \boldsymbol{\Phi}^{-1} t^{(k)}.

For Galerkin, the projection condition f^(k+1)Tf^(k+1),φiw=0\langle \hat{f}^{(k+1)} - \Contraction \hat{f}^{(k+1)}, \varphi_i \rangle_w = 0 directly gives a system for θ(k+1)\theta^{(k+1)}. When T\Contraction is linear in its argument (as in many integral equations), this is a linear system. When T\Contraction is nonlinear (as in the Bellman equation), we must solve a nonlinear system at each iteration—though each solution still only involves nn unknowns rather than an infinite-dimensional function.

When T\Contraction is a contraction in the infinite-dimensional space with constant γ<1\gamma < 1, iterating it pulls any starting function toward the unique fixed point. The hope is that the finite-dimensional operator—evaluating T\Contraction and projecting back onto the span of the basis functions—inherits this contraction property. When it does, function iteration converges globally from any initial guess, with each iteration reducing the error by a factor of roughly γ\gamma. This is computationally attractive: we only evaluate the operator and solve a linear system (for collocation) or a relatively simple system (for other methods).

However, the finite-dimensional approximation doesn’t always preserve contraction. High-order polynomial bases, in particular, can create oscillations between basis functions that amplify rather than contract errors. Even when contraction is preserved, convergence can be painfully slow when γ\gamma is close to 1—the “weak contraction” regime common in economic problems with patient agents (γ0.95\gamma \approx 0.95 or higher). Finally, not all operator equations naturally present themselves as contractions; some require reformulation (like f=fαN(f)f = f - \alpha \Residual(f)), and finding a good α\alpha can be problem-specific.

Method 2: Newton’s Method

Alternatively, we can treat the projection equations as a rootfinding problem G(θ)=0G(\theta) = 0 where Gi(θ)=Pi(θ)G_i(\theta) = P_i(\theta) for test function methods, or solve the first-order conditions for least squares. Newton’s method uses the update:

θ(k+1)=θ(k)JG(θ(k))1G(θ(k)),\theta^{(k+1)} = \theta^{(k)} - J_G(\theta^{(k)})^{-1} G(\theta^{(k)}),

where JG(θ)J_G(\theta) is the Jacobian of GG at θ\theta.

Computing the Jacobian: We must compute Jij=GiθjJ_{ij} = \frac{\partial G_i}{\partial \theta_j}. For collocation, Gi(θ)=f^(xi;θ)(Tf^(;θ))(xi)G_i(\theta) = \hat{f}(x_i; \theta) - (\Contraction \hat{f}(\cdot; \theta))(x_i), so:

Giθj=f^(xi;θ)θj(Tf^(;θ))(xi)θj.\frac{\partial G_i}{\partial \theta_j} = \frac{\partial \hat{f}(x_i; \theta)}{\partial \theta_j} - \frac{\partial (\Contraction \hat{f}(\cdot; \theta))(x_i)}{\partial \theta_j}.

The first term is straightforward (it’s just φj(xi)\varphi_j(x_i) for a linear approximation). The second term requires differentiating the operator T\Contraction with respect to the parameters.

When T\Contraction involves optimization (as in the Bellman operator Lv=maxa{r(s,a)+γE[v(s)]}\Bellman v = \max_a \{r(s,a) + \gamma \mathbb{E}[v(s')]\}), computing this derivative appears problematic because the max operator is not differentiable. However, the Envelope Theorem resolves this difficulty.

With the Envelope Theorem providing a tractable way to compute Jacobians for problems involving optimization, Newton’s method becomes practical for weighted residual methods applied to Bellman equations and similar problems. The method offers quadratic convergence near the solution. Once in the neighborhood of the true fixed point, Newton’s method typically converges in just a few iterations. Unlike function iteration, it doesn’t rely on the finite-dimensional approximation preserving any contraction property, making it applicable to a broader range of problems, particularly those with high-order polynomial bases or large discount factors where function iteration struggles.

However, Newton’s method demands more from both the algorithm and the user. Each iteration requires computing and solving a full Jacobian system, making the per-iteration cost significantly higher than function iteration. The method is also sensitive to initialization: started far from the solution, Newton’s method may diverge or converge to spurious fixed points that the finite-dimensional problem introduces but the original infinite-dimensional problem lacks. When applying the Envelope Theorem, implementation becomes more complex. We must track the optimal action at each evaluation point and compute the Jacobian entries using the formula above (expected basis function values at next states under optimal actions), though the economic interpretation (tracking how value propagates through optimal decisions) often makes the computation conceptually clearer than explicit derivative calculations would be.

Comparison and Practical Recommendations
MethodConvergencePer-iteration costInitial guess sensitivity
Function iterationLinear (when contraction holds)LowRobust
Newton’s methodQuadratic (near solution)Moderate (Jacobian + solve)Requires good initial guess

Which method to use? When the problem has strong contraction (small γ\gamma, well-conditioned bases, shape-preserving approximations like linear interpolation or splines), function iteration is simple and robust. For weak contraction (large γ\gamma, high-order polynomials), a hybrid approach works well: run function iteration for several iterations to enter the basin of attraction, then switch to Newton’s method for rapid final convergence. When the finite-dimensional approximation destroys contraction entirely (common with non-monotone bases), Newton’s method may be necessary from the start, though careful initialization (from a coarser approximation or perturbation methods) is required.

Quasi-Newton methods like BFGS or Broyden offer a middle ground. They approximate the Jacobian using function evaluations only, avoiding explicit derivative computations while maintaining superlinear convergence. This can be useful when computing the exact Jacobian via the Envelope Theorem is expensive or when the approximation quality is acceptable.

Step 5: Verify the Solution

Once we have computed a candidate solution f^\hat{f}, we must verify its quality. Projection methods optimize f^\hat{f} with respect to specific criteria (specific test functions or collocation points), but we should check that the residual is small everywhere, including directions or points we did not optimize over.

Typical diagnostic checks include:

Application to the Bellman Equation

Consider the Bellman optimality equation v(s)=Lv(s)=maxaAs{r(s,a)+γjSp(js,a)v(j)}v(s) = \Bellman v(s) = \max_{a \in \mathcal{A}_s} \{ r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) v(j) \}. For a candidate approximation v^(s)=i=1nθiφi(s)\hat{v}(s) = \sum_{i=1}^n \theta_i \varphi_i(s), the residual is:

R(s;θ)=Lv^(s)v^(s)=maxaAs{r(s,a)+γjSp(js,a)v^(j)}i=1nθiφi(s).R(s; \theta) = \Bellman\hat{v}(s) - \hat{v}(s) = \max_{a \in \mathcal{A}_s} \left\{ r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) \hat{v}(j) \right\} - \sum_{i=1}^n \theta_i \varphi_i(s).

We examine how collocation and Galerkin—the two most common weighted residual methods for Bellman equations—specialize the general solution approaches from Step 4.

Collocation

For collocation, we choose nn states {s1,,sn}\{s_1, \ldots, s_n\} and require the Bellman equation to hold exactly at these points:

j=1nθjφj(si)=maxaAsi{r(si,a)+γjSp(jsi,a)=1nθφ(j)},i=1,,n.\sum_{j=1}^n \theta_j \varphi_j(s_i) = \max_{a \in \mathcal{A}_{s_i}} \left\{ r(s_i,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s_i,a) \sum_{\ell=1}^n \theta_\ell \varphi_\ell(j) \right\}, \quad i = 1, \ldots, n.

It helps to define the parametric Bellman operator Lφ:RnRn\mathrm{L}_\varphi: \mathbb{R}^n \to \mathbb{R}^n by [Lφ(θ)]i=[Lv^(;θ)](si)[\mathrm{L}_\varphi(\theta)]_i = [\Bellman\hat{v}(\cdot; \theta)](s_i), the Bellman operator evaluated at collocation point sis_i. Let Φ\boldsymbol{\Phi} be the n×nn \times n matrix with entries Φij=φj(si)\Phi_{ij} = \varphi_j(s_i). Then the collocation equations become Φθ=Lφ(θ)\boldsymbol{\Phi} \theta = \mathrm{L}_\varphi(\theta).

Function iteration for collocation proceeds as follows. Given current coefficients θ(k)\theta^{(k)}, we evaluate the Bellman operator at each collocation point to get target values ti(k)=[Lφ(θ(k))]it_i^{(k)} = [\mathrm{L}_\varphi(\theta^{(k)})]_i. We then find new coefficients by solving the linear system Φθ(k+1)=t(k)\boldsymbol{\Phi} \theta^{(k+1)} = t^{(k)}. This is parametric value iteration: apply the Bellman operator, fit the result.

When the state space is continuous, we approximate expectations using numerical quadrature (Gauss-Hermite for normal shocks, etc.). The method is simple and robust when the finite-dimensional approximation preserves contraction, but can be slow for large discount factors.

Newton’s method for collocation treats the problem as rootfinding: G(θ)=ΦθLφ(θ)=0G(\theta) = \boldsymbol{\Phi} \theta - \mathrm{L}_\varphi(\theta) = 0. The Jacobian is JG=ΦJLφJ_G = \boldsymbol{\Phi} - J_{\mathrm{L}_\varphi}, where the Envelope Theorem (Step 4) gives us [JLφ]ij=γsp(ssi,ai(θ))φj(s)[J_{\mathrm{L}_\varphi}]_{ij} = \gamma \sum_{s'} p(s'|s_i, a_i^*(\theta)) \varphi_j(s'). Here ai(θ)a_i^*(\theta) is the optimal action at collocation point sis_i given the current coefficients.

This converges rapidly near the solution but requires good initialization and more computation per iteration than function iteration. The method is equivalent to policy iteration: each step evaluates the value of the current greedy policy, then improves it.

Why is collocation popular for Bellman equations? Because it avoids integration when testing the residual. We only evaluate the Bellman operator at nn discrete points. In contrast, Galerkin requires integrating the residual against each basis function.

Galerkin

For Galerkin, we use the basis functions themselves as test functions. The conditions are:

S[Lv^(s;θ)v^(s;θ)]φi(s)w(s)ds=0,i=1,,n.\int_{\mathcal{S}} [\Bellman\hat{v}(s; \theta) - \hat{v}(s; \theta)] \varphi_i(s) w(s) ds = 0, \quad i = 1, \ldots, n.

where w(s)w(s) is a weight function (often the stationary distribution dπ(s)d^\pi(s) in RL applications, or simply w(s)=1w(s) = 1). Expanding this:

S[maxa{r(s,a)+γE[v(s)]}jθjφj(s)]φi(s)w(s)ds=0.\int_{\mathcal{S}} \left[ \max_a \left\{ r(s,a) + \gamma \mathbb{E}[v(s')] \right\} - \sum_j \theta_j \varphi_j(s) \right] \varphi_i(s) w(s) ds = 0.

Function iteration for Galerkin works differently than for collocation. Given θ(k)\theta^{(k)}, we cannot simply evaluate the Bellman operator and fit. Instead, we must solve an integral equation. At each iteration, we seek θ(k+1)\theta^{(k+1)} satisfying:

Sjθj(k+1)φj(s)φi(s)w(s)ds=S[Lv^(s;θ(k))]φi(s)w(s)ds.\int_{\mathcal{S}} \sum_j \theta_j^{(k+1)} \varphi_j(s) \varphi_i(s) w(s) ds = \int_{\mathcal{S}} [\Bellman\hat{v}(s; \theta^{(k)})] \varphi_i(s) w(s) ds.

The left side is a linear system (the “mass matrix” Mij=φiφjwM_{ij} = \int \varphi_i \varphi_j w), and the right side requires integrating the Bellman operator output against each test function. When the basis functions are orthogonal polynomials with matching weight ww, the mass matrix is diagonal, simplifying the solve. But we still need numerical integration to evaluate the right side. This makes Galerkin substantially more expensive than collocation per iteration.

Newton’s method for Galerkin similarly requires integration. The residual is R(s;θ)=Lv^(s;θ)v^(s;θ)R(s; \theta) = \Bellman\hat{v}(s; \theta) - \hat{v}(s; \theta), and we need Gi(θ)=R(s;θ)φi(s)w(s)ds=0G_i(\theta) = \int R(s; \theta) \varphi_i(s) w(s) ds = 0. The Jacobian entry is:

Jij=[Lv^(s;θ)θjφj(s)]φi(s)w(s)ds.J_{ij} = \int \left[ \frac{\partial \Bellman\hat{v}(s; \theta)}{\partial \theta_j} - \varphi_j(s) \right] \varphi_i(s) w(s) ds.

The Envelope Theorem gives Lv^(s;θ)θj=γE[φj(s)s,a(s;θ)]\frac{\partial \Bellman\hat{v}(s; \theta)}{\partial \theta_j} = \gamma \mathbb{E}[\varphi_j(s') \mid s, a^*(s;\theta)], so we must integrate expected basis function values (under optimal actions) against test functions and weight. This requires both numerical integration and careful tracking of optimal actions across the state space, making it substantially more complex than collocation’s pointwise evaluation.

The advantage of Galerkin over collocation lies in its theoretical properties: when using orthogonal polynomials, Galerkin provides optimal approximation in the weighted L2L^2 norm. For smooth problems, this can yield better accuracy per degree of freedom than collocation. In practice, collocation’s computational simplicity usually outweighs Galerkin’s theoretical optimality for Bellman equations, especially in high-dimensional problems where integration becomes prohibitively expensive.

Galerkin for Discrete MDPs: LSTD and LSPI

When the state space is discrete and finite, the Galerkin conditions simplify dramatically. The integrals become sums, and we can write everything in matrix form. This specialization shows the connection to algorithms widely used in reinforcement learning.

For a discrete state space S={s1,,sm}\mathcal{S} = \{s_1, \ldots, s_m\}, the Galerkin orthogonality conditions

S[Lv^(s;θ)v^(s;θ)]φi(s)w(s)ds=0\int_{\mathcal{S}} [\Bellman\hat{v}(s; \theta) - \hat{v}(s; \theta)] \varphi_i(s) w(s) ds = 0

become weighted sums over states:

sSξ(s)[Lv^(s;θ)v^(s;θ)]φi(s)=0,i=1,,n,\sum_{s \in \mathcal{S}} \xi(s) [\Bellman\hat{v}(s; \theta) - \hat{v}(s; \theta)] \varphi_i(s) = 0, \quad i = 1, \ldots, n,

where ξ(s)0\xi(s) \geq 0 with sξ(s)=1\sum_s \xi(s) = 1 is a probability distribution over states. Define the feature matrix ΦRm×n\boldsymbol{\Phi} \in \mathbb{R}^{m \times n} with entries Φsi=φi(s)\Phi_{si} = \varphi_i(s) (each row contains the features for one state), and let Ξ=diag(ξ)\boldsymbol{\Xi} = \text{diag}(\xi) be the diagonal matrix with the state distribution on the diagonal.

Policy Evaluation: LSTD

For policy evaluation with a fixed policy π\pi, the Bellman operator is linear:

[Lπv^](s)=r(s,π(s))+γjSp(js,π(s))v^(j).[\BellmanPi \hat{v}](s) = r(s, \pi(s)) + \gamma \sum_{j \in \mathcal{S}} p(j|s, \pi(s)) \hat{v}(j).

With linear function approximation v^(s)=φ(s)θ=iθiφi(s)\hat{v}(s) = \boldsymbol{\varphi}(s)^\top \theta = \sum_i \theta_i \varphi_i(s), this becomes:

[Lπv^](s)=r(s,π(s))+γjSp(js,π(s))iθiφi(j).[\BellmanPi \hat{v}](s) = r(s, \pi(s)) + \gamma \sum_{j \in \mathcal{S}} p(j|s, \pi(s)) \sum_i \theta_i \varphi_i(j).

Let rπRm\mathbf{r}_\pi \in \mathbb{R}^m be the vector of rewards [rπ]s=r(s,π(s))[\mathbf{r}_\pi]_s = r(s, \pi(s)), and PπRm×m\mathbf{P}_\pi \in \mathbb{R}^{m \times m} be the transition matrix with [Pπ]sj=p(js,π(s))[\mathbf{P}_\pi]_{sj} = p(j|s, \pi(s)). Then Lπv^=rπ+γPπΦθ\BellmanPi \hat{v} = \mathbf{r}_\pi + \gamma \mathbf{P}_\pi \boldsymbol{\Phi} \theta in vector form.

The Galerkin conditions require Lπv^v^,φiξ=0\langle \BellmanPi \hat{v} - \hat{v}, \varphi_i \rangle_\xi = 0 for all basis functions, which in matrix form is:

ΦΞ(rπ+γPπΦθΦθ)=0.\boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\mathbf{r}_\pi + \gamma \mathbf{P}_\pi \boldsymbol{\Phi} \theta - \boldsymbol{\Phi} \theta) = \mathbf{0}.

Rearranging:

ΦΞ(ΦγPπΦ)θ=ΦΞrπ.\boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\boldsymbol{\Phi} - \gamma \mathbf{P}_\pi \boldsymbol{\Phi}) \theta = \boldsymbol{\Phi}^\top \boldsymbol{\Xi} \mathbf{r}_\pi.

This is the LSTD (Least Squares Temporal Difference) solution. The matrix A=ΦΞ(ΦγPπΦ)\mathbf{A} = \boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\boldsymbol{\Phi} - \gamma \mathbf{P}_\pi \boldsymbol{\Phi}) and vector b=ΦΞrπ\mathbf{b} = \boldsymbol{\Phi}^\top \boldsymbol{\Xi} \mathbf{r}_\pi give the linear system Aθ=b\mathbf{A} \theta = \mathbf{b}.

When ξ\xi is the stationary distribution of policy π\pi (so ξPπ=ξ\xi^\top \mathbf{P}_\pi = \xi^\top), this system has a unique solution, and the projected Bellman operator PLπ\Proj \BellmanPi is a contraction in the weighted L2L^2 norm ξ\|\cdot\|_\xi. This is the theoretical foundation for TD learning with linear function approximation.

The Bellman Optimality Equation: Function Iteration and Newton’s Method

For the Bellman optimality equation, the max operator introduces nonlinearity:

[Lv^](s)=maxaAs{r(s,a)+γjSp(js,a)v^(j)}.[\Bellman\hat{v}](s) = \max_{a \in \mathcal{A}_s} \left\{ r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) \hat{v}(j) \right\}.

The Galerkin conditions become:

F(θ)ΦΞ(Lv^(;θ)Φθ)=0,F(\theta) \equiv \boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\Bellman\hat{v}(\cdot; \theta) - \boldsymbol{\Phi} \theta) = \mathbf{0},

where the Bellman operator must be evaluated at each state ss to find the optimal action and compute the target value. This is a system of nn nonlinear equations in nn unknowns.

Function iteration applies the Bellman operator and projects back. Given θ(k)\theta^{(k)}, compute the greedy policy π(k)(s)=argmaxa{r(s,a)+γjSp(js,a)φ(j)θ(k)}\pi^{(k)}(s) = \arg\max_a \{r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) \boldsymbol{\varphi}(j)^\top \theta^{(k)}\} at each state, then solve:

ΦΞ(ΦγPπ(k)Φ)θ(k+1)=ΦΞrπ(k).\boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\boldsymbol{\Phi} - \gamma \mathbf{P}_{\pi^{(k)}} \boldsymbol{\Phi}) \theta^{(k+1)} = \boldsymbol{\Phi}^\top \boldsymbol{\Xi} \mathbf{r}_{\pi^{(k)}}.

This evaluates the current greedy policy using LSTD, then implicitly improves by computing a new greedy policy at the next iteration. However, convergence can be slow when the finite-dimensional approximation poorly preserves contraction.

Newton’s method treats G(θ)=0G(\theta) = 0 as a rootfinding problem and uses the Jacobian to accelerate convergence. The Jacobian of GG is:

JG(θ)=Gθ=ΦΞ(Lv^(;θ)θΦ).J_G(\theta) = \frac{\partial G}{\partial \theta} = \boldsymbol{\Phi}^\top \boldsymbol{\Xi} \left( \frac{\partial \Bellman\hat{v}(\cdot; \theta)}{\partial \theta} - \boldsymbol{\Phi} \right).

To compute Lv^(s;θ)θj\frac{\partial \Bellman\hat{v}(s; \theta)}{\partial \theta_j}, we use the Envelope Theorem from Step 4. At the current θ(k)\theta^{(k)}, let a(s;θ(k))a^*(s; \theta^{(k)}) be the optimal action at state ss. Then:

[Lv^](s;θ(k))θj=γjSp(js,a(s;θ(k)))φj(j).\frac{\partial [\Bellman\hat{v}](s; \theta^{(k)})}{\partial \theta_j} = \gamma \sum_{j \in \mathcal{S}} p(j|s, a^*(s; \theta^{(k)})) \varphi_j(j).

Define the policy π(k)(s)=a(s;θ(k))\pi^{(k)}(s) = a^*(s; \theta^{(k)}). The Jacobian becomes:

JG(θ(k))=ΦΞ(γPπ(k)ΦΦ)=ΦΞ(ΦγPπ(k)Φ).J_G(\theta^{(k)}) = \boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\gamma \mathbf{P}_{\pi^{(k)}} \boldsymbol{\Phi} - \boldsymbol{\Phi}) = -\boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\boldsymbol{\Phi} - \gamma \mathbf{P}_{\pi^{(k)}} \boldsymbol{\Phi}).

The Newton update θ(k+1)=θ(k)JG(θ(k))1G(θ(k))\theta^{(k+1)} = \theta^{(k)} - J_G(\theta^{(k)})^{-1} G(\theta^{(k)}) simplifies. We have:

G(θ(k))=ΦΞ(Lv^(;θ(k))Φθ(k)).G(\theta^{(k)}) = \boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\Bellman\hat{v}(\cdot; \theta^{(k)}) - \boldsymbol{\Phi} \theta^{(k)}).

At each state ss, the greedy value is [Lv^(;θ(k))](s)=r(s,π(k)(s))+γjp(js,π(k)(s))φ(j)θ(k)[\Bellman\hat{v}(\cdot; \theta^{(k)})](s) = r(s, \pi^{(k)}(s)) + \gamma \sum_j p(j|s, \pi^{(k)}(s)) \boldsymbol{\varphi}(j)^\top \theta^{(k)}, which equals [Lπ(k)v^(;θ(k))](s)[\mathrm{L}_{\pi^{(k)}} \hat{v}(\cdot; \theta^{(k)})](s). Thus:

G(θ(k))=ΦΞ(rπ(k)+γPπ(k)Φθ(k)Φθ(k)).G(\theta^{(k)}) = \boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\mathbf{r}_{\pi^{(k)}} + \gamma \mathbf{P}_{\pi^{(k)}} \boldsymbol{\Phi} \theta^{(k)} - \boldsymbol{\Phi} \theta^{(k)}).

The Newton step becomes:

θ(k+1)=θ(k)+[ΦΞ(ΦγPπ(k)Φ)]1ΦΞ(rπ(k)+γPπ(k)Φθ(k)Φθ(k)).\theta^{(k+1)} = \theta^{(k)} + [\boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\boldsymbol{\Phi} - \gamma \mathbf{P}_{\pi^{(k)}} \boldsymbol{\Phi})]^{-1} \boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\mathbf{r}_{\pi^{(k)}} + \gamma \mathbf{P}_{\pi^{(k)}} \boldsymbol{\Phi} \theta^{(k)} - \boldsymbol{\Phi} \theta^{(k)}).

Multiplying through and simplifying:

ΦΞ(ΦγPπ(k)Φ)θ(k+1)=ΦΞrπ(k).\boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\boldsymbol{\Phi} - \gamma \mathbf{P}_{\pi^{(k)}} \boldsymbol{\Phi}) \theta^{(k+1)} = \boldsymbol{\Phi}^\top \boldsymbol{\Xi} \mathbf{r}_{\pi^{(k)}}.

This is LSPI (Least Squares Policy Iteration). Each Newton step:

  1. Computes the greedy policy π(k)(s)=argmaxa{r(s,a)+γjp(js,a)φ(j)θ(k)}\pi^{(k)}(s) = \arg\max_a \{r(s,a) + \gamma \sum_j p(j|s,a) \boldsymbol{\varphi}(j)^\top \theta^{(k)}\}

  2. Solves the LSTD equation for this policy to get θ(k+1)\theta^{(k+1)}

Newton’s method for the Galerkin-projected Bellman optimality equation is equivalent to policy iteration in the function approximation setting. Just as Newton’s method for collocation corresponded to policy iteration (Step 4), Newton’s method for discrete Galerkin gives LSPI.

Galerkin projection with linear function approximation reduces policy iteration to a sequence of linear systems, each solvable in closed form. For discrete MDPs, we can compute the matrices ΦΞΦ\boldsymbol{\Phi}^\top \boldsymbol{\Xi} \boldsymbol{\Phi} and ΦΞPπΦ\boldsymbol{\Phi}^\top \boldsymbol{\Xi} \mathbf{P}_\pi \boldsymbol{\Phi} exactly. In the next section, we examine how these methods extend to the simulation-based setting, where we must estimate these matrices from samples rather than computing them exactly from the model.

Monotone Projection and the Preservation of Contraction

The informal discussion of shape preservation hints at a deeper theoretical question: when does the function iteration method converge? Recall from our discussion of collocation that function iteration proceeds in two steps:

  1. Apply the Bellman operator at collocation points: t(k)=v(θ(k))t^{(k)} = v(\theta^{(k)}) where ti(k)=Lv^(k)(si)t_i^{(k)} = \Bellman\hat{v}^{(k)}(s_i)

  2. Fit new coefficients to match these targets: Φθ(k+1)=t(k)\boldsymbol{\Phi} \theta^{(k+1)} = t^{(k)}, giving θ(k+1)=Φ1v(θ(k))\theta^{(k+1)} = \boldsymbol{\Phi}^{-1} v(\theta^{(k)})

We can reinterpret this iteration in function space rather than coefficient space. Let P\Proj be the projection operator that takes any function ff and returns its approximation in span{φ1,,φn}\text{span}\{\varphi_1, \ldots, \varphi_n\}. For collocation, P\Proj is the interpolation operator: (Pf)(s)(\Proj f)(s) is the unique linear combination of basis functions that matches ff at the collocation points. Then Step 2 can be written as: fit v^(k+1)\hat{v}^{(k+1)} so that v^(k+1)(si)=Lv^(k)(si)\hat{v}^{(k+1)}(s_i) = \Bellman\hat{v}^{(k)}(s_i) for all collocation points, which means v^(k+1)=P(Lv^(k))\hat{v}^{(k+1)} = \Proj(\Bellman\hat{v}^{(k)}).

In other words, function iteration is equivalent to projected value iteration in function space:

v^(k+1)=PLv^(k).\hat{v}^{(k+1)} = \Proj \Bellman \hat{v}^{(k)}.

We know that standard value iteration vk+1=Lvkv_{k+1} = \Bellman v_k converges because L\Bellman is a γ\gamma-contraction in the sup norm. But now we’re iterating with the composed operator PL\Proj \Bellman instead of L\Bellman alone.

This PL\Proj \Bellman structure is not specific to collocation. It is inherent in all projection methods. The general pattern is always the same: apply the Bellman operator to get a target function Lv^(k)\Bellman\hat{v}^{(k)}, then project it back onto our approximation space to get v^(k+1)\hat{v}^{(k+1)}. The projection step defines an operator P\Proj that depends on our choice of test functions:

But regardless of which projection method we use, iteration takes the form v^(k+1)=PLv^(k)\hat{v}^{(k+1)} = \Proj \Bellman\hat{v}^{(k)}.

Does the composition PL\Proj \Bellman inherit the contraction property of L\Bellman? If not, the iteration may diverge, oscillate, or converge to a spurious fixed point even though the original problem is well-posed.

Monotone Approximators and Stability

The answer turns out to depend on specific properties of the approximation operator P\Proj. This theory was developed independently across multiple research communities—computational economics Judd (1992)Judd (1996)McGrattan (1997)Santos & Vigo-Aguiar (1998), economic dynamics Stachurski (2009), and reinforcement learning Gordon (1995)Gordon (1999)—arriving at essentially the same mathematical conditions.

Monotonicity Implies Nonexpansiveness

It turns out that approximation operators satisfying simple structural properties automatically preserve contraction.

This proposition shows that monotonicity and constant preservation automatically imply nonexpansiveness. There is no need to verify this separately. The intuition is that a monotone, constant-preserving operator acts like a weighted average that respects order structure and cannot amplify differences between functions.

Preservation of Contraction

Combining nonexpansiveness with the contraction property of the Bellman operator yields the main stability result.

This error bound tells us that the fixed-point error is controlled by how well P\Proj can represent vv^*. If vRange(P)v^* \in \text{Range}(\Proj), then Pv=v\Proj v^* = v^* and the error vanishes. Otherwise, the error is proportional to the approximation error Pvv\|\Proj v^* - v^*\|_\infty, amplified by the factor (1γ)1(1-\gamma)^{-1}.

Averagers in Discrete-State Problems

For discrete-state problems, the monotonicity conditions have a natural interpretation as averaging with nonnegative weights. This characterization was developed by Gordon in the context of reinforcement learning.

Averagers automatically satisfy the monotonicity conditions: linearity follows from matrix multiplication, monotonicity follows from nonnegativity of entries, and constant preservation follows from row sums equaling one.

This specializes the Santos-Vigo-Aguiar theorem to discrete states, expressed in the probabilistic language of stochastic matrices. The stochastic matrix characterization connects to Markov chain theory: Pv\Proj v represents expected values after one transition, and the monotonicity property reflects the fact that expectations preserve order.

Examples of averagers include state aggregation (averaging values within groups), K-nearest neighbors (averaging over nearest states), kernel smoothing with positive kernels, and multilinear interpolation on grids (barycentric weights are nonnegative and sum to one). Counterexamples include linear least squares regression (projection matrix may have negative entries) and high-order polynomial interpolation (Runge phenomenon produces negative weights).

The following table summarizes which common approximation operators satisfy the monotonicity conditions:

MethodMonotone?Notes
Piecewise linear interpolationYesAlways an averager; guaranteed stability
Multilinear interpolation (grid)YesBarycentric weights are nonnegative and sum to one
Shape-preserving splines (Schumaker)YesDesigned to maintain monotonicity
State aggregationYesExact averaging within groups
Kernel smoothing (positive kernels)YesIf kernel integrates to one
High-order polynomial interpolationNoOscillations violate monotonicity (Runge phenomenon)
Least squares projection (arbitrary basis)NoProjection matrix may have negative entries
Fourier/spectral methodsNoNot monotone-preserving in general
Neural networksNoHighly flexible but no monotonicity guarantees

The distinction between “safe” (monotone) and “potentially unstable” (non-monotone) approximators provides rigorous foundation for the folk wisdom that linear interpolation is reliable while high-order polynomials can be dangerous for value iteration. But notice that the table’s verdict on “least squares projection” is somewhat abstract—it doesn’t specifically address the three weighted residual methods we introduced at the start of this chapter.

The choice of solution method determines which approximation operators are safe to use. Successive approximation (fixed-point iteration) requires monotone approximators to guarantee convergence. Rootfinding methods like Newton’s method do not require monotonicity. Stability depends on numerical properties of the Jacobian rather than contraction preservation. These considerations suggest hybrid strategies. One approach runs a few iterations with a monotone method to generate an initial guess, then switches to Newton’s method with a smooth approximation for rapid final convergence.

Connecting Back to Collocation, Galerkin, and Least Squares

We have now developed a general stability theory for projected value iteration and surveyed which approximation operators are monotone. But what does this mean for the three specific weighted residual methods we introduced at the start of this chapter—collocation, Galerkin, and least squares? Each method defines a different projection operator P\Proj, and we now need to determine which satisfy the monotonicity conditions that guarantee convergence.

Collocation with piecewise linear interpolation is monotone. When we use collocation with piecewise linear basis functions on a grid, the projection operator performs linear interpolation between grid points. At any state ss between grid points sis_i and si+1s_{i+1}, the interpolated value is:

(Pv)(s)=si+1ssi+1siv(si)+ssisi+1siv(si+1).(\Proj v)(s) = \frac{s_{i+1} - s}{s_{i+1} - s_i} v(s_i) + \frac{s - s_i}{s_{i+1} - s_i} v(s_{i+1}).

The interpolation weights (barycentric coordinates) are nonnegative and sum to one, making this an averager in Gordon’s sense. Therefore collocation with piecewise linear bases satisfies the monotonicity conditions and the Santos-Vigo-Aguiar stability theorem applies. The folk wisdom that “linear interpolation is safe for value iteration” has rigorous theoretical foundation.

Galerkin projection is generally not monotone. The Galerkin projection operator for a general basis {φ1,,φn}\{\varphi_1, \ldots, \varphi_n\} has the form:

P=Φ(ΦWΦ)1ΦW,\Proj = \boldsymbol{\Phi}(\boldsymbol{\Phi}^\top \mathbf{W} \boldsymbol{\Phi})^{-1} \boldsymbol{\Phi}^\top \mathbf{W},

where W\mathbf{W} is a diagonal weight matrix and Φ\boldsymbol{\Phi} contains the basis function evaluations. This projection matrix typically has negative entries. To see why, consider a simple example with polynomial basis functions {1,x,x2}\{1, x, x^2\} on [1,1][-1, 1]. The projection of a function onto this space involves computing (ΦWΦ)1(\boldsymbol{\Phi}^\top \mathbf{W} \boldsymbol{\Phi})^{-1}, and the resulting operator can map nonnegative functions to functions with negative values. This is the same phenomenon underlying the Runge phenomenon in high-order polynomial interpolation: the projection weights oscillate in sign.

Since Galerkin projection is not monotone, the sup norm contraction theory does not guarantee convergence of projected value iteration vk+1=PLvkv_{k+1} = \Proj \Bellman v_k with Galerkin.

Least squares methods share the non-monotonicity issue. The least squares projection operator minimizes N(f^)w2\|\Residual(\hat{f})\|_w^2 and has the same mathematical form as Galerkin projection. It is a linear projection onto span{φ1,,φn}\text{span}\{\varphi_1, \ldots, \varphi_n\} with respect to a weighted inner product. Like Galerkin, the projection matrix typically contains negative entries and violates monotonicity.

The gap in our theory. The monotone approximator framework successfully covers collocation with simple bases, but leaves two important methods—Galerkin and least squares—without convergence guarantees. These methods are used in least-squares temporal difference learning (LSTD) and modern reinforcement learning with linear function approximation. We need a different analytical framework to understand when these non-monotone projections lead to convergent algorithms.

Beyond Monotone Approximators

The monotone approximator theory gives us a clean sufficient condition for convergence: if P\Proj is monotone (and constant-preserving), then P\Proj is non-expansive in the sup norm \|\cdot\|_\infty. Since L\Bellman is a γ\gamma-contraction in the sup norm, their composition PL\Proj \Bellman is also a γ\gamma-contraction in the sup norm, guaranteeing convergence of projected value iteration.

But what if P\Proj is not monotone? Can we still guarantee convergence? Galerkin and least squares projections typically violate monotonicity, yet they are widely used in practice, particularly in reinforcement learning through least-squares temporal difference learning (LSTD). In general, proving convergence for non-monotone projections is difficult. However, for the special case of policy evaluation—computing the value function vπv_\pi of a fixed policy π\pi—we can establish convergence by working in a different norm.

The Policy Evaluation Problem and LSTD

Consider the policy evaluation problem: given policy π\pi, we want to solve the policy Bellman equation vπ=rπ+γPπvπv_\pi = r_\pi + \gamma \mathbf{P}_\pi v_\pi, where rπr_\pi and Pπ\mathbf{P}_\pi are the reward vector and transition matrix under π\pi. This is the core computational task in policy iteration, actor-critic algorithms, and temporal difference learning. In reinforcement learning, we typically learn from sampled experience: trajectories (s0,a0,r1,s1,a1,r2,s2,)(s_0, a_0, r_1, s_1, a_1, r_2, s_2, \ldots) generated by following π\pi. If the Markov chain induced by π\pi is ergodic, the state distribution converges to a stationary distribution ξ\xi satisfying ξPπ=ξ\xi^\top \mathbf{P}_\pi = \xi^\top.

This distribution determines which states appear frequently in our data. States visited often contribute more samples and have more influence on any learned approximation. States visited rarely contribute little. For a linear approximation vθ(s)=jθjφj(s)v_\theta(s) = \sum_j \theta_j \varphi_j(s), the least-squares temporal difference (LSTD) algorithm computes coefficients by solving:

ΦΞ(ΦγPπΦ)θ=ΦΞrπ,\boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\boldsymbol{\Phi} - \gamma \mathbf{P}_\pi \boldsymbol{\Phi}) \boldsymbol{\theta} = \boldsymbol{\Phi}^\top \boldsymbol{\Xi} \mathbf{r}_\pi,

where Φ\boldsymbol{\Phi} is the matrix of basis function evaluations and Ξ=diag(ξ)\boldsymbol{\Xi} = \text{diag}(\xi). We write this matrix equation for analysis purposes, but the actual algorithm does not compute it this way. For large state spaces, we cannot enumerate all states to form Φ\boldsymbol{\Phi} or explicitly represent the transition matrix Pπ\mathbf{P}_\pi. Instead, the practical algorithm accumulates sums from sampled transitions (s,r,s)(s, r, s'), incrementally building the matrices ΦΞΦ\boldsymbol{\Phi}^\top \boldsymbol{\Xi} \boldsymbol{\Phi} and ΦΞPπΦ\boldsymbol{\Phi}^\top \boldsymbol{\Xi} \mathbf{P}_\pi \boldsymbol{\Phi} without ever forming the full objects. The algorithm is derived from first principles through temporal difference learning, and the Galerkin perspective provides an interpretation of what it computes.

LSTD as Projected Bellman Equation

To see what this equation means, let v^=Φθ\hat{v} = \boldsymbol{\Phi} \boldsymbol{\theta} be the solution. Expanding the parentheses:

ΦΞΦθγΦΞPπΦθ=ΦΞrπ.\boldsymbol{\Phi}^\top \boldsymbol{\Xi} \boldsymbol{\Phi} \boldsymbol{\theta} - \gamma \boldsymbol{\Phi}^\top \boldsymbol{\Xi} \mathbf{P}_\pi \boldsymbol{\Phi} \boldsymbol{\theta} = \boldsymbol{\Phi}^\top \boldsymbol{\Xi} \mathbf{r}_\pi.

Moving all terms to the left side and factoring out ΦΞ\boldsymbol{\Phi}^\top \boldsymbol{\Xi}:

ΦΞ(ΦθγPπΦθrπ)=0.\boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\boldsymbol{\Phi} \boldsymbol{\theta} - \gamma \mathbf{P}_\pi \boldsymbol{\Phi} \boldsymbol{\theta} - \mathbf{r}_\pi) = \mathbf{0}.

Since v^=Φθ\hat{v} = \boldsymbol{\Phi} \boldsymbol{\theta} and the policy Bellman operator is Lπv^=rπ+γPπv^\BellmanPi \hat{v} = \mathbf{r}_\pi + \gamma \mathbf{P}_\pi \hat{v}, we can write:

ΦΞ(v^Lπv^)=0.\boldsymbol{\Phi}^\top \boldsymbol{\Xi} (\hat{v} - \BellmanPi \hat{v}) = \mathbf{0}.

Let φj\boldsymbol{\varphi}_j denote the jj-th column of Φ\boldsymbol{\Phi}, which contains the evaluations of the jj-th basis function at all states. The equation above says that for each jj:

φjΞ(v^Lπv^)=0.\boldsymbol{\varphi}_j^\top \boldsymbol{\Xi} (\hat{v} - \BellmanPi \hat{v}) = 0.

But φjΞ(v^Lπv^)\boldsymbol{\varphi}_j^\top \boldsymbol{\Xi} (\hat{v} - \BellmanPi \hat{v}) is exactly the ξ\xi-weighted inner product φj,v^Lπv^ξ\langle \boldsymbol{\varphi}_j, \hat{v} - \BellmanPi \hat{v} \rangle_\xi. So the residual v^Lπv^\hat{v} - \BellmanPi \hat{v} is orthogonal to every basis function, and therefore orthogonal to the entire subspace span(Φ)\text{span}(\boldsymbol{\Phi}).

By definition, the orthogonal projection Py\Proj y of a vector yy onto a subspace is the unique vector in that subspace such that yPyy - \Proj y is orthogonal to the subspace. Here, v^\hat{v} lies in span(Φ)\text{span}(\boldsymbol{\Phi}) (since v^=Φθ\hat{v} = \boldsymbol{\Phi} \boldsymbol{\theta}), and we have just shown that Lπv^v^\BellmanPi \hat{v} - \hat{v} is orthogonal to span(Φ)\text{span}(\boldsymbol{\Phi}). Therefore, v^=PLπv^\hat{v} = \Proj \BellmanPi \hat{v}, where P\Proj is orthogonal projection onto span(Φ)\text{span}(\boldsymbol{\Phi}) with respect to the ξ\xi-weighted inner product:

u,vξ=uΞv,vξ=vΞv,P=Φ(ΦΞΦ)1ΦΞ.\langle u, v \rangle_\xi = u^\top \boldsymbol{\Xi} v, \qquad \|v\|_\xi = \sqrt{v^\top \boldsymbol{\Xi} v}, \qquad \Proj = \boldsymbol{\Phi}(\boldsymbol{\Phi}^\top \boldsymbol{\Xi} \boldsymbol{\Phi})^{-1} \boldsymbol{\Phi}^\top \boldsymbol{\Xi}.

The weighting by ξ\xi is not arbitrary. Temporal difference learning performs stochastic updates using individual transitions: θk+1=θk+αk(r+γvθk(s)vθk(s))vθk(s)\theta_{k+1} = \theta_k + \alpha_k (r + \gamma v_{\theta_k}(s') - v_{\theta_k}(s)) \nabla v_{\theta_k}(s), with states sampled from ξ\xi. The ODE analysis of this stochastic process (Borkar-Meyn theory) shows convergence to a fixed point, which can be expressed in closed form as the ξ\xi-weighted projected Bellman operator. LSTD is an algorithm that computes this analytical fixed point.

Orthogonal Projection is Non-Expansive

Suppose ξ\xi is the steady-state distribution: ξPπ=ξ\xi^\top \mathbf{P}_\pi = \xi^\top. Our goal is to establish that PLπ\Proj \BellmanPi is a contraction in ξ\|\cdot\|_\xi. If we can establish that P\Proj is non-expansive in this norm and that Lπ\BellmanPi is a γ\gamma-contraction in ξ\|\cdot\|_\xi, then their composition will be a γ\gamma-contraction:

PLπvPLπwξLπvLπwξγvwξ.\|\Proj \BellmanPi v - \Proj \BellmanPi w\|_\xi \leq \|\BellmanPi v - \BellmanPi w\|_\xi \leq \gamma \|v - w\|_\xi.

First, we establish that orthogonal projection is non-expansive. For any vector vv, we can decompose v=Pv+(vPv)v = \Proj v + (v - \Proj v), where (vPv)(v - \Proj v) is orthogonal to the subspace span(Φ)\text{span}(\boldsymbol{\Phi}). By the Pythagorean theorem in the ξ\|\cdot\|_\xi inner product:

vξ2=Pvξ2+vPvξ2.\|v\|_\xi^2 = \|\Proj v\|_\xi^2 + \|v - \Proj v\|_\xi^2.

Since vPvξ20\|v - \Proj v\|_\xi^2 \geq 0, we have:

vξ2Pvξ2.\|v\|_\xi^2 \geq \|\Proj v\|_\xi^2.

Taking square roots of both sides (which preserves the inequality since both norms are non-negative):

Pvξvξ.\|\Proj v\|_\xi \leq \|v\|_\xi.

This holds for all vv, so P\Proj is non-expansive in ξ\|\cdot\|_\xi.

Can Lπ\BellmanPi be a Contraction in ξ\|\cdot\|_\xi ?

To show Lπ=rπ+γPπ\BellmanPi = r_\pi + \gamma \mathbf{P}_\pi is a γ\gamma-contraction, we need to verify:

LπvLπwξ=γPπ(vw)ξ=γPπ(vw)ξ.\|\BellmanPi v - \BellmanPi w\|_\xi = \|\gamma \mathbf{P}_\pi (v - w)\|_\xi = \gamma \|\mathbf{P}_\pi (v - w)\|_\xi.

This will be at most γvwξ\gamma \|v - w\|_\xi if Pπ\mathbf{P}_\pi is non-expansive, meaning Pπzξzξ\|\mathbf{P}_\pi z\|_\xi \leq \|z\|_\xi for any vector zz. So the key is to establish that Pπ\mathbf{P}_\pi is non-expansive in ξ\|\cdot\|_\xi.

Consider the squared norm of Pπz\mathbf{P}_\pi z. By definition of the weighted norm:

Pπzξ2=sξ(s)[(Pπz)(s)]2.\|\mathbf{P}_\pi z\|_\xi^2 = \sum_s \xi(s) [(\mathbf{P}_\pi z)(s)]^2.

The ss-th component of Pπz\mathbf{P}_\pi z is (Pπz)(s)=sp(ss,π(s))z(s)(\mathbf{P}_\pi z)(s) = \sum_{s'} p(s'|s,\pi(s)) z(s'). This is a weighted average of the values z(s)z(s') with weights p(ss,π(s))p(s'|s,\pi(s)) that sum to one. Therefore:

Pπzξ2=sξ(s)[sp(ss,π(s))z(s)]2.\|\mathbf{P}_\pi z\|_\xi^2 = \sum_s \xi(s) \left[\sum_{s'} p(s'|s,\pi(s)) z(s')\right]^2.

Since the function xx2x \mapsto x^2 is convex, Jensen’s inequality applied to the probability distribution p(s,π(s))p(\cdot|s,\pi(s)) gives:

[sp(ss,π(s))z(s)]2sp(ss,π(s))z(s)2.\left[\sum_{s'} p(s'|s,\pi(s)) z(s')\right]^2 \leq \sum_{s'} p(s'|s,\pi(s)) z(s')^2.

Substituting this into the norm expression:

Pπzξ2sξ(s)sp(ss,π(s))z(s)2=sz(s)2sξ(s)p(ss,π(s)).\|\mathbf{P}_\pi z\|_\xi^2 \leq \sum_s \xi(s) \sum_{s'} p(s'|s,\pi(s)) z(s')^2 = \sum_{s'} z(s')^2 \sum_s \xi(s) p(s'|s,\pi(s)).

The stationarity condition ξPπ=ξ\xi^\top \mathbf{P}_\pi = \xi^\top means sξ(s)p(ss,π(s))=ξ(s)\sum_s \xi(s) p(s'|s,\pi(s)) = \xi(s') for all ss'. Therefore:

Pπzξ2sz(s)2ξ(s)=zξ2.\|\mathbf{P}_\pi z\|_\xi^2 \leq \sum_{s'} z(s')^2 \xi(s') = \|z\|_\xi^2.

Taking square roots, Pπzξzξ\|\mathbf{P}_\pi z\|_\xi \leq \|z\|_\xi, so Pπ\mathbf{P}_\pi is non-expansive in ξ\|\cdot\|_\xi. This makes Lπ=rπ+γPπ\BellmanPi = r_\pi + \gamma \mathbf{P}_\pi a γ\gamma-contraction in ξ\|\cdot\|_\xi. Composing with the non-expansive projection:

PLπvPLπwξLπvLπwξγvwξ.\|\Proj \BellmanPi v - \Proj \BellmanPi w\|_\xi \leq \|\BellmanPi v - \BellmanPi w\|_\xi \leq \gamma \|v - w\|_\xi.

By Banach’s fixed-point theorem, PLπ\Proj \BellmanPi has a unique fixed point and iterates converge from any initialization.

Interpretation: The On-Policy Condition

The result shows that convergence depends on matching the weighting to the operator. We cannot choose an arbitrary weighted L2L^2 norm and expect PLπ\Proj \BellmanPi to be a contraction. Instead, the weighting ξ\xi must have a specific relationship with the transition matrix Pπ\mathbf{P}_\pi in the operator Lπ\BellmanPi: namely, ξ\xi must be the stationary distribution of Pπ\mathbf{P}_\pi. This is what makes the weighted geometry compatible with the operator’s structure. When this match holds, Jensen’s inequality gives us non-expansiveness of Pπ\mathbf{P}_\pi in the ξ\|\cdot\|_\xi norm, and the composition PLπ\Proj \BellmanPi inherits the contraction property.

In reinforcement learning, this has a practical interpretation. When we learn by following policy π\pi and collecting transitions (s,a,r,s)(s, a, r, s'), the states we visit are distributed according to the stationary distribution of π\pi. This is on-policy learning. The LSTD algorithm uses data sampled from this distribution, which means the empirical weighting naturally matches the operator structure. Our analysis shows that the iterative algorithm vk+1=PLπvkv_{k+1} = \Proj \BellmanPi v_k converges to the same fixed point that LSTD computes in closed form.

This is fundamentally different from the monotone approximator theory. There, we required structural properties of P\Proj itself (monotonicity, constant preservation) to guarantee that P\Proj preserves the sup-norm contraction property of L\Bellman. Here, we place no such restriction on P\Proj—Galerkin projection is not monotone. Instead, convergence depends on matching the norm to the operator. When ξ\xi does not match the stationary distribution, as in off-policy learning where data comes from a different behavior policy, the Jensen inequality argument breaks down. The operator Pπ\mathbf{P}_\pi need not be non-expansive in ξ\|\cdot\|_\xi, and PLπ\Proj \BellmanPi may fail to contract. This explains divergence phenomena such as Baird’s counterexample Baird (1995).

Why Not Bellman Optimality?

Can we extend this weighted L2L^2 analysis to the Bellman optimality operator Lv=maxa[ra+γPav]\Bellman v = \max_a [r_a + \gamma \mathbf{P}_a v]? The answer is no, at least not with this approach. The obstacle appears at the Jensen inequality step. For policy evaluation, we had:

Pπzξ2=sξ(s)[sp(ss,π(s))z(s)]2.\|\mathbf{P}_\pi z\|_\xi^2 = \sum_s \xi(s) \left[\sum_{s'} p(s'|s,\pi(s)) z(s')\right]^2.

The inner term is a convex combination of the values z(s)z(s'), which allowed us to apply Jensen’s inequality to the convex function xx2x \mapsto x^2. For the optimal Bellman operator, we would need to bound:

[maxasp(ss,a)z(s)]2.\left[\max_{a} \sum_{s'} p(s'|s,a) z(s')\right]^2.

But the maximum of convex combinations is not itself a convex combination. It is a pointwise maximum. Jensen’s inequality does not apply. We cannot conclude that maxa[Paz]\max_a [\mathbf{P}_a z] is non-expansive in any weighted L2L^2 norm.

Is convergence of PL\Proj \Bellman with Galerkin projection impossible, or merely difficult to prove? The situation is subtle. In practice, fitted Q-iteration and approximate value iteration with neural networks often work well, suggesting that some form of stability exists. But there are also well-documented divergence examples (e.g., Q-learning with linear function approximation can diverge). The theoretical picture remains incomplete. Some results exist for restricted function classes or under strong assumptions on the MDP structure, but no general convergence guarantee like the policy evaluation result is available. The interplay between the max operator, the projection, and the norm geometry is not well understood. This is an active area of research in reinforcement learning theory.

From Abstract to Practical: Fitted-Value Iteration

We have developed weighted residual methods through abstract functional equations: choose test functions, impose orthogonality conditions R,piw=0\langle R, p_i \rangle_w = 0, solve for coefficients. But what are we actually computing when we solve these equations by successive approximation? The answer is simpler than the formalism suggests: function iteration with a fitting step.

Recall that the weighted residual conditions vLv,piw=0\langle v - \Bellman v, p_i \rangle_w = 0 define a fixed-point problem v=PLvv = \Proj \Bellman v, where P\Proj is a projection operator onto span(Φ)\text{span}(\boldsymbol{\Phi}). We can solve this by iteration: vk+1=PLvkv_{k+1} = \Proj \Bellman v_k. Under appropriate conditions (monotonicity of P\Proj, or matching the weight to the operator for policy evaluation), this converges to a solution.

In parameter space, this iteration becomes a fitting procedure. Consider Galerkin projection with a finite state space of nn states. Let Φ\boldsymbol{\Phi} be the n×dn \times d matrix of basis evaluations, W\mathbf{W} the diagonal weight matrix, and y\mathbf{y} the vector of Bellman operator evaluations: yi=(Lvk)(si)y_i = (\Bellman v_k)(s_i). The projection is:

θk+1=(ΦWΦ)1ΦWy.\boldsymbol{\theta}_{k+1} = (\boldsymbol{\Phi}^\top \mathbf{W} \boldsymbol{\Phi})^{-1} \boldsymbol{\Phi}^\top \mathbf{W} \mathbf{y}.

This is weighted least-squares regression: fit Φθ\boldsymbol{\Phi} \boldsymbol{\theta} to targets y\mathbf{y}. For collocation, we require exact interpolation Φθk+1=y\boldsymbol{\Phi} \boldsymbol{\theta}_{k+1} = \mathbf{y} at chosen collocation points. For continuous state spaces, we approximate the Galerkin integrals using sampled states, reducing to the same finite-dimensional fitting problem. The abstraction remains consistent: function iteration in the abstract becomes generate targets, fit to targets, repeat in the implementation.

This extends beyond linear basis functions. Neural networks, decision trees, and kernel methods all implement variants of this procedure. Given data {(si,yi)}\{(s_i, y_i)\} where yi=(Lvk)(si)y_i = (\Bellman v_k)(s_i), each method produces a function vk+1:SRv_{k+1}: \mathcal{S} \to \mathbb{R} by fitting to the targets. The projection operator P\Proj is simply one instantiation of a fitting procedure—Galerkin and collocation correspond to specific choices of approximation class and loss function.

The abstraction fit\mathtt{fit} encapsulates all the complexity of function approximation, whether that involves solving a linear system, running gradient descent, or training an ensemble. The projection operator P\Proj is one instantiation: when F\mathcal{F} is a linear subspace and we minimize weighted squared error, we recover Galerkin or collocation. Neural networks and other non-linear methods extend this framework beyond theoretical tractability.

A limitation of this algorithm is that it assumes we can evaluate the Bellman operator exactly. Computing yi=(Lvk)(si)y_i = (\Bellman v_k)(s_i) requires knowing transition probabilities and summing over all next states. In practice, we often have only a simulator or observed data. The next chapter shows how to approximate these expectations from samples, connecting the fitted-value iteration framework developed here to simulation-based methods and reinforcement learning.

The Minimum Residual Framework for Nonlinear Approximators

The weighted residual methods developed in this chapter have focused on linear function classes: polynomial bases, piecewise linear interpolants, and linear combinations of fixed basis functions. Neural networks, kernel methods, and decision trees do not fit this template. How does the framework extend to nonlinear approximators?

Recall the Galerkin approach for linear approximation vθ=i=1dθiφiv_{\boldsymbol{\theta}} = \sum_{i=1}^d \theta_i \varphi_i. The orthogonality conditions vLv,φiw=0\langle v - \Bellman v, \varphi_i \rangle_w = 0 for all ii define a linear system with a closed-form solution. These equations arise from minimizing vLvw2\|v - \Bellman v\|_w^2 over the subspace, since at the minimum, the gradient with respect to each coefficient must vanish. The connection between norm minimization and orthogonality holds generally. For any norm w\|\cdot\|_w induced by an inner product ,w\langle \cdot, \cdot \rangle_w, minimizing f(θ)w2\|f(\boldsymbol{\theta})\|_w^2 with respect to parameters requires θif(θ)w2=0\frac{\partial}{\partial \theta_i} \|f(\boldsymbol{\theta})\|_w^2 = 0. Since fw2=f,fw\|f\|_w^2 = \langle f, f \rangle_w, the chain rule gives 2f,fθiw=02\langle f, \frac{\partial f}{\partial \theta_i} \rangle_w = 0. Minimizing the residual norm is thus equivalent to requiring orthogonality f,fθiw=0\langle f, \frac{\partial f}{\partial \theta_i} \rangle_w = 0 for all ii. The equivalence holds for any choice of inner product: weighted L2L^2 integrals for Galerkin, sums over collocation points for collocation, or sampled expectations for neural networks.

For nonlinear function classes parameterized by θRp\boldsymbol{\theta} \in \mathbb{R}^p (neural networks, kernel expansions), the same minimization principle applies:

θ=argminθvθLvθw2.\boldsymbol{\theta}^* = \arg\min_{\boldsymbol{\theta}} \|v_{\boldsymbol{\theta}} - \Bellman v_{\boldsymbol{\theta}}\|_w^2.

The first-order stationarity condition yields orthogonality:

vθLvθ,vθθiw=0for all i.\Big\langle v_{\boldsymbol{\theta}} - \Bellman v_{\boldsymbol{\theta}}, \frac{\partial v_{\boldsymbol{\theta}}}{\partial \theta_i} \Big\rangle_w = 0 \quad \text{for all } i.

The test functions are now the partial derivatives vθθi\frac{\partial v_{\boldsymbol{\theta}}}{\partial \theta_i}, which span the tangent space to the manifold {vθ:θRp}\{v_{\boldsymbol{\theta}} : \boldsymbol{\theta} \in \mathbb{R}^p\} at the current parameters. In the linear case vθ=iθiφiv_{\boldsymbol{\theta}} = \sum_i \theta_i \varphi_i, the partial derivative vθθi=φi\frac{\partial v_{\boldsymbol{\theta}}}{\partial \theta_i} = \varphi_i recovers the fixed basis functions of Galerkin. For nonlinear parameterizations, the test functions change with θ\boldsymbol{\theta}, and the orthogonality conditions define a nonlinear system solved by iterative gradient descent.

The dual pairing formulation Legrand & Junca (2025) extends this framework to settings where test objects need not be regular functions. We have been informal about this distinction in our treatment of collocation, but the Dirac deltas δ(xxi)\delta(x - x_i) used there are not classical functions—they are distributions, defined rigorously only through their action on test functions via N(v),δ(xxi)=(Nv)(xi)\langle \Residual(v), \delta(x - x_i) \rangle = (\Residual v)(x_i). The simple calculus argument for orthogonality does not apply directly to such objects; the dual pairing framework provides the proper mathematical foundation. The induced dual norm N(v)=supw=1N(v),w\|\Residual(v)\|_* = \sup_{\|w\|=1} |\langle \Residual(v), w \rangle| measures residuals by their worst-case effect on test functions, a perspective that has inspired adversarial formulations Zang et al. (2020) where both trial and test functions are learned.

The minimum residual framework thus connects classical projection methods to modern function approximation. The unifying principle is orthogonality of residuals to test functions. Linear methods use fixed test functions and admit closed-form solutions. Nonlinear methods use parameter-dependent test functions and require iterative optimization. Convergence guarantees are well understood for linear projections (monotonicity, matched weightings) but remain an active research area for neural networks. The next chapter extends this framework to the simulation-based setting, where the Bellman operator itself must be approximated from samples rather than computed exactly.

References
  1. Chakraverty, S., Mahato, N. R., Karunakar, P., & Rao, T. D. (2019). Weighted Residual Methods. In Advanced Numerical and Semi-Analytical Methods for Differential Equations (pp. 25–44). John Wiley & Sons, Inc. 10.1002/9781119423461.ch3
  2. Atkinson, K. E., & Potra, F. A. (1987). Projection and Iterated Projection Methods for Nonlinear Integral Equations. SIAM Journal on Numerical Analysis, 24(6), 1352–1373. 10.1137/0724087
  3. Judd, K. L. (1992). Projection methods for solving aggregate growth models. Journal of Economic Theory, 58(2), 410–452.
  4. Judd, K. L. (1996). Approximation, perturbation, and projection methods in economic analysis. In H. M. Amman, D. A. Kendrick, & J. Rust (Eds.), Handbook of Computational Economics (Vol. 1, pp. 509–585). Elsevier.
  5. McGrattan, E. R. (1997). Application of Weighted Residual Methods to Dynamic Economic Models.
  6. Santos, M. S., & Vigo-Aguiar, J. (1998). Analysis of a numerical dynamic programming algorithm applied to economic models. Econometrica, 66(2), 409–426.
  7. Stachurski, J. (2009). Economic Dynamics: Theory and Computation. MIT Press.
  8. Gordon, G. J. (1995). Stable function approximation in dynamic programming. Proceedings of the Twelfth International Conference on International Conference on Machine Learning, 261–268.
  9. Gordon, G. J. (1999). Approximate Solutions to Markov Decision Problems [Phdthesis]. Carnegie Mellon University.
  10. Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. Proceedings of the Twelfth International Conference on Machine Learning, 30–37.
  11. Legrand, M., & Junca, S. (2025). Weighted Residual Solution Methods.
  12. Zang, Y., Bao, G., Ye, X., & Zhou, H. (2020). Weak adversarial networks for high-dimensional partial differential equations. Journal of Computational Physics, 411, 109409. 10.1016/j.jcp.2020.109409