The Bellman optimality equation is a functional equation: an equation where the unknown is an entire function rather than a finite-dimensional vector. When the state space is continuous or very large, we cannot represent the value function exactly on a computer. We must instead work with finite-dimensional approximations. This motivates weighted residual methods (also called minimum residual methods), a general framework for transforming infinite-dimensional problems into tractable finite-dimensional ones Chakraverty et al. (2019)Atkinson & Potra (1987).
What Does It Mean for a Residual to Be Zero?¶
Consider a functional equation , where is an operator and the unknown is an entire function (in our case, the Bellman optimality equation , which we can write as ). Suppose we have found a candidate approximate solution . To verify it satisfies , we compute the residual function . For a true solution, this residual should be the zero function: for every state .
How might we test whether a function is zero? One approach: sample many input points , check whether at each, and summarize the results into a single scalar test by computing a weighted sum with weights . If is zero everywhere, this sum is zero. If is nonzero somewhere, we can choose points and weights to make the sum nonzero. For vectors in finite dimensions, the inner product implements exactly this idea: it tests by weighting and summing. Indeed, a vector equals zero if and only if for every vector . To see why, suppose . Choosing gives , contradicting the claim that all inner products vanish.
The same principle extends to functions. A function equals the zero function if and only if its “inner product” with every “test function” vanishes:
where is a weight function that is part of the inner product definition. Why does this work? For the same reason as in finite dimensions: if is not the zero function, there must be some region where . We can then choose a test function that is nonzero in that same region (for instance, itself), which will produce , witnessing that is nonzero. Conversely, if is the zero function, then for any test function .
This ability to distinguish between different functions using inner products is a fundamental principle from functional analysis. Just as we can test a vector by taking inner products with other vectors, we can test a function by taking inner products with other functions.
Connection to Functional Analysis
The principle that “a function equals zero if and only if it has zero inner product with all test functions” is a consequence of the Hahn-Banach theorem, one of the cornerstones of functional analysis. The theorem guarantees that for any nonzero function in a suitable function space, there exists a continuous linear functional (which can be represented as an inner product with some test function ) that produces a nonzero value when applied to . This is often phrased as “the dual space separates points.”
While you don’t need to know the Hahn-Banach theorem to use weighted residual methods, it provides the rigorous mathematical foundation ensuring that our inner product tests are theoretically sound. The constructive argument we gave above (choosing ) works in simple cases with well-behaved functions, but the Hahn-Banach theorem extends this guarantee to much more general settings.
Why is this useful? It transforms the pointwise condition “ for all ” (infinitely many conditions, one per state) into an equivalent condition about inner products. We still cannot test against all possible test functions, since there are infinitely many of those too. But the inner product perspective suggests a natural computational strategy: choose a finite collection of test functions and use them to construct conditions that we can actually compute.
This suggests the weighted residual (or minimum residual) approach: choose test functions and a weight function , then require the residual to have zero weighted inner product with each test function:
where the residual is and the approximation is . For the Bellman equation, , so . This transforms the impossible task of verifying “ for all ” into a finite-dimensional problem: find coefficients satisfying weighted integral conditions.
The weight function is part of the inner product definition and serves important purposes: it can emphasize certain regions of the state space, or represent a natural probability measure over states. In unweighted problems, we simply take . In reinforcement learning applications, is often chosen as the stationary distribution under a policy . This is exactly what happens in methods like LSTD (Least Squares Temporal Difference), which can be viewed as a Galerkin method with .
Different choices of test functions give different methods, each with different computational and theoretical properties. When (test functions equal basis functions), the method is called Galerkin, and can be interpreted geometrically as a projection of the residual onto the orthogonal complement of the approximation space. The rest of this chapter develops this framework systematically: how to choose basis functions , how to select test functions , and how to solve the resulting systems.
The General Framework¶
Consider an operator equation of the form
where is a continuous operator between complete normed vector spaces and . For the Bellman equation, we have , so that solving is equivalent to finding the fixed point .
Just as we transcribed infinite-dimensional continuous optimal control problems into finite-dimensional discrete optimal control problems in earlier chapters, we seek a finite-dimensional approximation to this infinite-dimensional functional equation. Recall that for continuous optimal control, we adopted control parameterization: we represented the control trajectory using a finite set of basis functions (piecewise constants, polynomials, splines) and searched over the finite-dimensional coefficient space instead of the infinite-dimensional function space. For integrals in the objective and constraints, we used numerical quadrature to approximate them with finite sums.
We follow the same strategy here. We parameterize the value function using a finite set of basis functions —commonly polynomials (Chebyshev, Legendre), though other function classes (splines, radial basis functions, neural networks) are possible—and search for coefficients in . When integrals appear in the Bellman operator or projection conditions, we approximate them using numerical quadrature. The projection method approach consists of several conceptual steps that accomplish this transcription.
Step 1: Choose a Finite-Dimensional Approximation Space¶
We begin by selecting a basis and approximating the unknown function as a linear combination:
The choice of basis functions is problem-dependent. Common choices include:
Polynomials: For smooth problems, we might use Chebyshev polynomials or other orthogonal polynomial families
Splines: For problems where we expect the solution to have regions of different smoothness
Radial basis functions: For high-dimensional problems where tensor product methods become intractable
The number of basis functions determines the flexibility of our approximation. In practice, we start with small and increase it until the approximation quality is satisfactory. The only unknowns now are the coefficients .
While the classical presentation of projection methods focuses on polynomial bases, the framework applies equally well to other function classes. Neural networks, for instance, can be viewed through this lens: a neural network with parameters defines a flexible function class, and many training procedures can be interpreted as projection methods with specific choices of test functions or residual norms. The distinction is that classical methods typically use predetermined basis functions with linear coefficients, while neural networks use adaptive nonlinear features. Throughout this chapter, we focus on the classical setting to develop the core concepts, but the principles extend naturally to modern function approximators.
Step 2: Define the Residual Function¶
Since we are approximating with , the operator will generally not vanish exactly. Instead, we obtain a residual function:
This residual measures how far our candidate solution is from satisfying the equation at each point . As we discussed in the introduction, we will assess whether this residual is “close to zero” by testing its inner products against chosen test functions.
Step 3: Impose Conditions to Determine Coefficients¶
Having chosen our basis and defined the residual, we must decide how to make the residual “close to zero.” As discussed in the introduction, we test the residual using weighted integrals. We select test functions and a weight function (often for unweighted problems), then impose weighted residual conditions:
These integral conditions provide equations to determine the unknown coefficients in . Different choices of test functions give different methods:
Galerkin: (test with the basis functions themselves)
Collocation: (test at specific points)
Method of moments: (test with monomials)
An alternative is the least squares approach, which minimizes the weighted norm of the residual:
We focus primarily on methods distinguished by their choice of test functions . The Galerkin method, where , can be interpreted geometrically as a projection when working in a Hilbert space with a weighted inner product.
Let us examine the standard choices of test functions and what they tell us about the residual:
Galerkin Method: Test Against the Basis¶
The Galerkin method chooses test functions , the same basis functions used to approximate :
To understand what this means, recall that in finite dimensions, two vectors are orthogonal when their inner product is zero. For functions, expresses the same concept: and are orthogonal as functions with respect to the weighted inner product. But there’s more to this than just testing against individual basis functions.
Consider our approximation space as an -dimensional subspace within the infinite-dimensional space of all functions. Any function in this space can be written as for some coefficients . If the residual is orthogonal to all basis functions , then by linearity of the inner product, for any such function :
This shows that is orthogonal to every function we can represent with our basis. The residual has “zero overlap” with our approximation space: we cannot express any part of it using our basis functions. In this sense, the residual is as “invisible” to our approximation as possible.
This condition is the defining property of optimality. By choosing our approximation so that the residual is orthogonal to the entire approximation space, we ensure that is the orthogonal projection of the true solution onto . Within this -dimensional space, no better choice is possible: any other coefficients would yield a residual with a nonzero component inside the space, and therefore a larger norm.
The finite-dimensional analogy makes this concrete. Suppose you want to approximate a vector using only the -plane (a 2D subspace). The best approximation is to project onto the plane, giving . The error is , which points purely in the -direction, orthogonal to the entire -plane. We see the Galerkin condition in action: the error is orthogonal to the approximation space.
Method of Moments: Test Against Monomials¶
The method of moments, for problems on , chooses test functions for :
This requires the first moments of the residual function to vanish, ensuring the residual is “balanced” in the sense that it has no systematic trend captured by low-order polynomials. The moments measure weighted averages of the residual, with increasing powers of giving more weight to larger values. Setting these to zero ensures the residual doesn’t grow systematically with . This approach is particularly useful when is chosen as a probability measure, making the conditions natural moment restrictions familiar from statistics and econometrics.
Collocation Method: Test Against Delta Functions¶
The collocation method chooses test functions , the Dirac delta functions at points :
This is projection against the most localized test functions possible: delta functions that “sample” the residual at specific points, requiring the residual to vanish exactly where we test it. The computational advantage is significant: collocation avoids numerical integration entirely, requiring only pointwise evaluation of .
Orthogonal collocation combines collocation with spectral basis functions (orthogonal polynomials like Chebyshev, Legendre, or Hermite) for smooth problems. We choose collocation points at the zeros of the -th polynomial in the family. For example, with Chebyshev polynomials , we place collocation points at the zeros of .
These zeros are also optimal nodes for Gauss quadrature with the associated weight function. This coordination means:
We get the computational simplicity of collocation: just pointwise evaluation
When we need integrals (inside the Bellman operator), the collocation points double as quadrature nodes with exactness for polynomials up to degree
For smooth problems, spectral approximations achieve exponential convergence: the error decreases like as we add basis functions, compared to for piecewise polynomials
This approach is often called a pseudospectral method or spectral collocation method. The Chebyshev interpolation theorem guarantees that forcing at these carefully chosen points makes small everywhere, with well-conditioned systems and near-optimal interpolation error.
Subdomain Method: Test Against Indicator Functions¶
The subdomain method partitions the domain into subregions and chooses test functions , the indicator functions:
This requires the residual to have zero average over each subdomain, ensuring the approximation is good “on average” over each piece of the domain. This approach is particularly natural for finite element methods where the domain is divided into elements, ensuring local balance of the residual within each element.
Least Squares¶
The least squares approach appears different at first glance, but it also fits the test function framework. We minimize:
The first-order conditions for this minimization problem are:
Thus least squares implicitly uses test functions , the gradients of the residual with respect to parameters. Unlike other methods where test functions are chosen a priori, here they depend on the current guess for and on the structure of our approximation.
We can now see the unifying structure of weighted residual methods: whether we use projection conditions or least squares minimization, all these methods follow the same template of restricting the search to an -dimensional function space and imposing conditions on the residual. For projection methods specifically, we pick test functions and require . They differ only in their philosophy about which test functions best detect whether the residual is “nearly zero.” Galerkin tests against the approximation basis itself (natural for orthogonal bases), the method of moments tests against monomials (ensuring polynomial balance), collocation tests against delta functions (pointwise satisfaction), subdomain tests against indicators (local average satisfaction), and least squares tests against residual gradients (global norm minimization). Each choice reflects different priorities: computational efficiency, theoretical optimality, ease of implementation, or sensitivity to errors in different regions of the domain.
Step 4: Solve the Finite-Dimensional Problem¶
The projection conditions give us a system to solve for the coefficients . For test function methods (Galerkin, collocation, moments, subdomain), we solve:
This is a system of (generally nonlinear) equations in unknowns. For least squares, we solve the optimization problem .
Computational Cost and Conditioning¶
The computational cost per iteration varies significantly across methods:
Collocation: Cheapest to evaluate since requires only pointwise evaluation (no integration). The Jacobian is also cheap: .
Galerkin and moments: More expensive due to integration. Computing requires numerical quadrature. Each Jacobian entry requires integrating .
Least squares: Most expensive when done via the objective function, which requires integrating . However, the first-order conditions reduce it to a system like Galerkin, with test functions .
For methods requiring integration, the choice of quadrature rule should match the basis. Gaussian quadrature with nodes at orthogonal polynomial zeros is efficient. When combined with collocation at those same points, the quadrature is exact for polynomials up to a certain degree. This coordination between quadrature and collocation makes orthogonal collocation effective.
The conditioning of the system depends on the choice of test functions. The Jacobian matrix has entries:
When test functions are orthogonal (or nearly so), the Jacobian tends to be well-conditioned. This is why orthogonal polynomial bases are preferred in Galerkin methods: they produce Jacobians with controlled condition numbers. Poorly chosen basis functions or collocation points can lead to nearly singular Jacobians, causing numerical instability. Orthogonal bases and carefully chosen collocation points (like Chebyshev nodes) help maintain good conditioning.
Two Main Solution Approaches¶
We have two fundamentally different ways to solve the projection equations: function iteration (exploiting fixed-point structure) and Newton’s method (exploiting smoothness). The choice depends on whether the original operator equation has contraction properties and how well those properties are preserved by the finite-dimensional approximation.
Method 1: Function Iteration (Successive Approximation)¶
When the operator equation has the form where is a contraction, the most natural approach is to iterate the operator directly:
The infinite-dimensional iteration becomes a finite-dimensional iteration in coefficient space once we choose our weighted residual method. Given a current approximation , how do we find the coefficients for the next iterate ?
Different weighted residual methods answer this differently. For collocation, we proceed in two steps:
Evaluate the operator: At each collocation point , compute what the next iterate should be: . These target values tell us what should equal at the collocation points.
Find matching coefficients: Determine so that for all . This is a linear system: .
In matrix form: , where is the collocation matrix with entries . Solving this system gives .
For Galerkin, the projection condition directly gives a system for . When is linear in its argument (as in many integral equations), this is a linear system. When is nonlinear (as in the Bellman equation), we must solve a nonlinear system at each iteration—though each solution still only involves unknowns rather than an infinite-dimensional function.
When is a contraction in the infinite-dimensional space with constant , iterating it pulls any starting function toward the unique fixed point. The hope is that the finite-dimensional operator—evaluating and projecting back onto the span of the basis functions—inherits this contraction property. When it does, function iteration converges globally from any initial guess, with each iteration reducing the error by a factor of roughly . This is computationally attractive: we only evaluate the operator and solve a linear system (for collocation) or a relatively simple system (for other methods).
However, the finite-dimensional approximation doesn’t always preserve contraction. High-order polynomial bases, in particular, can create oscillations between basis functions that amplify rather than contract errors. Even when contraction is preserved, convergence can be painfully slow when is close to 1—the “weak contraction” regime common in economic problems with patient agents ( or higher). Finally, not all operator equations naturally present themselves as contractions; some require reformulation (like ), and finding a good can be problem-specific.
Method 2: Newton’s Method¶
Alternatively, we can treat the projection equations as a rootfinding problem where for test function methods, or solve the first-order conditions for least squares. Newton’s method uses the update:
where is the Jacobian of at .
Computing the Jacobian: We must compute . For collocation, , so:
The first term is straightforward (it’s just for a linear approximation). The second term requires differentiating the operator with respect to the parameters.
When involves optimization (as in the Bellman operator ), computing this derivative appears problematic because the max operator is not differentiable. However, the Envelope Theorem resolves this difficulty.
With the Envelope Theorem providing a tractable way to compute Jacobians for problems involving optimization, Newton’s method becomes practical for weighted residual methods applied to Bellman equations and similar problems. The method offers quadratic convergence near the solution. Once in the neighborhood of the true fixed point, Newton’s method typically converges in just a few iterations. Unlike function iteration, it doesn’t rely on the finite-dimensional approximation preserving any contraction property, making it applicable to a broader range of problems, particularly those with high-order polynomial bases or large discount factors where function iteration struggles.
However, Newton’s method demands more from both the algorithm and the user. Each iteration requires computing and solving a full Jacobian system, making the per-iteration cost significantly higher than function iteration. The method is also sensitive to initialization: started far from the solution, Newton’s method may diverge or converge to spurious fixed points that the finite-dimensional problem introduces but the original infinite-dimensional problem lacks. When applying the Envelope Theorem, implementation becomes more complex. We must track the optimal action at each evaluation point and compute the Jacobian entries using the formula above (expected basis function values at next states under optimal actions), though the economic interpretation (tracking how value propagates through optimal decisions) often makes the computation conceptually clearer than explicit derivative calculations would be.
Comparison and Practical Recommendations¶
| Method | Convergence | Per-iteration cost | Initial guess sensitivity |
|---|---|---|---|
| Function iteration | Linear (when contraction holds) | Low | Robust |
| Newton’s method | Quadratic (near solution) | Moderate (Jacobian + solve) | Requires good initial guess |
Which method to use? When the problem has strong contraction (small , well-conditioned bases, shape-preserving approximations like linear interpolation or splines), function iteration is simple and robust. For weak contraction (large , high-order polynomials), a hybrid approach works well: run function iteration for several iterations to enter the basin of attraction, then switch to Newton’s method for rapid final convergence. When the finite-dimensional approximation destroys contraction entirely (common with non-monotone bases), Newton’s method may be necessary from the start, though careful initialization (from a coarser approximation or perturbation methods) is required.
Quasi-Newton methods like BFGS or Broyden offer a middle ground. They approximate the Jacobian using function evaluations only, avoiding explicit derivative computations while maintaining superlinear convergence. This can be useful when computing the exact Jacobian via the Envelope Theorem is expensive or when the approximation quality is acceptable.
Step 5: Verify the Solution¶
Once we have computed a candidate solution , we must verify its quality. Projection methods optimize with respect to specific criteria (specific test functions or collocation points), but we should check that the residual is small everywhere, including directions or points we did not optimize over.
Typical diagnostic checks include:
Computing using a more accurate quadrature rule than was used in the optimization
Evaluating at many points not used in the fitting process
If using Galerkin with the first basis functions, checking orthogonality against higher-order basis functions
Application to the Bellman Equation¶
Consider the Bellman optimality equation . For a candidate approximation , the residual is:
We examine how collocation and Galerkin—the two most common weighted residual methods for Bellman equations—specialize the general solution approaches from Step 4.
Collocation¶
For collocation, we choose states and require the Bellman equation to hold exactly at these points:
It helps to define the parametric Bellman operator by , the Bellman operator evaluated at collocation point . Let be the matrix with entries . Then the collocation equations become .
Function iteration for collocation proceeds as follows. Given current coefficients , we evaluate the Bellman operator at each collocation point to get target values . We then find new coefficients by solving the linear system . This is parametric value iteration: apply the Bellman operator, fit the result.
When the state space is continuous, we approximate expectations using numerical quadrature (Gauss-Hermite for normal shocks, etc.). The method is simple and robust when the finite-dimensional approximation preserves contraction, but can be slow for large discount factors.
Newton’s method for collocation treats the problem as rootfinding: . The Jacobian is , where the Envelope Theorem (Step 4) gives us . Here is the optimal action at collocation point given the current coefficients.
This converges rapidly near the solution but requires good initialization and more computation per iteration than function iteration. The method is equivalent to policy iteration: each step evaluates the value of the current greedy policy, then improves it.
Why is collocation popular for Bellman equations? Because it avoids integration when testing the residual. We only evaluate the Bellman operator at discrete points. In contrast, Galerkin requires integrating the residual against each basis function.
Galerkin¶
For Galerkin, we use the basis functions themselves as test functions. The conditions are:
where is a weight function (often the stationary distribution in RL applications, or simply ). Expanding this:
Function iteration for Galerkin works differently than for collocation. Given , we cannot simply evaluate the Bellman operator and fit. Instead, we must solve an integral equation. At each iteration, we seek satisfying:
The left side is a linear system (the “mass matrix” ), and the right side requires integrating the Bellman operator output against each test function. When the basis functions are orthogonal polynomials with matching weight , the mass matrix is diagonal, simplifying the solve. But we still need numerical integration to evaluate the right side. This makes Galerkin substantially more expensive than collocation per iteration.
Newton’s method for Galerkin similarly requires integration. The residual is , and we need . The Jacobian entry is:
The Envelope Theorem gives , so we must integrate expected basis function values (under optimal actions) against test functions and weight. This requires both numerical integration and careful tracking of optimal actions across the state space, making it substantially more complex than collocation’s pointwise evaluation.
The advantage of Galerkin over collocation lies in its theoretical properties: when using orthogonal polynomials, Galerkin provides optimal approximation in the weighted norm. For smooth problems, this can yield better accuracy per degree of freedom than collocation. In practice, collocation’s computational simplicity usually outweighs Galerkin’s theoretical optimality for Bellman equations, especially in high-dimensional problems where integration becomes prohibitively expensive.
Galerkin for Discrete MDPs: LSTD and LSPI¶
When the state space is discrete and finite, the Galerkin conditions simplify dramatically. The integrals become sums, and we can write everything in matrix form. This specialization shows the connection to algorithms widely used in reinforcement learning.
For a discrete state space , the Galerkin orthogonality conditions
become weighted sums over states:
where with is a probability distribution over states. Define the feature matrix with entries (each row contains the features for one state), and let be the diagonal matrix with the state distribution on the diagonal.
Policy Evaluation: LSTD¶
For policy evaluation with a fixed policy , the Bellman operator is linear:
With linear function approximation , this becomes:
Let be the vector of rewards , and be the transition matrix with . Then in vector form.
The Galerkin conditions require for all basis functions, which in matrix form is:
Rearranging:
This is the LSTD (Least Squares Temporal Difference) solution. The matrix and vector give the linear system .
When is the stationary distribution of policy (so ), this system has a unique solution, and the projected Bellman operator is a contraction in the weighted norm . This is the theoretical foundation for TD learning with linear function approximation.
The Bellman Optimality Equation: Function Iteration and Newton’s Method¶
For the Bellman optimality equation, the max operator introduces nonlinearity:
The Galerkin conditions become:
where the Bellman operator must be evaluated at each state to find the optimal action and compute the target value. This is a system of nonlinear equations in unknowns.
Function iteration applies the Bellman operator and projects back. Given , compute the greedy policy at each state, then solve:
This evaluates the current greedy policy using LSTD, then implicitly improves by computing a new greedy policy at the next iteration. However, convergence can be slow when the finite-dimensional approximation poorly preserves contraction.
Newton’s method treats as a rootfinding problem and uses the Jacobian to accelerate convergence. The Jacobian of is:
To compute , we use the Envelope Theorem from Step 4. At the current , let be the optimal action at state . Then:
Define the policy . The Jacobian becomes:
The Newton update simplifies. We have:
At each state , the greedy value is , which equals . Thus:
The Newton step becomes:
Multiplying through and simplifying:
This is LSPI (Least Squares Policy Iteration). Each Newton step:
Computes the greedy policy
Solves the LSTD equation for this policy to get
Newton’s method for the Galerkin-projected Bellman optimality equation is equivalent to policy iteration in the function approximation setting. Just as Newton’s method for collocation corresponded to policy iteration (Step 4), Newton’s method for discrete Galerkin gives LSPI.
Galerkin projection with linear function approximation reduces policy iteration to a sequence of linear systems, each solvable in closed form. For discrete MDPs, we can compute the matrices and exactly. In the next section, we examine how these methods extend to the simulation-based setting, where we must estimate these matrices from samples rather than computing them exactly from the model.
Monotone Projection and the Preservation of Contraction¶
The informal discussion of shape preservation hints at a deeper theoretical question: when does the function iteration method converge? Recall from our discussion of collocation that function iteration proceeds in two steps:
Apply the Bellman operator at collocation points: where
Fit new coefficients to match these targets: , giving
We can reinterpret this iteration in function space rather than coefficient space. Let be the projection operator that takes any function and returns its approximation in . For collocation, is the interpolation operator: is the unique linear combination of basis functions that matches at the collocation points. Then Step 2 can be written as: fit so that for all collocation points, which means .
In other words, function iteration is equivalent to projected value iteration in function space:
We know that standard value iteration converges because is a -contraction in the sup norm. But now we’re iterating with the composed operator instead of alone.
This structure is not specific to collocation. It is inherent in all projection methods. The general pattern is always the same: apply the Bellman operator to get a target function , then project it back onto our approximation space to get . The projection step defines an operator that depends on our choice of test functions:
For collocation, interpolates values at collocation points
For Galerkin, is orthogonal projection with respect to
For least squares, minimizes the weighted residual norm
But regardless of which projection method we use, iteration takes the form .
Does the composition inherit the contraction property of ? If not, the iteration may diverge, oscillate, or converge to a spurious fixed point even though the original problem is well-posed.
Monotone Approximators and Stability¶
The answer turns out to depend on specific properties of the approximation operator . This theory was developed independently across multiple research communities—computational economics Judd (1992)Judd (1996)McGrattan (1997)Santos & Vigo-Aguiar (1998), economic dynamics Stachurski (2009), and reinforcement learning Gordon (1995)Gordon (1999)—arriving at essentially the same mathematical conditions.
Monotonicity Implies Nonexpansiveness¶
It turns out that approximation operators satisfying simple structural properties automatically preserve contraction.
This proposition shows that monotonicity and constant preservation automatically imply nonexpansiveness. There is no need to verify this separately. The intuition is that a monotone, constant-preserving operator acts like a weighted average that respects order structure and cannot amplify differences between functions.
Preservation of Contraction¶
Combining nonexpansiveness with the contraction property of the Bellman operator yields the main stability result.
This error bound tells us that the fixed-point error is controlled by how well can represent . If , then and the error vanishes. Otherwise, the error is proportional to the approximation error , amplified by the factor .
Averagers in Discrete-State Problems¶
For discrete-state problems, the monotonicity conditions have a natural interpretation as averaging with nonnegative weights. This characterization was developed by Gordon in the context of reinforcement learning.
Averagers automatically satisfy the monotonicity conditions: linearity follows from matrix multiplication, monotonicity follows from nonnegativity of entries, and constant preservation follows from row sums equaling one.
This specializes the Santos-Vigo-Aguiar theorem to discrete states, expressed in the probabilistic language of stochastic matrices. The stochastic matrix characterization connects to Markov chain theory: represents expected values after one transition, and the monotonicity property reflects the fact that expectations preserve order.
Examples of averagers include state aggregation (averaging values within groups), K-nearest neighbors (averaging over nearest states), kernel smoothing with positive kernels, and multilinear interpolation on grids (barycentric weights are nonnegative and sum to one). Counterexamples include linear least squares regression (projection matrix may have negative entries) and high-order polynomial interpolation (Runge phenomenon produces negative weights).
The following table summarizes which common approximation operators satisfy the monotonicity conditions:
| Method | Monotone? | Notes |
|---|---|---|
| Piecewise linear interpolation | Yes | Always an averager; guaranteed stability |
| Multilinear interpolation (grid) | Yes | Barycentric weights are nonnegative and sum to one |
| Shape-preserving splines (Schumaker) | Yes | Designed to maintain monotonicity |
| State aggregation | Yes | Exact averaging within groups |
| Kernel smoothing (positive kernels) | Yes | If kernel integrates to one |
| High-order polynomial interpolation | No | Oscillations violate monotonicity (Runge phenomenon) |
| Least squares projection (arbitrary basis) | No | Projection matrix may have negative entries |
| Fourier/spectral methods | No | Not monotone-preserving in general |
| Neural networks | No | Highly flexible but no monotonicity guarantees |
The distinction between “safe” (monotone) and “potentially unstable” (non-monotone) approximators provides rigorous foundation for the folk wisdom that linear interpolation is reliable while high-order polynomials can be dangerous for value iteration. But notice that the table’s verdict on “least squares projection” is somewhat abstract—it doesn’t specifically address the three weighted residual methods we introduced at the start of this chapter.
The choice of solution method determines which approximation operators are safe to use. Successive approximation (fixed-point iteration) requires monotone approximators to guarantee convergence. Rootfinding methods like Newton’s method do not require monotonicity. Stability depends on numerical properties of the Jacobian rather than contraction preservation. These considerations suggest hybrid strategies. One approach runs a few iterations with a monotone method to generate an initial guess, then switches to Newton’s method with a smooth approximation for rapid final convergence.
Connecting Back to Collocation, Galerkin, and Least Squares¶
We have now developed a general stability theory for projected value iteration and surveyed which approximation operators are monotone. But what does this mean for the three specific weighted residual methods we introduced at the start of this chapter—collocation, Galerkin, and least squares? Each method defines a different projection operator , and we now need to determine which satisfy the monotonicity conditions that guarantee convergence.
Collocation with piecewise linear interpolation is monotone. When we use collocation with piecewise linear basis functions on a grid, the projection operator performs linear interpolation between grid points. At any state between grid points and , the interpolated value is:
The interpolation weights (barycentric coordinates) are nonnegative and sum to one, making this an averager in Gordon’s sense. Therefore collocation with piecewise linear bases satisfies the monotonicity conditions and the Santos-Vigo-Aguiar stability theorem applies. The folk wisdom that “linear interpolation is safe for value iteration” has rigorous theoretical foundation.
Galerkin projection is generally not monotone. The Galerkin projection operator for a general basis has the form:
where is a diagonal weight matrix and contains the basis function evaluations. This projection matrix typically has negative entries. To see why, consider a simple example with polynomial basis functions on . The projection of a function onto this space involves computing , and the resulting operator can map nonnegative functions to functions with negative values. This is the same phenomenon underlying the Runge phenomenon in high-order polynomial interpolation: the projection weights oscillate in sign.
Since Galerkin projection is not monotone, the sup norm contraction theory does not guarantee convergence of projected value iteration with Galerkin.
Least squares methods share the non-monotonicity issue. The least squares projection operator minimizes and has the same mathematical form as Galerkin projection. It is a linear projection onto with respect to a weighted inner product. Like Galerkin, the projection matrix typically contains negative entries and violates monotonicity.
The gap in our theory. The monotone approximator framework successfully covers collocation with simple bases, but leaves two important methods—Galerkin and least squares—without convergence guarantees. These methods are used in least-squares temporal difference learning (LSTD) and modern reinforcement learning with linear function approximation. We need a different analytical framework to understand when these non-monotone projections lead to convergent algorithms.
Beyond Monotone Approximators¶
The monotone approximator theory gives us a clean sufficient condition for convergence: if is monotone (and constant-preserving), then is non-expansive in the sup norm . Since is a -contraction in the sup norm, their composition is also a -contraction in the sup norm, guaranteeing convergence of projected value iteration.
But what if is not monotone? Can we still guarantee convergence? Galerkin and least squares projections typically violate monotonicity, yet they are widely used in practice, particularly in reinforcement learning through least-squares temporal difference learning (LSTD). In general, proving convergence for non-monotone projections is difficult. However, for the special case of policy evaluation—computing the value function of a fixed policy —we can establish convergence by working in a different norm.
The Policy Evaluation Problem and LSTD¶
Consider the policy evaluation problem: given policy , we want to solve the policy Bellman equation , where and are the reward vector and transition matrix under . This is the core computational task in policy iteration, actor-critic algorithms, and temporal difference learning. In reinforcement learning, we typically learn from sampled experience: trajectories generated by following . If the Markov chain induced by is ergodic, the state distribution converges to a stationary distribution satisfying .
This distribution determines which states appear frequently in our data. States visited often contribute more samples and have more influence on any learned approximation. States visited rarely contribute little. For a linear approximation , the least-squares temporal difference (LSTD) algorithm computes coefficients by solving:
where is the matrix of basis function evaluations and . We write this matrix equation for analysis purposes, but the actual algorithm does not compute it this way. For large state spaces, we cannot enumerate all states to form or explicitly represent the transition matrix . Instead, the practical algorithm accumulates sums from sampled transitions , incrementally building the matrices and without ever forming the full objects. The algorithm is derived from first principles through temporal difference learning, and the Galerkin perspective provides an interpretation of what it computes.
LSTD as Projected Bellman Equation¶
To see what this equation means, let be the solution. Expanding the parentheses:
Moving all terms to the left side and factoring out :
Since and the policy Bellman operator is , we can write:
Let denote the -th column of , which contains the evaluations of the -th basis function at all states. The equation above says that for each :
But is exactly the -weighted inner product . So the residual is orthogonal to every basis function, and therefore orthogonal to the entire subspace .
By definition, the orthogonal projection of a vector onto a subspace is the unique vector in that subspace such that is orthogonal to the subspace. Here, lies in (since ), and we have just shown that is orthogonal to . Therefore, , where is orthogonal projection onto with respect to the -weighted inner product:
The weighting by is not arbitrary. Temporal difference learning performs stochastic updates using individual transitions: , with states sampled from . The ODE analysis of this stochastic process (Borkar-Meyn theory) shows convergence to a fixed point, which can be expressed in closed form as the -weighted projected Bellman operator. LSTD is an algorithm that computes this analytical fixed point.
Orthogonal Projection is Non-Expansive¶
Suppose is the steady-state distribution: . Our goal is to establish that is a contraction in . If we can establish that is non-expansive in this norm and that is a -contraction in , then their composition will be a -contraction:
First, we establish that orthogonal projection is non-expansive. For any vector , we can decompose , where is orthogonal to the subspace . By the Pythagorean theorem in the inner product:
Since , we have:
Taking square roots of both sides (which preserves the inequality since both norms are non-negative):
This holds for all , so is non-expansive in .
Can be a Contraction in ?¶
To show is a -contraction, we need to verify:
This will be at most if is non-expansive, meaning for any vector . So the key is to establish that is non-expansive in .
Consider the squared norm of . By definition of the weighted norm:
The -th component of is . This is a weighted average of the values with weights that sum to one. Therefore:
Since the function is convex, Jensen’s inequality applied to the probability distribution gives:
Substituting this into the norm expression:
The stationarity condition means for all . Therefore:
Taking square roots, , so is non-expansive in . This makes a -contraction in . Composing with the non-expansive projection:
By Banach’s fixed-point theorem, has a unique fixed point and iterates converge from any initialization.
Interpretation: The On-Policy Condition¶
The result shows that convergence depends on matching the weighting to the operator. We cannot choose an arbitrary weighted norm and expect to be a contraction. Instead, the weighting must have a specific relationship with the transition matrix in the operator : namely, must be the stationary distribution of . This is what makes the weighted geometry compatible with the operator’s structure. When this match holds, Jensen’s inequality gives us non-expansiveness of in the norm, and the composition inherits the contraction property.
In reinforcement learning, this has a practical interpretation. When we learn by following policy and collecting transitions , the states we visit are distributed according to the stationary distribution of . This is on-policy learning. The LSTD algorithm uses data sampled from this distribution, which means the empirical weighting naturally matches the operator structure. Our analysis shows that the iterative algorithm converges to the same fixed point that LSTD computes in closed form.
This is fundamentally different from the monotone approximator theory. There, we required structural properties of itself (monotonicity, constant preservation) to guarantee that preserves the sup-norm contraction property of . Here, we place no such restriction on —Galerkin projection is not monotone. Instead, convergence depends on matching the norm to the operator. When does not match the stationary distribution, as in off-policy learning where data comes from a different behavior policy, the Jensen inequality argument breaks down. The operator need not be non-expansive in , and may fail to contract. This explains divergence phenomena such as Baird’s counterexample Baird (1995).
Why Not Bellman Optimality?¶
Can we extend this weighted analysis to the Bellman optimality operator ? The answer is no, at least not with this approach. The obstacle appears at the Jensen inequality step. For policy evaluation, we had:
The inner term is a convex combination of the values , which allowed us to apply Jensen’s inequality to the convex function . For the optimal Bellman operator, we would need to bound:
But the maximum of convex combinations is not itself a convex combination. It is a pointwise maximum. Jensen’s inequality does not apply. We cannot conclude that is non-expansive in any weighted norm.
Is convergence of with Galerkin projection impossible, or merely difficult to prove? The situation is subtle. In practice, fitted Q-iteration and approximate value iteration with neural networks often work well, suggesting that some form of stability exists. But there are also well-documented divergence examples (e.g., Q-learning with linear function approximation can diverge). The theoretical picture remains incomplete. Some results exist for restricted function classes or under strong assumptions on the MDP structure, but no general convergence guarantee like the policy evaluation result is available. The interplay between the max operator, the projection, and the norm geometry is not well understood. This is an active area of research in reinforcement learning theory.
From Abstract to Practical: Fitted-Value Iteration¶
We have developed weighted residual methods through abstract functional equations: choose test functions, impose orthogonality conditions , solve for coefficients. But what are we actually computing when we solve these equations by successive approximation? The answer is simpler than the formalism suggests: function iteration with a fitting step.
Recall that the weighted residual conditions define a fixed-point problem , where is a projection operator onto . We can solve this by iteration: . Under appropriate conditions (monotonicity of , or matching the weight to the operator for policy evaluation), this converges to a solution.
In parameter space, this iteration becomes a fitting procedure. Consider Galerkin projection with a finite state space of states. Let be the matrix of basis evaluations, the diagonal weight matrix, and the vector of Bellman operator evaluations: . The projection is:
This is weighted least-squares regression: fit to targets . For collocation, we require exact interpolation at chosen collocation points. For continuous state spaces, we approximate the Galerkin integrals using sampled states, reducing to the same finite-dimensional fitting problem. The abstraction remains consistent: function iteration in the abstract becomes generate targets, fit to targets, repeat in the implementation.
This extends beyond linear basis functions. Neural networks, decision trees, and kernel methods all implement variants of this procedure. Given data where , each method produces a function by fitting to the targets. The projection operator is simply one instantiation of a fitting procedure—Galerkin and collocation correspond to specific choices of approximation class and loss function.
The abstraction encapsulates all the complexity of function approximation, whether that involves solving a linear system, running gradient descent, or training an ensemble. The projection operator is one instantiation: when is a linear subspace and we minimize weighted squared error, we recover Galerkin or collocation. Neural networks and other non-linear methods extend this framework beyond theoretical tractability.
A limitation of this algorithm is that it assumes we can evaluate the Bellman operator exactly. Computing requires knowing transition probabilities and summing over all next states. In practice, we often have only a simulator or observed data. The next chapter shows how to approximate these expectations from samples, connecting the fitted-value iteration framework developed here to simulation-based methods and reinforcement learning.
The Minimum Residual Framework for Nonlinear Approximators¶
The weighted residual methods developed in this chapter have focused on linear function classes: polynomial bases, piecewise linear interpolants, and linear combinations of fixed basis functions. Neural networks, kernel methods, and decision trees do not fit this template. How does the framework extend to nonlinear approximators?
Recall the Galerkin approach for linear approximation . The orthogonality conditions for all define a linear system with a closed-form solution. These equations arise from minimizing over the subspace, since at the minimum, the gradient with respect to each coefficient must vanish. The connection between norm minimization and orthogonality holds generally. For any norm induced by an inner product , minimizing with respect to parameters requires . Since , the chain rule gives . Minimizing the residual norm is thus equivalent to requiring orthogonality for all . The equivalence holds for any choice of inner product: weighted integrals for Galerkin, sums over collocation points for collocation, or sampled expectations for neural networks.
For nonlinear function classes parameterized by (neural networks, kernel expansions), the same minimization principle applies:
The first-order stationarity condition yields orthogonality:
The test functions are now the partial derivatives , which span the tangent space to the manifold at the current parameters. In the linear case , the partial derivative recovers the fixed basis functions of Galerkin. For nonlinear parameterizations, the test functions change with , and the orthogonality conditions define a nonlinear system solved by iterative gradient descent.
The dual pairing formulation Legrand & Junca (2025) extends this framework to settings where test objects need not be regular functions. We have been informal about this distinction in our treatment of collocation, but the Dirac deltas used there are not classical functions—they are distributions, defined rigorously only through their action on test functions via . The simple calculus argument for orthogonality does not apply directly to such objects; the dual pairing framework provides the proper mathematical foundation. The induced dual norm measures residuals by their worst-case effect on test functions, a perspective that has inspired adversarial formulations Zang et al. (2020) where both trial and test functions are learned.
The minimum residual framework thus connects classical projection methods to modern function approximation. The unifying principle is orthogonality of residuals to test functions. Linear methods use fixed test functions and admit closed-form solutions. Nonlinear methods use parameter-dependent test functions and require iterative optimization. Convergence guarantees are well understood for linear projections (monotonicity, matched weightings) but remain an active research area for neural networks. The next chapter extends this framework to the simulation-based setting, where the Bellman operator itself must be approximated from samples rather than computed exactly.
- Chakraverty, S., Mahato, N. R., Karunakar, P., & Rao, T. D. (2019). Weighted Residual Methods. In Advanced Numerical and Semi-Analytical Methods for Differential Equations (pp. 25–44). John Wiley & Sons, Inc. 10.1002/9781119423461.ch3
- Atkinson, K. E., & Potra, F. A. (1987). Projection and Iterated Projection Methods for Nonlinear Integral Equations. SIAM Journal on Numerical Analysis, 24(6), 1352–1373. 10.1137/0724087
- Judd, K. L. (1992). Projection methods for solving aggregate growth models. Journal of Economic Theory, 58(2), 410–452.
- Judd, K. L. (1996). Approximation, perturbation, and projection methods in economic analysis. In H. M. Amman, D. A. Kendrick, & J. Rust (Eds.), Handbook of Computational Economics (Vol. 1, pp. 509–585). Elsevier.
- McGrattan, E. R. (1997). Application of Weighted Residual Methods to Dynamic Economic Models.
- Santos, M. S., & Vigo-Aguiar, J. (1998). Analysis of a numerical dynamic programming algorithm applied to economic models. Econometrica, 66(2), 409–426.
- Stachurski, J. (2009). Economic Dynamics: Theory and Computation. MIT Press.
- Gordon, G. J. (1995). Stable function approximation in dynamic programming. Proceedings of the Twelfth International Conference on International Conference on Machine Learning, 261–268.
- Gordon, G. J. (1999). Approximate Solutions to Markov Decision Problems [Phdthesis]. Carnegie Mellon University.
- Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. Proceedings of the Twelfth International Conference on Machine Learning, 30–37.
- Legrand, M., & Junca, S. (2025). Weighted Residual Solution Methods.
- Zang, Y., Bao, G., Ye, X., & Zhou, H. (2020). Weak adversarial networks for high-dimensional partial differential equations. Journal of Computational Physics, 411, 109409. 10.1016/j.jcp.2020.109409