Skip to article frontmatterSkip to article content

Monte Carlo Integration in Approximate Dynamic Programming

The projection methods from the previous chapter showed how to transform the infinite-dimensional fixed-point problem Lv=v\Bellman v = v into a finite-dimensional one by choosing basis functions {φi}\{\varphi_i\} and imposing conditions that make the residual R(s)=Lv(s)v(s)R(s) = \Bellman v(s) - v(s) small. Different projection conditions (Galerkin orthogonality, collocation at points, least squares minimization) yield different finite-dimensional systems to solve.

However, we left unresolved the question of how to evaluate the Bellman operator itself. Applying L\Bellman at any state requires computing an expectation:

(Lv)(s)=maxaAs{r(s,a)+γv(s)p(dss,a)}(\Bellman v)(s) = \max_{a \in \mathcal{A}_s} \left\{r(s,a) + \gamma \int v(s')p(ds'|s,a)\right\}

For discrete state spaces with a manageable number of states, this expectation is a finite sum we can compute exactly. For continuous or high-dimensional state spaces, we need numerical integration. The projection methods framework is compatible with any quadrature scheme, but leaves the choice of integration method unspecified.

This chapter addresses the integration subproblem that arises at two levels in approximate dynamic programming. First, we must evaluate the transition expectation v(s)p(dss,a)\int v(s')p(ds'|s,a) within the Bellman operator itself. Second, when using projection methods like Galerkin or least squares, we encounter outer integrations for enforcing orthogonality conditions or minimizing residuals over distributions of states. Both require numerical approximation in continuous or high-dimensional spaces.

We begin by examining deterministic numerical integration methods: quadrature rules that approximate integrals by evaluating integrands at carefully chosen points with associated weights. We discuss how to coordinate the choice of quadrature with the choice of basis functions to balance approximation accuracy and computational cost. Then we turn to Monte Carlo integration, which approximates expectations using random samples rather than deterministic quadrature points. This shift from deterministic to stochastic integration is what brings us into machine learning territory. When we replace exact transition probabilities with samples drawn from simulations or real interactions, projection methods combined with Monte Carlo integration become what the operations research community calls simulation-based approximate dynamic programming and what the machine learning community calls reinforcement learning. By relying on samples rather than explicit probability functions, we move from model-based planning to data-driven learning.

Evaluating the Bellman Operator with Numerical Quadrature

Before turning to Monte Carlo methods, we examine the structure of the numerical integration problem. When we apply the Bellman operator to an approximate value function v^(s;θ)=iθiφi(s)\hat{v}(s; \theta) = \sum_i \theta_i \varphi_i(s), we must evaluate integrals of the form:

v^(s;θ)p(ss,a)ds=(iθiφi(s))p(ss,a)ds=iθiφi(s)p(ss,a)ds\int \hat{v}(s'; \theta) \, p(s'|s,a) \, ds' = \int \left(\sum_i \theta_i \varphi_i(s')\right) p(s'|s,a) \, ds' = \sum_i \theta_i \int \varphi_i(s') \, p(s'|s,a) \, ds'

This shows two independent approximations:

  1. Value function approximation: We represent vv using basis functions {φi}\{\varphi_i\} and coefficients θ\theta

  2. Quadrature approximation: We approximate each integral φi(s)p(ss,a)ds\int \varphi_i(s') p(s'|s,a) ds' numerically

These choices are independent but should be coordinated. To see why, consider what happens in projection methods when we iterate v^(k+1)=PLv^(k)\hat{v}^{(k+1)} = \Proj \Bellman \hat{v}^{(k)}. In practice, we cannot evaluate L\Bellman exactly due to the integrals, so we compute v^(k+1)=PL~v^(k)\hat{v}^{(k+1)} = \Proj \BellmanQuad \hat{v}^{(k)} instead, where L~\BellmanQuad denotes the Bellman operator with numerical quadrature.

The error in a single iteration can be bounded by the triangle inequality. Let vv denote the true fixed point v=Lvv = \Bellman v. Then:

vv^(k+1)=vPL~v^(k)vPLv+PLvPLv^(k)+PLv^(k)PL~v^(k)=vPLvbest approximation error+PLvPLv^(k)contraction of iterate error+PLv^(k)PL~v^(k)quadrature error\begin{aligned} \|v - \hat{v}^{(k+1)}\| &= \|v - \Proj \BellmanQuad \hat{v}^{(k)}\| \\ &\le \|v - \Proj \Bellman v\| + \|\Proj \Bellman v - \Proj \Bellman \hat{v}^{(k)}\| + \|\Proj \Bellman \hat{v}^{(k)} - \Proj \BellmanQuad \hat{v}^{(k)}\| \\ &= \underbrace{\|v - \Proj \Bellman v\|}_{\text{best approximation error}} + \underbrace{\|\Proj \Bellman v - \Proj \Bellman \hat{v}^{(k)}\|}_{\text{contraction of iterate error}} + \underbrace{\|\Proj \Bellman \hat{v}^{(k)} - \Proj \BellmanQuad \hat{v}^{(k)}\|}_{\text{quadrature error}} \end{aligned}

The first term is the best we can do with our basis (how well P\Proj approximates the true solution). The second term decreases with iterations when PL\Proj \Bellman is a contraction. The third term is the error from replacing exact integrals with quadrature, and it does not vanish as iterations proceed.

To make this concrete, consider evaluating (Lv^)(s)(\Bellman \hat{v})(s) for our current approximation v^(s;θ)=iθiφi(s)\hat{v}(s; \theta) = \sum_i \theta_i \varphi_i(s). We want:

(Lv^)(s)=maxaAs{r(s,a)+γiθiφi(s)p(ss,a)ds}(\Bellman \hat{v})(s) = \max_{a \in \mathcal{A}_s} \left\{r(s,a) + \gamma \sum_i \theta_i \int \varphi_i(s') p(s'|s,a) ds' \right\}

But we compute instead:

(L~v^)(s)=maxaAs{r(s,a)+γiθijwjφi(sj)}(\BellmanQuad \hat{v})(s) = \max_{a \in \mathcal{A}_s} \left\{r(s,a) + \gamma \sum_i \theta_i \sum_j w_j \varphi_i(s'_j) \right\}

where {(sj,wj)}\{(s'_j, w_j)\} are quadrature nodes and weights. If the quadrature error (Lv^)(s)(L~v^)(s)\|(\Bellman \hat{v})(s) - (\BellmanQuad \hat{v})(s)\| is large relative to the basis approximation quality, we cannot exploit the expressive power of the basis. For instance, degree-10 Chebyshev polynomials represent smooth functions to O(108)O(10^{-8}) accuracy, but combined with rectangle-rule quadrature (error O(h2)102O(h^2) \approx 10^{-2}), the quadrature term dominates the error bound. We pay the cost of storing and manipulating 10 coefficients but achieve only O(h2)O(h^2) convergence in the quadrature mesh size.

This echoes the coordination principle from continuous optimal control (Chapter on trajectory optimization): when transcribing a continuous-time problem, we use the same quadrature nodes for both the running cost integral and the dynamics integral. There, coordination ensures that “where we pay” aligns with “where we enforce” the dynamics. Here, coordination ensures that integration accuracy matches approximation accuracy. Both are instances of balancing multiple sources of error in numerical methods.

Standard basis-quadrature pairings achieve this balance:

To make this concrete, we examine what these pairings look like in practice for collocation and Galerkin projection.

Orthogonal Collocation with Chebyshev Polynomials

Consider approximating the value function using Chebyshev polynomials of degree n1n-1:

v^(s;θ)=i=0n1θiTi(s)\hat{v}(s; \theta) = \sum_{i=0}^{n-1} \theta_i T_i(s)

For orthogonal collocation, we place collocation points at the zeros of Tn(s)T_n(s), denoted {sj}j=1n\{s_j\}_{j=1}^n. At each collocation point, we require the Bellman equation to hold exactly:

v^(sj;θ)=maxaA{r(sj,a)+γv^(s;θ)p(dssj,a)}\hat{v}(s_j; \theta) = \max_{a \in \mathcal{A}} \left\{r(s_j,a) + \gamma \int \hat{v}(s'; \theta) p(ds'|s_j,a)\right\}

The integral on the right must be approximated using quadrature. With Chebyshev-Gauss quadrature using the same nodes {sk}k=1n\{s_k\}_{k=1}^n and weights {wk}k=1n\{w_k\}_{k=1}^n, this becomes:

v^(sj;θ)=maxaA{r(sj,a)+γk=1nwkv^(sk;θ)p(sksj,a)}\hat{v}(s_j; \theta) = \max_{a \in \mathcal{A}} \left\{r(s_j,a) + \gamma \sum_{k=1}^n w_k \hat{v}(s_k; \theta) p(s_k|s_j,a)\right\}

Substituting the basis representation v^(sk;θ)=i=0n1θiTi(sk)\hat{v}(s_k; \theta) = \sum_{i=0}^{n-1} \theta_i T_i(s_k):

i=0n1θiTi(sj)=maxaA{r(sj,a)+γk=1nwkp(sksj,a)i=0n1θiTi(sk)}\sum_{i=0}^{n-1} \theta_i T_i(s_j) = \max_{a \in \mathcal{A}} \left\{r(s_j,a) + \gamma \sum_{k=1}^n w_k p(s_k|s_j,a) \sum_{i=0}^{n-1} \theta_i T_i(s_k)\right\}

Rearranging:

i=0n1θiTi(sj)=maxaA{r(sj,a)+γi=0n1θik=1nwkTi(sk)p(sksj,a)Bjia}\sum_{i=0}^{n-1} \theta_i T_i(s_j) = \max_{a \in \mathcal{A}} \left\{r(s_j,a) + \gamma \sum_{i=0}^{n-1} \theta_i \underbrace{\sum_{k=1}^n w_k T_i(s_k) p(s_k|s_j,a)}_{B_{ji}^a}\right\}

This yields a system of nn nonlinear equations (one per collocation point):

i=0n1Ti(sj)θi=maxaA{r(sj,a)+γi=0n1Bjiaθi},j=1,,n\sum_{i=0}^{n-1} T_i(s_j) \theta_i = \max_{a \in \mathcal{A}} \left\{r(s_j,a) + \gamma \sum_{i=0}^{n-1} B_{ji}^a \theta_i\right\}, \quad j=1,\ldots,n

The matrix elements BjiaB_{ji}^a can be precomputed once the quadrature nodes, weights, and transition probabilities are known. Solving this system gives the coefficient vector θ\theta.

Galerkin Projection with Hermite Polynomials

For a problem with Gaussian shocks, we might use Hermite polynomials {Hi(s)}i=0n1\{H_i(s)\}_{i=0}^{n-1} weighted by the Gaussian density ϕ(s)\phi(s). The Galerkin condition requires:

(Lv^(s;θ)v^(s;θ))Hj(s)ϕ(s)ds=0,j=0,,n1\int \left(\Bellman \hat{v}(s; \theta) - \hat{v}(s; \theta)\right) H_j(s) \phi(s) ds = 0, \quad j=0,\ldots,n-1

Expanding the Bellman operator:

[maxaA{r(s,a)+γv^(s;θ)p(dss,a)}v^(s;θ)]Hj(s)ϕ(s)ds=0\int \left[\max_{a \in \mathcal{A}} \left\{r(s,a) + \gamma \int \hat{v}(s'; \theta) p(ds'|s,a)\right\} - \hat{v}(s; \theta)\right] H_j(s) \phi(s) ds = 0

We approximate this outer integral using Gauss-Hermite quadrature with nodes {s}=1m\{s_\ell\}_{\ell=1}^m and weights {w}=1m\{w_\ell\}_{\ell=1}^m:

=1mw[maxaA{r(s,a)+γv^(s;θ)p(dss,a)}v^(s;θ)]Hj(s)=0\sum_{\ell=1}^m w_\ell \left[\max_{a \in \mathcal{A}} \left\{r(s_\ell,a) + \gamma \int \hat{v}(s'; \theta) p(ds'|s_\ell,a)\right\} - \hat{v}(s_\ell; \theta)\right] H_j(s_\ell) = 0

The inner integral (transition expectation) is also approximated using Gauss-Hermite quadrature:

v^(s;θ)p(dss,a)k=1mwkv^(sk;θ)p(sks,a)\int \hat{v}(s'; \theta) p(ds'|s_\ell,a) \approx \sum_{k=1}^m w_k \hat{v}(s_k; \theta) p(s_k|s_\ell,a)

Substituting the basis representation and collecting terms:

=1mwHj(s)maxaA{r(s,a)+γi=0n1θik=1mwkHi(sk)p(sks,a)}==1mwHj(s)i=0n1θiHi(s)\sum_{\ell=1}^m w_\ell H_j(s_\ell) \max_{a \in \mathcal{A}} \left\{r(s_\ell,a) + \gamma \sum_{i=0}^{n-1} \theta_i \sum_{k=1}^m w_k H_i(s_k) p(s_k|s_\ell,a)\right\} = \sum_{\ell=1}^m w_\ell H_j(s_\ell) \sum_{i=0}^{n-1} \theta_i H_i(s_\ell)

This gives nn nonlinear equations in nn unknowns. The right-hand side simplifies using orthogonality: when using the same quadrature nodes for projection and integration, =1mwHj(s)Hi(s)=δijHj2\sum_{\ell=1}^m w_\ell H_j(s_\ell) H_i(s_\ell) = \delta_{ij} \|H_j\|^2.

In both cases, the basis-quadrature pairing ensures that:

  1. The quadrature nodes appear in both the transition expectation and the outer projection

  2. The quadrature accuracy matches the polynomial approximation order

  3. Precomputed matrices capture the dynamics, making iterations efficient

Monte Carlo Integration

Deterministic quadrature rules work well when the state space has low dimension, the transition density is smooth, and we can evaluate it cheaply at arbitrary points. In many stochastic control problems none of these conditions truly hold. The state may be high dimensional, the dynamics may be given by a simulator rather than an explicit density, and the cost of each call to the model may be large. In that regime, deterministic quadrature becomes brittle. Monte Carlo methods offer a different way to approximate expectations, one that relies only on the ability to sample from the relevant distributions.

Monte Carlo as randomized quadrature

Consider a single expectation of the form

J=f(x)p(dx)=E[f(X)],J = \int f(x)\,p(dx) = \mathbb{E}[f(X)],

where XpX \sim p and ff is some integrable function. Monte Carlo integration approximates JJ by drawing independent samples X(1),,X(N)pX^{(1)},\ldots,X^{(N)} \sim p and forming the sample average

J^N1Nn=1Nf(X(n)).\hat{J}_N \equiv \frac{1}{N}\sum_{n=1}^N f\bigl(X^{(n)}\bigr).

This estimator has two basic properties that will matter throughout this chapter.

First, it is unbiased:

E[J^N]=E[1Nn=1Nf(X(n))]=1Nn=1NE[f(X(n))]=E[f(X)]=J.\mathbb{E}[\hat{J}_N] = \mathbb{E}\left[\frac{1}{N}\sum_{n=1}^N f(X^{(n)})\right] = \frac{1}{N}\sum_{n=1}^N \mathbb{E}[f(X^{(n)})] = \mathbb{E}[f(X)] = J.

Second, its variance scales as 1/N1/N. If we write

σ2Var[f(X)],\sigma^2 \equiv \mathrm{Var}[f(X)],

then independence of the samples gives

Var(J^N)=1N2n=1NVar(f(X(n)))=σ2N.\mathrm{Var}(\hat{J}_N) = \frac{1}{N^2}\sum_{n=1}^N \mathrm{Var}\bigl(f(X^{(n)})\bigr) = \frac{\sigma^2}{N}.

The central limit theorem then says that for large NN,

N(J^NJ)N(0,σ2),\sqrt{N}\,(\hat{J}_N - J) \Longrightarrow \mathcal{N}(0,\sigma^2),

so the integration error decays at rate O(N1/2)O(N^{-1/2}). This rate is slow compared to high-order quadrature in low dimension, but it has one crucial advantage: it does not explicitly depend on the dimension of xx. Monte Carlo integration pays in variance, not in an exponential growth in the number of nodes.

It is often helpful to view Monte Carlo as randomized quadrature. A deterministic quadrature rule selects nodes xjx_j and weights wjw_j in advance and computes

jwjf(xj).\sum_j w_j f(x_j).

Monte Carlo can be written in the same form: if we draw X(n)X^{(n)} from density pp, the sample average

J^N=1Nn=1Nf(X(n))\hat{J}_N = \frac{1}{N}\sum_{n=1}^N f(X^{(n)})

is just a quadrature rule with random nodes and equal weights. More advanced Monte Carlo schemes, such as importance sampling, change both the sampling distribution and the weights, but the basic idea remains the same.

Monte Carlo evaluation of the Bellman operator

We now apply this to the Bellman operator. For a fixed value function vv and a given state-action pair (s,a)(s,a), the transition part of the Bellman operator is

v(s)p(dss,a)=E[v(S)S=s,A=a].\int v(s')\,p(ds' \mid s,a) = \mathbb{E}\left[v(S') \mid S = s, A = a\right].

If we can simulate next states S(1),,S(N)S'^{(1)},\ldots,S'^{(N)} from the transition kernel p(s,a)p(\cdot \mid s,a), either by calling a simulator or by interacting with the environment, we can approximate this expectation by

E^N[v(S)s,a]1Nn=1Nv(S(n)).\widehat{\mathbb{E}}_N\bigl[v(S') \mid s,a\bigr] \equiv \frac{1}{N}\sum_{n=1}^N v\bigl(S'^{(n)}\bigr).

If the immediate reward is also random, say

r=r(S,A,S)with(S,r)p(s,a),r = r(S,A,S') \quad \text{with} \quad (S', r) \sim p(\cdot \mid s,a),

we can approximate the full one-step return

E[r+γv(S)S=s,A=a]\mathbb{E}\bigl[r + \gamma v(S') \mid S = s, A = a\bigr]

by

G^N(s,a)1Nn=1N[r(n)+γv(S(n))],\widehat{G}_N(s,a) \equiv \frac{1}{N}\sum_{n=1}^N \bigl[r^{(n)} + \gamma v(S'^{(n)})\bigr],

where (r(n),S(n))(r^{(n)}, S'^{(n)}) are independent samples given (s,a)(s,a). Again, this is an unbiased estimator of the Bellman expectation for fixed vv.

Plugging this into the Bellman operator gives a Monte Carlo Bellman operator:

(L^Nv)(s)maxaAs{G^N(s,a)}.(\widehat{\Bellman}_N v)(s) \equiv \max_{a \in \mathcal{A}_s} \left\{ \widehat{G}_N(s,a) \right\}.

The expectation inside the braces is now replaced by a random sample average. In model-based settings, we implement this by simulating many next states for each candidate action aa at state ss. In model-free settings, we obtain samples from real interactions and re-use them to estimate the expectation.

At this stage nothing about approximation or projection has entered yet. For a fixed value function vv, Monte Carlo provides unbiased, noisy evaluations of (Lv)(s)(\Bellman v)(s). The approximation question arises once we couple this stochastic evaluation with basis functions and projections.

Sampling the outer expectations

Projection methods introduce a second layer of integration. In Galerkin and least squares schemes, we choose a distribution μ\mu over states (and sometimes actions) and enforce conditions of the form

R(s;θ)pi(s)μ(ds)=0orR(s;θ)2μ(ds) is minimized.\int R(s; \theta)\,p_i(s)\, \mu(ds) = 0 \quad \text{or} \quad \int R(s; \theta)^2\,\mu(ds) \text{ is minimized}.

Here R(s;θ)R(s; \theta) is the residual function, such as

R(s;θ)=(Lv^)(s;θ)v^(s;θ),R(s; \theta) = (\Bellman \hat{v})(s;\theta) - \hat{v}(s;\theta),

and pip_i are test functions or derivatives of the residual with respect to parameters.

Note a subtle but important shift in perspective from the previous chapter. There, the weight function w(s)w(s) in the inner product f,gw=f(s)g(s)w(s)ds\langle f, g \rangle_w = \int f(s) g(s) w(s) ds could be any positive weight function (not necessarily normalized). For Monte Carlo integration, however, we need a probability distribution we can sample from. We write μ\mu for this sampling distribution and express projection conditions as integrals with respect to μ\mu: R(s;θ)pi(s)μ(ds)\int R(s; \theta) p_i(s) \mu(ds). If a problem was originally formulated with an unnormalized weight w(s)w(s), we must either (i) normalize it to define μ\mu, or (ii) use importance sampling with a different μ\mu and reweight samples by w(s)/μ(s)w(s)/\mu(s). In reinforcement learning, μ\mu is typically the empirical state visitation distribution from collected trajectories.

These outer integrals over ss are generally not easier to compute than the inner transition expectations. Monte Carlo gives a way to approximate them as well. If we can draw states S(1),,S(M)μS^{(1)},\ldots,S^{(M)} \sim \mu, we can approximate, for example, the Galerkin orthogonality conditions by

R(s;θ)pi(s)μ(ds)1Mm=1MR(S(m);θ)pi(S(m)).\int R(s; \theta)\,p_i(s)\,\mu(ds) \approx \frac{1}{M}\sum_{m=1}^M R\bigl(S^{(m)}; \theta\bigr)\,p_i\bigl(S^{(m)}\bigr).

Similarly, a least squares objective

R(s;θ)2μ(ds)\int R(s; \theta)^2\,\mu(ds)

is approximated by the empirical risk

1Mm=1MR(S(m);θ)2.\frac{1}{M}\sum_{m=1}^M R\bigl(S^{(m)}; \theta\bigr)^2.

If we now substitute Monte Carlo estimates for both the inner transition expectations and the outer projection conditions, we obtain fully simulation-based schemes. We no longer need explicit access to the transition kernel p(s,a)p(\cdot \mid s,a) or the state distribution μ\mu. It is enough to be able to sample from them, either through a simulator or by interacting with the real system.

Simulation-Based Projection Methods

Monte Carlo integration replaces both levels of integration in approximate dynamic programming. The transition expectation in the Bellman operator becomes 1Nn=1Nv(S(n))\frac{1}{N}\sum_{n=1}^N v(S'^{(n)}), and the outer integrals for projection become empirical averages over sampled states. Writing PM\Proj_M for projection using MM state samples and L^N\widehat{\Bellman}_N for the Monte Carlo Bellman operator with NN transition samples, the iteration becomes

v^(k+1)=PML^Nv^(k).\hat{v}^{(k+1)} = \Proj_M\,\widehat{\Bellman}_N \hat{v}^{(k)}.

In the typical reinforcement learning setting, N=1N=1: each observed transition (si,ai,ri,si)(s_i, a_i, r_i, s'_i) provides exactly one sample of the next state. This means we work with high-variance single-sample estimates ri+γv(si)r_i + \gamma v(s'_i) rather than averaged returns. Variance reduction comes from aggregating information across different transitions through the function approximator and from collecting long trajectories with many transitions. We examine this single-sample constraint in detail after introducing Q-functions below.

Unlike deterministic quadrature, which introduces a fixed bias at each iteration, Monte Carlo introduces random error with zero mean but nonzero variance. However, combining Monte Carlo with the maximization in the Bellman operator creates a systematic problem: while the estimate of the expected return for any individual action is unbiased, taking the maximum over these noisy estimates introduces upward bias. This overestimation compounds through value iteration and degrades the resulting policies. We address this challenge in the next chapter on batch reinforcement learning.

Amortizing Action Selection via Q-Functions

Monte Carlo integration enables model-free approximate dynamic programming: we no longer need explicit transition probabilities p(ss,a)p(s'|s,a), only the ability to sample next states. However, one computational challenge remains. The standard formulation of an optimal decision rule is

π(s)=argmaxaA{r(s,a)+γv(s)p(dss,a)}.\pi(s) = \arg\max_{a \in \mathcal{A}} \left\{r(s,a) + \gamma \int v(s')p(ds'|s,a)\right\}.

Even with an optimal value function vv^* in hand, extracting an action at state ss requires evaluating the transition expectation for each candidate action. In the model-free setting, this means we must draw Monte Carlo samples from each action’s transition distribution every time we select an action. This repeated sampling “at inference time” wastes computation, especially when the same state is visited multiple times.

We can amortize this computation by working at a different level of representation. Define the state-action value function (or Q-function)

q(s,a)=r(s,a)+γv(s)p(dss,a).q(s,a) = r(s,a) + \gamma \int v(s')p(ds'|s,a).

The Q-function caches the result of evaluating each action at each state. Once we have qq, action selection reduces to a finite maximization:

π(s)=argmaxaA(s)q(s,a).\pi(s) = \arg\max_{a \in \mathcal{A}(s)} q(s,a).

No integration appears in this expression. The transition expectation has been precomputed and stored in qq itself.

The optimal Q-function qq^* satisfies its own Bellman equation. Substituting the definition of v(s)=maxaq(s,a)v^*(s) = \max_a q^*(s,a) into the expression for qq:

q(s,a)=r(s,a)+γp(dss,a)maxaA(s)q(s,a).q^*(s,a) = r(s,a) + \gamma \int p(ds'|s,a) \max_{a' \in \mathcal{A}(s')} q^*(s', a').

This defines a Bellman operator on Q-functions:

(Lq)(s,a)=r(s,a)+γp(dss,a)maxaA(s)q(s,a).(\Bellman q)(s,a) = r(s,a) + \gamma \int p(ds'|s,a)\max_{a' \in \mathcal{A}(s')} q(s', a').

Like the Bellman operator on value functions, L\Bellman is a γ\gamma-contraction in the sup-norm, guaranteeing a unique fixed point qq^*. We can thus apply the same projection and Monte Carlo techniques developed for value functions to Q-functions. The computational cost shifts from action selection (repeated sampling at decision time) to training (evaluating expectations during the iterations of approximate value iteration). Once qq is learned, acting is cheap.

The algorithm above evaluates the expectation p(dss,a)maxaq(s,a;θn)\int p(ds'|s,a)\max_{a'} q(s',a'; \boldsymbol{\theta}_n) at each state-action pair. In principle, we could approximate this integral using many Monte Carlo samples: draw NN next states from p(s,a)p(\cdot|s,a) and average 1Ni=1Nmaxaq(si,a)\frac{1}{N}\sum_{i=1}^N \max_{a'} q(s'_i, a') to reduce variance. This is standard Monte Carlo integration applied to the Bellman operator.

The Single-Sample Paradigm

Most reinforcement learning algorithms do not follow this pattern. Instead, they work with datasets of transition tuples D={(si,ai,ri,si)}i=1M\mathcal{D} = \{(s_i, a_i, r_i, s'_i)\}_{i=1}^M, where each tuple records a single observed next state sis'_i from executing action aia_i in state sis_i. Each transition provides exactly one sample of the dynamics: the target becomes yi=ri+γmaxaq(si,a)y_i = r_i + \gamma \max_{a'} q(s'_i, a') using only the single observed sis'_i.

This design choice has two origins. First, it reflects genuine constraints in some problem settings. A robot executing action aa in state ss observes one outcome ss' and cannot rewind to the exact same state to sample alternative continuations. Real physical systems, biological experiments, and costly interactions (clinical trials, manufacturing processes) provide one sample per executed action. However, many domains where RL is applied do not face this constraint. Game emulators support state saving and restoration, physics simulators can reset to arbitrary configurations, and software environments can be cloned cheaply. The single-sample pattern persists in these settings not from necessity but from algorithmic convention.

Second, and perhaps more fundamentally, the single-sample paradigm emerged from reinforcement learning’s roots in computational neuroscience and models of biological intelligence. Early work viewed RL as a model of animal learning, where organisms experience one consequence per action and must learn from streaming sensory data. This perspective shaped the field’s algorithmic agenda: develop methods that learn from single sequential experiences, as biological agents do. Temporal difference learning, Q-learning, and actor-critic methods all follow this template.

These origins matter because they established design choices that persist even when the original constraints do not apply. In many practical applications, we have fast simulators that support arbitrary state initialization (e.g., physics engines for robotics, game emulators for Atari). We could collect multiple next-state samples per (s,a)(s,a) pair to reduce variance in target computation. However, most algorithms do not exploit this capability. The transition tuple (si,ai,ri,si)(s_i, a_i, r_i, s'_i) with N=1N=1 remains the standard data structure, and methods compensate for high variance through other means: larger replay buffers, function approximation that pools information across samples, multi-step returns, and careful optimization strategies.

This creates an asymmetry in how algorithms use Monte Carlo integration. For the outer expectation (sampling which states to evaluate at), algorithms freely adjust sample size: offline methods reuse entire datasets, replay buffers store thousands of transitions, online methods process one sample per step. For the inner expectation (sampling next states to evaluate the Bellman operator at a given (s,a)(s,a)), the overwhelming majority of practical methods use N=1N=1 regardless of whether the environment permits larger samples.

The single-sample paradigm shapes three aspects of algorithm design:

  1. Data structure: Algorithms store and manipulate tuples {(si,ai,ri,si)}\{(s_i, a_i, r_i, s'_i)\}, not state-action pairs with multiple next-state samples

  2. Variance reduction: Comes from aggregation across different samples through function approximation and from trajectory length, not from repeated sampling at fixed (s,a)(s,a)

  3. Target computation: Each transition yields exactly one target yi=ri+γmaxaq(si,a)y_i = r_i + \gamma \max_{a'} q(s'_i, a') evaluated at the single observed sis'_i

The algorithms that follow in the next chapters all adopt this single-sample structure. Understanding its origins clarifies when the design choice might be revisited: in domains with cheap, resettable simulators and high target variance, averaging multiple next-state samples per (s,a)(s,a) pair offers a direct variance reduction mechanism that is rarely exploited in current practice.

Overestimation Bias and Mitigation Strategies

Monte Carlo integration provides unbiased estimates of individual expectations, but when we apply the Bellman operator’s maximization to these noisy estimates, a systematic problem arises: taking the maximum over noisy values creates upward bias. This overestimation compounds through iterative algorithms and can severely degrade the quality of learned policies. This section examines the sources of this bias and presents two approaches for mitigating it.

Sources of Overestimation Bias

When we apply Monte Carlo integration to the Bellman operator, each individual expectation is unbiased. The problem arises when we maximize over these noisy estimates. To make this precise, consider a given state-action pair (s,a)(s,a) and value function vv. Define the true continuation value

μ(s,a)v(s)p(dss,a),\mu(s,a) \equiv \int v(s')\,p(ds' \mid s,a),

and its Monte Carlo approximation with NN samples:

μ^N(s,a)1Ni=1Nv(si),sip(s,a).\hat{\mu}_N(s,a) \equiv \frac{1}{N}\sum_{i=1}^N v(s'_i), \quad s'_i \sim p(\cdot|s,a).

Each estimator is unbiased: E[μ^N(s,a)]=μ(s,a)\mathbb{E}[\hat{\mu}_N(s,a)] = \mu(s,a). The Monte Carlo Bellman operator is

(L^Nv)(s)=maxaAs{r(s,a)+γμ^N(s,a)}.(\widehat{\Bellman}_N v)(s) = \max_{a \in \mathcal{A}_s} \left\{ r(s,a) + \gamma \hat{\mu}_N(s,a)\right\}.

While each μ^N(s,a)\hat{\mu}_N(s,a) is unbiased, the maximization introduces systematic overestimation. To see why, let aa^* be the truly optimal action. The maximum of any collection is at least as large as any particular element:

E[maxa{r(s,a)+γμ^N(s,a)}]E[r(s,a)+γμ^N(s,a)]=r(s,a)+γμ(s,a)=(Lv)(s).\mathbb{E}\big[\max_a \{r(s,a) + \gamma \hat{\mu}_N(s,a)\}\big] \ge \mathbb{E}\big[r(s,a^*) + \gamma \hat{\mu}_N(s,a^*)\big] = r(s,a^*) + \gamma \mu(s,a^*) = (\Bellman v)(s).

The inequality is strict whenever multiple actions have nonzero variance. The maximization selects whichever action happens to receive a positive noise realization, and that inflated estimate contributes to the target value. This is Jensen’s inequality for the max operator: E[maxaYa]maxaE[Ya]\mathbb{E}[\max_a Y_a] \ge \max_a \mathbb{E}[Y_a] for random variables {Ya}\{Y_a\}, since the max is convex. Even though we start with unbiased estimators, taking the maximum breaks unbiasedness.

Operationally, this means that repeatedly sampling fresh states and taking the max will not, on average, converge to the true value but systematically land above it. Unlike noise that averages out, this bias persists no matter how many samples we draw. Worse, it compounds across iterations as overestimates feed into future target computations.

Monte Carlo value iteration applies vk+1=L^Nvkv_{k+1} = \widehat{\Bellman}_N v_k repeatedly. The overestimation bias does not stay confined to a single iteration: it accumulates through the recursion. At iteration 2, we compute Monte Carlo estimates μ^N(s,a)=1Ni=1Nv1(si)\hat{\mu}_N(s,a) = \frac{1}{N}\sum_{i=1}^N v_1(s'_i), but v1v_1 is already biased upward from iteration 1. Averaging an overestimated function and then maximizing over noisy estimates compounds the bias: E[vk]E[vk1]v\mathbb{E}[v_k] \ge \mathbb{E}[v_{k-1}] \ge \cdots \ge v^*. Without correction, this feedback loop can produce severely inflated value estimates that yield poor policies.

Learning the Bias Correction

One approach, developed by Keane and Wolpin Keane & Wolpin (1994) in the context of dynamic discrete choice models, treats the bias itself as a quantity to be learned and subtracted. For a given value function vv and Monte Carlo sample size NN, define the bias at state ss as

δ(s)=E[(L^Nv)(s)](Lv)(s)0.\delta(s) = \mathbb{E}\big[(\widehat{\Bellman}_N v)(s)\big] - (\Bellman v)(s) \ge 0.

While we cannot compute δ(s)\delta(s) directly (we lack both the expectation and the exact Bellman application), the bias has structure. It depends on observable quantities: the number of actions As|\mathcal{A}_s|, the sample size NN, and the spread of action values. When one action dominates, δ(s)\delta(s) is small. When several actions have similar values, noise is more likely to flip the maximizer, increasing δ(s)\delta(s).

Rather than deriving δ(s)\delta(s) analytically, Keane and Wolpin proposed learning it empirically. The strategy follows a “simulate on a subset, interpolate everywhere” template common in econometric dynamic programming. At a carefully chosen set of states, we run both high-fidelity simulation (many samples or exact integration) and the low-fidelity NN-sample estimate used in value iteration. The gap between these estimates provides training data for the bias. We then fit a regression model gηg_\eta that predicts δ(s)\delta(s) from features of the state and action-value distribution. During value iteration, we subtract the predicted bias from the raw Monte Carlo maximum.

Useful features for predicting the bias include the spread of action values (from a separate high-fidelity simulation), the number of actions, and gaps to the best action. These are cheap to compute and track regimes where maximization bias is large. The procedure can be summarized as:

The regression model gηg_\eta is trained offline by comparing high and low-fidelity simulations at a grid of states, then applied during each value iteration step. This approach has been influential in econometrics for structural estimation problems. However, it has seen limited adoption in reinforcement learning. The computational overhead is substantial: it requires high-fidelity simulation at training states, careful feature engineering, and maintaining the regression model throughout learning. More critically, the circular dependency between the bias estimate and the value function can amplify errors if gηg_\eta is misspecified.

Analytical Bias Correction

Lee et al. (2013)Lee & Powell (2019) developed an alternative that derives the bias analytically using extreme value theory rather than learning it empirically. Their analysis considers the distributional structure of the noise in Monte Carlo estimates. When rewards are Gaussian with variance σ2\sigma^2, the bias in maxaR^(s,a)\max_a \widehat{R}(s,a) can be computed from extreme value theory.

For a single-state MDP where all randomness comes from stochastic rewards, suppose that for each action aa, the observed reward is R^(s,a)=μa+εa\widehat{R}(s,a) = \mu_a + \varepsilon_a where εaN(0,σ2)\varepsilon_a \sim \mathcal{N}(0, \sigma^2). When we form the target maxaR^(s,a)\max_a \widehat{R}(s,a), we are taking the maximum of M=A(s)M = |\mathcal{A}(s)| independent Gaussians. After appropriate centering and scaling, this maximum converges in distribution to a Gumbel random variable. The expected bias can be computed as:

E[maxaεa]σ(ξbM+bM)\mathbb{E}\left[\max_{a} \varepsilon_a\right] \approx \sigma \left(\frac{\xi}{b_M} + b_M\right)

where bM=2logMloglogMlog4πb_M = \sqrt{2\log M - \log\log M - \log 4\pi} and ξ0.5772\xi \approx 0.5772 is the Euler-Mascheroni constant. This quantity is positive and grows with the number of actions MM. The bias scales with the noise level σ\sigma and with the action space size.

The corrected target becomes:

y~i=ri+γmaxaq(si,a;θ(n))B(si,ai)\tilde{y}_i = r_i + \gamma \max_{a'} q(s'_i, a'; \boldsymbol{\theta}^{(n)}) - B(s_i, a_i)

where the bias correction term is:

B(s,a)=(ξbA(s)+bA(s))σ^(s,a)B(s,a) = \left(\frac{\xi}{b_{|A(s)|}} + b_{|A(s)|}\right) \hat{\sigma}(s,a)

Here σ^(s,a)=Var^[R(s,a)]\hat{\sigma}(s,a) = \sqrt{\widehat{\text{Var}}[R(s,a)]} is the estimated standard deviation of the reward at state-action pair (s,a)(s,a). This can be computed from multiple observed reward samples at the same (s,a)(s,a) pair if available.

This approach is exact for single-state MDPs where Q-values equal expected rewards and all bias comes from reward noise. For general multi-state MDPs, the situation is more complex. The bias arises from the max operator over Q-values at the next state ss', not from reward stochasticity at the current state. Lee and Powell’s multistate extension addresses this by introducing a separate correction term BTB^T for transition stochasticity, estimated empirically by averaging maxaq(s,a)\max_{a'} q(s', a') over multiple sampled next states from the same (s,a)(s,a) pair, rather than using an analytical formula. The single-state analytical correction remains valuable for bandit problems and as a first-order approximation when reward variance dominates, but does not directly address max-operator bias from function approximation error in the Q-values themselves. The correction is optimal specifically for Gaussian errors. If the true errors have heavier tails, the correction may be insufficient.

Lee-Powell’s analytical formula can be viewed as deriving what Keane-Wolpin learned empirically, replacing the regression model with a closed-form expression from extreme value theory. The trade-off is generality versus computational cost: Keane-Wolpin adapts to any noise structure but requires expensive high-fidelity simulations at training states; Lee-Powell assumes Gaussian noise but evaluates instantly once σ^\hat{\sigma} is estimated. Both subtract a correction term from targets while preserving squared error loss.

An alternative to explicit bias correction is to replace the hard max operator with a soft maximum. The regularized MDP chapter discusses smooth Bellman operators that use logsumexp or Gaussian uncertainty-weighted aggregation to avoid the discontinuity and overestimation inherent in taking a hard maximum. These approaches modify the Bellman operator itself rather than correcting its bias after the fact, preserving squared error loss while constructing smoother targets. Another approach, examined in the next chapter, changes the loss function itself to match the noise structure in TD targets.

Decoupling Selection and Evaluation

An alternative approach modifies the estimator itself to break the coupling that creates bias. In the standard Monte Carlo update maxa{r(s,a)+γμ^N(s,a)}\max_a \{r(s,a) + \gamma \hat{\mu}_N(s,a)\}, the same noisy estimate both selects which action looks best and provides the value assigned to that action. To see the problem, decompose the estimator into its mean and noise:

μ^N(s,a)=μ(s,a)+εa,\hat{\mu}_N(s,a) = \mu(s,a) + \varepsilon_a,

where εa\varepsilon_a is zero-mean noise. Whichever action happens to have the largest εa\varepsilon_a gets selected, and that same positive noise inflates the target value:

Y=r(s,a)+γμ(s,a)+γεa,Y = r(s,a^\star) + \gamma \mu(s,a^\star) + \gamma \varepsilon_{a^\star},

where a=argmaxa{r(s,a)+γμ(s,a)+γεa}a^\star = \arg\max_a \{r(s,a) + \gamma \mu(s,a) + \gamma \varepsilon_a\}. This coupling (using the same random variable εa\varepsilon_{a^\star} for both selection and evaluation) produces E[Y]maxa{r(s,a)+γμ(s,a)}\mathbb{E}[Y] \ge \max_a \{r(s,a) + \gamma \mu(s,a)\}.

Double Q-learning Van Hasselt et al. (2016) breaks this coupling by maintaining two independent Monte Carlo estimates. Conceptually, for each action aa at state ss, we would draw two separate sets of samples from p(s,a)p(\cdot|s,a):

μ^N(1)(s,a)=μ(s,a)+εa(1),μ^N(2)(s,a)=μ(s,a)+εa(2),\hat{\mu}^{(1)}_N(s,a) = \mu(s,a) + \varepsilon^{(1)}_a, \quad \hat{\mu}^{(2)}_N(s,a) = \mu(s,a) + \varepsilon^{(2)}_a,

where εa(1)\varepsilon^{(1)}_a and εa(2)\varepsilon^{(2)}_a are independent zero-mean noise terms. We use the first estimate to select the action but the second to evaluate it:

a=argmaxa{r(s,a)+γμ^N(1)(s,a)},Y=r(s,a)+γμ^N(2)(s,a).a^\star = \arg\max_{a} \left\{r(s,a) + \gamma \hat{\mu}^{(1)}_N(s,a)\right\}, \quad Y = r(s,a^\star) + \gamma \hat{\mu}^{(2)}_N(s,a^\star).

Note that aa^\star is a random variable. It is the result of maximizing over noisy estimates μ^N(1)(s,a)\hat{\mu}^{(1)}_N(s,a), and therefore depends on the noise ε(1)\varepsilon^{(1)}. The target value YY uses the second estimator μ^N(2)(s,a)\hat{\mu}^{(2)}_N(s,a^\star), evaluating it at the randomly selected action aa^\star. Expanding the target value to make the dependencies explicit:

Y=r(s,a)+γμ^N(2)(s,a)=r(s,a)+γμ(s,a)+γεa(2),Y = r(s,a^\star) + \gamma \hat{\mu}^{(2)}_N(s,a^\star) = r(s,a^\star) + \gamma \mu(s,a^\star) + \gamma \varepsilon^{(2)}_{a^\star},

where aa^\star itself depends on ε(1)\varepsilon^{(1)}. To see why this eliminates evaluation bias, write the expectation as a nested expectation:

E[Y]=Eε(1)[Eε(2)[r(s,a)+γμ(s,a)+γεa(2)a]].\mathbb{E}[Y] = \mathbb{E}_{\varepsilon^{(1)}}\Big[\mathbb{E}_{\varepsilon^{(2)}}\big[r(s,a^\star) + \gamma \mu(s,a^\star) + \gamma \varepsilon^{(2)}_{a^\star} \mid a^\star\big]\Big].

The inner expectation, conditioned on the selected action aa^\star, equals:

Eε(2)[r(s,a)+γμ(s,a)+γεa(2)a]=r(s,a)+γμ(s,a),\mathbb{E}_{\varepsilon^{(2)}}\big[r(s,a^\star) + \gamma \mu(s,a^\star) + \gamma \varepsilon^{(2)}_{a^\star} \mid a^\star\big] = r(s,a^\star) + \gamma \mu(s,a^\star),

because E[εa(2)a]=0\mathbb{E}[\varepsilon^{(2)}_{a^\star} \mid a^\star] = 0. Why does this conditional expectation vanish? Because aa^\star is a random variable determined entirely by ε(1)\varepsilon^{(1)} (through the argmax operation), while ε(2)\varepsilon^{(2)} is independent of ε(1)\varepsilon^{(1)}. Therefore, ε(2)\varepsilon^{(2)} is independent of aa^\star. By the fundamental property of independence, for any random variables XX and YY where XYX \perp Y, we have E[XY]=E[X]\mathbb{E}[X \mid Y] = \mathbb{E}[X]. Applying this:

E[εa(2)a]=E[εa(2)]=0,\mathbb{E}[\varepsilon^{(2)}_{a^\star} \mid a^\star] = \mathbb{E}[\varepsilon^{(2)}_{a^\star}] = 0,

where the final equality holds because each εa(2)\varepsilon^{(2)}_a has zero mean. Taking the outer expectation:

E[Y]=Eε(1)[r(s,a)+γμ(s,a)].\mathbb{E}[Y] = \mathbb{E}_{\varepsilon^{(1)}}\big[r(s,a^\star) + \gamma \mu(s,a^\star)\big].

The evaluation noise contributes zero on average because it is independent of the selection. However, double Q-learning does not eliminate all bias. Two distinct sources of bias arise when maximizing over noisy estimates:

  1. Selection bias: Using argmaxa{r(s,a)+γμ(s,a)+γεa(1)}\arg\max_a \{r(s,a) + \gamma \mu(s,a) + \gamma \varepsilon^{(1)}_a\} tends to select actions that received positive noise. The noise ε(1)\varepsilon^{(1)} causes us to favor actions that got lucky, so aa^\star is not uniformly distributed but skewed toward actions with positive εa(1)\varepsilon^{(1)}_a. This means Eε(1)[μ(s,a)]maxaμ(s,a)\mathbb{E}_{\varepsilon^{(1)}}[\mu(s,a^\star)] \ge \max_a \mu(s,a).

  2. Evaluation bias: In the standard estimator, we use the same noisy estimate μ^N(s,a)\hat{\mu}_N(s,a^\star) that selected aa^\star to also evaluate it. If aa^\star was selected because εa\varepsilon_{a^\star} happened to be large and positive, that same inflated estimate provides the target value. This creates additional bias beyond the selection effect.

Double Q-learning eliminates evaluation bias by using independent noise for evaluation: even though aa^\star is biased toward actions with positive ε(1)\varepsilon^{(1)}, the evaluation uses ε(2)\varepsilon^{(2)} which is independent of the selection, so E[εa(2)a]=0\mathbb{E}[\varepsilon^{(2)}_{a^\star} \mid a^\star] = 0. Selection bias remains, but removing evaluation bias substantially reduces total overestimation.

Remark 2 (Implementation via different Q-functions)

The conceptual framework above assumes we can draw multiple independent samples from p(s,a)p(\cdot|s,a) for each state-action pair. In practice, this would require resetting a simulator to the same state and sampling multiple times, which is often infeasible. The practical implementation achieves the same effect differently: maintain two Q-functions that are trained on different data or updated at different times (e.g., one is a slowly-updating target network). Since the two Q-functions experience different noise realizations during training, their errors remain less correlated than if we used a single Q-function for both selection and evaluation. This is how Double DQN works, as we’ll see in the next chapter on online learning.

The following algorithm implements this principle with Monte Carlo integration. We maintain two Q-functions and alternate which one selects actions and which one evaluates them. The bias reduction mechanism applies whether we store Q-values in a table or use function approximation.

Note that this algorithm is written in the theoretical form that assumes we can draw NN samples from each (s,a)(s,a) pair. In practice with single-trajectory data (N=1N=1, one quadruple (s,a,r,s)(s,a,r,s') per transition), the double Q-learning principle still applies but we work with the single observed next state ss' and maintain two Q-functions that are trained on different subsets of data or updated at different frequencies to achieve the independence needed for bias reduction.

At lines 9 and 12, q(1)q^{(1)} selects the action but q(2)q^{(2)} evaluates it. The noise in q(1)q^{(1)} influences which action is chosen, but the independent noise in q(2)q^{(2)} does not inflate the evaluation. Lines 10 and 13 apply the same principle symmetrically for updating q(2)q^{(2)}.

Weighted Estimators: A Third Approach

The Keane-Wolpin and double Q-learning approaches represent two strategies for addressing overestimation bias: explicit correction and decoupled estimation. A third approach, developed by D’Eramo et al. D'Eramo et al. (2016), replaces the hard maximum with a probability-weighted average that accounts for uncertainty in value estimates.

The maximum estimator (ME) used in standard Q-learning takes the maximum of sample mean Q-values:

μ^ME=maxaAμ^a\hat{\mu}^{ME}_* = \max_{a \in \mathcal{A}} \hat{\mu}_a

where μ^a\hat{\mu}_a is the sample mean Q-value for action aa. While each individual μ^a\hat{\mu}_a may be unbiased, the maximum is systematically upward biased. The bias can be bounded by:

Bias(μ^ME)M1Ma=1Mσa2na\text{Bias}(\hat{\mu}^{ME}_*) \leq \sqrt{\frac{M-1}{M}\sum_{a=1}^M \frac{\sigma_a^2}{n_a}}

where M=AM = |\mathcal{A}| is the number of actions, σa2\sigma_a^2 is the variance of Q-values for action aa, and nan_a is the number of samples used to estimate it.

The double estimator (DE) from double Q-learning reduces this bias by decoupling selection and evaluation, but introduces negative bias:

E[μ^DE]maxaμa\mathbb{E}[\hat{\mu}^{DE}_*] \leq \max_a \mu_a

This pessimism can be problematic when one action is clearly superior to others, as the estimator may systematically undervalue the optimal action.

The weighted estimator (WE) provides a middle ground. Instead of taking a hard maximum, compute a weighted average where weights reflect the probability that each action is optimal. Under a Gaussian approximation for the sampling distribution of Q-value estimates, the weight for action aa is:

wa=P(aargmaxaAμ^a)=+ϕ(xμ^aσ^a/na)baΦ(xμ^bσ^b/nb)dxw_a = \mathbb{P}\left(a \in \arg\max_{a' \in \mathcal{A}} \hat{\mu}_{a'}\right) = \int_{-\infty}^{+\infty} \phi\left(\frac{x - \hat{\mu}_a}{\hat{\sigma}_a/\sqrt{n_a}}\right) \prod_{b \neq a} \Phi\left(\frac{x - \hat{\mu}_b}{\hat{\sigma}_b/\sqrt{n_b}}\right) dx

where ϕ\phi and Φ\Phi are the standard normal PDF and CDF. This integral computes the probability that action aa achieves the maximum by integrating over all possible true values xx. For each xx, the ϕ\phi term gives the probability density that action aa has true value xx, while the product of Φ\Phi terms gives the probability that all other actions have true values below xx.

The weighted estimator then becomes:

μ^WE=aAwaμ^a\hat{\mu}^{WE}_* = \sum_{a \in \mathcal{A}} w_a \hat{\mu}_a

D’Eramo et al. established that this estimator satisfies Bias(μ^WE)Bias(μ^ME)\text{Bias}(\hat{\mu}^{WE}_*) \leq \text{Bias}(\hat{\mu}^{ME}_*) and Bias(μ^WE)Bias(μ^DE)\text{Bias}(\hat{\mu}^{WE}_*) \geq \text{Bias}(\hat{\mu}^{DE}_*). The weighted estimator thus provides a middle ground between the optimism of ME and the pessimism of DE.

The relative performance depends on the problem structure. When one action is clearly superior (μ1μ2σ\mu_1 - \mu_2 \gg \sigma), the signal dominates the noise and ME performs well with minimal overestimation. When multiple actions have similar values (μ1μ2\mu_1 \approx \mu_2), eliminating overestimation becomes critical and DE excels despite its pessimism. In intermediate cases, which represent most practical scenarios, WE adapts to the actual uncertainty and avoids both extremes.

The weighted estimator can be incorporated into Q-learning by maintaining variance estimates alongside Q-values and computing the weighted target at each step. The weights waw_a themselves can serve as an exploration policy that naturally adapts to estimation uncertainty. This approach adapts automatically to heterogeneous uncertainty across actions and provides balanced bias, but requires maintaining variance estimates (adding memory overhead) and computing integrals (expensive for large action spaces). The method assumes Gaussian sampling distributions, though it proves robust in practice even when this assumption is violated.

The weighted estimator relates to the smooth Bellman operators discussed in the regularized MDP chapter. Both replace the discontinuous hard maximum with smooth aggregation, but weighted estimation adapts to state-action-specific uncertainty while logsumexp applies uniform smoothing via temperature. The choice among ME, DE, and WE depends on computational constraints and problem characteristics. When variance estimates are available and action spaces are small, WE offers a principled approach that balances the extremes of overestimation and underestimation. When maintaining variance is expensive or action spaces are large, double Q-learning provides a simpler alternative that eliminates evaluation bias without explicit probability weighting.

Summary and Forward Look

This chapter developed the simulation-based approach to approximate dynamic programming by replacing analytical integration with Monte Carlo sampling. We showed how deterministic quadrature rules give way to randomized estimation when transition densities are expensive to evaluate or the state space is high-dimensional. We need only the ability to sample from transition distributions, not to compute their densities explicitly.

Three foundational components emerged:

  1. Monte Carlo Bellman operator: Replace exact expectations v(s)p(dss,a)\int v(s')p(ds'|s,a) with sample averages 1Nv(si)\frac{1}{N}\sum v(s'_i), typically with N=1N=1 in practice

  2. Q-functions: Cache state-action values to amortize the cost of action selection, eliminating the need for repeated sampling at decision time

  3. Bias mitigation: Address the systematic overestimation that arises from maximizing over noisy estimates through four approaches: learning empirical corrections (Keane-Wolpin), analytical adjustment from extreme value theory (Lee-Powell), decoupling selection from evaluation (double Q-learning), and probability-weighted aggregation (weighted estimators)

The algorithms presented (Parametric Q-Value Iteration, Keane-Wolpin, Double Q) remain in the theoretical setting where we can draw multiple samples from each state-action pair and choose optimization parameters freely. The next chapter addresses the practical constraints of reinforcement learning: working with fixed offline datasets of single-sample transitions, choosing function approximators and optimization strategies, and understanding the algorithmic design space that yields methods like FQI, NFQI, and DQN.

References
  1. Keane, M. P., & Wolpin, K. I. (1994). The Solution and Estimation of Discrete Choice Dynamic Programming Models by Simulation and Interpolation: Monte Carlo Evidence. The Review of Economics and Statistics, 76(4), 648. 10.2307/2109768
  2. Lee, D., Defourny, B., & Powell, W. B. (2013). Bias-corrected Q-learning to control max-operator bias in Q-learning. 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), 93–99. 10.1109/adprl.2013.6614994
  3. Lee, D., & Powell, W. B. (2019). Bias-Corrected Q-Learning With Multistate Extension. IEEE Transactions on Automatic Control, 64(10), 4011–4023. 10.1109/tac.2019.2912443
  4. Van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-learning. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1).
  5. D’Eramo, C., Nuara, A., Restelli, M., & Nowe, A. (2016). Estimating the Maximum Expected Value through Gaussian Approximation. Proceedings of the 33rd International Conference on Machine Learning, 48, 1032–1040.