Skip to article frontmatterSkip to article content

Dynamic Programming

Unlike the methods we’ve discussed so far, dynamic programming takes a step back and considers an entire family of related problems rather than a single optimization problem. This approach, while seemingly more complex at first glance, can often lead to efficient solutions.

Dynamic programming leverage the solution structure underlying many control problems that allows for a decomposition it into smaller, more manageable subproblems. Each subproblem is itself an optimization problem, embedded within the larger whole. This recursive structure is the foundation upon which dynamic programming constructs its solutions.

To ground our discussion, let us return to the domain of discrete-time optimal control problems (DOCPs). These problems frequently arise from the discretization of continuous-time optimal control problems. While the focus here will be on deterministic problems, these concepts extend naturally to stochastic problems by taking the expectation over the random quantities.

Consider a typical DOCP of Bolza type:

minimizeJcT(xT)+t=1T1ct(xt,ut)subject toxt+1=ft(xt,ut),t=1,,T1,ulbutuub,t=1,,T,xlbxtxub,t=1,,T,givenx1\begin{align*} \text{minimize} \quad & J \triangleq c_\mathrm{T}(\mathbf{x}_T) + \sum_{t=1}^{T-1} c_t(\mathbf{x}_t, \mathbf{u}_t) \\ \text{subject to} \quad & \mathbf{x}_{t+1} = \mathbf{f}_t(\mathbf{x}_t, \mathbf{u}_t), \quad t = 1, \ldots, T-1, \\ & \mathbf{u}_{lb} \leq \mathbf{u}_t \leq \mathbf{u}_{ub}, \quad t = 1, \ldots, T, \\ & \mathbf{x}_{lb} \leq \mathbf{x}_t \leq \mathbf{x}_{ub}, \quad t = 1, \ldots, T, \\ \text{given} \quad & \mathbf{x}_1 \end{align*}

Rather than considering only the total cost from the initial time to the final time, dynamic programming introduces the concept of cost from an arbitrary point in time to the end. This leads to the definition of the “cost-to-go” or “value function” Jk(xk)J_k(\mathbf{x}_k):

Jk(xk)cT(xT)+t=kT1ct(xt,ut)J_k(\mathbf{x}_k) \triangleq c_\mathrm{T}(\mathbf{x}_T) + \sum_{t=k}^{T-1} c_t(\mathbf{x}_t, \mathbf{u}_t)

This function represents the total cost incurred from stage kk onwards to the end of the time horizon, given that the system is initialized in state xk\mathbf{x}_k at stage kk. Suppose the problem has been solved from stage k+1k+1 to the end, yielding the optimal cost-to-go Jk+1(xk+1)J_{k+1}^\star(\mathbf{x}_{k+1}) for any state xk+1\mathbf{x}_{k+1} at stage k+1k+1. The question then becomes: how does this information inform the decision at stage kk?

Given knowledge of the optimal behavior from k+1k+1 onwards, the task reduces to determining the optimal action uk\mathbf{u}_k at stage kk. This control should minimize the sum of the immediate cost ck(xk,uk)c_k(\mathbf{x}_k, \mathbf{u}_k) and the optimal future cost Jk+1(xk+1)J_{k+1}^\star(\mathbf{x}_{k+1}), where xk+1\mathbf{x}_{k+1} is the resulting state after applying action uk\mathbf{u}_k. Mathematically, this is expressed as:

Jk(xk)=minuk[ck(xk,uk)+Jk+1(fk(xk,uk))]J_k^\star(\mathbf{x}_k) = \min_{\mathbf{u}_k} \left[ c_k(\mathbf{x}_k, \mathbf{u}_k) + J_{k+1}^\star(\mathbf{f}_k(\mathbf{x}_k, \mathbf{u}_k)) \right]

This equation is known as Bellman’s equation, named after Richard Bellman, who formulated the principle of optimality:

An optimal policy has the property that whatever the previous state and decision, the remaining decisions must constitute an optimal policy with regard to the state resulting from the previous decision.

In other words, any sub-path of an optimal path, from any intermediate point to the end, must itself be optimal. This principle is the basis for the backward induction procedure which computes the optimal value function and provides closed-loop control capabilities without having to use an explicit NLP solver.

Dynamic programming can handle nonlinear systems and non-quadratic cost functions naturally. It provides a global optimal solution, when one exists, and can incorporate state and control constraints with relative ease. However, as the dimension of the state space increases, this approach suffers from what Bellman termed the “curse of dimensionality.” The computational complexity and memory requirements grow exponentially with the state dimension, rendering direct application of dynamic programming intractable for high-dimensional problems.

Fortunately, learning-based methods offer efficient tools to combat the curse of dimensionality on two fronts: by using function approximation (e.g., neural networks) to avoid explicit discretization, and by leveraging randomization through Monte Carlo methods inherent in the learning paradigm. Most of this course is dedicated to those ideas.

Backward Recursion

The principle of optimality provides a methodology for solving optimal control problems. Beginning at the final time horizon and working backwards, at each stage the local optimization problem given by Bellman’s equation is solved. This process, termed backward recursion or backward induction, constructs the optimal value function stage by stage.

Upon completion of this backward pass, we now have access to the optimal control to take at any stage and in any state. Furthermore, we can simulate optimal trajectories from any initial state and applying the optimal policy at each stage to generate the optimal trajectory.

Example: Optimal Harvest in Resource Management

Dynamic programming is often used in resource management and conservation biology to devise policies to be implemented by decision makers and stakeholders : for eg. in fishereries, or timber harvesting. Per Conroy & Peterson (2013), we consider a population of a particular species, whose abundance we denote by xtx_t, where tt represents discrete time steps. Our objective is to maximize the cumulative harvest over a finite time horizon, while also considering the long-term sustainability of the population. This optimization problem can be formulated as:

maximizet=t0tfF(xtht)+FT(xtf)\text{maximize} \quad \sum_{t=t_0}^{t_f} F(x_t \cdot h_t) + F_\mathrm{T}(x_{t_f})

Here, F()F(\cdot) represents the immediate reward function associated with harvesting, hth_t is the harvest rate at time tt, and FT()F_\mathrm{T}(\cdot) denotes a terminal value function that could potentially assign value to the final population state. In this particular problem, we assign no terminal value to the final population state, setting FT(xtf)=0F_\mathrm{T}(x_{t_f}) = 0 and allowing us to focus solely on the cumulative harvest over the time horizon.

In our model population model, the abundance of a specicy xx ranges from 1 to 100 individuals. The decision variable is the harvest rate hh, which can take values from the set D={0,0.1,0.2,0.3,0.4,0.5}D = \{0, 0.1, 0.2, 0.3, 0.4, 0.5\}. The population dynamics are governed by a modified logistic growth model:

xt+1=xt+0.3xt(1xt/125)htxtx_{t+1} = x_t + 0.3x_t(1 - x_t/125) - h_tx_t

where the 0.3 represents the growth rate and 125 is the carrying capacity (the maximum population size given the available resources). The logistic growth model returns continuous values; however our DP formulation uses a discrete state space. Therefore, we also round the the outcomes to the nearest integer.

Applying the principle of optimality, we can express the optimal value function J(xt,t)J^\star(x_t,t) recursively:

J(xt,t)=maxhtD(F(x,h,t)+J(xt+1,t+1))J^\star(x_t, t) = \max_{h_t \in D} (F(x, h, t) + J^*(x_{t+1}, t+1))

with the boundary condition J(xtf)=0J^*(x_{t_f}) = 0.

It’s worth noting that while this example uses a relatively simple model, the same principles can be applied to more complex scenarios involving stochasticity, multiple species interactions, or spatial heterogeneity.

Source
import numpy as np

# Parameters
r_max = 0.3
K = 125
T = 20  # Number of time steps
N_max = 100  # Maximum population size to consider
h_max = 0.5  # Maximum harvest rate
h_step = 0.1  # Step size for harvest rate

# Create state and decision spaces
N_space = np.arange(1, N_max + 1)
h_space = np.arange(0, h_max + h_step, h_step)

# Initialize value function and policy
V = np.zeros((T + 1, len(N_space)))
policy = np.zeros((T, len(N_space)))

# Terminal value function (F_T)
def terminal_value(N):
    return 0

# State return function (F)
def state_return(N, h):
    return N * h

# State dynamics function
def state_dynamics(N, h):
    return N + r_max * N * (1 - N / K) - N * h

# Backward iteration
for t in range(T - 1, -1, -1):
    for i, N in enumerate(N_space):
        max_value = float('-inf')
        best_h = 0

        for h in h_space:
            if h > 1:  # Ensure harvest rate doesn't exceed 100%
                continue

            next_N = state_dynamics(N, h)
            if next_N < 1:  # Ensure population doesn't go extinct
                continue

            next_N_index = np.searchsorted(N_space, next_N)
            if next_N_index == len(N_space):
                next_N_index -= 1

            value = state_return(N, h) + V[t + 1, next_N_index]

            if value > max_value:
                max_value = value
                best_h = h

        V[t, i] = max_value
        policy[t, i] = best_h

# Function to simulate the optimal policy with conversion to Python floats
def simulate_optimal_policy(initial_N, T):
    trajectory = [float(initial_N)]  # Ensure first value is a Python float
    harvests = []

    for t in range(T):
        N = trajectory[-1]
        N_index = np.searchsorted(N_space, N)
        if N_index == len(N_space):
            N_index -= 1

        h = policy[t, N_index]
        harvests.append(float(N * h))  # Ensure harvest is a Python float

        next_N = state_dynamics(N, h)
        trajectory.append(float(next_N))  # Ensure next population value is a Python float

    return trajectory, harvests

# Example usage
initial_N = 50
trajectory, harvests = simulate_optimal_policy(initial_N, T)

print("Optimal policy:")
print(policy)
print("\nPopulation trajectory:", trajectory)
print("Harvests:", harvests)
print("Total harvest:", sum(harvests))

Handling Continuous Spaces with Interpolation

In many real-world problems, such as our resource management example, the state space is inherently continuous. Dynamic programming, however, is usually defined on discrete state spaces. To reconcile this, we approximate the value function on a finite grid of points and use interpolation to estimate its value elsewhere.

In our earlier example, we acted as if population sizes could only be whole numbers: 1 fish, 2 fish, 3 fish. But real measurements don’t fit neatly. What do you do with a survey that reports 42.7 fish? Our reflex in the code example was to round to the nearest integer, effectively saying “let’s just call it 43.” This corresponds to nearest-neighbor interpolation, also known as discretization. It’s the zeroth-order case: you assume the value between grid points is constant and equal to the closest one. In practice, this amounts to overlaying a grid on the continuous landscape and forcing yourself to stand at the intersections. In our demo code, this step was carried out with numpy.searchsorted.

While easy to implement, nearest-neighbor interpolation can introduce artifacts:

  1. Decisions may change abruptly, even if the state only shifts slightly.

  2. Precision is lost, especially in regimes where small variations matter.

  3. The curse of dimensionality forces an impractically fine grid if many state variables are added.

To address these issues, we can use higher-order interpolation. Instead of taking the nearest neighbor, we estimate the value at off-grid points by leveraging multiple nearby values.

Backward Recursion with Interpolation

Suppose we have computed Jk+1(x)J_{k+1}^\star(\mathbf{x}) only at grid points xXgrid\mathbf{x} \in \mathcal{X}_\text{grid}. To evaluate Bellman’s equation at an arbitrary xk+1\mathbf{x}_{k+1}, we interpolate. Formally, let Ik+1(x)I_{k+1}(\mathbf{x}) be the interpolation operator that extends the value function from Xgrid\mathcal{X}_\text{grid} to the continuous space. Then:

Jk(xk)=minuk[ck(xk,uk)+Ik+1(fk(xk,uk))].J_k^\star(\mathbf{x}_k) = \min_{\mathbf{u}_k} \Big[ c_k(\mathbf{x}_k, \mathbf{u}_k) + I_{k+1}\big(\mathbf{f}_k(\mathbf{x}_k, \mathbf{u}_k)\big) \Big].

For instance, in one dimension, linear interpolation gives:

Ik+1(x)=Jk+1(xl)+xxlxuxl(Jk+1(xu)Jk+1(xl)),I_{k+1}(x) = J_{k+1}^\star(x_l) + \frac{x - x_l}{x_u - x_l} \big(J_{k+1}^\star(x_u) - J_{k+1}^\star(x_l)\big),

where xlx_l and xux_u are the nearest grid points bracketing xx. Linear interpolation is often sufficient, but higher-order methods (cubic splines, radial basis functions) can yield smoother and more accurate estimates. The choice of interpolation scheme and grid layout both affect accuracy and efficiency. A finer grid improves resolution but increases computational cost, motivating strategies like adaptive grid refinement or replacing interpolation altogether with parametric function approximation which we are going to see later in this book.

In higher-dimensional spaces, naive interpolation becomes prohibitively expensive due to the curse of dimensionality. Several approaches such as tensorized multilinear interpolation, radial basis functions, and machine learning models address this challenge by extending a common principle: they approximate the value function at unobserved points using information from a finite set of evaluations. However, as dimensionality continues to grow, even tensor methods face scalability limits, which is why flexible parametric models like neural networks have become essential tools for high-dimensional function approximation.

Example: Optimal Harvest with Linear Interpolation

Here is a demonstration of the backward recursion procedure using linear interpolation.

Source
import numpy as np

# Parameters
r_max = 0.3
K = 125
T = 20  # Number of time steps
N_max = 100  # Maximum population size to consider
h_max = 0.5  # Maximum harvest rate
h_step = 0.1  # Step size for harvest rate

# Create state and decision spaces
N_space = np.arange(1, N_max + 1)
h_space = np.arange(0, h_max + h_step, h_step)

# Initialize value function and policy
V = np.zeros((T + 1, len(N_space)))
policy = np.zeros((T, len(N_space)))

# Terminal value function (F_T)
def terminal_value(N):
    return 0

# State return function (F)
def state_return(N, h):
    return N * h

# State dynamics function
def state_dynamics(N, h):
    return N + r_max * N * (1 - N / K) - N * h

# Function to linearly interpolate between grid points in N_space
def interpolate_value_function(V, N_space, next_N, t):
    if next_N <= N_space[0]:
        return V[t, 0]  # Below or at minimum population, return minimum value
    if next_N >= N_space[-1]:
        return V[t, -1]  # Above or at maximum population, return maximum value
    
    # Find indices to interpolate between
    lower_idx = np.searchsorted(N_space, next_N) - 1
    upper_idx = lower_idx + 1
    
    # Linear interpolation
    N_lower = N_space[lower_idx]
    N_upper = N_space[upper_idx]
    weight = (next_N - N_lower) / (N_upper - N_lower)
    return (1 - weight) * V[t, lower_idx] + weight * V[t, upper_idx]

# Backward iteration with interpolation
for t in range(T - 1, -1, -1):
    for i, N in enumerate(N_space):
        max_value = float('-inf')
        best_h = 0
        
        for h in h_space:
            if h > 1:  # Ensure harvest rate doesn't exceed 100%
                continue
            
            next_N = state_dynamics(N, h)
            if next_N < 1:  # Ensure population doesn't go extinct
                continue
            
            # Interpolate value for next_N
            value = state_return(N, h) + interpolate_value_function(V, N_space, next_N, t + 1)
            
            if value > max_value:
                max_value = value
                best_h = h
        
        V[t, i] = max_value
        policy[t, i] = best_h

# Function to simulate the optimal policy using interpolation
def simulate_optimal_policy(initial_N, T):
    trajectory = [initial_N]
    harvests = []

    for t in range(T):
        N = trajectory[-1]
        
        # Interpolate optimal harvest rate
        if N <= N_space[0]:
            h = policy[t, 0]
        elif N >= N_space[-1]:
            h = policy[t, -1]
        else:
            lower_idx = np.searchsorted(N_space, N) - 1
            upper_idx = lower_idx + 1
            weight = (N - N_space[lower_idx]) / (N_space[upper_idx] - N_space[lower_idx])
            h = (1 - weight) * policy[t, lower_idx] + weight * policy[t, upper_idx]
        
        harvests.append(float(N * h))  # Ensure harvest is a Python float
        next_N = state_dynamics(N, h)
        trajectory.append(float(next_N))  # Ensure next population value is a Python float

    return trajectory, harvests

# Example usage
initial_N = 50
trajectory, harvests = simulate_optimal_policy(initial_N, T)

print("Optimal policy:")
print(policy)
print("\nPopulation trajectory:", trajectory)
print("Harvests:", harvests)
print("Total harvest:", sum(harvests))

Due to pedagogical considerations, this example is using our own implementation of the linear interpolation procedure. However, a more general and practical approach would be to use a built-in interpolation procedure in Numpy. Because our state space has a single dimension, we can simply use scipy.interpolate.interp1d which offers various interpolation methods through its kind argument, which can take values in ‘linear’, ‘nearest’, ‘nearest-up’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘previous’, or ‘next’. ‘zero’, ‘slinear’, ‘quadratic’ and ‘cubic’.

Here’s a more general implementation which here uses cubic interpolation through the scipy.interpolate.interp1d function:

Source
import numpy as np
from scipy.interpolate import interp1d

# Parameters
r_max = 0.3
K = 125
T = 20  # Number of time steps
N_max = 100  # Maximum population size to consider
h_max = 0.5  # Maximum harvest rate
h_step = 0.1  # Step size for harvest rate

# Create state and decision spaces
N_space = np.arange(1, N_max + 1)
h_space = np.arange(0, h_max + h_step, h_step)

# Initialize value function and policy
V = np.zeros((T + 1, len(N_space)))
policy = np.zeros((T, len(N_space)))

# Terminal value function (F_T)
def terminal_value(N):
    return 0

# State return function (F)
def state_return(N, h):
    return N * h

# State dynamics function
def state_dynamics(N, h):
    return N + r_max * N * (1 - N / K) - N * h

# Function to create interpolation function for a given time step
def create_interpolator(V_t, N_space):
    return interp1d(N_space, V_t, kind='cubic', bounds_error=False, fill_value=(V_t[0], V_t[-1]))

# Backward iteration with interpolation
for t in range(T - 1, -1, -1):
    interpolator = create_interpolator(V[t+1], N_space)
    
    for i, N in enumerate(N_space):
        max_value = float('-inf')
        best_h = 0

        for h in h_space:
            if h > 1:  # Ensure harvest rate doesn't exceed 100%
                continue

            next_N = state_dynamics(N, h)
            if next_N < 1:  # Ensure population doesn't go extinct
                continue

            # Use interpolation to get the value for next_N
            value = state_return(N, h) + interpolator(next_N)

            if value > max_value:
                max_value = value
                best_h = h

        V[t, i] = max_value
        policy[t, i] = best_h

# Function to simulate the optimal policy using interpolation
def simulate_optimal_policy(initial_N, T):
    trajectory = [initial_N]
    harvests = []

    for t in range(T):
        N = trajectory[-1]
        
        # Create interpolator for the policy at time t
        policy_interpolator = interp1d(N_space, policy[t], kind='cubic', bounds_error=False, fill_value=(policy[t][0], policy[t][-1]))
        
        h = policy_interpolator(N)
        harvests.append(float(N * h))  # Ensure harvest is a Python float

        next_N = state_dynamics(N, h)
        trajectory.append(float(next_N))  # Ensure next population value is a Python float

    return trajectory, harvests

# Example usage
initial_N = 50
trajectory, harvests = simulate_optimal_policy(initial_N, T)

print("Optimal policy (first few rows):")
print(policy[:5])
print("\nPopulation trajectory:", trajectory)
print("Harvests:", harvests)
print("Total harvest:", sum(harvests))

Stochastic Dynamic Programming and Markov Decision Processes

While our previous discussion centered on deterministic systems, many real-world problems involve uncertainty. Stochastic Dynamic Programming (SDP) extends our framework to handle stochasticity in both the objective function and system dynamics. This extension naturally leads us to consider more general policy classes and to formalize when simpler policies suffice.

Decision Rules and Policies

Before diving into stochastic systems, we need to establish terminology for the different types of strategies a decision maker might employ. In the deterministic setting, we implicitly used feedback controllers of the form u(x,t)u(\mathbf{x}, t). In the stochastic setting, we must be more precise about what information policies can use and how they select actions.

A decision rule is a prescription for action selection in each state at a specified decision epoch. These rules can vary in their complexity based on two main criteria:

  1. Dependence on history: Markovian or History-dependent

  2. Action selection method: Deterministic or Randomized

Markovian decision rules depend only on the current state, while history-dependent rules consider the entire sequence of past states and actions. Formally, a history hth_t at time tt is:

ht=(s1,a1,,st1,at1,st)h_t = (s_1, a_1, \ldots, s_{t-1}, a_{t-1}, s_t)

The set of all possible histories at time tt, denoted HtH_t, grows exponentially with tt:

Deterministic rules select an action with certainty, while randomized rules specify a probability distribution over the action space.

These classifications lead to four types of decision rules:

  1. Markovian Deterministic (MD): πt:SAs\pi_t: \mathcal{S} \rightarrow \mathcal{A}_s

  2. Markovian Randomized (MR): πt:SP(As)\pi_t: \mathcal{S} \rightarrow \mathcal{P}(\mathcal{A}_s)

  3. History-dependent Deterministic (HD): πt:HtAs\pi_t: H_t \rightarrow \mathcal{A}_s

  4. History-dependent Randomized (HR): πt:HtP(As)\pi_t: H_t \rightarrow \mathcal{P}(\mathcal{A}_s)

where P(As)\mathcal{P}(\mathcal{A}_s) denotes the set of probability distributions over As\mathcal{A}_s.

A policy π\boldsymbol{\pi} is a sequence of decision rules, one for each decision epoch:

π=(π1,π2,...,πN1)\boldsymbol{\pi} = (\pi_1, \pi_2, ..., \pi_{N-1})

The set of all policies of class KK (where K{HR,HD,MR,MD}K \in \{HR, HD, MR, MD\}) is denoted as ΠK\Pi^K. These policy classes form a hierarchy:

ΠMDΠMRΠHR,ΠMDΠHDΠHR\Pi^{MD} \subset \Pi^{MR} \subset \Pi^{HR}, \quad \Pi^{MD} \subset \Pi^{HD} \subset \Pi^{HR}

The largest set ΠHR\Pi^{HR} contains all possible policies. We ask: under what conditions can we restrict attention to the much simpler set ΠMD\Pi^{MD} without loss of optimality?

Stochastic System Dynamics

In the stochastic setting, our system evolution takes the form:

xt+1=ft(xt,ut,wt)\mathbf{x}_{t+1} = \mathbf{f}_t(\mathbf{x}_t, \mathbf{u}_t, \mathbf{w}_t)

Here, wt\mathbf{w}_t represents a random disturbance or noise term at time tt due to the inherent uncertainty in the system’s behavior. The stage cost function may also incorporate stochastic influences:

ct(xt,ut,wt)c_t(\mathbf{x}_t, \mathbf{u}_t, \mathbf{w}_t)

In this context, our objective shifts from minimizing a deterministic cost to minimizing the expected total cost:

E[cT(xT)+t=1T1ct(xt,ut,wt)]\mathbb{E}\left[c_\mathrm{T}(\mathbf{x}_T) + \sum_{t=1}^{T-1} c_t(\mathbf{x}_t, \mathbf{u}_t, \mathbf{w}_t)\right]

where the expectation is taken over the distributions of the random variables wt\mathbf{w}_t. The principle of optimality still holds in the stochastic case, but Bellman’s optimality equation now involves an expectation:

Jk(xk)=minukEwk[ck(xk,uk,wk)+Jk+1(fk(xk,uk,wk))]J_k^\star(\mathbf{x}_k) = \min_{\mathbf{u}_k} \mathbb{E}_{\mathbf{w}_k}\left[c_k(\mathbf{x}_k, \mathbf{u}_k, \mathbf{w}_k) + J_{k+1}^\star(\mathbf{f}_k(\mathbf{x}_k, \mathbf{u}_k, \mathbf{w}_k))\right]

In practice, this expectation is often computed by discretizing the distribution of wk\mathbf{w}_k when the set of possible disturbances is very large or even continuous. Let’s say we approximate the distribution with KK discrete values wki\mathbf{w}_k^i, each occurring with probability pkip_k^i. Then our Bellman equation becomes:

Jk(xk)=minuki=1Kpki(ck(xk,uk,wki)+Jk+1(fk(xk,uk,wki)))J_k^\star(\mathbf{x}_k) = \min_{\mathbf{u}_k} \sum_{i=1}^K p_k^i \left(c_k(\mathbf{x}_k, \mathbf{u}_k, \mathbf{w}_k^i) + J_{k+1}^\star(\mathbf{f}_k(\mathbf{x}_k, \mathbf{u}_k, \mathbf{w}_k^i))\right)

Optimality Equations in the Stochastic Setting

When dealing with stochastic systems, a central question arises: what information should our control policy use? In the most general case, a policy might use the entire history of observations and actions. However, as we’ll see, the Markovian structure of our problems allows for dramatic simplifications.

Let ht=(s1,a1,s2,a2,,st1,at1,st)h_t = (s_1, a_1, s_2, a_2, \ldots, s_{t-1}, a_{t-1}, s_t) denote the complete history up to time tt. In the stochastic setting, the history-based optimality equations become:

ut(ht)=supaAst{rt(st,a)+jSpt(jst,a)ut+1(ht,a,j)},uN(hN)=rN(sN)u_t(h_t) = \sup_{a\in A_{s_t}}\left\{ r_t(s_t,a) + \sum_{j\in S} p_t(j\mid s_t,a)\, u_{t+1}(h_t,a,j) \right\},\quad u_N(h_N)=r_N(s_N)

where we now explicitly use the transition probabilities pt(jst,a)p_t(j|s_t,a) rather than a deterministic dynamics function.

Intuition: This formalizes Bellman’s principle of optimality: “An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.” The recursive structure means that optimal local decisions (choosing the best action at each step) lead to global optimality, even with uncertainty captured by the transition probabilities.

A simplification occurs when we examine these history-based equations more closely. The Markov property of our system dynamics and rewards means that the optimal return actually depends on the history only through the current state:

Intuition: The Markov property means that the current state contains all information needed to predict future evolution. The past provides no additional value for decision-making. This result allows us to work with value functions vt(s)v_t^*(s) indexed only by state and time, dramatically simplifying both theory and computation.

This state-sufficiency result, combined with the fact that randomization never helps when maximizing expected returns, leads to a dramatic simplification of the policy space:

Intuition: Even in stochastic systems, randomization in the policy doesn’t help when maximizing expected returns: you should always choose the action with the highest expected value. Combined with state sufficiency, this means simple state-to-action mappings are optimal.

These results justify focusing on deterministic Markov policies and lead to the backward recursion algorithm for stochastic systems:

While SDP provides us with a framework to for handling uncertainty, it makes the curse of dimensionality even more difficult to handle in practice. Both the state space and the disturbance space must be discretized. This can lead to a combinatorial explosion in the number of scenarios to be evaluated at each stage.

However, just as we tackled the challenges of continuous state spaces with discretization and interpolation, we can devise efficient methods to handle the additional complexity of evaluating expectations. This problem essentially becomes one of numerical integration. When the set of disturbances is continuous (as is often the case with continuous state spaces), we enter a domain where numerical quadrature methods could be applied. But these methods tend to scale poorly as the number of dimensions grows. This is where more efficient techniques, often rooted in Monte Carlo methods, come into play. Two ingredients tackle the curse of dimensionality:

  1. Function approximation (through discretization, interpolation, neural networks, etc.)

  2. Monte Carlo integration (simulation)

These two elements essentially distill the key ingredients of machine learning, which is the direction we’ll be exploring in this course.

Example: Stochastic Optimal Harvest in Resource Management

Building upon our previous deterministic model, we now introduce stochasticity to more accurately reflect the uncertainties inherent in real-world resource management scenarios Conroy & Peterson, 2013. As before, we consider a population of a particular species, whose abundance we denote by xtx_t, where tt represents discrete time steps. Our objective remains to maximize the cumulative harvest over a finite time horizon, while also considering the long-term sustainability of the population. However, we now account for two sources of stochasticity: partial controllability of harvest and environmental variability affecting growth rates. The optimization problem can be formulated as:

maximizeE[t=t0tfF(xtht)]\text{maximize} \quad \mathbb{E}\left[\sum_{t=t_0}^{t_f} F(x_t \cdot h_t)\right]

Here, F()F(\cdot) represents the immediate reward function associated with harvesting, and hth_t is the realized harvest rate at time tt. The expectation E[]\mathbb{E}[\cdot] over both harvest and growth rates, which we view as random variables. In our stochastic model, the abundance xx still ranges from 1 to 100 individuals. The decision variable is now the desired harvest rate dtd_t, which can take values from the set D=0,0.1,0.2,0.3,0.4,0.5D = {0, 0.1, 0.2, 0.3, 0.4, 0.5}. However, the realized harvest rate hth_t is stochastic and follows a discrete distribution:

ht={0.75dtwith probability 0.25dtwith probability 0.51.25dtwith probability 0.25h_t = \begin{cases} 0.75d_t & \text{with probability } 0.25 \\ d_t & \text{with probability } 0.5 \\ 1.25d_t & \text{with probability } 0.25 \end{cases}

By expressing the harvest rate as a random variable, we mean to capture the fact that harvesting is a not completely under our control: we might obtain more or less what we had intended to. Furthermore, we generalize the population dynamics to the stochastic case via:

$$

x_{t+1} = x_t + r_tx_t(1 - x_t/K) - h_tx_t $$

where K=125K = 125 is the carrying capacity. The growth rate rtr_t is now stochastic and follows a discrete distribution:

rt={0.85rmaxwith probability 0.251.05rmaxwith probability 0.51.15rmaxwith probability 0.25r_t = \begin{cases} 0.85r_{\text{max}} & \text{with probability } 0.25 \\ 1.05r_{\text{max}} & \text{with probability } 0.5 \\ 1.15r_{\text{max}} & \text{with probability } 0.25 \end{cases}

where rmax=0.3r_{\text{max}} = 0.3 is the maximum growth rate. Applying the principle of optimality, we can express the optimal value function J(xt,t)J^\star(x_t, t) recursively:

J(xt,t)=maxd(t)DE[F(xtht)+J(xt+1,t+1)]J^\star(x_t, t) = \max_{d(t) \in D} \mathbb{E}\left[F(x_t \cdot h_t) + J^\star(x_{t+1}, t+1)\right]

where the expectation is taken over the harvest and growth rate random variables. The boundary condition remains J(xtf)=0J^*(x_{t_f}) = 0. We can now adapt our previous code to account for the stochasticity in our model. One important difference is that simulating a solution in this context requires multiple realizations of our process. This is an important consideration when evaluating reinforcement learning methods in practice, as success cannot be claimed based on a single successful trajectory.

Source
import numpy as np
from scipy.interpolate import interp1d

# Parameters
r_max = 0.3
K = 125
T = 30  # Number of time steps
N_max = 100  # Maximum population size to consider
h_max = 0.5  # Maximum harvest rate
h_step = 0.1  # Step size for harvest rate

# Create state and decision spaces
N_space = np.linspace(1, N_max, 100)  # Using more granular state space
h_space = np.arange(0, h_max + h_step, h_step)

# Stochastic parameters
h_outcomes = np.array([0.75, 1.0, 1.25])
h_probs = np.array([0.25, 0.5, 0.25])
r_outcomes = np.array([0.85, 1.05, 1.15]) * r_max
r_probs = np.array([0.25, 0.5, 0.25])

# Initialize value function and policy
V = np.zeros((T + 1, len(N_space)))
policy = np.zeros((T, len(N_space)))

# State return function (F)
def state_return(N, h):
    return N * h

# State dynamics function (stochastic)
def state_dynamics(N, h, r):
    return N + r * N * (1 - N / K) - h * N

# Function to create interpolation function for a given time step
def create_interpolator(V_t, N_space):
    return interp1d(N_space, V_t, kind='linear', bounds_error=False, fill_value=(V_t[0], V_t[-1]))

# Backward iteration with stochastic dynamics
for t in range(T - 1, -1, -1):
    interpolator = create_interpolator(V[t+1], N_space)
    
    for i, N in enumerate(N_space):
        max_value = float('-inf')
        best_h = 0

        for h in h_space:
            if h > 1:  # Ensure harvest rate doesn't exceed 100%
                continue

            expected_value = 0
            for h_factor, h_prob in zip(h_outcomes, h_probs):
                for r_factor, r_prob in zip(r_outcomes, r_probs):
                    realized_h = h * h_factor
                    realized_r = r_factor

                    next_N = state_dynamics(N, realized_h, realized_r)
                    if next_N < 1:  # Ensure population doesn't go extinct
                        continue

                    # Use interpolation to get the value for next_N
                    value = state_return(N, realized_h) + interpolator(next_N)
                    expected_value += value * h_prob * r_prob

            if expected_value > max_value:
                max_value = expected_value
                best_h = h

        V[t, i] = max_value
        policy[t, i] = best_h

# Function to simulate the optimal policy using interpolation (stochastic version)
def simulate_optimal_policy(initial_N, T, num_simulations=100):
    all_trajectories = []
    all_harvests = []

    for _ in range(num_simulations):
        trajectory = [initial_N]
        harvests = []

        for t in range(T):
            N = trajectory[-1]
            
            # Create interpolator for the policy at time t
            policy_interpolator = interp1d(N_space, policy[t], kind='linear', bounds_error=False, fill_value=(policy[t][0], policy[t][-1]))
            
            intended_h = policy_interpolator(N)
            
            # Apply stochasticity
            h_factor = np.random.choice(h_outcomes, p=h_probs)
            r_factor = np.random.choice(r_outcomes, p=r_probs)
            
            realized_h = intended_h * h_factor
            harvests.append(N * realized_h)

            next_N = state_dynamics(N, realized_h, r_factor)
            trajectory.append(next_N)

        all_trajectories.append(trajectory)
        all_harvests.append(harvests)

    return all_trajectories, all_harvests

# Example usage
initial_N = 50
trajectories, harvests = simulate_optimal_policy(initial_N, T)

# Calculate average trajectory and total harvest
avg_trajectory = np.mean(trajectories, axis=0)
avg_total_harvest = np.mean([sum(h) for h in harvests])

print("Optimal policy (first few rows):")
print(policy[:5])
print("\nAverage population trajectory:", avg_trajectory)
print("Average total harvest:", avg_total_harvest)

# Plot results
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
plt.subplot(121)
for traj in trajectories[:20]:  # Plot first 20 trajectories
    plt.plot(range(T+1), traj, alpha=0.3)
plt.plot(range(T+1), avg_trajectory, 'r-', linewidth=2)
plt.title('Population Trajectories')
plt.xlabel('Time')
plt.ylabel('Population')

plt.subplot(122)
plt.hist([sum(h) for h in harvests], bins=20)
plt.title('Distribution of Total Harvest')
plt.xlabel('Total Harvest')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

Linear Quadratic Regulator via Dynamic Programming

We now examine a special case where the backward recursion admits a closed-form solution. When the system dynamics are linear and the cost function is quadratic, the optimization at each stage can be solved analytically. Moreover, the value function itself maintains a quadratic structure throughout the recursion, and the optimal policy reduces to a simple linear feedback law. This result eliminates the need for discretization, interpolation, or any function approximation. The infinite-dimensional problem collapses to tracking a finite set of matrices.

Consider a discrete-time linear system:

xt+1=Atxt+Btut\mathbf{x}_{t+1} = A_t\mathbf{x}_t + B_t\mathbf{u}_t

where xtRn\mathbf{x}_t \in \mathbb{R}^n is the state and utRm\mathbf{u}_t \in \mathbb{R}^m is the control input. The matrices AtRn×nA_t \in \mathbb{R}^{n \times n} and BtRn×mB_t \in \mathbb{R}^{n \times m} describe the system dynamics at time tt.

The cost function to be minimized is quadratic:

J=12xTQTxT+12t=0T1(xtQtxt+utRtut)J = \frac{1}{2}\mathbf{x}_T^\top Q_T \mathbf{x}_T + \frac{1}{2}\sum_{t=0}^{T-1} \left(\mathbf{x}_t^\top Q_t \mathbf{x}_t + \mathbf{u}_t^\top R_t \mathbf{u}_t\right)

where QT0Q_T \succeq 0 (positive semidefinite), Qt0Q_t \succeq 0, and Rt0R_t \succ 0 (positive definite) are symmetric matrices of appropriate dimensions. The positive definiteness of RtR_t ensures the minimization problem is well-posed.

What we now have to observe is that if the terminal cost is quadratic, then the value function at every earlier stage remains quadratic. This is not immediately obvious, but it follows from the structure of Bellman’s equation combined with the linearity of the dynamics.

We claim that the optimal cost-to-go from any stage tt takes the form:

Jt(xt)=12xtPtxtJ_t^\star(\mathbf{x}_t) = \frac{1}{2}\mathbf{x}_t^\top P_t \mathbf{x}_t

for some positive semidefinite matrix PtP_t. At the terminal time, this is true by definition: PT=QTP_T = Q_T.

Let’s verify this structure and derive the recursion for PtP_t using backward induction. Suppose we’ve established that Jt+1(xt+1)=12xt+1Pt+1xt+1J_{t+1}^\star(\mathbf{x}_{t+1}) = \frac{1}{2}\mathbf{x}_{t+1}^\top P_{t+1} \mathbf{x}_{t+1}. Bellman’s equation at stage tt states:

Jt(xt)=minut[12xtQtxt+12utRtut+Jt+1(xt+1)]J_t^\star(\mathbf{x}_t) = \min_{\mathbf{u}_t} \left[ \frac{1}{2}\mathbf{x}_t^\top Q_t \mathbf{x}_t + \frac{1}{2}\mathbf{u}_t^\top R_t \mathbf{u}_t + J_{t+1}^\star(\mathbf{x}_{t+1}) \right]

Substituting the dynamics xt+1=Atxt+Btut\mathbf{x}_{t+1} = A_t\mathbf{x}_t + B_t\mathbf{u}_t and the quadratic form for Jt+1J_{t+1}^\star:

Jt(xt)=minut[12xtQtxt+12utRtut+12(Atxt+Btut)Pt+1(Atxt+Btut)]J_t^\star(\mathbf{x}_t) = \min_{\mathbf{u}_t} \left[ \frac{1}{2}\mathbf{x}_t^\top Q_t \mathbf{x}_t + \frac{1}{2}\mathbf{u}_t^\top R_t \mathbf{u}_t + \frac{1}{2}(A_t\mathbf{x}_t + B_t\mathbf{u}_t)^\top P_{t+1} (A_t\mathbf{x}_t + B_t\mathbf{u}_t) \right]

Expanding the last term:

(Atxt+Btut)Pt+1(Atxt+Btut)=xtAtPt+1Atxt+2xtAtPt+1Btut+utBtPt+1Btut(A_t\mathbf{x}_t + B_t\mathbf{u}_t)^\top P_{t+1} (A_t\mathbf{x}_t + B_t\mathbf{u}_t) = \mathbf{x}_t^\top A_t^\top P_{t+1} A_t \mathbf{x}_t + 2\mathbf{x}_t^\top A_t^\top P_{t+1} B_t \mathbf{u}_t + \mathbf{u}_t^\top B_t^\top P_{t+1} B_t \mathbf{u}_t

The expression inside the minimization becomes:

12xtQtxt+12utRtut+12xtAtPt+1Atxt+xtAtPt+1Btut+12utBtPt+1Btut\frac{1}{2}\mathbf{x}_t^\top Q_t \mathbf{x}_t + \frac{1}{2}\mathbf{u}_t^\top R_t \mathbf{u}_t + \frac{1}{2}\mathbf{x}_t^\top A_t^\top P_{t+1} A_t \mathbf{x}_t + \mathbf{x}_t^\top A_t^\top P_{t+1} B_t \mathbf{u}_t + \frac{1}{2}\mathbf{u}_t^\top B_t^\top P_{t+1} B_t \mathbf{u}_t

Collecting terms involving ut\mathbf{u}_t:

=12xt(Qt+AtPt+1At)xt+xtAtPt+1Btut+12ut(Rt+BtPt+1Bt)ut= \frac{1}{2}\mathbf{x}_t^\top (Q_t + A_t^\top P_{t+1} A_t) \mathbf{x}_t + \mathbf{x}_t^\top A_t^\top P_{t+1} B_t \mathbf{u}_t + \frac{1}{2}\mathbf{u}_t^\top (R_t + B_t^\top P_{t+1} B_t) \mathbf{u}_t

This is a quadratic function of ut\mathbf{u}_t. To find the minimizer, we take the gradient with respect to ut\mathbf{u}_t and set it to zero:

ut=(Rt+BtPt+1Bt)ut+BtPt+1Atxt=0\frac{\partial}{\partial \mathbf{u}_t} = (R_t + B_t^\top P_{t+1} B_t) \mathbf{u}_t + B_t^\top P_{t+1} A_t \mathbf{x}_t = 0

Since Rt+BtPt+1BtR_t + B_t^\top P_{t+1} B_t is positive definite (both RtR_t and Pt+1P_{t+1} are positive semidefinite with RtR_t strictly positive), we can solve for the optimal control:

ut=(Rt+BtPt+1Bt)1BtPt+1Atxt\mathbf{u}_t^\star = -(R_t + B_t^\top P_{t+1} B_t)^{-1} B_t^\top P_{t+1} A_t \mathbf{x}_t

Define the gain matrix:

Kt=(Rt+BtPt+1Bt)1BtPt+1AtK_t = (R_t + B_t^\top P_{t+1} B_t)^{-1} B_t^\top P_{t+1} A_t

so that ut=Ktxt\mathbf{u}_t^\star = -K_t\mathbf{x}_t. This is a linear feedback policy: the optimal control is simply a linear function of the current state.

Substituting ut\mathbf{u}_t^\star back into the cost-to-go expression and simplifying (by completing the square), we obtain:

Jt(xt)=12xtPtxtJ_t^\star(\mathbf{x}_t) = \frac{1}{2}\mathbf{x}_t^\top P_t \mathbf{x}_t

where PtP_t satisfies the discrete-time Riccati equation:

Pt=Qt+AtPt+1AtAtPt+1Bt(Rt+BtPt+1Bt)1BtPt+1AtP_t = Q_t + A_t^\top P_{t+1} A_t - A_t^\top P_{t+1} B_t (R_t + B_t^\top P_{t+1} B_t)^{-1} B_t^\top P_{t+1} A_t

Putting everything together, the backward induction procedure under the LQR setting then becomes:

Markov Decision Process Formulation

Rather than expressing the stochasticity in our system through a disturbance term as a parameter to a deterministic difference equation, we often work with an alternative representation (more common in operations research) which uses the Markov Decision Process formulation. The idea is that when we model our system in this way with the disturbance term being drawn indepently of the previous stages, the induced trajectory are those of a Markov chain. Hence, we can re-cast our control problem in that language, leading to the so-called Markov Decision Process framework in which we express the system dynamics in terms of transition probabilities rather than explicit state equations. In this framework, we express the probability that the system is in a given state using the transition probability function:

pt(xt+1xt,ut)p_t(\mathbf{x}_{t+1} | \mathbf{x}_t, \mathbf{u}_t)

This function gives the probability of transitioning to state xt+1\mathbf{x}_{t+1} at time t+1t+1, given that the system is in state xt\mathbf{x}_t and action ut\mathbf{u}_t is taken at time tt. Therefore, ptp_t specifies a conditional probability distribution over the next states: namely, the sum (for discrete state spaces) or integral over the next state should be 1.

Given the control theory formulation of our problem via a deterministic dynamics function and a noise term, we can derive the corresponding transition probability function through the following relationship:

pt(xt+1xt,ut)=P(Wt{wW:xt+1=ft(xt,ut,w)})={wW:xt+1=ft(xt,ut,w)}qt(w)\begin{aligned} p_t(\mathbf{x}_{t+1} | \mathbf{x}_t, \mathbf{u}_t) &= \mathbb{P}(\mathbf{W}_t \in \left\{\mathbf{w} \in \mathbf{W}: \mathbf{x}_{t+1} = f_t(\mathbf{x}_t, \mathbf{u}_t, \mathbf{w})\right\}) \\ &= \sum_{\left\{\mathbf{w} \in \mathbf{W}: \mathbf{x}_{t+1} = f_t(\mathbf{x}_t, \mathbf{u}_t, \mathbf{w})\right\}} q_t(\mathbf{w}) \end{aligned}

Here, qt(w)q_t(\mathbf{w}) represents the probability density or mass function of the disturbance Wt\mathbf{W}_t (assuming discrete state spaces). When dealing with continuous spaces, the above expression simply contains an integral rather than a summation.

For a system with deterministic dynamics and no disturbance, the transition probabilities become much simpler and be expressed using the indicator function. Given a deterministic system with dynamics:

xt+1=ft(xt,ut)\mathbf{x}_{t+1} = f_t(\mathbf{x}_t, \mathbf{u}_t)

The transition probability function can be expressed as:

pt(xt+1xt,ut)={1if xt+1=ft(xt,ut)0otherwisep_t(\mathbf{x}_{t+1} | \mathbf{x}_t, \mathbf{u}_t) = \begin{cases} 1 & \text{if } \mathbf{x}_{t+1} = f_t(\mathbf{x}_t, \mathbf{u}_t) \\ 0 & \text{otherwise} \end{cases}

With this transition probability function, we can recast our Bellman optimality equation:

Jt(xt)=maxutU{ct(xt,ut)+xt+1pt(xt+1xt,ut)Jt+1(xt+1)}J_t^\star(\mathbf{x}_t) = \max_{\mathbf{u}_t \in \mathbf{U}} \left\{ c_t(\mathbf{x}_t, \mathbf{u}_t) + \sum_{\mathbf{x}_{t+1}} p_t(\mathbf{x}_{t+1} | \mathbf{x}_t, \mathbf{u}_t) J_{t+1}^\star(\mathbf{x}_{t+1}) \right\}

Here, c(xt,ut){c}(\mathbf{x}_t, \mathbf{u}_t) represents the expected immediate reward (or negative cost) when in state xt\mathbf{x}_t and taking action ut\mathbf{u}_t at time tt. The summation term computes the expected optimal value for the future states, weighted by their transition probabilities.

This formulation offers several advantages:

  1. It makes the Markovian nature of the problem explicit: the future state depends only on the current state and action, not on the history of states and actions.

  2. For discrete-state problems, the entire system dynamics can be specified by a set of transition matrices, one for each possible action.

  3. It allows us to bridge the gap with the wealth of methods in the field of probabilistic graphical models and statistical machine learning techniques for modelling and analysis.

Notation in Operations Reseach

The presentation above was intended to bridge the gap between the control-theoretic perspective and the world of closed-loop control through the idea of determining the value function of a parametric optimal control problem. We then saw how the backward induction procedure was applicable to both the deterministic and stochastic cases by taking the expectation over the disturbance variable. We then said that we can alternatively work with a representation of our system where instead of writing our model as a deterministic dynamics function taking a disturbance as an input, we would rather work directly via its transition probability function, which gives rise to the Markov chain interpretation of our system in simulation.

Note that the notation used in control theory tends to differ from that found in operations research communities, in which the field of dynamic programming flourished. We summarize those (purely notational) differences in this section.

In operations research, the system state at each decision epoch is typically denoted by sSs \in \mathcal{S}, where SS is the set of possible system states. When the system is in state ss, the decision maker may choose an action aa from the set of allowable actions As\mathcal{A}_s. The union of all action sets is denoted as A=sSAs\mathcal{A} = \bigcup_{s \in \mathcal{S}} \mathcal{A}_s.

The dynamics of the system are described by a transition probability function pt(js,a)p_t(j | s, a), which represents the probability of transitioning to state jSj \in \mathcal{S} at time t+1t+1, given that the system is in state ss at time tt and action aAsa \in \mathcal{A}_s is chosen. This transition probability function satisfies:

jSpt(js,a)=1\sum_{j \in \mathcal{S}} p_t(j | s, a) = 1

It’s worth noting that in operations research, we typically work with reward maximization rather than cost minimization, which is more common in control theory. However, we can easily switch between these perspectives by simply negating the quantity. That is, maximizing a reward function is equivalent to minimizing its negative, which we would then call a cost function.

The reward function is denoted by rt(s,a)r_t(s, a), representing the reward received at time tt when the system is in state ss and action aa is taken. In some cases, the reward may also depend on the next state, in which case it is denoted as rt(s,a,j)r_t(s, a, j). The expected reward can then be computed as:

rt(s,a)=jSrt(s,a,j)pt(js,a)r_t(s, a) = \sum_{j \in \mathcal{S}} r_t(s, a, j) p_t(j | s, a)

Combined together, these elemetns specify a Markov decision process, which is fully described by the tuple:

{T,S,As,pt(s,a),rt(s,a)}\{T, S, \mathcal{A}_s, p_t(\cdot | s, a), r_t(s, a)\}

where T\mathrm{T} represents the set of decision epochs (the horizon).

What is an Optimal Policy?

Let’s go back to the starting point and define what it means for a policy to be optimal in a Markov Decision Problem. For this, we will be considering different possible search spaces (policy classes) and compare policies based on the ordering of their value from any possible start state. The value of a policy π\boldsymbol{\pi} (optimal or not) is defined as the expected total reward obtained by following that policy from a given starting state. Formally, for a finite-horizon MDP with NN decision epochs, we define the value function vπ(s,t)v^{\boldsymbol{\pi}}(s, t) as:

vπ(s,t)E[k=tN1rt(Sk,Ak)+rN(SN)St=s]v^{\boldsymbol{\pi}}(s, t) \triangleq \mathbb{E}\left[\sum_{k=t}^{N-1} r_t(S_k, A_k) + r_N(S_N) \mid S_t = s\right]

where StS_t is the state at time tt, AtA_t is the action taken at time tt, and rtr_t is the reward function. For simplicity, we write vπ(s)v^{\boldsymbol{\pi}}(s) to denote vπ(s,1)v^{\boldsymbol{\pi}}(s, 1), the value of following policy π\boldsymbol{\pi} from state ss at the first stage over the entire horizon NN.

In finite-horizon MDPs, our goal is to identify an optimal policy, denoted by π\boldsymbol{\pi}^*, that maximizes total expected reward over the horizon NN. Specifically:

vπ(s)vπ(s),sS,πΠHRv^{\boldsymbol{\pi}^*}(s) \geq v^{\boldsymbol{\pi}}(s), \quad \forall s \in \mathcal{S}, \quad \forall \boldsymbol{\pi} \in \Pi^{\text{HR}}

We call π\boldsymbol{\pi}^* an optimal policy because it yields the highest possible value across all states and all policies within the policy class ΠHR\Pi^{\text{HR}}. We denote by vv^* the maximum value achievable by any policy:

v(s)=maxπΠHRvπ(s),sSv^*(s) = \max_{\boldsymbol{\pi} \in \Pi^{\text{HR}}} v^{\boldsymbol{\pi}}(s), \quad \forall s \in \mathcal{S}

In reinforcement learning literature, vv^* is typically referred to as the “optimal value function,” while in some operations research references, it might be called the “value of an MDP.” An optimal policy π\boldsymbol{\pi}^* is one for which its value function equals the optimal value function:

vπ(s)=v(s),sSv^{\boldsymbol{\pi}^*}(s) = v^*(s), \quad \forall s \in \mathcal{S}

This notion of optimality applies to every state. Policies optimal in this sense are sometimes called “uniformly optimal policies.” A weaker notion of optimality, often encountered in reinforcement learning practice, is optimality with respect to an initial distribution of states. In this case, we seek a policy πΠHR\boldsymbol{\pi} \in \Pi^{\text{HR}} that maximizes:

sSvπ(s)P1(S1=s)\sum_{s \in \mathcal{S}} v^{\boldsymbol{\pi}}(s) P_1(S_1 = s)

where P1(S1=s)P_1(S_1 = s) is the probability of starting in state ss.

The maximum value can be achieved by searching over the space of deterministic Markovian Policies. Consequently:

v(s)=maxπΠHRvπ(s)=maxπΠMDvπ(s),sSv^*(s) = \max_{\boldsymbol{\pi} \in \Pi^{\mathrm{HR}}} v^{\boldsymbol{\pi}}(s) = \max _{\boldsymbol{\pi} \in \Pi^{M D}} v^{\boldsymbol{\pi}}(s), \quad s \in S

This equality significantly simplifies the computational complexity of our algorithms, as the search problem can now be decomposed into NN sub-problems in which we only have to search over the set of possible actions. This is the backward induction algorithm, which we present a second time, but departing this time from the control-theoretic notation and using the MDP formalism:

Note that the same procedure can also be used for finding the value of a policy with minor changes;

This code could also finally be adapted to support randomized policies using:

vπ(st,t)=atAstπt(atst)(rt(st,at)+jSpt(jst,at)vπ(j,t+1))v^{\boldsymbol{\pi}}(s_t, t) = \sum_{a_t \in \mathcal{A}_{s_t}} \pi_t(a_t \mid s_t) \left( r_t(s_t, a_t) + \sum_{j \in S} p_t(j | s_t, a_t) v^{\boldsymbol{\pi}}(j, t+1) \right)

Example: Sample Size Determination in Pharmaceutical Development

Pharmaceutical development is the process of bringing a new drug from initial discovery to market availability. This process is lengthy, expensive, and risky, typically involving several stages:

  1. Drug Discovery: Identifying a compound that could potentially treat a disease.

  2. Preclinical Testing: Laboratory and animal testing to assess safety and efficacy. . Clinical Trials: Testing the drug in humans, divided into phases:

    • Phase I: Testing for safety in a small group of healthy volunteers.

    • Phase II: Testing for efficacy and side effects in a larger group with the target condition.

    • Phase III: Large-scale testing to confirm efficacy and monitor side effects.

  3. Regulatory Review: Submitting a New Drug Application (NDA) for approval.

  4. Post-Market Safety Monitoring: Continuing to monitor the drug’s effects after market release.

This process can take 10-15 years and cost over $1 billion Adams & Brantner (2009). The high costs and risks involved call for a principled approach to decision making. We’ll focus on the clinical trial phases and NDA approval, per the MDP model presented by Chang (2010):

  1. States (SS): Our state space is S={s1,s2,s3,s4}S = \{s_1, s_2, s_3, s_4\}, where:

    • s1s_1: Phase I clinical trial

    • s2s_2: Phase II clinical trial

    • s3s_3: Phase III clinical trial

    • s4s_4: NDA approval

  2. Actions (AA): At each state, the action is choosing the sample size nin_i for the corresponding clinical trial. The action space is A={10,11,...,1000}A = \{10, 11, ..., 1000\}, representing possible sample sizes.

  3. Transition Probabilities (PP): The probability of moving from one state to the next depends on the chosen sample size and the inherent properties of the drug. We define:

    • P(s2s1,n1)=p12(n1)=i=0η1n1(n1i)p0i(1p0)n1iP(s_2|s_1, n_1) = p_{12}(n_1) = \sum_{i=0}^{\lfloor\eta_1 n_1\rfloor} \binom{n_1}{i} p_0^i (1-p_0)^{n_1-i} where p0p_0 is the true toxicity rate and η1\eta_1 is the toxicity threshold for Phase I.

  1. Rewards (RR): The reward function captures the costs of running trials and the potential profit from a successful drug:

    • r(si,ni)=ci(ni)r(s_i, n_i) = -c_i(n_i) for i=1,2,3i = 1, 2, 3, where ci(ni)c_i(n_i) is the cost of running a trial with sample size nin_i.

    • r(s4)=g4r(s_4) = g_4, where g4g_4 is the expected profit from a successful drug.

  2. Discount Factor (γ\gamma): We use a discount factor 0<γ10 < \gamma \leq 1 to account for the time value of money and risk preferences.

Source
import numpy as np
from scipy.stats import binom
from scipy.stats import norm

def binomial_pmf(k, n, p):
    return binom.pmf(k, n, p)

def transition_prob_phase1(n1, eta1, p0):
    return np.sum([binomial_pmf(i, n1, p0) for i in range(int(eta1 * n1) + 1)])

def transition_prob_phase2(n2, eta2, delta):
    return norm.cdf((np.sqrt(n2) / 2) * delta - norm.ppf(1 - eta2))

def transition_prob_phase3(n3, eta3, delta):
    return norm.cdf((np.sqrt(n3) / 2) * delta - norm.ppf(1 - eta3))

def immediate_reward(n):
    return -n  # Negative to represent cost

def backward_induction(S, A, gamma, g4, p0, delta, eta1, eta2, eta3):
    V = np.zeros(len(S))
    V[3] = g4  # Value for NDA approval state
    optimal_n = [None] * 3  # Store optimal n for each phase

    # Backward induction
    for i in range(2, -1, -1):  # Iterate backwards from Phase III to Phase I
        max_value = -np.inf
        for n in A:
            if i == 0:  # Phase I
                p = transition_prob_phase1(n, eta1, p0)
            elif i == 1:  # Phase II
                p = transition_prob_phase2(n, eta2, delta)
            else:  # Phase III
                p = transition_prob_phase3(n, eta3, delta)
            value = immediate_reward(n) + gamma * p * V[i+1]
            if value > max_value:
                max_value = value
                optimal_n[i] = n
        V[i] = max_value

    return V, optimal_n

# Set up the problem parameters
S = ['Phase I', 'Phase II', 'Phase III', 'NDA approval']
A = range(10, 1001)
gamma = 0.95
g4 = 10000
p0 = 0.1  # Example toxicity rate for Phase I
delta = 0.5  # Example normalized treatment difference
eta1, eta2, eta3 = 0.2, 0.1, 0.025

# Run the backward induction algorithm
V, optimal_n = backward_induction(S, A, gamma, g4, p0, delta, eta1, eta2, eta3)

# Print results
for i, state in enumerate(S):
    print(f"Value for {state}: {V[i]:.2f}")
print(f"Optimal sample sizes: Phase I: {optimal_n[0]}, Phase II: {optimal_n[1]}, Phase III: {optimal_n[2]}")

# Sanity checks
print("\nSanity checks:")
print(f"1. NDA approval value: {V[3]}")
print(f"2. All values non-positive and <= NDA value: {all(v <= V[3] for v in V)}")
print(f"3. Optimal sample sizes in range: {all(10 <= n <= 1000 for n in optimal_n if n is not None)}")

Infinite-Horizon MDPs

It often makes sense to model control problems over infinite horizons. We extend the previous setting and define the expected total reward of policy πΠHR\boldsymbol{\pi} \in \Pi^{\mathrm{HR}}, vπv^{\boldsymbol{\pi}} as:

vπ(s)=E[t=1r(St,At)]v^{\boldsymbol{\pi}}(s) = \mathbb{E}\left[\sum_{t=1}^{\infty} r(S_t, A_t)\right]

One drawback of this model is that we could easily encounter values that are ++\infty or -\infty, even in a setting as simple as a single-state MDP which loops back into itself and where the accrued reward is nonzero.

Therefore, it is often more convenient to work with an alternative formulation which guarantees the existence of a limit: the expected total discounted reward of policy πΠHR\boldsymbol{\pi} \in \Pi^{\mathrm{HR}} is defined to be:

vγπ(s)limNE[t=1Nγt1r(St,At)]v_\gamma^{\boldsymbol{\pi}}(s) \equiv \lim_{N \rightarrow \infty} \mathbb{E}\left[\sum_{t=1}^N \gamma^{t-1} r(S_t, A_t)\right]

for 0γ<10 \leq \gamma < 1 and when maxsSmaxaAsr(s,a)=Rmax<\max_{s \in \mathcal{S}} \max_{a \in \mathcal{A}_s}|r(s, a)| = R_{\max} < \infty, in which case, vγπ(s)(1γ)1Rmax|v_\gamma^{\boldsymbol{\pi}}(s)| \leq (1-\gamma)^{-1} R_{\max}.

Finally, another possibility for the infinite-horizon setting is the so-called average reward or gain of policy πΠHR\boldsymbol{\pi} \in \Pi^{\mathrm{HR}} defined as:

gπ(s)limN1NE[t=1Nr(St,At)]g^{\boldsymbol{\pi}}(s) \equiv \lim_{N \rightarrow \infty} \frac{1}{N} \mathbb{E}\left[\sum_{t=1}^N r(S_t, A_t)\right]

We won’t be working with this formulation in this course due to its inherent practical and theoretical complexities.

Extending the previous notion of optimality from finite-horizon models, a policy π\boldsymbol{\pi}^* is said to be discount optimal for a given γ\gamma if:

vγπ(s)vγπ(s) for each sS and all πΠHRv_\gamma^{\boldsymbol{\pi}^*}(s) \geq v_\gamma^{\boldsymbol{\pi}}(s) \quad \text { for each } s \in S \text { and all } \boldsymbol{\pi} \in \Pi^{\mathrm{HR}}

Furthermore, the value of a discounted MDP vγ(s)v_\gamma^*(s), is defined by:

vγ(s)maxπΠHRvγπ(s)v_\gamma^*(s) \equiv \max _{\boldsymbol{\pi} \in \Pi^{\mathrm{HR}}} v_\gamma^{\boldsymbol{\pi}}(s)

More often, we refer to vγv_\gamma by simply calling it the optimal value function.

As for the finite-horizon setting, the infinite horizon discounted model does not require history-dependent policies, since for any πΠHR\boldsymbol{\pi} \in \Pi^{HR} there exists a πΠMR\boldsymbol{\pi}^{\prime} \in \Pi^{MR} with identical total discounted reward:

vγ(s)maxπΠHRvγπ(s)=maxπΠMRvγπ(s).v_\gamma^*(s) \equiv \max_{\boldsymbol{\pi} \in \Pi^{HR}} v_\gamma^{\boldsymbol{\pi}}(s)=\max_{\boldsymbol{\pi} \in \Pi^{MR}} v_\gamma^{\boldsymbol{\pi}}(s) .

Random Horizon Interpretation of Discounting

The use of discounting can be motivated both from a modeling perspective and as a means to ensure that the total reward remains bounded. From the modeling perspective, we can view discounting as a way to weight more or less importance on the immediate rewards vs. the long-term consequences. There is also another interpretation which stems from that of a finite horizon model but with an uncertain end time. More precisely:

Let vνπ(s)v_\nu^{\boldsymbol{\pi}}(s) denote the expected total reward obtained by using policy π\boldsymbol{\pi} when the horizon length ν\nu is random. We define it by:

vνπ(s)Esπ[Eν{t=1νr(St,At)}]v_\nu^{\boldsymbol{\pi}}(s) \equiv \mathbb{E}_s^{\boldsymbol{\pi}}\left[\mathbb{E}_\nu\left\{\sum_{t=1}^\nu r(S_t, A_t)\right\}\right]

Vector Representation in Markov Decision Processes

Let V be the set of bounded real-valued functions on a discrete state space S. This means any function fV f \in V satisfies the condition:

f=maxsSf(s)<.\|f\| = \max_{s \in S} |f(s)| < \infty.

where notation f \|f\| represents the sup-norm (or \ell_\infty -norm) of the function f f .

When working with discrete state spaces, we can interpret elements of V as vectors and linear operators on V as matrices, allowing us to leverage tools from linear algebra. The sup-norm (\ell_\infty norm) of matrix H\mathbf{H} is defined as:

HmaxsSjSHs,j\|\mathbf{H}\| \equiv \max_{s \in S} \sum_{j \in S} |\mathbf{H}_{s,j}|

where Hs,j\mathbf{H}_{s,j} represents the (s,j)(s, j)-th component of the matrix H\mathbf{H}.

For a Markovian decision rule πΠMD\pi \in \Pi^{MD}, we define:

rπ(s)r(s,π(s)),rπRS,[Pπ]s,jp(js,π(s)),PπRS×S.\begin{align*} \mathbf{r}_\pi(s) &\equiv r(s, \pi(s)), \quad \mathbf{r}_\pi \in \mathbb{R}^{|S|}, \\ [\mathbf{P}_\pi]_{s,j} &\equiv p(j \mid s, \pi(s)), \quad \mathbf{P}_\pi \in \mathbb{R}^{|S| \times |S|}. \end{align*}

For a randomized decision rule πΠMR\pi \in \Pi^{MR}, these definitions extend to:

rπ(s)aAsπ(as)r(s,a),[Pπ]s,jaAsπ(as)p(js,a).\begin{align*} \mathbf{r}_\pi(s) &\equiv \sum_{a \in A_s} \pi(a \mid s) \, r(s, a), \\ [\mathbf{P}_\pi]_{s,j} &\equiv \sum_{a \in A_s} \pi(a \mid s) \, p(j \mid s, a). \end{align*}

In both cases, rπ\mathbf{r}_\pi denotes a reward vector in RS\mathbb{R}^{|S|}, with each component rπ(s)\mathbf{r}_\pi(s) representing the reward associated with state ss. Similarly, Pπ\mathbf{P}_\pi is a transition probability matrix in RS×S\mathbb{R}^{|S| \times |S|}, capturing the transition probabilities under decision rule π\pi.

For a nonstationary Markovian policy π=(π1,π2,)ΠMR\boldsymbol{\pi} = (\pi_1, \pi_2, \ldots) \in \Pi^{MR}, the expected total discounted reward is given by:

vγπ(s)=E[t=1γt1r(St,At)|S1=s].\mathbf{v}_\gamma^{\boldsymbol{\pi}}(s)=\mathbb{E}\left[\sum_{t=1}^{\infty} \gamma^{t-1} r\left(S_t, A_t\right) \,\middle|\, S_1 = s\right].

Using vector notation, this can be expressed as:

vγπ=t=1γt1Pπt1rπ1=rπ1+γPπ1rπ2+γ2Pπ1Pπ2rπ3+=rπ1+γPπ1(rπ2+γPπ2rπ3+γ2Pπ2Pπ3rπ4+).\begin{aligned} \mathbf{v}_\gamma^{\boldsymbol{\pi}} &= \sum_{t=1}^{\infty} \gamma^{t-1} \mathbf{P}_{\boldsymbol{\pi}}^{t-1} \mathbf{r}_{\pi_1} \\ &= \mathbf{r}_{\pi_1} + \gamma \mathbf{P}_{\pi_1} \mathbf{r}_{\pi_2} + \gamma^2 \mathbf{P}_{\pi_1} \mathbf{P}_{\pi_2} \mathbf{r}_{\pi_3} + \cdots \\ &= \mathbf{r}_{\pi_1} + \gamma \mathbf{P}_{\pi_1} \left( \mathbf{r}_{\pi_2} + \gamma \mathbf{P}_{\pi_2} \mathbf{r}_{\pi_3} + \gamma^2 \mathbf{P}_{\pi_2} \mathbf{P}_{\pi_3} \mathbf{r}_{\pi_4} + \cdots \right). \end{aligned}

This formulation leads to a recursive relationship:

vγπ=rπ1+γPπ1vγπ=t=1γt1Pπt1rπt\begin{align*} \mathbf{v}_\gamma^{\boldsymbol{\pi}} &= \mathbf{r}_{\pi_1} + \gamma \mathbf{P}_{\pi_1} \mathbf{v}_\gamma^{\boldsymbol{\pi}^{\prime}}\\ &=\sum_{t=1}^{\infty} \gamma^{t-1} \mathbf{P}_{\boldsymbol{\pi}}^{t-1} \mathbf{r}_{\pi_t} \end{align*}

where π=(π2,π3,)\boldsymbol{\pi}^{\prime} = (\pi_2, \pi_3, \ldots).

For a stationary policy π=const(π)\boldsymbol{\pi} = \mathrm{const}(\pi) with constant decision rule π\pi, the total expected reward simplifies to:

vγπ=rπ+γPπvγπ=t=1γt1Pπt1rπ\begin{align*} \mathbf{v}_\gamma^{\pi} &= \mathbf{r}_\pi+ \gamma \mathbf{P}_\pi \mathbf{v}_\gamma^{\pi} \\ &=\sum_{t=1}^{\infty} \gamma^{t-1} \mathbf{P}_\pi^{t-1} \mathbf{r}_{\pi} \end{align*}

This last expression is called a Neumann series expansion, and it’s guaranteed to exists under the assumptions of bounded reward and discount factor strictly less than one.

Note that for any induced matrix norm \|\cdot\| (i.e., a norm satisfying HvHv\|\mathbf{H}\mathbf{v}\| \leq \|\mathbf{H}\| \cdot \|\mathbf{v}\| for all vectors v\mathbf{v}) and any matrix H\mathbf{H}, the spectral radius is bounded by:

ρ(H)H.\rho(\mathbf{H}) \leq \|\mathbf{H}\|.

This inequality provides a practical way to verify the convergence condition ρ(H)<1\rho(\mathbf{H}) < 1 by checking the simpler condition H<1\|\mathbf{H}\| < 1 rather than trying to compute the eigenvalues directly.

We can now verify that (IγPd)(\mathbf{I} - \gamma \mathbf{P}_d) is invertible and the Neumann series converges.

  1. Norm of the transition matrix: Since Pd\mathbf{P}_d is a stochastic matrix (each row sums to 1 and all entries are non-negative), its \ell_\infty-norm is:

    Pd=maxsSjS[Pd]s,j=maxsS1=1.\|\mathbf{P}_d\| = \max_{s \in S} \sum_{j \in S} [\mathbf{P}_d]_{s,j} = \max_{s \in S} 1 = 1.
  2. Norm of the scaled matrix: Using the homogeneity property of norms, we have:

    γPd=γPd=γ1=γ.\|\gamma \mathbf{P}_d\| = |\gamma| \cdot \|\mathbf{P}_d\| = |\gamma| \cdot 1 = |\gamma|.
  3. Bounding the spectral radius: Since the spectral radius is bounded by the matrix norm:

    ρ(γPd)γPd=γ.\rho(\gamma \mathbf{P}_d) \leq \|\gamma \mathbf{P}_d\| = |\gamma|.
  4. Verifying convergence: Since 0γ<10 \leq \gamma < 1 by assumption, we have:

    ρ(γPd)γ<1.\rho(\gamma \mathbf{P}_d) \leq |\gamma| < 1.

    This strict inequality guarantees that (IγPd)(\mathbf{I} - \gamma \mathbf{P}_d) is invertible and the Neumann series converges.

Therefore, the Neumann series expansion converges and yields:

vγd=(IγPd)1rd=t=0(γPd)trd=t=1γt1Pdt1rd.\mathbf{v}_\gamma^{d^\infty} = (\mathbf{I} - \gamma \mathbf{P}_d)^{-1} \mathbf{r}_d = \sum_{t=0}^{\infty} (\gamma \mathbf{P}_d)^t \mathbf{r}_d = \sum_{t=1}^{\infty} \gamma^{t-1} \mathbf{P}_d^{t-1} \mathbf{r}_d.

Consequently, for a stationary policy, vγd\mathbf{v}_\gamma^{d^\infty} can be determined as the solution to the linear equation:

v=rd+γPdv,\mathbf{v} = \mathbf{r}_d+ \gamma \mathbf{P}_d\mathbf{v},

which can be rearranged to:

(IγPd)v=rd.(\mathbf{I} - \gamma \mathbf{P}_d) \mathbf{v} = \mathbf{r}_d.

We can also characterize vγd\mathbf{v}_\gamma^{d^\infty} as the solution to an operator equation. More specifically, define the transformation Ld\mathrm{L}_d by

Ldvrd+γPdv\mathrm{L}_d \mathbf{v} \equiv \mathbf{r}_d+\gamma \mathbf{P}_d\mathbf{v}

for any vV\mathbf{v} \in V. Intuitively, Ld\mathrm{L}_d takes a value function v\mathbf{v} as input and returns a new value function that combines immediate rewards (rd\mathbf{r}_d) with discounted future values (γPdv\gamma \mathbf{P}_d\mathbf{v}).

Therefore, we view Ld\mathrm{L}_d as an operator mapping elements of VV to VV: i.e., Ld:VV\mathrm{L}_d: V \rightarrow V. The fact that the value function of a policy is the solution to a fixed-point equation can then be expressed with the statement:

vγd=Ldvγd.\mathbf{v}_\gamma^{d^\infty}=\mathrm{L}_d \mathbf{v}_\gamma^{d^\infty}.

This is a fixed-point equation: the value function vγd\mathbf{v}_\gamma^{d^\infty} is a fixed point of the operator Ld\mathrm{L}_d.

Solving Operator Equations

The operator equation we encountered in MDPs, vγd=Ldvγd\mathbf{v}_\gamma^{d^\infty} = \mathrm{L}_d \mathbf{v}_\gamma^{d^\infty}, is a specific instance of a more general class of problems known as operator equations. These equations appear in various fields of mathematics and applied sciences, ranging from differential equations to functional analysis.

Operator equations can take several forms, each with its own characteristics and solution methods:

  1. Fixed Point Form: x=T(x)x = \mathrm{T}(x), where T:XX\mathrm{T}: X \rightarrow X. Common in fixed-point problems, such as our MDP equation, we seek a fixed point xx^* such that x=T(x)x^* = \mathrm{T}(x^*).

  2. General Operator Equation: T(x)=y\mathrm{T}(x) = y, where T:XY\mathrm{T}: X \rightarrow Y. Here, XX and YY can be different spaces. We seek an xXx \in X that satisfies the equation for a given yYy \in Y.

  3. Nonlinear Equation: T(x)=0\mathrm{T}(x) = 0, where T:XY\mathrm{T}: X \rightarrow Y. A special case of the general operator equation where we seek roots or zeros of the operator.

  4. Variational Inequality: Find xKx^* \in K such that T(x),xx0\langle \mathrm{T}(x^*), x - x^* \rangle \geq 0 for all xKx \in K. Here, KK is a closed convex subset of XX, and T:KX\mathrm{T}: K \rightarrow X^* (the dual space of XX). These problems often arise in optimization, game theory, and partial differential equations.

Successive Approximation Method

For equations in fixed point form, a common numerical solution method is successive approximation, also known as fixed-point iteration:

The convergence of successive approximation depends on the properties of the operator T\mathrm{T}. In the simplest and most common setting, we assume T\mathrm{T} is a contraction mapping. The Banach Fixed-Point Theorem then guarantees that T\mathrm{T} has a unique fixed point, and the successive approximation method will converge to this fixed point from any starting point. Specifically, T\mathrm{T} is a contraction if there exists a constant q[0,1)q \in [0,1) such that for all x,yXx,y \in X:

d(T(x),T(y))qd(x,y)d(\mathrm{T}(x), \mathrm{T}(y)) \leq q \cdot d(x,y)

where dd is the metric on XX. In this case, the rate of convergence is linear, with error bound:

d(xn,x)qn1qd(x1,x0)d(x_n, x^*) \leq \frac{q^n}{1-q} d(x_1, x_0)

However, the contraction mapping condition is not the only one that can lead to convergence. For instance, if T\mathrm{T} is nonexpansive (i.e., Lipschitz continuous with Lipschitz constant 1) and XX is a Banach space with certain geometrical properties (e.g., uniformly convex), then under additional conditions (e.g., T\mathrm{T} has at least one fixed point), the successive approximation method can still converge, albeit potentially more slowly than in the contraction case.

In practice, when dealing with specific problems like MDPs or differential equations, the properties of the operator often naturally align with one of these convergence conditions. For example, in discounted MDPs, the Bellman operator is a contraction in the supremum norm, which guarantees the convergence of value iteration.

Newton-Kantorovich Method

The Newton-Kantorovich method is a generalization of Newton’s method from finite dimensional vector spaces to infinite dimensional function spaces: rather than iterating in the space of vectors, we are iterating in the space of functions.

Newton’s method is often written as the familiar update:

xk+1=xk[DF(xk)]1F(xk),x_{k+1} = x_k - [DF(x_k)]^{-1} F(x_k),

which makes it look as though the essence of the method is “take a derivative and invert it.” But the real workhorse behind Newton’s method (both in finite and infinite dimensions) is linearization.

At each step, the idea is to replace the nonlinear operator F:XYF:X \to Y by a local surrogate model of the form

F(x+h)F(x)+Lh,F(x+h) \approx F(x) + Lh,

where LL is a linear map capturing how small perturbations in the input propagate to changes in the output. This is a Taylor-like expansion in Banach spaces: the role of the derivative is precisely to provide the correct notion of such a linear operator.

To find a root of FF, we impose the condition that the surrogate vanishes at the next iterate:

0=F(x+h)F(x)+Lh.0 = F(x+h) \approx F(x) + Lh.

Solving this linear equation gives the increment hh. In finite dimensions, LL is the Jacobian matrix; in Banach spaces, it must be the Fréchet derivative.

But what exactly is a Fréchet derivative in infinite dimensions? To understand this, we need to generalize the concept of derivative from finite-dimensional calculus. In infinite-dimensional spaces, there are several notions of differentiability, each with different strengths and requirements:

1. Gâteaux (Directional) Derivative

We say that the Gâteaux derivative of FF at xx in a specific direction hh is:

F(x;h)=limt0F(x+th)F(x)tF'(x; h) = \lim_{t \to 0} \frac{F(x + th) - F(x)}{t}

This quantity measures how the function FF changes along the ray x+thx + th. While this limit may exist for each direction hh separately, it doesn’t guarantee that the derivative is linear in hh. This is a key limitation: the Gâteaux derivative can exist in all directions but still fail to provide a good linear approximation.

2. Hadamard Directional Derivative

Rather than considering a single direction of perturbation, we now consider a bundle of perturbations around hh. We ask how the function changes as we approach the target direction from nearby directions. We say that FF has a Hadamard directional derivative if:

F(x;h)=limt0hhF(x+th)F(x)tF'(x; h) = \lim_{\substack{t \downarrow 0 \\ h' \to h}} \frac{F(x + t h') - F(x)}{t}

This is a stronger condition than Gâteaux differentiability because it requires the limit to be uniform over nearby directions. However, it still doesn’t guarantee linearity in hh.

3. Fréchet Derivative

The strongest and most natural notion: FF is Fréchet differentiable at xx if there exists a bounded linear operator LL such that:

limh0F(x+h)F(x)Lhh=0\lim_{h \to 0} \frac{\|F(x + h) - F(x) - Lh\|}{\|h\|} = 0

This definition directly addresses the inadequacy of the previous notions. Unlike Gâteaux and Hadamard derivatives, the Fréchet derivative explicitly requires the existence of a linear operator LL that provides a good approximation. Key properties:

Relationship:

Freˊchet differentiableHadamard directionally diff.Gaˆteaux directionally diff.\text{Fréchet differentiable} \Rightarrow \text{Hadamard directionally diff.} \Rightarrow \text{Gâteaux directionally diff.}

In the context of the Newton-Kantorovich method, we work with an operator F:XYF: X \to Y where both XX and YY are Banach spaces. The Fréchet derivative F(x)F'(x) is the best linear approximation of FF near xx, and it’s exactly this linear operator LL that we use in our linearization F(x+h)F(x)+F(x)hF(x+h) \approx F(x) + F'(x)h.

Now apart from those mathematical technicalities, Newton-Kantorovich has in essence the same structure as that of the original Newton’s method. That is, it applies the following sequence of steps:

  1. Linearize the Operator: Given an approximation xn x_n , we consider the Fréchet derivative of F F , denoted by F(xn) F'(x_n) . This derivative is a linear operator that provides a local approximation of F F near xn x_n .

  2. Set Up the Newton Step: The method then solves the linearized equation for a correction hn h_n :

    F(xn)hn=F(xn).F'(x_n) h_n = -F(x_n).

    This equation represents a linear system where hn h_n is chosen so that the linearized operator F(xn)+F(xn)hn F(x_n) + F'(x_n)h_n equals zero.

  3. Update the Solution: The new approximation xn+1 x_{n+1} is then given by:

    xn+1=xn+hn.x_{n+1} = x_n + h_n.

    This correction step refines xn x_n , bringing it closer to the true solution.

  4. Repeat Until Convergence: We repeat the linearization and update steps until the solution xn x_n converges to the desired tolerance, which can be verified by checking that F(xn) \|F(x_n)\| is sufficiently small, or by monitoring the norm xn+1xn \|x_{n+1} - x_n\| .

The convergence of Newton-Kantorovich does not hinge on F F being a contraction over the entire domain (as it could be the case for successive approximation). The convergence properties of the Newton-Kantorovich method are as follows:

  1. Local Convergence: Under mild conditions (e.g., FF is Fréchet differentiable and F(x)F'(x) is invertible near the solution), the method converges locally. This means that if the initial guess is sufficiently close to the true solution, the method will converge.

  2. Global Convergence: Global convergence is not guaranteed in general. However, under stronger conditions (e.g., FF is analytic and satisfies certain bounds), the method can converge globally.

  3. Rate of Convergence: When the method converges, it typically exhibits quadratic convergence. This means that the error at each step is proportional to the square of the error at the previous step:

    xn+1xCxnx2\|x_{n+1} - x^*\| \leq C\|x_n - x^*\|^2

    where xx^* is the true solution and CC is some constant. This quadratic convergence is significantly faster than the linear convergence typically seen in methods like successive approximation.

Optimality Equations for Infinite-Horizon MDPs

Recall that in the finite-horizon setting, the optimality equations are:

vn(s)=maxaAs{r(s,a)+γjSp(js,a)vn+1(j)}v_n(s) = \max_{a \in A_s} \left\{r(s, a) + \gamma \sum_{j \in S} p(j | s, a) v_{n+1}(j)\right\}

where vn(s)v_n(s) is the value function at time step nn for state ss, AsA_s is the set of actions available in state ss, r(s,a)r(s, a) is the reward function, γ\gamma is the discount factor, and p(js,a)p(j | s, a) is the transition probability from state ss to state jj given action aa.

Intuitively, we would expect that by taking the limit of nn to infinity, we might get the nonlinear equations:

v(s)=maxaAs{r(s,a)+γjSp(js,a)v(j)}v(s) = \max_{a \in A_s} \left\{r(s, a) + \gamma \sum_{j \in S} p(j | s, a) v(j)\right\}

which are called the optimality equations or Bellman equations for infinite-horizon MDPs.

We can adopt an operator-theoretic perspective by defining operators on the space VV of bounded real-valued functions on the state space SS. For a deterministic Markov rule πΠMD\pi \in \Pi^{MD}, define the policy-evaluation operator:

(Lπv)(s)=r(s,π(s))+γjSp(js,π(s))v(j)(\BellmanPi v)(s) = r(s,\pi(s)) + \gamma \sum_{j \in \mathcal{S}} p(j|s,\pi(s)) v(j)

The Bellman optimality operator is then:

LvmaxπΠMD{rπ+γPπv}\Bellman \mathbf{v} \equiv \max_{\pi \in \Pi^{MD}} \left\{\mathbf{r}_\pi + \gamma \mathbf{P}_\pi \mathbf{v}\right\}

where ΠMD\Pi^{MD} is the set of Markov deterministic decision rules, rπ\mathbf{r}_\pi is the reward vector under decision rule π\pi, and Pπ\mathbf{P}_\pi is the transition probability matrix under decision rule π\pi.

Note that while we write maxπΠMD\max_{\pi \in \Pi^{MD}}, we do not implement the above operator by enumerating all decision rules. Rather, the fact that we compare policies based on their value functions in a componentwise fashion means that maximizing over the space of Markovian deterministic rules reduces to the following update in component form:

(Lv)(s)=maxaAs{r(s,a)+γjSp(js,a)v(j)}(\Bellman \mathbf{v})(s) = \max_{a \in \mathcal{A}_s} \left\{r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) v(j)\right\}

For convenience, we define the greedy selector Greedy(v)ΠMD\mathrm{Greedy}(v) \in \Pi^{MD} that extracts an optimal decision rule from a value function:

Greedy(v)(s)argmaxaAs{r(s,a)+γjSp(js,a)v(j)}\mathrm{Greedy}(v)(s) \in \arg\max_{a \in \mathcal{A}_s} \left\{r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) v(j)\right\}

In Puterman’s terminology, such a greedy selector is called vv-improving (or conserving when it achieves the maximum). This operator will be useful for expressing algorithms succinctly:

The equivalence between these two forms can be shown mathematically, as demonstrated in the following proposition and proof.

Algorithms for Solving the Optimality Equations

The optimality equations are operator equations. Therefore, we can apply general numerical methods to solve them. Applying the successive approximation method to the Bellman optimality equation yields a method known as “value iteration” in dynamic programming. A direct application of the blueprint for successive approximation yields the following algorithm:

The termination criterion in this algorithm is based on a specific bound that provides guarantees on the quality of the solution. This is in contrast to supervised learning, where we often use arbitrary termination criteria based on computational budget or early stopping when the learning curve flattens. This is because establishing implementable generalization bounds in supervised learning is challenging.

However, in the dynamic programming context, we can derive various bounds that can be implemented in practice. These bounds help us terminate our procedure with a guarantee on the precision of our value function and, correspondingly, on the optimality of the resulting policy.

Newton-Kantorovich Applied to Bellman Optimality

We now apply the Newton-Kantorovich framework to the Bellman optimality equation. Let

(Lv)(s)=maxaA(s){r(s,a)+γsp(ss,a)v(s)}.(\Bellman v)(s) = \max_{a \in A(s)} \left\{ r(s,a) + \gamma \sum_{s'} p(s' \mid s,a) v(s') \right\}.

The problem is to find vv such that Lv=v\Bellman v = v, or equivalently B(v):=Lvv=0\mathrm{B}(v) := \Bellman v - v = 0. The operator L\Bellman is piecewise affine, hence not globally differentiable, but it is directionally differentiable everywhere in the Hadamard sense and Fréchet differentiable at points where the maximizer is unique.

We consider three complementary perspectives for understanding and computing its derivative.

Perspective 1: Max of Affine Maps

In tabular form, for finite state and action spaces, the Bellman operator can be written as a pointwise maximum of affine maps:

(Lv)(s)=maxaA(s){r(s,a)+γ(Pav)(s)},(\Bellman v)(s) = \max_{a \in A(s)} \left\{ r(s,a) + \gamma (P_a v)(s) \right\},

where PaRS×SP_a \in \mathbb{R}^{|S| \times |S|} is the transition matrix associated with action aa. Each Qav:=ra+γPavQ_a v := r^a + \gamma P_a v is affine in vv. The operator L\Bellman therefore computes the upper envelope of a finite set of affine functions at each state.

At any vv, let the active set at state ss be

A(s;v):=argmaxaA(s)(Qav)(s).\mathcal{A}^*(s; v) := \arg\max_{a \in A(s)} (Q_a v)(s).

Then the Hadamard directional derivative exists and is given by

(L(v;h))(s)=maxaA(s;v)γ(Pah)(s).(\Bellman '(v; h))(s) = \max_{a \in \mathcal{A}^*(s; v)} \gamma (P_a h)(s).

If the active set is a singleton, this expression becomes linear in hh, and L\Bellman is Fréchet differentiable at vv, with

L(v)=γPπv,\Bellman'(v) = \gamma P_{\pi_v},

where πv(s):=a(s)\pi_v(s) := a^*(s) is the greedy policy at vv.

Perspective 2: Envelope Theorem

Consider now a value function approximated as a linear combination of basis functions:

vc(s)=jcjϕj(s).v_c(s) = \sum_j c_j \phi_j(s).

At a node sis_i, define the parametric maximization

vi(c):=(Lvc)(si)=maxaA(si){r(si,a)+γjcjEssi,a[ϕj(s)]}.v_i(c) := (\Bellman v_c)(s_i) = \max_{a \in A(s_i)} \left\{ r(s_i,a) + \gamma \sum_j c_j \mathbb{E}_{s' \mid s_i, a}[\phi_j(s')] \right\}.

Define

Fi(a,c):=r(si,a)+γjcjEssi,a[ϕj(s)],F_i(a, c) := r(s_i,a) + \gamma \sum_j c_j \mathbb{E}_{s' \mid s_i, a}[\phi_j(s')],

so that vi(c)=maxaFi(a,c)v_i(c) = \max_a F_i(a, c). Since FiF_i is linear in cc, we can apply the envelope theorem (Danskin’s theorem): if the optimizer ai(c)a_i^*(c) is unique or selected measurably, then

vicj(c)=γEssi,ai(c)[ϕj(s)].\frac{\partial v_i}{\partial c_j}(c) = \gamma \mathbb{E}_{s' \mid s_i, a_i^*(c)}[\phi_j(s')].

We do not need to differentiate the optimizer ai(c)a_i^*(c) itself. The result extends to the subdifferential case when ties occur, where the Jacobian becomes set-valued.

This result is useful when solving the collocation equation Φc=v(c)\Phi c = v(c). Newton’s method requires the Jacobian v(c)v'(c), and this expression allows us to compute it without involving any derivatives of the optimal action.

Perspective 3: The Implicit Function Theorem

The third perspective applies the implicit function theorem to understand when the Bellman operator is differentiable despite containing a max operator. The maximization problem defines an implicit relationship between the value function and the optimal action, and the implicit function theorem tells us when this relationship is smooth enough to differentiate through.

The Bellman operator is defined as

(Lv)(s)=maxa{r(s,a)+γjp(js,a)v(j)}.(\Bellman v)(s) = \max_{a} \left\{ r(s,a) + \gamma \sum_j p(j \mid s,a) v(j) \right\}.

The difficulty is that the max operator encodes a discrete selection: which action achieves the maximum. To apply the implicit function theorem, we reformulate this as follows. For each action aa, define the action-value function:

Qa(v,s):=r(s,a)+γjp(js,a)v(j).Q_a(v, s) := r(s,a) + \gamma \sum_j p(j \mid s,a) v(j).

The optimal action at vv satisfies the optimality condition:

Qa(s)(v,s)Qa(v,s)for all a.Q_{a^*(s)}(v, s) \geq Q_a(v, s) \quad \text{for all } a.

Now suppose that at a particular vv, action a(s)a^*(s) is a strict local maximizer in the sense that there exists δ>0\delta > 0 such that

Qa(s)(v,s)>Qa(v,s)+δfor all aa(s).Q_{a^*(s)}(v, s) > Q_a(v, s) + \delta \quad \text{for all } a \neq a^*(s).

This strict inequality is the regularity condition needed for the implicit function theorem. It ensures that the optimal action is unique at vv and remains so in a neighborhood of vv.

To see why, consider any perturbation v+hv + h with h\|h\| small. Since QaQ_a is linear in vv, we have:

Qa(v+h,s)=Qa(v,s)+γjp(js,a)h(j).Q_a(v+h, s) = Q_a(v, s) + \gamma \sum_j p(j \mid s,a) h(j).

The perturbation term is bounded: γjp(js,a)h(j)γh|\gamma \sum_j p(j \mid s,a) h(j)| \leq \gamma \|h\|. Therefore, for h<δ/γ\|h\| < \delta/\gamma, the strict gap ensures that

Qa(s)(v+h,s)>Qa(v+h,s)for all aa(s).Q_{a^*(s)}(v+h, s) > Q_a(v+h, s) \quad \text{for all } a \neq a^*(s).

Thus a(s)a^*(s) remains the unique maximizer throughout the neighborhood {v+h:h<δ/γ}\{v + h : \|h\| < \delta/\gamma\}.

The implicit function theorem now applies: in this neighborhood, the mapping va(s;v)v \mapsto a^*(s; v) is constant (and hence smooth), taking the value a(s)a^*(s). This allows us to write

(Lv)(s)=Qa(s)(v,s)=r(s,a(s))+γjp(js,a(s))v(j)(\Bellman v)(s) = Q_{a^*(s)}(v, s) = r(s,a^*(s)) + \gamma \sum_j p(j \mid s,a^*(s)) v(j)

as an explicit formula that holds throughout the neighborhood. Since Qa(s)(,s)Q_{a^*(s)}(\cdot, s) is an affine (hence smooth) function of vv, we can differentiate it:

ddv(Lv)(s)=γPa(s).\frac{d}{dv} (\Bellman v)(s) = \gamma P_{a^*(s)}.

More precisely, for any perturbation hh:

(L(v+h))(s)=(Lv)(s)+γjp(js,a(s))h(j)+o(h).(\Bellman (v+h))(s) = (\Bellman v)(s) + \gamma \sum_j p(j \mid s,a^*(s)) h(j) + o(\|h\|).

This is the Fréchet derivative:

L(v)=γPπv,\Bellman'(v) = \gamma P_{\pi_v},

where πv(s)=a(s)\pi_v(s) = a^*(s) is the greedy policy.

The role of the implicit function theorem: It guarantees that when the maximizer is unique with a strict gap (the regularity condition), the argmax function va(s;v)v \mapsto a^*(s; v) is locally constant, which removes the non-differentiability of the max operator. Without this regularity condition (specifically, at points where multiple actions tie for optimality), the implicit function theorem does not apply, and the operator is not Fréchet differentiable. The active set perspective (Perspective 1) and the envelope theorem (Perspective 2) provide the tools to handle these non-smooth points.

Connection to Policy Iteration

We return to the Newton-Kantorovich step:

(IL(vn))hn=vnLvn,vn+1=vnhn.(I - \Bellman'(v_n)) h_n = v_n - \Bellman v_n, \quad v_{n+1} = v_n - h_n.

Suppose L(vn)=γPπvn\Bellman'(v_n) = \gamma P_{\pi_{v_n}} for the greedy policy πvn\pi_{v_n}. Then

(IγPπvn)vn+1=rπvn,(I - \gamma P_{\pi_{v_n}}) v_{n+1} = r^{\pi_{v_n}},

which is exactly policy evaluation for πvn\pi_{v_n}. Recomputing the greedy policy from vn+1v_{n+1} yields the next iterate.

Thus, policy iteration is Newton-Kantorovich applied to the Bellman optimality equation. At points of nondifferentiability (when ties occur), the operator is still semismooth, and policy iteration corresponds to a semismooth Newton method. The envelope theorem is what justifies the simplification of the Jacobian to γPπv\gamma P_{\pi_v}, bypassing the need to differentiate through the optimizer. This completes the equivalence.

The Semismooth Newton Perspective

The three perspectives we developed above (the active set view, the envelope theorem, and the implicit function theorem) all point toward a deeper framework for understanding Newton-type methods on non-smooth operators. This framework, known as semismooth Newton methods, was developed precisely to handle operators like the Bellman operator that are piecewise smooth but not globally differentiable. The connection between policy iteration and semismooth Newton methods has been rigorously developed in recent work Gargiani et al. (2022).

The classical Newton-Kantorovich method assumes the operator is Fréchet differentiable everywhere. The derivative exists, is unique, and varies continuously with the base point. But the Bellman operator L\Bellman violates this assumption at any value function where multiple actions tie for optimality at some state. At such points, the implicit function theorem fails, and there is no unique Fréchet derivative.

Semismooth Newton methods address this by replacing the notion of a single Jacobian with a generalized derivative that captures the behavior of the operator near non-smooth points. The most commonly used generalized derivative is the Clarke subdifferential, which we can think of as the convex hull of all possible “candidate Jacobians” that arise from limits approaching the non-smooth point from different directions.

For the Bellman residual B(v)=Lvv\mathrm{B}(v) = \Bellman v - v, the Clarke subdifferential at a point vv can be characterized explicitly using our first perspective. Recall that at each state ss, we defined the active set A(s;v)=argmaxaQa(v,s)\mathcal{A}^*(s; v) = \arg\max_a Q_a(v, s). When this set contains multiple actions, the operator is not Fréchet differentiable. However, it remains directionally differentiable in all directions, and the Clarke subdifferential consists of all matrices of the form

B(v)={IγPπ:π(s)A(s;v) for all s}.\partial \mathrm{B}(v) = \left\{ I - \gamma P_\pi : \pi(s) \in \mathcal{A}^*(s; v) \text{ for all } s \right\}.

In words, the generalized Jacobian is the set of all matrices IγPπI - \gamma P_\pi where π\pi is any policy that selects an action from the active set at each state. When the maximizer is unique everywhere, this set reduces to a singleton, and we recover the classical Fréchet derivative. When ties occur, the set has multiple elements: precisely the convex combinations mentioned in Perspective 1.

The semismooth Newton method for solving B(v)=0\mathrm{B}(v) = 0 proceeds by selecting an element JkB(vk)J_k \in \partial \mathrm{B}(v_k) at each iteration and solving

Jkhk=B(vk),vk+1=vk+hk.J_k h_k = -\mathrm{B}(v_k), \quad v_{k+1} = v_k + h_k.

What this tells us is that any choice from the Clarke subdifferential yields a valid Newton-like update. In the context of the Bellman equation, choosing Jk=IγPπkJ_k = I - \gamma P_{\pi_k} where πk\pi_k is any greedy policy corresponds exactly to the policy evaluation step in policy iteration. The freedom in selecting which action to choose when ties occur translates to the freedom in selecting which element of the subdifferential to use.

Under appropriate regularity conditions (specifically, when the residual function is BD-regular or CD-regular), the semismooth Newton method converges locally at a quadratic rate Gargiani et al. (2022). This means that near the solution, the error decreases quadratically:

vk+1vCvkv2.\|v_{k+1} - v^*\| \leq C \|v_k - v^*\|^2.

This theoretical result explains an empirical observation that has long been noted in practice: policy iteration typically converges in very few iterations, often just a handful, even when the state and action spaces are enormous and the space of possible policies is exponentially large.

The semismooth Newton framework also suggests a spectrum of methods interpolating between value iteration and policy iteration. Value iteration can be interpreted as a Newton-like method where we choose Jk=IJ_k = I at every iteration, ignoring the dependence of L\Bellman on vv entirely. This choice guarantees global convergence through the contraction property but sacrifices the quadratic local convergence rate. Policy iteration, at the other extreme, uses the full generalized Jacobian Jk=IγPπkJ_k = I - \gamma P_{\pi_k}, achieving quadratic convergence but at the cost of solving a linear system at each iteration.

Between these extremes lie methods that use approximate Jacobians. One natural variant is to choose Jk=αIJ_k = \alpha I for some scalar α>1\alpha > 1. This leads to the update

vk+1=α1αvk+1αLvk.v_{k+1} = \frac{\alpha - 1}{\alpha} v_k + \frac{1}{\alpha} \Bellman v_k.

This is known as α\alpha-value iteration or successive over-relaxation when α>1\alpha > 1. For appropriate choices of α\alpha, this method retains global convergence while achieving better local rates than standard value iteration, and it requires only pointwise operations rather than solving a linear system. The Newton perspective thus unifies existing algorithms and generates new ones by systematically exploring different approximations to the generalized Jacobian.

The connection to semismooth Newton methods places policy iteration within a broader mathematical framework that extends far beyond dynamic programming. Semismooth Newton methods are used in optimization (for complementarity problems and variational inequalities), in PDE-constrained optimization (for problems with control constraints), and in economics (for equilibrium problems). The Bellman equation, viewed through this lens, is simply one instance of a piecewise smooth equation, and the tools developed for such equations apply directly.

Policy Iteration

While we derived policy iteration-like steps from the Newton-Kantorovich method, it’s worth examining policy iteration as a standalone algorithm, as it has been traditionally presented in the field of dynamic programming.

The policy iteration algorithm for discounted Markov decision problems is as follows:

As opposed to value iteration, this algorithm produces a sequence of both deterministic Markovian decision rules {πn}\{\pi_n\} and value functions {vn}\{\mathbf{v}^n\}. We recognize in this algorithm the linearization step of the Newton-Kantorovich procedure, which takes place here in the policy evaluation step 3 where we solve the linear system (IγPπn)v=rπn(\mathbf{I}-\gamma \mathbf{P}_{\pi_n}) \mathbf{v} = \mathbf{r}_{\pi_n}. In practice, this linear system could be solved either using direct methods (eg. Gaussian elimination), using simple iterative methods such as the successive approximation method for policy evaluation, or more sophisticated methods such as GMRES.

References
  1. Conroy, M. J., & Peterson, J. T. (2013). Decision Making in Natural Resource Management: A Structured, Adaptive Approach: A Structured, Adaptive Approach. Wiley. 10.1002/9781118506196
  2. Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons.
  3. Adams, C. P., & Brantner, V. V. (2009). Spending on new drug development1. Health Economics, 19(2), 130–141. 10.1002/hec.1454
  4. Chang, M. (2010). Monte Carlo Simulation for the Pharmaceutical Industry: Concepts, Algorithms, and Case Studies. CRC Press. 10.1201/ebk1439835920
  5. Gargiani, M., Zanelli, A., Liao-McPherson, D., Summers, T. H., & Lygeros, J. (2022). Dynamic Programming Through the Lens of Semismooth Newton-Type Methods. IEEE Control Systems Letters, 6, 2996–3001. 10.1109/LCSYS.2022.3181213