Dynamic Programming#
Unlike the methods we’ve discussed so far, dynamic programming takes a step back and considers not just a single optimization problem, but an entire family of related problems. This approach, while seemingly more complex at first glance, can often lead to efficient solutions.
Dynamic programming leverage the solution structure underlying many control problems that allows for a decomposition it into smaller, more manageable subproblems. Each subproblem is itself an optimization problem, embedded within the larger whole. This recursive structure is the foundation upon which dynamic programming constructs its solutions.
To ground our discussion, let us return to the domain of discrete-time optimal control problems (DOCPs). These problems frequently arise from the discretization of continuous-time optimal control problems. While the focus here will be on deterministic problems, it is worth noting that these concepts extend naturally to stochastic problems by taking the expectation over the random quantities.
Consider a typical DOCP of Bolza type:
Rather than considering only the total cost from the initial time to the final time, dynamic programming introduces the concept of cost from an arbitrary point in time to the end. This leads to the definition of the “cost-to-go” or “value function” \(J_k(\mathbf{x}_k)\):
This function represents the total cost incurred from stage \(k\) onwards to the end of the time horizon, given that the system is initialized in state \(\mathbf{x}_k\) at stage \(k\). Suppose the problem has been solved from stage \(k+1\) to the end, yielding the optimal cost-to-go \(J_{k+1}^\star(\mathbf{x}_{k+1})\) for any state \(\mathbf{x}_{k+1}\) at stage \(k+1\). The question then becomes: how does this information inform the decision at stage \(k\)?
Given knowledge of the optimal behavior from \(k+1\) onwards, the task reduces to determining the optimal action \(\mathbf{u}_k\) at stage \(k\). This control should minimize the sum of the immediate cost \(c_k(\mathbf{x}_k, \mathbf{u}_k)\) and the optimal future cost \(J_{k+1}^\star(\mathbf{x}_{k+1})\), where \(\mathbf{x}_{k+1}\) is the resulting state after applying action \(\mathbf{u}_k\). Mathematically, this is expressed as:
This equation is known as Bellman’s equation, named after Richard Bellman, who formulated the principle of optimality:
An optimal policy has the property that whatever the previous state and decision, the remaining decisions must constitute an optimal policy with regard to the state resulting from the previous decision.
In other words, any sub-path of an optimal path, from any intermediate point to the end, must itself be optimal.
Dynamic programming can handle nonlinear systems and non-quadratic cost functions naturally. It provides a global optimal solution, when one exists, and can incorporate state and control constraints with relative ease. Howver, as the dimension of the state space increases, this approach suffers from what Bellman termed the “curse of dimensionality.” The computational complexity and memory requirements grow exponentially with the state dimension, rendering direct application of dynamic programming intractable for high-dimensional problems.
Fortunately, learning-based methods offer efficient tools to combat the curse of dimensionality on two fronts: by using function approximation (e.g., neural networks) to avoid explicit discretization, and by leveraging randomization through Monte Carlo methods inherent in the learning paradigm. Most of this course is dedicated to those ideas.
Backward Recursion#
The principle of optimality provides a methodology for solving optimal control problems. Beginning at the final time horizon and working backwards, at each stage the local optimization problem given by Bellman’s equation is solved. This process, termed backward recursion or backward induction constructs the optimal value function stage by stage.
Algorithm 3 (Backward Recursion for Dynamic Programming)
Input: Terminal cost function \(c_\mathrm{T}(\cdot)\), stage cost functions \(c_t(\cdot, \cdot)\), system dynamics \(f_t(\cdot, \cdot)\), time horizon \(\mathrm{T}\)
Output: Optimal value functions \(J_t^\star(\cdot)\) and optimal control policies \(\mu_t^\star(\cdot)\) for \(t = 1, \ldots, T\)
Initialize \(J_T^\star(\mathbf{x}) = c_\mathrm{T}(\mathbf{x})\) for all \(\mathbf{x}\) in the state space
For \(t = T-1, T-2, \ldots, 1\):
For each state \(\mathbf{x}\) in the state space:
Compute \(J_t^\star(\mathbf{x}) = \min_{\mathbf{u}} \left[ c_t(\mathbf{x}, \mathbf{u}) + J_{t+1}^\star(f_t(\mathbf{x}, \mathbf{u})) \right]\)
Compute \(\mu_t^\star(\mathbf{x}) = \arg\min_{\mathbf{u}} \left[ c_t(\mathbf{x}, \mathbf{u}) + J_{t+1}^\star(f_t(\mathbf{x}, \mathbf{u})) \right]\)
End For
End For
Return \(J_t^\star(\cdot)\), \(\mu_t^\star(\cdot)\) for \(t = 1, \ldots, T\)
Upon completion of this backward pass, we now have access to the optimal control to take at any stage and in any state. Furthermore, we can simulate optimal trajectories from any initial state and applying the optimal policy at each stage to generate the optimal trajectory.
Example: Optimal Harvest in Resource Management#
Dynamic programming is often used in resource management and conservation biology to devise policies to be implemented by decision makers and stakeholders : for eg. in fishereries, or timber harvesting. Per [9], we consider a population of a particular species, whose abundance we denote by \(x_t\), where \(t\) represents discrete time steps. Our objective is to maximize the cumulative harvest over a finite time horizon, while also considering the long-term sustainability of the population. This optimization problem can be formulated as:
Here, \(F(\cdot)\) represents the immediate reward function associated with harvesting, \(h_t\) is the harvest rate at time \(t\), and \(F_\mathrm{T}(\cdot)\) denotes a terminal value function that could potentially assign value to the final population state. In this particular problem, we assign no terminal value to the final population state, setting \(F_\mathrm{T}(x_{t_f}) = 0\) and allowing us to focus solely on the cumulative harvest over the time horizon.
In our model population model, the abundance of a specicy \(x\) ranges from 1 to 100 individuals. The decision variable is the harvest rate \(h\), which can take values from the set \(D = \{0, 0.1, 0.2, 0.3, 0.4, 0.5\}\). The population dynamics are governed by a modified logistic growth model:
where the \(0.3\) represents the growth rate and \(125\) is the carrying capacity (the maximum population size given the available resources). The logistic growth model returns continuous values; however our DP formulation uses a discrete state space. Therefore, we also round the the outcomes to the nearest integer.
Applying the principle of optimality, we can express the optimal value function \(J^\star(x_t,t)\) recursively:
with the boundary condition \(J^*(x_{t_f}) = 0\).
It’s worth noting that while this example uses a relatively simple model, the same principles can be applied to more complex scenarios involving stochasticity, multiple species interactions, or spatial heterogeneity.
Show code cell source
import numpy as np
# Parameters
r_max = 0.3
K = 125
T = 20 # Number of time steps
N_max = 100 # Maximum population size to consider
h_max = 0.5 # Maximum harvest rate
h_step = 0.1 # Step size for harvest rate
# Create state and decision spaces
N_space = np.arange(1, N_max + 1)
h_space = np.arange(0, h_max + h_step, h_step)
# Initialize value function and policy
V = np.zeros((T + 1, len(N_space)))
policy = np.zeros((T, len(N_space)))
# Terminal value function (F_T)
def terminal_value(N):
return 0
# State return function (F)
def state_return(N, h):
return N * h
# State dynamics function
def state_dynamics(N, h):
return N + r_max * N * (1 - N / K) - N * h
# Backward iteration
for t in range(T - 1, -1, -1):
for i, N in enumerate(N_space):
max_value = float('-inf')
best_h = 0
for h in h_space:
if h > 1: # Ensure harvest rate doesn't exceed 100%
continue
next_N = state_dynamics(N, h)
if next_N < 1: # Ensure population doesn't go extinct
continue
next_N_index = np.searchsorted(N_space, next_N)
if next_N_index == len(N_space):
next_N_index -= 1
value = state_return(N, h) + V[t + 1, next_N_index]
if value > max_value:
max_value = value
best_h = h
V[t, i] = max_value
policy[t, i] = best_h
# Function to simulate the optimal policy with conversion to Python floats
def simulate_optimal_policy(initial_N, T):
trajectory = [float(initial_N)] # Ensure first value is a Python float
harvests = []
for t in range(T):
N = trajectory[-1]
N_index = np.searchsorted(N_space, N)
if N_index == len(N_space):
N_index -= 1
h = policy[t, N_index]
harvests.append(float(N * h)) # Ensure harvest is a Python float
next_N = state_dynamics(N, h)
trajectory.append(float(next_N)) # Ensure next population value is a Python float
return trajectory, harvests
# Example usage
initial_N = 50
trajectory, harvests = simulate_optimal_policy(initial_N, T)
print("Optimal policy:")
print(policy)
print("\nPopulation trajectory:", trajectory)
print("Harvests:", harvests)
print("Total harvest:", sum(harvests))
Optimal policy:
[[0.2 0.2 0.2 ... 0.4 0.4 0.5]
[0.2 0.2 0.2 ... 0.4 0.4 0.5]
[0.2 0.2 0.2 ... 0.4 0.4 0.4]
...
[0.2 0.2 0.2 ... 0.5 0.5 0.5]
[0.2 0.5 0.5 ... 0.5 0.5 0.5]
[0.2 0.5 0.5 ... 0.5 0.5 0.5]]
Population trajectory: [50.0, 54.0, 63.2016, 53.614938617856, 62.80047226002128, 65.89520835342945, 62.063500827311884, 65.23169346891407, 61.5424456170318, 64.7610004703774, 61.171531280797, 64.42514256278633, 60.90621923290014, 52.003257133909514, 61.11382126799714, 64.37282756165249, 60.86484408994034, 39.80100508132969, 28.038916051902078, 20.544298889444192, 15.422475391094192]
Harvests: [5.0, 0.0, 18.960480000000004, 0.0, 6.280047226002129, 13.17904167068589, 6.206350082731189, 13.046338693782815, 6.15424456170318, 12.95220009407548, 6.1171531280797, 12.885028512557268, 18.271865769870047, 0.0, 6.111382126799715, 12.874565512330499, 30.43242204497017, 19.900502540664846, 14.019458025951039, 10.272149444722096]
Total harvest: 212.66322943492608
Discretization and Interpolation#
In many real-world problems, such as our resource management example, the state space is inherently continuous. However, the dynamic programming algorithm we’ve discussed operates on discrete state spaces. To bridge this gap, we have two main approaches: discretization and interpolation. In the previous example, we used a discretization method by rounding off values to the nearest grid point.
In our idealized models, we imagined population sizes as whole numbers—1 fish, 2 fish, 3 fish—but nature rarely conforms to such simplicity. What do you do when your survey reports 42.7 fish, for example? Without much explanation, our reflex in the previous example was to simply round things off. After all, what’s the harm in calling 42.7 fish just 43? This approach, known as discretization, is the simplest way to handle continuous states. It’s like overlaying a grid on a smooth landscape and only allowing yourself to stand at the intersections. In our demo code, we implemented this step via the numpy.searchsorted function.
Discretization is straightforward and allows you to apply dynamic programming algorithms directly. For many problems, it might even be sufficient. However, as you might imagine from the various time discretization schemes we’ve encountered in trajectory optimization, we can do better. Specifically, we want to address the following issues:
Our model might make abrupt changes in decisions, even when the population barely shifts.
We’re losing precision, especially when dealing with smaller population sizes where every individual counts.
We might want to scale up and add more factors to our model—perhaps considering the population’s age structure or environmental variables. However, the curse of dimensionality might leave us needing an impossibly large number of grid points to maintain accuracy.
Rather than confining ourselves to grid intersections as we did with basic discretization, we can estimate the value between them via interpolation. When you encounter a state that doesn’t exactly match a grid point—like that population of 42.7 fish—you can estimate its value based on nearby points you’ve already calculated. In its simplest form, we could use linear interpolation. Intuitively, it’s like estimating the height of a hill between two surveyed points by drawing a straight line between them. Let’s formalize this approach in the context of the backward induction procedure.
Backward Recursion with Interpolation#
In a continuous state space, we don’t have \(J_{k+1}^\star(\mathbf{x}_{k+1})\) directly available for all possible \(\mathbf{x}_{k+1}\). Instead, we have \(J_{k+1}^\star(\mathbf{x})\) for a set of discrete grid points \(\mathbf{x} \in \mathcal{X}_\text{grid}\). We use interpolation to estimate \(J_{k+1}^\star(\mathbf{x}_{k+1})\) for any \(\mathbf{x}_{k+1}\) not in \(\mathcal{X}_\text{grid}\).
Let’s define an interpolation function \(I_{k+1}(\mathbf{x})\) that estimates \(J_{k+1}^\star(\mathbf{x})\) for any \(\mathbf{x}\) based on the known values at grid points. Then, we can express Bellman’s equation with interpolation as:
For linear interpolation in a one-dimensional state space, \(I_{k+1}(\mathbf{x})\) would be defined as:
where \(x_l\) and \(x_u\) are the nearest lower and upper grid points to \(x\), respectively.
Here’s a pseudo-code algorithm for backward recursion with interpolation:
Algorithm 4 (Backward Recursion with Interpolation for Dynamic Programming)
Input:
Terminal cost function \(c_\mathrm{T}(\cdot)\)
Stage cost functions \(c_t(\cdot, \cdot)\)
System dynamics \(f_t(\cdot, \cdot)\)
Time horizon \(\mathrm{T}\)
Grid of state points \(\mathcal{X}_\text{grid}\)
Set of possible actions \(\mathcal{U}\)
Output: Optimal value functions \(J_t^\star(\cdot)\) and optimal control policies \(\mu_t^\star(\cdot)\) for \(t = 1, \ldots, T\) at grid points
Initialize \(J_T^\star(\mathbf{x}) = c_\mathrm{T}(\mathbf{x})\) for all \(\mathbf{x} \in \mathcal{X}_\text{grid}\)
For \(t = T-1, T-2, \ldots, 1\):
For each state \(\mathbf{x} \in \mathcal{X}_\text{grid}\):
Initialize \(J_t^\star(\mathbf{x}) = \infty\) and \(\mu_t^\star(\mathbf{x}) = \text{None}\)
For each action \(\mathbf{u} \in \mathcal{U}\):
Compute next state \(\mathbf{x}_\text{next} = f_t(\mathbf{x}, \mathbf{u})\)
Compute interpolated future cost \(J_\text{future} = I_{t+1}(\mathbf{x}_\text{next})\)
Compute total cost \(J_\text{total} = c_t(\mathbf{x}, \mathbf{u}) + J_\text{future}\)
If \(J_\text{total} < J_t^\star(\mathbf{x})\):
Update \(J_t^\star(\mathbf{x}) = J_\text{total}\)
Update \(\mu_t^\star(\mathbf{x}) = \mathbf{u}\)
End For
End For
Return \(J_t^\star(\cdot)\), \(\mu_t^\star(\cdot)\) for \(t = 1, \ldots, T\)
The choice of interpolation method can significantly affect the accuracy of the solution. Linear interpolation is simple and often effective, but higher-order methods like cubic spline interpolation might provide better results in some problems. Furthermore, the layout and density of the grid points in \(\mathcal{X}_\text{grid}\) can impact both the accuracy and computational efficiency. A finer grid generally provides better accuracy but increases computational cost. To balance this tradeoff, you might consider techniques like adaptive grid refinement or function approximation methods instead of fixed grid-based interpolation. Special care may also be needed for states near the boundaries, where interpolation might not be possible in all directions.
While simple to implement, interpolation methods scale poorly in multi-dimensional spaces in terms of computational complexity. Techniques like multilinear interpolation with tensorized representations or more advanced methods like radial basis function interpolation might be necessary.
To better address this computational challenge, we will broaden our perspective through the lens of numerical approximation methods for solving functional operator equations. Polynomial interpolation is a form of approximation, with properties akin to generalization in machine learning. By building these connections, we will develop techniques capable of more robustly handling the curse of dimensionality by leveraging the generalization properties of machine learning models, and the “blessing of randomness” inherent in supervised learning and Monte Carlo methods.
Example: Optimal Harvest with Linear Interpolation#
Here is a demonstration of the backward recursion procedure using linear interpolation.
Show code cell source
import numpy as np
# Parameters
r_max = 0.3
K = 125
T = 20 # Number of time steps
N_max = 100 # Maximum population size to consider
h_max = 0.5 # Maximum harvest rate
h_step = 0.1 # Step size for harvest rate
# Create state and decision spaces
N_space = np.arange(1, N_max + 1)
h_space = np.arange(0, h_max + h_step, h_step)
# Initialize value function and policy
V = np.zeros((T + 1, len(N_space)))
policy = np.zeros((T, len(N_space)))
# Terminal value function (F_T)
def terminal_value(N):
return 0
# State return function (F)
def state_return(N, h):
return N * h
# State dynamics function
def state_dynamics(N, h):
return N + r_max * N * (1 - N / K) - N * h
# Function to linearly interpolate between grid points in N_space
def interpolate_value_function(V, N_space, next_N, t):
if next_N <= N_space[0]:
return V[t, 0] # Below or at minimum population, return minimum value
if next_N >= N_space[-1]:
return V[t, -1] # Above or at maximum population, return maximum value
# Find indices to interpolate between
lower_idx = np.searchsorted(N_space, next_N) - 1
upper_idx = lower_idx + 1
# Linear interpolation
N_lower = N_space[lower_idx]
N_upper = N_space[upper_idx]
weight = (next_N - N_lower) / (N_upper - N_lower)
return (1 - weight) * V[t, lower_idx] + weight * V[t, upper_idx]
# Backward iteration with interpolation
for t in range(T - 1, -1, -1):
for i, N in enumerate(N_space):
max_value = float('-inf')
best_h = 0
for h in h_space:
if h > 1: # Ensure harvest rate doesn't exceed 100%
continue
next_N = state_dynamics(N, h)
if next_N < 1: # Ensure population doesn't go extinct
continue
# Interpolate value for next_N
value = state_return(N, h) + interpolate_value_function(V, N_space, next_N, t + 1)
if value > max_value:
max_value = value
best_h = h
V[t, i] = max_value
policy[t, i] = best_h
# Function to simulate the optimal policy using interpolation
def simulate_optimal_policy(initial_N, T):
trajectory = [initial_N]
harvests = []
for t in range(T):
N = trajectory[-1]
# Interpolate optimal harvest rate
if N <= N_space[0]:
h = policy[t, 0]
elif N >= N_space[-1]:
h = policy[t, -1]
else:
lower_idx = np.searchsorted(N_space, N) - 1
upper_idx = lower_idx + 1
weight = (N - N_space[lower_idx]) / (N_space[upper_idx] - N_space[lower_idx])
h = (1 - weight) * policy[t, lower_idx] + weight * policy[t, upper_idx]
harvests.append(float(N * h)) # Ensure harvest is a Python float
next_N = state_dynamics(N, h)
trajectory.append(float(next_N)) # Ensure next population value is a Python float
return trajectory, harvests
# Example usage
initial_N = 50
trajectory, harvests = simulate_optimal_policy(initial_N, T)
print("Optimal policy:")
print(policy)
print("\nPopulation trajectory:", trajectory)
print("Harvests:", harvests)
print("Total harvest:", sum(harvests))
Optimal policy:
[[0. 0. 0. ... 0.4 0.4 0.4]
[0. 0. 0. ... 0.4 0.4 0.4]
[0. 0. 0. ... 0.4 0.4 0.4]
...
[0. 0. 0.3 ... 0.5 0.5 0.5]
[0.2 0.5 0.5 ... 0.5 0.5 0.5]
[0.2 0.5 0.5 ... 0.5 0.5 0.5]]
Population trajectory: [50, 59.0, 62.445600000000006, 62.793456961535966, 60.906514028106535, 64.1847685511936, 60.71600257278426, 64.0117639631371, 60.5789261378371, 63.88717626457206, 60.48012279248407, 63.79731874379539, 60.40881570882111, 63.73243881376377, 60.3573056779798, 63.685556376683536, 60.32007179593332, 39.523630889226936, 27.8698229545787, 20.431713488016012, 15.34347899187751]
Harvests: [0.0, 5.9, 9.027135936000038, 11.26173625265758, 6.0906514028106535, 12.83695371023872, 6.071600257278426, 12.80235279262742, 6.057892613783711, 12.777435252914414, 6.0480122792484075, 12.759463748759078, 6.040881570882111, 12.746487762752755, 6.03573056779798, 12.737111275336709, 30.16003589796666, 19.761815444613468, 13.93491147728935, 10.215856744008006]
Total harvest: 213.26606498696546
Due to pedagogical considerations, this example is using our own implementation of the linear interpolation procedure. However, a more general and practical approach would be to use a built-in interpolation procedure in Numpy. Because our state space has a single dimension, we can simply use scipy.interpolate.interp1d which offers various interpolation methods through its kind
argument, which can take values in ‘linear’, ‘nearest’, ‘nearest-up’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘previous’, or ‘next’. ‘zero’, ‘slinear’, ‘quadratic’ and ‘cubic’.
Here’s a more general implementation which here uses cubic interpolation through the scipy.interpolate.interp1d
function:
Show code cell source
import numpy as np
from scipy.interpolate import interp1d
# Parameters
r_max = 0.3
K = 125
T = 20 # Number of time steps
N_max = 100 # Maximum population size to consider
h_max = 0.5 # Maximum harvest rate
h_step = 0.1 # Step size for harvest rate
# Create state and decision spaces
N_space = np.arange(1, N_max + 1)
h_space = np.arange(0, h_max + h_step, h_step)
# Initialize value function and policy
V = np.zeros((T + 1, len(N_space)))
policy = np.zeros((T, len(N_space)))
# Terminal value function (F_T)
def terminal_value(N):
return 0
# State return function (F)
def state_return(N, h):
return N * h
# State dynamics function
def state_dynamics(N, h):
return N + r_max * N * (1 - N / K) - N * h
# Function to create interpolation function for a given time step
def create_interpolator(V_t, N_space):
return interp1d(N_space, V_t, kind='cubic', bounds_error=False, fill_value=(V_t[0], V_t[-1]))
# Backward iteration with interpolation
for t in range(T - 1, -1, -1):
interpolator = create_interpolator(V[t+1], N_space)
for i, N in enumerate(N_space):
max_value = float('-inf')
best_h = 0
for h in h_space:
if h > 1: # Ensure harvest rate doesn't exceed 100%
continue
next_N = state_dynamics(N, h)
if next_N < 1: # Ensure population doesn't go extinct
continue
# Use interpolation to get the value for next_N
value = state_return(N, h) + interpolator(next_N)
if value > max_value:
max_value = value
best_h = h
V[t, i] = max_value
policy[t, i] = best_h
# Function to simulate the optimal policy using interpolation
def simulate_optimal_policy(initial_N, T):
trajectory = [initial_N]
harvests = []
for t in range(T):
N = trajectory[-1]
# Create interpolator for the policy at time t
policy_interpolator = interp1d(N_space, policy[t], kind='cubic', bounds_error=False, fill_value=(policy[t][0], policy[t][-1]))
h = policy_interpolator(N)
harvests.append(float(N * h)) # Ensure harvest is a Python float
next_N = state_dynamics(N, h)
trajectory.append(float(next_N)) # Ensure next population value is a Python float
return trajectory, harvests
# Example usage
initial_N = 50
trajectory, harvests = simulate_optimal_policy(initial_N, T)
print("Optimal policy (first few rows):")
print(policy[:5])
print("\nPopulation trajectory:", trajectory)
print("Harvests:", harvests)
print("Total harvest:", sum(harvests))
Optimal policy (first few rows):
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3
0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4
0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3
0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4
0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3
0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4
0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3
0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4
0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3
0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4
0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4]]
Population trajectory: [50, 59.0, 62.445600000000006, 62.855816819468515, 66.38501069094303, 62.46338144008508, 66.19082983826176, 62.307060290079974, 65.86630883298251, 62.0275406161329, 65.08602250342238, 61.40579635061663, 65.16431296091169, 61.453283417050585, 65.25607512917725, 61.51516182245858, 65.3615391678991, 42.035857104726354, 29.387853805711547, 21.437532761435143, 16.047063462998082]
Harvests: [3.3073317494565994e-20, 5.8999999999999995, 8.96477607806749, 5.84550227506383, 13.260405311492967, 5.647548383617885, 13.22607620843378, 5.815662115341462, 13.186571332467969, 6.31598238982395, 13.039176123074048, 5.613609913801768, 13.068992991332264, 5.569578810421298, 13.097683026436256, 5.526294879593217, 32.68102988779014, 21.017928552363177, 14.693926902855772, 10.718766380713353]
Total harvest: 213.18951156269063
Stochastic Dynamic Programming in Control Theory#
While our previous discussion centered on deterministic systems, many real-world problems involve uncertainty. Stochastic Dynamic Programming (SDP) extends our framework to handle stochasticity in both the objective function and system dynamics.
In the stochastic setting, our system evolution takes the form:
Here, \(\mathbf{w}_t\) represents a random disturbance or noise term at time \(t\) due to the inherent uncertainty in the system’s behavior. The stage cost function may also incorporate stochastic influences:
In this context, our objective shifts from minimizing a deterministic cost to minimizing the expected total cost:
where the expectation is taken over the distributions of the random variables \(\mathbf{w}_t\). The principle of optimality still holds in the stochastic case, but Bellman’s optimality equation now involves an expectation:
In practice, this expectation is often computed by discretizing the distribution of \(\mathbf{w}_k\) when the set of possible disturbances is very large or even continuous. Let’s say we approximate the distribution with \(K\) discrete values \(\mathbf{w}_k^i\), each occurring with probability \(p_k^i\). Then our Bellman equation becomes:
The backward recursion algorithm for SDP follows a similar structure to its deterministic counterpart, with the key difference being that we now have to compute expected values:
Algorithm 5 (Backward Recursion for Stochastic Dynamic Programming)
Input: Terminal cost function \(c_\mathrm{T}(\cdot)\), stage cost functions \(c_t(\cdot, \cdot, \cdot)\), system dynamics \(\mathbf{f}_t(\cdot, \cdot, \cdot)\), time horizon \(\mathrm{T}\), disturbance distributions
Output: Optimal value functions \(J_t^\star(\cdot)\) and optimal control policies \(\mu_t^\star(\cdot)\) for \(t = 1, \ldots, T\)
Initialize \(J_T^\star(\mathbf{x}) = c_\mathrm{T}(\mathbf{x})\) for all \(\mathbf{x}\) in the state space
For \(t = T-1, T-2, \ldots, 1\):
For each state \(\mathbf{x}\) in the state space:
Compute \(J_t^\star(\mathbf{x}) = \min_{\mathbf{u}} \mathbb{E}_{\mathbf{w}_t}\left[c_t(\mathbf{x}, \mathbf{u}, \mathbf{w}_t) + J_{t+1}^\star(\mathbf{f}_t(\mathbf{x}, \mathbf{u}, \mathbf{w}_t))\right]\)
Compute \(\mu_t^\star(\mathbf{x}) = \arg\min_{\mathbf{u}} \mathbb{E}_{\mathbf{w}_t}\left[c_t(\mathbf{x}, \mathbf{u}, \mathbf{w}_t) + J_{t+1}^\star(\mathbf{f}_t(\mathbf{x}, \mathbf{u}, \mathbf{w}_t))\right]\)
End For
End For
Return \(J_t^\star(\cdot)\), \(\mu_t^\star(\cdot)\) for \(t = 1, \ldots, T\)
While SDP provides us with a framework to for handling uncertainty, it makes the curse of dimensionality even more difficult to handle in practice. Not only does the state space need to be discretized, but now the disturbance space must be discretized as well. This can lead to a combinatorial explosion in the number of scenarios to be evaluated at each stage.
However, just as we tackled the challenges of continuous state spaces with discretization and interpolation, we can devise efficient methods to handle the additional complexity of evaluating expectations. This problem essentially becomes one of numerical integration. When the set of disturbances is continuous (as is often the case with continuous state spaces), we enter a domain where numerical quadrature methods could be applied. But these methods tend to scale poorly as the number of dimensions grows. This is where more efficient techniques, often rooted in Monte Carlo methods, come into play. The combination of two key ingredients emerges to tackle the curse of dimensionality:
Function approximation (through discretization, interpolation, neural networks, etc.)
Monte Carlo integration (simulation)
These two elements essentially distill the key ingredients of machine learning, which is the direction we’ll be exploring in this course.
Example: Stochastic Optimal Harvest in Resource Management#
Building upon our previous deterministic model, we now introduce stochasticity to more accurately reflect the uncertainties inherent in real-world resource management scenarios [9]. As before, we consider a population of a particular species, whose abundance we denote by \(x_t\), where \(t\) represents discrete time steps. Our objective remains to maximize the cumulative harvest over a finite time horizon, while also considering the long-term sustainability of the population. However, we now account for two sources of stochasticity: partial controllability of harvest and environmental variability affecting growth rates. The optimization problem can be formulated as:
Here, \(F(\cdot)\) represents the immediate reward function associated with harvesting, and \(h_t\) is the realized harvest rate at time \(t\). The expectation \(\mathbb{E}[\cdot]\) over both harvest and growth rates, which we view as random variables. In our stochastic model, the abundance \(x\) still ranges from 1 to 100 individuals. The decision variable is now the desired harvest rate \(d_t\), which can take values from the set \(D = {0, 0.1, 0.2, 0.3, 0.4, 0.5}\). However, the realized harvest rate \(h_t\) is stochastic and follows a discrete distribution:
By expressing the harvest rate as a random variable, we mean to capture the fact that harvesting is a not completely under our control: we might obtain more or less what we had intended to. Furthermore, we generalize the population dynamics to the stochastic case via:
where \(K = 125\) is the carrying capacity. The growth rate \(r_t\) is now stochastic and follows a discrete distribution:
where \(r_{\text{max}} = 0.3\) is the maximum growth rate. Applying the principle of optimality, we can express the optimal value function \(J^\star(x_t, t)\) recursively:
where the expectation is taken over the harvest and growth rate random variables. The boundary condition remains \(J^*(x_{t_f}) = 0\). We can now adapt our previous code to account for the stochasticity in our model. One important difference is that simulating a solution in this context requires multiple realizations of our process. This is an important consideration when evaluating reinforcement learning methods in practice, as success cannot be claimed based on a single successful trajectory.
Show code cell source
import numpy as np
from scipy.interpolate import interp1d
# Parameters
r_max = 0.3
K = 125
T = 30 # Number of time steps
N_max = 100 # Maximum population size to consider
h_max = 0.5 # Maximum harvest rate
h_step = 0.1 # Step size for harvest rate
# Create state and decision spaces
N_space = np.linspace(1, N_max, 100) # Using more granular state space
h_space = np.arange(0, h_max + h_step, h_step)
# Stochastic parameters
h_outcomes = np.array([0.75, 1.0, 1.25])
h_probs = np.array([0.25, 0.5, 0.25])
r_outcomes = np.array([0.85, 1.05, 1.15]) * r_max
r_probs = np.array([0.25, 0.5, 0.25])
# Initialize value function and policy
V = np.zeros((T + 1, len(N_space)))
policy = np.zeros((T, len(N_space)))
# State return function (F)
def state_return(N, h):
return N * h
# State dynamics function (stochastic)
def state_dynamics(N, h, r):
return N + r * N * (1 - N / K) - h * N
# Function to create interpolation function for a given time step
def create_interpolator(V_t, N_space):
return interp1d(N_space, V_t, kind='linear', bounds_error=False, fill_value=(V_t[0], V_t[-1]))
# Backward iteration with stochastic dynamics
for t in range(T - 1, -1, -1):
interpolator = create_interpolator(V[t+1], N_space)
for i, N in enumerate(N_space):
max_value = float('-inf')
best_h = 0
for h in h_space:
if h > 1: # Ensure harvest rate doesn't exceed 100%
continue
expected_value = 0
for h_factor, h_prob in zip(h_outcomes, h_probs):
for r_factor, r_prob in zip(r_outcomes, r_probs):
realized_h = h * h_factor
realized_r = r_factor
next_N = state_dynamics(N, realized_h, realized_r)
if next_N < 1: # Ensure population doesn't go extinct
continue
# Use interpolation to get the value for next_N
value = state_return(N, realized_h) + interpolator(next_N)
expected_value += value * h_prob * r_prob
if expected_value > max_value:
max_value = expected_value
best_h = h
V[t, i] = max_value
policy[t, i] = best_h
# Function to simulate the optimal policy using interpolation (stochastic version)
def simulate_optimal_policy(initial_N, T, num_simulations=100):
all_trajectories = []
all_harvests = []
for _ in range(num_simulations):
trajectory = [initial_N]
harvests = []
for t in range(T):
N = trajectory[-1]
# Create interpolator for the policy at time t
policy_interpolator = interp1d(N_space, policy[t], kind='linear', bounds_error=False, fill_value=(policy[t][0], policy[t][-1]))
intended_h = policy_interpolator(N)
# Apply stochasticity
h_factor = np.random.choice(h_outcomes, p=h_probs)
r_factor = np.random.choice(r_outcomes, p=r_probs)
realized_h = intended_h * h_factor
harvests.append(N * realized_h)
next_N = state_dynamics(N, realized_h, r_factor)
trajectory.append(next_N)
all_trajectories.append(trajectory)
all_harvests.append(harvests)
return all_trajectories, all_harvests
# Example usage
initial_N = 50
trajectories, harvests = simulate_optimal_policy(initial_N, T)
# Calculate average trajectory and total harvest
avg_trajectory = np.mean(trajectories, axis=0)
avg_total_harvest = np.mean([sum(h) for h in harvests])
print("Optimal policy (first few rows):")
print(policy[:5])
print("\nAverage population trajectory:", avg_trajectory)
print("Average total harvest:", avg_total_harvest)
# Plot results
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 6))
plt.subplot(121)
for traj in trajectories[:20]: # Plot first 20 trajectories
plt.plot(range(T+1), traj, alpha=0.3)
plt.plot(range(T+1), avg_trajectory, 'r-', linewidth=2)
plt.title('Population Trajectories')
plt.xlabel('Time')
plt.ylabel('Population')
plt.subplot(122)
plt.hist([sum(h) for h in harvests], bins=20)
plt.title('Distribution of Total Harvest')
plt.xlabel('Total Harvest')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
Optimal policy (first few rows):
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3
0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4
0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3
0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4
0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3
0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4
0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3
0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4
0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3
0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4
0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4]]
Average population trajectory: [50. 59.144 63.01791265 61.90379293 62.48486942 62.5432987
62.50870009 63.28488739 62.36131516 62.63290211 62.52594072 62.19942686
62.72926534 62.55024395 62.85173443 62.48246875 62.51623895 62.55143499
62.64905504 62.18322939 62.35395919 62.05178891 62.61193761 62.02569485
62.46265882 62.64708742 61.37887892 42.75048552 30.14094071 21.89689001
16.30109051]
Average total harvest: 313.30972350077883

Dynamic Programming#
Rather than expressing the stochasticity in our system through a disturbance term as a parameter to a deterministic difference equation, we often work with an alternative representation (more common in operations research) which uses the Markov Decision Process formulation. The idea is that when we model our system in this way with the disturbance term being drawn indepently of the previous stages, the induced trajectory are those of a Markov chain. Hence, we can re-cast our control problem in that language, leading to the so-called Markov Decision Process framework in which we express the system dynamics in terms of transition probabilities rather than explicit state equations. In this framework, we express the probability that the system is in a given state using the transition probability function:
This function gives the probability of transitioning to state \(\mathbf{x}_{t+1}\) at time \(t+1\), given that the system is in state \(\mathbf{x}_t\) and action \(\mathbf{u}_t\) is taken at time \(t\). Therefore, \(p_t\) specifies a conditional probability distribution over the next states: namely, the sum (for discrete state spaces) or integral over the next state should be 1.
Given the control theory formulation of our problem via a deterministic dynamics function and a noise term, we can derive the corresponding transition probability function through the following relationship:
Here, \(q_t(\mathbf{w})\) represents the probability density or mass function of the disturbance \(\mathbf{W}_t\) (assuming discrete state spaces). When dealing with continuous spaces, the above expression simply contains an integral rather than a summation.
For a system with deterministic dynamics and no disturbance, the transition probabilities become much simpler and be expressed using the indicator function. Given a deterministic system with dynamics:
The transition probability function can be expressed as:
With this transition probability function, we can recast our Bellman optimality equation:
Here, \({c}(\mathbf{x}_t, \mathbf{u}_t)\) represents the expected immediate reward (or negative cost) when in state \(\mathbf{x}_t\) and taking action \(\mathbf{u}_t\) at time \(t\). The summation term computes the expected optimal value for the future states, weighted by their transition probabilities.
This formulation offers several advantages:
It makes the Markovian nature of the problem explicit: the future state depends only on the current state and action, not on the history of states and actions.
For discrete-state problems, the entire system dynamics can be specified by a set of transition matrices, one for each possible action.
It allows us to bridge the gap with the wealth of methods in the field of probabilistic graphical models and statistical machine learning techniques for modelling and analysis.
Notation in Operations Reseach#
The presentation above was intended to bridge the gap between the control-theoretic perspective and the world of closed-loop control through the idea of determining the value function of a parametric optimal control problem. We then saw how the backward induction procedure was applicable to both the deterministic and stochastic cases by taking the expectation over the disturbance variable. We then said that we can alternatively work with a representation of our system where instead of writing our model as a deterministic dynamics function taking a disturbance as an input, we would rather work directly via its transition probability function, which gives rise to the Markov chain interpretation of our system in simulation.
Now we should highlight that the notation used in control theory tends to differ from that found in operations research communities, in which the field of dynamic programming flourished. We summarize those (purely notational) differences in this section.
In operations research, the system state at each decision epoch is typically denoted by \(s \in \mathcal{S}\), where \(S\) is the set of possible system states. When the system is in state \(s\), the decision maker may choose an action \(a\) from the set of allowable actions \(\mathcal{A}_s\). The union of all action sets is denoted as \(\mathcal{A} = \bigcup_{s \in \mathcal{S}} \mathcal{A}_s\).
The dynamics of the system are described by a transition probability function \(p_t(j | s, a)\), which represents the probability of transitioning to state \(j \in \mathcal{S}\) at time \(t+1\), given that the system is in state \(s\) at time \(t\) and action \(a \in \mathcal{A}_s\) is chosen. This transition probability function satisfies:
It’s worth noting that in operations research, we typically work with reward maximization rather than cost minimization, which is more common in control theory. However, we can easily switch between these perspectives by simply negating the quantity. That is, maximizing a reward function is equivalent to minimizing its negative, which we would then call a cost function.
The reward function is denoted by \(r_t(s, a)\), representing the reward received at time \(t\) when the system is in state \(s\) and action \(a\) is taken. In some cases, the reward may also depend on the next state, in which case it is denoted as \(r_t(s, a, j)\). The expected reward can then be computed as:
Combined together, these elemetns specify a Markov decision process, which is fully described by the tuple:
where \(\mathrm{T}\) represents the set of decision epochs (the horizon).
Decision Rules and Policies#
In the presentation provided so far, we directly assumed that the form of our feedback controller was of the form \(u(\mathbf{x}, t)\). The idea is that rather than just looking at the stage as in open-loop control, we would now consider the current state to account for the presence of noise. We came to that conclusion by considering the parametric optimization problem corresponding to the trajectory optimization perspective and saw that the “argmax” counterpart to the value function (the max) was exactly this function \(u(x, t)\). But this presentation was mostly for intuition and neglected the fact that we could consider other kinds of feedback controllers. In the context of MDPs and under the OR terminology, we should now rather talk of policies instead of controllers.
But to properly introduce the concept of policy, we first have to talk about decision rules. A decision rule is a prescription of a procedure for action selection in each state at a specified decision epoch. These rules can vary in their complexity due to their potential dependence on the history and ways in which actions are then selected. Decision rules can be classified based on two main criteria:
Dependence on history: Markovian or History-dependent
Action selection method: Deterministic or Randomized
Markovian decision rules are those that depend only on the current state, while history-dependent rules consider the entire sequence of past states and actions. Formally, we can define a history \(h_t\) at time \(t\) as:
where \(s_u\) and \(a_u\) denote the state and action at decision epoch \(u\). The set of all possible histories at time \(t\), denoted \(H_t\), grows rapidly with \(t\):
This exponential growth in the size of the history set motivates us to seek conditions under which we can avoid searching for history-dependent decision rules and instead focus on Markovian rules, which are much simpler to implement and evaluate.
Decision rules can be further classified as deterministic or randomized. A deterministic rule selects an action with certainty, while a randomized rule specifies a probability distribution over the action space.
These classifications lead to four types of decision rules:
Markovian Deterministic (MD): \(d_t: \mathcal{S} \rightarrow \mathcal{A}_s\)
Markovian Randomized (MR): \(d_t: \mathcal{S} \rightarrow \mathcal{P}(\mathcal{A}_s)\)
History-dependent Deterministic (HD): \(d_t: H_t \rightarrow \mathcal{A}_s\)
History-dependent Randomized (HR): \(d_t: H_t \rightarrow \mathcal{P}(\mathcal{A}_s)\)
Where \(\mathcal{P}(\mathcal{A}_s)\) denotes the set of probability distributions over \(\mathcal{A}_s\).
It’s important to note that decision rules are stage-wise objects. However, to solve an MDP, we need a strategy for the entire horizon. This is where we make a distinction and introduce the concept of a policy. A policy \(\pi\) is a sequence of decision rules, one for each decision epoch:
Where \(N\) is the horizon length (possibly infinite). The set of all policies of class \(K\) (where \(K\) can be HR, HD, MR, or MD) is denoted as \(\Pi^K\).
A special type of policy is a stationary policy, where the same decision rule is used at all epochs: \(\pi = (d, d, ...)\), often denoted as \(d^\infty\).
The relationships between these policy classes form a hierarchy:
Where SD stands for Stationary Deterministic and SR for Stationary Randomized. The largest set is by far the set of history randomized policies.
A fundamental question in MDP theory is: under what conditions can we avoid working with the set \(\Pi^{HR}\) and focus for example on the much simpler set of deterministic Markovian policy? Even more so, we will see that in the infinite horizon case, we can drop the dependance on time and simply consider stationary deterministic Markovian policies.
What is an Optimal Policy?#
Let’s go back to the starting point and define what it means for a policy to be optimal in a Markov Decision Problem. For this, we will be considering different possible search spaces (policy classes) and compare policies based on the ordering of their value from any possible start state. The value of a policy \(\pi\) (optimal or not) is defined as the expected total reward obtained by following that policy from a given starting state. Formally, for a finite-horizon MDP with \(N\) decision epochs, we define the value function \(v_\pi(s, t)\) as:
where \(S_t\) is the state at time \(t\), \(A_t\) is the action taken at time \(t\), and \(r_t\) is the reward function. For simplicity, we write \(v_\pi(s)\) to denote \(v_\pi(s, 1)\), the value of following policy \(\pi\) from state \(s\) at the first stage over the entire horizon \(N\).
In finite-horizon MDPs, our goal is to identify an optimal policy, denoted by \(\pi^*\), that maximizes total expected reward over the horizon \(N\). Specifically:
We call \(\pi^*\) an optimal policy because it yields the highest possible value across all states and all policies within the policy class \(\Pi^{\text{HR}}\). We denote by \(v^*\) the maximum value achievable by any policy:
In reinforcement learning literature, \(v^*\) is typically referred to as the “optimal value function,” while in some operations research references, it might be called the “value of an MDP.” An optimal policy \(\pi^*\) is one for which its value function equals the optimal value function:
It’s important to note that this notion of optimality applies to every state. Policies optimal in this sense are sometimes called “uniformly optimal policies.” A weaker notion of optimality, often encountered in reinforcement learning practice, is optimality with respect to an initial distribution of states. In this case, we seek a policy \(\pi \in \Pi^{\text{HR}}\) that maximizes:
where \(P_1(S_1 = s)\) is the probability of starting in state \(s\).
A fundamental result in MDP theory states that the maximum value can be achieved by searching over the space of deterministic Markovian Policies. Consequently:
This equality significantly simplifies the computational complexity of our algorithms, as the search problem can now be decomposed into \(N\) sub-problems in which we only have to search over the set of possible actions. This is the backward induction algorithm, which we present a second time, but departing this time from the control-theoretic notation and using the MDP formalism:
Algorithm 6 (Backward Induction)
Input: State space \(S\), Action space \(A\), Transition probabilities \(p_t\), Reward function \(r_t\), Time horizon \(N\)
Output: Optimal value functions \(v^*\)
Initialize:
Set \(t = N\)
For all \(s_N \in S\):
\[v^*(s_N, N) = r_N(s_N)\]
For \(t = N-1\) to \(1\):
For each \(s_t \in S\): a. Compute the optimal value function:
\[v^*(s_t, t) = \max_{a \in A_{s_t}} \left\{r_t(s_t, a) + \sum_{j \in S} p_t(j | s_t, a) v^*(j, t+1)\right\}\]b. Determine the set of optimal actions:
\[A_{s_t,t}^* = \arg\max_{a \in A_{s_t}} \left\{r_t(s_t, a) + \sum_{j \in S} p_t(j | s_t, a) v^*(j, t+1)\right\}\]
Return the optimal value functions \(u_t^*\) and optimal action sets \(A_{s_t,t}^*\) for all \(t\) and \(s_t\)
Note that the same procedure can also be used for finding the value of a policy with minor changes;
Algorithm 7 (Policy Evaluation)
Input:
State space \(S\)
Action space \(A\)
Transition probabilities \(p_t\)
Reward function \(r_t\)
Time horizon \(N\)
A markovian deterministic policy \(\pi\)
Output: Value function \(v^\pi\) for policy \(\pi\)
Initialize:
Set \(t = N\)
For all \(s_N \in S\):
\[v_\pi(s_N, N) = r_N(s_N)\]
For \(t = N-1\) to \(1\):
For each \(s_t \in S\): a. Compute the value function for the given policy:
\[v_\pi(s_t, t) = r_t(s_t, d_t(s_t)) + \sum_{j \in S} p_t(j | s_t, d_t(s_t)) v_\pi(j, t+1)\]
Return the value function \(v^\pi(s_t, t)\) for all \(t\) and \(s_t\)
This code could also finally be adapted to support randomized policies using:
Example: Sample Size Determination in Pharmaceutical Development#
Pharmaceutical development is the process of bringing a new drug from initial discovery to market availability. This process is lengthy, expensive, and risky, typically involving several stages:
Drug Discovery: Identifying a compound that could potentially treat a disease.
Preclinical Testing: Laboratory and animal testing to assess safety and efficacy. . Clinical Trials: Testing the drug in humans, divided into phases:
Phase I: Testing for safety in a small group of healthy volunteers.
Phase II: Testing for efficacy and side effects in a larger group with the target condition.
Phase III: Large-scale testing to confirm efficacy and monitor side effects.
Regulatory Review: Submitting a New Drug Application (NDA) for approval.
Post-Market Safety Monitoring: Continuing to monitor the drug’s effects after market release.
This process can take 10-15 years and cost over $1 billion [1]. The high costs and risks involved call for a principled approach to decision making. We’ll focus on the clinical trial phases and NDA approval, per the MDP model presented by [6]:
States (\(S\)): Our state space is \(S = \{s_1, s_2, s_3, s_4\}\), where:
\(s_1\): Phase I clinical trial
\(s_2\): Phase II clinical trial
\(s_3\): Phase III clinical trial
\(s_4\): NDA approval
Actions (\(A\)): At each state, the action is choosing the sample size \(n_i\) for the corresponding clinical trial. The action space is \(A = \{10, 11, ..., 1000\}\), representing possible sample sizes.
Transition Probabilities (\(P\)): The probability of moving from one state to the next depends on the chosen sample size and the inherent properties of the drug. We define:
\(P(s_2|s_1, n_1) = p_{12}(n_1) = \sum_{i=0}^{\lfloor\eta_1 n_1\rfloor} \binom{n_1}{i} p_0^i (1-p_0)^{n_1-i}\) where \(p_0\) is the true toxicity rate and \(\eta_1\) is the toxicity threshold for Phase I.
Of particular interest is the transition from Phase II to Phase III which we model as:
\(P(s_3|s_2, n_2) = p_{23}(n_2) = \Phi\left(\frac{\sqrt{n_2}}{2}\delta - z_{1-\eta_2}\right)\)
where \(\Phi\) is the cumulative distribution function (CDF) of the standard normal distribution:
\(\Phi(x) = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^x e^{-t^2/2} dt\)
This is giving us the probability that we would observe a treatment effect this large or larger if the null hypothesis (no treatment effect) were true. A higher probability indicates stronger evidence of a treatment effect, making it more likely that the drug will progress to Phase III.
In this expression, \(\delta\) is called the “normalized treatment effect”. In clinical trials, we’re often interested in the difference between the treatment and control groups. The “normalized” part means we’ve adjusted this difference for the variability in the data. Specifically \(\delta = \frac{\mu_t - \mu_c}{\sigma}\) where \(\mu_t\) is the mean outcome in the treatment group, \(\mu_c\) is the mean outcome in the control group, and \(\sigma\) is the standard deviation of the outcome. A larger \(\delta\) indicates a stronger treatment effect.
Furthermore, the term \(z_{1-\eta_2}\) is the \((1-\eta_2)\)-quantile of the standard normal distribution. In other words, it’s the value where the probability of a standard normal random variable being greater than this value is \(\eta_2\). For example, if \(\eta_2 = 0.05\), then \(z_{1-\eta_2} \approx 1.645\). A smaller \(\eta_2\) makes the trial more conservative, requiring stronger evidence to proceed to Phase III.
Finally, \(n_2\) is the sample size for Phase II. The \(\sqrt{n_2}\) term reflects that the precision of our estimate of the treatment effect improves with the square root of the sample size.
\(P(s_4|s_3, n_3) = p_{34}(n_3) = \Phi\left(\frac{\sqrt{n_3}}{2}\delta - z_{1-\eta_3}\right)\) where \(\eta_3\) is the significance level for Phase III.
Rewards (\(R\)): The reward function captures the costs of running trials and the potential profit from a successful drug:
\(r(s_i, n_i) = -c_i(n_i)\) for \(i = 1, 2, 3\), where \(c_i(n_i)\) is the cost of running a trial with sample size \(n_i\).
\(r(s_4) = g_4\), where \(g_4\) is the expected profit from a successful drug.
Discount Factor (\(\gamma\)): We use a discount factor \(0 < \gamma \leq 1\) to account for the time value of money and risk preferences.
Show code cell source
import numpy as np
from scipy.stats import binom
from scipy.stats import norm
def binomial_pmf(k, n, p):
return binom.pmf(k, n, p)
def transition_prob_phase1(n1, eta1, p0):
return np.sum([binomial_pmf(i, n1, p0) for i in range(int(eta1 * n1) + 1)])
def transition_prob_phase2(n2, eta2, delta):
return norm.cdf((np.sqrt(n2) / 2) * delta - norm.ppf(1 - eta2))
def transition_prob_phase3(n3, eta3, delta):
return norm.cdf((np.sqrt(n3) / 2) * delta - norm.ppf(1 - eta3))
def immediate_reward(n):
return -n # Negative to represent cost
def backward_induction(S, A, gamma, g4, p0, delta, eta1, eta2, eta3):
V = np.zeros(len(S))
V[3] = g4 # Value for NDA approval state
optimal_n = [None] * 3 # Store optimal n for each phase
# Backward induction
for i in range(2, -1, -1): # Iterate backwards from Phase III to Phase I
max_value = -np.inf
for n in A:
if i == 0: # Phase I
p = transition_prob_phase1(n, eta1, p0)
elif i == 1: # Phase II
p = transition_prob_phase2(n, eta2, delta)
else: # Phase III
p = transition_prob_phase3(n, eta3, delta)
value = immediate_reward(n) + gamma * p * V[i+1]
if value > max_value:
max_value = value
optimal_n[i] = n
V[i] = max_value
return V, optimal_n
# Set up the problem parameters
S = ['Phase I', 'Phase II', 'Phase III', 'NDA approval']
A = range(10, 1001)
gamma = 0.95
g4 = 10000
p0 = 0.1 # Example toxicity rate for Phase I
delta = 0.5 # Example normalized treatment difference
eta1, eta2, eta3 = 0.2, 0.1, 0.025
# Run the backward induction algorithm
V, optimal_n = backward_induction(S, A, gamma, g4, p0, delta, eta1, eta2, eta3)
# Print results
for i, state in enumerate(S):
print(f"Value for {state}: {V[i]:.2f}")
print(f"Optimal sample sizes: Phase I: {optimal_n[0]}, Phase II: {optimal_n[1]}, Phase III: {optimal_n[2]}")
# Sanity checks
print("\nSanity checks:")
print(f"1. NDA approval value: {V[3]}")
print(f"2. All values non-positive and <= NDA value: {all(v <= V[3] for v in V)}")
print(f"3. Optimal sample sizes in range: {all(10 <= n <= 1000 for n in optimal_n if n is not None)}")
Value for Phase I: 7869.92
Value for Phase II: 8385.83
Value for Phase III: 9123.40
Value for NDA approval: 10000.00
Optimal sample sizes: Phase I: 75, Phase II: 239, Phase III: 326
Sanity checks:
1. NDA approval value: 10000.0
2. All values non-positive and <= NDA value: True
3. Optimal sample sizes in range: True
Infinite-Horizon MDPs#
It often makes sense to model control problems over infinite horizons. We extend the previous setting and define the expected total reward of policy \(\pi \in \Pi^{\mathrm{HR}}\), \(v^\pi\) as:
One drawback of this model is that we could easily encounter values that are \(+\infty\) or \(-\infty\), even in a setting as simple as a single-state MDP which loops back into itself and where the accrued reward is nonzero.
Therefore, it is often more convenient to work with an alternative formulation which guarantees the existence of a limit: the expected total discounted reward of policy \(\pi \in \Pi^{\mathrm{HR}}\) is defined to be:
for \(0 \leq \gamma < 1\) and when \(\max_{s \in \mathcal{S}} \max_{a \in \mathcal{A}_s}|r(s, a)| = R_{\max} < \infty\), in which case, \(|v_\gamma^\pi(s)| \leq (1-\gamma)^{-1} R_{\max}\).
Finally, another possibility for the infinite-horizon setting is the so-called average reward or gain of policy \(\pi \in \Pi^{\mathrm{HR}}\) defined as:
We won’t be working with this formulation in this course due to its inherent practical and theoretical complexities.
Extending the previous notion of optimality from finite-horizon models, a policy \(\pi^*\) is said to be discount optimal for a given \(\gamma\) if:
Furthermore, the value of a discounted MDP \(v_\gamma^*(s)\), is defined by:
More often, we refer to \(v_\gamma\) by simply calling it the optimal value function.
As for the finite-horizon setting, the infinite horizon discounted model does not require history-dependent policies, since for any \(\pi \in \Pi^{HR}\) there exists a \(\pi^{\prime} \in \Pi^{MR}\) with identical total discounted reward: $\( v_\gamma^*(s) \equiv \max_{\pi \in \Pi^{HR}} v_\gamma^\pi(s)=\max_{\pi \in \Pi^{MR}} v_\gamma^\pi(s) . \)$
Random Horizon Interpretation of Discounting#
The use of discounting can be motivated both from a modeling perspective and as a means to ensure that the total reward remains bounded. From the modeling perspective, we can view discounting as a way to weight more or less importance on the immediate rewards vs. the long-term consequences. There is also another interpretation which stems from that of a finite horizon model but with an uncertain end time. More precisely:
Let \(v_\nu^\pi(s)\) denote the expected total reward obtained by using policy \(\pi\) when the horizon length \(\nu\) is random. We define it by:
Theorem 1 (Random horizon interpretation of discounting)
Suppose that the horizon \(\nu\) follows a geometric distribution with parameter \(\gamma\), \(0 \leq \gamma < 1\), independent of the policy such that \(P(\nu=n) = (1-\gamma) \gamma^{n-1}, \, n=1,2, \ldots\), then \(v_\nu^\pi(s) = v_\gamma^\pi(s)\) for all \(s \in \mathcal{S}\) .
Proof. See proposition 5.3.1 in [33]
Under the bounded reward assumption and \(\gamma < 1\), the series converges and we can reverse the order of summation :
where the last line follows from the geometric series:
Vector Representation in Markov Decision Processes#
Let V be the set of bounded real-valued functions on a discrete state space S. This means any function \( f \in V \) satisfies the condition:
where notation \( \|f\| \) represents the sup-norm (or \( \ell_\infty \)-norm) of the function \( f \).
When working with discrete state spaces, we can interpret elements of V as vectors and linear operators on V as matrices, allowing us to leverage tools from linear algebra. The sup-norm (\(\ell_\infty\) norm) of matrix \(\mathbf{H}\) is defined as:
where \(\mathbf{H}_{s,j}\) represents the \((s, j)\)-th component of the matrix \(\mathbf{H}\).
For a Markovian decision rule \(d \in D^{MD}\), we define:
For a randomized decision rule \(d \in D^{MR}\), these definitions extend to:
In both cases, \(\mathbf{r}_d\) denotes a reward vector in \(\mathbb{R}^{|S|}\), with each component \(\mathbf{r}_d(s)\) representing the reward associated with state \(s\). Similarly, \(\mathbf{P}_d\) is a transition probability matrix in \(\mathbb{R}^{|S| \times |S|}\), capturing the transition probabilities under decision rule \(d\).
For a nonstationary Markovian policy \(\pi = (d_1, d_2, \ldots) \in \Pi^{MR}\), the expected total discounted reward is given by:
Using vector notation, this can be expressed as:
This formulation leads to a recursive relationship:
where \(\pi^{\prime} = (d_2, d_3, \ldots)\).
For stationary policies, where \(\pi = d^{\infty} \equiv (d, d, \ldots)\), the total expected reward simplifies to:
This last expression is called a Neumann series expansion, and it’s guaranteed to exists under the assumptions of bounded reward and discount factor strictly less than one. Consequently, for a stationary policy, \(\mathbf{v}_\gamma^{d^\infty}\) can be determined as the solution to the linear equation:
which can be rearranged to:
We can also characterize \(\mathbf{v}_\gamma^{d^\infty}\) as the solution to an operator equation. More specifically, define the linear transformation \(\mathrm{L}_d\) by
for any \(v \in V\). Intuitively, \(\mathrm{L}_d\) takes a value function \(\mathbf{v}\) as input and returns a new value function that combines immediate rewards (\(\mathbf{r}_d\)) with discounted future values (\(\gamma \mathbf{P}_d\mathbf{v}\)).
Therefore, we view \(\mathrm{L}_d\) as an operator mapping elements of \(V\) to \(V\): ie. \(\mathrm{L}_d: V \rightarrow V\). The fact that the value function of a policy is the solution to a system of equations can then be expressed with the statement:
Solving Operator Equations#
The operator equation we encountered in MDPs, \(\mathbf{v}_\gamma^{d^\infty} = \mathrm{L}_d \mathbf{v}_\gamma^{d^\infty}\), is a specific instance of a more general class of problems known as operator equations. These equations appear in various fields of mathematics and applied sciences, ranging from differential equations to functional analysis.
Operator equations can take several forms, each with its own characteristics and solution methods:
Fixed Point Form: \(x = \mathrm{T}(x)\), where \(\mathrm{T}: X \rightarrow X\). Common in fixed-point problems, such as our MDP equation, we seek a fixed point \(x^*\) such that \(x^* = \mathrm{T}(x^*)\).
General Operator Equation: \(\mathrm{T}(x) = y\), where \(\mathrm{T}: X \rightarrow Y\). Here, \(X\) and \(Y\) can be different spaces. We seek an \(x \in X\) that satisfies the equation for a given \(y \in Y\).
Nonlinear Equation: \(\mathrm{T}(x) = 0\), where \(\mathrm{T}: X \rightarrow Y\). A special case of the general operator equation where we seek roots or zeros of the operator.
Variational Inequality: Find \(x^* \in K\) such that \(\langle \mathrm{T}(x^*), x - x^* \rangle \geq 0\) for all \(x \in K\). Here, \(K\) is a closed convex subset of \(X\), and \(\mathrm{T}: K \rightarrow X^*\) (the dual space of \(X\)). These problems often arise in optimization, game theory, and partial differential equations.
Successive Approximation Method#
For equations in fixed point form, a common numerical solution method is successive approximation, also known as fixed-point iteration:
Algorithm 8 (Successive Approximation)
Input: An operator \(\mathrm{T}: X \rightarrow X\), an initial guess \(x_0 \in X\), and a tolerance \(\epsilon > 0\)
Output: An approximate fixed point \(x^*\) such that \(\|x^* - \mathrm{T}(x^*)\| < \epsilon\)
Initialize \(n = 0\)
repeat
3. Compute \(x_{n+1} = \mathrm{T}(x_n)\)
4. If \(\|x_{n+1} - x_n\| < \epsilon\), return \(x_{n+1}\)
5. Set \(n = n + 1\)until convergence or maximum iterations reached
The convergence of successive approximation depends on the properties of the operator \(\mathrm{T}\). In the simplest and most common setting, we assume \(\mathrm{T}\) is a contraction mapping. The Banach Fixed-Point Theorem then guarantees that \(\mathrm{T}\) has a unique fixed point, and the successive approximation method will converge to this fixed point from any starting point. Specifically, \(\mathrm{T}\) is a contraction if there exists a constant \(q \in [0,1)\) such that for all \(x,y \in X\):
where \(d\) is the metric on \(X\). In this case, the rate of convergence is linear, with error bound:
However, the contraction mapping condition is not the only one that can lead to convergence. For instance, if \(\mathrm{T}\) is nonexpansive (i.e., Lipschitz continuous with Lipschitz constant 1) and \(X\) is a Banach space with certain geometrical properties (e.g., uniformly convex), then under additional conditions (e.g., \(\mathrm{T}\) has at least one fixed point), the successive approximation method can still converge, albeit potentially more slowly than in the contraction case.
In practice, when dealing with specific problems like MDPs or differential equations, the properties of the operator often naturally align with one of these convergence conditions. For example, in discounted MDPs, the Bellman operator is a contraction in the supremum norm, which guarantees the convergence of value iteration.
Newton-Kantorovich Method#
The Newton-Kantorovich method is a generalization of Newton’s method from finite dimensional vector spaces to infinite dimensional function spaces: rather than iterating in the space of vectors, we are iterating in the space of functions. Just as in the finite-dimensional counterpart, the idea is to improve the rate of convergence of our method by taking an “educated guess” on where to move next using a linearization of our operator at the current point. Now the concept of linearization, which is synonymous with derivative, will also require a generalization. Here we are in essence trying to quantify how the output of the operator \(\mathrm{T}\) – a function – varies as we perturb its input – also a function. The right generalization here is that of the Fréchet derivative.
Before we delve into the Fréchet derivative, it’s important to understand the context in which it operates: Banach spaces. A Banach space is a complete normed vector space: ie a vector space, that has a norm, and which every Cauchy sequence convergeces. Banach spaces provide a natural generalization of finite-dimensional vector spaces to infinite-dimensional settings. The norm in a Banach space allows us to quantify the “size” of functions and the “distance” between functions. This allows us to define notions of continuity, differentiability, and for analyzing the convergence of our method. Furthermore, the completeness property of Banach spaces ensures that we have a well-defined notion of convergence.
In the context of the Newton-Kantorovich method, we typically work with an operator \(\mathrm{T}: X \to Y\), where both \(X\) and \(Y\) are Banach spaces and whose Fréchet derivative at a point \(x \in X\), denoted \(\mathrm{T}'(x)\), is a bounded linear operator from \(X\) to \(Y\) such that:
where \(\|\cdot\|_X\) and \(\|\cdot\|_Y\) are the norms in \(X\) and \(Y\) respectively. In other words, \(\mathrm{T}'(x)\) is the best linear approximation of \(\mathrm{T}\) near \(x\).
Now apart from those mathematical technicalities, Newton-Kantorovich has in essence the same structure as that of the original Newton’s method. That is, it applies the following sequence of steps:
Linearize the Operator: Given an approximation \( x_n \), we consider the Fréchet derivative of \( \mathrm{T} \), denoted by \( \mathrm{T}'(x_n) \). This derivative is a linear operator that provides a local approximation of \( \mathrm{T} \), near \( x_n \).
Set Up the Newton Step: The method then solves the linearized equation for a correction \( h_n \):
\[ \mathrm{T}(x_n) h_n = \mathrm{T}(x_n) - x_n. \]This equation represents a linear system where \( h_n \) is chosen to minimize the difference between \( x_n \) and \( \mathrm{T}(x_n) \) with respect to the operator’s local behavior.
Update the Solution: The new approximation \( x_{n+1} \) is then given by:
\[ x_{n+1} = x_n - h_n. \]This correction step refines \( x_n \), bringing it closer to the true solution.
Repeat Until Convergence: We repeat the linearization and update steps until the solution \( x_n \) converges to the desired tolerance, which can be verified by checking that \( \|\mathrm{T}(x_n) - x_n\| \) is sufficiently small, or by monitoring the norm \( \|x_{n+1} - x_n\| \).
The convergence of Newton-Kantorovich does not hinge on \( \mathrm{T} \) being a contraction over the entire domain – as it could be the case for successive approximation. The convergence properties of the Newton-Kantorovich method are as follows:
Local Convergence: Under mild conditions (e.g., \(\mathrm{T}\) is Fréchet differentiable and \(\mathrm{T}'(x)\) is invertible near the solution), the method converges locally. This means that if the initial guess is sufficiently close to the true solution, the method will converge.
Global Convergence: Global convergence is not guaranteed in general. However, under stronger conditions (e.g., \( \mathrm{T} \), is analytic and satisfies certain bounds), the method can converge globally.
Rate of Convergence: When the method converges, it typically exhibits quadratic convergence. This means that the error at each step is proportional to the square of the error at the previous step:
\[ \|x_{n+1} - x^*\| \leq C\|x_n - x^*\|^2 \]where \(x^*\) is the true solution and \(C\) is some constant. This quadratic convergence is significantly faster than the linear convergence typically seen in methods like successive approximation.
Optimality Equations for Infinite-Horizon MDPs#
Recall that in the finite-horizon setting, the optimality equations are:
where \(v_n(s)\) is the value function at time step \(n\) for state \(s\), \(A_s\) is the set of actions available in state \(s\), \(r(s, a)\) is the reward function, \(\gamma\) is the discount factor, and \(p(j | s, a)\) is the transition probability from state \(s\) to state \(j\) given action \(a\).
Intuitively, we would expect that by taking the limit of \(n\) to infinity, we might get the nonlinear equations:
which are called the optimality equations or Bellman equations for infinite-horizon MDPs.
We can adopt an operator-theoretic perspective by defining a nonlinear operator \(\mathrm{L}\) on the space \(V\) of bounded real-valued functions on the state space \(S\). We can then show that the value of an MDP is the solution to the following fixed-point equation:
where \(D^{MD}\) is the set of Markov deterministic decision rules, \(\mathbf{r}_d\) is the reward vector under decision rule \(d\), and \(\mathbf{P}_d\) is the transition probability matrix under decision rule \(d\).
Note that while we write \(\max_{d \in D^{MD}}\), we do not implement the above operator in this way. Written in this fashion, it would indeed imply that we first need to enumerate all Markov deterministic decision rules and pick the maximum. Now the fact that we compare policies based on their value functions in a componentwise fashion, maxing over the space of Markovian deterministic rules amounts to the following update in component form:
The equivalence between these two forms can be shown mathematically, as demonstrated in the following proposition and proof.
Proposition 2
The operator \(\mathrm{L}\) defined as a maximization over Markov deterministic decision rules:
is equivalent to the componentwise maximization over actions:
Proof. Let’s define the right-hand side of the first equation as \(R_1\) and the right-hand side of the second equation as \(R_2\). We’ll prove that \(R_1 \leq R_2\) and \(R_2 \leq R_1\), which will establish their equality.
Step 1: Proving \(R_1 \leq R_2\)
For any \(d \in D^{MD}\), we have:
This is because \(d(s) \in \mathcal{A}_s\), so the left-hand side is included in the set over which we’re maximizing on the right-hand side.
Taking the maximum over all \(d \in D^{MD}\) on the left-hand side doesn’t change this inequality:
Therefore, \(R_1 \leq R_2\).
Step 2: Proving \(R_2 \leq R_1\)
Let \(a^* \in \mathcal{A}_s\) be the action that achieves the maximum in \(R_2\). We can construct a Markov deterministic decision rule \(d^*\) such that \(d^*(s) = a^*\) and \(d^*(s')\) is arbitrary for \(s' \neq s\). Then:
Conclusion: Since we’ve shown \(R_1 \leq R_2\) and \(R_2 \leq R_1\), we can conclude that \(R_1 = R_2\), which proves the equivalence of the two forms of the operator.
Algorithms for Solving the Optimality Equations#
The optimality equations are operator equations. Therefore, we can apply general numerical methods to solve them. Applying the successive approximation method to the Bellman optimality equation yields a method known as “value iteration” in dynamic programming. A direct application of the blueprint for successive approximation yields the following algorithm:
Algorithm 9 (Value Iteration)
Input Given an MDP \((S, A, P, R, \gamma)\) and tolerance \(\varepsilon > 0\)
Output Compute an \(\varepsilon\)-optimal value function \(V\) and policy \(\pi\)
Initialize \(v_0(s) = 0\) for all \(s \in S\)
\(n \leftarrow 0\)
repeat
For each \(s \in S\):
\(v_{n+1}(s) \leftarrow \max_{a \in A} \left\{r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a)v_n(s')\right\}\)
\(\delta \leftarrow \|v_{n+1} - v_n\|_\infty\)
\(n \leftarrow n + 1\)
until \(\delta < \frac{\varepsilon(1-\gamma)}{2\gamma}\)
For each \(s \in S\):
\(\pi(s) \leftarrow \arg\max_{a \in A} \left\{r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a)v_n(s')\right\}\)
return \(v_n, \pi\)
The termination criterion in this algorithm is based on a specific bound that provides guarantees on the quality of the solution. This is in contrast to supervised learning, where we often use arbitrary termination criteria based on computational budget or early stopping when the learning curve flattens. This is because establishing implementable generalization bounds in supervised learning is challenging.
However, in the dynamic programming context, we can derive various bounds that can be implemented in practice. These bounds help us terminate our procedure with a guarantee on the precision of our value function and, correspondingly, on the optimality of the resulting policy.
Proposition 3 (Convergence of Value Iteration)
(Adapted from Puterman [33] theorem 6.3.1)
Let \(v_0\) be any initial value function, \(\varepsilon > 0\) a desired accuracy, and let \(\{v_n\}\) be the sequence of value functions generated by value iteration, i.e., \(v_{n+1} = \mathrm{L}v_n\) for \(n \geq 1\), where \(\mathrm{L}\) is the Bellman optimality operator. Then:
\(v_n\) converges to the optimal value function \(v^*_\gamma\),
The algorithm terminates in finite time,
The resulting policy \(\pi_\varepsilon\) is \(\varepsilon\)-optimal, and
When the algorithm terminates, \(v_{n+1}\) is within \(\varepsilon/2\) of \(v^*_\gamma\).
Proof. Parts 1. and 2. follow directly from the fact that \(\mathrm{L}\) is a contraction mapping. Hence, by Banach’s fixed-point theorem, it has a unique fixed point (which is \(v^*_\gamma\)), and repeated application of \(\mathrm{L}\) will converge to this fixed point. Moreover, this convergence happens at a linear/geometric rate, which ensures that we reach the termination condition in finite time.
To show that the Bellman optimality operator \(\mathrm{L}\) is a contraction mapping, we need to prove that for any two value functions \(v\) and \(u\):
where \(\gamma \in [0,1)\) is the discount factor and \(\|\cdot\|_\infty\) is the supremum norm.
Let’s start by writing out the definition of \(\mathrm{L}v\) and \(\mathrm{L}u\):
Now, for any state s, let \(a_V\) be the action that achieves the maximum for \(\mathrm{L}\)V, and \(a_U\) be the action that achieves the maximum for \(\mathrm{L}u\). Then:
By the definition of \(a_V\) and \(a_U\), we know that:
Subtracting these inequalities:
Taking the absolute value of both sides and using the fact that \(\sum_{j \in \mathcal{S}} p(j|s,a) = 1\) for any s and a:
Since this holds for all \(s\), we can take the supremum over \(s\):
Thus, we have shown that \(\mathrm{L}\) is indeed a contraction mapping with contraction factor \(\gamma\).
Now, let’s focus on parts 3. and 4. Suppose our algorithm has just terminated, i.e., \(\|v_{n+1} - v_n\| < \frac{\varepsilon(1-\gamma)}{2\gamma}\) for some \(n\). We want to show that our current value function \(v_{n+1}\) and the policy \(\pi_\varepsilon\) derived from it are close to optimal.
We start with the following inequality:
This inequality is derived using the triangle inequality:
Let’s focus on the first term, \(\|V^{\pi_\varepsilon}_\gamma - v_{n+1}\|\). We can expand this:
Here, we’ve used the fact that \(V^{\pi_\varepsilon}_\gamma\) is a fixed point of \(\mathrm{L}_{\pi_\varepsilon}\), that \(\mathrm{L}_{\pi_\varepsilon}v_{n+1} = \mathrm{L}v_{n+1}\) (by definition of \(\pi_\varepsilon\)), and that both \(\mathrm{L}\) and \(\mathrm{L}_{\pi_\varepsilon}\) are contractions with factor \(\gamma\).
Rearranging this inequality, we get:
We can derive a similar bound for \(\|v_{n+1} - v^*_\gamma\|\):
Now, remember that our algorithm terminated when \(\|v_{n+1} - v_n\| < \frac{\varepsilon(1-\gamma)}{2\gamma}\). Plugging this into our bounds:
Finally, combining these results with our initial inequality:
This completes the proof. We’ve shown that when the algorithm terminates, our value function \(v_{n+1}\) is within \(\varepsilon/2\) of optimal (part 4.), and our policy \(\pi_\varepsilon\) is \(\varepsilon\)-optimal (part 3.).
Newton-Kantorovich applied to the Optimality Equations#
Another perspective on the optimality equations is that instead of looking for \(v_\gamma^\star\) as the fixed-point of \(\mathrm{L}\) in the fixed-point problem find \(v\) such that \(\mathrm{L} v = v\), we consider instead the related form in the nonlinear operator equation \(\mathrm{L}v - v = 0\). Writing things in this way highlights the fact that we could also consider using Newton-Kantorovich iteration to solve this equation instead of the successive approximation method.
If we were to apply the blueprint of Newton-Kantorovich, we would get:
Algorithm 10 (Newton-Kantorovich for Bellman Optimality Equation)
Input: MDP \((S, A, P, R, \gamma)\), initial guess \(v_0\), tolerance \(\epsilon > 0\)
Output: Approximate optimal value function \(v^\star\)
Initialize \(n = 0\)
repeat
Compute the Fréchet derivative of \(\mathrm{B}(v) = \mathrm{L}v - v\) at \(v_n\):
\[\mathrm{B}'(v_n) = \mathrm{L}' - \mathrm{I}\]where \(\mathrm{I}\) is the identity operator
Solve the linear equation for \(h_n\):
\[(\mathrm{L}' - \mathrm{I})h_n = v_n - \mathrm{L}v_n\]Update: \(v_{n+1} = v_n - h_n\)
\(n = n + 1\)
until \(\|\mathrm{L}v_n - v_n\| < \epsilon\)
return \(v_n\)
The key difference in this application is the specific form of our operator \(\mathrm{B}(v) = \mathrm{L}v - v\), where \(\mathrm{L}\) is the Bellman optimality operator. The Fréchet derivative of this operator is \(\mathrm{B}'(v) = \mathrm{L}' - \mathrm{I}\), where \(\mathrm{L}'\) is the Fréchet derivative of the Bellman optimality operator and \(\mathrm{I}\) is the identity operator.
Let’s examine the Fréchet derivative of \(\mathrm{L}\) more closely. The Bellman optimality operator \(\mathrm{L}\) is defined as:
For each state \(s\), let \(a^*(s)\) be the action that achieves the maximum in \((\mathrm{L}v)(s)\). Then, for any function \(h\), the Fréchet derivative of \(\mathrm{L}\) at v has the following effect when applied to any function \(h\):
This means that the Fréchet derivative \(\mathrm{L}'(v)\) is a linear operator that, when applied to a function \(h\), gives a new function whose value at each state \(s\) is a discounted expectation of \(h\) over the next states, using the transition probabilities corresponding to the optimal action \(a^*(s)\) at the current point \(v\).
Now, let’s look more closely at what happens in each iteration of this Newton-Kantorovich method:
In step 3, when we compute \(\mathrm{L}'(v_n)\), we’re essentially defining a policy based on the current value function estimate. This policy chooses the action \(a^*(s)\) for each state \(s\).
In step 4, we solve the linear equation:
\[(\mathrm{L}' - \mathrm{I})h_n = v_n - \mathrm{L}v_n\]This can be rearranged to:
\[\mathrm{L}'h_n = \mathrm{L}v_n\]Solving this equation is equivalent to evaluating the policy defined by \(a^*(s)\).
The update in step 5 can be rewritten as:
\[v_{n+1} = \mathrm{L}v_n\]This step improves our value function estimate based on the policy we just evaluated.
Interestingly, this sequence of steps - deriving a policy from the current value function, evaluating that policy, and then improving the value function - is a well-known algorithm in dynamic programming called policy iteration. In fact, policy iteration can be viewed as simply the application of the Newton-Kantorovich method to the operator \(\mathrm{B}(v) = \mathrm{L}v - v\).
Policy Iteration#
While we derived policy iteration-like steps from the Newton-Kantorovich method, it’s worth examining policy iteration as a standalone algorithm, as it has been traditionally presented in the field of dynamic programming.
The policy iteration algorithm for discounted Markov decision problems is as follows:
Algorithm 11 (Policy Iteration)
Input: MDP \((S, A, P, R, \gamma)\) Output: Optimal policy \(\pi^*\)
Initialize: \(n = 0\), select an arbitrary decision rule \(d_0 \in D\)
repeat 3. (Policy evaluation) Obtain \(\mathbf{v}^n\) by solving:
\[(\mathbf{I}-\gamma \mathbf{P}_{d_n}) \mathbf{v} = \mathbf{r}_{d_n}\](Policy improvement) Choose \(d_{n+1}\) to satisfy:
\[d_{n+1} \in \arg\max_{d \in D}\left\{\mathbf{r}_d+\gamma \mathbf{P}_d \mathbf{v}^n\right\}\]Set \(d_{n+1} = d_n\) if possible.
If \(d_{n+1} = d_n\), return \(d^* = d_n\)
\(n = n + 1\)
until convergence
As opposed to value iteration, this algorithm produces a sequence of both deterministic Markovian decision rules \(\{d_n\}\) and value functions \(\{\mathbf{v}^n\}\). We recognize in this algorithm the linearization step of the Newton-Kantorovich procedure, which takes place here in the policy evaluation step 3 where we solve the linear system \((\mathbf{I}-\gamma \mathbf{P}_{d_n}) \mathbf{v} = \mathbf{r}_{d_n}\). In practice, this linear sytem could be solved either using direct methods (eg. Gaussian elimination), using simple iterative methods such as the successive approximation method for policy evaluation, or more sophisticated methods such as GMRES.