7. From Trajectories to Policies#

The methods seen so far, whether in discrete or continuous-time, were deterministic trajectory optimization methods. Given an initial state, they provide a prescription of controls to apply as a function of time. In the collocation case, we would also obtain a state trajectory corresponding to those controls as a by-product. The control function, \(\mathbf{u}[i]\) (in discrete time) or \(u(t)\) (in continuous-time), is blind to the system’s state at any point in time. It simply reads off the values stored in memory or performs some form of interpolation in time. This approach is optimal under the assumption that our system is deterministic. We know that no matter how many times we repeat the experiment from the same initial state, we will always get the same result; hence, reapplying the same controls in the same order is sufficient.

However, no matter how good our model is, real-life deployment of our controller will inevitably involve prediction errors. In such cases, a simple control function that only considers the current time—an open-loop controller—won’t suffice. We must inform our controller that the real world is likely not in the state it thinks it’s in: we need to provide feedback to close the loop. Depending on the structure of the noise affecting our system, we may encounter feedback controllers that depend on the entire history (in the case of partial observability) or just the current state (under perfect observability). Dynamic programming methods will provide us with solution methods to derive such controllers. These methods exist for both the continuous-time setting (via the Hamilton-Jacobi equations) and the discrete setting through the Bellman optimality equations. It also provides us with the necessary framework to address partially observable systems (for which we can’t directly measure the state) using the POMDP framework. We will cover these solution methods in this chapter.

8. Closing the Loop by Replanning#

There exists a simple recipe for closing the loop: since we will likely end up in states that we haven’t planned for, we might as well recompute our solution as frequently as possible using the updated state information as initial state to our trajectory optimization procedure. This replanning or reoptimization strategy is called Model Predictive Control (MPC). Given any trajectory optimization algorithm, we can turn it into a closed-loop variant using MPC.

Consider a trajectory optimization problem in Bolza form,

\[\begin{split} \begin{aligned} \text{minimize} \quad & c(\mathbf{x}(t_0 + T)) + \int_{t_0}^{t_0 + T} c(\mathbf{x}(t), \mathbf{u}(t)) \, dt \\ \text{subject to} \quad & \dot{\mathbf{x}}(t) = \mathbf{f}(\mathbf{x}(t), \mathbf{u}(t)) \\ & \mathbf{g}(\mathbf{x}(t), \mathbf{u}(t)) \leq \mathbf{0} \\ & \mathbf{u}_{\text{min}} \leq \mathbf{u}(t) \leq \mathbf{u}_{\text{max}} \\ \text{given} \quad & \mathbf{x}(t_0) = \mathbf{x}_{\text{current}} \enspace . \end{aligned} \end{split}\]

MPC then proceeds as follows. At the current time \( t_0 \), the MPC controller considers a prediction horizon \( T \) over which it optimizes the future trajectory of the system. The controller then solves the trajectory optimization problem with the initial state set to the current state \( \mathbf{x}_{\text{current}} = \mathbf{x}(t_0) \). This yields an optimal control trajectory \( \mathbf{u}^*(t) \) for \( t \in [t_0, t_0 + T] \).

However, Instead of applying the entire computed control trajectory, the controller extracts only the first part of the solution, namely \( \mathbf{u}^*(t_0) \), and applies it to the system. This strategy is called receding horizon control. The idea is that the control \( \mathbf{u}^*(t_0) \) is based on the best prediction available at time \( t_0 \), considering the current state of the system and expected future disturbances.

After applying the first control input, the system evolves according to its dynamics, and the controller measures the new state at the next time step, \( t_0 + \Delta t \). Using this updated state as the new initial condition, the MPC controller re-solves the trajectory optimization problem over a shifted prediction horizon \( [t_0 + \Delta t, t_0 + \Delta t + T] \).

This procedure is then repeated ad infinitum or until the end of overall control problem. More concisely, here’s a pseudo-code showing the general blueprint of MPC methods:

Algorithm 8.1 (Non-linear Model Predictive Control)

Input:

  • Prediction horizon \( T \)

  • Time step \( \Delta t \)

  • Initial state \( \mathbf{x}(t_0) \)

  • Cost functions \( c(\mathbf{x}(t), \mathbf{u}(t)) \) and \( c(\mathbf{x}(t_0 + T)) \)

  • System dynamics \( \dot{\mathbf{x}}(t) = \mathbf{f}(\mathbf{x}(t), \mathbf{u}(t)) \)

  • Constraints \( \mathbf{g}(\mathbf{x}(t), \mathbf{u}(t)) \leq \mathbf{0} \)

  • Control limits \( \mathbf{u}_{\text{min}} \leq \mathbf{u}(t) \leq \mathbf{u}_{\text{max}} \)

Output:

  • Control input sequence \( \{\mathbf{u}(t)\} \) applied to the system

Initialization:

  1. Set \( t \leftarrow t_0 \)

  2. Measure initial state \( \mathbf{x}_{\text{current}} \leftarrow \mathbf{x}(t) \)

Procedure:

  1. Repeat until the end of the control task:

    1. Solve the following optimization problem:

    \[\begin{split} \begin{aligned} \text{minimize} \quad & c(\mathbf{x}(t + T)) + \int_{t}^{t + T} c(\mathbf{x}(\tau), \mathbf{u}(\tau)) \, d\tau \\ \text{subject to} \quad & \dot{\mathbf{x}}(\tau) = \mathbf{f}(\mathbf{x}(\tau), \mathbf{u}(\tau)) \quad \forall \tau \in [t, t + T] \\ & \mathbf{g}(\mathbf{x}(\tau), \mathbf{u}(\tau)) \leq \mathbf{0} \quad \forall \tau \in [t, t + T] \\ & \mathbf{u}_{\text{min}} \leq \mathbf{u}(\tau) \leq \mathbf{u}_{\text{max}} \quad \forall \tau \in [t, t + T] \\ \text{given} \quad & \mathbf{x}(t) = \mathbf{x}_{\text{current}} \end{aligned} \end{split}\]
    • Obtain the optimal control trajectory \( \mathbf{u}^*(\tau) \) and the optimal state trajectory \( \mathbf{x}^*(\tau) \) for \( \tau \in [t, t + T] \).

    1. Apply the first control input:

    \[ \mathbf{u}(t) \leftarrow \mathbf{u}^*(t) \]
    • Apply \( \mathbf{u}(t) \) to the system.

    1. Advance the system state:

    • Let the system evolve under the control \( \mathbf{u}(t) \) according to the dynamics \( \dot{\mathbf{x}}(t) = \mathbf{f}(\mathbf{x}(t), \mathbf{u}(t)) \).

    • Update \( t \leftarrow t + \Delta t \).

    1. Measure the new state:

    \[ \mathbf{x}_{\text{current}} \leftarrow \mathbf{x}(t) \]
  2. End Repeat

8.1. Example: Propofol Infusion Control#

This problem explores the control of propofol infusion in total intravenous anesthesia (TIVA). Our presentation follows the problem formulation developped by Sawaguchi et al. [39]. The primary objective is to maintain the desired level of unconsciousness while minimizing adverse reactions and ensuring quick recovery after surgery.

The level of unconsciousness is measured by the Bispectral Index (BIS), which is obtained using an electroencephalography (EEG) device. The BIS ranges from \(0\) (complete suppression of brain activity) to \(100\) (fully awake), with the target range for general anesthesia typically between \(40\) and \(60\).

The goal is to design a control system that regulates the infusion rate of propofol to maintain the BIS within the target range. This can be formulated as an optimal control problem:

\[\begin{split} \begin{align*} \min_{u(t)} & \int_{0}^{T} \left( BIS(t) - BIS_{\text{target}} \right)^2 + \gamma u(t)^2 \, dt \\ \text{subject to:} \\ \dot{x}_1 &= -(k_{10} + k_{12} + k_{13})x_1 + k_{21}x_2 + k_{31}x_3 + \frac{u(t)}{V_1} \\ \dot{x}_2 &= k_{12}x_1 - k_{21}x_2 \\ \dot{x}_3 &= k_{13}x_1 - k_{31}x_3 \\ \dot{x}_e &= k_{e0}(x_1 - x_e) \\ BIS(t) &= E_0 - E_{\text{max}}\frac{x_e^\gamma}{x_e^\gamma + EC_{50}^\gamma} \end{align*} \end{split}\]

Where:

  • \(u(t)\) is the propofol infusion rate (mg/kg/h)

  • \(x_1\), \(x_2\), and \(x_3\) are the drug concentrations in different body compartments

  • \(x_e\) is the effect-site concentration

  • \(k_{ij}\) are rate constants for drug transfer between compartments

  • \(BIS(t)\) is the Bispectral Index

  • \(\gamma\) is a regularization parameter penalizing excessive drug use

  • \(E_0\), \(E_{\text{max}}\), \(EC_{50}\), and \(\gamma\) are parameters of the pharmacodynamic model

The specific dynamics model used in this problem is so-called “Pharmacokinetic-Pharmacodynamic Model” and consists of three main components:

  1. Pharmacokinetic Model, which describes how the drug distributes through the body over time. It’s based on a three-compartment model:

    • Central compartment (blood and well-perfused organs)

    • Shallow peripheral compartment (muscle and other tissues)

    • Deep peripheral compartment (fat)

  2. Effect Site Model, which represents the delay between drug concentration in the blood and its effect on the brain.

  3. Pharmacodynamic Model that relates the effect-site concentration to the observed BIS.

The propofol infusion control problem presents several interesting challenges from a research perspective. First, there is a delay in how fast the drug can reach a different compartments in addition to the BIS measurements which can lag. This could lead to instability if not properly addressed in the control design.

Furthermore, every patient is different from another. Hence, we cannot simply learn a single controller offline and hope that it will generalize to an entire patient population. We will account for this variability through Model Predictive Control (MPC) and dynamically adapt to the model mismatch through replanning. How a patient will react to a given dose of drug also varies and must be carefully controlled to avoid overdoses. This adds an additional layer of complexity since we have to incorporate safety constraints. Finally, the patient might suddenly change state, for example due to surgical stimuli, and the controller must be able to adapt quickly to compensate for the disturbance to the system.

Hide code cell source
import numpy as np
from scipy.optimize import minimize
import matplotlib.pyplot as plt

class Patient:
    def __init__(self, age, weight):
        self.age = age
        self.weight = weight
        self.set_pk_params()
        self.set_pd_params()

    def set_pk_params(self):
        self.v1 = 4.27 * (self.weight / 70) ** 0.71 * (self.age / 30) ** (-0.39)
        self.v2 = 18.9 * (self.weight / 70) ** 0.64 * (self.age / 30) ** (-0.62)
        self.v3 = 238 * (self.weight / 70) ** 0.95
        self.cl1 = 1.89 * (self.weight / 70) ** 0.75 * (self.age / 30) ** (-0.25)
        self.cl2 = 1.29 * (self.weight / 70) ** 0.62
        self.cl3 = 0.836 * (self.weight / 70) ** 0.77
        self.k10 = self.cl1 / self.v1
        self.k12 = self.cl2 / self.v1
        self.k13 = self.cl3 / self.v1
        self.k21 = self.cl2 / self.v2
        self.k31 = self.cl3 / self.v3
        self.ke0 = 0.456

    def set_pd_params(self):
        self.E0 = 100
        self.Emax = 100
        self.EC50 = 3.4
        self.gamma = 3

def pk_model(x, u, patient):
    x1, x2, x3, xe = x
    dx1 = -(patient.k10 + patient.k12 + patient.k13) * x1 + patient.k21 * x2 + patient.k31 * x3 + u / patient.v1
    dx2 = patient.k12 * x1 - patient.k21 * x2
    dx3 = patient.k13 * x1 - patient.k31 * x3
    dxe = patient.ke0 * (x1 - xe)
    return np.array([dx1, dx2, dx3, dxe])

def pd_model(ce, patient):
    return patient.E0 - patient.Emax * (ce ** patient.gamma) / (ce ** patient.gamma + patient.EC50 ** patient.gamma)

def simulate_step(x, u, patient, dt):
    x_next = x + dt * pk_model(x, u, patient)
    bis = pd_model(x_next[3], patient)
    return x_next, bis

def objective(u, x0, patient, dt, N, target_bis):
    x = x0.copy()
    total_cost = 0
    for i in range(N):
        x, bis = simulate_step(x, u[i], patient, dt)
        total_cost += (bis - target_bis)**2 + 0.1 * u[i]**2
    return total_cost

def mpc_step(x0, patient, dt, N, target_bis):
    u0 = 10 * np.ones(N)  # Initial guess
    bounds = [(0, 20)] * N  # Infusion rate between 0 and 20 mg/kg/h
    
    result = minimize(objective, u0, args=(x0, patient, dt, N, target_bis),
                      method='SLSQP', bounds=bounds)
    
    return result.x[0]  # Return only the first control input

def run_mpc_simulation(patient, T, dt, N, target_bis):
    steps = int(T / dt)
    x = np.zeros((steps+1, 4))
    bis = np.zeros(steps+1)
    u = np.zeros(steps)
    
    for i in range(steps):
        # Add noise to the current state to simulate real-world uncertainty
        x_noisy = x[i] + np.random.normal(0, 0.01, size=4)
        
        # Use noisy state for MPC planning
        u[i] = mpc_step(x_noisy, patient, dt, N, target_bis)
        
        # Evolve the true state using the deterministic model
        x[i+1], bis[i] = simulate_step(x[i], u[i], patient, dt)
    
    bis[-1] = pd_model(x[-1, 3], patient)
    return x, bis, u

# Set up the problem
patient = Patient(age=40, weight=70)
T = 120  # Total time in minutes
dt = 0.5  # Time step in minutes
N = 20  # Prediction horizon
target_bis = 50  # Target BIS value

# Run MPC simulation
x, bis, u = run_mpc_simulation(patient, T, dt, N, target_bis)

# Plot results
t = np.arange(0, T+dt, dt)
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(10, 12), sharex=True)

ax1.plot(t, bis)
ax1.set_ylabel('BIS')
ax1.set_ylim(0, 100)
ax1.axhline(y=target_bis, color='r', linestyle='--')

ax2.plot(t[:-1], u)
ax2.set_ylabel('Infusion Rate (mg/kg/h)')

ax3.plot(t, x[:, 3])
ax3.set_ylabel('Effect-site Concentration (µg/mL)')
ax3.set_xlabel('Time (min)')

plt.tight_layout()
plt.show()

print(f"Initial BIS: {bis[0]:.2f}")
print(f"Final BIS: {bis[-1]:.2f}")
print(f"Mean infusion rate: {np.mean(u):.2f} mg/kg/h")
print(f"Final effect-site concentration: {x[-1, 3]:.2f} µg/mL")
_images/1d8350cffd118c210c9ab7c2d52f7022cae8f0b69dfb8ea56883056290984f7e.png
Initial BIS: 100.00
Final BIS: 50.16
Mean infusion rate: 8.85 mg/kg/h
Final effect-site concentration: 3.39 µg/mL

9. Computational Efficiency Improvements#

One challenge of Model Predictive Control (MPC) is its computational cost. In real-time applications, such as adaptive optics, the controller may need to operate at extremely high frequencies—for example, 1000 Hz. In this scenario, the solver has just 1 millisecond to compute an optimal solution, pushing the limits of computational efficiency.

9.1. Explicit MPC#

A potential solution to this problem is to offload some of the computational effort offline. Instead of solving the optimization problem at every time step during execution, we could attempt to precompute solutions for all potential states in advance. At first glance, this seems impractical—without leveraging specific structure or partitioning the state space intelligently, precomputing solutions for every possible state would be infeasible. However, with the right techniques, this approach becomes viable.

This is the essence of explicit MPC, which hinges on a subfield of mathematical programming known as multi-parametric (or simply parametric) programming. A multiparametric programming problem can be described by the following formulation:

\[\begin{split} \begin{array}{cl} z(\boldsymbol{\theta}) = \min_{\mathbf{x}} & f(\mathbf{x}, \boldsymbol{\theta}) \\ \text { s.t. } & \mathbf{g}(\mathbf{x}, \boldsymbol{\theta}) \leq 0 \\ & \mathbf{h}(\mathbf{x}, \boldsymbol{\theta}) = 0 \\ & \mathbf{x} \in \mathbb{R}^n \\ & \boldsymbol{\theta} \in \mathbb{R}^m \end{array} \end{split}\]

where:

  • \(\mathbf{x} \in \mathbb{R}^n\) are the decision variables,

  • \(\boldsymbol{\theta} \in \mathbb{R}^m\) are the parameters,

  • \(f(\mathbf{x}, \boldsymbol{\theta})\) is the objective function,

  • \(\mathbf{g}(\mathbf{x}, \boldsymbol{\theta}) \leq 0\) and \(\mathbf{h}(\mathbf{x}, \boldsymbol{\theta}) = 0\) are the inequality and equality constraints, respectively.

Parametric programming methods provides ways for efficiently evaluating \(z(\boldsymbol{\theta})\) – the value function – by leveraging the structure of the solution space. In particular, it leverages the idea that the solution space can be partitioned into critical regions—regions of the parameter space where the optimal solution structure remains unchanged. Within each region, the solution can often be expressed as a piecewise affine function of the parameters, which is easy to represent and compute offline.

In trajectory optimization problems, the initial state \(\boldsymbol{x}_0\) can also be treated as a parameter. This transforms the problem into a parametric optimization problem, where \(\boldsymbol{x}_0\) defines a family of optimization problems, each yielding a different optimal solution. The relationship between the parameters and solutions can be described using two key mappings:

  • \(\boldsymbol{u}^\star(\boldsymbol{x}_0)\): The optimal control sequence as a function of the initial state.

  • \(v(\boldsymbol{x}_0)\): The value function, which gives the optimal objective value for a given \(\boldsymbol{x}_0\).

It is therefore at this level that parametric programming methods can come into play and provide efficient methods for computing the value function offline: that is without resorting to direct calls to a nonlinear programming solver for every new \(\boldsymbol{x}\) encountered along a trajectory.

9.1.1. Amortized Optimization and Neural Networks#

The idea of solving an entire family of optimization problems efficiently is not unique to parametric programming. In machine learning, amortized optimization (or amortized inference) aims to “learn to optimize” by constructing models that generalize over a family of optimization problems. This approach is particularly relevant in applications such as hyperparameter optimization, meta-learning, and probabilistic inference.

In contrast to explicit MPC, which partitions the state space, amortized optimization typically uses neural networks to approximate the mapping from parameters to optimal solutions. This has led to recent explorations of amortizing NMPC (Nonlinear MPC) controllers into neural networks, blending the structure of MPC with the generalization power of neural networks. This represents a promising direction for creating efficient controllers that combine physics-based models, safety constraints, and the flexibility of learned models.

9.2. Warmstarting and Predictor-Corrector MPC#

Another way in which we can speed up NMPC is by providing good initial guesses for the solver. When solving a series of optimization problems along a trajectory, it is likely that the solution to previous problem might be close to that of the current one. Hence, as a heuristic it often makes sense to “warmstart” from the previous solution.

Another alternative is to extrapolate what the next solution ought to be based on the previous one. What we mean here is that rather than simply using the last solution as a guess for that of the current problem, we leverage the “sensitivity information” around the last solution to make a guess about where we might be going. This idea is reminiscent of predictor-corrector schemes which we have briefly discussed in the first chapter.

To implement a predictor corrector MPC scheme, we need to understand how the optimal solution \(\mathbf{x}^*(\boldsymbol{\theta})\) and the value function \(z(\boldsymbol{\theta})\) change as the parameters \(\boldsymbol{\theta}\) vary. We achieve this by applying the implicit function theorem to the KKT (Karush-Kuhn-Tucker) conditions of the parametric problem. The KKT conditions for a parametric optimization problem are necessary for optimality and can be written as:

\[ \mathbf{F}(\mathbf{x}, \boldsymbol{\gamma}, \boldsymbol{\nu}, \boldsymbol{\theta}) = 0, \]

where \(\mathbf{F}\) encapsulates the stationarity, primal feasibility, dual feasibility, and complementary slackness conditions from the KKT theorem. By treating \(\mathbf{x}\), \(\boldsymbol{\gamma}\), and \(\boldsymbol{\nu}\) as functions of the parameters \(\boldsymbol{\theta}\), the implicit function theorem guarantees that, under certain regularity conditions, these optimal variables are continuously differentiable with respect to \(\boldsymbol{\theta}\). This allows us to compute sensitivity derivatives, which describe how small changes in \(\boldsymbol{\theta}\) affect the optimal solution.

By leveraging this sensitivity information, we can predict changes in the optimal solution and “warm-start” the optimization process at the next time step in MPC. This concept is related to numerical continuation, where a complex optimization problem is solved by gradually transforming a simpler, well-understood problem into the more difficult one.

Unlike the methods we’ve discussed so far, dynamic programming takes a step back and considers not just a single optimization problem, but an entire family of related problems. This approach, while seemingly more complex at first glance, can often lead to efficient solutions.

Dynamic programming leverage the solution structure underlying many control problems that allows for a decomposition it into smaller, more manageable subproblems. Each subproblem is itself an optimization problem, embedded within the larger whole. This recursive structure is the foundation upon which dynamic programming constructs its solutions.

To ground our discussion, let us return to the domain of discrete-time optimal control problems (DOCPs). These problems frequently arise from the discretization of continuous-time optimal control problems. While the focus here will be on deterministic problems, it is worth noting that these concepts extend naturally to stochastic problems by taking the expectation over the random quantities.

Consider a typical DOCP of Bolza type:

\[\begin{split} \begin{align*} \text{minimize} \quad & J \triangleq c_\mathrm{T}(\mathbf{x}_T) + \sum_{t=1}^{T-1} c_t(\mathbf{x}_t, \mathbf{u}_t) \\ \text{subject to} \quad & \mathbf{x}_{t+1} = \mathbf{f}_t(\mathbf{x}_t, \mathbf{u}_t), \quad t = 1, \ldots, T-1, \\ & \mathbf{u}_{lb} \leq \mathbf{u}_t \leq \mathbf{u}_{ub}, \quad t = 1, \ldots, T, \\ & \mathbf{x}_{lb} \leq \mathbf{x}_t \leq \mathbf{x}_{ub}, \quad t = 1, \ldots, T, \\ \text{given} \quad & \mathbf{x}_1 \end{align*} \end{split}\]

Rather than considering only the total cost from the initial time to the final time, dynamic programming introduces the concept of cost from an arbitrary point in time to the end. This leads to the definition of the “cost-to-go” or “value function” \(J_k(\mathbf{x}_k)\):

\[ J_k(\mathbf{x}_k) \triangleq c_\mathrm{T}(\mathbf{x}_T) + \sum_{t=k}^{T-1} c_t(\mathbf{x}_t, \mathbf{u}_t) \]

This function represents the total cost incurred from stage \(k\) onwards to the end of the time horizon, given that the system is initialized in state \(\mathbf{x}_k\) at stage \(k\). Suppose the problem has been solved from stage \(k+1\) to the end, yielding the optimal cost-to-go \(J_{k+1}^\star(\mathbf{x}_{k+1})\) for any state \(\mathbf{x}_{k+1}\) at stage \(k+1\). The question then becomes: how does this information inform the decision at stage \(k\)?

Given knowledge of the optimal behavior from \(k+1\) onwards, the task reduces to determining the optimal action \(\mathbf{u}_k\) at stage \(k\). This control should minimize the sum of the immediate cost \(c_k(\mathbf{x}_k, \mathbf{u}_k)\) and the optimal future cost \(J_{k+1}^\star(\mathbf{x}_{k+1})\), where \(\mathbf{x}_{k+1}\) is the resulting state after applying action \(\mathbf{u}_k\). Mathematically, this is expressed as:

\[ J_k^\star(\mathbf{x}_k) = \min_{\mathbf{u}_k} \left[ c_k(\mathbf{x}_k, \mathbf{u}_k) + J_{k+1}^\star(\mathbf{f}_k(\mathbf{x}_k, \mathbf{u}_k)) \right] \]

This equation is known as Bellman’s equation, named after Richard Bellman, who formulated the principle of optimality:

An optimal policy has the property that whatever the previous state and decision, the remaining decisions must constitute an optimal policy with regard to the state resulting from the previous decision.

In other words, any sub-path of an optimal path, from any intermediate point to the end, must itself be optimal.

Dynamic programming can handle nonlinear systems and non-quadratic cost functions naturally. It provides a global optimal solution, when one exists, and can incorporate state and control constraints with relative ease. Howver, as the dimension of the state space increases, this approach suffers from what Bellman termed the “curse of dimensionality.” The computational complexity and memory requirements grow exponentially with the state dimension, rendering direct application of dynamic programming intractable for high-dimensional problems.

Fortunately, learning-based methods offer efficient tools to combat the curse of dimensionality on two fronts: by using function approximation (e.g., neural networks) to avoid explicit discretization, and by leveraging randomization through Monte Carlo methods inherent in the learning paradigm. Most of this course is dedicated to those ideas.

9.3. Backward Recursion#

The principle of optimality provides a methodology for solving optimal control problems. Beginning at the final time horizon and working backwards, at each stage the local optimization problem given by Bellman’s equation is solved. This process, termed backward recursion or backward induction constructs the optimal value function stage by stage.

Algorithm 9.1 (Backward Recursion for Dynamic Programming)

Input: Terminal cost function \(c_\mathrm{T}(\cdot)\), stage cost functions \(c_t(\cdot, \cdot)\), system dynamics \(f_t(\cdot, \cdot)\), time horizon \(\mathrm{T}\)

Output: Optimal value functions \(J_t^\star(\cdot)\) and optimal control policies \(\mu_t^\star(\cdot)\) for \(t = 1, \ldots, T\)

  1. Initialize \(J_T^\star(\mathbf{x}) = c_\mathrm{T}(\mathbf{x})\) for all \(\mathbf{x}\) in the state space

  2. For \(t = T-1, T-2, \ldots, 1\):

    1. For each state \(\mathbf{x}\) in the state space:

      1. Compute \(J_t^\star(\mathbf{x}) = \min_{\mathbf{u}} \left[ c_t(\mathbf{x}, \mathbf{u}) + J_{t+1}^\star(f_t(\mathbf{x}, \mathbf{u})) \right]\)

      2. Compute \(\mu_t^\star(\mathbf{x}) = \arg\min_{\mathbf{u}} \left[ c_t(\mathbf{x}, \mathbf{u}) + J_{t+1}^\star(f_t(\mathbf{x}, \mathbf{u})) \right]\)

    2. End For

  3. End For

  4. Return \(J_t^\star(\cdot)\), \(\mu_t^\star(\cdot)\) for \(t = 1, \ldots, T\)

Upon completion of this backward pass, we now have access to the optimal control to take at any stage and in any state. Furthermore, we can simulate optimal trajectories from any initial state and applying the optimal policy at each stage to generate the optimal trajectory.

9.3.1. Example: Optimal Harvest in Resource Management#

Dynamic programming is often used in resource management and conservation biology to devise policies to be implemented by decision makers and stakeholders : for eg. in fishereries, or timber harvesting. Per [6], we consider a population of a particular species, whose abundance we denote by \(x_t\), where \(t\) represents discrete time steps. Our objective is to maximize the cumulative harvest over a finite time horizon, while also considering the long-term sustainability of the population. This optimization problem can be formulated as:

\[ \text{maximize} \quad \sum_{t=t_0}^{t_f} F(x_t \cdot h_t) + F_\mathrm{T}(x_{t_f}) \]

Here, \(F(\cdot)\) represents the immediate reward function associated with harvesting, \(h_t\) is the harvest rate at time \(t\), and \(F_\mathrm{T}(\cdot)\) denotes a terminal value function that could potentially assign value to the final population state. In this particular problem, we assign no terminal value to the final population state, setting \(F_\mathrm{T}(x_{t_f}) = 0\) and allowing us to focus solely on the cumulative harvest over the time horizon.

In our model population model, the abundance of a specicy \(x\) ranges from 1 to 100 individuals. The decision variable is the harvest rate \(h\), which can take values from the set \(D = \{0, 0.1, 0.2, 0.3, 0.4, 0.5\}\). The population dynamics are governed by a modified logistic growth model:

\[ x_{t+1} = x_t + 0.3x_t(1 - x_t/125) - h_tx_t \]

where the \(0.3\) represents the growth rate and \(125\) is the carrying capacity (the maximum population size given the available resources). The logistic growth model returns continuous values; however our DP formulation uses a discrete state space. Therefore, we also round the the outcomes to the nearest integer.

Applying the principle of optimality, we can express the optimal value function \(J^\star(x_t,t)\) recursively:

\[ J^\star(x_t, t) = \max_{h_t \in D} (F(x, h, t) + J^*(x_{t+1}, t+1)) \]

with the boundary condition \(J^*(x_{t_f}) = 0\).

It’s worth noting that while this example uses a relatively simple model, the same principles can be applied to more complex scenarios involving stochasticity, multiple species interactions, or spatial heterogeneity.

Hide code cell source
import numpy as np

# Parameters
r_max = 0.3
K = 125
T = 20  # Number of time steps
N_max = 100  # Maximum population size to consider
h_max = 0.5  # Maximum harvest rate
h_step = 0.1  # Step size for harvest rate

# Create state and decision spaces
N_space = np.arange(1, N_max + 1)
h_space = np.arange(0, h_max + h_step, h_step)

# Initialize value function and policy
V = np.zeros((T + 1, len(N_space)))
policy = np.zeros((T, len(N_space)))

# Terminal value function (F_T)
def terminal_value(N):
    return 0

# State return function (F)
def state_return(N, h):
    return N * h

# State dynamics function
def state_dynamics(N, h):
    return N + r_max * N * (1 - N / K) - N * h

# Backward iteration
for t in range(T - 1, -1, -1):
    for i, N in enumerate(N_space):
        max_value = float('-inf')
        best_h = 0

        for h in h_space:
            if h > 1:  # Ensure harvest rate doesn't exceed 100%
                continue

            next_N = state_dynamics(N, h)
            if next_N < 1:  # Ensure population doesn't go extinct
                continue

            next_N_index = np.searchsorted(N_space, next_N)
            if next_N_index == len(N_space):
                next_N_index -= 1

            value = state_return(N, h) + V[t + 1, next_N_index]

            if value > max_value:
                max_value = value
                best_h = h

        V[t, i] = max_value
        policy[t, i] = best_h

# Function to simulate the optimal policy with conversion to Python floats
def simulate_optimal_policy(initial_N, T):
    trajectory = [float(initial_N)]  # Ensure first value is a Python float
    harvests = []

    for t in range(T):
        N = trajectory[-1]
        N_index = np.searchsorted(N_space, N)
        if N_index == len(N_space):
            N_index -= 1

        h = policy[t, N_index]
        harvests.append(float(N * h))  # Ensure harvest is a Python float

        next_N = state_dynamics(N, h)
        trajectory.append(float(next_N))  # Ensure next population value is a Python float

    return trajectory, harvests

# Example usage
initial_N = 50
trajectory, harvests = simulate_optimal_policy(initial_N, T)

print("Optimal policy:")
print(policy)
print("\nPopulation trajectory:", trajectory)
print("Harvests:", harvests)
print("Total harvest:", sum(harvests))
Optimal policy:
[[0.2 0.2 0.2 ... 0.4 0.4 0.5]
 [0.2 0.2 0.2 ... 0.4 0.4 0.5]
 [0.2 0.2 0.2 ... 0.4 0.4 0.4]
 ...
 [0.2 0.2 0.2 ... 0.5 0.5 0.5]
 [0.2 0.5 0.5 ... 0.5 0.5 0.5]
 [0.2 0.5 0.5 ... 0.5 0.5 0.5]]

Population trajectory: [50.0, 54.0, 63.2016, 53.614938617856, 62.80047226002128, 65.89520835342945, 62.063500827311884, 65.23169346891407, 61.5424456170318, 64.7610004703774, 61.171531280797, 64.42514256278633, 60.90621923290014, 52.003257133909514, 61.11382126799714, 64.37282756165249, 60.86484408994034, 39.80100508132969, 28.038916051902078, 20.544298889444192, 15.422475391094192]
Harvests: [5.0, 0.0, 18.960480000000004, 0.0, 6.280047226002129, 13.17904167068589, 6.206350082731189, 13.046338693782815, 6.15424456170318, 12.95220009407548, 6.1171531280797, 12.885028512557268, 18.271865769870047, 0.0, 6.111382126799715, 12.874565512330499, 30.43242204497017, 19.900502540664846, 14.019458025951039, 10.272149444722096]
Total harvest: 212.66322943492605

9.4. Discretization and Interpolation#

In many real-world problems, such as our resource management example, the state space is inherently continuous. However, the dynamic programming algorithm we’ve discussed operates on discrete state spaces. To bridge this gap, we have two main approaches: discretization and interpolation. In the previous example, we used a discretization method by rounding off values to the nearest grid point.

In our idealized models, we imagined population sizes as whole numbers—1 fish, 2 fish, 3 fish—but nature rarely conforms to such simplicity. What do you do when your survey reports 42.7 fish, for example? Without much explanation, our reflex in the previous example was to simply round things off. After all, what’s the harm in calling 42.7 fish just 43? This approach, known as discretization, is the simplest way to handle continuous states. It’s like overlaying a grid on a smooth landscape and only allowing yourself to stand at the intersections. In our demo code, we implemented this step via the numpy.searchsorted function.

Discretization is straightforward and allows you to apply dynamic programming algorithms directly. For many problems, it might even be sufficient. However, as you might imagine from the various time discretization schemes we’ve encountered in trajectory optimization, we can do better. Specifically, we want to address the following issues:

  1. Our model might make abrupt changes in decisions, even when the population barely shifts.

  2. We’re losing precision, especially when dealing with smaller population sizes where every individual counts.

  3. We might want to scale up and add more factors to our model—perhaps considering the population’s age structure or environmental variables. However, the curse of dimensionality might leave us needing an impossibly large number of grid points to maintain accuracy.

Rather than confining ourselves to grid intersections as we did with basic discretization, we can estimate the value between them via interpolation. When you encounter a state that doesn’t exactly match a grid point—like that population of 42.7 fish—you can estimate its value based on nearby points you’ve already calculated. In its simplest form, we could use linear interpolation. Intuitively, it’s like estimating the height of a hill between two surveyed points by drawing a straight line between them. Let’s formalize this approach in the context of the backward induction procedure.

9.4.1. Backward Recursion with Interpolation#

In a continuous state space, we don’t have \(J_{k+1}^\star(\mathbf{x}_{k+1})\) directly available for all possible \(\mathbf{x}_{k+1}\). Instead, we have \(J_{k+1}^\star(\mathbf{x})\) for a set of discrete grid points \(\mathbf{x} \in \mathcal{X}_\text{grid}\). We use interpolation to estimate \(J_{k+1}^\star(\mathbf{x}_{k+1})\) for any \(\mathbf{x}_{k+1}\) not in \(\mathcal{X}_\text{grid}\).

Let’s define an interpolation function \(I_{k+1}(\mathbf{x})\) that estimates \(J_{k+1}^\star(\mathbf{x})\) for any \(\mathbf{x}\) based on the known values at grid points. Then, we can express Bellman’s equation with interpolation as:

\[ J_k^\star(\mathbf{x}_k) = \min_{\mathbf{u}_k} \left[ c_k(\mathbf{x}_k, \mathbf{u}_k) + I_{k+1}(\mathbf{f}_k(\mathbf{x}_k, \mathbf{u}_k)) \right] \]

For linear interpolation in a one-dimensional state space, \(I_{k+1}(\mathbf{x})\) would be defined as:

\[ I_{k+1}(x) = J_{k+1}^\star(x_l) + \frac{x - x_l}{x_u - x_l} \left(J_{k+1}^\star(x_u) - J_{k+1}^\star(x_l)\right) \]

where \(x_l\) and \(x_u\) are the nearest lower and upper grid points to \(x\), respectively.

Here’s a pseudo-code algorithm for backward recursion with interpolation:

Algorithm 9.2 (Backward Recursion with Interpolation for Dynamic Programming)

Input:

  • Terminal cost function \(c_\mathrm{T}(\cdot)\)

  • Stage cost functions \(c_t(\cdot, \cdot)\)

  • System dynamics \(f_t(\cdot, \cdot)\)

  • Time horizon \(\mathrm{T}\)

  • Grid of state points \(\mathcal{X}_\text{grid}\)

  • Set of possible actions \(\mathcal{U}\)

Output: Optimal value functions \(J_t^\star(\cdot)\) and optimal control policies \(\mu_t^\star(\cdot)\) for \(t = 1, \ldots, T\) at grid points

  1. Initialize \(J_T^\star(\mathbf{x}) = c_\mathrm{T}(\mathbf{x})\) for all \(\mathbf{x} \in \mathcal{X}_\text{grid}\)

  2. For \(t = T-1, T-2, \ldots, 1\):

    1. For each state \(\mathbf{x} \in \mathcal{X}_\text{grid}\):

      1. Initialize \(J_t^\star(\mathbf{x}) = \infty\) and \(\mu_t^\star(\mathbf{x}) = \text{None}\)

      2. For each action \(\mathbf{u} \in \mathcal{U}\):

        1. Compute next state \(\mathbf{x}_\text{next} = f_t(\mathbf{x}, \mathbf{u})\)

        2. Compute interpolated future cost \(J_\text{future} = I_{t+1}(\mathbf{x}_\text{next})\)

        3. Compute total cost \(J_\text{total} = c_t(\mathbf{x}, \mathbf{u}) + J_\text{future}\)

        4. If \(J_\text{total} < J_t^\star(\mathbf{x})\):

          1. Update \(J_t^\star(\mathbf{x}) = J_\text{total}\)

          2. Update \(\mu_t^\star(\mathbf{x}) = \mathbf{u}\)

    2. End For

  3. End For

  4. Return \(J_t^\star(\cdot)\), \(\mu_t^\star(\cdot)\) for \(t = 1, \ldots, T\)

The choice of interpolation method can significantly affect the accuracy of the solution. Linear interpolation is simple and often effective, but higher-order methods like cubic spline interpolation might provide better results in some problems. Furthermore, the layout and density of the grid points in \(\mathcal{X}_\text{grid}\) can impact both the accuracy and computational efficiency. A finer grid generally provides better accuracy but increases computational cost. To balance this tradeoff, you might consider techniques like adaptive grid refinement or function approximation methods instead of fixed grid-based interpolation. Special care may also be needed for states near the boundaries, where interpolation might not be possible in all directions.

While simple to implement, interpolation methods scale poorly in multi-dimensional spaces in terms of computational complexity. Techniques like multilinear interpolation with tensorized representations or more advanced methods like radial basis function interpolation might be necessary.

To better address this computational challenge, we will broaden our perspective through the lens of numerical approximation methods for solving functional operator equations. Polynomial interpolation is a form of approximation, with properties akin to generalization in machine learning. By building these connections, we will develop techniques capable of more robustly handling the curse of dimensionality by leveraging the generalization properties of machine learning models, and the “blessing of randomness” inherent in supervised learning and Monte Carlo methods.

9.4.1.1. Example: Optimal Harvest with Linear Interpolation#

Here is a demonstration of the backward recursion procedure using linear interpolation.

Hide code cell source
import numpy as np

# Parameters
r_max = 0.3
K = 125
T = 20  # Number of time steps
N_max = 100  # Maximum population size to consider
h_max = 0.5  # Maximum harvest rate
h_step = 0.1  # Step size for harvest rate

# Create state and decision spaces
N_space = np.arange(1, N_max + 1)
h_space = np.arange(0, h_max + h_step, h_step)

# Initialize value function and policy
V = np.zeros((T + 1, len(N_space)))
policy = np.zeros((T, len(N_space)))

# Terminal value function (F_T)
def terminal_value(N):
    return 0

# State return function (F)
def state_return(N, h):
    return N * h

# State dynamics function
def state_dynamics(N, h):
    return N + r_max * N * (1 - N / K) - N * h

# Function to linearly interpolate between grid points in N_space
def interpolate_value_function(V, N_space, next_N, t):
    if next_N <= N_space[0]:
        return V[t, 0]  # Below or at minimum population, return minimum value
    if next_N >= N_space[-1]:
        return V[t, -1]  # Above or at maximum population, return maximum value
    
    # Find indices to interpolate between
    lower_idx = np.searchsorted(N_space, next_N) - 1
    upper_idx = lower_idx + 1
    
    # Linear interpolation
    N_lower = N_space[lower_idx]
    N_upper = N_space[upper_idx]
    weight = (next_N - N_lower) / (N_upper - N_lower)
    return (1 - weight) * V[t, lower_idx] + weight * V[t, upper_idx]

# Backward iteration with interpolation
for t in range(T - 1, -1, -1):
    for i, N in enumerate(N_space):
        max_value = float('-inf')
        best_h = 0
        
        for h in h_space:
            if h > 1:  # Ensure harvest rate doesn't exceed 100%
                continue
            
            next_N = state_dynamics(N, h)
            if next_N < 1:  # Ensure population doesn't go extinct
                continue
            
            # Interpolate value for next_N
            value = state_return(N, h) + interpolate_value_function(V, N_space, next_N, t + 1)
            
            if value > max_value:
                max_value = value
                best_h = h
        
        V[t, i] = max_value
        policy[t, i] = best_h

# Function to simulate the optimal policy using interpolation
def simulate_optimal_policy(initial_N, T):
    trajectory = [initial_N]
    harvests = []

    for t in range(T):
        N = trajectory[-1]
        
        # Interpolate optimal harvest rate
        if N <= N_space[0]:
            h = policy[t, 0]
        elif N >= N_space[-1]:
            h = policy[t, -1]
        else:
            lower_idx = np.searchsorted(N_space, N) - 1
            upper_idx = lower_idx + 1
            weight = (N - N_space[lower_idx]) / (N_space[upper_idx] - N_space[lower_idx])
            h = (1 - weight) * policy[t, lower_idx] + weight * policy[t, upper_idx]
        
        harvests.append(float(N * h))  # Ensure harvest is a Python float
        next_N = state_dynamics(N, h)
        trajectory.append(float(next_N))  # Ensure next population value is a Python float

    return trajectory, harvests

# Example usage
initial_N = 50
trajectory, harvests = simulate_optimal_policy(initial_N, T)

print("Optimal policy:")
print(policy)
print("\nPopulation trajectory:", trajectory)
print("Harvests:", harvests)
print("Total harvest:", sum(harvests))
Optimal policy:
[[0.  0.  0.  ... 0.4 0.4 0.4]
 [0.  0.  0.  ... 0.4 0.4 0.4]
 [0.  0.  0.  ... 0.4 0.4 0.4]
 ...
 [0.  0.  0.3 ... 0.5 0.5 0.5]
 [0.2 0.5 0.5 ... 0.5 0.5 0.5]
 [0.2 0.5 0.5 ... 0.5 0.5 0.5]]

Population trajectory: [50, 59.0, 62.445600000000006, 62.793456961535966, 60.906514028106535, 64.1847685511936, 60.71600257278426, 64.0117639631371, 60.5789261378371, 63.88717626457206, 60.48012279248407, 63.79731874379539, 60.40881570882111, 63.73243881376377, 60.3573056779798, 63.685556376683536, 60.32007179593332, 39.523630889226936, 27.8698229545787, 20.431713488016012, 15.34347899187751]
Harvests: [0.0, 5.9, 9.027135936000038, 11.26173625265758, 6.0906514028106535, 12.83695371023872, 6.071600257278426, 12.80235279262742, 6.057892613783711, 12.777435252914414, 6.0480122792484075, 12.759463748759078, 6.040881570882111, 12.746487762752755, 6.03573056779798, 12.737111275336709, 30.16003589796666, 19.761815444613468, 13.93491147728935, 10.215856744008006]
Total harvest: 213.2660649869655

Due to pedagogical considerations, this example is using our own implementation of the linear interpolation procedure. However, a more general and practical approach would be to use a built-in interpolation procedure in Numpy. Because our state space has a single dimension, we can simply use scipy.interpolate.interp1d which offers various interpolation methods through its kind argument, which can take values in ‘linear’, ‘nearest’, ‘nearest-up’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘previous’, or ‘next’. ‘zero’, ‘slinear’, ‘quadratic’ and ‘cubic’.

Here’s a more general implementation which here uses cubic interpolation through the scipy.interpolate.interp1d function:

Hide code cell source
import numpy as np
from scipy.interpolate import interp1d

# Parameters
r_max = 0.3
K = 125
T = 20  # Number of time steps
N_max = 100  # Maximum population size to consider
h_max = 0.5  # Maximum harvest rate
h_step = 0.1  # Step size for harvest rate

# Create state and decision spaces
N_space = np.arange(1, N_max + 1)
h_space = np.arange(0, h_max + h_step, h_step)

# Initialize value function and policy
V = np.zeros((T + 1, len(N_space)))
policy = np.zeros((T, len(N_space)))

# Terminal value function (F_T)
def terminal_value(N):
    return 0

# State return function (F)
def state_return(N, h):
    return N * h

# State dynamics function
def state_dynamics(N, h):
    return N + r_max * N * (1 - N / K) - N * h

# Function to create interpolation function for a given time step
def create_interpolator(V_t, N_space):
    return interp1d(N_space, V_t, kind='cubic', bounds_error=False, fill_value=(V_t[0], V_t[-1]))

# Backward iteration with interpolation
for t in range(T - 1, -1, -1):
    interpolator = create_interpolator(V[t+1], N_space)
    
    for i, N in enumerate(N_space):
        max_value = float('-inf')
        best_h = 0

        for h in h_space:
            if h > 1:  # Ensure harvest rate doesn't exceed 100%
                continue

            next_N = state_dynamics(N, h)
            if next_N < 1:  # Ensure population doesn't go extinct
                continue

            # Use interpolation to get the value for next_N
            value = state_return(N, h) + interpolator(next_N)

            if value > max_value:
                max_value = value
                best_h = h

        V[t, i] = max_value
        policy[t, i] = best_h

# Function to simulate the optimal policy using interpolation
def simulate_optimal_policy(initial_N, T):
    trajectory = [initial_N]
    harvests = []

    for t in range(T):
        N = trajectory[-1]
        
        # Create interpolator for the policy at time t
        policy_interpolator = interp1d(N_space, policy[t], kind='cubic', bounds_error=False, fill_value=(policy[t][0], policy[t][-1]))
        
        h = policy_interpolator(N)
        harvests.append(float(N * h))  # Ensure harvest is a Python float

        next_N = state_dynamics(N, h)
        trajectory.append(float(next_N))  # Ensure next population value is a Python float

    return trajectory, harvests

# Example usage
initial_N = 50
trajectory, harvests = simulate_optimal_policy(initial_N, T)

print("Optimal policy (first few rows):")
print(policy[:5])
print("\nPopulation trajectory:", trajectory)
print("Harvests:", harvests)
print("Total harvest:", sum(harvests))
Optimal policy (first few rows):
[[0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3
  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4
  0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4]
 [0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3
  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4
  0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4]
 [0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3
  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4
  0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4]
 [0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3
  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4
  0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4]
 [0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3
  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4
  0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4]]

Population trajectory: [50, 59.0, 62.445600000000006, 62.855816819468515, 66.38501069094303, 62.46338144008508, 66.19082983826176, 62.307060290079974, 65.86630883298251, 62.0275406161329, 65.08602250342238, 61.40579635061663, 65.16431296091169, 61.453283417050585, 65.25607512917725, 61.51516182245858, 65.3615391678991, 42.035857104726354, 29.387853805711547, 21.437532761435143, 16.047063462998082]
Harvests: [3.3073317494565994e-20, 5.8999999999999995, 8.96477607806749, 5.84550227506383, 13.260405311492967, 5.647548383617885, 13.22607620843378, 5.815662115341462, 13.186571332467969, 6.31598238982395, 13.039176123074048, 5.613609913801768, 13.068992991332264, 5.569578810421298, 13.097683026436256, 5.526294879593217, 32.68102988779014, 21.017928552363177, 14.693926902855772, 10.718766380713353]
Total harvest: 213.18951156269063

10. Stochastic Dynamic Programming in Control Theory#

While our previous discussion centered on deterministic systems, many real-world problems involve uncertainty. Stochastic Dynamic Programming (SDP) extends our framework to handle stochasticity in both the objective function and system dynamics.

In the stochastic setting, our system evolution takes the form:

\[ \mathbf{x}_{t+1} = \mathbf{f}_t(\mathbf{x}_t, \mathbf{u}_t, \mathbf{w}_t) \]

Here, \(\mathbf{w}_t\) represents a random disturbance or noise term at time \(t\) due to the inherent uncertainty in the system’s behavior. The stage cost function may also incorporate stochastic influences:

\[ c_t(\mathbf{x}_t, \mathbf{u}_t, \mathbf{w}_t) \]

In this context, our objective shifts from minimizing a deterministic cost to minimizing the expected total cost:

\[ \mathbb{E}\left[c_\mathrm{T}(\mathbf{x}_T) + \sum_{t=1}^{T-1} c_t(\mathbf{x}_t, \mathbf{u}_t, \mathbf{w}_t)\right] \]

where the expectation is taken over the distributions of the random variables \(\mathbf{w}_t\). The principle of optimality still holds in the stochastic case, but Bellman’s optimality equation now involves an expectation:

\[ J_k^\star(\mathbf{x}_k) = \min_{\mathbf{u}_k} \mathbb{E}_{\mathbf{w}_k}\left[c_k(\mathbf{x}_k, \mathbf{u}_k, \mathbf{w}_k) + J_{k+1}^\star(\mathbf{f}_k(\mathbf{x}_k, \mathbf{u}_k, \mathbf{w}_k))\right] \]

In practice, this expectation is often computed by discretizing the distribution of \(\mathbf{w}_k\) when the set of possible disturbances is very large or even continuous. Let’s say we approximate the distribution with \(K\) discrete values \(\mathbf{w}_k^i\), each occurring with probability \(p_k^i\). Then our Bellman equation becomes:

\[ J_k^\star(\mathbf{x}_k) = \min_{\mathbf{u}_k} \sum_{i=1}^K p_k^i \left(c_k(\mathbf{x}_k, \mathbf{u}_k, \mathbf{w}_k^i) + J_{k+1}^\star(\mathbf{f}_k(\mathbf{x}_k, \mathbf{u}_k, \mathbf{w}_k^i))\right) \]

The backward recursion algorithm for SDP follows a similar structure to its deterministic counterpart, with the key difference being that we now have to compute expected values:

Algorithm 10.1 (Backward Recursion for Stochastic Dynamic Programming)

Input: Terminal cost function \(c_\mathrm{T}(\cdot)\), stage cost functions \(c_t(\cdot, \cdot, \cdot)\), system dynamics \(\mathbf{f}_t(\cdot, \cdot, \cdot)\), time horizon \(\mathrm{T}\), disturbance distributions

Output: Optimal value functions \(J_t^\star(\cdot)\) and optimal control policies \(\mu_t^\star(\cdot)\) for \(t = 1, \ldots, T\)

  1. Initialize \(J_T^\star(\mathbf{x}) = c_\mathrm{T}(\mathbf{x})\) for all \(\mathbf{x}\) in the state space

  2. For \(t = T-1, T-2, \ldots, 1\):

    1. For each state \(\mathbf{x}\) in the state space:

      1. Compute \(J_t^\star(\mathbf{x}) = \min_{\mathbf{u}} \mathbb{E}_{\mathbf{w}_t}\left[c_t(\mathbf{x}, \mathbf{u}, \mathbf{w}_t) + J_{t+1}^\star(\mathbf{f}_t(\mathbf{x}, \mathbf{u}, \mathbf{w}_t))\right]\)

      2. Compute \(\mu_t^\star(\mathbf{x}) = \arg\min_{\mathbf{u}} \mathbb{E}_{\mathbf{w}_t}\left[c_t(\mathbf{x}, \mathbf{u}, \mathbf{w}_t) + J_{t+1}^\star(\mathbf{f}_t(\mathbf{x}, \mathbf{u}, \mathbf{w}_t))\right]\)

    2. End For

  3. End For

  4. Return \(J_t^\star(\cdot)\), \(\mu_t^\star(\cdot)\) for \(t = 1, \ldots, T\)

While SDP provides us with a framework to for handling uncertainty, it makes the curse of dimensionality even more difficult to handle in practice. Not only does the state space need to be discretized, but now the disturbance space must be discretized as well. This can lead to a combinatorial explosion in the number of scenarios to be evaluated at each stage.

However, just as we tackled the challenges of continuous state spaces with discretization and interpolation, we can devise efficient methods to handle the additional complexity of evaluating expectations. This problem essentially becomes one of numerical integration. When the set of disturbances is continuous (as is often the case with continuous state spaces), we enter a domain where numerical quadrature methods could be applied. But these methods tend to scale poorly as the number of dimensions grows. This is where more efficient techniques, often rooted in Monte Carlo methods, come into play. The combination of two key ingredients emerges to tackle the curse of dimensionality:

  1. Function approximation (through discretization, interpolation, neural networks, etc.)

  2. Monte Carlo integration (simulation)

These two elements essentially distill the key ingredients of machine learning, which is the direction we’ll be exploring in this course.

10.1. Example: Stochastic Optimal Harvest in Resource Management#

Building upon our previous deterministic model, we now introduce stochasticity to more accurately reflect the uncertainties inherent in real-world resource management scenarios [6]. As before, we consider a population of a particular species, whose abundance we denote by \(x_t\), where \(t\) represents discrete time steps. Our objective remains to maximize the cumulative harvest over a finite time horizon, while also considering the long-term sustainability of the population. However, we now account for two sources of stochasticity: partial controllability of harvest and environmental variability affecting growth rates. The optimization problem can be formulated as:

\[ \text{maximize} \quad \mathbb{E}\left[\sum_{t=t_0}^{t_f} F(x_t \cdot h_t)\right] \]

Here, \(F(\cdot)\) represents the immediate reward function associated with harvesting, and \(h_t\) is the realized harvest rate at time \(t\). The expectation \(\mathbb{E}[\cdot]\) over both harvest and growth rates, which we view as random variables. In our stochastic model, the abundance \(x\) still ranges from 1 to 100 individuals. The decision variable is now the desired harvest rate \(d_t\), which can take values from the set \(D = {0, 0.1, 0.2, 0.3, 0.4, 0.5}\). However, the realized harvest rate \(h_t\) is stochastic and follows a discrete distribution:

\[\begin{split} h_t = \begin{cases} 0.75d_t & \text{with probability } 0.25 \\ d_t & \text{with probability } 0.5 \\ 1.25d_t & \text{with probability } 0.25 \end{cases} \end{split}\]

By expressing the harvest rate as a random variable, we mean to capture the fact that harvesting is a not completely under our control: we might obtain more or less what we had intended to. Furthermore, we generalize the population dynamics to the stochastic case via:

\[x_{t+1} = x_t + r_tx_t(1 - x_t/K) - h_tx_t \]

where \(K = 125\) is the carrying capacity. The growth rate \(r_t\) is now stochastic and follows a discrete distribution:

\[\begin{split} r_t = \begin{cases} 0.85r_{\text{max}} & \text{with probability } 0.25 \\ 1.05r_{\text{max}} & \text{with probability } 0.5 \\ 1.15r_{\text{max}} & \text{with probability } 0.25 \end{cases} \end{split}\]

where \(r_{\text{max}} = 0.3\) is the maximum growth rate. Applying the principle of optimality, we can express the optimal value function \(J^\star(x_t, t)\) recursively:

\[ J^\star(x_t, t) = \max_{d(t) \in D} \mathbb{E}\left[F(x_t \cdot h_t) + J^\star(x_{t+1}, t+1)\right] \]

where the expectation is taken over the harvest and growth rate random variables. The boundary condition remains \(J^*(x_{t_f}) = 0\). We can now adapt our previous code to account for the stochasticity in our model. One important difference is that simulating a solution in this context requires multiple realizations of our process. This is an important consideration when evaluating reinforcement learning methods in practice, as success cannot be claimed based on a single successful trajectory.

Hide code cell source
import numpy as np
from scipy.interpolate import interp1d

# Parameters
r_max = 0.3
K = 125
T = 30  # Number of time steps
N_max = 100  # Maximum population size to consider
h_max = 0.5  # Maximum harvest rate
h_step = 0.1  # Step size for harvest rate

# Create state and decision spaces
N_space = np.linspace(1, N_max, 100)  # Using more granular state space
h_space = np.arange(0, h_max + h_step, h_step)

# Stochastic parameters
h_outcomes = np.array([0.75, 1.0, 1.25])
h_probs = np.array([0.25, 0.5, 0.25])
r_outcomes = np.array([0.85, 1.05, 1.15]) * r_max
r_probs = np.array([0.25, 0.5, 0.25])

# Initialize value function and policy
V = np.zeros((T + 1, len(N_space)))
policy = np.zeros((T, len(N_space)))

# State return function (F)
def state_return(N, h):
    return N * h

# State dynamics function (stochastic)
def state_dynamics(N, h, r):
    return N + r * N * (1 - N / K) - h * N

# Function to create interpolation function for a given time step
def create_interpolator(V_t, N_space):
    return interp1d(N_space, V_t, kind='linear', bounds_error=False, fill_value=(V_t[0], V_t[-1]))

# Backward iteration with stochastic dynamics
for t in range(T - 1, -1, -1):
    interpolator = create_interpolator(V[t+1], N_space)
    
    for i, N in enumerate(N_space):
        max_value = float('-inf')
        best_h = 0

        for h in h_space:
            if h > 1:  # Ensure harvest rate doesn't exceed 100%
                continue

            expected_value = 0
            for h_factor, h_prob in zip(h_outcomes, h_probs):
                for r_factor, r_prob in zip(r_outcomes, r_probs):
                    realized_h = h * h_factor
                    realized_r = r_factor

                    next_N = state_dynamics(N, realized_h, realized_r)
                    if next_N < 1:  # Ensure population doesn't go extinct
                        continue

                    # Use interpolation to get the value for next_N
                    value = state_return(N, realized_h) + interpolator(next_N)
                    expected_value += value * h_prob * r_prob

            if expected_value > max_value:
                max_value = expected_value
                best_h = h

        V[t, i] = max_value
        policy[t, i] = best_h

# Function to simulate the optimal policy using interpolation (stochastic version)
def simulate_optimal_policy(initial_N, T, num_simulations=100):
    all_trajectories = []
    all_harvests = []

    for _ in range(num_simulations):
        trajectory = [initial_N]
        harvests = []

        for t in range(T):
            N = trajectory[-1]
            
            # Create interpolator for the policy at time t
            policy_interpolator = interp1d(N_space, policy[t], kind='linear', bounds_error=False, fill_value=(policy[t][0], policy[t][-1]))
            
            intended_h = policy_interpolator(N)
            
            # Apply stochasticity
            h_factor = np.random.choice(h_outcomes, p=h_probs)
            r_factor = np.random.choice(r_outcomes, p=r_probs)
            
            realized_h = intended_h * h_factor
            harvests.append(N * realized_h)

            next_N = state_dynamics(N, realized_h, r_factor)
            trajectory.append(next_N)

        all_trajectories.append(trajectory)
        all_harvests.append(harvests)

    return all_trajectories, all_harvests

# Example usage
initial_N = 50
trajectories, harvests = simulate_optimal_policy(initial_N, T)

# Calculate average trajectory and total harvest
avg_trajectory = np.mean(trajectories, axis=0)
avg_total_harvest = np.mean([sum(h) for h in harvests])

print("Optimal policy (first few rows):")
print(policy[:5])
print("\nAverage population trajectory:", avg_trajectory)
print("Average total harvest:", avg_total_harvest)

# Plot results
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
plt.subplot(121)
for traj in trajectories[:20]:  # Plot first 20 trajectories
    plt.plot(range(T+1), traj, alpha=0.3)
plt.plot(range(T+1), avg_trajectory, 'r-', linewidth=2)
plt.title('Population Trajectories')
plt.xlabel('Time')
plt.ylabel('Population')

plt.subplot(122)
plt.hist([sum(h) for h in harvests], bins=20)
plt.title('Distribution of Total Harvest')
plt.xlabel('Total Harvest')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()
Optimal policy (first few rows):
[[0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3
  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4
  0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4]
 [0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3
  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4
  0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4]
 [0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3
  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4
  0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4]
 [0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3
  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4
  0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4]
 [0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3
  0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4
  0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4]]

Average population trajectory: [50.         59.378      63.02594219 62.07399048 62.38525635 62.88470217
 62.17465721 62.2248334  62.47938852 62.59676755 62.31105159 62.78535414
 62.49696249 62.49840737 62.72194781 62.03486706 62.81537474 62.76013195
 62.18232565 62.51755689 62.41471196 62.7422975  62.12139399 62.63093949
 62.37802382 62.16700238 61.75763962 42.27394412 30.04020103 21.33341045
 15.75001317]
Average total harvest: 314.7663220224656
_images/876636a296a966a315a77cbe3f3f6142b0953feecd96cfec34da5afe50018122.png