1. Discrete-Time Trajectory Optimization#
The sciences do not try to explain, they hardly even try to interpret, they mainly make models. By a model is meant a mathematical construct which, with the addition of certain verbal interpretations, describes observed phenomena. The justification of such a mathematical construct is solely and precisely that it is expected to work.
—John von Neumann
This course considers two broad categories of environments, each with its own specialized solution methods: deterministic and stochastic environments. Stochastic problems are mathematically more general than their deterministic counterparts. However, despite this generality, it’s important to note that algorithms for stochastic problems are not necessarily more powerful than those designed for deterministic ones when used in practice. We should keep in mind that stochasticity and determinism are assumed properties of the world, which we model—perhaps imperfectly—into our algorithms. In this course, we adopt a pragmatic perspective on that matter, and will make assumptions only to the extent that those assumptions help us design algorithms which are ultimately useful in practice: a lesson which we have certainly learned from the success of deep learning methods over the last years. With this pragmatic stance, we start our journey with deterministic discrete-time models.
1.1. Irrigation Management Example#
Resource allocation problems, found across various fields of operations research, are a specific kind of deterministic discrete-time optimal control problems. For example, consider the problem of irrigation management as posed by Hall and Butcher [20], in which a decision maker is tasked with finding the optimal amount of water to allocate throughout the various growth stages of a plant in order to maximize the yield. Clearly, allocating all the water – thereby flooding the crop – in the first stage would be a bad idea. Similarly, letting the crop dry out and only watering at the end is also suboptimal. Our solution – that is a prescription for the amount of water to use at any point in time – should balance those two extremes. In order to achieve this goal, our system should ideally be informed by the expected growth dynamics of the given crops as a function of the water provided: a process which depends at the very least on physical properties known to influence the soil moisture. The basic model considered in this paper describes the evolution of the moisture content through the equation:
where \(\eta\) is a fixed “efficiency” constant determining the soil moisture response to irrigation, and \(e_t\) is a known quantity summarizing the effects of water loss due to evaporation and finally \(\phi_t\) represents the expected added moisture due to precipitation.
Furthermore, we should ensure that the total amount of water used throughout the season does not exceed the maximum allowed amount. To avoid situations where our crop receives too little or too much water, we further impose the condition that \(w_p \leq w_{t+1} \leq w_f\) for all stages, for some given values of the so-called permanent wilting percentage \(w_p\) and field capacity \(w_f\). Depending on the moisture content of the soil, the yield will vary up to a maximum value \(Y_max\) depending on the water deficiencies incurred throughout the season. The authors make the assumptions that such deficiencies interact multiplicatively across stages such that the total yield is given by \(\left[\prod_{t=1}^N d_t(w_t)\right] Y_{\max}\). Due to the operating cost of watering operations (for example, energy consumption of the pumps, human labor, etc), a more meaningful objective is to maximize \(\prod_{t=1}^N d_t\left(w_t\right) Y_{\max } - \sum_{t=1}^N c_t\left(u_t\right)\). The problem specification laid out above can be turned into the following mathematical program:
The multiplicative form of the objective function coming from the yield term has been eliminated through by adding a new variable, \(a_t\), representing the product accumulation of water deficiencies since the beginning of the season.
Clearly, this model is a simplification of real phenomena at play: that of the physical process of water absorption by a plant through its root system. Many more aspects of the world would have to be included to have a more faithful reproduction of the real process: for example by taking into account the real-time meteorological data, pressure, soil type, solar irradiance, shading and topology of the terrain etc. We could go to the level of even modelling the inner workings of the plant itself to understand exactly how much water will get absorbed. More crucially, our assumption that the water absorption takes place instantaneously at discrete points in time is certainly not true. So should we go back to the drawing board and consider a better model? The answer is that “it depends”. It depends on the adequacy of the solution when deployed in the real world, or whether it helped provide insights to the user. Put simply: is the system useful to those who interact with it? Answering this question requires us to think more broadly about our system and how it will interact more broadly with the end users and the society.
2. Mathematical Programming Formulation#
The three quantities \((w_t, q_t, a_t)\) appearing in the model of Hall and Butcher [20] above have the property that they encompass all of the information necessary to simulate the process. We say that it is a “state variable”, and its time evolution is specified via so-called dynamics function, which we commonly denote by \(f_t\). In discrete-time systems, the dynamics are often described by “difference equations,” as opposed to the “differential equations” used for continuous-time systems. When the dynamics function depends on the time index \(t\), we refer to it as “non-stationary” dynamics. Conversely, if the function \(f\) remains constant across all time steps, we call it “stationary” dynamics. In the context of optimal control theory, we refer more generally to \(u_t\) as a “control” variable while in other communities it is called an “action”. Whether the problem is posed as a minimization problem or maximization problem is also a matter of communities, with control theory typically posing problems in terms of cost minimization while OR and RL communities usually adopt a reward maximization perspective. In this course, we will alternate between the two equivalent formulations while ensuring that context is sufficiently clear to understand which one is used.
The problem stated above, is known as a Discrete-Time Optimal Control Problem (DOCP), which we write more generically as:
(Discrete-Time Optimal Control Problem of Bolza Type)
The objective function (sometimes called a “performance index”) in a Bolza problem comprises of two terms: the sum of immediate cost or rewards per stage, and a terminal cost function (sometimes called “scrap value”). If the terminal cost function is omitted from the objective function, then the resulting DOCP is said to be of Lagrange type.
(Discrete-Time Optimal Control Problem of Lagrange Type)
Finally, a Mayer problem is one in which the objective function only contains a terminal cost function without explicitly accounting for immediate costs encountered across stages:
(Discrete-Time Optimal Control Problem of Mayer Type)
When writing the optimal control problem in any of those three forms, it is implied that both \(u_1, ..., u_T\) and the state trajectory \(x_1, ..., x_T\) are optimization variables. Since we ultimately care about the decisions themselves, the idea of posing the states themselves as optimization variables seems uncessary given that we have access to the dynamics. We will soon see that there indeed exists a way in which we get rid of the state variables as constraints through a process of explicit simulation with the class of “shooting methods”, thereby turning what would otherwise be an constrained optimization problem into an unconstrained one.
2.1. Reduction to Mayer Problems#
While it might appear at first glance that Bolza problems are somehow more expressive or powerful, we can show that both Lagrange and Bolza problems can be reduced to a Mayer problem through the idea of “state augmentation”. The overall idea is that the explicit sum of costs can be eliminated by maintaining a running sum of costs as an additional state variable \(y_t\). The augmented system equation then becomes:
where \(\boldsymbol{\tilde{x}}_t \triangleq (\mathbf{x}_t, y_t)\) is the concatenation of the running cost so far and the underlying state of the original system. The terminal cost function in the Bolza-to-Mayer transformation is computed with:
This transformation is often helpful to simplify mathematical derivations (as we are about to see shortly) but could also be used to streamline algorithmic implementation (by maintaining one version of the code rather than three with many if/else statements). That being said, there could also be computational advantages to working with the specific problem types rather than the one size-fits-for-all Mayer reduction.
3. Numerical Methods for Solving DOCPs#
Let’s assume that an optimal control problem has been formulated in one of the forms presented earlier and has been given to us to solve. The following section explores numerical solutions applicable to these problems, focusing on trajectory optimization. Our goal is to output an optimal control (and state trajectory) based on the given cost function and dynamics structure. It’s important to note that the methods presented here are not learning methods just yet; they don’t involve ingesting data or inferring unknown quantities from it. However, these methods represent a central component of any decision-learning system, and we will later explore how learning concepts can be incorporated.
Before delving into the solution methods, let’s consider an electric vehicle energy management problem which we will use this as a test bed throughout this section. Consider an electric vehicle traversing a planned route, where we aim to optimize its energy consumption over a 20-minute journey. Our simplified model represents the vehicle’s state using two variables: \(x_1\), the battery state of charge as a percentage, and \(x_2\), denoting the vehicle’s speed in meters per second. The control input \(u\), ranging from -1 to 1, represents the motor power, with negative values indicating regenerative braking and positive values representing acceleration. The problem can be formally expressed as a mathematical program in Bolza form:
The system dynamics, represented by \(f_t(x_t, u_t)\), describe how the battery charge and vehicle speed evolve based on the current state and control input. The initial condition \(x_1 = [1, 0]^T\) indicates that the vehicle starts with a fully charged battery and zero initial speed. The constraints \(-1 \leq u_t \leq 1\) and \(-5 \leq x_{t,1}, x_{t,2} \leq 5\) ensure that the control inputs and state variables remain within acceptable ranges throughout the journey. This model is of course highly simplistic and neglects the nonlinear nature of battery discharge and vehicle motion due to air resistance, road grade, and vehicle mass, etc. Furthermore, our model ignores the effect of environmental factors like wind and temperature on regenerative breaking. Route-specific information such as elevation changes and speed limits are absent, as is the consideration of auxiliary power consumption such as heating and entertainment. These are all possible improvements to our models which we ignore at the moment for the sake of simplicity.
3.1. Single Shooting Methods#
Given access to unconstrained optimization solver, the easiest method to implement is by far what is known as “single shooting” in control theory. The idea of simple: rather than having to solve for the state variables as equality constraints, we transform the original constrained problem into an unconstrained one through “simulation”, ie by recursively computing the evolution of our system for any given set of controls and initial state. In the deterministic setting, given an initial state, we can always exactly reconstruct the resulting sequence of states by “rolling out” our model, a process which some communities would refer to as “time marching”. Mathematically, this amounts to forming the following unconstrained program:
To implement this transform, we construct a set of helper functions \(\boldsymbol{\phi}_1, ..., \boldsymbol{\phi}_{T-1}\) whose role is compute the state at any time \(t=1, ..., T\) resulting from applying the sequence of controls starting from the initial state. We can define those functions recursively as
(Naive Single Shooting: re-computation/checkpointing)
Inputs Initial state \(\mathbf{x}_1\), time horizon \(T\), control bounds \(\mathbf{u}_{lb}\) and \(\mathbf{u}_{ub}\), state transition functions \(\mathbf{f}_t\), cost functions \(c_t\)
Output Optimal control sequence \(\mathbf{u}^*_{1:T-1}\)
Initialize \(\mathbf{u}_{1:T-1}\) within bounds \([\mathbf{u}_{lb}, \mathbf{u}_{ub}]\)
Define \(\boldsymbol{\phi}_t(\mathbf{u}_{1:T-1}, \mathbf{x}_1)\) for \(t = 1, ..., T\):
If \(t = 1\):
Return \(\mathbf{x}_1\)
Else:
\(\mathbf{x} \leftarrow \mathbf{x}_1\)
For \(i = 1\) to \(t-1\):
\(\mathbf{x} \leftarrow \mathbf{f}_{i}(\mathbf{x}, \mathbf{u}_{i})\)
Return \(\mathbf{x}\)
Define objective function \(J(\mathbf{u}_{1:T-1})\):
\(J \leftarrow c_T(\boldsymbol{\phi}_T(\mathbf{u}_{1:T-1}, \mathbf{x}_1))\)
For \(t = 1\) to \(T-1\):
\(J \leftarrow J + c_t(\boldsymbol{\phi}_t(\mathbf{u}_{1:T-1}, \mathbf{x}_1), \mathbf{u}_t)\)
Return \(J\)
Solve optimization problem: \(\mathbf{u}^*_{1:T-1} \leftarrow \arg\min_{\mathbf{u}_{1:T-1}} J(\mathbf{u}_{1:T-1})\) subject to \(\mathbf{u}_{lb} \leq \mathbf{u}_t \leq \mathbf{u}_{ub}, \, t=1,\ldots,T-1\)
Return \(\mathbf{u}^*_{1:T-1}\)
Show code cell content
import jax
import jax.numpy as jnp
from jax import grad, jit
from jax.example_libraries import optimizers
import matplotlib.pyplot as plt
def single_shooting_ev_optimization(T=20, num_iterations=1000, step_size=0.01):
"""
Implements the single shooting method for the electric vehicle energy optimization problem.
Args:
T: time horizon
num_iterations: number of optimization iterations
step_size: step size for the optimizer
Returns:
optimal_u: optimal control sequence
"""
def f(x, u, t):
return jnp.array([
x[0] + 0.1 * x[1] + 0.05 * u,
x[1] + 0.1 * u
])
def c(x, u, t):
if t == T:
return x[0]**2 + x[1]**2
else:
return 0.1 * (x[0]**2 + x[1]**2 + u**2)
def phi(u, x1, t):
x = x1
for k in range(1, t+1):
if k > 1:
x = f(x, u[k-2], k-1)
return x
def objective(u):
total_cost = 0
for t in range(1, T):
x_t = phi(u, x1, t)
total_cost += c(x_t, u[t-1], t)
x_T = phi(u, x1, T)
total_cost += c(x_T, 0.0, T) # No control at final step
return total_cost
def clip_controls(u):
return jnp.clip(u, -1.0, 1.0)
x1 = jnp.array([1.0, 0.0]) # Initial state: full battery, zero speed
# Initialize controls
u_init = jnp.zeros(T-1)
# Setup optimizer
optimizer = optimizers.adam(step_size)
opt_init, opt_update, get_params = optimizer
opt_state = opt_init(u_init)
@jit
def step(i, opt_state):
u = get_params(opt_state)
value, grads = jax.value_and_grad(objective)(u)
opt_state = opt_update(i, grads, opt_state)
u = get_params(opt_state)
u = clip_controls(u)
opt_state = opt_init(u)
return value, opt_state
# Run optimization
for i in range(num_iterations):
value, opt_state = step(i, opt_state)
if i % 100 == 0:
print(f"Iteration {i}, Cost: {value}")
optimal_u = get_params(opt_state)
return optimal_u
def plot_results(optimal_u, T):
# Compute state trajectory
x1 = jnp.array([1.0, 0.0])
x_trajectory = [x1]
for t in range(T-1):
x_next = jnp.array([
x_trajectory[-1][0] + 0.1 * x_trajectory[-1][1] + 0.05 * optimal_u[t],
x_trajectory[-1][1] + 0.1 * optimal_u[t]
])
x_trajectory.append(x_next)
x_trajectory = jnp.array(x_trajectory)
time = jnp.arange(T)
plt.figure(figsize=(12, 8))
plt.subplot(2, 1, 1)
plt.plot(time, x_trajectory[:, 0], label='Battery State of Charge')
plt.plot(time, x_trajectory[:, 1], label='Vehicle Speed')
plt.xlabel('Time Step')
plt.ylabel('State Value')
plt.title('Optimal State Trajectories')
plt.legend()
plt.grid(True)
plt.subplot(2, 1, 2)
plt.plot(time[:-1], optimal_u, label='Motor Power Input')
plt.xlabel('Time Step')
plt.ylabel('Control Input')
plt.title('Optimal Control Inputs')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
# Run the optimization
optimal_u = single_shooting_ev_optimization()
print("Optimal control sequence:", optimal_u)
# Plot the results
plot_results(optimal_u, T=20)
Iteration 0, Cost: 2.9000003337860107
Iteration 100, Cost: 1.4151430130004883
Iteration 200, Cost: 1.4088482856750488
Iteration 300, Cost: 1.409121036529541
Iteration 400, Cost: 1.4093767404556274
Iteration 500, Cost: 1.4096131324768066
Iteration 600, Cost: 1.4098304510116577
Iteration 700, Cost: 1.4100292921066284
Iteration 800, Cost: 1.410210371017456
Iteration 900, Cost: 1.410375714302063
Optimal control sequence: [-0.84949356 -0.76774836 -0.6298094 -0.4975385 -0.38879833 -0.3057558
-0.2055591 -0.1496703 -0.07521398 -0.0221795 0.01586018 0.05557606
0.09427628 0.11390017 0.13218477 0.1561778 0.1723089 0.17151853
0.18140294]
The approach outlined in Algorithm 3.1 stems directly from the mathematical definition and involves recomputing the sequence of states from the begining every time that the instantenous cost function along the trajectory needs to be evaluated. This implementation has the benefit that it requires very little storage, as the only quantity that we have to maintain in addition to the running cost is the last state. However, this simplicitity and storage savings come at a steep computation cost as it requires re-computing the trajectory up to any given stage starting from the initial state.
A more practical and efficient implementation combines trajectory unrolling with cost accumulation. This process can be realized through a simple for-loop in frameworks like JAX, which can trace code execution through control flows. Alternatively, a more efficient scan
operation could be employed. By simultaneously computing the trajectory and summing costs, we eliminate redundant calculations, effectively trading computation for storage—a strategy reminiscent of checkpointing in automatic differentiation.
(Single Shooting: Trajectory Storage)
Inputs Initial state \(\mathbf{x}_1\), time horizon \(T\), control bounds \(\mathbf{u}_{lb}\) and \(\mathbf{u}_{ub}\), state transition functions \(\mathbf{f}_t\), cost functions \(c_t\)
Output Optimal control sequence \(\mathbf{u}^*_{1:T-1}\)
Initialize \(\mathbf{u}_{1:T-1}\) within bounds \([\mathbf{u}_{lb}, \mathbf{u}_{ub}]\)
Define function ComputeTrajectoryAndCost(\(\mathbf{u}_{1:T-1}, \mathbf{x}_1\)):
Initialize \(\mathbf{x} \leftarrow [\mathbf{x}_1]\) // List to store states
Initialize \(J \leftarrow 0\) // Total cost
For \(t = 1\) to \(T-1\):
\(J \leftarrow J + c_t(\mathbf{x}[t], \mathbf{u}_t)\)
\(\mathbf{x}_{\text{next}} \leftarrow \mathbf{f}_t(\mathbf{x}[t], \mathbf{u}_t)\)
Append \(\mathbf{x}_{\text{next}}\) to \(\mathbf{x}\)
\(J \leftarrow J + c_T(\mathbf{x}[T])\) // Add final state cost
Return \(\mathbf{x}, J\)
Define objective function \(J(\mathbf{u}_{1:T-1})\):
\(\_, J \leftarrow\) ComputeTrajectoryAndCost(\(\mathbf{u}_{1:T-1}, \mathbf{x}_1\))
Return \(J\)
Solve optimization problem: \(\mathbf{u}^*_{1:T-1} \leftarrow \arg\min_{\mathbf{u}_{1:T-1}} J(\mathbf{u}_{1:T-1})\) subject to \(\mathbf{u}_{lb} \leq \mathbf{u}_t \leq \mathbf{u}_{ub}, \, t=1,\ldots,T-1\)
Return \(\mathbf{u}^*_{1:T-1}\)
Show code demonstration
import jax
import jax.numpy as jnp
from jax import grad, jit
from jax.example_libraries import optimizers
import matplotlib.pyplot as plt
def single_shooting_ev_optimization(T=20, num_iterations=1000, step_size=0.01):
"""
Implements the single shooting method for the electric vehicle energy optimization problem.
Args:
T: time horizon
num_iterations: number of optimization iterations
step_size: step size for the optimizer
Returns:
optimal_u: optimal control sequence
"""
def f(x, u, t):
return jnp.array([
x[0] + 0.1 * x[1] + 0.05 * u,
x[1] + 0.1 * u
])
def c(x, u, t):
if t == T:
return x[0]**2 + x[1]**2
else:
return 0.1 * (x[0]**2 + x[1]**2 + u**2)
def compute_trajectory_and_cost(u, x1):
x = x1
total_cost = 0
for t in range(1, T):
total_cost += c(x, u[t-1], t)
x = f(x, u[t-1], t)
total_cost += c(x, 0.0, T) # No control at final step
return total_cost
def objective(u):
return compute_trajectory_and_cost(u, x1)
def clip_controls(u):
return jnp.clip(u, -1.0, 1.0)
x1 = jnp.array([1.0, 0.0]) # Initial state: full battery, zero speed
# Initialize controls
u_init = jnp.zeros(T-1)
# Setup optimizer
optimizer = optimizers.adam(step_size)
opt_init, opt_update, get_params = optimizer
opt_state = opt_init(u_init)
@jit
def step(i, opt_state):
u = get_params(opt_state)
value, grads = jax.value_and_grad(objective)(u)
opt_state = opt_update(i, grads, opt_state)
u = get_params(opt_state)
u = clip_controls(u)
opt_state = opt_init(u)
return value, opt_state
# Run optimization
for i in range(num_iterations):
value, opt_state = step(i, opt_state)
if i % 100 == 0:
print(f"Iteration {i}, Cost: {value}")
optimal_u = get_params(opt_state)
return optimal_u
def plot_results(optimal_u, T):
# Compute state trajectory
x1 = jnp.array([1.0, 0.0])
x_trajectory = [x1]
for t in range(T-1):
x_next = jnp.array([
x_trajectory[-1][0] + 0.1 * x_trajectory[-1][1] + 0.05 * optimal_u[t],
x_trajectory[-1][1] + 0.1 * optimal_u[t]
])
x_trajectory.append(x_next)
x_trajectory = jnp.array(x_trajectory)
time = jnp.arange(T)
plt.figure(figsize=(12, 8))
plt.subplot(2, 1, 1)
plt.plot(time, x_trajectory[:, 0], label='Battery State of Charge')
plt.plot(time, x_trajectory[:, 1], label='Vehicle Speed')
plt.xlabel('Time Step')
plt.ylabel('State Value')
plt.title('Optimal State Trajectories')
plt.legend()
plt.grid(True)
plt.subplot(2, 1, 2)
plt.plot(time[:-1], optimal_u, label='Motor Power Input')
plt.xlabel('Time Step')
plt.ylabel('Control Input')
plt.title('Optimal Control Inputs')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
# Run the optimization
optimal_u = single_shooting_ev_optimization()
print("Optimal control sequence:", optimal_u)
# Plot the results
plot_results(optimal_u, T=20)
Iteration 0, Cost: 2.9000003337860107
Iteration 100, Cost: 1.4151430130004883
Iteration 200, Cost: 1.4088490009307861
Iteration 300, Cost: 1.4091218709945679
Iteration 400, Cost: 1.4093778133392334
Iteration 500, Cost: 1.4096143245697021
Iteration 600, Cost: 1.4098315238952637
Iteration 700, Cost: 1.410030484199524
Iteration 800, Cost: 1.4102118015289307
Iteration 900, Cost: 1.4103769063949585
Optimal control sequence: [-0.84949297 -0.7677483 -0.62980956 -0.4975385 -0.3887983 -0.3057558
-0.20555909 -0.14967027 -0.0752141 -0.0221795 0.01586002 0.05557553
0.09427629 0.11389964 0.13218471 0.15617755 0.17241438 0.17151856
0.18140264]
3.1.1. Dealing with Bound Constraints#
While we have successfully eliminated the dynamics as explicit constraints through what essentially amounts to a “reparametrization” of our problem, we’ve been silent regarding the bound constraints. The view of single shooting as a perfect transformation from a constrained problem to an unconstrained one is not entirely accurate: we must leave something on the table, and that something is the ability to easily impose state constraints.
By directly simulating the process from the initial state, there is one and only one corresponding induced path, and there’s no way to let our optimizer know that it can adjust within some bounds, even if that means the generated trajectory is no longer feasible (realistic).
Fortunately, the situation is much better for bound constraints on the controls. If we choose gradient descent as our method for solving this problem, we can consider a simple extension to readily support these kinds of bound constraints. The approach, in this case, would be what we call projected gradient descent. The general form of a projected gradient descent step can be expressed as:
where \(\mathcal{P}_C\) denotes the projection onto the feasible set \(C\), \(\alpha\) is the step size, and \(\nabla J(\mathbf{u}_k)\) is the gradient of the objective function at the current point \(\mathbf{u}_k\). In general, the projection operation can be computationally expensive or even intractable. However, in the case of box constraints (i.e., bound constraints), the projection simplifies to an element-wise clipping operation:
With this simple change, we can maintain the computational simplicity of unconstrained optimization while enforcing the bound constraints at each iteration: ie ensuring that we are feasible throughout optimization. Moreover, it can be shown that this projection preserves the convergence properties of the gradient descent method, and that under suitable conditions (such as Lipschitz continuity of the gradient), projected gradient descent converges to a stationary point of the constrained problem.
Here’s the algorithm for projected gradient descent with bound constraint for a general problem of the form:
where \(J(\mathbf{u})\) is our objective function, and \(\mathbf{u}_{lb}\) and \(\mathbf{u}_{ub}\) are the lower and upper bounds on the control variables, respectively.
(Projected Gradient Descent for Bound Constraints)
Input: Initial point \(\mathbf{u}_0\), learning rate \(\alpha\), bounds \(\mathbf{u}_{lb}\) and \(\mathbf{u}_{ub}\), maximum iterations \(\max_\text{iter}\), tolerance \(\varepsilon\)
Initialize \(k = 0\)
While \(k < \max_\text{iter}\) and not converged:
Compute gradient: \(\mathbf{g}_k = \nabla J(\mathbf{u}_k)\)
Update: \(\mathbf{u}_{k+1} = \text{clip}(\mathbf{u}_k - \alpha \mathbf{g}_k, \mathbf{u}_{lb}, \mathbf{u}_{ub})\)
Check convergence: if \(\|\mathbf{u}_{k+1} - \mathbf{u}_k\| < \varepsilon\), mark as converged
\(k = k + 1\)
Return \(\mathbf{u}_k\)
In this algorithm, the clip
function projects the updated point back onto the feasible region defined by the bounds:
3.1.2. On the choice of optimizer#
Despite frequent mentions of automatic differentiation, it’s important to note that the single shooting approaches outlined in this section need not rely on gradient-based optimization methods. In fact, one could use any method provided by scipy.optimize.minimize
, which offers a range of options such as:
Derivative-free methods like Nelder-Mead Simplex, suitable for problems where gradients are unavailable or difficult to compute.
Quasi-Newton methods like BFGS (Broyden-Fletcher-Goldfarb-Shanno), which by default uses finite differences rather than automatic differentiation to approximate gradients.
Another common strategy for single shooting methods is to use stochastic optimization techniques. For instance, random search generates a number of candidate solutions randomly and evaluates them. This approach is useful for problems with badly behaved loss landscapes or when gradient information is unreliable. More sophisticated stochastic methods include:
Genetic Algorithms: These mimic biological evolution, using mechanisms like selection, crossover, and mutation to evolve a population of solutions over generations [21]. (Implemented in DEAP library)
Simulated Annealing: Inspired by the annealing process in metallurgy, this method allows for occasional “uphill” moves to escape local minima [24]. (Available in SciPy)
Particle Swarm Optimization: This technique simulates the social behavior of organisms in a swarm, with particles (candidate solutions) moving through the search space and influencing each other [23]. (Implemented in PySwarms library)
The selection of an optimization method for single shooting is influenced by multiple factors: problem-specific characteristics, available computational resources, and the balance between exploring the solution space and exploiting known good solutions. While gradient-based methods generally offer faster convergence when applicable, derivative-free and stochastic approaches tend to be more robust to complex non-convex loss landscapes, albeit at the cost of increased computational demands.
In practice, however, this choice is often guided by the tools at hand and the practitioners’ familiarity with them. For instance, researchers with a background in deep learning tend to gravitate towards first-order gradient-based optimization techniques along with automatic differentiation for efficient derivative computation.
3.2. Constrained Optimization Approach#
The mathematical programming formulation presented earlier lends itself readily to off-the-shelf solvers for nonlinear mathematical programs. For example, we can use the scipy.optimize.minimize
function along with the SLSQP (Sequential Least Squares Programming) solver to obtain a solution to any feasible Bolza problem of the form presented below.
The following code demonstrate how the car charging problem can be solved directly using scipy.optimize.minimize
in a black box fashion. In the next sections, we will dive deeper into the mathematical underpinnings of constrained optimization and implement our own solvers. For the moment, I simply want to bring to your attention that the solution to this problem, as expressed in its original form, involves solving for two interdependent quantities: the optimal sequence of controls, and that of the states encountered when applying them to the system. Is that bug, or a feature? We’ll see that it depends…
Show code cell source
import numpy as np
from scipy.optimize import minimize
import matplotlib.pyplot as plt
def solve_docp(c_T, c_t, f_t, x_1, T, u_lb, u_ub, x_lb, x_ub):
"""
Solve a Discrete-Time Optimal Control Problem of Bolza Type using scipy.minimize with SLSQP.
Parameters:
- c_T: function, terminal cost c_T(x_T)
- c_t: function, stage cost c_t(x_t, u_t)
- f_t: function, state transition f_t(x_t, u_t)
- x_1: array, initial state
- T: int, time horizon
- u_lb, u_ub: arrays, lower and upper bounds for control inputs
- x_lb, x_ub: arrays, lower and upper bounds for states
Returns:
- result: OptimizeResult object from scipy.optimize.minimize
"""
n_x = len(x_1)
n_u = len(u_lb)
def objective(z):
x = z[:T*n_x].reshape(T, n_x)
u = z[T*n_x:].reshape(T, n_u)
cost = c_T(x[-1])
for t in range(T):
cost += c_t(x[t], u[t])
return cost
def constraints(z):
x = z[:T*n_x].reshape(T, n_x)
u = z[T*n_x:].reshape(T, n_u)
cons = []
# State transition constraints
for t in range(T-1):
cons.extend(x[t+1] - f_t(x[t], u[t]))
# Initial state constraint
cons.extend(x[0] - x_1)
return np.array(cons)
# Set up bounds
bounds = []
for t in range(T):
bounds.extend([(xl, xu) for xl, xu in zip(x_lb, x_ub)])
for t in range(T):
bounds.extend([(ul, uu) for ul, uu in zip(u_lb, u_ub)])
# Initial guess
z0 = np.zeros(T * (n_x + n_u))
# Solve the optimization problem
result = minimize(
objective,
z0,
method='SLSQP',
constraints={'type': 'eq', 'fun': constraints},
bounds=bounds,
options={'ftol': 1e-6, 'maxiter': 1000}
)
return result
def plot_results(x_opt, u_opt, T):
"""
Plot the optimal states and control inputs.
Parameters:
- x_opt: array, optimal states
- u_opt: array, optimal control inputs
- T: int, time horizon
"""
time = np.arange(T)
plt.figure(figsize=(12, 8))
# Plot states
plt.subplot(2, 1, 1)
plt.plot(time, x_opt[:, 0], label='Battery State of Charge')
plt.plot(time, x_opt[:, 1], label='Vehicle Speed')
plt.xlabel('Time Step')
plt.ylabel('State Value')
plt.title('Optimal State Trajectories')
plt.legend()
plt.grid(True)
# Plot control inputs
plt.subplot(2, 1, 2)
plt.plot(time, u_opt, label='Motor Power Input')
plt.xlabel('Time Step')
plt.ylabel('Control Input')
plt.title('Optimal Control Inputs')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
def example_docp():
# Define problem-specific functions and parameters
def c_T(x_T):
return x_T[0]**2 + x_T[1]**2
def c_t(x_t, u_t):
return 0.1 * (x_t[0]**2 + x_t[1]**2 + u_t[0]**2)
def f_t(x_t, u_t):
return np.array([
x_t[0] + 0.1 * x_t[1] + 0.05 * u_t[0],
x_t[1] + 0.1 * u_t[0]
])
x_1 = np.array([1.0, 0.0])
T = 20
u_lb = np.array([-1.0])
u_ub = np.array([1.0])
x_lb = np.array([-5.0, -5.0])
x_ub = np.array([5.0, 5.0])
result = solve_docp(c_T, c_t, f_t, x_1, T, u_lb, u_ub, x_lb, x_ub)
print("Optimization successful:", result.success)
print("Optimal cost:", result.fun)
# Extract optimal states and controls
x_opt = result.x[:T*2].reshape(T, 2)
u_opt = result.x[T*2:].reshape(T, 1)
print("Optimal states:")
print(x_opt)
print("Optimal controls:")
print(u_opt)
# Plot the results
plot_results(x_opt, u_opt, T)
if __name__ == "__main__":
example_docp()
Optimization successful: True
Optimal cost: 1.4233413234222974
Optimal states:
[[ 1.00000000e+00 -9.22148533e-19]
[ 9.53308082e-01 -9.33838361e-02]
[ 9.05334190e-01 -1.70654854e-01]
[ 8.56730650e-01 -2.33730963e-01]
[ 8.08051698e-01 -2.84342673e-01]
[ 7.59764997e-01 -3.24047541e-01]
[ 7.12262103e-01 -3.54243821e-01]
[ 6.65868006e-01 -3.76183251e-01]
[ 6.20849775e-01 -3.90983061e-01]
[ 5.77424403e-01 -3.99637194e-01]
[ 5.35765910e-01 -4.03026741e-01]
[ 4.96011780e-01 -4.01929653e-01]
[ 4.58268768e-01 -3.97029746e-01]
[ 4.22618163e-01 -3.88925007e-01]
[ 3.89120539e-01 -3.78135253e-01]
[ 3.57820043e-01 -3.65109194e-01]
[ 3.28748272e-01 -3.50230898e-01]
[ 3.01927788e-01 -3.33825687e-01]
[ 2.77375293e-01 -3.16165539e-01]
[ 2.55104512e-01 -2.97473993e-01]]
Optimal controls:
[[-9.33838361e-01]
[-7.72710176e-01]
[-6.30761088e-01]
[-5.06117101e-01]
[-3.97048684e-01]
[-3.01962800e-01]
[-2.19394301e-01]
[-1.47998103e-01]
[-8.65413246e-02]
[-3.38954763e-02]
[ 1.09708836e-02]
[ 4.89990726e-02]
[ 8.10473863e-02]
[ 1.07897540e-01]
[ 1.30260589e-01]
[ 1.48782960e-01]
[ 1.64052114e-01]
[ 1.76601475e-01]
[ 1.86915465e-01]
[-1.51238167e-09]]
3.2.1. Nonlinear Programming#
Unless specific assumptions are made on the dynamics and cost structure, a DOCP is, in its most general form, a nonlinear mathematical program (commonly referred to as an NLP, not to be confused with Natural Language Processing). An NLP can be formulated as follows:
Where:
\(f: \mathbb{R}^n \to \mathbb{R}\) is the objective function
\(\mathbf{g}: \mathbb{R}^n \to \mathbb{R}^m\) represents inequality constraints
\(\mathbf{h}: \mathbb{R}^n \to \mathbb{R}^\ell\) represents equality constraints
Unlike unconstrained optimization commonly used in deep learning, the optimality of a solution in constrained optimization must consider both the objective value and constraint feasibility. To illustrate this, consider the following problem, which includes both equality and inequality constraints:
In this example, the objective function \(f(x_1, x_2)\) is quadratic, the inequality constraint \(g(x_1, x_2)\) defines a circular feasible region centered at \((1, 1)\) with a radius of \(\sqrt{1.5}\) and the equality constraint \(h(x_1, x_2)\) requires \(x_2\) to lie on a sine wave function. The following code demonstrates the difference between the unconstrained, and constrained solutions to this problem.
Show code cell source
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize
# Define the objective function
def objective(x):
return (x[0] - 1)**2 + (x[1] - 2.5)**2
# Define the inequality constraint function
def constraint(x):
return -(x[0] - 1)**2 - (x[1] - 1)**2 + 1.5
# Define the gradient of the objective function
def objective_gradient(x):
return np.array([2*(x[0] - 1), 2*(x[1] - 2.5)])
# Define the gradient of the inequality constraint function
def constraint_gradient(x):
return np.array([-2*(x[0] - 1), -2*(x[1] - 1)])
# Define the sine wave equality constraint function
def sine_wave_equality_constraint(x):
return x[1] - (0.5 * np.sin(2 * np.pi * x[0]) + 1.5)
# Define the gradient of the sine wave equality constraint function
def sine_wave_equality_constraint_gradient(x):
return np.array([-np.pi * np.cos(2 * np.pi * x[0]), 1])
# Define the constraints including the sine wave equality constraint
sine_wave_constraints = [{'type': 'ineq', 'fun': constraint, 'jac': constraint_gradient}, # Inequality constraint
{'type': 'eq', 'fun': sine_wave_equality_constraint, 'jac': sine_wave_equality_constraint_gradient}] # Sine wave equality constraint
# Define only the inequality constraint
inequality_constraints = [{'type': 'ineq', 'fun': constraint, 'jac': constraint_gradient}]
# Initial guess
x0 = [1.25, 1.5]
# Solve the optimization problem with the sine wave equality constraint
res_sine_wave_constraint = minimize(objective, x0, method='SLSQP', jac=objective_gradient,
constraints=sine_wave_constraints, options={'disp': False})
x_opt_sine_wave_constraint = res_sine_wave_constraint.x
# Solve the optimization problem with only the inequality constraint
res_inequality_only = minimize(objective, x0, method='SLSQP', jac=objective_gradient,
constraints=inequality_constraints, options={'disp': False})
x_opt_inequality_only = res_inequality_only.x
# Solve the unconstrained optimization problem for reference
res_unconstrained = minimize(objective, x0, method='SLSQP', jac=objective_gradient, options={'disp': False})
x_opt_unconstrained = res_unconstrained.x
# Generate data for visualization
x = np.linspace(-1, 4, 400)
y = np.linspace(-1, 4, 400)
X, Y = np.meshgrid(x, y)
Z = (X - 1)**2 + (Y - 2.5)**2 # Objective function values
constraint_values = (X - 1)**2 + (Y - 1)**2
# Data for sine wave constraint
x_sine = np.linspace(-1, 4, 400)
y_sine = 0.5 * np.sin(2 * np.pi * x_sine) + 1.5
# Visualization with Improved Color Scheme
plt.figure(figsize=(8, 6))
plt.contourf(X, Y, Z, levels=100, cmap='viridis', alpha=0.6) # Heatmap for the objective function
# Plot all the optimal points
plt.plot(x_opt_inequality_only[0], x_opt_inequality_only[1], 'ro', label='Optimal Solution (Inequality Only)', markersize=8, markeredgecolor='black')
plt.plot(x_opt_sine_wave_constraint[0], x_opt_sine_wave_constraint[1], 'mo', label='Optimal Solution (Sine Wave Equality & Inequality)', markersize=8, markeredgecolor='black')
plt.plot(x_opt_unconstrained[0], x_opt_unconstrained[1], 'co', label='Unconstrained Minimum', markersize=8, markeredgecolor='black')
# Adjust constraint boundary colors
plt.contour(X, Y, constraint_values, levels=[1.5], colors='navy', linewidths=2, linestyles='dashed')
plt.contourf(X, Y, constraint_values, levels=[0, 1.5], colors='skyblue', alpha=0.3)
# Plot the sine wave equality constraint with a high contrast color
plt.plot(x_sine, y_sine, 'lime', linestyle='--', linewidth=2, label='Sine Wave Equality Constraint')
plt.xlim([-1, 4])
plt.ylim([-1, 4])
plt.xlabel('x1')
plt.ylabel('x2')
plt.title('Example NLP')
plt.legend(loc='upper left', fontsize='small', edgecolor='black', fancybox=True)
plt.grid(True)
# Set the aspect ratio to be equal so the circle appears correctly
plt.gca().set_aspect('equal', adjustable='box')
plt.show()
3.2.1.1. Karush-Kuhn-Tucker (KKT) conditions#
While this example is simple enough to convince ourselves visually of the solution to this particular problem, it falls short of providing us with actionable chracterization of what constitutes and optimal solution in general. The Karush-Kuhn-Tucker (KKT) conditions provide us with an answer to this problem by generalizing the first-order optimality conditions in unconstrained optimization to problems involving both equality and inequality constraints. This result relies on the construction of an auxiliary function called the Lagrangian, defined as:
where \(\boldsymbol{\mu} \in \mathbb{R}^m\) and \(\boldsymbol{\lambda} \in \mathbb{R}^\ell\) are known as Lagrange multipliers. The first-order optimality conditions then state that if \(\mathbf{x}^*\), then there must exist corresponding Lagrange multipliers \(\boldsymbol{\mu}^*\) and \(\boldsymbol{\lambda}^*\) such that:
The gradient of the Lagrangian with respect to \(\mathbf{x}\) must be zero at the optimal point (stationarity):
\[\nabla_x \mathcal{L}(\mathbf{x}^*, \boldsymbol{\mu}^*, \boldsymbol{\lambda}^*) = \nabla f(\mathbf{x}^*) + \sum_{i=1}^m \mu_i^* \nabla g_i(\mathbf{x}^*) + \sum_{j=1}^\ell \lambda_j^* \nabla h_j(\mathbf{x}^*) = \mathbf{0}\]In the case where we only have equality constraints, this means that the gradient of the objective and that of constraint are parallel to each other at the optimum but point in opposite directions.
A valid solution of a NLP is one which satisfies all the constraints (primal feasibility)
\[\begin{aligned} \mathbf{g}(\mathbf{x}^*) &\leq \mathbf{0}, \enspace \text{and} \enspace \mathbf{h}(\mathbf{x}^*) &= \mathbf{0} \end{aligned}\]Furthermore, the Lagrange multipliers for inequality constraints must be non-negative (dual feasibility)
\[\boldsymbol{\mu}^* \geq \mathbf{0}\]This condition stems from the fact that the inequality constraints can only push the solution in one direction.
Finally, for each inequality constraint, either the constraint is active (equality holds) or its corresponding Lagrange multiplier is zero at an optimal solution (complementary slackness)
\[\mu_i^* g_i(\mathbf{x}^*) = 0, \quad \forall i = 1,\ldots,m\]
Let’s now solve our example problem above, this time using Ipopt via the Pyomo interface so that we can access the Lagrange multipliers found by the solver.
Show code cell content
from pyomo.environ import *
from pyomo.opt import SolverFactory
from myst_nb import glue
import math
# Define the Pyomo model
model = ConcreteModel()
# Define the variables
model.x1 = Var(initialize=1.25)
model.x2 = Var(initialize=1.5)
# Define the objective function
def objective_rule(model):
return (model.x1 - 1)**2 + (model.x2 - 2.5)**2
model.obj = Objective(rule=objective_rule, sense=minimize)
# Define the inequality constraint (circle)
def inequality_constraint_rule(model):
return (model.x1 - 1)**2 + (model.x2 - 1)**2 <= 1.5
model.ineq_constraint = Constraint(rule=inequality_constraint_rule)
# Define the equality constraint (sine wave) using Pyomo's math functions
def equality_constraint_rule(model):
return model.x2 == 0.5 * sin(2 * math.pi * model.x1) + 1.5
model.eq_constraint = Constraint(rule=equality_constraint_rule)
# Create a suffix component to capture dual values
model.dual = Suffix(direction=Suffix.IMPORT)
# Create a solver
solver=SolverFactory('ipopt')
# Solve the problem
results = solver.solve(model, tee=False)
# Check if the solver found an optimal solution
if (results.solver.status == SolverStatus.ok and
results.solver.termination_condition == TerminationCondition.optimal):
# Print the results
print(f"x1: {value(model.x1)}")
print(f"x2: {value(model.x2)}")
# Print the objective value
print(f"Objective value: {value(model.obj)}")
# Print the Lagrange multipliers (dual values)
print("\nLagrange multipliers:")
for c in model.component_objects(Constraint, active=True):
for index in c:
print(f"{c.name}[{index}]: {model.dual[c[index]]}")
glue(f"{c.name}[{index}]", model.dual[c[index]], display=False)
else:
print("Solver did not find an optimal solution.")
print(f"Solver Status: {results.solver.status}")
print(f"Termination Condition: {results.solver.termination_condition}")
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[5], line 1
----> 1 from pyomo.environ import *
2 from pyomo.opt import SolverFactory
3 from myst_nb import glue
ModuleNotFoundError: No module named 'pyomo'
After running the code, we find that the Lagrange multiplier associated with the inequality constraint is approximately . This very small value, close to zero, suggests that the inequality constraint is not active at the optimal solution, meaning that the solution point lies inside the circle defined by this constraint. This can be verified visually in the figure above. As for the equality constraint, its corresponding Lagrange multiplier is and the fact that it’s non-zero indicates that this constraint is active at the optimal solution. In general when we find a Lagrange multiplier close to zero (like the one for the inequality constraint), it means that constraint is not “binding”—the optimal solution does not lie on the boundary defined by this constraint. In contrast, a non-zero Lagrange multiplier, such as the one for the equality constraint, indicates that the constraint is active and that any relaxation would directly affect the objective function’s value, as required by the stationarity condition.
3.2.1.2. Lagrange Multiplier Theorem#
The KKT conditions introduced above characterize the solution structure of constrained optimization problems with equality constraints. In this particular context, these conditions are referred to as the first-order optimality conditions, as part of the Lagrange multiplier theorem. Let’s just re-state them in that simpler setting:
(Lagrange Multiplier Theorem)
Consider the constrained optimization problem:
where \(\mathbf{x} \in \mathbb{R}^n\), \(f: \mathbb{R}^n \to \mathbb{R}\), and \(h_i: \mathbb{R}^n \to \mathbb{R}\) for \(i = 1, \ldots, m\).
Assume that:
\(f\) and \(h_i\) are continuously differentiable functions.
The gradients \(\nabla h_i(\mathbf{x}^*)\) are linearly independent at the optimal point \(\mathbf{x}^*\).
Then, there exist unique Lagrange multipliers \(\lambda_i^* \in \mathbb{R}\), \(i = 1, \ldots, m\), such that the following first-order optimality conditions hold:
Stationarity: \(\nabla f(\mathbf{x}^*) + \sum_{i=1}^m \lambda_i^* \nabla h_i(\mathbf{x}^*) = \mathbf{0}\)
Primal feasibility: \(h_i(\mathbf{x}^*) = 0\), for \(i = 1, \ldots, m\)
Note that both the stationarity and primal feasibility statements are simply saying that the derivative of the Lagrangian in either the primal or dual variables must be zero at an optimal constrained solution. In other words:
Letting \(\mathbf{F}(\mathbf{x}, \boldsymbol{\lambda})\) stand for \(\nabla_{\mathbf{x}, \boldsymbol{\lambda}} L(\mathbf{x}, \boldsymbol{\lambda})\), the Lagrange multipliers theorem tells us that an optimal primal-dual pair is actually a zero of that function \(\mathbf{F}\): the derivative of the Lagrangian. Therefore, we can use this observation to craft a solution method for solving equality constrained optimization using Newton’s method, which is a numerical procedure for finding zeros of a nonlinear function.
3.2.1.3. Newton’s Method#
Newton’s method is a numerical procedure for solving root-finding problems. These are nonlinear systems of equations of the form:
Find \(\mathbf{z}^* \in \mathbb{R}^n\) such that \(\mathbf{F}(\mathbf{z}^*) = \mathbf{0}\)
where \(\mathbf{F}: \mathbb{R}^n \to \mathbb{R}^n\) is a continuously differentiable function. Newton’s method then consists in applying the following sequence of iterates:
where \(\mathbf{z}^k\) is the k-th iterate, and \(\nabla \mathbf{F}(\mathbf{z}^k)\) is the Jacobian matrix of \(\mathbf{F}\) evaluated at \(\mathbf{z}^k\).
Newton’s method exhibits local quadratic convergence: if the initial guess \(\mathbf{z}^0\) is sufficiently close to the true solution \(\mathbf{z}^*\), and \(\nabla \mathbf{F}(\mathbf{z}^*)\) is nonsingular, the method converges quadratically to \(\mathbf{z}^*\) [31]. However, the method is sensitive to the initial guess; if it’s too far from the desired solution, Newton’s method might fail to converge or converge to a different root. To mitigate this problem, a set of techniques known as numerical continuation methods [2] have been developed. These methods effectively enlarge the basin of attraction of Newton’s method by solving a sequence of related problems, progressing from an easy one to the target problem. This approach is reminiscent of several concepts in machine learning and statistical inference: curriculum learning in machine learning, where models are trained on increasingly complex data; tempering in Markov Chain Monte Carlo (MCMC) samplers, which gradually adjusts the target distribution to improve mixing; and modern diffusion models, which use a similar concept of gradually transforming noise into structured data.
3.2.1.3.1. Efficient Implementation of Newton’s Method#
Note that each step of Newton’s method involves computing the inverse of a Jacobian matrix. However, a cardinal rule in numerical linear algebra is to avoid computing matrix inverses explicitly: rarely, if ever, should there be a np.lindex.inv
in your code. Instead, the numerically stable and computationally efficient approach is to solve a linear system of equations at each step.
Given the Newton’s method iterate:
We can reformulate this as a two-step procedure:
Solve the linear system: \(\underbrace{[\nabla \mathbf{F}(\mathbf{z}^k)]}_{\mathbf{A}} \Delta \mathbf{z}^k = -\mathbf{F}(\mathbf{z}^k)\)
Update: \(\mathbf{z}^{k+1} = \mathbf{z}^k + \Delta \mathbf{z}^k\)
The structure of the linear system in step 1 often allows for specialized solution methods. In the context of automatic differentiation, matrix-free linear solvers are particularly useful. These solvers can find a solution without explicitly forming the matrix A, requiring only the ability to evaluate matrix-vector or vector-matrix products. Typical examples of such methods include classical matrix-splitting methods (e.g., Richardson iteration) or conjugate gradient methods through sparse.linalg.cg
for example. Another useful method is the Generalized Minimal Residual method (GMRES) implemented in SciPy via sparse.linalg.gmres
, which is useful when facing non-symmetric and indefinite systems.
By inspecting the structure of matrix \(\mathbf{A}\) in the specific application where the function \(\mathbf{F}\) is the derivative of the Lagrangian, we will also uncover an important structure known as the KKT matrix. This structure will then allow us to derive a Quadratic Programming (QP) sub-problem as part of a larger iterative procedure for solving equality and inequality constrained problems via Sequential Quadratic Programming (SQP).
3.2.1.4. Solving Equality Constrained Programs with Newton’s Method#
To solve equality-constrained optimization problems using Newton’s method, we begin by recognizing that the problem reduces to finding a zero of the function \(\mathbf{F}(\mathbf{z}) = \nabla_{\mathbf{x}, \boldsymbol{\lambda}} L(\mathbf{x}, \boldsymbol{\lambda})\). Here, \(\mathbf{F}\) represents the derivative of the Lagrangian function, and \(\mathbf{z} = (\mathbf{x}, \boldsymbol{\lambda})\) combines both the primal variables \(\mathbf{x}\) and the dual variables (Lagrange multipliers) \(\boldsymbol{\lambda}\). Explicitly, we have:
Newton’s method involves linearizing \(\mathbf{F}(\mathbf{z})\) around the current iterate \(\mathbf{z}^k = (\mathbf{x}^k, \boldsymbol{\lambda}^k)\) and then solving the resulting linear system. At each iteration \(k\), Newton’s method updates the current estimate by solving the linear system:
However, instead of explicitly inverting the Jacobian matrix \(\nabla \mathbf{F}(\mathbf{z}^k)\), we solve the linear system:
where \(\Delta \mathbf{z}^k = (\Delta \mathbf{x}^k, \Delta \boldsymbol{\lambda}^k)\) represents the Newton step for the primal and dual variables. Substituting the expression for \(\mathbf{F}(\mathbf{z})\) and its Jacobian, the system becomes:
The matrix on the left-hand side is known as the KKT matrix, as it stems from the Karush-Kuhn-Tucker conditions for this optimization problem The solution of this system provides the updates \(\Delta \mathbf{x}^k\) and \(\Delta \boldsymbol{\lambda}^k\), which are then used to update the primal and dual variables:
3.2.1.4.1. Demonstration#
The following code demonstates how we can implement this idea in Jax. In this demonstration, we are minimizing a quadratic objective function subject to a single equality constraint, a problem formally stated as follows:
Geometrically speaking, the constraint \(h(x)\) describes a unit circle centered at the origin. To solve this problem using the method of Lagrange multipliers, we form the Lagrangian:
For this particular problem, it happens so that we can also find an analytical without even having to use Newton’s method. From the first-order optimality conditions, we obtain the following linear system of equations:
From the first two equations, we then get:
which we can substitute these into the 3rd constraint equation to obtain:
This value of the Lagrange multiplier can then be backsubstituted into the above equations to obtain \(x_1 = \frac{2}{\sqrt{5}}\) and \(x_2 = \frac{1}{\sqrt{5}}\). We can verify numerically (and visually on the following graph) that the point \((2/\sqrt{5}, 1/\sqrt{5})\) is indeed the point on the unit circle closest to \((2, 1)\).
Show code cell source
import jax
import jax.numpy as jnp
from jax import grad, jit, jacfwd
import matplotlib.pyplot as plt
# Define the objective function and constraint
def f(x):
return (x[0] - 2)**2 + (x[1] - 1)**2
def g(x):
return x[0]**2 + x[1]**2 - 1
# Lagrangian
def L(x, lambda_):
return f(x) + lambda_ * g(x)
# Gradient and Hessian of Lagrangian
grad_L_x = jit(grad(L, argnums=0))
grad_L_lambda = jit(grad(L, argnums=1))
hess_L_xx = jit(jacfwd(grad_L_x, argnums=0))
hess_L_xlambda = jit(jacfwd(grad_L_x, argnums=1))
# Newton's method
@jit
def newton_step(x, lambda_):
grad_x = grad_L_x(x, lambda_)
grad_lambda = grad_L_lambda(x, lambda_)
hess_xx = hess_L_xx(x, lambda_)
hess_xlambda = hess_L_xlambda(x, lambda_).reshape(-1)
# Construct the full KKT matrix
kkt_matrix = jnp.block([
[hess_xx, hess_xlambda.reshape(-1, 1)],
[hess_xlambda, jnp.array([[0.0]])]
])
# Construct the right-hand side
rhs = jnp.concatenate([-grad_x, -jnp.array([grad_lambda])])
# Solve the KKT system
delta = jnp.linalg.solve(kkt_matrix, rhs)
return x + delta[:2], lambda_ + delta[2]
def solve_constrained_optimization(x0, lambda0, max_iter=100, tol=1e-6):
x, lambda_ = x0, lambda0
for i in range(max_iter):
x_new, lambda_new = newton_step(x, lambda_)
if jnp.linalg.norm(jnp.concatenate([x_new - x, jnp.array([lambda_new - lambda_])])) < tol:
break
x, lambda_ = x_new, lambda_new
return x, lambda_, i+1
# Analytical solution
def analytical_solution():
x1 = 2 / jnp.sqrt(5)
x2 = 1 / jnp.sqrt(5)
lambda_opt = jnp.sqrt(5) - 1
return jnp.array([x1, x2]), lambda_opt
# Solve the problem numerically
x0 = jnp.array([0.5, 0.5])
lambda0 = 0.0
x_opt_num, lambda_opt_num, iterations = solve_constrained_optimization(x0, lambda0)
# Compute analytical solution
x_opt_ana, lambda_opt_ana = analytical_solution()
# Verify the result
print("\nNumerical Solution:")
print(f"Constraint violation: {g(x_opt_num):.6f}")
print(f"Objective function value: {f(x_opt_num):.6f}")
print("\nAnalytical Solution:")
print(f"Constraint violation: {g(x_opt_ana):.6f}")
print(f"Objective function value: {f(x_opt_ana):.6f}")
print("\nComparison:")
x_diff = jnp.linalg.norm(x_opt_num - x_opt_ana)
lambda_diff = jnp.abs(lambda_opt_num - lambda_opt_ana)
print(f"Difference in x: {x_diff}")
print(f"Difference in lambda: {lambda_diff}")
# Precision test
rtol = 1e-5 # relative tolerance
atol = 1e-8 # absolute tolerance
x_close = jnp.allclose(x_opt_num, x_opt_ana, rtol=rtol, atol=atol)
lambda_close = jnp.isclose(lambda_opt_num, lambda_opt_ana, rtol=rtol, atol=atol)
print("\nPrecision Test:")
print(f"x values are close: {x_close}")
print(f"lambda values are close: {lambda_close}")
if x_close and lambda_close:
print("The numerical solution matches the analytical solution within the specified tolerance.")
else:
print("The numerical solution differs from the analytical solution more than the specified tolerance.")
# Visualize the result
plt.figure(figsize=(12, 10))
# Create a mesh for the contour plot
x1_range = jnp.linspace(-1.5, 2.5, 100)
x2_range = jnp.linspace(-1.5, 2.5, 100)
X1, X2 = jnp.meshgrid(x1_range, x2_range)
Z = jnp.array([[f(jnp.array([x1, x2])) for x1 in x1_range] for x2 in x2_range])
# Plot filled contours
contour = plt.contourf(X1, X2, Z, levels=50, cmap='viridis', alpha=0.7, extent=[-1.5, 2.5, -1.5, 2.5])
plt.colorbar(contour, label='Objective Function Value')
# Plot the constraint
theta = jnp.linspace(0, 2*jnp.pi, 100)
x1 = jnp.cos(theta)
x2 = jnp.sin(theta)
plt.plot(x1, x2, color='red', linewidth=2, label='Constraint')
# Plot the optimal points (numerical and analytical) and initial point
plt.scatter(x_opt_num[0], x_opt_num[1], color='red', s=100, edgecolor='white', linewidth=2, label='Numerical Optimal Point')
plt.scatter(x_opt_ana[0], x_opt_ana[1], color='blue', s=100, edgecolor='white', linewidth=2, label='Analytical Optimal Point')
plt.scatter(x0[0], x0[1], color='green', s=100, edgecolor='white', linewidth=2, label='Initial Point')
# Add labels and title
plt.xlabel('x1', fontsize=12)
plt.ylabel('x2', fontsize=12)
plt.title('Constrained Optimization: Numerical vs Analytical Solution', fontsize=14)
plt.legend(fontsize=10)
plt.grid(True, linestyle='--', alpha=0.7)
# Set the axis limits explicitly
plt.xlim(-1.5, 2.5)
plt.ylim(-1.5, 2.5)
plt.tight_layout()
plt.show()
3.2.2. The SQP Approach: Taylor Expansion and Quadratic Approximation#
Sequential Quadratic Programming (SQP) tackles the problem of solving constrained programs by iteratively solving a sequence of simpler subproblems. Specifically, these subproblems are quadratic programs (QPs) that approximate the original problem around the current iterate by using a quadratic model of the objective function and a linear model of the constraints. Suppose we have the following optimization problem with equality constraints:
At each iteration \(k\), we approximate the objective function \(f(\mathbf{x})\) using a second-order Taylor expansion around the current iterate \(\mathbf{x}^k\). The standard Taylor expansion for \(f\) would be:
This expansion uses the Hessian of the objective function \(\nabla^2 f(\mathbf{x}^k)\) to capture the curvature of \(f\). However, in the context of constrained optimization, we also need to account for the effect of the constraints on the local behavior of the solution. If we were to use only \(\nabla^2 f(\mathbf{x}^k)\), we would not capture the influence of the constraints on the curvature of the feasible region. The resulting subproblem might then lead to steps that violate the constraints or are less effective in achieving convergence. The choice that we make instead is to use the Hessian of the Lagrangian, \(\nabla^2_{\mathbf{x}\mathbf{x}} L(\mathbf{x}^k, \boldsymbol{\lambda}^k)\), leading to the following quadratic model:
Similarly, the equality constraints \(\mathbf{h}(\mathbf{x})\) are linearized around \(\mathbf{x}^k\):
Combining these approximations, we obtain a Quadratic Programming (QP) subproblem, which approximates our original problem locally at \(\mathbf{x}^k\) but is easier to solve:
where \(\Delta \mathbf{x} = \mathbf{x} - \mathbf{x}^k\). The QP subproblem solved at each iteration focuses on finding the optimal step direction \(\Delta \mathbf{x}\) for the primal variables. While solving this QP, we obtain not only the step \(\Delta \mathbf{x}\) but also the associated Lagrange multipliers for the QP subproblem, which correspond to an updated dual variable vector \(\boldsymbol{\lambda}^{k+1}\). More specifically, after solving the QP, we use \(\Delta \mathbf{x}^k\) to update the primal variables:
Simultaneously, the Lagrange multipliers from the QP provide the updated dual variables \(\boldsymbol{\lambda}^{k+1}\). We summarize the SQP algorithm in the following pseudo-code:
(Sequential Quadratic Programming (SQP))
Input: Initial estimate \(\mathbf{x}^0\), initial Lagrange multipliers \(\boldsymbol{\lambda}^0\), tolerance \(\epsilon > 0\).
Output: Solution \(\mathbf{x}^*\), Lagrange multipliers \(\boldsymbol{\lambda}^*\).
Procedure:
Compute the QP Solution: Solve the QP subproblem to obtain \(\Delta \mathbf{x}^k\). The QP solver also provides the updated Lagrange multipliers \(\boldsymbol{\lambda}^{k+1}\) associated with the constraints.
Update the Estimates: Update the primal variables:
\[ \mathbf{x}^{k+1} = \mathbf{x}^k + \Delta \mathbf{x}^k. \]Set the dual variables to the updated values \(\boldsymbol{\lambda}^{k+1}\) from the QP solution.
Repeat Until Convergence: Continue iterating until \(\|\Delta \mathbf{x}^k\| < \epsilon\) and the KKT conditions are satisfied.
3.2.2.1. Connection to Newton’s Method in the Equality-Constrained Case#
The QP subproblem in SQP is directly related to applying Newton’s method for equality-constrained optimization. To see this, note that the KKT matrix of the QP subproblem is:
This is exactly the same linear system that have to solve when applying Newton’s method to the KKT conditions of the original program! Thus, solving the QP subproblem at each iteration of SQP is equivalent to taking a Newton step on the KKT conditions of the original nonlinear problem.
3.2.3. SQP for Inequality-Constrained Optimization#
So far, we’ve applied the ideas behind Sequential Quadratic Programming (SQP) to problems with only equality constraints. Now, let’s extend this framework to handle optimization problems that also include inequality constraints. Consider a general nonlinear optimization problem that includes both equality and inequality constraints:
As we did earlier, we approximate this problem by constructing a quadratic approximation to the objective and a linearization of the constraints. QP subproblem at each iteration is then formulated as:
where \(\Delta \mathbf{x} = \mathbf{x} - \mathbf{x}^k\) represents the step direction for the primal variables. The following pseudocode outlines the steps involved in applying SQP to a problem with both equality and inequality constraints:
(Sequential Quadratic Programming (SQP) with Inequality Constraints)
Input: Initial estimate \(\mathbf{x}^0\), initial multipliers \(\boldsymbol{\lambda}^0, \boldsymbol{\nu}^0\), tolerance \(\epsilon > 0\).
Output: Solution \(\mathbf{x}^*\), Lagrange multipliers \(\boldsymbol{\lambda}^*, \boldsymbol{\nu}^*\).
Procedure:
Initialization: Set \(k = 0\).
Repeat:
a. Construct the QP Subproblem: Formulate the QP subproblem using the current iterate \(\mathbf{x}^k\), \(\boldsymbol{\lambda}^k\), and \(\boldsymbol{\nu}^k\):
\[\begin{split} \begin{aligned} \text{Minimize} \quad & \nabla f(\mathbf{x}^k)^T \Delta \mathbf{x} + \frac{1}{2} \Delta \mathbf{x}^T \nabla^2_{\mathbf{x}\mathbf{x}} L(\mathbf{x}^k, \boldsymbol{\lambda}^k, \boldsymbol{\nu}^k) \Delta \mathbf{x} \\ \text{subject to} \quad & \nabla \mathbf{g}(\mathbf{x}^k) \Delta \mathbf{x} + \mathbf{g}(\mathbf{x}^k) \leq \mathbf{0}, \\ & \nabla \mathbf{h}(\mathbf{x}^k) \Delta \mathbf{x} + \mathbf{h}(\mathbf{x}^k) = \mathbf{0}. \end{aligned} \end{split}\]b. Solve the QP Subproblem: Solve for \(\Delta \mathbf{x}^k\) and obtain the updated Lagrange multipliers \(\boldsymbol{\lambda}^{k+1}\) and \(\boldsymbol{\nu}^{k+1}\).
c. Update the Estimates: Update the primal variables and multipliers:
\[ \mathbf{x}^{k+1} = \mathbf{x}^k + \Delta \mathbf{x}^k. \]d. Check for Convergence: If \(\|\Delta \mathbf{x}^k\| < \epsilon\) and the KKT conditions are satisfied, stop. Otherwise, set \(k = k + 1\) and repeat.
Return: \(\mathbf{x}^* = \mathbf{x}^{k+1}, \boldsymbol{\lambda}^* = \boldsymbol{\lambda}^{k+1}, \boldsymbol{\nu}^* = \boldsymbol{\nu}^{k+1}\).
3.2.3.1. Demonstration with JAX and CVXPy#
Consider the following equality and inequality-constrained problem:
This example builds on our previous one but adds a parabola-shaped inequality constraint. We require our solution to lie not only on the circle defining our equality constraint but also below the parabola. To solve the QP subproblem, we will be using the CVXPY package. While the Lagrangian and derivatives could be computed easily by hand, we use JAX for generality:
Show code cell source
import jax
import jax.numpy as jnp
from jax import grad, jit, jacfwd, hessian
import numpy as np
import cvxpy as cp
import matplotlib.pyplot as plt
# Define the objective function and constraints
@jit
def f(x):
return (x[0] - 2)**2 + (x[1] - 1)**2
@jit
def g(x):
return jnp.array([x[0]**2 + x[1]**2 - 1])
@jit
def h(x):
return jnp.array([x[0]**2 - x[1]]) # Corrected inequality constraint: x[1] <= x[0]^2
# Compute gradients and Jacobians using JAX
grad_f = jit(grad(f))
hess_f = jit(hessian(f))
jac_g = jit(jacfwd(g))
jac_h = jit(jacfwd(h))
@jit
def lagrangian(x, lambda_, nu):
return f(x) + jnp.dot(lambda_, g(x)) + jnp.dot(nu, h(x))
hess_L = jit(hessian(lagrangian, argnums=0))
def solve_qp_subproblem(x, lambda_, nu):
n = len(x)
delta_x = cp.Variable(n)
# Convert JAX arrays to numpy for cvxpy
grad_f_np = np.array(grad_f(x))
hess_L_np = np.array(hess_L(x, lambda_, nu))
jac_g_np = np.array(jac_g(x))
jac_h_np = np.array(jac_h(x))
g_np = np.array(g(x))
h_np = np.array(h(x))
obj = cp.Minimize(grad_f_np.T @ delta_x + 0.5 * cp.quad_form(delta_x, hess_L_np))
constraints = [
jac_g_np @ delta_x + g_np == 0,
jac_h_np @ delta_x + h_np <= 0
]
prob = cp.Problem(obj, constraints)
prob.solve()
return delta_x.value, prob.constraints[0].dual_value, prob.constraints[1].dual_value
def sqp(x0, max_iter=100, tol=1e-6):
x = x0
lambda_ = jnp.zeros(1)
nu = jnp.zeros(1)
for i in range(max_iter):
delta_x, new_lambda, new_nu = solve_qp_subproblem(x, lambda_, nu)
if jnp.linalg.norm(delta_x) < tol:
break
x = x + delta_x
lambda_ = new_lambda
nu = new_nu
return x, lambda_, nu, i+1
# Initial point
x0 = jnp.array([0.5, 0.5])
# Solve using SQP
x_opt, lambda_opt, nu_opt, iterations = sqp(x0)
print(f"Optimal x: {x_opt}")
print(f"Optimal lambda: {lambda_opt}")
print(f"Optimal nu: {nu_opt}")
print(f"Iterations: {iterations}")
# Visualize the result
plt.figure(figsize=(12, 10))
# Create a mesh for the contour plot
x1_range = jnp.linspace(-1.5, 2.5, 100)
x2_range = jnp.linspace(-1.5, 2.5, 100)
X1, X2 = jnp.meshgrid(x1_range, x2_range)
Z = jnp.array([[f(jnp.array([x1, x2])) for x1 in x1_range] for x2 in x2_range])
# Plot filled contours
contour = plt.contourf(X1, X2, Z, levels=50, cmap='viridis', alpha=0.7)
plt.colorbar(contour, label='Objective Function Value')
# Plot the equality constraint
theta = jnp.linspace(0, 2*jnp.pi, 100)
x1_eq = jnp.cos(theta)
x2_eq = jnp.sin(theta)
plt.plot(x1_eq, x2_eq, color='red', linewidth=2, label='Equality Constraint')
# Plot the inequality constraint and shade the feasible region
x1_ineq = jnp.linspace(-1.5, 2.5, 100)
x2_ineq = x1_ineq**2
plt.plot(x1_ineq, x2_ineq, color='orange', linewidth=2, label='Inequality Constraint')
# Shade the feasible region for the inequality constraint
x2_lower = jnp.minimum(x2_ineq, 2.5)
plt.fill_between(x1_ineq, -1.5, x2_lower, color='gray', alpha=0.2, hatch='\\/...', label='Feasible Region')
# Plot the optimal and initial points
plt.scatter(x_opt[0], x_opt[1], color='red', s=100, edgecolor='white', linewidth=2, label='Optimal Point')
plt.scatter(x0[0], x0[1], color='green', s=100, edgecolor='white', linewidth=2, label='Initial Point')
# Add labels and title
plt.xlabel('x1', fontsize=12)
plt.ylabel('x2', fontsize=12)
plt.title('SQP for Inequality Constraints with CVXPY and JAX', fontsize=14)
plt.legend(fontsize=10, loc='upper center')
plt.grid(True, linestyle='--', alpha=0.7)
# Set the axis limits explicitly
plt.xlim(-1.5, 2.5)
plt.ylim(-1.5, 2.5)
plt.tight_layout()
plt.show()
# Verify the result
print(f"\nEquality constraint violation: {g(x_opt)[0]:.6f}")
print(f"Inequality constraint violation: {h(x_opt)[0]:.6f}")
print(f"Objective function value: {f(x_opt):.6f}")
3.2.4. The Arrow-Hurwicz-Uzawa algorithm#
While the SQP method addresses constrained optimization problems by sequentially solving quadratic subproblems, an alternative approach emerges from viewing constrained optimization as a min-max problem. This perspective leads to a simpler algorithm, originally introduced by the Arrow-Hurwicz-Uzawa [3]. Consider the following general constrained optimization problem encompassing both equality and inequality constraints:
Using the Lagrangian function \(L(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\mu}) = f(\mathbf{x}) + \boldsymbol{\mu}^T \mathbf{g}(\mathbf{x}) + \boldsymbol{\lambda}^T \mathbf{h}(\mathbf{x})\), we can reformulate this problem as the following min-max problem:
The role of each component in this min-max structure can be understood as follows:
The outer minimization over \(\mathbf{x}\) finds the feasible point that minimizes the objective function \(f(\mathbf{x})\).
The maximization over \(\boldsymbol{\mu} \geq 0\) ensures that inequality constraints \(\mathbf{g}(\mathbf{x}) \leq \mathbf{0}\) are satisfied. If any inequality constraint is violated, the corresponding term in \(\boldsymbol{\mu}^T \mathbf{g}(\mathbf{x})\) can be made arbitrarily large by choosing a large enough \(\mu_i\).
The maximization over \(\boldsymbol{\lambda}\) ensures that equality constraints \(\mathbf{h}(\mathbf{x}) = \mathbf{0}\) are satisfied.
Using this observation, we can devise an algorithm which, like SQP, will update both the primal and dual variables at every step. But rather than using second-order optimization, we will simply use a first-order gradient update step: a descent step in the primal variable, and an ascent step in the dual one. The corresponding procedure, when implemented by gradient descent, is called Gradient Ascent Descent in the learning and optimization communities. In the case of equality constraints only, the algorithm looks like the following:
(Arrow-Hurwicz-Uzawa for equality constraints only)
Input: Initial guess \(\mathbf{x}^0\), \(\boldsymbol{\lambda}^0\), step sizes \(\alpha\), \(\beta\) Output: Optimal \(\mathbf{x}^*\), \(\boldsymbol{\lambda}^*\)
1: for \(k = 0, 1, 2, \ldots\) until convergence do
2: \(\mathbf{x}^{k+1} = \mathbf{x}^k - \alpha \nabla_{\mathbf{x}} L(\mathbf{x}^k, \boldsymbol{\lambda}^k)\) (Primal update)
3: \(\boldsymbol{\lambda}^{k+1} = \boldsymbol{\lambda}^k + \beta \nabla_{\boldsymbol{\lambda}} L(\mathbf{x}^{k+1}, \boldsymbol{\lambda}^k)\) (Dual update)
4: end for
5: return \(\mathbf{x}^k\), \(\boldsymbol{\lambda}^k\)
Now to account for the fact that the Lagrange multiplier needs to be non-negative for inequality constraints, we can use our previous idea from projected gradient descent for bound constraints and consider a projection, or clipping step to ensure that this condition is satisfied throughout. In this case, the algorithm looks like the following:
(Arrow-Hurwicz-Uzawa for equality and inequality constraints)
Input: Initial guess \(\mathbf{x}^0\), \(\boldsymbol{\lambda}^0\), \(\boldsymbol{\mu}^0 \geq 0\), step sizes \(\alpha\), \(\beta\), \(\gamma\) Output: Optimal \(\mathbf{x}^*\), \(\boldsymbol{\lambda}^*\), \(\boldsymbol{\mu}^*\)
1: for \(k = 0, 1, 2, \ldots\) until convergence do
2: \(\mathbf{x}^{k+1} = \mathbf{x}^k - \alpha \nabla_{\mathbf{x}} L(\mathbf{x}^k, \boldsymbol{\lambda}^k, \boldsymbol{\mu}^k)\) (Primal update)
3: \(\boldsymbol{\lambda}^{k+1} = \boldsymbol{\lambda}^k + \beta \nabla_{\boldsymbol{\lambda}} L(\mathbf{x}^{k+1}, \boldsymbol{\lambda}^k, \boldsymbol{\mu}^k)\) (Dual update for equality constraints)
4: \(\boldsymbol{\mu}^{k+1} = [\boldsymbol{\mu}^k + \gamma \nabla_{\boldsymbol{\mu}} L(\mathbf{x}^{k+1}, \boldsymbol{\lambda}^k, \boldsymbol{\mu}^k)]_+\) (Dual update with clipping for inequality constraints)
5: end for
6: return \(\mathbf{x}^k\), \(\boldsymbol{\lambda}^k\), \(\boldsymbol{\mu}^k\)
Here, \([\cdot]_+\) denotes the projection onto the non-negative orthant, ensuring that \(\boldsymbol{\mu}\) remains non-negative.
However, as it is widely known from the lessons of GAN (Generative Adversarial Network) training [12], Gradient Descent Ascent (GDA) can fail to converge or suffer from instability. The Arrow-Hurwicz-Uzawa algorithm, also known as the first-order Lagrangian method, is known to converge only locally, in the vicinity of an optimal primal-dual pair.
Show code cell source
import jax
import jax.numpy as jnp
from jax import grad, jit, value_and_grad
import optax
import matplotlib.pyplot as plt
# Define the objective function and constraints
@jit
def f(x):
return (x[0] - 2)**2 + (x[1] - 1)**2
@jit
def g(x):
return jnp.array([x[0]**2 + x[1]**2 - 1])
@jit
def h(x):
return jnp.array([x[0]**2 - x[1]]) # Inequality constraint: x[1] <= x[0]^2
# Define the Lagrangian
@jit
def lagrangian(x, lambda_, mu):
return f(x) + jnp.dot(lambda_, g(x)) + jnp.dot(mu, h(x))
# Compute gradients of the Lagrangian
grad_L_x = jit(grad(lagrangian, argnums=0))
grad_L_lambda = jit(grad(lagrangian, argnums=1))
grad_L_mu = jit(grad(lagrangian, argnums=2))
# Define the Arrow-Hurwicz-Uzawa update step
@jit
def update(carry, t):
x, lambda_, mu, opt_state_x, opt_state_lambda, opt_state_mu = carry
# Compute gradients
grad_x = grad_L_x(x, lambda_, mu)
grad_lambda = grad_L_lambda(x, lambda_, mu)
grad_mu = grad_L_mu(x, lambda_, mu)
# Update primal variables (minimization)
updates_x, opt_state_x = optimizer_x.update(grad_x, opt_state_x)
x = optax.apply_updates(x, updates_x)
# Update dual variables (maximization)
updates_lambda, opt_state_lambda = optimizer_lambda.update(grad_lambda, opt_state_lambda)
lambda_ = optax.apply_updates(lambda_, -updates_lambda) # Positive update for maximization
updates_mu, opt_state_mu = optimizer_mu.update(grad_mu, opt_state_mu)
mu = optax.apply_updates(mu, -updates_mu) # Positive update for maximization
# Project mu onto the non-negative orthant
mu = jnp.maximum(mu, 0)
return (x, lambda_, mu, opt_state_x, opt_state_lambda, opt_state_mu), x
def arrow_hurwicz_uzawa(x0, lambda0, mu0, max_iter=1000):
# Initialize optimizers
global optimizer_x, optimizer_lambda, optimizer_mu
optimizer_x = optax.adam(learning_rate=0.01)
optimizer_lambda = optax.adam(learning_rate=0.01)
optimizer_mu = optax.adam(learning_rate=0.01)
opt_state_x = optimizer_x.init(x0)
opt_state_lambda = optimizer_lambda.init(lambda0)
opt_state_mu = optimizer_mu.init(mu0)
init_carry = (x0, lambda0, mu0, opt_state_x, opt_state_lambda, opt_state_mu)
# Use jax.lax.scan for the optimization loop
(x, lambda_, mu, _, _, _), trajectory = jax.lax.scan(update, init_carry, jnp.arange(max_iter))
return x, lambda_, mu, trajectory
# Initial point
x0 = jnp.array([0.5, 0.5])
lambda0 = jnp.zeros(1)
mu0 = jnp.zeros(1)
# Solve using Arrow-Hurwicz-Uzawa
x_opt, lambda_opt, mu_opt, trajectory = arrow_hurwicz_uzawa(x0, lambda0, mu0, max_iter=1000)
print(f"Final x: {x_opt}")
print(f"Final lambda: {lambda_opt}")
print(f"Final mu: {mu_opt}")
# Visualize the result
plt.figure(figsize=(12, 10))
# Create a mesh for the contour plot
x1_range = jnp.linspace(-1.5, 2.5, 100)
x2_range = jnp.linspace(-1.5, 2.5, 100)
X1, X2 = jnp.meshgrid(x1_range, x2_range)
Z = jnp.array([[f(jnp.array([x1, x2])) for x1 in x1_range] for x2 in x2_range])
# Plot filled contours
contour = plt.contourf(X1, X2, Z, levels=50, cmap='viridis', alpha=0.7)
plt.colorbar(contour, label='Objective Function Value')
# Plot the equality constraint
theta = jnp.linspace(0, 2*jnp.pi, 100)
x1_eq = jnp.cos(theta)
x2_eq = jnp.sin(theta)
plt.plot(x1_eq, x2_eq, color='red', linewidth=2, label='Equality Constraint')
# Plot the inequality constraint and shade the feasible region
x1_ineq = jnp.linspace(-1.5, 2.5, 100)
x2_ineq = x1_ineq**2
plt.plot(x1_ineq, x2_ineq, color='orange', linewidth=2, label='Inequality Constraint')
# Shade the feasible region for the inequality constraint
x2_lower = jnp.minimum(x2_ineq, 2.5)
plt.fill_between(x1_ineq, -1.5, x2_lower, color='gray', alpha=0.2, hatch='\\/...', label='Feasible Region')
# Plot the optimal and initial points
plt.scatter(x_opt[0], x_opt[1], color='red', s=100, edgecolor='white', linewidth=2, label='Final Point')
plt.scatter(x0[0], x0[1], color='green', s=100, edgecolor='white', linewidth=2, label='Initial Point')
# Plot the optimization trajectory using scatter plot
scatter = plt.scatter(trajectory[:, 0], trajectory[:, 1], c=jnp.arange(len(trajectory)),
cmap='cool', s=10, alpha=0.5)
plt.colorbar(scatter, label='Iteration')
# Add labels and title
plt.xlabel('x1', fontsize=12)
plt.ylabel('x2', fontsize=12)
plt.title('Arrow-Hurwicz-Uzawa Algorithm with JAX and Adam (Corrected Min/Max)', fontsize=14)
plt.legend(fontsize=10, loc='upper center', bbox_to_anchor=(0.5, -0.05), ncol=3)
plt.grid(True, linestyle='--', alpha=0.7)
# Set the axis limits explicitly
plt.xlim(-1.5, 2.5)
plt.ylim(-1.5, 2.5)
plt.tight_layout()
plt.show()
# Verify the result
print(f"\nEquality constraint violation: {g(x_opt)[0]:.6f}")
print(f"Inequality constraint violation: {h(x_opt)[0]:.6f}")
print(f"Objective function value: {f(x_opt):.6f}")
4. The Discrete-Time Pontryagin Maximum Principle#
Discrete-time optimal control problems (DOCPs) form a specific class of nonlinear programming problems. Therefore, we can apply the general results from the Karush-Kuhn-Tucker (KKT) conditions to characterize the structure of optimal solutions to DOCPs in any of their three forms. The discrete-time analogue of the KKT conditions for DOCPs is known as the discrete-time Pontryagin Maximum Principle (PMP). The PMP was first described by Pontryagin in 1956 [32] for continuous-time systems, with the discrete-time version following shortly after. Similar to the KKT conditions, the PMP is useful from both theoretical and practical perspectives. It not only allows us to sometimes find closed-form solutions but also inspires the development of algorithms.
Importantly, the PMP goes beyond the KKT conditions by demonstrating the existence of a particular recursive equation—the adjoint equation. This equation governs the evolution of the derivative of the Hamiltonian, a close cousin to the Lagrangian. The adjoint equation enables us to transform the PMP into an algorithmic procedure, which has much in common with backpropagation [36] in deep learning. This connection between optimal control theory has been noted by several researchers, including Griewank [16] in the context of automatic differentiation, and LeCun [26] in his early work on neural networks.
4.1. PMP for Mayer Problems#
Before delving into more general cases, let’s consider a Mayer problem where the goal is to minimize a terminal cost function \(c_T(\mathbf{x}_T)\):
As done previously using the single shooting method, we reformulate this problem as an unconstrained optimization problem (excluding the state bound constraints since we lack a straightforward way to incorporate them directly). This reformulation is:
where the state evolution functions \(\boldsymbol{\phi}_t\) are defined recursively as:
To find the first-order optimality condition, we differentiate the objective function \(J(\mathbf{u}_{1:T-1})\) with respect to each control variable \(\mathbf{u}_t\) and set it to zero:
Applying the chain rule, we get:
Now, let’s expand the derivative \(\frac{\partial \boldsymbol{\phi}_T}{\partial \mathbf{u}_t}\) using its non-recursive form. From the definition of the state evolution functions, we have:
The above can also be written more recursively. For \(s \geq t\), the derivative of \(\boldsymbol{\phi}_s\) with respect to \(\mathbf{u}_t\) is:
and
The overall derivative is then of the form:
where \(\boldsymbol{\lambda}_t\) is called the adjoint (co-state) variable, and contains the reverse accumulation of the derivative. The evolution of this variable also obeys a difference equation, but one which runs backward in time: the adjoint equation. The recursive relationship for the adjoint equation is then:
with the terminal condition:
The first-order optimality condition in terms of the adjoint variable can finally be written as:
4.2. PMP for Bolza Problems#
To derive the adjoint equation for the Bolza problem, we consider the optimal control problem where the objective is to minimize both a terminal cost \(c_T(\mathbf{x}_T)\) and the sum of intermediate costs \(c_t(\mathbf{x}_t, \mathbf{u}_t)\):
To handle the constraints, we introduce the Lagrangian function with multipliers \(\boldsymbol{\lambda}_t\) for each constraint \(\mathbf{x}_{t+1} = \mathbf{f}_t(\mathbf{x}_t, \mathbf{u}_t)\):
The existence of an optimal constrained solution \((\mathbf{x}^\star, \mathbf{u})^\star\) implies that there exists a unique set of Lagrange multipliers \(\boldsymbol{\lambda}_t^\star\) such that the derivative of the Lagrangian with respect to all variables equals zero: \(\nabla L(\mathbf{x}^\star, \mathbf{u}^\star, \boldsymbol{\lambda}^\star) = 0.\)
To simplify, we rearrange the Lagrangian so that each state variable \(\mathbf{x}_t\) appears only once in the summation:
Note that by adding and subtracting, we can write:
Substituting this back into the Lagrangian gives:
By differentiating the Lagrangian with respect to each state \(\mathbf{x}_i\), we obtain:
We finally obtain the adjoint equation by setting the above expression to zero at an optimal primal-dual pair, and re-arranging the terms:
The optimality condition for the controls is obtained by differentiating the Lagrangian with respect to \(\mathbf{u}^*_t\):
As expected from the general theory of constrained optimization, we finally recover the fact that the constraints must be satisfied at an optimal solution: