Skip to article frontmatterSkip to article content

Smooth Bellman Optimality Equations

Dynamic programming methods suffer from the curse of dimensionality and can quickly become difficult to apply in practice. We may also be dealing with large or continuous state or action spaces. We have seen so far that we could address this problem using discretization, or interpolation. These were already examples of approximate dynamic programming. In this chapter, we will see other forms of approximations meant to facilitate the optimization problem, either by approximating the optimality equations, the value function, or the policy itself. Approximation theory is at the heart of learning methods, and fundamentally, this chapter will be about the application of learning ideas to solve complex decision-making problems.

Smooth Bellman Optimality Equations

While the standard Bellman optimality equations use the max operator to determine the best action, an alternative formulation known as the smooth or soft Bellman optimality equations replaces this with a softmax operator. This approach originated from Rust (1987) and was later rediscovered in the context of maximum entropy inverse reinforcement learning Ziebart et al. (2008), which then led to soft Q-learning Haarnoja et al. (2017) and soft actor-critic Haarnoja et al. (2018), a state-of-the-art deep reinforcement learning algorithm.

In the infinite-horizon setting, the smooth Bellman optimality equations take the form:

vγ(s)=1βlogaAsexp(β(r(s,a)+γjSp(js,a)vγ(j)))v_\gamma^\star(s) = \frac{1}{\beta} \log \sum_{a \in A_s} \exp\left(\beta\left(r(s, a) + \gamma \sum_{j \in S} p(j | s, a) v_\gamma^\star(j)\right)\right)

Adopting an operator-theoretic perspective, we can define a nonlinear operator Lβ\mathrm{L}_\beta such that the smooth value function of an MDP is then the solution to the following fixed-point equation:

(Lβv)(s)=1βlogaAsexp(β(r(s,a)+γjSp(js,a)v(j)))(\mathrm{L}_\beta \mathbf{v})(s) = \frac{1}{\beta} \log \sum_{a \in \mathcal{A}_s} \exp\left(\beta\left(r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) v(j)\right)\right)

As β\beta \to \infty, Lβ\mathrm{L}_\beta converges to the standard Bellman operator L\Bellman. Furthermore, it can be shown that the smooth Bellman operator is a contraction mapping in the supremum norm, and therefore has a unique fixed point. However, as opposed to the usual “hard” setting, the fixed point of Lβ\mathrm{L}_\beta is associated with the value function of an optimal stochastic policy defined by the softmax distribution:

π(as)=exp(β(r(s,a)+γjSp(js,a)vγ(j)))aAsexp(β(r(s,a)+γjSp(js,a)vγ(j)))\pi(a|s) = \frac{\exp\left(\beta\left(r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) v_\gamma^\star(j)\right)\right)}{\sum_{a' \in \mathcal{A}_s} \exp\left(\beta\left(r(s,a') + \gamma \sum_{j \in \mathcal{S}} p(j|s,a') v_\gamma^\star(j)\right)\right)}

Despite the confusing terminology, the above “softmax” policy is simply the smooth counterpart to the argmax operator in the original optimality equation: it acts as a soft-argmax.

This formulation is interesting for several reasons. First, smoothness is a desirable property from an optimization standpoint. Unlike γ\gamma, we view β\beta as a hyperparameter of our algorithm, which we can control to achieve the desired level of accuracy.

Second, while presented from an intuitive standpoint where we replace the max by the log-sum-exp (a smooth maximum) and the argmax by the softmax (a smooth argmax), this formulation can also be obtained from various other perspectives, offering theoretical tools and solution methods. For example, Rust (1987) derived this algorithm by considering a setting in which the rewards are stochastic and perturbed by a Gumbel noise variable. When considering the corresponding augmented state space and integrating the noise, we obtain smooth equations. This interpretation is leveraged by Rust for modeling purposes.

Smooth Value Iteration Algorithm

The smooth value iteration algorithm replaces the max operator in standard value iteration with the logsumexp operator. Here’s the algorithm structure:

Differences from standard value iteration:

There is also a way to obtain this equation by starting from the energy-based formulation often used in supervised learning, in which we convert an unnormalized probability distribution into a distribution using the softmax transformation. This is essentially what Ziebart et al. (2008) did in their paper. Furthermore, this perspective bridges with the literature on probabilistic graphical models, in which we can now cast the problem of finding an optimal smooth policy into one of maximum likelihood estimation (an inference problem). This is the idea of control as inference, which also admits the converse - that of inference as control - used nowadays for deriving fast samples and amortized inference techniques using reinforcement learning Levine et al. (2018).

Finally, it’s worth noting that we can also derive this form by considering an entropy-regularized formulation in which we penalize for the entropy of our policy in the reward function term. This formulation admits a solution that coincides with the smooth Bellman equations Haarnoja et al. (2017).

Alternative Soft Maximum: Gaussian Uncertainty-Weighted Aggregation

The logsumexp operator provides one way to soften the hard maximum, but alternative approaches exist. When Q-value estimates have heterogeneous uncertainty (some actions estimated more precisely than others), we can weight actions by the probability they are optimal under a Gaussian uncertainty model. D'Eramo et al. (2016) proposed computing weights as:

wa=+ϕ(xμ^aσ^a/n)baΦ(xμ^bσ^b/n)dxw_{a'} = \int_{-\infty}^{+\infty} \phi\left(\frac{x - \hat{\mu}_{a'}}{\hat{\sigma}_{a'}/\sqrt{n}}\right) \prod_{b \neq a'} \Phi\left(\frac{x - \hat{\mu}_b}{\hat{\sigma}_b/\sqrt{n}}\right) dx

where μ^a\hat{\mu}_{a'} and σ^a\hat{\sigma}_{a'} are the sample mean and standard deviation of Q-value estimates, nn is the sample size, and ϕ\phi, Φ\Phi are the standard normal PDF and CDF. The soft Bellman target becomes v(s)=awaq(s,a)v(s') = \sum_{a'} w_{a'} q(s', a'), a probability-weighted expectation rather than a hard maximum.

This differs from logsumexp in that it adapts to state-action-specific uncertainty (actions with tighter confidence intervals receive more weight), whereas logsumexp applies uniform smoothing via temperature β\beta. The Gaussian-weighted approach requires maintaining variance estimates and computing integrals, making it more expensive than logsumexp. However, it provides a principled way to reduce overestimation bias in Q-learning while avoiding the pessimism of double Q-learning. We return to this estimator in the simulation-based methods chapter when discussing overestimation bias mitigation strategies.

Gumbel Noise on the Rewards

We can obtain the smooth Bellman equation by considering a setting in which we have Gumbel noise added to the reward function. This derivation provides both theoretical insight and connects to practical modeling scenarios where rewards have random perturbations.

Step 1: Define the Augmented MDP with Gumbel Noise

At each time period and state ss, we draw an action-indexed shock vector:

ϵt(s)=(ϵt(s,a))aAs,where ϵt(s,a) i.i.d.Gumbel(μϵ,1/β)\boldsymbol{\epsilon}_t(s) = \big(\epsilon_t(s,a)\big)_{a \in \mathcal{A}_s}, \quad \text{where } \epsilon_t(s,a) \text{ i.i.d.} \sim \mathrm{Gumbel}(\mu_\epsilon, 1/\beta)

These shocks are independent across time periods, states, and actions, and are independent of the MDP transition dynamics p(s,a)p(\cdot | s, a).

The Gumbel distribution with location parameter μ\mu and scale parameter 1/β1/\beta has probability density function:

f(x;μ,β)=βexp(β(xμ)exp(β(xμ)))f(x; \mu, \beta) = \beta\exp\left(-\beta(x-\mu)-\exp(-\beta(x-\mu))\right)

To generate a Gumbel-distributed random variable, we can use inverse transform sampling: X=μ1βln(ln(U))X = \mu - \frac{1}{\beta} \ln(-\ln(U)) where UU is uniform on (0,1)(0,1).

We now define an augmented MDP with:

The transition factorizes because the next shock vector ϵ\boldsymbol{\epsilon}' is drawn independently of the current state and action (conditional independence).

Step 2: The Hard Bellman Equation on the Augmented State Space

The Bellman optimality equation for the augmented MDP is:

v~(s,ϵ)=maxaAs{r(s,a)+ϵ(a)+γEs,ϵ[v~(s,ϵ)s,a]}\tilde{v}(s, \boldsymbol{\epsilon}) = \max_{a \in \mathcal{A}_s} \left\{ r(s,a) + \epsilon(a) + \gamma \mathbb{E}_{s', \boldsymbol{\epsilon}'}\left[\tilde{v}(s', \boldsymbol{\epsilon}') \mid s, a\right] \right\}

Here the expectation is over the next augmented state (s,ϵ)(s', \boldsymbol{\epsilon}'), which includes both the next state sp(s,a)s' \sim p(\cdot | s, a) and the next shock vector ϵp()\boldsymbol{\epsilon}' \sim p(\cdot).

This is a perfectly well-defined Bellman equation, and an optimal stationary policy exists:

π(s,ϵ)argmaxaAs{r(s,a)+ϵ(a)+γEs,ϵ[v~(s,ϵ)s,a]}\pi(s, \boldsymbol{\epsilon}) \in \operatorname{argmax}_{a \in \mathcal{A}_s} \left\{ r(s,a) + \epsilon(a) + \gamma \mathbb{E}_{s', \boldsymbol{\epsilon}'}\left[\tilde{v}(s', \boldsymbol{\epsilon}') \mid s, a\right] \right\}

However, this equation is computationally intractable because:

We never solve this equation directly. Instead, we use it as a mathematical device to derive the smooth Bellman equation.

Step 3: Define the Ex-Ante (Inclusive) Value Function

The idea here is to consider the expected value before observing the current shocks. We define what some authors in econometrics call the inclusive value or ex-ante value:

v(s):=Eϵ[v~(s,ϵ)]v(s) := \mathbb{E}_{\boldsymbol{\epsilon}}\big[\tilde{v}(s, \boldsymbol{\epsilon})\big]

This is the value of being in state ss before we observe the current-period shock vector ϵ\boldsymbol{\epsilon}.

Step 4: Separate the Deterministic and Random Components

Now we take the expectation of the augmented Bellman equation with respect to the current shocks only (everything that does not depend on the current ϵ\boldsymbol{\epsilon} can be pulled out).

First, note that by the law of iterated expectations and independence of shocks across time:

Eϵ[v~(s,ϵ)]=v(s)\mathbb{E}_{\boldsymbol{\epsilon}'}\big[\tilde{v}(s', \boldsymbol{\epsilon}')\big] = v(s')

This follows from our definition of vv and the fact that the next shock is independent of everything else.

Now define the deterministic part of the right-hand side:

xa(s):=r(s,a)+γjSp(js,a)v(j)x_a(s) := r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) v(j)

This is the expected return from taking action aa in state ss without the shock. Using this notation, the augmented Bellman equation becomes:

v~(s,ϵ)=maxaAs{xa(s)+ϵ(a)}\tilde{v}(s, \boldsymbol{\epsilon}) = \max_{a \in \mathcal{A}_s} \left\{ x_a(s) + \epsilon(a) \right\}

Taking the expectation over ϵ\boldsymbol{\epsilon} on both sides:

v(s)=Eϵ[maxaAs{xa(s)+ϵ(a)}]v(s) = \mathbb{E}_{\boldsymbol{\epsilon}}\left[\max_{a \in \mathcal{A}_s} \left\{ x_a(s) + \epsilon(a) \right\}\right]

Step 5: Apply the Gumbel Random Utility Identity

We now invoke a result from extreme value theory:

Applying this identity to our problem (with mean-zero shocks):

v(s)=Eϵ[maxaAs{xa(s)+ϵ(a)}]=1βlogaAsexp(βxa(s))v(s) = \mathbb{E}_{\boldsymbol{\epsilon}}\left[\max_{a \in \mathcal{A}_s} \{x_a(s) + \epsilon(a)\}\right] = \frac{1}{\beta} \log \sum_{a \in \mathcal{A}_s} \exp(\beta x_a(s))

Substituting the definition of xa(s)x_a(s):

v(s)=1βlogaAsexp(β(r(s,a)+γjSp(js,a)v(j)))v(s) = \frac{1}{\beta} \log \sum_{a \in \mathcal{A}_s} \exp\left(\beta\left(r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) v(j)\right)\right)

We have arrived at the smooth Bellman equation.

Step 6: Summary of the Derivation

To recap the logical flow:

  1. We constructed an augmented MDP with state (s,ϵ)(s, \boldsymbol{\epsilon}) where shocks perturb rewards

  2. We wrote the standard Bellman equation for this augmented MDP (hard max, but over an infinite-dimensional state space)

  3. We defined the ex-ante value v(s)=Eϵ[v~(s,ϵ)]v(s) = \mathbb{E}_{\boldsymbol{\epsilon}}[\tilde{v}(s, \boldsymbol{\epsilon})] to eliminate the continuous shock component

  4. We separated deterministic and random terms: v~(s,ϵ)=maxa{xa(s)+ϵ(a)}\tilde{v}(s, \boldsymbol{\epsilon}) = \max_a \{x_a(s) + \epsilon(a)\}

  5. We applied the Gumbel identity to evaluate Eϵ[maxa{}]\mathbb{E}_{\boldsymbol{\epsilon}}[\max_a \{\cdots\}] in closed form as a log-sum-exp

The augmented MDP with shocks exists only as a mathematical device. We never approximate v~\tilde{v}, never discretize ϵ\boldsymbol{\epsilon}, and never enumerate the augmented state space. The only computational object we work with is v(s)v(s) on the original (finite) state space, which satisfies the smooth Bellman equation.

Deriving the Optimal Smooth Policy

Now that we have derived the smooth value function, we can also obtain the corresponding optimal policy. The question is: what policy should we follow in the original MDP (without explicitly conditioning on shocks)?

In the augmented MDP, the optimal policy is deterministic but depends on the shock realization:

π(s,ϵ)argmaxaAs{r(s,a)+ϵ(a)+γjSp(js,a)v(j)}\pi(s, \boldsymbol{\epsilon}) \in \operatorname{argmax}_{a \in \mathcal{A}_s} \left\{ r(s,a) + \epsilon(a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) v(j) \right\}

However, we want a policy for the original state space ss (not the augmented state). We obtain this by marginalizing over the current shocks, essentially asking: “what is the probability that action aa is optimal when we average over all possible shock realizations?”

Define an indicator function:

Ia(ϵ)={1if aargmaxaAs{xa(s)+ϵ(a)}0otherwiseI_a(\boldsymbol{\epsilon}) = \begin{cases} 1 & \text{if } a \in \operatorname{argmax}_{a' \in \mathcal{A}_s} \left\{ x_{a'}(s) + \epsilon(a') \right\} \\ 0 & \text{otherwise} \end{cases}

where xa(s)=r(s,a)+γjSp(js,a)v(j)x_a(s) = r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) v(j) as before.

The ex-ante probability that action aa is optimal at state ss is:

π(as)=Eϵ[Ia(ϵ)]=Pϵ(aargmaxa{xa(s)+ϵ(a)})\pi(a|s) = \mathbb{E}_{\boldsymbol{\epsilon}}[I_a(\boldsymbol{\epsilon})] = \mathbb{P}_{\boldsymbol{\epsilon}}\left(a \in \operatorname{argmax}_{a'} \left\{ x_{a'}(s) + \epsilon(a') \right\}\right)

This is the probability that action aa achieves the maximum when utilities are perturbed by Gumbel noise.

Applying this result to our problem:

π(as)=exp(βxa(s))aAsexp(βxa(s))=exp(β(r(s,a)+γjSp(js,a)v(j)))aAsexp(β(r(s,a)+γjSp(js,a)v(j)))\pi(a|s) = \frac{\exp\left(\beta x_a(s)\right)}{\sum_{a' \in \mathcal{A}_s} \exp\left(\beta x_{a'}(s)\right)} = \frac{\exp\left(\beta\left(r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a)v(j)\right)\right)}{\sum_{a' \in \mathcal{A}_s} \exp\left(\beta\left(r(s,a') + \gamma \sum_{j \in \mathcal{S}} p(j|s,a')v(j)\right)\right)}

This is the softmax policy or Gibbs/Boltzmann policy with inverse temperature β\beta.

Properties:

This completes the derivation: the smooth Bellman equation yields a value function v(s)v(s), and the corresponding optimal policy is the softmax over Q-values.

Regularized Markov Decision Processes

Regularized MDPs Geist et al. (2019) provide another perspective on how the smooth Bellman equations come to be. This framework offers a more general approach in which we seek to find optimal policies under the infinite horizon criterion while also accounting for a regularizer that influences the kind of policies we try to obtain.

Let’s set up some necessary notation. First, recall that the policy evaluation operator for a stationary policy with decision rule π\pi is defined as:

Lπv=rπ+γPπv\BellmanPi \mathbf{v} = \mathbf{r}_\pi + \gamma \mathbf{P}_\pi \mathbf{v}

where rπ\mathbf{r}_\pi is the expected reward vector under policy π\pi, γ\gamma is the discount factor, and Pπ\mathbf{P}_\pi is the state transition probability matrix under π\pi. A complementary object to the value function is the q-function (or Q-factor) representation:

qγπ(s,a)=r(s,a)+γjSp(js,a)vγπ(j)vγπ(s)=aAsπ(as)qγπ(s,a)\begin{align*} q_\gamma^{\pi}(s, a) &= r(s, a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) v_\gamma^{\pi}(j) \\ v_\gamma^{\pi}(s) &= \sum_{a \in \mathcal{A}_s} \pi(a | s) q_\gamma^{\pi}(s, a) \end{align*}

The policy evaluation operator can then be written in terms of the q-function as:

[Lπv](s)=π(s),q(s,)[\BellmanPi v](s) = \langle \pi(\cdot | s), q(s, \cdot) \rangle

Legendre-Fenchel Transform

The workhorse behind the theory of regularized MDPs is the Legendre-Fenchel transform, also known as the convex conjugate. For a strongly convex function Ω:ΔAR\Omega: \Delta_{\mathcal{A}} \rightarrow \mathbb{R}, its Legendre-Fenchel transform Ω:RAR\Omega^*: \mathbb{R}^{\mathcal{A}} \rightarrow \mathbb{R} is defined as:

Ω(q(s,))=maxπ(s)ΔAπ(s),q(s,)Ω(π(s))\Omega^*(q(s, \cdot)) = \max_{\pi(\cdot|s) \in \Delta_{\mathcal{A}}} \langle \pi(\cdot | s), q(s, \cdot) \rangle - \Omega(\pi(\cdot | s))

An important property of this transform is that it has a unique maximizing argument, given by the gradient of Ω\Omega^*. This gradient is Lipschitz and satisfies:

Ω(q(s,))=argmaxππ(s),q(s,)Ω(π(s))\nabla \Omega^*(q(s, \cdot)) = \arg\max_\pi \langle \pi(\cdot | s), q(s, \cdot) \rangle - \Omega(\pi(\cdot | s))

An important example of a regularizer is the negative entropy, which gives rise to the smooth Bellman equations as we are about to see.

Regularized Bellman Operators

With these concepts in place, we can now define the regularized Bellman operators:

  1. Regularized Policy Evaluation Operator (Lπ,Ω)(\mathrm{L}_{\pi,\Omega}):

    [Lπ,Ωv](s)=q(s,),π(s)Ω(π(s))[\mathrm{L}_{\pi,\Omega} v](s) = \langle q(s,\cdot), \pi(\cdot | s) \rangle - \Omega(\pi(\cdot | s))
  2. Regularized Bellman Optimality Operator (LΩ)(\mathrm{L}_\Omega):

    [LΩv](s)=[maxπLπ,Ωv](s)=Ω(q(s,))[\mathrm{L}_\Omega v](s) = [\max_\pi \mathrm{L}_{\pi,\Omega} v ](s) = \Omega^*(q(s, \cdot))

It can be shown that the addition of a regularizer in these regularized operators still preserves the contraction properties, and therefore the existence of a solution to the optimality equations and the convergence of successive approximation.

The regularized value function of a stationary policy with decision rule π\pi, denoted by vπ,Ωv_{\pi,\Omega}, is the unique fixed point of the operator equation:

find v such that v=Lπ,Ωv\text{find $v$ such that } \enspace v = \mathrm{L}_{\pi,\Omega} v

Under the usual assumptions on the discount factor and the boundedness of the reward, the value of a policy can also be found in closed form by solving for v\mathbf{v} in the linear system of equations:

(IγPπ)v=rπΩπ(\mathbf{I} - \gamma \mathbf{P}_\pi) \mathbf{v} = \mathbf{r}_\pi - \boldsymbol{\Omega}_\pi

where [Ωπ](s)=Ω(π(s))[\boldsymbol{\Omega}_\pi](s) = \Omega(\pi(\cdot|s)) is the vector of regularization terms at each state.

The associated state-action value function qπ,Ωq_{\pi,\Omega} is given by:

qπ,Ω(s,a)=r(s,a)+jSγp(js,a)vπ,Ω(j)vπ,Ω(s)=aAsπ(as)qπ,Ω(s,a)Ω(π(s))\begin{align*} q_{\pi,\Omega}(s, a) &= r(s, a) + \sum_{j \in \mathcal{S}} \gamma p(j|s,a) v_{\pi,\Omega}(j) \\ v_{\pi,\Omega}(s) &= \sum_{a \in \mathcal{A}_s} \pi(a | s) q_{\pi,\Omega}(s, a) - \Omega(\pi(\cdot | s)) \end{align*}

The regularized optimal value function vΩv^*_\Omega is then the unique fixed point of LΩ\mathrm{L}_\Omega in the fixed point equation:

find v such that v=LΩv\text{find $v$ such that } v = \mathrm{L}_\Omega v

The associated state-action value function qΩq^*_\Omega is given by:

qΩ(s,a)=r(s,a)+jSγp(js,a)vΩ(j)vΩ(s)=Ω(qΩ(s,))\begin{align*} q^*_\Omega(s, a) &= r(s, a) + \sum_{j \in \mathcal{S}} \gamma p(j|s,a) v^*_\Omega(j) \\ v^*_\Omega(s) &= \Omega^*(q^*_\Omega(s, \cdot))\end{align*}

An important result in the theory of regularized MDPs is that there exists a unique optimal regularized policy. Specifically, if πΩ\pi^*_\Omega is a conserving decision rule (i.e., πΩ=argmaxπLπ,ΩvΩ\pi^*_\Omega = \arg\max_\pi \mathrm{L}_{\pi,\Omega} v^*_\Omega), then the randomized stationary policy π=const(πΩ)\boldsymbol{\pi} = \mathrm{const}(\pi^*_\Omega) is the unique optimal regularized policy.

In practice, once we have found vΩv^*_\Omega, we can derive the optimal decision rule by taking the gradient of the convex conjugate evaluated at the optimal action-value function:

π(s)=Ω(qΩ(s,))\pi^*(\cdot | s) = \nabla \Omega^*(q^*_\Omega(s, \cdot))

Recovering the Smooth Bellman Equations

Under this framework, we can recover the smooth Bellman equations by choosing Ω\Omega to be the negative entropy, and obtain the softmax policy as the gradient of the convex conjugate. Let’s show this explicitly:

  1. Using the negative entropy regularizer:

    Ω(d(s))=aAsd(as)lnd(as)\Omega(d(\cdot|s)) = \sum_{a \in \mathcal{A}_s} d(a|s) \ln d(a|s)
  2. The convex conjugate:

    Ω(q(s,))=lnaAsexpq(s,a)\Omega^*(q(s, \cdot)) = \ln \sum_{a \in \mathcal{A}_s} \exp q(s,a)
  3. Now, let’s write out the regularized Bellman optimality equation:

    vΩ(s)=Ω(qΩ(s,))v^*_\Omega(s) = \Omega^*(q^*_\Omega(s, \cdot))
  4. Substituting the expressions for Ω\Omega^* and qΩq^*_\Omega:

    vΩ(s)=lnaAsexp(r(s,a)+γjSp(js,a)vΩ(j))v^*_\Omega(s) = \ln \sum_{a \in \mathcal{A}_s} \exp \left(r(s, a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) v^*_\Omega(j)\right)

This matches the form of the smooth Bellman equation we derived earlier, with the log-sum-exp operation replacing the max operation of the standard Bellman equation.

Furthermore, the optimal policy is given by the gradient of Ω\Omega^*:

d(as)=Ω(qΩ(s,))=exp(qΩ(s,a))aAsexp(qΩ(s,a))d^*(a|s) = \nabla \Omega^*(q^*_\Omega(s, \cdot)) = \frac{\exp(q^*_\Omega(s,a))}{\sum_{a' \in \mathcal{A}_s} \exp(q^*_\Omega(s,a'))}

This is the familiar softmax policy we encountered in the smooth MDP setting.

Smooth Policy Iteration Algorithm

Now that we’ve seen how the regularized MDP framework leads to smooth Bellman equations, we present smooth policy iteration. Unlike value iteration which directly iterates the Bellman operator, policy iteration alternates between policy evaluation and policy improvement steps.

Key properties of smooth policy iteration:

  1. Entropy-regularized evaluation: The policy evaluation step (line 12 of Algorithm Algorithm 2) accounts for the entropy bonus αH(π(s))\alpha H(\pi(\cdot|s)) where α=1/β\alpha = 1/\beta

  2. Stochastic policy improvement: The policy improvement step (lines 12-14 of Algorithm Algorithm 3) uses softmax instead of deterministic argmax, producing a stochastic policy

  3. Temperature parameter:

    • Higher β\beta → policies closer to deterministic (lower entropy)

    • Lower β\beta → more stochastic policies (higher entropy)

    • As β\beta \to \infty → recovers standard policy iteration

  4. Convergence: Like standard policy iteration, this algorithm converges to the unique optimal regularized value function and policy

Equivalence Between Smooth Bellman Equations and Entropy-Regularized MDPs

We have now seen two distinct ways to arrive at smooth Bellman equations. Earlier in this chapter, we introduced the logsumexp operator as a smooth approximation to the max operator, motivated by analytical tractability and the desire for differentiability. Just now, we derived the same equations through the lens of regularized MDPs, where we explicitly penalize the entropy of policies. These two perspectives are mathematically equivalent: solving the smooth Bellman equation with inverse temperature parameter β\beta yields exactly the same optimal value function and optimal policy as solving the entropy-regularized MDP with regularization strength α=1/β\alpha = 1/\beta. The two formulations are not merely similar. They describe identical optimization problems.

To see this equivalence clearly, consider the standard MDP problem with rewards r(s,a)r(s,a) and transition probabilities p(js,a)p(j|s,a). The regularized MDP framework tells us to solve:

maxπEπ[t=0γtr(st,at)]+αEπ[t=0γtH(π(st))],\max_\pi \mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t r(s_t, a_t) \right] + \alpha \mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t H(\pi(\cdot|s_t)) \right],

where H(π(s))=aπ(as)lnπ(as)H(\pi(\cdot|s)) = -\sum_a \pi(a|s) \ln \pi(a|s) is the entropy of the policy at state ss, and α>0\alpha > 0 is the entropy regularization strength.

We can rewrite this objective by absorbing the entropy term into a modified reward function. Define the entropy-augmented reward:

r~(s,a,π)=r(s,a)+αH(π(s)).\tilde{r}(s,a,\pi) = r(s,a) + \alpha H(\pi(\cdot|s)).

However, this formulation makes the reward depend on the entire policy at each state, which is awkward. We can reformulate this more cleanly by expanding the entropy term. Recall that the entropy is:

H(π(s))=aπ(as)lnπ(as).H(\pi(\cdot|s)) = -\sum_a \pi(a|s) \ln \pi(a|s).

When we take the expectation over actions drawn from π\pi, we have:

Eaπ(s)[H(π(s))]=aπ(as)[aπ(as)lnπ(as)]=aπ(as)lnπ(as),\mathbb{E}_{a \sim \pi(\cdot|s)} [H(\pi(\cdot|s))] = \sum_a \pi(a|s) \left[-\sum_{a'} \pi(a'|s) \ln \pi(a'|s)\right] = -\sum_{a'} \pi(a'|s) \ln \pi(a'|s),

since the entropy doesn’t depend on which action is actually sampled. But we can also write this as:

H(π(s))=aπ(as)lnπ(as)=Eaπ(s)[lnπ(as)].H(\pi(\cdot|s)) = -\sum_a \pi(a|s) \ln \pi(a|s) = \mathbb{E}_{a \sim \pi(\cdot|s)}[-\ln \pi(a|s)].

This shows that adding αH(π(s))\alpha H(\pi(\cdot|s)) to the expected reward at state ss is equivalent to adding αlnπ(as)-\alpha \ln \pi(a|s) to the reward of taking action aa at state ss. More formally:

Eπ[t=0γtr(st,at)]+αEπ[t=0γtH(π(st))]=Eπ[t=0γtr(st,at)]+αEπ[t=0γtEatπ(st)[lnπ(atst)]]=Eπ[t=0γt(r(st,at)αlnπ(atst))].\begin{align*} &\mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t r(s_t, a_t) \right] + \alpha \mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t H(\pi(\cdot|s_t)) \right] \\ &= \mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t r(s_t, a_t) \right] + \alpha \mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t \mathbb{E}_{a_t \sim \pi(\cdot|s_t)}[-\ln \pi(a_t|s_t)] \right] \\ &= \mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t \left( r(s_t, a_t) - \alpha \ln \pi(a_t|s_t) \right) \right]. \end{align*}

The entropy bonus at each state, when averaged over the policy, becomes a per-action penalty proportional to the negative log probability of the action taken. This reformulation is more useful because the modified reward now depends only on the state, the action taken, and the probability assigned to that specific action by the policy, not on the entire distribution over actions.

This expression shows that entropy regularization is equivalent to adding a state-action dependent penalty term αlnπ(as)-\alpha \ln \pi(a|s) to the reward. Intuititively, this terms amounts to paying a cost for low-entropy (deterministic) policies.

Now, when we write down the Bellman equation for this entropy-regularized problem, at each state ss we need to find the decision rule d(s)Δ(As)d(\cdot|s) \in \Delta(\mathcal{A}_s) (a probability distribution over actions) that maximizes:

v(s)=maxd(s)Δ(As)ad(as)[r(s,a)+γjSp(js,a)v(j)αlnd(as)].v(s) = \max_{d(\cdot|s) \in \Delta(\mathcal{A}_s)} \sum_a d(a|s) \left[ r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) v(j) - \alpha \ln d(a|s) \right].

Here Δ(As)={d(s):d(as)0,ad(as)=1}\Delta(\mathcal{A}_s) = \{d(\cdot|s) : d(a|s) \geq 0, \sum_a d(a|s) = 1\} denotes the probability simplex over actions available at state ss. The optimization is over randomized decision rules at each state, constrained to be valid probability distributions.

This is a convex optimization problem with a linear constraint. We form the Lagrangian:

L(d,λ)=ad(as)[r(s,a)+γjSp(js,a)v(j)αlnd(as)]λ(ad(as)1),\mathcal{L}(d, \lambda) = \sum_a d(a|s) \left[ r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) v(j) - \alpha \ln d(a|s) \right] - \lambda \left(\sum_a d(a|s) - 1\right),

where λ\lambda is the Lagrange multiplier enforcing the normalization constraint. Taking the derivative with respect to d(as)d(a|s) and setting it to zero:

r(s,a)+γjSp(js,a)v(j)α(1+lnd(as))λ=0.r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) v(j) - \alpha(1 + \ln d^*(a|s)) - \lambda = 0.

Solving for d(as)d^*(a|s):

d(as)=exp(1α(r(s,a)+γjSp(js,a)v(j)λ)).d^*(a|s) = \exp\left(\frac{1}{\alpha}\left(r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) v(j) - \lambda\right)\right).

Using the normalization constraint ad(as)=1\sum_a d^*(a|s) = 1 to solve for λ\lambda:

aexp(1α(r(s,a)+γjSp(js,a)v(j)))=exp(λα).\sum_a \exp\left(\frac{1}{\alpha}\left(r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) v(j)\right)\right) = \exp\left(\frac{\lambda}{\alpha}\right).

Therefore:

λ=αlnaexp(1α(r(s,a)+γjSp(js,a)v(j))).\lambda = \alpha \ln \sum_a \exp\left(\frac{1}{\alpha}\left(r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) v(j)\right)\right).

Substituting this back into the Bellman equation and simplifying:

v(s)=αlnaexp(1α(r(s,a)+γjSp(js,a)v(j))).v(s) = \alpha \ln \sum_a \exp\left(\frac{1}{\alpha}\left(r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) v(j)\right)\right).

Setting β=1/α\beta = 1/\alpha (the inverse temperature), this becomes:

v(s)=1βlnaexp(β(r(s,a)+γjSp(js,a)v(j))).v(s) = \frac{1}{\beta} \ln \sum_a \exp\left(\beta\left(r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) v(j)\right)\right).

We recover the smooth Bellman equation we derived earlier using the logsumexp operator. The inverse temperature parameter β\beta controls how closely the logsumexp approximates the max: as β\beta \to \infty, we recover the standard Bellman equation, while for finite β\beta, we have a smooth approximation that corresponds to optimizing with entropy regularization strength α=1/β\alpha = 1/\beta.

The optimal policy is:

π(as)=exp(βq(s,a))aexp(βq(s,a))=softmaxβ(q(s,))(a),\pi^*(a|s) = \frac{\exp\left(\beta q^*(s,a)\right)}{\sum_{a'} \exp\left(\beta q^*(s,a')\right)} = \text{softmax}_\beta(q^*(s,\cdot))(a),

which is exactly the softmax policy parametrized by inverse temperature.

The derivation establishes the complete equivalence: the value function vv^* that solves the smooth Bellman equation is identical to the optimal value function vΩv^*_\Omega of the entropy-regularized MDP (with Ω\Omega being negative entropy and α=1/β\alpha = 1/\beta), and the softmax policy that is greedy with respect to this value function achieves the maximum of the entropy-regularized objective. Both approaches yield the same numerical solution: the same values at every state and the same policy prescriptions. The only difference is how we conceptualize the problem: as smoothing the Bellman operator for computational tractability, or as explicitly trading off reward maximization against policy entropy.

This equivalence has important implications. When we use smooth Bellman equations with a logsumexp operator, we are implicitly solving an entropy-regularized MDP. Conversely, when we explicitly add entropy regularization to an MDP objective, we arrive at smooth Bellman equations as the natural description of optimality. This dual perspective will prove valuable in understanding various algorithms and theoretical results. For instance, in soft actor-critic methods and other maximum entropy reinforcement learning algorithms, the connection between smooth operators and entropy regularization provides both computational benefits (differentiability) and conceptual clarity (why we want stochastic policies).

Entropy-Regularized Dynamic Programming Algorithms

While the smooth Bellman equations (using logsumexp) and entropy-regularized formulations are mathematically equivalent, it is instructive to present the algorithms explicitly in the entropy-regularized form, where the entropy bonus appears directly in the update equations.

Features:

Features:

References
  1. Rust, J. (1987). Optimal Replacement of GMC Bus Engines: An Empirical Model of Harold Zurcher. Econometrica, 55(5), 999–1033.
  2. Ziebart, B. D., Maas, A. L., Bagnell, J. A., & Dey, A. K. (2008). Maximum Entropy Inverse Reinforcement Learning. Proceedings of the 23rd AAAI Conference on Artificial Intelligence, 1433–1438.
  3. Haarnoja, T., Tang, H., Abbeel, P., & Levine, S. (2017). Reinforcement Learning with Deep Energy-Based Policies. Proceedings of the 34th International Conference on Machine Learning, 70, 1352–1361.
  4. Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proceedings of the 35th International Conference on Machine Learning (ICML), 1861–1870.
  5. Levine, S., Kumar, A., Tucker, G., & Fu, J. (2018). Reinforcement Learning as a Framework for Control: A Survey. arXiv Preprint arXiv:1806.04222.
  6. D’Eramo, C., Nuara, A., Restelli, M., & Nowe, A. (2016). Estimating the Maximum Expected Value through Gaussian Approximation. Proceedings of the 33rd International Conference on Machine Learning, 48, 1032–1040.
  7. Geist, M., Scherrer, B., & Pietquin, O. (2019). A Theory of Regularized Markov Decision Processes. In K. Chaudhuri & R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference on Machine Learning (Vol. 97, pp. 2160–2169). PMLR. https://proceedings.mlr.press/v97/geist19a.html