Dynamic programming methods suffer from the curse of dimensionality and can quickly become difficult to apply in practice. Not only this, we may also be dealing with large or continuous state or action spaces. We have seen so far that we could address this problem using discretization, or interpolation. These were already examples of approximate dynamic programming. In this chapter, we will see other forms of approximations meant to facilitate the optimization problem, either by approximating the optimality equations, the value function, or the policy itself. Approximation theory is at the heart of learning methods, and fundamentally, this chapter will be about the application of learning ideas to solve complex decision-making problems.
Smooth Bellman Optimality Equations#
While the standard Bellman optimality equations use the max operator to determine the best action, an alternative formulation known as the smooth or soft Bellman optimality equations replaces this with a softmax operator. This approach originated from [37] and was later rediscovered in the context of maximum entropy inverse reinforcement learning [44], which then led to soft Q-learning [22] and soft actor-critic [23], a state-of-the-art deep reinforcement learning algorithm.
In the infinite-horizon setting, the smooth Bellman optimality equations take the form:
Adopting an operator-theoretic perspective, we can define a nonlinear operator \(\mathrm{L}_\beta\) such that the smooth value function of an MDP is then the solution to the following fixed-point equation:
As \(\beta \to \infty\), \(\mathrm{L}_\beta\) converges to the standard Bellman operator \(\mathrm{L}\). Furthermore, it can be shown that the smooth Bellman operator is a contraction mapping in the supremum norm, and therefore has a unique fixed point. However, as opposed to the usual “hard” setting, the fixed point of \(\mathrm{L}_\beta\) is associated with the value function of an optimal stochastic policy defined by the softmax distribution:
Despite the confusing terminology, the above “softmax” policy is simply the smooth counterpart to the argmax operator in the original optimality equation: it acts as a soft-argmax.
This formulation is interesting for several reasons. First, smoothness is a desirable property from an optimization standpoint. Unlike \(\gamma\), we view \(\beta\) as a hyperparameter of our algorithm, which we can control to achieve the desired level of accuracy.
Second, while presented from an intuitive standpoint where we replace the max by the log-sum-exp (a smooth maximum) and the argmax by the softmax (a smooth argmax), this formulation can also be obtained from various other perspectives, offering theoretical tools and solution methods. For example, Rust [37] derived this algorithm by considering a setting in which the rewards are stochastic and perturbed by a Gumbel noise variable. When considering the corresponding augmented state space and integrating the noise, we obtain smooth equations. This interpretation is leveraged by Rust for modeling purposes.
Smooth Value Iteration Algorithm#
The smooth value iteration algorithm replaces the max operator in standard value iteration with the logsumexp operator. Here’s the algorithm structure:
Algorithm 12 (Smooth Value Iteration)
Input: MDP \((S, A, r, p, \gamma)\), inverse temperature \(\beta > 0\), tolerance \(\epsilon > 0\)
Output: Approximate optimal value function \(v\) and stochastic policy \(\pi\)
Initialize \(v(s) \leftarrow 0\) for all \(s \in S\)
repeat
\(\quad \Delta \leftarrow 0\)
\(\quad\) for each state \(s \in S\) do
\(\quad\quad\) for each action \(a \in A_s\) do
\(\quad\quad\quad q(s,a) \leftarrow r(s,a) + \gamma \sum_{j \in S} p(j|s,a) v(j)\)
\(\quad\quad\) end for
\(\quad\quad v_{\text{new}}(s) \leftarrow \frac{1}{\beta} \log \sum_{a \in A_s} \exp(\beta \cdot q(s,a))\)
\(\quad\quad \Delta \leftarrow \max(\Delta, |v_{\text{new}}(s) - v(s)|)\)
\(\quad\quad v(s) \leftarrow v_{\text{new}}(s)\)
\(\quad\) end for
until \(\Delta < \epsilon\)
Extract policy: for each state \(s \in S\) do
\(\quad\) Compute \(q(s,a)\) for all \(a \in A_s\) as in lines 5-7
\(\quad \pi(a|s) \leftarrow \frac{\exp(\beta \cdot q(s,a))}{\sum_{a' \in A_s} \exp(\beta \cdot q(s,a'))}\) for all \(a \in A_s\)
end for
return \(v, \pi\)
Key differences from standard value iteration:
Line 8 uses \(\frac{1}{\beta} \log \sum_a \exp(\beta \cdot q(s,a))\) instead of \(\max_a q(s,a)\)
Line 15 extracts a stochastic policy using softmax instead of a deterministic argmax policy
As \(\beta \to \infty\), the algorithm converges to standard value iteration
Lower \(\beta\) values produce more stochastic policies with higher entropy
There is also a way to obtain this equation by starting from the energy-based formulation often used in supervised learning, in which we convert an unnormalized probability distribution into a distribution using the softmax transformation. This is essentially what Ziebart et al. [44] did in their paper. Furthermore, this perspective bridges with the literature on probabilistic graphical models, in which we can now cast the problem of finding an optimal smooth policy into one of maximum likelihood estimation (an inference problem). This is the idea of control as inference, which also admits the converse - that of inference as control - used nowadays for deriving fast samples and amortized inference techniques using reinforcement learning [28].
Finally, it’s worth noting that we can also derive this form by considering an entropy-regularized formulation in which we penalize for the entropy of our policy in the reward function term. This formulation admits a solution that coincides with the smooth Bellman equations [22].
Gumbel Noise on the Rewards#
We can obtain the smooth Bellman equation by considering a setting in which we have Gumbel noise added to the reward function. More precisely, we define an MDP whose state space is now that of \(\tilde{s} = (s, \epsilon)\), where the reward function is given by
and where the transition probability function is:
This expression stems from the conditional independence assumption that we make on the noise variable given the state.
Furthermore, we assume that \(\epsilon(a)\) is a random variable following a Gumbel distribution with location 0 and scale \(1/\beta\). The Gumbel distribution is a continuous probability distribution used to model the maximum (or minimum) of a number of samples of various distributions. Its probability density function is:
where \(\mu\) is the location parameter and \(\beta\) is the scale parameter. To generate a Gumbel-distributed random variable, one can use the inverse transform sampling method: and set \( X = \mu - \beta \ln(-\ln(U)) \) where \(U\) is a uniform random variable on the interval \((0,1)\).
The Bellman equation in this augmented state space becomes:
Furthermore, since all we did is to define another MDP, we still have a contraction and an optimal stationary policy \(\boldsymbol{\pi} = \mathrm{const}(\pi)\) can be found via the following deterministic Markovian decision rule:
Note how the expectation is now over the next augmented state space and is therefore both over the next state in the original MDP and over the next perturbation. While in the general case there isn’t much that we can do to simplify the expression for the expectation over the next state in the MDP, we can however leverage a remarkable property of the Gumbel distribution which allows us to eliminate the \(\epsilon\) term in the above and recover the familiar smooth Bellman equation.
For a set of random variables \(X_1, \ldots, X_n\), each following a Gumbel distribution with location parameters \(\mu_1, \ldots, \mu_n\) and scale parameter \(1/\beta\), extreme value theory tells us that:
In our case, each \(X_i\) corresponds to \(r(s,a_i) + \epsilon(a_i) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a_i)v(j)\) for a given action \(a_i\). The location parameter \(\mu_i\) is \(r(s,a_i) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a_i)v(j)\), and the scale parameter is \(1/\beta\).
Applying this result to our problem, and taking the expectation over the noise \(\epsilon\):
If we define \(v_\gamma^\star(s) = \mathbb{E}_\epsilon[v_\gamma^\star(s,\epsilon)]\), we obtain the smooth Bellman equation:
This final equation is the smooth Bellman equation, which we derived by introducing Gumbel noise to the reward function and leveraging properties of the Gumbel distribution and extreme value theory.
Now, in the same way that we have been able to simplify and specialize the form of the value function under Gumbel noise, we can also derive an expression for the corresponding optimal policy. To see this, we apply similar steps and start with the optimal decision rule for the augmented MDP:
In order to simplify this expression by taking the expectation over the noise variable, we define an indicator function for the event that action \(a\) is in the set of optimal actions:
Note that this definition allows us to recover the original expression since:
This set-valued – but deterministic – function \(\pi(s,\epsilon)\) gives us the set of optimal actions for a given state \(s\) and noise realization \(\epsilon\). For simplicity, consider the case where the optimal set of actions at \(s\) is a singleton such that taking the expectation over the noise variable gives us:
Now, we can leverage a key property of the Gumbel distribution. For a set of random variables \(\{X_i = \mu_i + \epsilon_i\}\) where \(\epsilon_i\) are i.i.d. Gumbel(0, 1/β) random variables, we have:
In our case, \(X_a = r(s,a) + \epsilon(a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a)v_\gamma^\star(j)\) for each action \(a\) (after marginalizing out \(\epsilon'\)), with \(\mu_a = r(s,a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a)v_\gamma^\star(j)\).
Applying this property and using the definition \(v_\gamma^\star(s) = \mathbb{E}_\epsilon[v_\gamma^\star(s,\epsilon)]\), we get:
This gives us the optimal stochastic policy for the smooth MDP. Note that as \(\beta \to \infty\), this policy approaches the deterministic policy of the original MDP, while for finite \(\beta\), it gives a stochastic policy.
Regularized Markov Decision Processes#
Regularized MDPs [16] provide another perspective on how the smooth Bellman equations come to be. This framework offers a more general approach in which we seek to find optimal policies under the infinite horizon criterion while also accounting for a regularizer that influences the kind of policies we try to obtain.
Let’s set up some necessary notation. First, recall that the policy evaluation operator for a stationary policy with decision rule \(\pi\) is defined as:
where \(\mathbf{r}_\pi\) is the expected reward vector under policy \(\pi\), \(\gamma\) is the discount factor, and \(\mathbf{P}_\pi\) is the state transition probability matrix under \(\pi\). A complementary object to the value function is the q-function (or Q-factor) representation:
The policy evaluation operator can then be written in terms of the q-function as:
Legendre-Fenchel Transform#
The workhorse behind the theory of regularized MDPs is the Legendre-Fenchel transform, also known as the convex conjugate. For a strongly convex function \(\Omega: \Delta_{\mathcal{A}} \rightarrow \mathbb{R}\), its Legendre-Fenchel transform \(\Omega^*: \mathbb{R}^{\mathcal{A}} \rightarrow \mathbb{R}\) is defined as:
An important property of this transform is that it has a unique maximizing argument, given by the gradient of \(\Omega^*\). This gradient is Lipschitz and satisfies:
An important example of a regularizer is the negative entropy, which gives rise to the smooth Bellman equations as we are about to see.
Regularized Bellman Operators#
With these concepts in place, we can now define the regularized Bellman operators:
Regularized Policy Evaluation Operator \((\mathrm{L}_{\pi,\Omega})\):
\[ [\mathrm{L}_{\pi,\Omega} v](s) = \langle q(s,\cdot), \pi(\cdot | s) \rangle - \Omega(\pi(\cdot | s)) \]Regularized Bellman Optimality Operator \((\mathrm{L}_\Omega)\):
\[ [\mathrm{L}_\Omega v](s) = [\max_\pi \mathrm{L}_{\pi,\Omega} v ](s) = \Omega^*(q(s, \cdot)) \]
It can be shown that the addition of a regularizer in these regularized operators still preserves the contraction properties, and therefore the existence of a solution to the optimality equations and the convergence of successive approximation.
The regularized value function of a stationary policy with decision rule \(\pi\), denoted by \(v_{\pi,\Omega}\), is the unique fixed point of the operator equation:
Under the usual assumptions on the discount factor and the boundedness of the reward, the value of a policy can also be found in closed form by solving for \(\mathbf{v}\) in the linear system of equations:
where \([\boldsymbol{\Omega}_\pi](s) = \Omega(\pi(\cdot|s))\) is the vector of regularization terms at each state.
The associated state-action value function \(q_{\pi,\Omega}\) is given by:
The regularized optimal value function \(v^*_\Omega\) is then the unique fixed point of \(\mathrm{L}_\Omega\) in the fixed point equation:
The associated state-action value function \(q^*_\Omega\) is given by:
An important result in the theory of regularized MDPs is that there exists a unique optimal regularized policy. Specifically, if \(\pi^*_\Omega\) is a conserving decision rule (i.e., \(\pi^*_\Omega = \arg\max_\pi \mathrm{L}_{\pi,\Omega} v^*_\Omega\)), then the randomized stationary policy \(\boldsymbol{\pi} = \mathrm{const}(\pi^*_\Omega)\) is the unique optimal regularized policy.
In practice, once we have found \(v^*_\Omega\), we can derive the optimal decision rule by taking the gradient of the convex conjugate evaluated at the optimal action-value function:
Recovering the Smooth Bellman Equations#
Under this framework, we can recover the smooth Bellman equations by choosing \(\Omega\) to be the negative entropy, and obtain the softmax policy as the gradient of the convex conjugate. Let’s show this explicitly:
Using the negative entropy regularizer:
\[ \Omega(d(\cdot|s)) = \sum_{a \in \mathcal{A}_s} d(a|s) \ln d(a|s) \]The convex conjugate:
\[ \Omega^*(q(s, \cdot)) = \ln \sum_{a \in \mathcal{A}_s} \exp q(s,a) \]Now, let’s write out the regularized Bellman optimality equation:
\[ v^*_\Omega(s) = \Omega^*(q^*_\Omega(s, \cdot)) \]Substituting the expressions for \(\Omega^*\) and \(q^*_\Omega\):
\[ v^*_\Omega(s) = \ln \sum_{a \in \mathcal{A}_s} \exp \left(r(s, a) + \gamma \sum_{j \in \mathcal{S}} p(j|s,a) v^*_\Omega(j)\right) \]
This is precisely the form of the smooth Bellman equation we derived earlier, with the log-sum-exp operation replacing the max operation of the standard Bellman equation.
Furthermore, the optimal policy is given by the gradient of \(\Omega^*\):
This is the familiar softmax policy we encountered in the smooth MDP setting.
Smooth Policy Iteration Algorithm#
Now that we’ve seen how the regularized MDP framework leads to smooth Bellman equations, we present smooth policy iteration. Unlike value iteration which directly iterates the Bellman operator, policy iteration alternates between policy evaluation and policy improvement steps.
Algorithm 13 (Smooth Policy Evaluation)
Input: MDP \((S, A, r, p, \gamma)\), policy \(\pi\), inverse temperature \(\beta > 0\), tolerance \(\epsilon > 0\)
Output: Value function \(v^\pi\) for policy \(\pi\)
Initialize \(v(s) \leftarrow 0\) for all \(s \in S\)
Set \(\alpha \leftarrow 1/\beta\)
repeat
\(\quad \Delta \leftarrow 0\)
\(\quad\) for each state \(s \in S\) do
\(\quad\quad v_{\text{old}} \leftarrow v(s)\)
\(\quad\quad\) for each action \(a \in A_s\) do
\(\quad\quad\quad q(s,a) \leftarrow r(s,a) + \gamma \sum_{j \in S} p(j|s,a) v(j)\)
\(\quad\quad\) end for
\(\quad\quad\) Compute expected Q-value: \(\bar{q} \leftarrow \sum_{a \in A_s} \pi(a|s) \cdot q(s,a)\)
\(\quad\quad\) Compute policy entropy: \(H \leftarrow -\sum_{a \in A_s} \pi(a|s) \log \pi(a|s)\)
\(\quad\quad v(s) \leftarrow \bar{q} + \alpha H\)
\(\quad\quad \Delta \leftarrow \max(\Delta, |v(s) - v_{\text{old}}|)\)
\(\quad\) end for
until \(\Delta < \epsilon\)
return \(v\)
Algorithm 14 (Smooth Policy Iteration)
Input: MDP \((S, A, r, p, \gamma)\), inverse temperature \(\beta > 0\), tolerance \(\epsilon > 0\)
Output: Approximate optimal value function \(v\) and stochastic policy \(\pi\)
Initialize \(\pi(a|s) \leftarrow 1/|A_s|\) for all \(s \in S, a \in A_s\) (uniform policy)
repeat
\(\quad\) Policy Evaluation:
\(\quad\quad\) \(v \leftarrow\) SmoothPolicyEvaluation(\(S, A, r, p, \gamma, \pi, \beta, \epsilon\))
\(\quad\) Policy Improvement:
\(\quad\) policy_stable \(\leftarrow\) true
\(\quad\) for each state \(s \in S\) do
\(\quad\quad \pi_{\text{old}}(\cdot|s) \leftarrow \pi(\cdot|s)\)
\(\quad\quad\) for each action \(a \in A_s\) do
\(\quad\quad\quad q(s,a) \leftarrow r(s,a) + \gamma \sum_{j \in S} p(j|s,a) v(j)\)
\(\quad\quad\) end for
\(\quad\quad\) for each action \(a \in A_s\) do
\(\quad\quad\quad \pi(a|s) \leftarrow \frac{\exp(\beta \cdot q(s,a))}{\sum_{a' \in A_s} \exp(\beta \cdot q(s,a'))}\)
\(\quad\quad\) end for
\(\quad\quad\) if \(\|\pi(\cdot|s) - \pi_{\text{old}}(\cdot|s)\| > \epsilon\) then
\(\quad\quad\quad\) policy_stable \(\leftarrow\) false
\(\quad\quad\) end if
\(\quad\) end for
until policy_stable
return \(v, \pi\)
Key properties of smooth policy iteration:
Entropy-regularized evaluation: The policy evaluation step (line 12 of Algorithm Algorithm 13) accounts for the entropy bonus \(\alpha H(\pi(\cdot|s))\) where \(\alpha = 1/\beta\)
Stochastic policy improvement: The policy improvement step (lines 12-14 of Algorithm Algorithm 14) uses softmax instead of deterministic argmax, producing a stochastic policy
Temperature parameter:
Higher \(\beta\) → policies closer to deterministic (lower entropy)
Lower \(\beta\) → more stochastic policies (higher entropy)
As \(\beta \to \infty\) → recovers standard policy iteration
Convergence: Like standard policy iteration, this algorithm converges to the unique optimal regularized value function and policy
Equivalence Between Smooth Bellman Equations and Entropy-Regularized MDPs#
We have now seen two distinct ways to arrive at smooth Bellman equations. Earlier in this chapter, we introduced the logsumexp operator as a smooth approximation to the max operator, motivated by analytical tractability and the desire for differentiability. Just now, we derived the same equations through the lens of regularized MDPs, where we explicitly penalize the entropy of policies. Remarkably, these two perspectives are mathematically equivalent: solving the smooth Bellman equation with inverse temperature parameter \(\beta\) yields exactly the same optimal value function and optimal policy as solving the entropy-regularized MDP with regularization strength \(\alpha = 1/\beta\). The two formulations are not merely similar—they describe identical optimization problems.
To see this equivalence clearly, consider the standard MDP problem with rewards \(r(s,a)\) and transition probabilities \(p(j|s,a)\). The regularized MDP framework tells us to solve:
where \(H(\pi(\cdot|s)) = -\sum_a \pi(a|s) \ln \pi(a|s)\) is the entropy of the policy at state \(s\), and \(\alpha > 0\) is the entropy regularization strength.
We can rewrite this objective by absorbing the entropy term into a modified reward function. Define the entropy-augmented reward:
However, this formulation makes the reward depend on the entire policy at each state, which is awkward. We can reformulate this more cleanly by expanding the entropy term. Recall that the entropy is:
When we take the expectation over actions drawn from \(\pi\), we have:
since the entropy doesn’t depend on which action is actually sampled. But we can also write this as:
This shows that adding \(\alpha H(\pi(\cdot|s))\) to the expected reward at state \(s\) is equivalent to adding \(-\alpha \ln \pi(a|s)\) to the reward of taking action \(a\) at state \(s\). More formally:
The entropy bonus at each state, when averaged over the policy, becomes a per-action penalty proportional to the negative log probability of the action taken. This reformulation is more useful because the modified reward now depends only on the state, the action taken, and the probability assigned to that specific action by the policy—not on the entire distribution over actions.
This expression shows that entropy regularization is equivalent to adding a state-action dependent penalty term \(-\alpha \ln \pi(a|s)\) to the reward. Intuititively, this terms amounts to paying a cost for low-entropy (deterministic) policies.
Now, when we write down the Bellman equation for this entropy-regularized problem, at each state \(s\) we need to find the decision rule \(d(\cdot|s) \in \Delta(\mathcal{A}_s)\) (a probability distribution over actions) that maximizes:
Here \(\Delta(\mathcal{A}_s) = \{d(\cdot|s) : d(a|s) \geq 0, \sum_a d(a|s) = 1\}\) denotes the probability simplex over actions available at state \(s\). The optimization is over randomized decision rules at each state, constrained to be valid probability distributions.
This is a convex optimization problem with a linear constraint. We form the Lagrangian:
where \(\lambda\) is the Lagrange multiplier enforcing the normalization constraint. Taking the derivative with respect to \(d(a|s)\) and setting it to zero:
Solving for \(d^*(a|s)\):
Using the normalization constraint \(\sum_a d^*(a|s) = 1\) to solve for \(\lambda\):
Therefore:
Substituting this back into the Bellman equation and simplifying:
Setting \(\beta = 1/\alpha\) (the inverse temperature), this becomes:
This is precisely the smooth Bellman equation we derived earlier using the logsumexp operator. The inverse temperature parameter \(\beta\) controls how closely the logsumexp approximates the max: as \(\beta \to \infty\), we recover the standard Bellman equation, while for finite \(\beta\), we have a smooth approximation that corresponds to optimizing with entropy regularization strength \(\alpha = 1/\beta\).
The optimal policy is:
which is exactly the softmax policy parametrized by inverse temperature.
The derivation establishes the complete equivalence: the value function \(v^*\) that solves the smooth Bellman equation is identical to the optimal value function \(v^*_\Omega\) of the entropy-regularized MDP (with \(\Omega\) being negative entropy and \(\alpha = 1/\beta\)), and the softmax policy that is greedy with respect to this value function achieves the maximum of the entropy-regularized objective. Both approaches yield the same numerical solution—the same values at every state and the same policy prescriptions. The only difference is how we conceptualize the problem: as smoothing the Bellman operator for computational tractability, or as explicitly trading off reward maximization against policy entropy.
This equivalence has important implications. When we use smooth Bellman equations with a logsumexp operator, we are implicitly solving an entropy-regularized MDP. Conversely, when we explicitly add entropy regularization to an MDP objective, we arrive at smooth Bellman equations as the natural description of optimality. This dual perspective will prove valuable in understanding various algorithms and theoretical results. For instance, in soft actor-critic methods and other maximum entropy reinforcement learning algorithms, the connection between smooth operators and entropy regularization provides both computational benefits (differentiability) and conceptual clarity (why we want stochastic policies).
Entropy-Regularized Dynamic Programming Algorithms#
While the smooth Bellman equations (using logsumexp) and entropy-regularized formulations are mathematically equivalent, it is instructive to present the algorithms explicitly in the entropy-regularized form, where the entropy bonus appears directly in the update equations.
Algorithm 15 (Entropy-Regularized Value Iteration)
Input: MDP \((S, A, r, p, \gamma)\), entropy weight \(\alpha > 0\), tolerance \(\epsilon > 0\)
Output: Approximate optimal value function \(v\) and stochastic policy \(\pi\)
Initialize \(\pi(a|s) \leftarrow 1/|A_s|\) for all \(s \in S, a \in A_s\) (uniform policy)
Initialize \(v(s) \leftarrow 0\) for all \(s \in S\)
repeat
\(\quad \Delta \leftarrow 0\)
\(\quad\) for each state \(s \in S\) do
\(\quad\quad\) Policy Improvement: Update policy for current value estimate
\(\quad\quad\) for each action \(a \in A_s\) do
\(\quad\quad\quad q(s,a) \leftarrow r(s,a) + \gamma \sum_{j \in S} p(j|s,a) v(j)\)
\(\quad\quad\) end for
\(\quad\quad\) for each action \(a \in A_s\) do
\(\quad\quad\quad \pi_{\text{new}}(a|s) \leftarrow \frac{\exp(q(s,a)/\alpha)}{\sum_{a' \in A_s} \exp(q(s,a')/\alpha)}\)
\(\quad\quad\) end for
\(\quad\quad\) Value Update: Compute regularized value
\(\quad\quad v_{\text{new}}(s) \leftarrow \sum_{a \in A_s} \pi_{\text{new}}(a|s) \cdot q(s,a) + \alpha H(\pi_{\text{new}}(\cdot|s))\)
\(\quad\quad\) where \(H(\pi_{\text{new}}(\cdot|s)) = -\sum_{a \in A_s} \pi_{\text{new}}(a|s) \log \pi_{\text{new}}(a|s)\)
\(\quad\quad \Delta \leftarrow \max(\Delta, |v_{\text{new}}(s) - v(s)|)\)
\(\quad\quad v(s) \leftarrow v_{\text{new}}(s)\)
\(\quad\quad \pi(\cdot|s) \leftarrow \pi_{\text{new}}(\cdot|s)\)
\(\quad\) end for
until \(\Delta < \epsilon\)
return \(v, \pi\)
Key features:
Line 11 updates the policy using the softmax of Q-values, with temperature \(\alpha\)
Line 14 explicitly computes the entropy-regularized value: expected Q-value plus entropy bonus
The algorithm maintains and updates a stochastic policy throughout
As \(\alpha \to 0\) (or equivalently \(\beta \to \infty\)), this recovers standard value iteration
Algorithm 16 (Entropy-Regularized Policy Iteration)
Input: MDP \((S, A, r, p, \gamma)\), entropy weight \(\alpha > 0\), tolerance \(\epsilon > 0\)
Output: Approximate optimal value function \(v\) and stochastic policy \(\pi\)
Initialize \(\pi(a|s) \leftarrow 1/|A_s|\) for all \(s \in S, a \in A_s\) (uniform policy)
repeat
\(\quad\) Policy Evaluation: Solve for \(v^\pi\) such that for all \(s \in S\):
\(\quad\quad\) Option 1 (Iterative):
\(\quad\quad\) Initialize \(v(s) \leftarrow 0\) for all \(s \in S\)
\(\quad\quad\) repeat
\(\quad\quad\quad\) for each state \(s \in S\) do
\(\quad\quad\quad\quad\) Compute \(q^\pi(s,a) \leftarrow r(s,a) + \gamma \sum_{j \in S} p(j|s,a) v(j)\) for all \(a \in A_s\)
\(\quad\quad\quad\quad v_{\text{new}}(s) \leftarrow \sum_{a \in A_s} \pi(a|s) \cdot q^\pi(s,a) + \alpha H(\pi(\cdot|s))\)
\(\quad\quad\quad\) end for
\(\quad\quad\quad\) if \(\max_s |v_{\text{new}}(s) - v(s)| < \epsilon\) then break
\(\quad\quad\quad v \leftarrow v_{\text{new}}\)
\(\quad\quad\) until convergence
\(\quad\quad\) Option 2 (Direct): Solve linear system \((\mathbf{I} - \gamma \mathbf{P}_\pi) \mathbf{v} = \mathbf{r}_\pi + \alpha \mathbf{H}_\pi\)
\(\quad\quad\) where \([\mathbf{r}_\pi](s) = \sum_a \pi(a|s) r(s,a)\) and \([\mathbf{H}_\pi](s) = H(\pi(\cdot|s))\)
\(\quad\) Policy Improvement:
\(\quad\) policy_changed \(\leftarrow\) false
\(\quad\) for each state \(s \in S\) do
\(\quad\quad \pi_{\text{old}}(\cdot|s) \leftarrow \pi(\cdot|s)\)
\(\quad\quad\) for each action \(a \in A_s\) do
\(\quad\quad\quad q(s,a) \leftarrow r(s,a) + \gamma \sum_{j \in S} p(j|s,a) v(j)\)
\(\quad\quad\) end for
\(\quad\quad\) for each action \(a \in A_s\) do
\(\quad\quad\quad \pi(a|s) \leftarrow \frac{\exp(q(s,a)/\alpha)}{\sum_{a' \in A_s} \exp(q(s,a')/\alpha)}\)
\(\quad\quad\) end for
\(\quad\quad\) if \(\|\pi(\cdot|s) - \pi_{\text{old}}(\cdot|s)\| > \epsilon\) then
\(\quad\quad\quad\) policy_changed \(\leftarrow\) true
\(\quad\quad\) end if
\(\quad\) end for
until policy_changed \(=\) false
return \(v, \pi\)
Key features:
Policy Evaluation (lines 3-15): Computes the value of the current policy including entropy bonus
Option 1: Iterative method (successive approximation)
Option 2: Direct solution via linear system
Policy Improvement (lines 16-29): Updates policy to softmax over Q-values
Line 14 shows the vector form: the linear system includes the entropy vector \(\mathbf{H}_\pi\)
The algorithm alternates between evaluating the current stochastic policy and improving it
Converges to the unique optimal entropy-regularized policy