Dynamic programming methods suffer from the curse of dimensionality and can quickly become difficult to apply in practice. We may also be dealing with large or continuous state or action spaces. We have seen so far that we could address this problem using discretization, or interpolation. These were already examples of approximate dynamic programming. In this chapter, we will see other forms of approximations meant to facilitate the optimization problem, either by approximating the optimality equations, the value function, or the policy itself.
Approximation theory is at the heart of learning methods, and fundamentally, this chapter will be about the application of learning ideas to solve complex decision-making problems.
While the standard Bellman optimality equations use the max operator to determine the best action, an alternative formulation known as the smooth or soft Bellman optimality equations replaces this with a softmax operator. This approach originated from Rust (1987) and was later rediscovered in the context of maximum entropy inverse reinforcement learning Ziebart et al. (2008), which then led to soft Q-learning Haarnoja et al. (2017) and soft actor-critic Haarnoja et al. (2018), a state-of-the-art deep reinforcement learning algorithm.
In the infinite-horizon setting, the smooth Bellman optimality equations take the form:
Adopting an operator-theoretic perspective, we can define a nonlinear operator Lβ such that the smooth value function of an MDP is then the solution to the following fixed-point equation:
As β→∞, Lβ converges to the standard Bellman operator L. Furthermore, it can be shown that the smooth Bellman operator is a contraction mapping in the supremum norm, and therefore has a unique fixed point. However, as opposed to the usual “hard” setting, the fixed point of Lβ is associated with the value function of an optimal stochastic policy defined by the softmax distribution:
Despite the confusing terminology, the above “softmax” policy is simply the smooth counterpart to the argmax operator in the original optimality equation: it acts as a soft-argmax.
This formulation is interesting for several reasons. First, smoothness is a desirable property from an optimization standpoint. Unlike γ, we view β as a hyperparameter of our algorithm, which we can control to achieve the desired level of accuracy.
Second, while presented from an intuitive standpoint where we replace the max by the log-sum-exp (a smooth maximum) and the argmax by the softmax (a smooth argmax), this formulation can also be obtained from various other perspectives, offering theoretical tools and solution methods. For example, Rust (1987) derived this algorithm by considering a setting in which the rewards are stochastic and perturbed by a Gumbel noise variable. When considering the corresponding augmented state space and integrating the noise, we obtain smooth equations. This interpretation is leveraged by Rust for modeling purposes.
The smooth value iteration algorithm replaces the max operator in standard value iteration with the logsumexp operator. Here’s the algorithm structure:
Differences from standard value iteration:
Line 8 uses β1log∑aexp(β⋅q(s,a)) instead of maxaq(s,a)
Line 15 extracts a stochastic policy using softmax instead of a deterministic argmax policy
As β→∞, the algorithm converges to standard value iteration
Lower β values produce more stochastic policies with higher entropy
There is also a way to obtain this equation by starting from the energy-based formulation often used in supervised learning, in which we convert an unnormalized probability distribution into a distribution using the softmax transformation. This is essentially what Ziebart et al. (2008) did in their paper. Furthermore, this perspective bridges with the literature on probabilistic graphical models, in which we can now cast the problem of finding an optimal smooth policy into one of maximum likelihood estimation (an inference problem). This is the idea of control as inference, which also admits the converse - that of inference as control - used nowadays for deriving fast samples and amortized inference techniques using reinforcement learning Levine et al. (2018).
Finally, it’s worth noting that we can also derive this form by considering an entropy-regularized formulation in which we penalize for the entropy of our policy in the reward function term. This formulation admits a solution that coincides with the smooth Bellman equations Haarnoja et al. (2017).
Alternative Soft Maximum: Gaussian Uncertainty-Weighted Aggregation¶
The logsumexp operator provides one way to soften the hard maximum, but alternative approaches exist. When Q-value estimates have heterogeneous uncertainty (some actions estimated more precisely than others), we can weight actions by the probability they are optimal under a Gaussian uncertainty model. D'Eramo et al. (2016) proposed computing weights as:
where μ^a′ and σ^a′ are the sample mean and standard deviation of Q-value estimates, n is the sample size, and ϕ, Φ are the standard normal PDF and CDF. The soft Bellman target becomes v(s′)=∑a′wa′q(s′,a′), a probability-weighted expectation rather than a hard maximum.
This differs from logsumexp in that it adapts to state-action-specific uncertainty (actions with tighter confidence intervals receive more weight), whereas logsumexp applies uniform smoothing via temperature β. The Gaussian-weighted approach requires maintaining variance estimates and computing integrals, making it more expensive than logsumexp. However, it provides a principled way to reduce overestimation bias in Q-learning while avoiding the pessimism of double Q-learning. We return to this estimator in the simulation-based methods chapter when discussing overestimation bias mitigation strategies.
We can obtain the smooth Bellman equation by considering a setting in which we have Gumbel noise added to the reward function. This derivation provides both theoretical insight and connects to practical modeling scenarios where rewards have random perturbations.
Step 1: Define the Augmented MDP with Gumbel Noise¶
At each time period and state s, we draw an action-indexed shock vector:
However, this equation is computationally intractable because:
The state space is continuous and infinite-dimensional
The shocks are fresh each period
We would need to solve for v~ over an uncountable domain
We never solve this equation directly. Instead, we use it as a mathematical device to derive the smooth Bellman equation.
Step 3: Define the Ex-Ante (Inclusive) Value Function¶
The idea here is to consider the expected value before observing the current shocks. We define what some authors in econometrics call the inclusive value or ex-ante value:
This is the value of being in state sbefore we observe the current-period shock vector ϵ.
Step 4: Separate the Deterministic and Random Components¶
Now we take the expectation of the augmented Bellman equation with respect to the current shocks only (everything that does not depend on the current ϵ can be pulled out).
First, note that by the law of iterated expectations and independence of shocks across time:
We constructed an augmented MDP with state (s,ϵ) where shocks perturb rewards
We wrote the standard Bellman equation for this augmented MDP (hard max, but over an infinite-dimensional state space)
We defined the ex-ante value v(s)=Eϵ[v~(s,ϵ)] to eliminate the continuous shock component
We separated deterministic and random terms: v~(s,ϵ)=maxa{xa(s)+ϵ(a)}
We applied the Gumbel identity to evaluate Eϵ[maxa{⋯}] in closed form as a log-sum-exp
The augmented MDP with shocks exists only as a mathematical device. We never approximate v~, never discretize ϵ, and never enumerate the augmented state space. The only computational object we work with is v(s) on the original (finite) state space, which satisfies the smooth Bellman equation.
Now that we have derived the smooth value function, we can also obtain the corresponding optimal policy. The question is: what policy should we follow in the original MDP (without explicitly conditioning on shocks)?
In the augmented MDP, the optimal policy is deterministic but depends on the shock realization:
However, we want a policy for the original state space s (not the augmented state). We obtain this by marginalizing over the current shocks, essentially asking: “what is the probability that action a is optimal when we average over all possible shock realizations?”
This is the softmax policy or Gibbs/Boltzmann policy with inverse temperature β.
Properties:
As β→∞: the policy becomes deterministic, concentrating on the action(s) with highest xa(s) (recovers standard greedy policy)
As β→0: the policy becomes uniform over all actions (maximum entropy)
For finite β>0: the policy is stochastic, with probability mass proportional to exponentiated Q-values
This completes the derivation: the smooth Bellman equation yields a value function v(s), and the corresponding optimal policy is the softmax over Q-values.
Regularized MDPs Geist et al. (2019) provide another perspective on how the smooth Bellman equations come to be. This framework offers a more general approach in which we seek to find optimal policies under the infinite horizon criterion while also accounting for a regularizer that influences the kind of policies we try to obtain.
Let’s set up some necessary notation. First, recall that the policy evaluation operator for a stationary policy with decision rule π is defined as:
where rπ is the expected reward vector under policy π, γ is the discount factor, and Pπ is the state transition probability matrix under π. A complementary object to the value function is the q-function (or Q-factor) representation:
The workhorse behind the theory of regularized MDPs is the Legendre-Fenchel transform, also known as the convex conjugate. For a strongly convex function Ω:ΔA→R, its Legendre-Fenchel transform Ω∗:RA→R is defined as:
An important property of this transform is that it has a unique maximizing argument, given by the gradient of Ω∗. This gradient is Lipschitz and satisfies:
It can be shown that the addition of a regularizer in these regularized operators still preserves the contraction properties, and therefore the existence of a solution to the optimality equations and the convergence of successive approximation.
The regularized value function of a stationary policy with decision rule π, denoted by vπ,Ω, is the unique fixed point of the operator equation:
Under the usual assumptions on the discount factor and the boundedness of the reward, the value of a policy can also be found in closed form by solving for v in the linear system of equations:
An important result in the theory of regularized MDPs is that there exists a unique optimal regularized policy. Specifically, if πΩ∗ is a conserving decision rule (i.e., πΩ∗=argmaxπLπ,ΩvΩ∗), then the randomized stationary policy π=const(πΩ∗) is the unique optimal regularized policy.
In practice, once we have found vΩ∗, we can derive the optimal decision rule by taking the gradient of the convex conjugate evaluated at the optimal action-value function:
Under this framework, we can recover the smooth Bellman equations by choosing Ω to be the negative entropy, and obtain the softmax policy as the gradient of the convex conjugate. Let’s show this explicitly:
This matches the form of the smooth Bellman equation we derived earlier, with the log-sum-exp operation replacing the max operation of the standard Bellman equation.
Furthermore, the optimal policy is given by the gradient of Ω∗:
Now that we’ve seen how the regularized MDP framework leads to smooth Bellman equations, we present smooth policy iteration. Unlike value iteration which directly iterates the Bellman operator, policy iteration alternates between policy evaluation and policy improvement steps.
Key properties of smooth policy iteration:
Entropy-regularized evaluation: The policy evaluation step (line 12 of Algorithm Algorithm 2) accounts for the entropy bonus αH(π(⋅∣s)) where α=1/β
Stochastic policy improvement: The policy improvement step (lines 12-14 of Algorithm Algorithm 3) uses softmax instead of deterministic argmax, producing a stochastic policy
Temperature parameter:
Higher β → policies closer to deterministic (lower entropy)
Lower β → more stochastic policies (higher entropy)
As β→∞ → recovers standard policy iteration
Convergence: Like standard policy iteration, this algorithm converges to the unique optimal regularized value function and policy
Equivalence Between Smooth Bellman Equations and Entropy-Regularized MDPs¶
We have now seen two distinct ways to arrive at smooth Bellman equations. Earlier in this chapter, we introduced the logsumexp operator as a smooth approximation to the max operator, motivated by analytical tractability and the desire for differentiability. Just now, we derived the same equations through the lens of regularized MDPs, where we explicitly penalize the entropy of policies. These two perspectives are mathematically equivalent: solving the smooth Bellman equation with inverse temperature parameter β yields exactly the same optimal value function and optimal policy as solving the entropy-regularized MDP with regularization strength α=1/β. The two formulations are not merely similar. They describe identical optimization problems.
To see this equivalence clearly, consider the standard MDP problem with rewards r(s,a) and transition probabilities p(j∣s,a). The regularized MDP framework tells us to solve:
However, this formulation makes the reward depend on the entire policy at each state, which is awkward. We can reformulate this more cleanly by expanding the entropy term. Recall that the entropy is:
This shows that adding αH(π(⋅∣s)) to the expected reward at state s is equivalent to adding −αlnπ(a∣s) to the reward of taking action a at state s. More formally:
The entropy bonus at each state, when averaged over the policy, becomes a per-action penalty proportional to the negative log probability of the action taken. This reformulation is more useful because the modified reward now depends only on the state, the action taken, and the probability assigned to that specific action by the policy, not on the entire distribution over actions.
This expression shows that entropy regularization is equivalent to adding a state-action dependent penalty term −αlnπ(a∣s) to the reward. Intuititively, this terms amounts to paying a cost for low-entropy (deterministic) policies.
Now, when we write down the Bellman equation for this entropy-regularized problem, at each state s we need to find the decision rule d(⋅∣s)∈Δ(As) (a probability distribution over actions) that maximizes:
Here Δ(As)={d(⋅∣s):d(a∣s)≥0,∑ad(a∣s)=1} denotes the probability simplex over actions available at state s. The optimization is over randomized decision rules at each state, constrained to be valid probability distributions.
This is a convex optimization problem with a linear constraint. We form the Lagrangian:
We recover the smooth Bellman equation we derived earlier using the logsumexp operator. The inverse temperature parameter β controls how closely the logsumexp approximates the max: as β→∞, we recover the standard Bellman equation, while for finite β, we have a smooth approximation that corresponds to optimizing with entropy regularization strength α=1/β.
which is exactly the softmax policy parametrized by inverse temperature.
The derivation establishes the complete equivalence: the value function v∗ that solves the smooth Bellman equation is identical to the optimal value function vΩ∗ of the entropy-regularized MDP (with Ω being negative entropy and α=1/β), and the softmax policy that is greedy with respect to this value function achieves the maximum of the entropy-regularized objective. Both approaches yield the same numerical solution: the same values at every state and the same policy prescriptions. The only difference is how we conceptualize the problem: as smoothing the Bellman operator for computational tractability, or as explicitly trading off reward maximization against policy entropy.
This equivalence has important implications. When we use smooth Bellman equations with a logsumexp operator, we are implicitly solving an entropy-regularized MDP. Conversely, when we explicitly add entropy regularization to an MDP objective, we arrive at smooth Bellman equations as the natural description of optimality. This dual perspective will prove valuable in understanding various algorithms and theoretical results. For instance, in soft actor-critic methods and other maximum entropy reinforcement learning algorithms, the connection between smooth operators and entropy regularization provides both computational benefits (differentiability) and conceptual clarity (why we want stochastic policies).
While the smooth Bellman equations (using logsumexp) and entropy-regularized formulations are mathematically equivalent, it is instructive to present the algorithms explicitly in the entropy-regularized form, where the entropy bonus appears directly in the update equations.
Features:
Line 11 updates the policy using the softmax of Q-values, with temperature α
Line 14 explicitly computes the entropy-regularized value: expected Q-value plus entropy bonus
The algorithm maintains and updates a stochastic policy throughout
As α→0 (or equivalently β→∞), this recovers standard value iteration
Features:
Policy Evaluation (lines 3-15): Computes the value of the current policy including entropy bonus
Rust, J. (1987). Optimal Replacement of GMC Bus Engines: An Empirical Model of Harold Zurcher. Econometrica, 55(5), 999–1033.
Ziebart, B. D., Maas, A. L., Bagnell, J. A., & Dey, A. K. (2008). Maximum Entropy Inverse Reinforcement Learning. Proceedings of the 23rd AAAI Conference on Artificial Intelligence, 1433–1438.
Haarnoja, T., Tang, H., Abbeel, P., & Levine, S. (2017). Reinforcement Learning with Deep Energy-Based Policies. Proceedings of the 34th International Conference on Machine Learning, 70, 1352–1361.
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proceedings of the 35th International Conference on Machine Learning (ICML), 1861–1870.
Levine, S., Kumar, A., Tucker, G., & Fu, J. (2018). Reinforcement Learning as a Framework for Control: A Survey. arXiv Preprint arXiv:1806.04222.
D’Eramo, C., Nuara, A., Restelli, M., & Nowe, A. (2016). Estimating the Maximum Expected Value through Gaussian Approximation. Proceedings of the 33rd International Conference on Machine Learning, 48, 1032–1040.
Geist, M., Scherrer, B., & Pietquin, O. (2019). A Theory of Regularized Markov Decision Processes. In K. Chaudhuri & R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference on Machine Learning (Vol. 97, pp. 2160–2169). PMLR. https://proceedings.mlr.press/v97/geist19a.html