DIME

Abstract

Maximum entropy reinforcement learning (MaxEnt-RL) has become the standard approach to RL due to its beneficial exploration properties. Traditionally, policies are parameterized using Gaussian distributions, which significantly limits their representational capacity. Diffusion-based policies offer a more expressive alternative, yet integrating them into MaxEnt-RL poses challenges—primarily due to the intractability of computing their marginal entropy. To overcome this, we propose Diffusion-Based Maximum Entropy RL (DIME). DIME leverages recent advances in approximate inference with diffusion models to derive a lower bound on the maximum entropy objective. Additionally, we propose a policy iteration scheme that provably converges to the optimal diffusion policy. Our method enables the use of expressive diffusion-based policies while retaining the principled exploration benefits of MaxEnt-RL, significantly outperforming other diffusion-based methods on challenging high-dimensional control benchmarks. It is also competitive with state-of-the-art non-diffusion based RL methods while requiring fewer algorithmic design choices and smaller update-to-data ratios, reducing computational complexity.

Maximum Entropy Reinforcement Learning with Diffusion Policies

DIME is a reinforcement learning algorithm that allows training diffusion policies in the maximum entropy RL framework. Notably, DIME allows controlling the exploration-exploitation trade-off of diffusion policies by adjusting the entropy-scaling parameter $\alpha$. The resulting behavior for the generative process can be seen in the following Figure.

Figure 1: The effect of the reward scaling parameter $\alpha$ . The figures in (a)-(b) show diffusion processes for different $\alpha$ values starting at a prior distribution $\mathcal{N}(0,I)$ and going backward in time to approximate the target distribution $\exp{\left(Q^\pi/\alpha\right)}/Z^\pi$. Small values for $\alpha$ (a) lead to concentrated target distributions with less noise in the diffusion trajectories especially at the last time steps. The higher $\alpha$ becomes (b) and (c), the more the target distribution is smoothed and the distribution of the samples at the last time steps becomes more noisy. Therefore, the parameter $\alpha$ directly controls the exploration by enforcing noisier samples the higher $\alpha$ becomes.

Optimizing the maximum entropy RL objective is difficult because the marginal likelihood is difficult to compute for diffusion policies. DIME's key contribution is a lower-bound to this maximum entropy objective that can be evaluated and optimized efficiently. The resulting lower-bound is the ratio between the backward and forward diffusion process, where the forward process starts at the target distribution $\vec{\pi}_0(a^0|s)=\frac{\exp{Q_\phi(s,a^0)}}{Z_\phi(s)}$, thereby including the Q-function in the policy update loss function. Please find the exact equations in the paper.

Entropy Scaling Benefits and Improved Performance over a Gaussian Policy

Figure 2: Entropy Scaling Sensitivity (a)-(b) . The $\alpha$ parameter controls the exploration-exploitation trade-off. (a) shows the learning curves for varying values on DMC's dog-run task. Too high $\alpha$ values ($\alpha=0.1$) do not incentivize learning whereas too small $\alpha$ values ($\alpha\leq10^{-5}$) converge to suboptimal behavior. (b) shows the aggregated end performance for each learning curve in (a). For increasing $\alpha$ values, the end performance increases until it reaches an optimum at $\alpha=10^{-3}$ after which the performance starts dropping. Diffusion Policy Benefit (c) and (d). We compare DIME to a Gaussian policy with the same implementation details as DIME on the (a) humanoid-run and (b) dog-run tasks. The diffusion-based policy reaches a higher return (a) and converges faster.

Number of Diffusion Steps and DIME's performance compared to other Diffusion-Based Baselines

Figure 3: Varying the Number of diffusion steps (a)-(b). The number of diffusion steps might affect the performance and the computation time. (a) shows DIME's learning curves for varying diffusion steps on DMC's humanoid-run task. Two diffusion steps perform badly, whereas four and eight diffusion steps perform similar but still worse than 16 and 32 diffusion steps which perform similarly. (b) shows the computation time for 1MIO steps of the corresponding learning curves. The smaller the diffusion steps, the less computation time is required. Learning Curves on Gym Benchmark Suite (c)-(d). We compare DIME against various diffusion baselines and CrossQ on the (c) Ant-v3 and (d) Humanoid-v3 from the Gym suite. While all diffusion-based methods are outperformed by DIME, DIME performs on par with CrossQ on the Ant environment. DIME performs favorably on the high-dimensional Humanoid-v3 environment, where it also outperforms CrossQ.

Comparison to SOTA RL Methods on DMC and MYO Suite

DIME is compared against most recent SOTA RL methods on the humanoid and dog environments from DMC and the difficult environments from the MYO Suite. DIME performs favorably on the high-dimensional dog tasks and performs on par against the SOTA Gaussian policy-based method BRO.

Figure 4: Training curves on DMC's dog, humanoid tasks, and the hand environments from the MYO Suite. DIME performs favorably on the high-dimensional dog tasks, where it significantly outperforms all baselines (dog-run) or converges faster to the final performance. On the humanoid tasks, DIME outperforms all diffusion-based baselines, CrossQ and BRO Fast, and performs on par with BRO on the humanoid-stand task and slightly worse on the humanoid-run and humanoid-walk tasks. In the MYO SUITE environments, DIME performs consistently on all tasks, either outperforming the baselines or performing on par.

BibTeX