Figure 1: The effect of the reward scaling parameter $\alpha$ . The figures in (a)-(b) show
diffusion processes for different $\alpha$ values starting at a prior distribution $\mathcal{N}(0,I)$ and going
backward in time to approximate the target distribution $\exp{\left(Q^\pi/\alpha\right)}/Z^\pi$.
Small values for $\alpha$ (a) lead to concentrated target distributions with less noise in the diffusion
trajectories especially at the last time steps. The higher $\alpha$ becomes (b) and (c), the more the target
distribution is smoothed and the distribution of the samples at the last time steps becomes more noisy.
Therefore, the parameter $\alpha$ directly controls the exploration by enforcing noisier samples the higher $\alpha$ becomes.
Figure 2: Entropy Scaling Sensitivity (a)-(b) . The $\alpha$ parameter controls the
exploration-exploitation trade-off. (a) shows the learning curves for varying values on DMC's dog-run task.
Too high $\alpha$ values ($\alpha=0.1$) do not incentivize learning whereas too small $\alpha$ values ($\alpha\leq10^{-5}$) converge to suboptimal behavior.
(b) shows the aggregated end performance for each learning curve in (a).
For increasing $\alpha$ values, the end performance increases until it reaches an optimum at $\alpha=10^{-3}$ after which the performance starts dropping.
Diffusion Policy Benefit (c) and (d). We compare DIME to a Gaussian policy with the same implementation details as DIME on the (a) humanoid-run and (b) dog-run tasks.
The diffusion-based policy reaches a higher return (a) and converges faster.
Figure 3: Varying the Number of diffusion steps (a)-(b). The number of diffusion steps might
affect the performance and the computation time. (a) shows DIME's learning curves for varying diffusion steps on DMC's humanoid-run task.
Two diffusion steps perform badly, whereas four and eight diffusion steps perform similar but still worse than 16 and 32 diffusion steps which perform similarly.
(b) shows the computation time for 1MIO steps of the corresponding learning curves.
The smaller the diffusion steps, the less computation time is required.
Learning Curves on Gym Benchmark Suite (c)-(d).
We compare DIME against various diffusion baselines and CrossQ on the (c) Ant-v3 and (d) Humanoid-v3 from the Gym suite.
While all diffusion-based methods are outperformed by DIME, DIME performs on par with CrossQ on the Ant environment.
DIME performs favorably on the high-dimensional Humanoid-v3 environment, where it also outperforms CrossQ.
Figure 4: Training curves on DMC's dog, humanoid tasks, and the hand environments from the MYO Suite.
DIME performs favorably on the high-dimensional dog tasks, where it significantly outperforms all baselines (dog-run)
or converges faster to the final performance.
On the humanoid tasks, DIME outperforms all diffusion-based baselines, CrossQ and BRO Fast, and performs on par
with BRO on the humanoid-stand task and slightly worse on the humanoid-run and humanoid-walk tasks.
In the MYO SUITE environments, DIME performs consistently on all tasks, either outperforming the baselines or performing on par.
@inproceedings{
celik2025dime,
title={{DIME}: Diffusion-Based Maximum Entropy Reinforcement Learning},
author={Onur Celik and Zechu Li and Denis Blessing and Ge Li and Daniel Palenicek and Jan Peters and Georgia Chalvatzaki and Gerhard Neumann},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=Aw6dBR7Vxj}
}