Denoising Diffusion Policy Optimization DDPO

Tags: #AI #nlp #llm #RLHF

Equation

JDDRL(θ)=Ecp(c),x0pθ(x0|c)[r(x0,c)] wRWR(x0,c)=1Zexp(βr(x0,c)) wsparse(x0,c)=1[r(x0,c)C] θJDDRL=Et=0Tθlogpθ(xt1xt,c)r(x0,c) θJDDRL=Et=0Tpθ(xt1xt,c)pθold(xt1xt,c)θlogpθ(xt1xt,c)r(x0,c)

Latex Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
                                 \mathcal{J}_\text{DDRL}(\theta) = \mathbb{E}_{c \sim p(c), x_{0} \sim p_{\theta} (x_{0} | c)} [r(x_{0}, c)]
 $$ $$
 
w_{\text{RWR}}(x_0, c) = \frac{1}{Z} \exp\big(\beta r(x_0, c) \big)
 
 $$ $$
w_{\text{sparse}} (x_0, c) = \mathbf{1} \big[ r(x_0, c) \geq C \big]
 
$$ $$
 
\nabla_\theta \mathcal{J}_\text{DDRL} = \mathbb{E} {\; \sum_{t=0}^{T} \nabla_\theta \log p_\theta(x_{t-1} \mid x_t, c) \; r(x_0, c)}
 
$$ $$
 
\nabla_\theta \mathcal{J}_\text{DDRL} = \mathbb{E} {\; \sum_{t=0}^{T} \frac{p_\theta (x_{t-1} \mid x_t, c)}{p_{\theta_\text{old}} (x_{t-1} \mid x_t, c)} \; \nabla_\theta \log p_\theta(x_{t-1} \mid x_t, c) \; r(x_0, c)}

Have Fun

Let's Vote for the Most Difficult Equation!

Introduction

REINFORCEMENT LEARNING TRAINING OF DIFFUSION MODELS. The denoising diffusion RL objective is to maximize a reward signal r defined on the samples and contexts for some context distribution p(c) of our choosing.
paper: TRAINING DIFFUSION MODELS WITH REINFORCEMENT LEARNING
huggingface: DDPO Trainer

Comments

Write Your Comment

Upload Pictures and Videos