Denoising Diffusion Policy Optimization DDPO

Tags: #AI #nlp #llm #RLHF

Equation

$$\mathcal{J}_\text{DDRL}(\theta) = \mathbb{E}_{c \sim p(c), x_{0} \sim p_{\theta} (x_{0} | c)} [r(x_{0}, c)] $$ $$ w_{\text{RWR}}(x_0, c) = \frac{1}{Z} \exp\big(\beta r(x_0, c) \big) $$ $$ w_{\text{sparse}} (x_0, c) = \mathbf{1} \big[ r(x_0, c) \geq C \big] $$ $$ \nabla_\theta \mathcal{J}_\text{DDRL} = \mathbb{E} {\; \sum_{t=0}^{T} \nabla_\theta \log p_\theta(x_{t-1} \mid x_t, c) \; r(x_0, c)} $$ $$ \nabla_\theta \mathcal{J}_\text{DDRL} = \mathbb{E} {\; \sum_{t=0}^{T} \frac{p_\theta (x_{t-1} \mid x_t, c)}{p_{\theta_\text{old}} (x_{t-1} \mid x_t, c)} \; \nabla_\theta \log p_\theta(x_{t-1} \mid x_t, c) \; r(x_0, c)} $$

Latex Code

                                 \mathcal{J}_\text{DDRL}(\theta) = \mathbb{E}_{c \sim p(c), x_{0} \sim p_{\theta} (x_{0} | c)} [r(x_{0}, c)] 
 $$ $$ 

w_{\text{RWR}}(x_0, c) = \frac{1}{Z} \exp\big(\beta r(x_0, c) \big)

 $$ $$ 
w_{\text{sparse}} (x_0, c) = \mathbf{1} \big[ r(x_0, c) \geq C \big]

$$ $$ 

\nabla_\theta \mathcal{J}_\text{DDRL} = \mathbb{E} {\; \sum_{t=0}^{T} \nabla_\theta \log p_\theta(x_{t-1} \mid x_t, c) \; r(x_0, c)}

$$ $$ 

\nabla_\theta \mathcal{J}_\text{DDRL} = \mathbb{E} {\; \sum_{t=0}^{T} \frac{p_\theta (x_{t-1} \mid x_t, c)}{p_{\theta_\text{old}} (x_{t-1} \mid x_t, c)} \; \nabla_\theta \log p_\theta(x_{t-1} \mid x_t, c) \; r(x_0, c)}


                            

Have Fun

Let's Vote for the Most Difficult Equation!

Introduction

REINFORCEMENT LEARNING TRAINING OF DIFFUSION MODELS. The denoising diffusion RL objective is to maximize a reward signal r defined on the samples and contexts for some context distribution p(c) of our choosing.
paper: TRAINING DIFFUSION MODELS WITH REINFORCEMENT LEARNING
huggingface: DDPO Trainer

Comments

Write Your Comment

Upload Pictures and Videos