Cheatsheet of Latex Code for Most Popular Machine Learning Equations
Navigation
In this blog, we will summarize the latex code for most popular machine learning equations, including multiple distance measures, generative models, etc. There are various distance measurements of different data distribution, including KL-Divergence, JS-Divergence, Wasserstein Distance(Optimal Transport), Maximum Mean Discrepancy(MMD) and so on. We will provide the latex code for machine learning models in the following sections. We will also provide latex code of Generative Adversarial Networks(GAN), Variational AutoEncoder(VAE), Diffusion Models(DDPM) for generative models in the second section.
- 1. Distance Measure
- 1.1 Kullback-Leibler Divergence(KL-Divergence)
- 1.2 Jensen-Shannon Divergence(JS-Divergence)
- 1.3 Wasserstein Distance(Optimal Transport)
- 1.4 Maximum Mean Discrepancy(MMD)
- 1.5 Mahalanobis Distance
- 2. Generative Models
- 2.1 Generative Adversarial Networks(GAN)
- 2.2 Variational AutoEncoder(VAE)
- 2.3 Diffusion Models(DDPM)
-
Kullback-Leibler Divergence(KL-Divergence)
Equation
Latex Code
KL(P||Q)=\sum_{x}P(x)\log(\frac{P(x)}{Q(x)})
Explanation
-
Jensen-Shannon Divergence(JS-Divergence)
Equation
Latex Code
JS(P||Q)=\frac{1}{2}KL(P||\frac{(P+Q)}{2})+\frac{1}{2}KL(Q||\frac{(P+Q)}{2})
Explanation
-
Wasserstein Distance(Optimal Transport)
Equation
Latex Code
W_{p}(P,Q)=(\inf_{J \in J(P,Q)} \int{||x-y||^{p}dJ(X,Y)})^\frac{1}{p}
Explanation
-
Maximum Mean Discrepancy(MMD)
Equation
Latex Code
\textup{MMD}(\mathbb{F},X,Y):=\sup_{f \in \mathbb{F}}(\frac{1}{m}\sum_{i=1}^{m}f(x_{i}) - \frac{1}{n}\sum_{j=1}^{n}f(y_{j}))
Explanation
-
Mahalanobis Distance
Equation
Latex Code
D_{M}(x,y)=\sqrt{(x-y)^{T}\Sigma^{-1}(x-y)}
Explanation
Mahalanobis Distance is a distance measure between a data point and dataset of a distribution. See website for more details https://www.sciencedirect.com/topics/engineering/mahalanobis-distance.
Distance Measure
-
Generative Adversarial Networks(GAN)
Equation
Latex Code
\min_{G} \max_{D} V(D,G)=\mathbb{E}_{x \sim p_{data}(x)}[\log D(x)]+\mathbb{E}_{z \sim p_{z}(z)}[\log(1-D(G(z)))]
Explanation
GAN latex code is illustrated above. See paper for more details Generative Adversarial Networks
-
Variational AutoEncoder(VAE)
Estimating the Log-likelihood and Posterior
Equation
Latex Code
\log p_{\theta}(x)=\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x)] \\ =\mathbb{E}_{q_{\phi}(z|x)}[\log \frac{p_{\theta}(x,z)}{p_{\theta}(z|x)}] \\ =\mathbb{E}_{q_{\phi}(z|x)}[\log [\frac{p_{\theta}(x,z)}{q_{\phi}(z|x)} \times \frac{q_{\phi}(z|x)}{p_{\theta}(z|x)}]] \\ =\mathbb{E}_{q_{\phi}(z|x)}[\log [\frac{p_{\theta}(x,z)}{q_{\phi}(z|x)} ]] +D_{KL}(q_{\phi}(z|x) || p_{\theta}(z|x))\\
Explanation
Evidence Lower Bound
Equation
Latex Code
\mathbb{L}_{\theta,\phi}(\mathbf{x})=\mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}[\log p_{\theta}(\mathbf{x},\mathbf{z})-\log q_{\phi}(\mathbf{z}|\mathbf{x}) ]
Explanation
Reparameterization trick
Equation
Latex Code
z = \mu + \epsilon \cdot \sigma
Explanation
VAE latex code is illustrated above. See paper for more details Auto-Encoding Variational Bayes
-
Diffusion Models(DDPM)
Explanation
See paper Denoising Diffusion Probabilistic Models for more details. See reference of the following blogpost https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
1.1 Forward Process
Equation
Latex Code
q(x_{t}|x_{t-1})=\mathcal{N}(x_{t};\sqrt{1-\beta_{t}}x_{t-1},\beta_{t}I) \\q(x_{1:T}|x_{0})=\prod_{t=1}^{T}q(x_{t}|x_{t-1})
1.2 Forward Process Reparameterization Trick
Equation
Latex Code
x_{t}=\sqrt{\alpha_{t}}x_{t-1}+\sqrt{1-\alpha_{t}} \epsilon_{t-1}\\=\sqrt{\alpha_{t}\alpha_{t-1}}x_{t-2} + \sqrt{1-\alpha_{t}\alpha_{t-1}} \bar{\epsilon}_{t-2}\\=\text{...}\\=\sqrt{\bar{\alpha}_{t}}x_{0}+\sqrt{1-\bar{\alpha}_{t}}\epsilon \\\alpha_{t}=1-\beta_{t}, \bar{\alpha}_{t}=\prod_{t=1}^{T}\alpha_{t}
1.3 Reverse Process
Latex Code
p_\theta(\mathbf{x}_{0:T}) = p(\mathbf{x}_T) \prod^T_{t=1} p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t) \\ p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t))
1.4 Reverse Process Variational Lower Bound
Latex Code
\begin{aligned} - \log p_\theta(\mathbf{x}_0) &\leq - \log p_\theta(\mathbf{x}_0) + D_\text{KL}(q(\mathbf{x}_{1:T}\vert\mathbf{x}_0) \| p_\theta(\mathbf{x}_{1:T}\vert\mathbf{x}_0) ) \\ &= -\log p_\theta(\mathbf{x}_0) + \mathbb{E}_{\mathbf{x}_{1:T}\sim q(\mathbf{x}_{1:T} \vert \mathbf{x}_0)} \Big[ \log\frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T}) / p_\theta(\mathbf{x}_0)} \Big] \\ &= -\log p_\theta(\mathbf{x}_0) + \mathbb{E}_q \Big[ \log\frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} + \log p_\theta(\mathbf{x}_0) \Big] \\ &= \mathbb{E}_q \Big[ \log \frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \Big] \\ \text{Let }L_\text{VLB} &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \Big[ \log \frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \Big] \geq - \mathbb{E}_{q(\mathbf{x}_0)} \log p_\theta(\mathbf{x}_0) \end{aligned}
1.5 Reverse Process Variational Lower Bound Decomposition Multiple KL-Divergence
Latex Code
\begin{aligned}L_\text{VLB} &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \Big[ \log\frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \Big] \\&= \mathbb{E}_q \Big[ \log\frac{\prod_{t=1}^T q(\mathbf{x}_t\vert\mathbf{x}_{t-1})}{ p_\theta(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t) } \Big] \\&= \mathbb{E}_q [\underbrace{D_\text{KL}(q(\mathbf{x}_T \vert \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_T))}_{L_T} + \sum_{t=2}^T \underbrace{D_\text{KL}(q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t))}_{L_{t-1}} \underbrace{- \log p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)}_{L_0} ]\end{aligned}
1.6 Reverse Process Variational Lower Bound Loss Function
Latex Code
\begin{aligned} L_\text{VLB} &= L_T + L_{T-1} + \dots + L_0 \\ \text{where } L_T &= D_\text{KL}(q(\mathbf{x}_T \vert \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_T)) \\ L_t &= D_\text{KL}(q(\mathbf{x}_t \vert \mathbf{x}_{t+1}, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_t \vert\mathbf{x}_{t+1})) \text{ for }1 \leq t \leq T-1 \\ L_0 &= - \log p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1) \end{aligned}