Cheatsheet of Latex Code for Multi-Task Learning Equations
Navigation
In this blog, we will summarize the latex code of most fundamental equations of multi-task learning(MTL) and transfer learning(TL). Multi-Task Learning aims to optimize N related tasks simultaneously and achieve the overall trade-off between multiple tasks. Typical network structure include shared-bottom models, Cross-Stitch Network, Multi-Gate Mixture of Experts (MMoE), Progressive Layered Extraction (PLE), Entire Space Multi-Task Model (ESSM) models and etc. Different from multi-task learning. In the following sections, we will dicuss more details of MTL equations, which is useful for your quick reference.
- 1. Multi-Task Learning(MTL)
- 1.1 Shared-Bottom Model
- 1.2 Multi-Gate Mixture of Experts (MMoE)
- 1.3 Progressive Layered Extraction (PLE)
- 1.4 Entire Space Multi-Task Model (ESSM)
- 1.5 Cross-Stitch Network
1. Multi-Task Learning(MTL)
-
1.1 Shared-Bottom Model
Equation
Latex Code
y_{k}=h^{k}(f(x))
Explanation
Shared-Bottom mtl models have shared representation f(x) for K individual tasks. For each task k, there is task specific tower with parameters h^{k}(.) which produces individual output for each task.
-
1.2 Multi-Gate Mixture of Experts (MMoE)
Equation
Latex Code
g^{k}(x)=\text{softmax}(W_{gk}x) \\ f^{k}(x)=\sum^{n}_{i=1}g^{k}(x)_{i}f_{i}(x) \\ y_{k}=h^{k}(f^{k}(x))
Explanation
Multi-Gate Mixture of Experts (MMoE) model is firstly introduced in KDD2018 paper Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts. The model introduce a MMoE layer to model the relationship of K multiple tasks using N experts. Let's assume input feature X has dimension D. There are K output tasks and N experts networks. The gating network is calculated as, g^{k}(x) is a N-dimensional vector indicating the softmax result of relative weights, W_{gk} is a trainable matrix with size R^{ND}. And f^{k}(x) is the weghted sum representation of output of N experts for task k. f_{i}(x) is the output of the i-th expert, and f^{k}(x) indicates the representation of k-th tasks as the summation of N experts. See below link Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts for more details.
-
1.3 Progressive Layered Extraction (PLE)
Equation
Latex Code
g^{k}(x)=w^{k}(x)S^{k}(x) \\ w^{k}(x)=\text{softmax}(W^{k}_{g}x) \\ S^{k}(x)=\[E^{T}_{(k,1)},E^{T}_{(k,2)},...,E^{T}_{(k,m_{k})},E^{T}_{(s,1)},E^{T}_{(s,2)},...,E^{T}_{(s,m_{s})}\]^{T} \\ y^{k}(x)=t^{k}(g^{k}(x)) \\ g^{k,j}(x)=w^{k,j}(g^{k,j-1}(x))S^{k,j}(x)
Explanation
Progressive Layered Extraction(PLE) model slightly modifies the original structure of MMoE models and explicitly separate the experts into shared experts and task-specific experts. Let's assume there are m_{s} shared experts and m_{t} tasks-specific experts. S^{k}(x) is a selected matrix composed of (m_{s} + m_{t}) D-dimensional vectors, with dimension as (m_{s} + m_{t}) \times D. w^{k}(x) denotes the gating function with size (m_{s} + m_{t}) and W^{k}_{g} is a trainable parameters with dimension as (m_{s} + m_{t}) \times D. t^{k} denotes the task-specific tower paratmeters. The progressive extraction layer means that the gating network g^{k,j}(x) of j-th extraction layer takes the output of previous gating layers g^{k,j-1}(x) as inputs. See below link of paper Progressive Layered Extraction (PLE): A Novel Multi-Task Learning (MTL) Model for Personalized Recommendations for more details.
-
1.4 Entire Space Multi-Task Model (ESSM)
Equation
Latex Code
L(\theta_{cvr},\theta_{ctr})=\sum^{N}_{i=1}l(y_{i},f(x_{i};\theta_{ctr}))+\sum^{N}_{i=1}l(y_{i}\&z_{i},f(x_{i};\theta_{ctr}) \times f(x_{i};\theta_{cvr}))
Explanation
ESSM model uses two separate towers to model pCTR prediction task and pCTCVR prediction task simultaneously. See below link of paper Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate for more details.
-
1.5 Cross-Stitch Network
Equation
Latex Code
\begin{bmatrix} \tilde{x}^{ij}_{A}\\\tilde{x}^{ij}_{B}\end{bmatrix}=\begin{bmatrix} a_{AA} & a_{AB}\\ a_{BA} & a_{BB} \end{bmatrix}\begin{bmatrix} x^{ij}_{A}\\ x^{ij}_{B} \end{bmatrix}
Explanation
The cross-stitch unit takes two activation maps xA and xB from previous layer and learns a linear combination of two inputs from previous tasks and combine them into two new representation. The linear combination is controlled by parameter \alpha. See below link of paper Cross-stitch Networks for Multi-task Learning for more details.