Multi-Gate Mixture of Experts MMoE
Tags: #machine learning #multi taskEquation
$$g^{k}(x)=\text{softmax}(W_{gk}x) \\ f^{k}(x)=\sum^{n}_{i=1}g^{k}(x)_{i}f_{i}(x) \\ y_{k}=h^{k}(f^{k}(x))$$Latex Code
g^{k}(x)=\text{softmax}(W_{gk}x) \\ f^{k}(x)=\sum^{n}_{i=1}g^{k}(x)_{i}f_{i}(x) \\ y_{k}=h^{k}(f^{k}(x))
Have Fun
Let's Vote for the Most Difficult Equation!
Introduction
Equation
Latex Code
g^{k}(x)=\text{softmax}(W_{gk}x) \\ f^{k}(x)=\sum^{n}_{i=1}g^{k}(x)_{i}f_{i}(x) \\ y_{k}=h^{k}(f^{k}(x))
Explanation
Multi-Gate Mixture of Experts (MMoE) model is firstly introduced in KDD2018 paper Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts. The model introduce a MMoE layer to model the relationship of K multiple tasks using N experts. Let's assume input feature X has dimension D. There are K output tasks and N experts networks. The gating network is calculated as, g^{k}(x) is a N-dimensional vector indicating the softmax result of relative weights, W_{gk} is a trainable matrix with size R^{ND}. And f^{k}(x) is the weghted sum representation of output of N experts for task k. f_{i}(x) is the output of the i-th expert, and f^{k}(x) indicates the representation of k-th tasks as the summation of N experts.
Related Documents
- See paper Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts for details.
Reply