ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction

Samyak Jain,Pradeep Yarlagadda,Shreyank Jyoti,Shyamgopal Karthik,Ramanathan Subramanian,Vineet Gandhi,Samyak Jain,Pradeep Yarlagadda,Shreyank Jyoti,Shyamgopal Karthik,Ramanathan Subramanian,Vineet Gandhi

We propose the ViNet architecture for audio-visual saliency prediction. ViNet is a fully convolutional encoder-decoder architecture. The encoder uses visual features from a network trained for action recognition, and the decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining features from multiple hierarchies. The overall architecture of ViNet is conceptually simp...