View article

[PDF] from arxiv.org

Leveraging Local Temporal Information for Multimodal Scene Classification

Authors

Saurabh Sahu, Palash Goyal

Publication date

2022/5/23

Conference

ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Pages

1830-1834

Publisher

IEEE

Description

Robust video scene classification models should capture the spatial (pixel-wise) and temporal (frame-wise) characteristics of a video effectively. Transformer models with self-attention which are designed to get contextualized representations for individual tokens given a sequence of tokens, are becoming increasingly popular in many computer vision tasks. However, the use of Transformer based models for video under-standing is still relatively unexplored. Moreover, these models fail to exploit the strong temporal relationships between the neighboring video frames to get potent frame-level representations. In this paper, we propose a novel self-attention block that leverages both local and global temporal relation-ships between the video frames to obtain better contextualized representations for the individual frames. This enables the model to understand the video at various granularities. We illustrate the …

Total citations

Cited by 1

20231

Scholar articles

Leveraging Local Temporal Information for Multimodal Scene Classification

S Sahu, P Goyal - ICASSP 2022-2022 IEEE International Conference on …, 2022