View article

[PDF] from arxiv.org

Frozen clip models are efficient video learners

Authors

Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard De Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, Hongsheng Li

Publication date

2022/10/23

Book

European Conference on Computer Vision

Pages

388-404

Publisher

Springer Nature Switzerland

Description

Video recognition has been dominated by the end-to-end learning paradigm – first initializing a video recognition model with weights of a pretrained image model and then conducting end-to-end training on videos. This enables the video network to benefit from the pretrained image model. However, this requires substantial computation and memory resources for finetuning on videos and the alternative of directly using pretrained image features without finetuning the image backbone leads to subpar results. Fortunately, recent advances in Contrastive Vision-Language Pre-training (CLIP) pave the way for a new route for visual recognition tasks. Pretrained on large open-vocabulary image–text pair data, these models learn powerful visual representations with rich semantics. In this paper, we present Efficient Video Learning (EVL) – an efficient framework for directly training high-quality video recognition models with …

Total citations

Cited by 148

2022202320242 76 70

Scholar articles

Frozen clip models are efficient video learners

Z Lin, S Geng, R Zhang, P Gao, G De Melo, X Wang… - European Conference on Computer Vision, 2022