Authors
Xiaoqi Zhao, Youwei Pang, Lihe Zhang, Huchuan Lu, Xiang Ruan
Publication date
2021
Journal
arXiv preprint arXiv:2101.12482
Volume
2
Description
Existing CNNs-Based RGB-D Salient Object Detection (SOD) networks are all required to be pre-trained on the ImageNet to learn the hierarchy features which can help to provide a good initialization. However, the collection and annotation of large-scale datasets are time-consuming and expensive. In this paper, we utilize Self-Supervised Representation Learning (SSL) to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation. Our pretext tasks require only a few and unlabeled RGB-D datasets to perform pre-training, which make the network capture rich semantic contexts as well as reduce the gap between two modalities, thereby providing an effective initialization for the downstream task. In addition, for the inherent problem of cross-modal fusion in RGB-D SOD, we propose a multi-path fusion (MPF) module that splits a single feature fusion into multi-path fusion to achieve an adequate perception of consistent and differential information. The MPF module is general and suitable for both crossmodal and cross-level feature fusion. Extensive experiments on six benchmark RGB-D SOD datasets, our model pretrained on the RGB-D dataset (6, 335 without any annotations) can perform favorably against most state-of-the-art RGB-D methods pre-trained on ImageNet (1, 280, 000 with image-level annotations).
Total citations
20212022202320242452
Scholar articles
X Zhao, Y Pang, L Zhang, H Lu, X Ruan - arXiv preprint arXiv:2101.12482, 2021