View article

[PDF] from arxiv.org

Enhancing Image-Text Matching with Adaptive Feature Aggregation

Authors

Zuhui Wang, Yunting Yin, IV Ramakrishnan

Publication date

2024/4/14

Conference

ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Pages

8245-8249

Publisher

IEEE

Description

Image-text matching aims to find matched cross-modal pairs accurately. While current methods often rely on projecting cross-modal features into a common embedding space, they frequently suffer from imbalanced feature representations across different modalities, leading to unreliable retrieval results. To address these limitations, we introduce a novel Feature Enhancement Module that adaptively aggregates single-modal features for more balanced and robust image-text retrieval. Additionally, we propose a new loss function that overcomes the shortcomings of original triplet ranking loss, thereby significantly improving retrieval performance. The proposed model has been evaluated on two public datasets and achieves competitive retrieval performance when compared with several state-of-the-art models. Implementation codes can be found here.

Scholar articles

Enhancing Image-Text Matching with Adaptive Feature Aggregation

Z Wang, Y Yin, IV Ramakrishnan - ICASSP 2024-2024 IEEE International Conference on …, 2024