Authors
Yoli Shavit, Ron Ferens, Yosi Keller
Publication date
2024/3/15
Journal
Computer Vision and Image Understanding
Pages
103982
Publisher
Academic Press
Description
Contemporary state-of-the-art localization methods perform feature matching against a structured scene model or learn to regress the scene 3D coordinates. The resulting matches between 2D query pixels and 3D scene coordinates are used to estimate the camera pose using PnP and RANSAC, requiring the camera intrinsics for both the query and reference images. An alternative approach is to directly regress the camera pose from the query image. Although less accurate, absolute camera pose regression does not require any additional information at inference time and is typically lightweight and fast. Recently, Transformers were proposed for learning multi-scene camera pose regression, employing encoders to attend to spatially varying deep features while using decoders to embed multiple scene queries at once. In this work, we show that Transformer Encoders can aggregate and extract task-informative …
Total citations
202220232024446
Scholar articles
Y Shavit, R Ferens, Y Keller - arXiv preprint arXiv:2103.11477, 2021
Y Shavit, R Ferens, Y Keller - arXiv preprint arxiv:2103.11477, 2021
Y Shavit, R Ferens, Y Keller - Computer Vision and Image Understanding, 2024