0%

CVPR2020图像匹配挑战赛,新数据集+新评测方法,SOTA正瑟瑟发抖!

从一系列的图像中恢复物体的3D结构是计算机视觉研究中一个热门课题,这使得我们可以相隔万里从google map中看到复活节岛的风景。这得益于图像来自于可控的条件,使得最终的重建效果一致性且质量都很高,但是这却限制了采集设备以及视角。畅想一下,假如我们不使用专业设备,而是利用sfm技术根据互联网上大量的图片重建出这个复杂世界。

为了加快这个领域的研究,更好地利用图像数据有效信息,谷歌联合 UVIC, CTU以及EPFL发表了这篇文章 “Image Matching across Wide Baselines: From Paper to Practice”,[PDF],旨在公布一种新的衡量用于3D重建方法的标准模块+数据集,这里主要是指2D图像间的匹配。这个评价模块可以很方便地集成并评估现有流行的特征匹配算法,包括传统方法或者基于机器学习的方法。

谷歌公布2020图像匹配挑战的数据集:官网博客,文末有排行榜。

前言

图像特征匹配是计算机视觉的基础+核心问题之一,包括image retrieval 48 7 69 91 63, 3D reconstruction3 43 79 106,re-localization 74 75 51以及 SLAM 61 30 31等在内的诸多研究领域都会用到特征匹配。这个问题已经研究了几十年,但仍未被很好地解决。特征匹配面临的问题很多,主要包括以下挑战:视角,尺度,旋转,光照,遮挡以及相机渲染等。

近些年来,研究者开始将视线转移到端到端的学习方法(图像->位姿),但是这些方法甚至没有达到传统的方法(图像->匹配->BA优化)的性能。我们可以看到,传统的方法将3D重建问题拆分成为2个子问题:特征匹配与位姿解算。解决每个子问题的新方法,诸如特征匹配/位姿解算,都使用了“临时指标”,但是单独地评价单个子问题的性能不足以说明整体性能。例如,一些研究仅在某个数据集上展现了相较于手工特征SIFT的优势,但是这些算法是否能够在真实应用中仍然展现出优势呢?我们通过后续实验说明传统算法经过调整之后也可匹敌现有的标称“sota”的算法(着实打脸)。

是时候换一种方式进行评价了,本文不去过多关注在临时指标上的表现,而关注在下游任务上的表现。本文贡献:

  1. 30k图像+深度图+真实位姿(posed image)
  2. 模块化流水线处理流程,结合了数十种经典的和最新的特征提取和匹配以及姿态估计方法,以及多种启发式方法,可以分别交换和调整
  3. 两个下游任务,双目/多视角重建
  4. 全面研究了手工特征以及学习特征数十种方法和技术,以及它们的结合以及超参数选择的过程

相关工作

局部特征

在引入SIFT特征之后,局部特征变成了主流。它的处理流程主要分为几个步骤:特征提取,旋转估计,描述子提取。除了SIFT,手工特征还有SURF 15, ORB 73, 以及 AKAZE 4等。

现代描述子通常在SIFT关键点(即DoG)的预裁剪图像块上训练深度网络,其中包括:Deepdesc 82, TFeat 11, L2-Net 89, Hardnet 57, SOSNet [90]以及 LogPolarDesc 34(它们中绝大多数都是在同一个数据集上进行的训练)。

最近有一些工作利用了其它线索,诸如几何或全局上下文信息进行训练,其中包括GeoDesc [50] and ContextDesc 49

另外还有一些方法将特征点以及描述子进行单独训练,例如TILDE 95, TCDet 103, QuadNet 78, and Key.Net 13。当前还有一些算法将二者联合起来训练,例如LIFT 99,DELF 63, SuperPoint 31, LF-Net 64, D2-Net 33,R2D2 72

鲁棒匹配

大基线的双目匹配的外点内点率可低至10%,甚至更低。要做匹配的话需要从中选择出能够解算出位姿的算法。常用的方式包括基于随机一致采样RANSAC的5-62,7-41,8-point39算法。它的改进算法包括local optimization 24, MLESAC 92, PROSAC 23, DEGENSAC 26, GC-RANSAC 12, MAGSAC 29,CNe (Context Networks) 100+RANSAC,同样还有70 104 85 102。作者最后加了一句“Despite their promise, it remains unclear how well they perform in real settings”(质疑中,哈哈)。

运动恢复结构(SfM)

方法 3 43 27 37 106,最流行的包括VisualSFM 98以及COLMAP 79(作为真值)。

数据集和标准

以前的特征匹配数据集如下:

  • Oxford dataset 54, 48张图像+真值单应矩阵
  • HPatches 9, 696张光照以及视角变化,无遮挡平面图像
  • DTU 1, Edge Foci 107, Webcam 95, AMOS 67, 以及 Strecha’s 83

上述数据集都有其限制:窄基线,真值噪声大,图像数量少。基于学习的描述子通常在21上进行训练,它们之所以比SIFT好的原因可能在于过拟合了(作者看到会不会脸红)。
另外,用于导航/重定位以及slam的数据集包括Kitti 38, Aachen 76, Robotcar 52以及CMU seasons 75 8,但并不包含Phototourism数据中的多种变换。

Phototourism 数据集

上述数据集这么“烂”,于是作者搞出了他们心目中最好的公开数据集——Phototourism 数据集。作者从43 88中选择的25个受欢迎的地标集合(共30k)为基础,每个地标都有成百上千的图像。论文中,作者从中选择出11个场景,其中9个测试集和2个验证集做实验。将它们缩减为最大尺寸为1024像素,并使用COLMAP 79对其进行求解位姿以及点云和深度,通过建立好的模型去除遮挡物。

具体地,如下2个表格所示:

处理流程图Pipeline

流程如上图,蓝色框就是要进行的几个处理,分别介绍一下。

特征提取

作者选择了3大类特征:

  1. 完全手工特征:
    SIFT 48 (以及RootSIFT 6), SURF 15, ORB 73, AKAZE 4,FREAK 107描述子+BRISK 108特征点,使用OpenCV的实现,除了ORB特征,降低特征提取阈值以多提取一些特征;
    除此之外,也考虑VLFeat94中DoG的一些变种:(VL-)DoG, Hessian 16, Hessian-Laplace 55, Harris-Laplace 55, MSER 53; 以及它们的仿射变种: DoG-Affine, Hessian-Affine 55 14, DoG-AffNet 59, Hessian-AffNet 59
  2. 描述子从DoG特征学习得到的特征:
    L2-Net 89, Hardnet 57,Geodesc 50, SOSNet 90, ContextDesc 49, LogPolarDesc 34
  3. 端到端学习来的特征:
    Superpoint 31, LF-Net 64, and D2-Net 33以及它们的多尺度变种:single- (SS) 以及 multi-scale (MS)

特征匹配

此处用的是最近邻。

外点滤除

Context Networks 100+RANSAC100 85,简称CNe,效果如下:

Stereo task

给定图像$\mathbf{I}_i$以及$\mathbf{I}_j$,解算基础矩阵 $\mathbf{F}_{i,j}$,除了现有的OpenCV19以及sklearn65中实现的RANSAC 36 25,作者也用到了DEGENSAC 26, GC-RANSAC 12 and MAGSAC 29。最后通过OpenCV的recoverPose函数解算位姿。

Multi-view task

由于是评价特征的好坏而不是SfM算法,作者从几个大场景中随机选择出图片构成几个小的数据集,称为”bags”。其中包含3/5张图像的各有100bags,10张图像的各有50bags,25张图像的各有25bags,总共275个bags。将外点滤除后的结果送入COLMAP 79作为输入进行SfM重建。

误差指标

  1. mAA(mean Average Accuracy): Stereo task/Multi-view task
  2. ATE(Absolute Trajectory Error): Multi-view task

实验开始——配置细节很重要

首先比较了RANSAC在不同参数配置(置信度,极线对齐误差阈值以及最大迭代次数)下的表现:

总体来说,MAGSAC表现最好,DEGENSAC表现次之。另外,作者提到“default settings can be woefully inadequate. For example, OpenCV sets τ = 0.99 and η = 3 pixels, which results in a mAP at 10o of 0.5292 on the validation set – a performance drop of 23.9% relative.” 所以在日常使用OpenCV的RANSAC函数时需要自己调整下参数。

作者认为RANSAC的内点阈值对于每种局部特征也是不同的,作者做了如下实验。
Figure 5. RANSAC – Inlier threshold $\eta$
上图可以直观看到从DOG学习的特征都聚集在了一起,其它特征比较分散,这也是太难选择了,于是作者使用了其他论文作者推荐的配置参数或者一些合理的参数作为内点阈值。

结果

作者列出了很多结果以及结论,我们仅去关注几个感兴趣的。

8k特征

大家期待已久的真的sota到底是谁呢?作者在以上特征的超参调整到最优后进行了实验,测试结果如下:

  1. mAA指标上DoG特征点占据了Top的位置,其中SOSNet排名#1,紧随其后的是HardNet。
  2. ‘HardNetAmos+’ 56,它在更多的数据(Brown 20, HPatches 9, AMOS 67)上进行了训练,但是效果却比不上在Brown的‘Liberty’上训练模型的效果。
  3. multi-view任务中,DoG+HardNet表现属于top水平,略优于ContextDesc, SOSNet,LogpolarDesc;
  4. R2D2是表现最好的端到端方法,同样在multi-view任务中表现较好(#6),但是在stereo任务中不如SIFT;
  5. D2-net表现并不太好,可能由于图像下采样造成了较差的定位误差;
  6. 适当调整参数后的SIFT尤其是RootSIFT能够在stereo任务中排名#9,multi-view任务中排名#9,与所谓sota相差13.1%以及4.9%.(真为咱传统特征争气!)

2k特征

这样做的理由是能够与LF-Net与Superpoint进行比较,结果如下图:

结论:

  1. Key.Net+HardNet获得最好的表现,第二名是LogPolarDesc;
  2. R2D2在stereo任务中排名#2,multi-view任务中排名#7

8k vs. 2k

结论:

  1. 基于DoG的方法容易受益于多个特征,而学习的方法收益于重新训练(该结论来自于Key.Net+Hardnet的组合,作者进行了重新训练,表现优异)
  2. 整体来说基于学习的特征KeyNet, SuperPoint, R2D2, LF-Net在multi-view任务配置下比stereo任务配置下表现更好;(作者的假设是它们的鲁棒性好,但定位精度低)

光照变化

作者用了直方图均衡化(CLAHE66)去调整图像光度,结果如上图,可以看到几乎所有的基于学习的方法的测试效果都下降了,这可能由于没有专门地在这种场景中进行训练。而SIFT也没有得到明显提升,可能在于SIFT描述子是在某些假设条件下最佳表现。

新指标 vs. 传统指标

这里要说明的是传统的评价方式与本文提出方式的关系。

  1. matching score的选择还是比较明智的,它似乎与mAA相关,但也很难保证高的匹配得分就一定有助于提升mAA,例如RootSIFT vs ContextDesc;
  2. repeatability则比较难去诠释它对最后位姿解算的效果。AKAZE的repeatability最好但是matching score和pose mAA都非常差,作者的原话(arxiv版本1)就是descriptor may hurt its performance
  3. Key.Net获得最好的repeatability,但是在mAA指标上弱于DoG的方法,即使使用了相同的描述子HardNet;

注意,以上结果都是论文发布在arxiv平台时给出的结果,最新结果参考这个官网排行榜

由于目前正在使用superpoint特征(SuperPoint (2k features, NMS=4), DEGENSAC),所以比较关注它的表现。感觉在2k特征阵营,它的表现并不好(屈居#35,目前共52个算法),然而SuperPoint + SuperGlue + DEGENSAC以及SuperPoint+GIFT+Graph Motion Coherence Network+DEGENSAC分别位列#1以及#2,这也是结果很让人欣慰!

References

1. H. Aanaes, A. L. Dahl, and K. Steenstrup Pedersen. Interesting Interest Points. IJCV, 97:18–35, 2012. 2
2. H. Aanaes and F. Kahl. Estimation of Deformable Structure and Motion. In Vision and Modelling of Dynamic Scenes Workshop, 2002. 6
3. S. Agarwal, N. Snavely, I. Simon, S.M. Seitz, and R. Szeliski. Building Rome in One Day. In ICCV, 2009. 1, 2
4. P. F. Alcantarilla, J. Nuevo, and A. Bartoli. Fast Explicit Diffusion for Accelerated Features in Nonlinear Scale Spaces. In BMVC, 2013. 2, 3
5. Anonymous. DeepSFM: Structure From Motion Via Deep Bundle Adjustment. In Submission to ICLR, 2020. 2
6. Relja Arandjelovic. Three things everyone should know to improve object retrieval. In CVPR, 2012. 3
7. Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. In CVPR, 2016. 1
8. Hernan Badino, Daniel Huber, and Takeo Kanade. The CMU Visual Localization Data Set. http://3dvis. ri.cmu.edu/data-sets/localization, 2011. 2
9. V. Balntas, K. Lenc, A. Vedaldi, and K. Mikolajczyk. HPatches: A Benchmark and Evaluation of Handcrafted and Learned Local Descriptors. In CVPR, 2017. 2, 7
10. Vassileios Balntas, Shuda Li, and Victor Prisacariu. RelocNet: Continuous Metric Learning Relocalisation using Neural Nets. In The European Conference on Computer Vision (ECCV), September 2018. 1
11. V. Balntas, E. Riba, D. Ponsa, and K. Mikolajczyk. Learning Local Feature Descriptors with Triplets and Shallow Convolutional Neural Networks. In BMVC, 2016. 2
12. Daniel Barath and Ji Matas. Graph-cut ransac. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 2, 4
13. Axel Barroso-Laguna, Edgar Riba, Daniel Ponsa, and Krystian Mikolajczyk. Key.Net: Keypoint Detection by Handcrafted and Learned CNN Filters. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019. 2, 3
14. A. Baumberg. Reliable Feature Matching Across Widely Separated Views. In CVPR, pages 774–781, 2000. 3, 6
15. H. Bay, T. Tuytelaars, and L. Van Gool. SURF: Speeded Up Robust Features. In ECCV, 2006. 2, 3
16. P. R. Beaudet. Rotationally invariant image operators. In Proceedings of the 4th International Joint Conference on Pattern Recognition, pages 579–583, Kyoto, Japan, Nov. 1978. 3, 6
17. Jia-Wang Bian, Yu-Huan Wu, Ji Zhao, Yun Liu, Le Zhang, Ming-Ming Cheng, and Ian Reid. An Evaluation of Feature Matchers for Fundamental Matrix Estimation. In BMVC, 2019. 2
18. Eric Brachmann and Carsten Rother. Neural- Guided RANSAC: Learning Where to Sample Model Hypotheses. In ICCV, 2019. 2
19. G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000. 4
20. M. Brown, G. Hua, and S. Winder. Discriminative Learning of Local Image Descriptors. PAMI, 2011. 1, 2, 7
21. M. Brown and D. Lowe. Automatic Panoramic Image Stitching Using Invariant Features. IJCV, 74:59–73, 2007. 2
22. Mai Bui, Christoph Baur, Nassir Navab, Slobodan Ilic, and Shadi Albarqouni. Adversarial Networks for Camera Pose Regression and Refinement. In The IEEE International Conference on Computer Vision (ICCV) Workshops, Oct 2019. 1
23. Ondˇrej Chum and Jiˇr´ı Matas. Matching with PROSAC Progressive Sample Consensus. In CVPR, pages 220–226, June 2005. 2
24. Ondˇrej Chum, Jiˇr´ı Matas, and Josef Kittler. Locally Optimized RANSAC. In PR, 2003. 2
25. Ondˇrej Chum, Jiˇr´ı Matas, and Josef Kittler. Locally optimized ransac. In Pattern Recognition, 2003. 4
26. Ondrej Chum, Tomas Werner, and Jiri Matas. Two-View Geometry Estimation Unaffected by a Dominant Plane. In CVPR, 2005. 2, 4
27. Hainan Cui, Xiang Gao, Shuhan Shen, and Zhanyi Hu. Hsfm: Hybrid structure-from-motion. In CVPR, July 2017. 2
28. Zheng Dang, Kwang Moo Yi, Yinlin Hu, Fei Wang, Pascal Fua, and Mathieu Salzmann. Eigendecomposition-Free Training of Deep Networks with Zero Eigenvalue-Based Losses. In ECCV, 2018. 4
29. Jana Noskova Daniel Barath, Jiri Matas. MAGSAC: marginalizing sample consensus. In CVPR, 2019. 1, 2, 4
30. D. Detone, T. Malisiewicz, and A. Rabinovich. Toward Geometric Deep SLAM. arXiv preprint arXiv:1707.07410, 2017. 1
31. D. Detone, T. Malisiewicz, and A. Rabinovich. Superpoint: Self-Supervised Interest Point Detection and Description. CVPR Workshop on Deep Learning for Visual SLAM, 2018. 1, 2, 3, 8
32. J. Dong and S. Soatto. Domain-Size Pooling in Local Descriptors: DSP-SIFT. In CVPR, 2015. 6
33. M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler. D2-Net: A Trainable CNN for Joint Detection and Description of Local Features. In CVPR, 2019. 1, 2, 3, 8
34. Patrick Ebel, Anastasiia Mishchuk, Kwang Moo Yi, Pascal Fua, and Eduard Trulls. Beyond Cartesian Representations for Local Descriptors. In ICCV, 2019. 2, 3, 6
35. Vassileios Balntas et.al. SILDa: A Multi-Task Dataset for Evaluating Visual Localization. https://github. com/scape-research/silda, 2018. 2
36. M.A Fischler and R.C. Bolles. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Communications ACM, 24(6):381–395, 1981. 1, 2, 4
37. P. Gay, V. Bansal, C. Rubino, and A. D. Bue. Probabilistic Structure from Motion with Objects (PSfMO). In ICCV, 2017. 2
38. Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In CVPR, 2012. 2
39. R.I. Hartley. In Defense of the Eight-Point Algorithm. PAMI, 19(6):580–593, June 1997. 2
40. R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2000. 1
41. R. I. Hartley. Projective reconstruction and invariants from multiple images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(10):1036–1041, Oct 1994. 1, 2
42. K. He, Y. Lu, and S. Sclaroff. Local Descriptors Optimized for Average Precision. In CVPR, 2018. 1
43. J. Heinly, J.L. Schoenberger, E. Dunn, and J-M. Frahm. Reconstructing the World in Six Days. In CVPR, 2015. 1, 2, 3
44. Karel Lenc and Varun Gulshan and Andrea Vedaldi. VLBenchmarks. http://www.vlfeat.org/ benchmarks/, 2011. 2
45. A. Kendall, M. Grimes, and R. Cipolla. Posenet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization. In ICCV, pages 2938–2946, 2015. 1
46. J. Krishna Murthy, Ganesh Iyer, and Liam Paull. gradSLAM: Dense SLAM meets Automatic Differentiation. arXiv, 2019. 2
47. Zhengqi Li and Noah Snavely. MegaDepth: Learning Single-View Depth Prediction from Internet Photos. In CVPR, 2018. 2
48. David G. Lowe. Distinctive Image Features from ScaleInvariant Keypoints. IJCV, 20(2):91–110, November 2004. 1, 2, 3, 4, 6, 8, 15
49. Zixin Luo, Tianwei Shen, Lei Zhou, Jiahui Zhang, Yao Yao, Shiwei Li, Tian Fang, and Long Quan. ContextDesc: Local Descriptor Augmentation with Cross-Modality Context. In CVPR, 2019. 2, 3
50. Z. Luo, T. Shen, L. Zhou, S. Zhu, R. Zhang, Y. Yao, T. Fang, and L. Quan. Geodesc: Learning Local Descriptors by Integrating Geometry Constraints. In ECCV, 2018. 2, 3
51. Simon Lynen, Bernhard Zeisl, Dror Aiger, Michael Bosse, Joel Hesch, Marc Pollefeys, Roland Siegwart, and Torsten Sattler. Large-scale, real-time visual-inertial localization revisited. arXiv Preprint, 2019. 1
52. Will Maddern, Geoffrey Pascoe, Chris Linegar, and Paul Newman. 1 year, 1000 km: The Oxford RobotCar dataset. IJRR, 36(1):3–15, 2017. 2
53. J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust WideBaseline Stereo from Maximally Stable Extremal Regions. IVC, 22(10):761–767, 2004. 3, 6
54. K. Mikolajczyk and C. Schmid. A Performance Evaluation of Local Descriptors. PAMI, 27(10):1615–1630, 2004. 2
55. K. Mikolajczyk, C. Schmid, and A. Zisserman. Human Detection Based on a Probabilistic Assembly of Robust Part Detectors. In ECCV, pages 69–82, 2004. 3, 6
56. Jiri Matas Milan Pultar, Dmytro Mishkin. Leveraging Outdoor Webcams for Local Descriptor Learning. In Proceedings of CVWW 2019, 2019. 7
57. A. Mishchuk, D. Mishkin, F. Radenovic, and J. Matas. Working Hard to Know Your Neighbor’s Margins: Local Descriptor Learning Loss. In NeurIPS, 2017. 2, 3, 6
58. Dmytro Mishkin, Jiri Matas, and Michal Perdoch. MODS: PAMI, 19(6):580–593, June 1997. 2 Fast and robust method for two-view matching. CVIU, 2015. 6, 15
59. D. Mishkin, F. Radenovic, and J. Matas. Repeatability is Not Enough: Learning Affine Regions via Discriminability. In ECCV, 2018. 3, 6
60. Arun Mukundan, Giorgos Tolias, and Ondrej Chum. Explicit Spatial Encoding for Deep Local Descriptors. In CVPR, 2019. 1
61. R. Mur-Artal, J. Montiel, and J. Tardos. Orb-Slam: A Versatile and Accurate Monocular Slam System. IEEE Transactions on Robotics, 31(5):1147–1163, 2015. 1
62. D. Nister. An Efficient Solution to the Five-Point Relative Pose Problem. In CVPR, June 2003. 2
63. Hyeonwoo Noh, Andre Araujo, Jack Sim, and Tobias Weyanda nd Bohyung Han. Large-Scale Image Retrieval with Attentive Deep Local Features. In ICCV, 2017. 1, 2
64. Yuki Ono, Eduard Trulls, Pascal Fua, and Kwang Moo Yi. LF-Net: Learning Local Features from Images. In NeurIPS, 2018. 2, 3
65. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. 4
66. Stephen M. Pizer, E. Philip Amburn, John D. Austin, Robert Cromartie, Ari Geselowitz, Trey Greer, Bart ter Haar Romeny, John B. Zimmerman, and Karel Zuiderveld. Adaptive histogram equalization and its variations. Computer vision, graphics, and image processing, 1987. 15
67. M. Pultar, D. Mishkin, and J. Matas. Leveraging Outdoor Webcams for Local Descriptor Learning. In Computer Vision Winter Workshop, 2019. 2, 7
68. C.R. Qi, H. Su, K. Mo, and L.J. Guibas. Pointnet: Deep Learning on Point Sets for 3D Classification and Segmentation. In CVPR, 2017. 4
69. Filip Radenovic, Georgios Tolias, and Ondra Chum. CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples. In ECCV, 2016. 1
70. R. Ranftl and V. Koltun. Deep Fundamental Matrix Estimation. In ECCV, 2018. 2, 4
71. J. Revaud, P. Weinzaepfel, C. De Souza, N. Pion, G. Csurka, Y. Cabon, and M. Humenberger. R2D2: Repeatable and Reliable Detector and Descriptor. In arXiv Preprint, 2019. 8
72. J´erˆome Revaud, Philippe Weinzaepfel, C´esar Roberto de Souza, Noe Pion, Gabriela Csurka, Yohann Cabon, and Martin Humenberger. R2D2: Repeatable and Reliable Detector and Descriptor. In NeurIPS, 2019. 2
73. E. Rublee, V. Rabaud, K. Konolidge, and G. Bradski. ORB: An Efficient Alternative to SIFT or SURF. In ICCV, 2011. 2, 3, 6
74. Torsten Sattler, Bastian Leibe, and Leif Kobbelt. Improving Image-Based Localization by Active Correspondence Search. In ECCV, 2012. 1
75. T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstrand, E. Stenborg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic, F. Kahl, and T. Pajdla. Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions. In CVPR, 2018. 1, 2
76. Torsten Sattler, Tobias Weyand, Bastian Leibe, and Leif Kobbelt. Image Retrieval for Image-Based Localization Revisited. In BMVC, 2012. 2
77. Torsten Sattler, Qunjie Zhou, Marc Pollefeys, and Laura Leal-Taixe. Understanding the Limitations of CNN-based Absolute Camera Pose Regression. In CVPR, 2019. 1
78. N. Savinov, A. Seki, L. Ladicky, T. Sattler, and M. Pollefeys. Quad-Networks: Unsupervised Learning to Rank for Interest Point Detection. CVPR, 2017. 2
79. J.L. Sch¨onberger and J.M. Frahm. Structure-From-Motion Revisited. In CVPR, 2016. 1, 2, 3, 4, 6
80. J.L. Sch¨onberger, H. Hardmeier, T. Sattler, and M. Pollefeys. Comparative Evaluation of Hand-Crafted and Learned Local Features. In CVPR, 2017. 2
81. Yunxiao Shi, Jing Zhu, Yi Fang, Kuochin Lien, and Junli Gu. Self-Supervised Learning of Depth and Ego-motion with Differentiable Bundle Adjustment. arXiv Preprint, 2019. 2
82. E. Simo-serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer. Discriminative Learning of Deep Convolutional Feature Point Descriptors. In ICCV, 2015. 2
83. C. Strecha, W.V. Hansen, L. Van Gool, P. Fua, and U. Thoennessen. On Benchmarking Camera Calibration and Multi-View Stereo for High Resolution Imagery. In CVPR, 2008. 2
84. J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A Benchmark for the Evaluation of RGB-D SLAM Systems. In IROS, 2012. 4
85. Weiwei Sun, Wei Jiang, Eduard Trulls, Andrea Tagliasacchi, and Kwang Moo Yi. Attentive Context Normalization for Robust Permutation-Equivariant Learning. In arXiv Preprint, 2019. 2, 4, 8
86. Chengzhou Tang and Ping Tan. Ba-Net: Dense Bundle Adjustment Network. In ICLR, 2019. 2
87. Keisuke Tateno, Federico Tombari, Iro Laina, and Nassir Navab. Cnn-slam: Real-time dense monocular slam with learned depth prediction. In CVPR, July 2017. 2
88. B. Thomee, D.A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L. Li. YFCC100M: the New Data in Multimedia Research. In CACM, 2016. 3
89. Y. Tian, B. Fan, and F. Wu. L2-Net: Deep Learning of Discriminative Patch Descriptor in Euclidean Space. In CVPR, 2017. 2, 3
90. Yurun Tian, Xin Yu, Bin Fan, Fuchao Wu, Huub Heijnen, and Vassileios Balntas. SOSNet: Second Order Similarity Regularization for Local Descriptor Learning. In CVPR, 2019. 1, 2, 3
91. Giorgos Tolias, Yannis Avrithis, and Herv´e J´egou. Image Search with Selective Match Kernels: Aggregation Across Single and Multiple Images. IJCV, 116(3):247–261, Feb 2016. 1
92. P.H.S. Torr and A. Zisserman. MLESAC: A New Robust Estimator with Application to Estimating Image Geometry. CVIU, 78:138–156, 2000. 2
93. B. Triggs, P. Mclauchlan, R. Hartley, and A. Fitzgibbon. Bundle Adjustment – A Modern Synthesis. In Vision Algorithms: Theory and Practice, pages 298–372, 2000. 1
94. Andrea Vedaldi and Brian Fulkerson. Vlfeat: An open and portable library of computer vision algorithms. In Proceedings of the 18th ACM International Conference on Multimedia, MM ’10, pages 1469–1472, 2010. 3
95. Y. Verdie, K. M. Yi, P. Fua, and V. Lepetit. TILDE: A Temporally Invariant Learned DEtector. In CVPR, 2015. 2
96. S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and K. Fragkiadaki. Sfm-Net: Learning of Structure and Motion from Video. arXiv Preprint, 2017. 2
97. X. Wei, Y. Zhang, Y. Gong, and N. Zheng. Kernelized Subspace Pooling for Deep Local Descriptors. In CVPR, 2018. 1
98. Changchang Wu. Towards Linear-Time Incremental Structure from Motion. In 3DV, 2013. 2, 6
99. Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. LIFT: Learned Invariant Feature Transform. In ECCV, 2016. 2
100. K. M. Yi, E. Trulls, Y. Ono, V. Lepetit, M. Salzmann, and P. Fua. Learning to Find Good Correspondences. In CVPR, 2018. 2, 3, 4, 7, 13, 17
101. S. Zagoruyko and N. Komodakis. Learning to Compare Image Patches via Convolutional Neural Networks. In CVPR, 2015. 6
102. Jiahui Zhang, Dawei Sun, Zixin Luo, Anbang Yao, Lei Zhou, Tianwei Shen, Yurong Chen, Long Quan, and Hongen Liao. Learning Two-View Correspondences and Geometry Using Order-Aware Network. ICCV, 2019. 2, 3, 4
103. Xu Zhang, Felix X. Yu, Svebor Karaman, and Shih-Fu Chang. Learning Discriminative and Transformation Covariant Local Feature Detectors. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. 2
104. Chen Zhao, Zhiguo Cao, Chi Li, Xin Li, and Jiaqi Yang. NM-Net: Mining Reliable Neighbors for Robust Feature Correspondences. In CVPR, 2019. 2, 4
105. Qunjie Zhou, Torsten Sattler, Marc Pollefeys, and Laura Leal-Taixe. To learn or not to learn: Visual localization from essential matrices. arXiv Preprint, 2019. 1
106. Siyu Zhu, Runze Zhang, Lei Zhou, Tianwei Shen, Tian Fang, Ping Tan, and Long Quan. Very Large-Scale Global SfM by Distributed Motion Averaging. In CVPR, June 2018. 1, 2
107. C.L. Zitnick and K. Ramnath. Edge Foci Interest Points. In ICCV, 2011. 2
108. A. Alahi, R. Ortiz, and P. Vandergheynst. FREAK: Fast Retina Keypoint. In CVPR, 2012. 7, 11
109. S. Leutenegger, M. Chli, and R. Y. Siegwart. Brisk: Binary robust invariant scalable keypoints. In ICCV, pages 2548–2555, 2011.7