G2G: Exploiting Intra-Group Geometry for Inter-Group Pose Estimation

Wei, Yufei; Ye, Shuhao; Hu, Chenxiao; Pan, Yiyuan; Feng, Dongyu; Xiong, Rong; Wang, Yue; Jiao, Yanmei

Abstract

Recovering the relative 6-DoF pose between two image groups underlies cross-sequence relocalization and multi-camera rig odometry. Each group carries known intra-group geometry from visual odometry or rig calibration, and pretrained multi-view backbones already fuse such geometry into visual features. Yet current models treat all views as an unstructured set, leaving cross-group reasoning as the missing piece. We cast group-to-group pose estimation as a single unified problem that fully exploits the geometry within each image group, and introduce G2G, which keeps the multi-view foundation model entirely frozen so that its rich 3D representations are preserved rather than collapsing under fine-tuning. Three lightweight trainable modules bridge the two groups: a perceiver resampler, a cross-group bridge with merged self-attention, and a multi-frame pose head, together adding about 32M parameters (under 6% of the full model) and supervised only by relative poses. Across four datasets that span indoor and outdoor simulation, real-world cross-session capture with appearance change, and zero-shot sim-to-real transfer, G2G attains state-of-the-art accuracy on both tasks, while every baseline is retrained with its full original supervision.

Explore in 3D

Interactive Reconstruction Viewer

Orbit real reconstructions placed by G2G. In Relocalization, merge two sequences and compare predicted vs. ground-truth alignment. In Rig Odometry, stitch consecutive multi-camera rigs frame by frame.

Loading point cloud…

drag · rotate | scroll · zoom

WebGL is unavailable in this browser.

The interactive viewer needs WebGL. The static qualitative figures above show the same results.

View

Color

Cameras

Points

More Results

Ablation study

Ablation of G2G design choices on five settings (mean errors)

Loading table…

Cross-sequence localization: full metric suite

Per-dataset translation / rotation errors and accuracy thresholds

Loading table…

Rig odometry: full metric suite

Per-configuration translation / rotation errors, accuracy thresholds, and mAA@30°

Loading table…

Pose error vs. field-of-view overlap

Per-pair field-of-view overlap distribution for each dataset — **Per-pair overlap distribution.** Histograms of field-of-view overlap for the evaluation pairs of (a) HM3D (median 0.50), (b) TartanGround (median 0.55), (c) NCLT (median 0.24), and (d) ZJH (median 0.34); each panel's box reports n, min, max, mean, and median. Faint dashed lines mark the nine curriculum thresholds (0.50 down to 0.10). NCLT has the lowest overlap, which makes it the hardest cross-sequence setting.

Mean rotation error binned by field-of-view overlap for each dataset — **Mean rotation error vs. field-of-view overlap.** Panels (a)–(d) cover HM3D, TartanGround, NCLT, and ZJH. In each, every method's mean rotation error (log scale) is binned by the field-of-view overlap of the selected top-1 window, i.e. the geometric overlap the model actually sees. On ZJH, solid and dashed curves denote the simulated and real splits. The MA-AB oracle (gray) is a reference ceiling. G2G (red) degrades most gradually as overlap drops and stays closest to the oracle, while baselines grow by up to an order of magnitude in the low-overlap regime.

Mean translation error binned by field-of-view overlap for each dataset — **Mean translation error vs. field-of-view overlap.** Panels (a)–(d) cover HM3D, TartanGround, NCLT, and ZJH. In each, every method's mean translation error in meters (log scale) is binned by the field-of-view overlap of the selected top-1 window, i.e. the geometric overlap the model actually sees. On ZJH, solid and dashed curves denote the simulated and real splits. Translation follows the main-table scale convention: VGGT is up-to-scale (raw L2), Reloc3R is Procrustes-aligned, and the remaining metric methods (G2G, LoMa, CoViS-Net, MA) use bare L2, so absolute values are not comparable across methods. The MA-AB oracle (gray) is a reference ceiling; G2G (red) stays closest to it and degrades most gradually as overlap drops, while baselines grow sharply in the low-overlap regime.

Additional case study

Predicted vs. ground-truth poses on the reconstructed scene — **Qualitative examples on real data.** (Top) A zero-shot sim-to-real ZJH localization pair in which the two groups image opposite sides of the same wall, leaving little direct visual overlap; the insets show G2G's predicted poses tracking the ground-truth trajectory on both groups. (Bottom) A long-range cross-session NCLT rig pair recorded on different dates, with large signage changed and roadside trees removed between sessions; G2G still brings the two groups into one frame, with a residual inter-group rotation of 1.12° and a residual translation of 5.3 cm.

G2G: Exploiting Intra-Group Geometry
for Inter-Group Pose Estimation

Abstract

Cross-Sequence Relocalization

Multi-Camera Rig Odometry

Interactive Reconstruction Viewer

More Results

BibTeX