G2G: Exploiting Intra-Group Geometry
for Inter-Group Pose Estimation

Yufei Wei* 1, Shuhao Ye* 1, Chenxiao Hu1, Yiyuan Pan2, Dongyu Feng2,
Rong Xiong1, Yue Wang1, Yanmei Jiao† 3
1State Key Laboratory of Industrial Control Technology, Zhejiang University
2Zhejiang Humanoid Robot Innovation Center, Ningbo
3Hangzhou Normal University
G2G framework: a frozen multi-view backbone with a perceiver resampler, a cross-group bridge with merged self-attention, and a multi-frame pose head.

Framework. A frozen multi-view backbone (MapAnything, DINO-V2 ViT-L/14) encodes each group's geometry. Three lightweight trainable modules (a perceiver resampler, a cross-group bridge with merged self-attention, and a multi-frame pose head) jointly refine the intra-group representations and regress the inter-group relative 6-DoF pose.

Abstract

Recovering the relative 6-DoF pose between two image groups underlies cross-sequence relocalization and multi-camera rig odometry. Each group carries known intra-group geometry from visual odometry or rig calibration, and pretrained multi-view backbones already fuse such geometry into visual features. Yet current models treat all views as an unstructured set, leaving cross-group reasoning as the missing piece. We cast group-to-group pose estimation as a single unified problem that fully exploits the geometry within each image group, and introduce G2G, which keeps the multi-view foundation model entirely frozen so that its rich 3D representations are preserved rather than collapsing under fine-tuning. Three lightweight trainable modules bridge the two groups: a perceiver resampler, a cross-group bridge with merged self-attention, and a multi-frame pose head, together adding about 32M parameters (under 6% of the full model) and supervised only by relative poses. Across four datasets that span indoor and outdoor simulation, real-world cross-session capture with appearance change, and zero-shot sim-to-real transfer, G2G attains state-of-the-art accuracy on both tasks, while every baseline is retrained with its full original supervision.

Task 1

Cross-Sequence Relocalization

Given two short sequences captured at different times, G2G recovers the relative pose that aligns them.

Table 1 Cross-sequence localization on four datasets (mean errors)

Loading table…

Task 2

Multi-Camera Rig Odometry

Two rigid camera rigs at consecutive timestamps; G2G estimates the inter-rig motion. Demos render one camera for clarity, while inference and pose prediction use the full multi-camera rig.

Table 2 Multi-camera rig odometry on six configurations (mean errors)

Loading table…

Explore in 3D

Interactive Reconstruction Viewer

Orbit real reconstructions placed by G2G. In Relocalization, merge two sequences and compare predicted vs. ground-truth alignment. In Rig Odometry, stitch consecutive multi-camera rigs frame by frame.

Loading point cloud…
drag · rotate  |  scroll · zoom

WebGL is unavailable in this browser.

The interactive viewer needs WebGL. The static qualitative figures above show the same results.

View
Color
Cameras
Points

More Results

Ablation study
Ablation of G2G design choices on five settings (mean errors)

Loading table…

Cross-sequence localization: full metric suite
Per-dataset translation / rotation errors and accuracy thresholds

Loading table…

Rig odometry: full metric suite
Per-configuration translation / rotation errors, accuracy thresholds, and mAA@30°

Loading table…

Pose error vs. field-of-view overlap
Per-pair field-of-view overlap distribution for each dataset
Per-pair overlap distribution. Histograms of field-of-view overlap for the evaluation pairs of (a) HM3D (median 0.50), (b) TartanGround (median 0.55), (c) NCLT (median 0.24), and (d) ZJH (median 0.34); each panel's box reports n, min, max, mean, and median. Faint dashed lines mark the nine curriculum thresholds (0.50 down to 0.10). NCLT has the lowest overlap, which makes it the hardest cross-sequence setting.
Mean rotation error binned by field-of-view overlap for each dataset
Mean rotation error vs. field-of-view overlap. Panels (a)–(d) cover HM3D, TartanGround, NCLT, and ZJH. In each, every method's mean rotation error (log scale) is binned by the field-of-view overlap of the selected top-1 window, i.e. the geometric overlap the model actually sees. On ZJH, solid and dashed curves denote the simulated and real splits. The MA-AB oracle (gray) is a reference ceiling. G2G (red) degrades most gradually as overlap drops and stays closest to the oracle, while baselines grow by up to an order of magnitude in the low-overlap regime.
Mean translation error binned by field-of-view overlap for each dataset
Mean translation error vs. field-of-view overlap. Panels (a)–(d) cover HM3D, TartanGround, NCLT, and ZJH. In each, every method's mean translation error in meters (log scale) is binned by the field-of-view overlap of the selected top-1 window, i.e. the geometric overlap the model actually sees. On ZJH, solid and dashed curves denote the simulated and real splits. Translation follows the main-table scale convention: VGGT is up-to-scale (raw L2), Reloc3R is Procrustes-aligned, and the remaining metric methods (G2G, LoMa, CoViS-Net, MA) use bare L2, so absolute values are not comparable across methods. The MA-AB oracle (gray) is a reference ceiling; G2G (red) stays closest to it and degrades most gradually as overlap drops, while baselines grow sharply in the low-overlap regime.
Additional case study
Predicted vs. ground-truth poses on the reconstructed scene
Qualitative examples on real data. (Top) A zero-shot sim-to-real ZJH localization pair in which the two groups image opposite sides of the same wall, leaving little direct visual overlap; the insets show G2G's predicted poses tracking the ground-truth trajectory on both groups. (Bottom) A long-range cross-session NCLT rig pair recorded on different dates, with large signage changed and roadside trees removed between sessions; G2G still brings the two groups into one frame, with a residual inter-group rotation of 1.12° and a residual translation of 5.3 cm.

BibTeX

@misc{wei2026g2gexploitingintragroupgeometry,
      title={G2G: Exploiting Intra-Group Geometry for Inter-Group Pose Estimation},
      author={Yufei Wei and Shuhao Ye and Chenxiao Hu and Yiyuan Pan and Dongyu Feng and Rong Xiong and Yue Wang and Yanmei Jiao},
      year={2026},
      eprint={2606.08284},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.08284},
}