SurgCUT3R: Surgical Scene-Aware Continuous Understanding of Temporal 3D Representation

Kaiyuan Xu1, Fangzhou Hong2, Daniel Elson1, Baoru Huang1,3
1The Hamlyn Centre for Robotic Surgery, Imperial College London, UK
2S-Lab, College of Computing and Data Science, Nanyang Technological University, Singapore
3Department of Computer Science, University of Liverpool, UK
ICRA 2026

Summary

SurgCUT3R adapts a state-of-the-art unified online reconstruction model to monocular surgical endoscopic video, addressing (i) the lack of supervised training data and (ii) accumulated pose drift on long sequences.

  • Metric-scale pseudo-GT depth from public stereo surgical datasets (SCARED, StereoMIS).
  • Hybrid supervision combining pseudo-GT with geometric self-correction to resist label noise.
  • Hierarchical long-sequence inference using global stability + local accuracy models to suppress drift.

4D visualization

Depth maps and pointmaps.

Abstract

Reconstructing surgical scenes from monocular endoscopic video is critical for advancing robotic-assisted surgery. However, the application of state-of-the-art general-purpose reconstruction models is constrained by two key challenges: the lack of supervised training data and performance degradation over long video sequences. To overcome these limitations, we propose SurgCUT3R, a systematic framework that adapts unified 3D reconstruction models to the surgical domain. Our contributions are threefold. First, we develop a data generation pipeline that exploits public stereo surgical datasets to produce large-scale, metric-scale pseudo-ground-truth depth maps, effectively bridging the data gap. Second, we propose a hybrid supervision strategy that couples our pseudo-ground-truth with geometric self-correction to enhance robustness against inherent data imperfections. Third, we introduce a hierarchical inference framework that employs two specialized models to effectively mitigate accumulated pose drift over long surgical videos: one for global stability and one for local accuracy. Experiments on the SCARED and StereoMIS datasets demonstrate that our method achieves a competitive balance between accuracy and efficiency, delivering near state-of-the-art but substantially faster pose estimation and offering a practical and effective solution for robust reconstruction in surgical environments.

Method

Overview of SurgCUT3R pipeline and hierarchical inference
Overview of SurgCUT3R. Left: The unified reconstruction pipeline. Streaming video frames are encoded via a ViT encoder and interact with a persistent state, which is continuously updated to sequentially output the pointmap and camera parameter for each frame. Right: Our hierarchical framework for long-sequence inference. The pink lines represent camera trajectories. A sparse but globally stable trajectory from a global model (Mglobal) provides anchor points to correct and stitch the dense but locally drifting trajectories from a local model (Mlocal), producing a final, drift-corrected trajectory.

Results

Quantitative evaluation table
Quantitative results. Quantitative evaluation of our SurgCUT3R method against existing methods in endoscopic scene reconstruction. The optimal and suboptimal results are shown in bold and underlined respectively.
Qualitative results of 3D reconstruction
Qualitative results of 3D reconstruction. With videos (small images) as input, this figure shows the reconstruction from the first frame (large images left) and the accumulated 3D model from multiple frames (large images right). This alignment between the single-frame and multi-frame reconstruction results highlights the geometric consistency of our method.
Qualitative results of monocular depth estimation
Qualitative results of monocular depth estimation. We compare our method with MonST3R [24], Spann3R [25], AF-SfMLearner [29], EndoDAC [30] and MegaSaM [28] on SCARED [34] and StereoMIS [35] datasets. Our method achieves the best qualitative results in feed-forward methods.
Qualitative comparison of camera trajectories (ablation)
Qualitative comparison of camera trajectories. Left: Without the hierarchical inference framework. Right: With our hierarchical inference framework (Ours).

BibTeX

@inproceedings{xu2026surgcut3r,
  title     = {SurgCUT3R: Surgical Scene-Aware Continuous Understanding of Temporal 3D Representation},
  author    = {Xu, Kaiyuan and Hong, Fangzhou and Elson, Daniel and Huang, Baoru},
  booktitle = {Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)},
  year      = {2026},
  url       = {https://YOUR_DOMAIN.com/SurgCUT3R}
}