SurgCUT3R: Surgical Scene-Aware Continuous Understanding of Temporal 3D Representation

Xu, Kaiyuan; Hong, Fangzhou; Elson, Daniel; Huang, Baoru

SurgCUT3R: Surgical Scene-Aware Continuous Understanding of Temporal 3D Representation

Kaiyuan Xu¹, Fangzhou Hong², Daniel Elson¹, Baoru Huang^1,3

¹The Hamlyn Centre for Robotic Surgery, Imperial College London, UK
²S-Lab, College of Computing and Data Science, Nanyang Technological University, Singapore
³Department of Computer Science, University of Liverpool, UK
ICRA 2026

Paper Code (Coming Soon)

Summary

SurgCUT3R adapts a state-of-the-art unified online reconstruction model to monocular surgical endoscopic video, addressing (i) the lack of supervised training data and (ii) accumulated pose drift on long sequences.

Metric-scale pseudo-GT depth from public stereo surgical datasets (SCARED, StereoMIS).
Hybrid supervision combining pseudo-GT with geometric self-correction to resist label noise.
Hierarchical long-sequence inference using global stability + local accuracy models to suppress drift.

4D visualization

Depth maps and pointmaps.

Abstract

Reconstructing surgical scenes from monocular endoscopic video is critical for advancing robotic-assisted surgery. However, the application of state-of-the-art general-purpose reconstruction models is constrained by two key challenges: the lack of supervised training data and performance degradation over long video sequences. To overcome these limitations, we propose SurgCUT3R, a systematic framework that adapts unified 3D reconstruction models to the surgical domain. Our contributions are threefold. First, we develop a data generation pipeline that exploits public stereo surgical datasets to produce large-scale, metric-scale pseudo-ground-truth depth maps, effectively bridging the data gap. Second, we propose a hybrid supervision strategy that couples our pseudo-ground-truth with geometric self-correction to enhance robustness against inherent data imperfections. Third, we introduce a hierarchical inference framework that employs two specialized models to effectively mitigate accumulated pose drift over long surgical videos: one for global stability and one for local accuracy. Experiments on the SCARED and StereoMIS datasets demonstrate that our method achieves a competitive balance between accuracy and efficiency, delivering near state-of-the-art but substantially faster pose estimation and offering a practical and effective solution for robust reconstruction in surgical environments.

Method

Overview of SurgCUT3R pipeline and hierarchical inference — **Overview of SurgCUT3R.** **Left:** The unified reconstruction pipeline. Streaming video frames are encoded via a ViT encoder and interact with a persistent state, which is continuously updated to sequentially output the pointmap and camera parameter for each frame. **Right:** Our hierarchical framework for long-sequence inference. The pink lines represent camera trajectories. A sparse but globally stable trajectory from a global model (M_global) provides anchor points to correct and stitch the dense but locally drifting trajectories from a local model (M_local), producing a final, drift-corrected trajectory.

Results

Quantitative evaluation table — **Quantitative results.** Quantitative evaluation of our SurgCUT3R method against existing methods in endoscopic scene reconstruction. The optimal and suboptimal results are shown in bold and underlined respectively.

**Qualitative results of monocular depth estimation.** We compare our method with MonST3R [24], Spann3R [25], AF-SfMLearner [29], EndoDAC [30] and MegaSaM [28] on SCARED [34] and StereoMIS [35] datasets. Our method achieves the best qualitative results in feed-forward methods.

Qualitative comparison of camera trajectories (ablation) — **Qualitative comparison of camera trajectories.** Left: Without the hierarchical inference framework. Right: With our hierarchical inference framework (Ours).

BibTeX

@inproceedings{xu2026surgcut3r,
  title     = {SurgCUT3R: Surgical Scene-Aware Continuous Understanding of Temporal 3D Representation},
  author    = {Xu, Kaiyuan and Hong, Fangzhou and Elson, Daniel and Huang, Baoru},
  booktitle = {Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)},
  year      = {2026},
  url       = {https://YOUR_DOMAIN.com/SurgCUT3R}
}