Fischer, V., Magdaleno, A., Calek, A., Cavalcanti, N., Hoffman, N., Germann, C., Wüthrich., J., Krähenmann, M., Farshad, M., Fürnstahl, P. & Calvet, L..
arXiv (2026)
The study presents a robust multi-view pipeline for 3D hand pose estimation in surgical environments, designed to operate without domain-specific fine-tuning and relying solely on off-the-shelf pretrained models. By combining person detection, whole-body pose estimation, 2D hand keypoint prediction, and constrained 3D optimization, the approach addresses challenges such as occlusions, intense lighting, and uniform glove appearance. In addition, a large-scale surgical benchmark dataset with over 68,000 frames and 3,000 annotated hand poses is introduced. Quantitative evaluation shows substantial improvements over baselines, establishing a strong foundation for future research in surgical computer vision.