Abstract:RGB-D salient object detection aims to accurately locate salient regions by jointly exploiting RGB appearance information and depth structural cues. To address the shortcomings of existing methods in modeling cross-modal complementarity and multi-level feature fusion, this paper proposes a cross-modal selection and perceptual refinement network. Specifically, a cross-modal selective fusion strategy adaptively models RGB–depth differences and complementarities at the channel, spatial, and frequency levels, while a modulated fusion mechanism enhances the selectivity and stability of high-level semantic fusion. Furthermore, a perceptual refinement scheme is incorporated into the decoding stage to progressively integrate high-level semantics with low-level structural details, thereby improving the structural integrity and boundary representation of salient objects. Experiments on six widely used RGB-D salient object detection benchmarks demonstrate that the proposed method consistently outperforms existing approaches in detection accuracy, boundary preservation, and robustness in complex scenes.