Abstract:Self-supervised monocular depth estimation aims to predict pixel-wise dense depth maps from a single RGB image without relying on ground-truth supervision. To address the limitations of existing methods in scene structure perception and local detail handling, this paper proposes a new method based on spatial frequency domain feature refinement and aggregation., which includes a spatial frequency feature refinement module and a dual-stream dynamic aggregation module. Specifically, the spatial frequency feature refinement module extracts and processes multi-scale fine-grained local features through the spatial refinement unit, and combines the frequency refinement unit to enhance the scene structure perception ability and suppress noise and redundant information by using discrete cosine transform and multi-angle channel attention mechanism. Secondly, the dual-stream dynamic aggregation module adaptively fuses global and local depth cues through the dual-stream convolutional attention mechanism. Extensive experimental results show that the performance of the proposed method is significantly better than the existing advanced models on the mainstream datasets, which achieves a good balance between accuracy and parameter quantity, and shows excellent cross-dataset generalization ability.