Abstract:Currently widely used deep learning models face challenges in comprehensively capturing detailed features and semantic information of buildings when extracting features from optical remote sensing images. To address these issues, this paper proposes a optical remote sensing image building detection method based on cross-scale and fine-grained encoders. The method designs a cross-scale encoder based on the deep Swin Transformer and a fine-grained encoder based on ResNeXt to extract global contextual information and local detailed information from optical remote sensing images, enabling high-precision building detection. This paper also designs a Branch Fusion Module that enhances the model's ability to model inter-channel relationships by introducing a position-sensitive channel attention mechanism, achieving precise detection of target buildings. Extensive experiments were conducted on the WHU, INRIA, and Massachusetts datasets, and the proposed model was compared with various state-of-the-art models. The experimental results validate the superior performance of the proposed method in building detection in optical remote sensing images.