Abstract:To address the issues of information loss, blurred boundaries, and insufficient utilization of multi-scale features in the feature extraction process of dermoscopic image classification, an improved FastViT (Fast Hybrid Vision Transformer) network, named EFFViT (Enhanced Positional Information and Feature Fusion Vision Transformer), is proposed to enhance skin lesion classification accuracy. First, a dual-channel gated coordinate attention feature token mixer is designed to enhance local spatial positional information representation, thereby improving lesion localization and detail extraction capabilities. Second, a feature enhancement module is constructed to strengthen the model’s ability to capture fine details. Finally, a multi-scale feature fusion module is introduced to integrate information from different scales, enhancing the perception of both global and local features. Experimental results on the ISIC 2018 and ISIC 2019 datasets demonstrate that the proposed EFFViT model achieves classification accuracy, precision, recall, and F1-score of 94.7%, 92.5%, 93.3%, and 92.0% on ISIC 2018, and 93.8%, 90.6%, 90.2%, and 90.3% on ISIC 2019, respectively. Compared with current state-of-the-art algorithms, EFFViT demonstrates superior performance in skin lesion image classification tasks.