Abstract:In order to solve the problem of mismatch between description statements and image content due to insufficient visual information in image captioning generation, the image captioning generation method is proposed based on ECA-Net. Firstly, the image segmentation feature is used as another visual information source, and an iterative independent layer normalization module is used to fuse segmentation feature and grid features, extracting image feature using a dual information flow network. Secondly, an efficient channel attention module is added to the encoder to learn the correlation between image features through cross-channel interaction, so that the prediction results are more focused on visual content. Finally, the decoder predicts the next phrase based on the provided visual information and the partially generated captions, thus generating accurate captions. Experimental results on MSCOCO dataset have shown that the proposed method can enhance the dependency between the visual information of images, resulting in higher correlation and more accurate grammar in generating subtitles.