Abstract:Aiming at the problem that the existing cross-modal retrieval methods based on common subspace are difficult to fully exploit the local consistency within the mode, a cross-modal image and text retrieval method integrating multi-head attention in image convolution is proposed. In order to improve the local consistency within the mode, a single sample in each mode is used as a node to construct the modal diagram, and the graph convolution code is used to mine the interactive information between the sample features in the mode. In order to distinguish the influence of different neighbor nodes on the center node, the attention mechanism is incorporated into the graph convolution, and the weight coefficient of each neighbor node is adaptively learned. In order to learn multiple sets of related features between nodes, a multi-head attention layer with weight parameters is constructed to update the central node information. In order to learn highly locally consistent and semantically consistent public representations, the weight of the common representation learning layer are shared and optimized by semantic constraints and modal invariant constraints. The experimental results show that on the two cross-modal datasets of Wikipedia and Pascal Sentence, the average mAP value of this method in different retrieval tasks are 2.6%~42.5% and 3.3%~54.3% higher than the 8 existing methods.