基于图卷积与多头注意力的图文跨模态检索
DOI:
作者:
作者单位:

1.江南大学机械工程学院;2.江南大学物联网工程学院

作者简介:

通讯作者:

中图分类号:

TP391

基金项目:

国家自然科学基金


Cross-modal image and text retrieval based on graph convolution and multi-head attention
Author:
Affiliation:

1.School of Mechanical Engineering, Jiangnan University;2.School of Internet of Things Engineering, Jiangnan University

Fund Project:

National Natural Science Foundation

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    针对现有基于公共子空间的跨模态检索方法难以充分挖掘模态内局部一致性的问题,提出一种在图卷积中融合多头注意力机制的图文跨模态检索方法。为了提高模态内的局部一致性,将各模态内单个样本作为节点构建模态图,采用图卷积编码挖掘模态内各样本特征间的交互信息;为了区分不同邻居节点对中心节点的影响力,在图卷积中引入注意力机制,自适应学习各个邻居节点的权重系数;为了学习节点间的多组相关特征,构建带有权重参数的多头注意力层来更新中心节点信息;为了学习高度局部一致且语义一致的公共表征,共享公共表征学习层权重,并通过语义约束和模态不变约束进行优化。实验结果表明,在Wikipedia和Pascal Sentence这2个跨模态数据集上,该方法在不同检索任务上的平均mAP值比8种现有方法分别提升了2.6%~42.5%和3.3%~54.3%。

    Abstract:

    Aiming at the problem that the existing cross-modal retrieval methods based on common subspace are difficult to fully exploit the local consistency within the mode, a cross-modal image and text retrieval method integrating multi-head attention in image convolution is proposed. In order to improve the local consistency within the mode, a single sample in each mode is used as a node to construct the modal diagram, and the graph convolution code is used to mine the interactive information between the sample features in the mode. In order to distinguish the influence of different neighbor nodes on the center node, the attention mechanism is incorporated into the graph convolution, and the weight coefficient of each neighbor node is adaptively learned. In order to learn multiple sets of related features between nodes, a multi-head attention layer with weight parameters is constructed to update the central node information. In order to learn highly locally consistent and semantically consistent public representations, the weight of the common representation learning layer are shared and optimized by semantic constraints and modal invariant constraints. The experimental results show that on the two cross-modal datasets of Wikipedia and Pascal Sentence, the average mAP value of this method in different retrieval tasks are 2.6%~42.5% and 3.3%~54.3% higher than the 8 existing methods.

    参考文献
    相似文献
    引证文献
引用本文
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2023-02-03
  • 最后修改日期:2023-04-10
  • 录用日期:2023-04-25
  • 在线发布日期:
  • 出版日期: