资讯

历史

科技

环境与自然

成长

游戏

财经

文学与艺术

美食

健康

家居

文化

情感

汽车

三农

军事

旅行

运动

教育

生活

星座命理

基于多模态知识图谱的中文跨模态实体对齐方法

创作时间:

作者:

@小白创作中心

基于多模态知识图谱的中文跨模态实体对齐方法

引用

来源

https://www.ecice06.com/CN/10.19678/j.issn.1000-3428.0066938

随着多模态数据交互式任务的兴起，综合利用不同模态的知识变得尤为重要。基于此背景，本文提出了一种基于多模态知识图谱的中文跨模态实体对齐方法，旨在解决图像与文本实体是否指代同一对象的问题。通过设计单双流交互预训练语言模型（CCMEA），结合自监督学习方法，实现了对图像和文本特征的高效提取与匹配。实验结果表明，该方法在多个数据集上均取得了显著的性能提升，特别是在纺织行业数据集上达到了94.3%的平均召回率。

1. 引言

多模态数据间交互式任务的出现对综合利用不同模态的知识提出了较高的要求，因此多模态知识图谱应运而生。在多模态知识图谱的构建过程中，图像与文本实体是否指代同一对象尤为重要，这要求对中文跨模态实体进行实体对齐。针对该问题，本文提出一种基于多模态知识图谱的中文跨模态实体对齐方法。

2. 方法论

2.1 CCMEA模型架构

CCMEA模型采用单双流交互预训练语言模型架构，面向领域细粒度图像和中文文本。模型基于自监督学习方法，利用视觉和文本编码器提取视觉和文本特征，并通过交叉编码器进行精细建模。最终采用对比学习方法计算图像和文本实体的匹配度。

图2 对比学习原理

2.2 实验设计

实验在MUGE、Flickr30k-CN和自建的TEXTILE数据集上进行。通过对比学习方法计算图像和文本实体的匹配度，评估模型的性能。

3. 实验结果与分析

实验结果表明，在MUGE和Flickr30k-CN数据集上，CCMEA模型的平均召回率（MR）相比于WukongViT-B基线模型分别提升了3.20和11.96个百分点。特别是在自建的TEXTILE数据集上，CCMEA模型达到了94.3%的平均召回率，证明了该方法可以有效对齐中文跨模态实体，并且具有较高的准确性和实用性。

4. 应用场景

该方法在纺织行业多模态知识图谱构建中具有重要应用价值。通过构建纺织行业多模态知识图谱，可以实现更精准的图像检索和知识获取，为纺织行业的智能化发展提供技术支持。

图5 纺织行业多模态知识图谱构建流程

5. 结论

本文提出了一种基于多模态知识图谱的中文跨模态实体对齐方法，通过设计CCMEA模型，实现了对图像和文本特征的高效提取与匹配。实验结果表明，该方法在多个数据集上均取得了显著的性能提升，特别是在纺织行业数据集上达到了94.3%的平均召回率。这为多模态知识图谱的构建和应用提供了新的思路和方法。

参考文献

[1] 郭浩, 李欣奕, 唐九阳, 等. 自适应特征融合的多模态实体对齐研究[J/OL]. 自动化学报: 1-13. [2023-02-02]. DOI:10.16383/j.aas.c210518.

[2] 陈烨, 周刚, 卢记仓. 多模态知识图谱构建与应用研究综述.计算机应用研究,2021,38(12): 3535- 3543.URL

[3] 张天杭, 李婷婷, 张永刚. 基于知识图谱嵌入的多跳中文知识问答方法.吉林大学学报(理学版),2022,60(1): 119- 126.URL

[4] ZHU X R, LI Z X, WANG X D, et al. Multi-modal knowledge graph construction and application: a survey[EB/OL]. [2023-02-02].https://arxiv.org/abs/2202.05786.

[5] SUCHANEK F M, KASNECI G, WEIKUM G. YAGO: a core of semantic knowledge[C]//Proceedings of the 16th International Conference on World Wide Web. New York, USA: ACM Press, 2007: 697-706.

[6] LEHMANN J, ISELE R, JAKOB M, et al. DBpedia—a large-scale, multilingual knowledge base extracted from Wikipedia.Semantic Web,2015,6(2): 167- 195.doi:10.3233/SW-140134

[7] ZHANG Q, FU J L, LIU X Y, et al. Adaptive co-attention network for named entity recognition in tweets[C]//Proceedings of AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2018: 1-8.

[8] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2023-02-02].https://arxiv.org/abs/2103.00020.

[9] LI L H, YATSKAR M, YIN D, et al. VisualBERT: a simple and performant baseline for vision and language[EB/OL]. [2023-02-02].https://arxiv.org/abs/1908.03557.pdf.

[10] LI G, DUAN N, FANG Y J, et al. Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training[C]//Proceedings of AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2020: 11336-11344.

[11] SU W J, ZHU X Z, CAO Y, et al. VL-BERT: pre-training of generic visual-linguistic representations[EB/OL]. [2023-02-02].https://arxiv.org/abs/1908.08530.

[12] CHEN Y C, LI L J, YU L C, et al. UNITER: universal image-text representation learning[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 104-120.

[13] LI X J, YIN X, LI C Y, et al. OSCAR: object-semantics aligned pre-training for vision-language tasks[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 121-137.

[14] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional Transformers for language understanding[C]//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1(Long and Short Papers). Philadelphia, USA: ACL Press, 2019: 4171-4186.

[15] VILLANI C. Optimal transport: old and new[EB/OL]. [2023-02-02].https://www.cedricvillani.org/sites/dev/files/old_images/2012/08/B07.StFlour.pdf.

[16] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[EB/OL]. [2023-02-02].https://arxiv.org/abs/1506.01497.pdf.

[17] SARIYILDIZ M B, PEREZ J, LARLUS D. Learning visual representations with caption annotations[M]. Berlin, Germany: Springer, 2020.

[18] HUANG Z C, ZENG Z Y, LIU B, et al. Pixel-BERT: aligning image pixels with text by deep multi-modal Transformers[EB/OL]. [2023-02-02].https://arxiv.org/abs/2004.00849.pdf.

[19] HUANG Z C, ZENG Z Y, HUANG Y P, et al. Seeing out of the box: end-to-end pre-training for vision-language representation learning[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2021: 12971-12980.

[20] ZHUGE M C, GAO D H, FAN D P, et al. Kaleido-BERT: vision-language pre-training on fashion domain[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA. IEEE Press, 2021: 12642-12652.

[21] KIM W, SON B, KIM I. ViLT: vision-and-language Transformer without convolution or region supervision[EB/OL]. [2023-02-02].https://arxiv.org/abs/2102.03334.pdf.

[22] GAO D H, JIN L B, CHEN B, et al. FashionBERT: text and image matching with adaptive loss for cross-modal retrieval[C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM Press, 2020: 2251-2260.

[23] XU K, BA J, KIROS R, et al. Show, attend and tell: neural image caption generation with visual attention[C]//Proceedings of the 32nd International Conference on Machine Learning. New York, USA: ACM Press, 2015: 2048-2057.

[24] SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the Inception architecture for computer vision[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA. IEEE Press, 2016: 2818-2826.

[25] XIE S N, GIRSHICK R, DOLLÁR P, et al. Aggregated residual transformations for deep neural networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA. IEEE Press, 2017: 5987-5995.

[26] ZHANG Y H, JIANG H, MIURA Y, et al. Contrastive learning of medical visual representations from paired images and text[EB/OL]. [2023-02-02].https://arxiv.org/abs/2010.00747.pdf.

[27] CHEN X, ZHANG N Y, LI L, et al. Hybrid Transformer with multi-level fusion for multimodal knowledge graph completion[EB/OL]. [2023-02-02].https://arxiv.org/abs/2205.02357.pdf.

[28] YAO L W, HUANG R H, HOU L, et al. FILIP: fine-grained interactive language-image pre-training[EB/OL]. [2023-02-02].https://arxiv.org/abs/2111.07783.pdf.

[29] GU J X, MENG X J, LU G S, et al. Wukong: a 100 million large-scale Chinese cross-modal pre-training benchmark[EB/OL]. [2023-02-02].https://arxiv.org/abs/2202.06767.pdf.

[30] 吴俊, 程垚, 郝瀚, 等. 基于BERT嵌入BiLSTM-CRF模型的中文专业术语抽取研究.情报学报,2020,39(4): 409- 418.doi:10.3772/j.issn.1000-0135.2020.04.007