基于显隐式双路径融合的多模态命名实体识别Multimodal named entity recognition based on explicit-implicit dual-path fusion
陈强,谷晓燕,杨溢
摘要(Abstract):
针对现有基于文本的命名实体识别方法难以有效利用视觉信息,且主流多模态命名实体识别(multimodal named entity recognition, MNER)方法存在跨模态语义关联挖掘不足、异构数据融合能力有限、易受模态语义鸿沟影响等问题,提出一种基于显隐式双路径融合的多模态命名实体识别模型DPF-MNER(dual-path fusion MNER)。该模型引入双路径融合机制实现跨模态深度对齐:在显式路径中,构建目标实体-词汇关系图,明确建模文本实体与图像区域间的语义对应关系;在隐式路径中,设计基于动量对比学习的难样本对齐机制,通过动量更新维护跨模态记忆库,引导模型在共享语义空间中拉近相关图文对、推远不相关图文对,缓解模态偏差。在构建的军事领域专用数据集ME-MNER与公开数据集Twitter-2017上的实验结果表明,DPF-MNER在F1指标上分别达到87.05%和86.35%,验证了该方法在提升实体识别精度与模型泛化能力方面的有效性。
关键词(KeyWords): 多模态命名实体识别;对比学习;跨模态对齐;显隐式融合
基金项目(Foundation): 装备预研领域基金项目(61403120404)
作者(Author): 陈强,谷晓燕,杨溢
DOI: 10.16508/j.cnki.11-5866/n.2025.06.009
参考文献(References):
- [1]童昭,王露笛,朱小杰,等.基于预训练模型的军事领域命名实体识别研究[J].数据与计算发展前沿,2022,4(5):120-128.TONG Z,WANG L D,ZHU X J,et al. Research on military domain named entity recognition based on pre-training model[J]. Frontiers of Data&Computing,2022,4(5):120-128.(in Chinese)
- [2]游新冬,葛昊杰,韩君妹,等.面向武器装备领域的复杂实体识别[J].北京大学学报(自然科学版),2022,58(3):391-404.YOU X D,GE H J,HAN J M,et al. Recognition of complex entities in weapons and equipment field[J]. Acta Scientiarum Naturalium Universitatis Pekinensis,2022,58(3):391-404.(in Chinese)
- [3]连尧,冯俊池,丁皓.基于对抗迁移学习的军事科技领域命名实体识别[J].电子设计工程,2022,30(20):121-127.LIAN Y,FENG J C,DING H. Named entity recognition in military technology field based on adversarial transfer learning[J]. Electronic Design Engineering,2022,30(20):121-127.(in Chinese)
- [4]MOON S,NEVES L,CARVALHO V. Multimodal named entity recognition for short social media posts[C]//Proceedings of the2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg, PA, USA:Association for Computational Linguistics,2018:852-860.
- [5]LU D,NEVES L,CARVALHO V,et al. Visual attention model for name tagging in multimodal social media[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers). Stroudsburg, PA, USA:Association for Computational Linguistics,2018:1990-1999.
- [6]RADFORD A,KIM J W,HALLACY C,et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the 38th International Conference on Machine Learning. New York:PMLR,2021:8748-8763.
- [7]ZHANG Q,FU J L,LIU X Y,et al. Adaptive co-attention network for named entity recognition in tweets[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence and 30th Innovative Applications of Artificial Intelligence Conference and8th AAAI Symposium on Educational Advances in Artificial Intelligence. Palo Alto, CA, USA:AAAI Press,2018:5674-5681.
- [8]CHEN S G,AGUILAR G,NEVES L,et al. Can images help recognize entities?A study of the role of images for multimodal NER[C]//Proceedings of the 7th Workshop on Noisy Usergenerated Text(W-NUT 2021). Stroudsburg, PA, USA:Association for Computational Linguistics,2021:87-96.
- [9]WANG X Y,GUI M,JIANG Y,et al. ITA:image-text alignments for multimodal named entity recognition[C]//Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg, PA, USA:Association for Computational Linguistics,2022:3176-3189.
- [10]YU J F,JIANG J,YANG L,et al. Improving multi-modal named entity recognition via entity span detection with unified multimodal transformer[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.Stroudsburg, PA, USA:Association for Computational Linguistics,2020:3342-3352.
- [11]ZHANG D,WEI S Z,LI S S,et al. Multi-modal graph fusion for named entity recognition with targeted visual guidance[C]//Proceedings of the 35th AAAI Conference on Artificial Intelligence/33rd Conference on Innovative Applications of Artificial Intelligence/11th Symposium on Educational Advances in Artificial Intelligence. Palo Alto, CA, USA:AAAI, 2021,35(16):14347-14355.
- [12]WU Z W,ZHENG C M,CAI Y,et al. Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York, NY,USA:Association for Computing Machinery,2020:1038-1046.
- [13]ZHENG C M,WU Z W,WANG T,et al. Object aware multimodal named entity recognition in social media posts with adversarial learning[J]. IEEE Transactions on Multimedia,2021,23:2520-2532.
- [14]JIA M H Z,SHEN L,SHEN X,et al. MNER-QG:an end-toend MRC framework for multimodal named entity recognition with query grounding[C]//Proceedings of the 37th AAAI Conference on Artificial Intelligence and 35th Conference on Innovative Applications of Artificial Intelligence and 13th Symposium on Educational Advances in Artificial Intelligence.Palo Alto, CA, USA:AAAI Press,2022:8032-8040.
- [15]CHEN J Y,XUE Y,ZHANG H L,et al. On development of multimodal named entity recognition using part-of-speech and mixture of experts[J]. International Journal of Machine Learning and Cybernetics,2023,14(6):2181-2192.
- [16]LU J Y,BATRA D,PARIKH D,et al. ViLBERT:pretraining task-agnostic visiolinguistic representations for vision-andlanguage tasks[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. La Jolla,CA, USA:NIPS,2019:13-23.
- [17]LI W,GAO C,NIU G C,et al. UNIMO:towards unified-modal understanding and generation via crossmodal contrastive learning[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(Volume 1:Long Papers). Stroudsburg, PA, USA:Association for Computational Linguistics,2021:2592-2607.
- [18]于碧辉,谭淑月,魏靖烜,等.基于对比学习的视觉增强多模态命名实体识别[J].计算机科学,2024,51(6):198-205.YU B H,TAN S Y,WEI J X,et al. Vision-enhanced multimodal named entity recognition based on contrastive learning[J].Computer Science,2024,51(6):198-205.(in Chinese)
- [19]VASWANI A,SHAZEER N,PARMAR N,et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. La Jolla, CA, USA:NIPS,2017:6000-6010.
- [20]DEVLIN J,CHANG M W,LEE K,et al. BERT:pretraining of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg, PA, USA:Association for Computational Linguistics,2019:4171-4186.
- [21]HE K M,FAN H Q,WU Y X,et al. Momentum contrast for unsupervised visual representation learning[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). New York, USA:IEEE,2020:9726-9735.
- [22]TANG W H,ZHOU F T,HUANG S,et al. Feature reembedding:towards foundation model-level performance in computational pathology[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). New York,USA:IEEE,2024:11343-11352.
- [23]MA X Z,HOVY E. End-to-end sequence labeling via bidirectional LSTM-CNNs-CRF[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers). Stroudsburg, PA, USA:Association for Computational Linguistics,2016:1064-1074.
- [24]LAMPLE G,BALLESTEROS M,SUBRAMANIAN S,et al.Neural architectures for named entity recognition[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg, PA, USA:Association for Computational Linguistics,2016:260-270.
- [25]LI J N,LI D X,SAVARESE S,et al. BLIP-2:bootstrapping language-image pre-training with frozen image encoders and large language models[C]//Proceedings of the 40th International Conference on Machine Learning. New York,USA:PMLR,2023:19730-19742.
- [26]DAI W L,LI J N,LI D X,et al. InstructBLIP:towards generalpurpose vision-language models with instruction tuning[C]//Advances in Neural Information Processing Systems. La Jolla,CA, USA:NIPS,2023:49250-49267.