资源受限场景下的少样本跨域文档检索模型-夜雨聆风

资源受限场景下的少样本跨域文档检索模型

引文格式：杨得草，苗怡然，陈超，等.资源受限场景下的少样本跨域文档检索模型［J］.西华师范大学学报(自然科学版)，2025，46(6):667-676.

作者:杨得草，苗怡然，陈超，于久桓，李齐治，彭德中

通讯作者：杨得草（1996—），工程师，男，主要从事核技术支持工作。

摘要：随着互联网的发展，网络上每天会产生数以万计的数据，用户难以从海量数据中准确检索出想要的内容。为帮助用户精准搜索到目标信息，本文提出了一种基于内在语义对比学习与句子向量聚合的小样本跨域文本检索模型。内在语义对比学习不仅解决了数据分布不一致导致的泛化问题，还克服了NLP中难以通过数据增强进行对比学习的难题；句子向量聚合模块解决了模型在显存不足时难以处理长文档的问题。在构建的小样本跨域文本检索的数据集上的实验表明，本文提出的方法能够有效提高检索性能，并且解决显存不足时长文本难以处理的问题。

关键词：文档检索；文档表示；对比学习；邻域泛化；小样本学习

参考文献

提

［1］全国数据资源调查工作组（国家工业信息安全发展研究中心）.全国数据资源调查报告（2023年）［R］.福州：第七届数字中国建设峰会·数据资源与数字安全论坛，2024.

［2］中国互联网络信息中心.第53次中国互联网络发展状况统计报告［R］.北京：中国互联网络信息中心，2024.

［3］WANG X,PENG D Z,HU P,et al.Cross-domain alignment for zero-shot sketch-based image retrieval［J］.IEEE Transactions on Circuits and Systems for Video Technology,2023,33(11)：7024-7035.

［4］ZHANG H X,CHENG D Q,KOU Q Q,et al.Indicative Vision Transformer for end-to-end zero-shot sketch-based image retrieval［J］.Advanced Engineering Informatics,2024,60：102398.

［5］WU L,WANG Y,SHAO L.Cycle-consistent deep generative hashing for cross-modal retrieval［J］.IEEE Transactions on Image Processing,2019,28(4)：1602-1612.

［6］廖颖.面向长文档的智能问答技术研究［D］.秦皇岛：燕山大学,2023.

［7］杨帆.基于语义增强特征融合的多模态图像检索模型［D］.大连：大连海事大学,2023.

［8］SUN Y,REN Z W,HU P,et al.Hierarchical consensus hashing for cross-modal retrieval［J］.IEEE Transactions on Multimedia,2023,26：824-836.

［9］KOBAYASHI S.Contextual augmentation：data augmentation by words with paradigmatic relations［J］.2018.

［10］WEI J,ZOU K.EDA：easy data augmentation techniques for boosting performance on text classification tasks［C］//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),2019：6382-6388.

［11］ FADAEE M,BISAZZA A,MONZ C.Data augmentation for low-resource neural machine translation［C］// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2：Short Papers),2017：567-573.

［12］ZHAO W X,LIU J,REN R Y,et al.Dense text retrieval based on pretrained language models：a survey［J］.ACM Transactions on Information Systems,2024,42(4)：1-60.

［13］RAO J,DING L,QI S H,et al.Dynamic contrastive distillation for image-text retrieval［J］.IEEE Transactions on Multimedia,2023,25：8383-8395.

［14］LING C,ZHAO X J,LU J Y,et al.Domain specialization as the key to make large language models disruptive：a comprehensive survey［EB/OL］.(2024-03-29)［2024-07-22］.https：//arxiv.org/abs/2305.18703v7.

［15］LIU H R,MA Y,YAN M,et al.DiDA：disambiguated domain alignment for cross-domain retrieval with partial labels［C］//AAAI Conference on Artificial Intelligence,2024.

［16］郑敏.基于判别性特征学习的细粒度图像—文本检索研究［D］.北京：北京交通大学,2022.

［17］汪浩然.基于语义和常识指导的跨模态图文检索技术研究［D］.天津：天津大学,2021.

［18］DEVLIN J,CHANG M W,LEE K,et al.BERT：pre-training of deep bidirectional transformers for language understanding［EB/OL］.2018：1810.04805.https：//arxiv.org/abs/1810.04805v2.

［19］WANG X Y,DU Y J,CHEN D,et al.Constructing better prototype generators with 3D CNNs for few-shot text classification［J］.Expert Systems with Applications,2023,225：120124.

［20］KIM Y.Convolutional neural networks for sentence classification［C］//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).Doha,Qatar.Stroudsburg,PA,USAACL,2014：1746-1751.

［21］LIU P,QIU X,XUANJING H.Recurrent Neural Network for Text Classification with Multi-Task Learning［C］//Proceeding of the 25th International Joint Conference on Artificial Intelligence.2016：2873-2879.

［22］QIN Y,PENG D Z,PENG X,et al.Deep evidential learning with noisy correspondence for cross-modal retrieval［C］//Proceedings of the 30th ACM International Conference on Multimedia.October 10-14,2022,Lisboa,Portugal.ACM,2022：4948-4956.

［23］GROENENDIJK R,KARAOGLU S,GEVERS T,et al.Multi-loss weighting with coefficient of variations［C］//2021 IEEE Winter Conference on Applications of Computer Vision (WACV).January 3-8,2021.Waikoloa,HI,USA.IEEE,2021：1469-1478.

［24］GAO T Y,YAO X C,CHEN D Q.SimCSE：simple contrastive learning of sentence embeddings［C］//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.Online and Punta Cana,Dominican Republic.Stroudsburg,PA,USAACL,2021：6894-6910.

［25］MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space［J］.1st International Conference on Learning Representations,ICLR 2013-Workshop Track Proceedings,2013：1-12.

FINANCE

扫码关注

网址：igne.cbpt.cnki.net/portal

通信地址：四川省南充市顺庆区师大路 1号

邮政编码：637009

办公室E-mail：jcwnuns@126.com

联系电话：0817-2568651