长文档中的信息检索：基于词聚类的方法以提高语义

2302.10150v2

中文标题#

长文档中的信息检索：基于词聚类的方法以提高语义

英文标题#

Information Retrieval in long documents: Word clustering approach for improving Semantics

中文摘要#

在本文中，我们提出了一种用于长文档语义信息检索的深度神经网络替代方法。这种新方法利用聚类技术来考虑单词在信息检索系统中的含义，这些系统针对长文档和短文档。这种方法使用一种专门设计的聚类算法，将意义相似的单词分组到集群中。文档和查询的双重表示（词法和语义）基于 Gerard Salton 提出的向量空间模型，该模型由形成的集群构成。我们提议的创新之处在于多个层面：首先，我们提出了一种高效的算法，用于构建使用词嵌入作为输入的语义接近单词的集群，然后我们定义了一个公式来加权这些集群，然后我们提出了一种函数，可以有效地将单词的含义与信息检索中广泛使用的词法模型结合起来。在三个不同的上下文中，使用两个不同的数据集 SQuAD 和 TREC-CAR 对我们的提议进行评估表明，它显著改进了仅基于关键词的经典方法，而不会损害词法方面。

英文摘要#

In this paper, we propose an alternative to deep neural networks for semantic information retrieval for the case of long documents. This new approach exploiting clustering techniques to take into account the meaning of words in Information Retrieval systems targeting long as well as short documents. This approach uses a specially designed clustering algorithm to group words with similar meanings into clusters. The dual representation (lexical and semantic) of documents and queries is based on the vector space model proposed by Gerard Salton in the vector space constituted by the formed clusters. The originalities of our proposal are at several levels: first, we propose an efficient algorithm for the construction of clusters of semantically close words using word embedding as input, then we define a formula for weighting these clusters, and then we propose a function allowing to combine efficiently the meanings of words with a lexical model widely used in Information Retrieval. The evaluation of our proposal in three contexts with two different datasets SQuAD and TREC-CAR has shown that is significantly improves the classical approaches only based on the keywords without degrading the lexical aspect.

PDF 获取#

查看中文 PDF - 2302.10150v2

智能达人抖店二维码

抖音扫码查看更多精彩内容