長文檔中的信息檢索：基於詞聚類的方法以提高語義

2302.10150v2

中文标题#

長文檔中的信息檢索：基於詞聚類的方法以提高語義

英文标题#

Information Retrieval in long documents: Word clustering approach for improving Semantics

中文摘要#

在本文中，我們提出了一種用於長文檔語義信息檢索的深度神經網絡替代方法。這種新方法利用聚類技術來考慮單詞在信息檢索系統中的含義，這些系統針對長文檔和短文檔。這種方法使用一種專門設計的聚類算法，將意義相似的單詞分組到集群中。文檔和查詢的雙重表示（詞法和語義）基於 Gerard Salton 提出的向量空間模型，該模型由形成的集群構成。我們提議的創新之處在於多個層面：首先，我們提出了一種高效的算法，用於構建使用詞嵌入作為輸入的語義接近單詞的集群，然後我們定義了一個公式來加權這些集群，然後我們提出了一種函數，可以有效地將單詞的含義與信息檢索中廣泛使用的詞法模型結合起來。在三個不同的上下文中，使用兩個不同的數據集 SQuAD 和 TREC-CAR 對我們的提議進行評估表明，它顯著改進了僅基於關鍵詞的經典方法，而不會損害詞法方面。

英文摘要#

In this paper, we propose an alternative to deep neural networks for semantic information retrieval for the case of long documents. This new approach exploiting clustering techniques to take into account the meaning of words in Information Retrieval systems targeting long as well as short documents. This approach uses a specially designed clustering algorithm to group words with similar meanings into clusters. The dual representation (lexical and semantic) of documents and queries is based on the vector space model proposed by Gerard Salton in the vector space constituted by the formed clusters. The originalities of our proposal are at several levels: first, we propose an efficient algorithm for the construction of clusters of semantically close words using word embedding as input, then we define a formula for weighting these clusters, and then we propose a function allowing to combine efficiently the meanings of words with a lexical model widely used in Information Retrieval. The evaluation of our proposal in three contexts with two different datasets SQuAD and TREC-CAR has shown that is significantly improves the classical approaches only based on the keywords without degrading the lexical aspect.

PDF 获取#

查看中文 PDF - 2302.10150v2

智能達人抖店二維碼

抖音掃碼查看更多精彩內容