長文書の情報検索：語のクラスタリングに基づく方法による意味の向上

2302.10150v2

日本語タイトル#

長文書における情報検索：意味を向上させるための単語クラスタリングアプローチ

英文タイトル#

Information Retrieval in long documents: Word clustering approach for improving Semantics

日本語要約#

本論文では、長文書における意味情報検索のための深層神経ネットワークの代替手法を提案します。この新しいアプローチは、長文書と短文書を対象とした情報検索システムにおける単語の意味を考慮するためにクラスタリング技術を利用します。このアプローチは、意味的に類似した単語をクラスタにグループ化するために特別に設計されたクラスタリングアルゴリズムを使用します。文書とクエリの二重表現（語彙的および意味的）は、形成されたクラスタから構成されるベクトル空間モデルに基づいています。私たちの提案の独自性は、いくつかのレベルにわたります。まず、単語埋め込みを入力として使用して意味的に近い単語のクラスタを構築するための効率的なアルゴリズムを提案し、次にこれらのクラスタに重みを付けるための公式を定義し、最後に情報検索で広く使用される語彙モデルと単語の意味を効率的に組み合わせる関数を提案します。SQuAD と TREC-CAR の 2 つの異なるデータセットを使用した 3 つの異なる文脈での提案の評価は、キーワードのみに基づく従来のアプローチを大幅に改善し、語彙的側面を損なうことなく行われることを示しています。

英文要約#

In this paper, we propose an alternative to deep neural networks for semantic information retrieval for the case of long documents. This new approach exploiting clustering techniques to take into account the meaning of words in Information Retrieval systems targeting long as well as short documents. This approach uses a specially designed clustering algorithm to group words with similar meanings into clusters. The dual representation (lexical and semantic) of documents and queries is based on the vector space model proposed by Gerard Salton in the vector space constituted by the formed clusters. The originalities of our proposal are at several levels: first, we propose an efficient algorithm for the construction of clusters of semantically close words using word embedding as input, then we define a formula for weighting these clusters, and then we propose a function allowing to combine efficiently the meanings of words with a lexical model widely used in Information Retrieval. The evaluation of our proposal in three contexts with two different datasets SQuAD and TREC-CAR has shown that is significantly improves the classical approaches only based on the keywords without degrading the lexical aspect.

PDF 取得#

查看中文 PDF - 2302.10150v2

スマート達人抖店 QR コード

抖音でスキャンしてさらに素晴らしいコンテンツを見る