文脈を意識した記憶を持つ長文脈音声合成

2508.14713v1

日本語タイトル#

コンテキスト認識メモリを用いた長文音声合成

英文标题#

Long-Context Speech Synthesis with Context-Aware Memory

日本語摘要#

長文テキスト音声合成において、現在のアプローチは通常、文レベルでテキストを音声に変換し、その結果を連結して擬似段落レベルの音声を形成します。これらの方法は段落の文脈的な一貫性を無視しており、長文音声の自然さが低下し、スタイルや音色の一貫性に問題を引き起こします。これらの問題を解決するために、私たちはコンテキスト認識メモリ（CAM）に基づく長文テキストから音声（TTS）モデルを提案します。CAM モジュールは、長期メモリと局所的な文脈の詳細を統合し、取得することで、長い段落内での動的メモリの更新と転送を可能にし、文レベルの音声合成を導きます。さらに、プレフィックスマスクは、単方向生成を維持しながらプレフィックストークンに対する双方向注意を可能にすることで、文脈学習能力を強化します。実験結果は、提案された方法が段落レベルの音声における韻律表現力、一貫性、文脈推論コストの面でベースラインおよび最先端の長文アプローチを上回ることを示しています。

英文摘要#

In long-text speech synthesis, current approaches typically convert text to speech at the sentence-level and concatenate the results to form pseudo-paragraph-level speech. These methods overlook the contextual coherence of paragraphs, leading to reduced naturalness and inconsistencies in style and timbre across the long-form speech. To address these issues, we propose a Context-Aware Memory (CAM)-based long-context Text-to-Speech (TTS) model. The CAM block integrates and retrieves both long-term memory and local context details, enabling dynamic memory updates and transfers within long paragraphs to guide sentence-level speech synthesis. Furthermore, the prefix mask enhances the in-context learning ability by enabling bidirectional attention on prefix tokens while maintaining unidirectional generation. Experimental results demonstrate that the proposed method outperforms baseline and state-of-the-art long-context methods in terms of prosody expressiveness, coherence and context inference cost across paragraph-level speech.

文章ページ#

コンテキスト認識メモリを用いた長文音声合成

PDF 获取#

中文 PDF を表示 - 2508.14713v1

スマート達人の抖店 QR コード

抖音でスキャンしてさらに素晴らしいコンテンツを見る