中文标题#
具有上下文感知记忆的长上下文语音合成
英文标题#
Long-Context Speech Synthesis with Context-Aware Memory
中文摘要#
在長文本語音合成中,當前方法通常在句子級別將文本轉換為語音,並連接結果以形成偽段落級別的語音。 這些方法忽略了段落的上下文連貫性,導致長篇語音中的自然度降低以及風格和音色的一致性問題。 為了解決這些問題,我們提出了一種基於上下文感知記憶(CAM)的長上下文文本到語音(TTS)模型。 CAM 模塊整合並檢索長期記憶和局部上下文細節,能夠在長段落中實現動態記憶更新和傳遞,以指導句子級別的語音合成。 此外,前綴掩碼通過在保持單向生成的同時允許前綴標記的雙向注意力來增強上下文學習能力。 實驗結果表明,所提出的方法在段落級別語音的韻律表現力、連貫性和上下文推理成本方面優於基線和最先進的長上下文方法。
英文摘要#
In long-text speech synthesis, current approaches typically convert text to speech at the sentence-level and concatenate the results to form pseudo-paragraph-level speech. These methods overlook the contextual coherence of paragraphs, leading to reduced naturalness and inconsistencies in style and timbre across the long-form speech. To address these issues, we propose a Context-Aware Memory (CAM)-based long-context Text-to-Speech (TTS) model. The CAM block integrates and retrieves both long-term memory and local context details, enabling dynamic memory updates and transfers within long paragraphs to guide sentence-level speech synthesis. Furthermore, the prefix mask enhances the in-context learning ability by enabling bidirectional attention on prefix tokens while maintaining unidirectional generation. Experimental results demonstrate that the proposed method outperforms baseline and state-of-the-art long-context methods in terms of prosody expressiveness, coherence and context inference cost across paragraph-level speech.
文章页面#
PDF 获取#
抖音扫码查看更多精彩内容