ノイズ参照を伴う音声分離のスケール不変信号対干渉比の研究

2508.14623v1

日本語タイトル#

ノイズ参照を伴う音声分離におけるスケール不変信号対歪み比の研究

英文タイトル#

A Study of the Scale Invariant Signal to Distortion Ratio in Speech Separation with Noisy References

日本語要約#

本論文では、訓練参照にノイズが含まれる場合（事実上のベンチマークデータセット WSJ0-2Mix のように）、監視下の音声分離においてスケール不変信号対雑音比（SI-SDR）を評価および訓練目標として使用することの意味を検討します。ノイズ参照を伴う SI-SDR の導出は、ノイズが達成可能な SI-SDR を制限するか、分離出力に不要なノイズを引き起こすことを示しています。この問題に対処するために、参照を強化し、WHAM! を使用して混合信号を増強する方法が提案され、ノイズ参照の学習を避けるモデルの訓練を目指します。これらの強化データセットで訓練された 2 つのモデルは、非侵入型 NISQA.v2 メトリックで評価されます。結果は、分離された音声のノイズが減少することを示していますが、参照の処理がアーティファクトを引き起こし、全体的な品質の向上を制限する可能性があることを示唆しています。WSJ0-2Mix および Libri2Mix テストセットにおいて、すべてのモデルの SI-SDR と知覚ノイズとの間に負の相関が見られ、導出から得られた結論を裏付けています。

英文要約#

This paper examines the implications of using the Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) as both evaluation and training objective in supervised speech separation, when the training references contain noise, as is the case with the de facto benchmark WSJ0-2Mix. A derivation of the SI-SDR with noisy references reveals that noise limits the achievable SI-SDR, or leads to undesired noise in the separated outputs. To address this, a method is proposed to enhance references and augment the mixtures with WHAM!, aiming to train models that avoid learning noisy references. Two models trained on these enhanced datasets are evaluated with the non-intrusive NISQA.v2 metric. Results show reduced noise in separated speech but suggest that processing references may introduce artefacts, limiting overall quality gains. Negative correlation is found between SI-SDR and perceived noisiness across models on the WSJ0-2Mix and Libri2Mix test sets, underlining the conclusion from the derivation.

文章ページ#

ノイズ参照を伴う音声分離におけるスケール不変信号対歪み比の研究

PDF 取得#

日本語 PDF を表示 - 2508.14623v1

スマート達人の抖店 QR コード

抖音でスキャンしてさらに素晴らしいコンテンツを確認