音声認識 x 大規模言語モデル x テキスト音声変換の人工知能面接システムにおける応用

2507.16835v1

日本語タイトル#

音声認識 x 大規模言語モデル x テキスト読み上げの組み合わせを人工知能面接システムで評価する

英文タイトル#

Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems

日本語摘要#

音声ベースの対話 AI システムは、音声からテキストへの変換（STT）、大規模言語モデル（LLMs）、およびテキストから音声への変換（TTS）コンポーネントを組み合わせたカスケードアーキテクチャにますます依存しています。私たちは、AI が実施した 30 万回以上の仕事面接からサンプリングしたデータを使用して、STT x LLM x TTS のスタックの大規模な実証比較を行いました。私たちは、対話の質、技術的正確性、およびスキル評価能力を評価するために、LLM-as-a-Judge の自動評価フレームワークを使用しました。5 つの生産構成の分析により、Google の STT、GPT-4.1、および Cartesia の TTS を組み合わせたスタックが、客観的な品質指標とユーザー満足度スコアの両方で他の代替案を上回ることが明らかになりました。驚くべきことに、客観的な品質指標とユーザー満足度スコアの相関関係は弱いことがわかり、音声ベースの AI システムにおけるユーザー体験は技術的パフォーマンス以外の要因に依存していることを示唆しています。私たちの研究結果は、多モーダル対話におけるコンポーネントの選択に実用的なガイダンスを提供し、人間と AI の相互作用のための検証された評価方法論に貢献します。

英文摘要#

Voice-based conversational AI systems increasingly rely on cascaded architectures that combine speech-to-text (STT), large language models (LLMs), and text-to-speech (TTS) components. We present a large-scale empirical comparison of STT x LLM x TTS stacks using data sampled from over 300,000 AI-conducted job interviews. We used an LLM-as-a-Judge automated evaluation framework to assess conversational quality, technical accuracy, and skill assessment capabilities. Our analysis of five production configurations reveals that a stack combining Google's STT, GPT-4.1, and Cartesia's TTS outperforms alternatives in both objective quality metrics and user satisfaction scores. Surprisingly, we find that objective quality metrics correlate weakly with user satisfaction scores, suggesting that user experience in voice-based AI systems depends on factors beyond technical performance. Our findings provide practical guidance for selecting components in multimodal conversations and contribute a validated evaluation methodology for human-AI interactions.

PDF 取得#

中文 PDF を表示 - 2507.16835v1

スマート達人の QR コード

抖音でスキャンしてさらに素晴らしいコンテンツを確認