評估語音識別 x 大型語言模型 x 文本轉語音組合在人工智慧面試系統中的應用

2507.16835v1

中文标题#

評估語音識別 x 大型語言模型 x 文本轉語音組合在人工智能面試系統中的應用

英文标题#

Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems

中文摘要#

基於語音的對話 AI 系統越來越多地依賴於級聯架構，該架構結合了語音轉文本（STT）、大型語言模型（LLMs）和文本轉語音（TTS）組件。我們使用從超過 300,000 次由 AI 進行的工作面試中採樣的數據，對 STT x LLM x TTS 組合進行了大規模的實證比較。我們使用了一個 LLM-as-a-Judge 的自動化評估框架來評估對話質量、技術準確性和技能評估能力。我們對五個生產配置的分析表明，結合 Google 的 STT、GPT-4.1 和 Cartesia 的 TTS 的組合在客觀質量指標和用戶滿意度評分方面都優於其他替代方案。令人驚訝的是，我們發現客觀質量指標與用戶滿意度評分之間的相關性較弱，這表明基於語音的 AI 系統的用戶體驗取決於技術性能之外的因素。我們的研究結果為選擇多模態對話中的組件提供了實用指導，並為人類與 AI 的互動貢獻了一種經過驗證的評估方法。

英文摘要#

Voice-based conversational AI systems increasingly rely on cascaded architectures that combine speech-to-text (STT), large language models (LLMs), and text-to-speech (TTS) components. We present a large-scale empirical comparison of STT x LLM x TTS stacks using data sampled from over 300,000 AI-conducted job interviews. We used an LLM-as-a-Judge automated evaluation framework to assess conversational quality, technical accuracy, and skill assessment capabilities. Our analysis of five production configurations reveals that a stack combining Google's STT, GPT-4.1, and Cartesia's TTS outperforms alternatives in both objective quality metrics and user satisfaction scores. Surprisingly, we find that objective quality metrics correlate weakly with user satisfaction scores, suggesting that user experience in voice-based AI systems depends on factors beyond technical performance. Our findings provide practical guidance for selecting components in multimodal conversations and contribute a validated evaluation methodology for human-AI interactions.

PDF 获取#

查看中文 PDF - 2507.16835v1

智能達人抖店二維碼

抖音掃碼查看更多精彩內容