评估语音识别 x 大型语言模型 x 文本转语音组合在人工智能面试系统中的应用

2507.16835v1

中文标题#

评估语音识别 x 大型语言模型 x 文本转语音组合在人工智能面试系统中的应用

英文标题#

Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems

中文摘要#

基于语音的对话 AI 系统越来越多地依赖于级联架构，该架构结合了语音转文本（STT）、大型语言模型（LLMs）和文本转语音（TTS）组件。我们使用从超过 300,000 次由 AI 进行的工作面试中采样的数据，对 STT x LLM x TTS 组合进行了大规模的实证比较。我们使用了一个 LLM-as-a-Judge 的自动化评估框架来评估对话质量、技术准确性和技能评估能力。我们对五个生产配置的分析表明，结合 Google 的 STT、GPT-4.1 和 Cartesia 的 TTS 的组合在客观质量指标和用户满意度评分方面都优于其他替代方案。令人惊讶的是，我们发现客观质量指标与用户满意度评分之间的相关性较弱，这表明基于语音的 AI 系统的用户体验取决于技术性能之外的因素。我们的研究结果为选择多模态对话中的组件提供了实用指导，并为人类与 AI 的互动贡献了一种经过验证的评估方法。

英文摘要#

Voice-based conversational AI systems increasingly rely on cascaded architectures that combine speech-to-text (STT), large language models (LLMs), and text-to-speech (TTS) components. We present a large-scale empirical comparison of STT x LLM x TTS stacks using data sampled from over 300,000 AI-conducted job interviews. We used an LLM-as-a-Judge automated evaluation framework to assess conversational quality, technical accuracy, and skill assessment capabilities. Our analysis of five production configurations reveals that a stack combining Google's STT, GPT-4.1, and Cartesia's TTS outperforms alternatives in both objective quality metrics and user satisfaction scores. Surprisingly, we find that objective quality metrics correlate weakly with user satisfaction scores, suggesting that user experience in voice-based AI systems depends on factors beyond technical performance. Our findings provide practical guidance for selecting components in multimodal conversations and contribute a validated evaluation methodology for human-AI interactions.

PDF 获取#

查看中文 PDF - 2507.16835v1

智能达人抖店二维码

抖音扫码查看更多精彩内容