中文标题#
通过闵可夫斯基范数进行语言检测:通过字符二元组和频率分析进行识别
英文标题#
Language Detection by Means of the Minkowski Norm: Identification Through Character Bigrams and Frequency Analysis
中文摘要#
围绕语言识别的争论近年来重新引起了关注,尤其是在 AI 驱动的语言模型迅速发展的情况下。 然而,基于非 AI 的方法在语言识别方面被忽视了。 本研究通过利用来自权威语言学研究的单字和双字频率排名,探索了一种用于语言确定性的数学算法实现。 所使用的数据集包括长度、历史时期和体裁各不相同的文本,包括短篇故事、童话和诗歌。 尽管存在这些差异,该方法在 150 个字符以下的文本上实现了超过 80% 的准确率,并且对于更长的文本达到了 100% 的准确率。 这些结果表明,基于经典频率的方法仍然是语言检测中有效且可扩展的替代方案,与 AI 驱动的模型相比。
英文摘要#
The debate surrounding language identification has gained renewed attention in recent years, especially with the rapid evolution of AI-powered language models. However, the non-AI-based approaches to language identification have been overshadowed. This research explores a mathematical implementation of an algorithm for language determinism by leveraging monograms and bigrams frequency rankings derived from established linguistic research. The datasets used comprise texts varying in length, historical period, and genre, including short stories, fairy tales, and poems. Despite these variations, the method achieves over 80% accuracy on texts shorter than 150 characters and reaches 100% accuracy for longer texts. These results demonstrate that classical frequency-based approaches remain effective and scalable alternatives to AI-driven models for language detection.
PDF 获取#
抖音扫码查看更多精彩内容