透過閔可夫斯基範數進行語言檢測：透過字符二元組和頻率分析進行識別

2507.16284v2

中文标题#

透過閔可夫斯基範數進行語言檢測：透過字符二元組和頻率分析進行識別

英文标题#

Language Detection by Means of the Minkowski Norm: Identification Through Character Bigrams and Frequency Analysis

中文摘要#

圍繞語言識別的爭論近年來重新引起了關注，尤其是在 AI 驅動的語言模型迅速發展的情況下。然而，基於非 AI 的方法在語言識別方面被忽視了。本研究通過利用來自權威語言學研究的單字和雙字頻率排名，探索了一種用於語言確定性的數學算法實現。所使用的數據集包括長度、歷史時期和體裁各不相同的文本，包括短篇故事、童話和詩歌。儘管存在這些差異，該方法在 150 個字符以下的文本上實現了超過 80% 的準確率，並且對於更長的文本達到了 100% 的準確率。這些結果表明，基於經典頻率的方法仍然是語言檢測中有效且可擴展的替代方案，與 AI 驅動的模型相比。

英文摘要#

The debate surrounding language identification has gained renewed attention in recent years, especially with the rapid evolution of AI-powered language models. However, the non-AI-based approaches to language identification have been overshadowed. This research explores a mathematical implementation of an algorithm for language determinism by leveraging monograms and bigrams frequency rankings derived from established linguistic research. The datasets used comprise texts varying in length, historical period, and genre, including short stories, fairy tales, and poems. Despite these variations, the method achieves over 80% accuracy on texts shorter than 150 characters and reaches 100% accuracy for longer texts. These results demonstrate that classical frequency-based approaches remain effective and scalable alternatives to AI-driven models for language detection.

PDF 獲取#

查看中文 PDF - 2507.16284v2

智能達人抖店二維碼

抖音掃碼查看更多精彩內容