SemToken：高効率な長文脈言語モデリングのための意味認識分割方法

2508.15190v1

日本語タイトル#

SemToken：効率的な長文脈言語モデリングのための意味認識トークン化手法

英文タイトル#

SemToken: Semantic-Aware Tokenization for Efficient Long-Context Language Modeling

日本語要約#

トークン化は言語モデリングにおいて重要な役割を果たしますが、既存の手法、例えばバイトペアエンコーディング（BPE）や WordPiece は、頻度統計に基づいてのみ操作し、テキストの潜在的な意味構造を無視しています。これにより、意味的に冗長なスパンの過剰トークン化や文脈の一貫性の利用不足が生じ、特に長文脈のシナリオでは顕著です。本研究では、\textbf {SemToken} という意味認識トークン化フレームワークを提案し、トークンの冗長性を共同で削減し、計算効率を向上させます。SemToken はまず、軽量エンコーダを通じて文脈の意味埋め込みを抽出し、意味的に同等なトークンを統合するために局所的な意味クラスタリングを行います。次に、意味密度に基づいて異種のトークン粒度を割り当て、内容が豊富な領域ではより細かいトークン化を、繰り返しや低エントロピーのスパンではより粗い圧縮を可能にします。SemToken は、現代の言語モデルや注意加速手法にシームレスに統合できます。WikiText-103 や LongBench などの長文脈言語モデリングベンチマークでの実験により、SemToken はトークン数を最大 2.4 倍削減し、速度を 1.9 倍向上させ、困惑度や下流の精度にほとんど低下がないことが示されました。我々の研究結果は、意味構造が大規模言語モデルにおけるトークン化と計算の最適化に新たな有望な軸を提供することを示唆しています。

英文要約#

Tokenization plays a critical role in language modeling, yet existing approaches such as Byte-Pair Encoding (BPE) or WordPiece operate purely on frequency statistics, ignoring the underlying semantic structure of text. This leads to over-tokenization of semantically redundant spans and underutilization of contextual coherence, particularly in long-context scenarios. In this work, we propose \textbf{SemToken}, a semantic-aware tokenization framework that jointly reduces token redundancy and improves computation efficiency. SemToken first extracts contextual semantic embeddings via lightweight encoders and performs local semantic clustering to merge semantically equivalent tokens. Then, it allocates heterogeneous token granularity based on semantic density, allowing finer-grained tokenization in content-rich regions and coarser compression in repetitive or low-entropy spans. SemToken can be seamlessly integrated with modern language models and attention acceleration methods. Experiments on long-context language modeling benchmarks such as WikiText-103 and LongBench show that SemToken achieves up to 2.4× reduction in token count and 1.9× speedup, with negligible or no degradation in perplexity and downstream accuracy. Our findings suggest that semantic structure offers a promising new axis for optimizing tokenization and computation in large language models.

文章ページ#

SemToken：効率的な長文脈言語モデリングのための意味認識トークン化手法

PDF 取得#

日本語 PDF を表示 - 2508.15190v1

スマート達人の抖店 QR コード

抖音でさらに素晴らしいコンテンツを見るには QR コードをスキャンしてください