TrorYongASR: Permuted AutoRegressive Sequence Modeling for Automatic Speech Recognition

Science

Technology

Permuted AutoRegressive Sequence Modeling for Automatic Speech Recognition

Author

Kimang Khun

Published

May 7, 2026

Abstract

TrorYongASR is an Encoder-Decoder model for Automatic Speech Recognition (ASR). It is designed for transcription and translation tasks and language detection based on audio. It is smaller than Whisper model, and has a competitive performance in term of WER and inference speed. TrorYongASR is inspired by PARSeq’s architecture (Bautista and Atienza 2022): the decoder has only one layer and predicts tokens corresponding to the observed positions. Currently, TrorYongASR has two configurations: tiny and small. Available pre-trained weights support Khmer and English languages, and TrorYongASR can be trained to support more languages without compromising any advantages mentioned earlier. Leveraging pre-trained weights of audio encoder of Whisper (Radford et al. 2023) as initialization, TrorYongASR achieves WER of 75.88% and 54.33% for Khmer and English respectively in tiny configuration; and of 50.46% and 21.75% in small configuration.

Evaluation

The evaluation assesses two capabilities — 1) language detection and 2) transcription — on two datasets (google/fleurs for Khmer and openslr/librispeech_asr for English). All results are from the test split of each dataset, representing the model’s generalization ability to unseen data.

Dataset	Language	Testing examples	Description
google/fleurs	Khmer	765	Multi-lingual dataset with Khmer language samples
librispeech.clean	English	2620	Clean speech dataset for English transcription

Audios longer than 30 seconds are excluded from the evaluation (that is why google/fleurs has 765 examples instead of 771).

TrorYongASR currently has 2 configurations as the following

Model Size	Tiny	Small
Audio Encoder	4 layers, 6 heads	12 layers, 12 heads
Audio Context	1500	1500
Text Decoder	1 layer, 12 heads	1 layer, 24 heads
Text Context	1024	1024
Embedding Dim	384	768
Parameters	29M	136M

The audio array are processed to log-mel spectrogram with 80 mels (the same as Whisper models of the same size)

Metrics and Results

Language Detection

Language detection measures model’s capability to recognize the spoken language from audio input. Since TrorYongASR currently supports 2 languages, this task becomes binary classification task. Classic metrics are used:

Precision: Proportion of predicted languages that are correct
Recall: Proportion of actual language samples correctly identified
F1-score: Harmonic mean of precision and recall

Results:

Model	Metrics	Khmer (`fleurs`)	English (`librispeech.clean`)
Tiny	Precision	100%	100%
	Recall	100%	100%
	F1-score	100%	100%
Small	Precision	100%	99%
	Recall	96%	100%
	F1-score	98%	99 %

Tiny size achieved perfect language detection performance on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. Small size performs slightly worst by tending to predict English language.

The 100% language detection scores may appear unusually high. This is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.

Transcription

For transcription task, 3 metrics below are used

Token Error Rate (TER): Proportion of incorrectly transcribed tokens
Character Error Rate (CER): Proportion of characters that are incorrect
Word Error Rate (WER): Proportion of words that are incorrect

Token Error Rate measures model’s capability in predicting the next token given the audio input and the current sequence of tokens. This metric is weaker than Word Error Rate (WER) and Character Error Rate (CER) because it doesn’t account for insertions, deletions, substitutions, and autoregression as comprehensively. Token Error Rate is used here because Khmer text lacks word boundaries, making WER and CER calculations challenging without additional preprocessing.

Transcription Results:

Model	Metric	Khmer (`fleurs`)	English (`librispeech.clean`)	Mixed (Khmer + English)
Tiny	WER	75.88%	54.33%	60.36%
	CER	54.99%	42.41%	46.18%
	TER	54%	17%	27%
Small	WER	50.46%	21.75%	29.78%
	CER	35.89%	16.58%	22.37%
	TER	43%	8%	18%

Key Observations:

The tiny model shows strong performance on English (54.33% WER, 42.41% CER, 17% TER)
Performance drops for Khmer (75.88% WER, 54.99% CER, 54% TER)
The small model shows strong performance on English (21.75% WER, 16.58% CER, 8% TER)
Performance for Khmer is moderate (50.46% WER, 35.89% CER, 43% TER)
The larger model benefits from increased embedding dimension (768 vs 384) and more layers for audio encoder (12 vs 4)

Note: To compute CER and WER, whitespaces are added between words in Khmer text (Khmer text does not have word boundaries like English text). To do so, khmercut PyPI package is used to tokenize Khmer text into words, and then the words are joined back together with whitespaces.

WER Comparison with Whisper:

Size	Model	Parameters	Khmer (`fleurs`)	English (`librispeech.clean`)
Tiny	TrorYongASR	29M	75.88%	54.33%
	Whisper	39M	100.6%	7.6%
Small	TrorYongASR	135M	50.46%	21.75%
	Whisper	244M	104.4%	3.4%

Key Observations:

Whisper models have more parameters for comparable sizes (39M vs 29M for Tiny, 244M vs 135M for Small)
Whisper shows significantly lower word error rates on English (7.6% vs 54.33% for Tiny, 3.4% vs 12.95% for Small)
Whisper performs worse on Khmer (100.6% vs 75.88% for Tiny, 104.4% vs 50.46% for Small)
Error rates > 100% for Whisper on Khmer indicate the model is overfitting to the training data

Note: WER data of Whisper is taken from their paper.

Result Summary

Language Detection: Both model sizes achieved great performance across all metrics (Precision, Recall, F1-score) on both datasets, indicating excellent binary classification capability for distinguishing between Khmer and English audio. This high score is expected because during pre-training, the model performs permutations on word tokens starting from position 3, while the first three positions (start token, language token, and task token) remain fixed. Since language detection relies on the language token at position 1, and this token is never permuted during pre-training, the model can achieve perfect accuracy on language detection tasks.

Transcription: The Small model shows strong performance on English (21.75% WER, 16.58% CER, 8% TER) and moderate performance for Khmer (50.46% WER, 35.89% CER, 43% TER). The Tiny model shows strong performance on English (54.33% WER, 42.41% CER, 17% TER) but significantly lower performance for Khmer (75.88% WER, 54.99% CER, 54% TER). This shows that TrorYongASR can be scaled to get higher performance.

Note on Translation Task: The models are also trained for translation task, but evaluation is deferred to future work due to scarce data (there are only 2000 examples from Khmer audio to English text, and 1000 examples from English audio to Khmer text in the pre-training).

Downstram Tasks and Limitations

TrorYongASR can be integrated into:

IoT devices: Smart speakers, voice assistants
Mobile application: Android/iOS apps with speech recognition
Larger ASR systems: As a component in multi-language ASR pipelines

Due to the lack of training dataset, TrorYongASR does not support

timestamps prediction
voice activity detection
translation performance is still limited

TrorYongASR Design

In sequence modeling, deep learning models are trained to generate future tokens conditioned on past tokens. Concretely, for a given audio \({\boldsymbol{x}}\) and its corresponding \(T\)-length text tokens \({\boldsymbol{y}}\), the likelihood for the standard autoregressive model is given by \[ \mathbb{P}_\theta({\boldsymbol{y}}|{\boldsymbol{x}}) = p_\theta(y_1|{\boldsymbol{x}})\times p_\theta(y_2|y_1, {\boldsymbol{x}})\times \dots \times p_\theta(y_T|y_{T-1},\dots, y_1, {\boldsymbol{x}}) =\prod_{t=1}^Tp_\theta(y_t|{\boldsymbol{y}}_{<t}, {\boldsymbol{x}}) \tag{1}\] where \(\theta\) is the parameters of the model. One tries to find \(\theta\) that minimizes the negative log of Equation 1 \[\begin{equation} {\mathrm{NLL}}(\theta) = -\sum_{t=1}^T\log p_{\theta}(y_t|{\boldsymbol{y}}_{<t}, {\boldsymbol{x}}) \end{equation}\] This method is known as monotonic autoregressive generation.

Inspired by PARSeq model (Bautista and Atienza 2022), TrorYongASR adapts Permutation Language Modeling (PLM) (Yang et al. 2019) for Automatic Speech Recognition (ASR). Let \({\mathcal{Z}}_T\) be the set of all possible permutations of \(\{1,2,\dots,T\}\). Without loss of generality, \(T\) can be fixed and omitted from the expression to help the readability. Let \({\boldsymbol{z}}\in{\mathcal{Z}}\) be a permutation that specifies an ordering of the text tokens in \({\boldsymbol{y}}\) and define

\[\begin{align*} \mathbb{P}_\theta({\boldsymbol{y}}_{{\boldsymbol{z}}}|{\boldsymbol{x}}) = p_\theta(y_{z_1}|{\boldsymbol{x}})\times p_\theta(y_{z_2}|y_{z_1}, {\boldsymbol{x}})\times \dots \times p_\theta(y_{z_T}|y_{z_{T-1}},\dots, y_{z_1}, {\boldsymbol{x}}) =\prod_{t=1}^Tp_\theta(y_{z_t}|{\boldsymbol{y}}_{{\boldsymbol{z}}_{<t}}, {\boldsymbol{x}}) \end{align*}\] as the likelihood for a given \({\boldsymbol{z}}\). So, Equation 1 is the case where \(z_t=t, \forall 1\le t\le T\).

TrorYongASR finds \(\theta\) that minimizes the negative log-likelihood

\[ {\mathbb{E}}_{{\boldsymbol{z}}\in{\mathcal{Z}}}\left[-\sum_{t=1}^T\log p_{\theta}(y_{z_t}|{\boldsymbol{y}}_{{\boldsymbol{z}}_{<t}}, {\boldsymbol{x}})\right] \tag{2}\]

It is clear that TrorYongASR is more general than monotonic autoregressive method that predicts tokens in a single standard direction, namely left-to-right. Moreover, the objective function Equation 2 only requires modification on Auditory-lingual Decoder (Audio Encoder can follow the one of Whisper model). An overview of TrorYongASR architecture is given in Figure 1.

Performing Equation 2 means training all \(T!\) factorizations which is computationally expensive in pratice. Following PARSeq, TrorYongASR compromises this by choosing 8 permutations for each audio. That is, TrorYongASR minimizes the following

\[ \widetilde{{\mathrm{NLL}}}(\theta) = \frac{1}{8} \sum_{1\le k\le 8}\left[-\sum_{t=1}^T\log p_{\theta}(y_{z_t^k}|{\boldsymbol{y}}_{{\boldsymbol{z}}^k_{<t}}, {\boldsymbol{x}})\right] \tag{3}\]

As mentioned in (Bautista and Atienza 2022), the permutation \({\boldsymbol{z}}^k\) is encapsulated by the attention mask. For example, the standard casual attention mask enforces identity permutation. So, TrorYongASR also crafts the attention mask to enforce the ordering specified by \({\boldsymbol{z}}^k\). Table 1 gives an example with 4 permutations and their corresponding attention mask.

Figure 1: TrorYongASR architecture. *Dropout* layers are omitted due to space constraints, \([B]\), \([L]\), \([T]\), \([E]\), and \([P]\) are *begin-of-sequence*, *language*, *task*, *end-of-sequence*, and *padding* tokens, respectively. This figures presents the case having 16 distinct target prediction positions. The QKV-projection is explicitly shown here because particularly for TrorYongASR, the single position basis \({\mathbf{p}}\) is used for each position to directly form query projection (differently from the standard decoder of (Vaswani et al. 2017), the linear layer for query projection is not required) while context vectors are the token embeddings of input context and are projected to keys and values. Root-mean-squared normalization and RoPE are applied to query and key projections. Attention masks are generated from the choosen permutations and are used only for the *context-position* attention. The last linear layer outputs logits over the vocabulary set. These logits are then used to compute cross-entropy loss.

In the following sections, I first explain how I integrate RoPE in TrorYongASR as it is not obvious when query vectors contain purely positional information. Then, I elaborate attention mask manipulation specifically for ASR task where the decoder requires not just begin-of-sequence token (sot token for Whisper), but also language and task token in order to decode an audio.

RoPE Integration in Auditory-lingual Decoder

In PARSeq (Bautista and Atienza 2022), the decoder requires 3 inputs: position, context, and audio encoding. Its first Multi-head Attention module is used for context-position attention: query vectors encode the target prediction positions whereas context vectors integrate both semantic embeddings and positional information. This decouples the context from the target position, allowing the model to learn from PLM (see (Bautista and Atienza 2022) for more detail). PARSeq achieves this idea by using position embeddings as query vectors and adding position embeddings to token embeddings to form context vectors. However, RoPE is a rotational operator. So, one must design query vectors such that each position holds the same neutral information enabling RoPE to inject positional information right before the attention mechanism.

Position Basis

TrorYongASR uses a single embedding vector \({\mathbf{p}}\in {\mathbb{R}}^{d}\), where \(d\) is the embedding dimension, for each target prediction position to form query projection directly. This embedding vector \({\mathbf{p}}\in{\mathbb{R}}^{d}\) is learnable and can be understood as position basis. Also, the context vectors are just token embeddings. The positional information are incorporated into queries and keys via the rotational operator. This preserves the decoupling concept from PARSeq while enabling RoPE for better performance.

The motivation to use position basis \({\mathbf{p}}\) as query projection is due to 2 reasons. Firstly, one could use \({\mathbf{p}}'\in{\mathbb{R}}^{d}\) to form query vectors because positional information is incorporated via RoPE operation after query projection. Secondly, let \(W_q\in{\mathbb{R}}^{d}\) be the weights of the linear layer. The query projection of any target prediction position is given by \(W_q{\mathbf{p}}'\in{\mathbb{R}}^{d}\). Both \(W_q\) and \({\mathbf{p}}'\) are learnable. So, one can save parameters by directly model \(W_q{\mathbf{p}}'\) as \({\mathbf{p}}\).

Null Context

In PARSeq’s implementation, begin-of-sequence token, denoted by \([B]\), is assigned as null context and does not contain positional information. TrorYongASR preserves this by applying null rotation to the embedding of \([B]\). So, from position 1 onward, keys receive rotation that is one position slower than queries. For example, position 0 at index 0 of query projection, and language token \([L]\) at index 1 of key projection both receive the same rotation. The same for position 1 at index 1 of query project, and task token \([T]\) at index 2 of key projection and so on.

Attention Mask Manipulation

Following Whisper, TrorYongASR has 3 start-of-text tokens: begin-of-sequence, language, and task tokens. When generating permutations, the 3 tokens remain fixed and only text tokens are permuted. Table 1 shows an example of 4 permutations of three-element text sequence. Table 1 (a) corresponds to the standard autoregressive model with the standard causal attention mask.

Table 1: Illustration of AR attention masks for each permutation. The table header (with the \([B]\) token) pertains to the input context, while the header column (with the \([E]\) token at position \(6\)) corresponds to the output tokens. \(1\) means that the output token has conditional dependency on the corresponding input token. \(0\) means that no information flows from input to output. For instance, the position of \([L]\) has conditional dependency on token \([B]\), but no information flows from token \([L]\) to the position of \([L]\). Note that regardless of permutations, position \(6\) corresponding to token \([E]\) has information from all input tokens, and information of token \([B]\), \([L]\), and \([T]\) flow to all text token positions (in particular, information of token \([B]\) flows to all positions). For notation, \([B]\) is begin-of-sequence token, \([L]\) is language token, \([T]\) is task token, and \([E]\) is end-of-sequence token.

(a) \({\boldsymbol{z}}^1:= [1,2,3,4,5,6]\)

	\([B]\)	\([L]\)	\([T]\)	\(y_1\)	\(y_2\)	\(y_3\)
\([L]\)	\(1\)	\(0\)	\(0\)	\(0\)	\(0\)	\(0\)
\([T]\)	\(1\)	\(1\)	\(0\)	\(0\)	\(0\)	\(0\)
\(y_1\)	\(1\)	\(1\)	\(1\)	\(0\)	\(0\)	\(0\)
\(y_2\)	\(1\)	\(1\)	\(1\)	\(1\)	\(0\)	\(0\)
\(y_3\)	\(1\)	\(1\)	\(1\)	\(1\)	\(1\)	\(0\)
\([E]\)	\(1\)	\(1\)	\(1\)	\(1\)	\(1\)	\(1\)

(b) \({\boldsymbol{z}}^2:= [1,2,3,6,5,4]\)

	\([B]\)	\([L]\)	\([T]\)	\(y_1\)	\(y_2\)	\(y_3\)
\([L]\)	\(1\)	\(0\)	\(0\)	\(0\)	\(0\)	\(0\)
\([T]\)	\(1\)	\(1\)	\(0\)	\(0\)	\(0\)	\(0\)
\(y_1\)	\(1\)	\(1\)	\(1\)	\(0\)	\(1\)	\(1\)
\(y_2\)	\(1\)	\(1\)	\(1\)	\(0\)	\(0\)	\(1\)
\(y_3\)	\(1\)	\(1\)	\(1\)	\(0\)	\(0\)	\(0\)
\([E]\)	\(1\)	\(1\)	\(1\)	\(1\)	\(1\)	\(1\)

	\([B]\)	\([L]\)	\([T]\)	\(y_1\)	\(y_2\)	\(y_3\)
\([L]\)	\(1\)	\(0\)	\(0\)	\(0\)	\(0\)	\(0\)
\([T]\)	\(1\)	\(1\)	\(0\)	\(0\)	\(0\)	\(0\)
\(y_1\)	\(1\)	\(1\)	\(1\)	\(0\)	\(0\)	\(0\)
\(y_2\)	\(1\)	\(1\)	\(1\)	\(1\)	\(0\)	\(1\)
\(y_3\)	\(1\)	\(1\)	\(1\)	\(1\)	\(0\)	\(0\)
\([E]\)	\(1\)	\(1\)	\(1\)	\(1\)	\(1\)	\(1\)

(d) \({\boldsymbol{z}}^4:= [1,2,3,5,6,4]\)

	\([B]\)	\([L]\)	\([T]\)	\(y_1\)	\(y_2\)	\(y_3\)
\([L]\)	\(1\)	\(0\)	\(0\)	\(0\)	\(0\)	\(0\)
\([T]\)	\(1\)	\(1\)	\(0\)	\(0\)	\(0\)	\(0\)
\(y_1\)	\(1\)	\(1\)	\(1\)	\(0\)	\(1\)	\(1\)
\(y_2\)	\(1\)	\(1\)	\(1\)	\(0\)	\(0\)	\(0\)
\(y_3\)	\(1\)	\(1\)	\(1\)	\(0\)	\(1\)	\(0\)
\([E]\)	\(1\)	\(1\)	\(1\)	\(1\)	\(1\)	\(1\)

So, the main implementation is to generate permutation \({\boldsymbol{z}}^k\) and derive the corresponding attention mask like Table 1.

References

Bautista, Darwin, and Rowel Atienza. 2022. “Scene Text Recognition with Permuted Autoregressive Sequence Models.” In European Conference on Computer Vision, 178–96. Springer.

Macháček, Dominik, Raj Dabre, and Ondřej Bojar. 2023. “Turning Whisper into Real-Time Transcription System.” In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations, 17–24.

Radford, Alec, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. “Robust Speech Recognition via Large-Scale Weak Supervision.” In International Conference on Machine Learning, 28492–518. PMLR.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30.

Yang, Zhilin, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. “Xlnet: Generalized Autoregressive Pretraining for Language Understanding.” Advances in Neural Information Processing Systems 32.

Citation

BibTeX citation:

@online{khun2026,
  author = {Khun, Kimang},
  title = {TrorYongASR: {Permuted} {AutoRegressive} {Sequence}
    {Modeling} for {Automatic} {Speech} {Recognition}},
  date = {2026-05-07},
  url = {https://kimang18.github.io/krorngai-blog/TrorYongASR/},
  langid = {en}
}

For attribution, please cite this work as:

Khun, Kimang. 2026. “TrorYongASR: Permuted AutoRegressive Sequence Modeling for Automatic Speech Recognition.” May 7, 2026. https://kimang18.github.io/krorngai-blog/TrorYongASR/.