TrorYongOCR: Encoder-Decoder Model for Scene Text Recognition

Science
Technology
Meet Your Tiny OCR Model
Author

Kimang KHUN

Published

February 19, 2026

Modified

June 15, 2026

Abstract

TrorYongOCR is a tiny encoder-decoder model for Scene Text Recognition task. It prepends the encoding of image patches to the “begin of sequence” token to condition next character token generation. Using LLM analogy, patch encodings can be simply seen as a prefill prompt. The single text decoder block of TrorYongOCR generates character tokens based on the prefill prompt in an autoregressive manner without cross-attention mechanism. Moreover, TrorYongOCR can process input images of arbitrary aspect ratio. Current pre-trained weight supports 2 languages: Khmer and English. The model is deployed on Huggingface Space here for demonstration.

Use Image Encoding as a Prefilll Prompt

TrorYongOCR is designed as the following: given \(L\) transformer blocks

  • \(L-1\) are encoding blocks that encode a given image
  • the last block is a single decoding block without cross-attention mechanism
  • each transformer is implemented with exclusive self-attention (Zhai 2026) style and SwiGLU in MLP

For the single decoding block,

  • the latent state of an image (the output of the last encoder block) is concatenated with the input character embedding (token embedding including bos token) to create context vector, i.e. key and value vectors (think of it like a prefill prompt)
  • the query vector is simply the input character embedding (token embedding).

TrorYongOCR architecture overview can be found in Figure 1.

Figure 1: TrorYongOCR architecture overview. The input image is transformed into patch embedding. The embedding is passed through \(L-1\) encoder blocks to generate image encoding (latent state). The image encoding is concatenated with character embedding (i.e., token embedding) before undergoing causal self-attention mechanism in the single decoder block to generate the latent state of character embedding. Finally, the linear layer projects the normalized latent state into logits over character set.

Batching Images of Different Aspect Ratio Using Prefix-Padding

Images are treated as sequences of patches. When batching images, the sequences with smaller length are prepended by padding patches to get a fixed batch shapes (see Figure 2). Pad mask is derived and used in the attention mechanism to avoid image patches attending to padding patches. Since the padding patches are added at the beginning of sequences, pad mask is also used when applying rotary embedding, so that image patches receive correct rotation (i.e., padding patches all stay in position 1 as shown in Figure 3). With this approach, we can train transformer-based model on images at their ‘native’ aspect ratio.

Figure 2: Example prefix-padding enables images with different aspect ratio to be batched together. Firstly, images are patchified. The shorter sequences are then prepended by padding patches so that each sequence has the same length. Prefix-padding is specific to the architecture of TrorYongOCR where image encoding plays a role similar to prefill prompt in LLMs. We assure the continuation of position when concatenate image encoding with character embedding by padding at the beginning of patch sequence. We rely on attention mask to avoid image patches attending to padding patches.
Figure 3: TrorYongOCR process flow: the batched sequences of patches are projected into embedding space. These patch embeddings are represented inside the dashed rectangle at the bottom. Encoder blocks rely on patch pad mask to avoid image patches attending to padding patches. The image encodings are then concatenated with character embeddings as shown inside the dashed rectangle in the middle. The decoder block processes this concatenation based on causal attention mask and pad mask. This flow generates latent state of each character embedding shown in the dashed rectangle at the top. This latent state is then root-mean-squared normalized and processed by the final linear layer to output logits for next character token. Note that the pad embeddings or encodings (which are all zeros in practice) all stay at position 1 (i.e., they receive no-rotation after applying rotary embedding).

This approach enables regularizing techniques such as patch dropping with variable drop rate, or resolution sampling with preserved aspect ratio.

Model Configuration

The choice of model configuration can be found as the following. While preserving aspect ratio, the input image is resized to \(min(H, W) = 32\) where \(H\) and \(W\) are height and width of image respectively. This is to reduce computation cost in the training as images with high resolution and big aspect ratio incur very long sequence of patches. The image patch size is \((8, 4)\) where \(8\) is along the width of input image. The context length for character sequence is up to \(1024\). Transformer configuration is the following: there are \(4\) blocks, each has embedding dimension \(d_{model}=384\) and \(h=6\) heads. In particular, encoding blocks (block \(1\) to \(3\)) have MLP dimension \(d_{MLP}=2*d_{model}=768\) and the decoding block has \(d_{MLP}=4*d_{model}=1546\) (see Table 1).

Table 1: Configuration of Transformer Blocks of TrorYongOCR
Layer \(d_{model}\) \(h\) \(d_{MLP}\) Role
1 384 6 768 Encoder
2 384 6 768 Encoder
3 384 6 768 Encoder
4 384 6 1546 Decoder

Compared to PARSeq

PARSeq model is an encoder-decoder model that implements Permuted Language Modeling (Yang et al. 2019): the text decoder uses position embedding as query vector, character embedding (token embedding plus position embedding) as context vector, and the latent state from image encoder as memory for the cross-attention mechanism (see Figure 3 of (Bautista and Atienza 2022)). PARSeq’s encoder aims to extract image features that help the decoder’s character-image alignment. It uses fixed resolution \((128, 32)\) (i.e., fixed aspect ratio equals to \(4\)). This is significantly different from TrorYongOCR’s encoder that tries to project image patches into character embedding space. Moreover, TrorYongOCR can process input image of arbitrary aspect ratio.

Compared to DTrOCR

DTrOCR is a decoder-only model. The image embedding (i.e., patch embedding plus position embedding) is concatenated with the input character embedding to form query vector. [SEP] token is added at the beginning of input character embedding to indicate sequence separation ([SEP] token is equivalent to bos token in TrorYongOCR). Causal self-attention mechanism is then applied to the query vector from layer to layer to generate characters autoregressively (see Figure 2 of (Fujitake 2024)). In contrast, TrorYongOCR concatenates input character embedding with image encoding to form key and value vectors for the single decoder block.

Training Detail

TrorYongOCR is implemented as a PyPI package and can be installed via

pip install tror-yong-ocr

The pre-trained weight of TrorYongOCR can be found here. It is obtained by pre-training on seanghay/khmer-hanuman-100k and SoyVitou/KhmerSynthetic1M datasets and fine-tuning on Khmer Scene Text dataset (Nom et al. 2024).

KhmerSynthetic1M

KhmerSynthetic1M is a dataset by Mr. Soy Vitou. This dataset contains images in gray monochromatic color palette (black, white, gray, etc.,). The distribution of the number of tokens, i.e. frequency of each number of tokens, is fairly uniform. In particular, the maximum number of tokens is around \(120\). This implies that there are images with aspect ratio largely higher than \(4\).

khmer-hanuman-100k

This dataset by Mr. Yat Seanghay contains images with a variety of background colors and character colors.

KhmerST: A Low-Resource Khmer Scene Text Detection and Recognition Benchmark

KhmerST is the first Khmer scene-text dataset consisting of:

  • 1,544 annotated images
  • 997 indoor scenes
  • 547 outdoor scenes

It has diverse conditions:

  • flat and raised text
  • low illumination
  • distant and partially occluded text.

The annotations are done at line-level with polygon bounding boxes.

To fine-tune TrorYongOCR, we cropped the polygon bounding boxes to get only text images. Then, we use warp operation to transform polygon into rectangle.

Hyperparameters

The hyperparameters regarding the learning rate is given in Table 2.

Table 2: Hyperparameters for my training. The learning rate is scheduled with a warmup phase of 1% of total steps when it increases to \(lr_{max}\) and a cosine annealing phase when it decreases to \(lr_{min}=10^{-5}\).
Batch size Epochs LR Schedule Warmup
\(1024\) \(16\) linear \(\rightarrow\) cosine 1%

TrorYongOCR is trained using a hybrid MuonAdam optimizer.

Weight Initialization

We initialized weights as what SOTA models reguarly do. The code to initialize the weight is given below.

Exceptionally, for position embedding used in the decoding block, I initialized it with \(std=1.0\).

def init_weights(self, module: nn.Module, name: str = '', exclude: Sequence[str] = ('')):
    """Initialize the weights using the typical initialization schemes used in SOTA models."""
    if any(map(name.startswith, exclude)):
        return
    if isinstance(module, nn.Linear):
        nn.init.trunc_normal_(module.weight, std=0.02)
        if module.bias is not None:
            nn.init.zeros_(module.bias)
    elif isinstance(module, nn.Embedding):
        nn.init.trunc_normal_(module.weight, std=0.02)
        if module.padding_idx is not None:
            module.weight.data[module.padding_idx].zero_()
    elif isinstance(module, nn.Conv2d):
        nn.init.kaiming_normal_(module.weight)
        if module.bias is not None:
            nn.init.zeros_(module.bias)
    elif isinstance(module, (nn.LayerNorm, nn.BatchNorm2d, nn.GroupNorm)):
        nn.init.ones_(module.weight)
        nn.init.zeros_(module.bias)

References

Bautista, Darwin, and Rowel Atienza. 2022. “Scene Text Recognition with Permuted Autoregressive Sequence Models.” In European Conference on Computer Vision, 178–96. Springer.
Fujitake, Masato. 2024. “Dtrocr: Decoder-Only Transformer for Optical Character Recognition.” In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 8025–35.
Nom, Vannkinh, Souhail Bakkali, Muhammad Muzzamil Luqman, Mickaël Coustaty, and Jean-Marc Ogier. 2024. “KhmerST: A Low-Resource Khmer Scene Text Detection and Recognition Benchmark.” In Proceedings of the Asian Conference on Computer Vision, 1777–92.
Yang, Zhilin, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. “Xlnet: Generalized Autoregressive Pretraining for Language Understanding.” Advances in Neural Information Processing Systems 32.
Zhai, Shuangfei. 2026. “Exclusive Self Attention.” arXiv Preprint arXiv:2603.09078.

Citation

BibTeX citation:
@online{khun2026,
  author = {KHUN, Kimang},
  title = {TrorYongOCR: {Encoder-Decoder} {Model} for {Scene} {Text}
    {Recognition}},
  date = {2026-02-19},
  url = {https://kimang18.github.io/krorngai-blog/TrorYongOCR/},
  langid = {en}
}
For attribution, please cite this work as:
KHUN, Kimang. 2026. “TrorYongOCR: Encoder-Decoder Model for Scene Text Recognition.” February 19, 2026. https://kimang18.github.io/krorngai-blog/TrorYongOCR/.