TrorYongOCR: Encoder-Decoder Model for Scene Text Recognition
Abstract
TrorYongOCR is a tiny encoder-decoder model for Scene Text Recognition task. It prepends the encoding of image patches to the “begin of sequence” token to condition next character token generation. Using LLM analogy, patch encodings can be simply seen as a prefill prompt. The single text decoder block of TrorYongOCR generates character tokens based on the prefill prompt in an autoregressive manner without cross-attention mechanism. Moreover, TrorYongOCR can process input images of arbitrary aspect ratio. Current pre-trained weight supports 2 languages: Khmer and English. The model is deployed on Huggingface Space here for demonstration.
Use Image Encoding as a Prefilll Prompt
TrorYongOCR is designed as the following: given \(L\) transformer blocks
- \(L-1\) are encoding blocks that encode a given image
- the last block is a single decoding block without cross-attention mechanism
- each transformer is implemented with exclusive self-attention (Zhai 2026) style and SwiGLU in MLP
For the single decoding block,
- the latent state of an image (the output of the last encoder block) is concatenated with the input character embedding (token embedding including
bostoken) to create context vector, i.e. key and value vectors (think of it like a prefill prompt) - the query vector is simply the input character embedding (token embedding).
TrorYongOCR architecture overview can be found in Figure 1.
Batching Images of Different Aspect Ratio Using Prefix-Padding
Images are treated as sequences of patches. When batching images, the sequences with smaller length are prepended by padding patches to get a fixed batch shapes (see Figure 2). Pad mask is derived and used in the attention mechanism to avoid image patches attending to padding patches. Since the padding patches are added at the beginning of sequences, pad mask is also used when applying rotary embedding, so that image patches receive correct rotation (i.e., padding patches all stay in position 1 as shown in Figure 3). With this approach, we can train transformer-based model on images at their ‘native’ aspect ratio.
This approach enables regularizing techniques such as patch dropping with variable drop rate, or resolution sampling with preserved aspect ratio.
Model Configuration
The choice of model configuration can be found as the following. While preserving aspect ratio, the input image is resized to \(min(H, W) = 32\) where \(H\) and \(W\) are height and width of image respectively. This is to reduce computation cost in the training as images with high resolution and big aspect ratio incur very long sequence of patches. The image patch size is \((8, 4)\) where \(8\) is along the width of input image. The context length for character sequence is up to \(1024\). Transformer configuration is the following: there are \(4\) blocks, each has embedding dimension \(d_{model}=384\) and \(h=6\) heads. In particular, encoding blocks (block \(1\) to \(3\)) have MLP dimension \(d_{MLP}=2*d_{model}=768\) and the decoding block has \(d_{MLP}=4*d_{model}=1546\) (see Table 1).
| Layer | \(d_{model}\) | \(h\) | \(d_{MLP}\) | Role |
|---|---|---|---|---|
| 1 | 384 | 6 | 768 | Encoder |
| 2 | 384 | 6 | 768 | Encoder |
| 3 | 384 | 6 | 768 | Encoder |
| 4 | 384 | 6 | 1546 | Decoder |
Compared to PARSeq
PARSeq model is an encoder-decoder model that implements Permuted Language Modeling (Yang et al. 2019): the text decoder uses position embedding as query vector, character embedding (token embedding plus position embedding) as context vector, and the latent state from image encoder as memory for the cross-attention mechanism (see Figure 3 of (Bautista and Atienza 2022)). PARSeq’s encoder aims to extract image features that help the decoder’s character-image alignment. It uses fixed resolution \((128, 32)\) (i.e., fixed aspect ratio equals to \(4\)). This is significantly different from TrorYongOCR’s encoder that tries to project image patches into character embedding space. Moreover, TrorYongOCR can process input image of arbitrary aspect ratio.
Compared to DTrOCR
DTrOCR is a decoder-only model. The image embedding (i.e., patch embedding plus position embedding) is concatenated with the input character embedding to form query vector. [SEP] token is added at the beginning of input character embedding to indicate sequence separation ([SEP] token is equivalent to bos token in TrorYongOCR). Causal self-attention mechanism is then applied to the query vector from layer to layer to generate characters autoregressively (see Figure 2 of (Fujitake 2024)). In contrast, TrorYongOCR concatenates input character embedding with image encoding to form key and value vectors for the single decoder block.
Training Detail
TrorYongOCR is implemented as a PyPI package and can be installed via
pip install tror-yong-ocrThe pre-trained weight of TrorYongOCR can be found here. It is obtained by pre-training on seanghay/khmer-hanuman-100k and SoyVitou/KhmerSynthetic1M datasets and fine-tuning on Khmer Scene Text dataset (Nom et al. 2024).
KhmerSynthetic1M
KhmerSynthetic1M is a dataset by Mr. Soy Vitou. This dataset contains images in gray monochromatic color palette (black, white, gray, etc.,). The distribution of the number of tokens, i.e. frequency of each number of tokens, is fairly uniform. In particular, the maximum number of tokens is around \(120\). This implies that there are images with aspect ratio largely higher than \(4\).
khmer-hanuman-100k
This dataset by Mr. Yat Seanghay contains images with a variety of background colors and character colors.
KhmerST: A Low-Resource Khmer Scene Text Detection and Recognition Benchmark
KhmerST is the first Khmer scene-text dataset consisting of:
- 1,544 annotated images
- 997 indoor scenes
- 547 outdoor scenes
It has diverse conditions:
- flat and raised text
- low illumination
- distant and partially occluded text.
The annotations are done at line-level with polygon bounding boxes.
To fine-tune TrorYongOCR, we cropped the polygon bounding boxes to get only text images. Then, we use warp operation to transform polygon into rectangle.
Hyperparameters
The hyperparameters regarding the learning rate is given in Table 2.
| Batch size | Epochs | LR Schedule | Warmup | |
|---|---|---|---|---|
| \(1024\) | \(16\) | linear \(\rightarrow\) cosine | 1% |
TrorYongOCR is trained using a hybrid MuonAdam optimizer.
Weight Initialization
We initialized weights as what SOTA models reguarly do. The code to initialize the weight is given below.
Exceptionally, for position embedding used in the decoding block, I initialized it with \(std=1.0\).
def init_weights(self, module: nn.Module, name: str = '', exclude: Sequence[str] = ('')):
"""Initialize the weights using the typical initialization schemes used in SOTA models."""
if any(map(name.startswith, exclude)):
return
if isinstance(module, nn.Linear):
nn.init.trunc_normal_(module.weight, std=0.02)
if module.bias is not None:
nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
nn.init.trunc_normal_(module.weight, std=0.02)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
elif isinstance(module, nn.Conv2d):
nn.init.kaiming_normal_(module.weight)
if module.bias is not None:
nn.init.zeros_(module.bias)
elif isinstance(module, (nn.LayerNorm, nn.BatchNorm2d, nn.GroupNorm)):
nn.init.ones_(module.weight)
nn.init.zeros_(module.bias)References
Citation
@online{khun2026,
author = {KHUN, Kimang},
title = {TrorYongOCR: {Encoder-Decoder} {Model} for {Scene} {Text}
{Recognition}},
date = {2026-02-19},
url = {https://kimang18.github.io/krorngai-blog/TrorYongOCR/},
langid = {en}
}