MomentUm Orthogonalized by Newton-Schulz

Train Faster With Less GPUs and Great Performance

KHUN Kimang

ក្រង AI

2026-03-10

ខ្លឹមសារសង្ខេប

ម៉ូអន (Muon)

ម៉ូអន (Muon) ជាបរមាករ1 (optimizer) ដែលធ្វើប្រមាណវិធីខាងក្រោម

  1. គណនាហ្គ្រាដ្យ៉ង់ (Gradient) របស់ប៉ារ៉ាម៉ែត្រ \({\mathbf{W}}\)
  2. ធ្វើបច្ចុប្បន្នភាពម៉ាទ្រីសម៉ូម៉ង់ (Momentum) \({\mathbf{M}}\) ដោយប្រើប្រាស់ហ្គ្រាដ្យ៉ង់
  3. គណនាម៉ាទ្រីសអរតូកូណាល់ (Orthogonal) \({\mathbf{O}}\) របស់ \({\mathbf{M}}\)
  4. ធ្វើបច្ចុប្បន្នភាព ប៉ារ៉ាម៉ែត្រ \({\mathbf{W}}\) ដោយប្រើប្រាស់ \({\mathbf{O}}\)

ម៉ូអន (Muon) មានសមត្ថភាពខ្ពស់ជាង អាដាំ (Adam)

Figure 1: Scaling law experiments comparing Muon and Adam. Muon is \(\approx\) 2\(\times\) more computational efficient than Adam with compute optimal training.

ម៉ូអនបង្កើនស្ថិរភាពនៃការបង្ហាត់ម៉ូដែលភាសា Kimi K2

Figure 2: Per-step training loss curve of Kimi K2, without smoothing or sub-sampling. It shows no spikes throughout the entire training process. Note that we omit the very beginning of training for clarity.

Definition

Linear Layer

  • parameters: \({\mathbf{W}}\in{\mathbb{R}}^{k\times d}\)
  • forward pass: \(\hat{{\boldsymbol{y}}}={\mathbf{W}}{\boldsymbol{x}}\), where \({\boldsymbol{x}}\in{\mathbb{R}}^{d}\)

For example, in case \(k=3\) and \(d=4\), we have \[ \begin{bmatrix} \hat{y}_{1} \\ \hat{y}_{2} \\ \hat{y}_{3} \end{bmatrix} = \begin{bmatrix} w_{11} & w_{12} & w_{13} & w_{14} \\ w_{21} & w_{22} & w_{23} & w_{24} \\ w_{31} & w_{32} & w_{33} & w_{34} \end{bmatrix} \begin{bmatrix} x_{1} \\ x_{2} \\ x_{3} \\ x_{4} \end{bmatrix} \]

  • compute loss: \(L({\boldsymbol{y}}, \hat{{\boldsymbol{y}}})\) where \({\boldsymbol{y}}\in{\mathbb{R}}^{k}\) is the ground-truth

Gradient Descent

  • backward pass: \(\frac{\partial L}{\partial w_{ij}}\) for \(1\le i\le k\) and \(1\le j\le d\)
  • update: \[w_{ij}^{\mathrm{new}} \leftarrow w_{ij}^{\mathrm{old}} - \alpha \frac{\partial L}{\partial w_{ij}}\]

where \(\alpha\) is the learning rate

Gradient Descent

  • backward pass: \(\frac{\partial L}{\partial w_{ij}}\) for \(1\le i\le k\) and \(1\le j\le d\)
  • update: \[w_{ij} \leftarrow w_{ij} - \alpha \frac{\partial L}{\partial w_{ij}}\]

For \(k=3\) and \(d=4\), we have \[ {\mathbf{W}} \leftarrow \begin{bmatrix} w_{11} & w_{12} & w_{13} & w_{14} \\ w_{21} & w_{22} & w_{23} & w_{24} \\ w_{31} & w_{32} & w_{33} & w_{34} \end{bmatrix} - \alpha \begin{bmatrix} \frac{\partial L}{\partial w_{11}} & \frac{\partial L}{\partial w_{12}} & \frac{\partial L}{\partial w_{13}} & \frac{\partial L}{\partial w_{14}} \\ \frac{\partial L}{\partial w_{21}} & \frac{\partial L}{\partial w_{22}} & \frac{\partial L}{\partial w_{23}} & \frac{\partial L}{\partial w_{24}} \\ \frac{\partial L}{\partial w_{31}} & \frac{\partial L}{\partial w_{32}} & \frac{\partial L}{\partial w_{33}} & \frac{\partial L}{\partial w_{34}} \end{bmatrix} \]

Adaptive Moment Estimation (Adam)

  • backward pass: \(\frac{\partial L}{\partial w_{ij}}\) for \(1\le i\le k\) and \(1\le j\le d\)
  • update momentum \({\mathbf{M}}\in{\mathbb{R}}^{k\times d}\) where \[m_{ij} \leftarrow \beta_1 m_{ij} + (1-\beta_1) \frac{\partial L}{\partial w_{ij}}\]
  • update Root Mean Squared propagation (RMSprop): \[\nu_{ij} \leftarrow \beta_2 \nu_{ij} + (1-\beta_2) \left(\frac{\partial L}{\partial w_{ij}}\right)^2\]
  • update: \[w_{ij} \leftarrow w_{ij} - \alpha \frac{m_{ij}}{\sqrt{\nu_{ij}} + \varepsilon}\]

Concept of Muon

  • backward pass: \(\frac{\partial L}{\partial w_{ij}}\) for \(1\le i\le k\) and \(1\le j\le d\)
  • update momentum \({\mathbf{M}}\in{\mathbb{R}}^{k\times d}\) where \[m_{ij} \leftarrow \beta_1 m_{ij} + (1-\beta_1) \frac{\partial L}{\partial w_{ij}}\]
  • compute orthogonal of \({\mathbf{M}}\): \[\mathrm{Ortho}({\mathbf{M}}) := \arg\min_{{\mathbf{O}}}{||{\mathbf{O}}-{\mathbf{M}}||} \mathrm{\ subject\ to:\ either\ } {\mathbf{O}}{\mathbf{O}}^{\top} = {\mathbf{I}}_{k\times k} \mathrm{\ or\ } {\mathbf{O}}^{\top}{\mathbf{O}}= {\mathbf{I}}_{d\times d} \tag{1}\]
  • update \({\mathbf{W}}\leftarrow {\mathbf{W}}-\alpha \mathrm{Ortho}({\mathbf{M}})\)

Concept of Muon

  • Equation 1 has one known solution: \({\mathbf{O}}={\mathbf{U}}{\mathbf{I}}_{k\times d}{\mathbf{V}}^{\top}\) where \({\mathbf{M}}={\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top}\),
    • \({\mathbf{S}}\) is diagonal
    • \({\mathbf{U}}\) and \({\mathbf{V}}\) are orthonormal
    • \({\mathbf{U}}{\mathbf{U}}^{\top} ={\mathbf{U}}^{\top}{\mathbf{U}}={\mathbf{I}}_{k\times k}\) and \({\mathbf{V}}{\mathbf{V}}^{\top} ={\mathbf{V}}^{\top}{\mathbf{V}}={\mathbf{I}}_{d\times d}\)
  • \({\mathbf{S}}\), \({\mathbf{U}}\), and \({\mathbf{V}}\) can be computed by Single Value Decomposition (SVD)
  • SVD is computationally expensive

Design of Muon

  • let \(\phi(x):=ax + bx^3 + cx^5\) be a quintic polynomial
  • apply \(\phi\) on \({\mathbf{M}}\) gives

\[ \begin{aligned} a{\mathbf{M}}+ b({\mathbf{M}}{\mathbf{M}}^{\top}){\mathbf{M}}+ c({\mathbf{M}}{\mathbf{M}}^{\top})^2{\mathbf{M}} &= a{\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +b \left({\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top}({\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top})^{\top}\right){\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +c\left({\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top}({\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top})^{\top}\right)^2{\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} \\ &= a{\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +b \left({\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top}{\mathbf{V}}({\mathbf{U}}{\mathbf{S}})^{\top}\right){\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +c\left({\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top}({\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top})^{\top}\right)^2{\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} \\ &= a{\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +b \left({\mathbf{U}}{\mathbf{S}}({\mathbf{U}}{\mathbf{S}})^{\top}\right){\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +c\left({\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top}({\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top})^{\top}\right)^2{\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} \\ &= a{\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +b \left({\mathbf{U}}{\mathbf{S}}{\mathbf{S}}^{\top}{\mathbf{U}}^{\top}\right){\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +c\left({\mathbf{U}}{\mathbf{S}}{\mathbf{S}}^{\top}{\mathbf{U}}^{\top}\right)\left({\mathbf{U}}{\mathbf{S}}{\mathbf{S}}^{\top}{\mathbf{U}}^{\top}\right){\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} \\ &= a{\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +b {\mathbf{U}}{\mathbf{S}}^{3}{\mathbf{V}}^{\top} +c\left({\mathbf{U}}{\mathbf{S}}^{2}{\mathbf{U}}^{\top}\right){\mathbf{U}}{\mathbf{S}}^{3}{\mathbf{V}}^{\top} \\ &= a{\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +b {\mathbf{U}}{\mathbf{S}}^{3}{\mathbf{V}}^{\top} +c{\mathbf{U}}{\mathbf{S}}^{5}{\mathbf{V}}^{\top} \\ &= {\mathbf{U}}\left(a{\mathbf{S}}+b {\mathbf{S}}^{3} +c{\mathbf{S}}^{5}\right){\mathbf{V}}^{\top} \\ &= {\mathbf{U}}\phi({\mathbf{S}}){\mathbf{V}}^{\top} \end{aligned} \]

Design of Muon

  • let \(\phi(x):=ax + bx^3 + cx^5\) be a quintic polynomial
  • we have \[ \begin{aligned} \phi({\mathbf{M}}) &= {\mathbf{U}}\phi({\mathbf{S}}){\mathbf{V}}^{\top} \\ \phi\big(\phi({\mathbf{M}})\big) &= {\mathbf{U}}\phi\big(\phi({\mathbf{S}})\big){\mathbf{V}}^{\top} \\ \phi\Big(\phi\big(\phi({\mathbf{M}})\big)\Big) &= {\mathbf{U}}\phi\Big(\phi\big(\phi({\mathbf{S}})\big)\Big){\mathbf{V}}^{\top} \\ &\vdots \\ \phi^N({\mathbf{M}}) &= {\mathbf{U}}\phi^N({\mathbf{S}}){\mathbf{V}}^{\top} \end{aligned} \]

Design of Muon

  • Since \({\mathbf{S}}\) is diagonal, \(\phi({\mathbf{S}})\) simply applies \(\phi\) to each diagonal element1 of \({\mathbf{S}}\)
  • For example, in case \(k=3, d=4\), \[ \phi({\mathbf{S}}) = \begin{bmatrix} \phi(s_{11}) & 0 & 0 & 0 \\ 0 & \phi(s_{22}) & 0 & 0 \\ 0 & 0 & \phi(s_{33}) & 0 \end{bmatrix}, \mathrm{\ and\ } \phi^N({\mathbf{S}}) = \begin{bmatrix} \phi^N(s_{11}) & 0 & 0 & 0 \\ 0 & \phi^N(s_{22}) & 0 & 0 \\ 0 & 0 & \phi^N(s_{33}) & 0 \end{bmatrix} \]
  • we can select the coefficients \((a, b, c)\) such that \(\phi^N(x) \to 1\) as \(N\to\infty\) for all \(x\in(0, 1]\)
  • in that case, \(\phi^N({\mathbf{S}})\to{\mathbf{I}}_{k\times d }\) as \(N\to\infty\) and \(\phi^N({\mathbf{M}})\to {\mathbf{O}}\)

Muon

  • backward pass: \(\frac{\partial L}{\partial w_{ij}}\) for \(1\le i\le k\) and \(1\le j\le d\)
  • update momentum \({\mathbf{M}}\in{\mathbb{R}}^{k\times d}\) where \[m_{ij} \leftarrow \beta_1 m_{ij} + (1-\beta_1) \frac{\partial L}{\partial w_{ij}}\]
  • normalize the momentum \[{\mathbf{M}}':=\frac{{\mathbf{M}}}{||{\mathbf{M}}||+\varepsilon} \]
  • with1 \(a, b, c = (3.4445, -4.7750, 2.0315)\), compute of \({\mathbf{O}}:=\phi^5({\mathbf{M}}')\)
  • update \({\mathbf{W}}\leftarrow {\mathbf{W}}-\alpha {\mathbf{O}}\)

Muon consumes GPU memory less than Adam

  • Muon needs storage for momentum, each equals to parameter storage
  • Adam needs storage for momentum, and RMSprop, each equals to parameter storage

References

Liu, Jingyuan, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, et al. 2025. “Muon Is Scalable for LLM Training.” https://arxiv.org/abs/2502.16982.
Team, Kimi, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, et al. 2025. “Kimi K2: Open Agentic Intelligence.” arXiv Preprint arXiv:2507.20534.