MomentUm Orthogonalized by Newton-Schulz

Train Faster With Less GPUs and Great Performance

KHUN Kimang

kimang.khun@polytechnique.org

ក្រង AI

2026-03-10

ខ្លឹមសារសង្ខេប

ម៉ូអន (Muon)

ម៉ូអន (Muon) ជាបរមាករ¹ (optimizer) ដែលធ្វើប្រមាណវិធីខាងក្រោម

គណនាហ្គ្រាដ្យ៉ង់ (Gradient) របស់ប៉ារ៉ាម៉ែត្រ \({\mathbf{W}}\)
ធ្វើបច្ចុប្បន្នភាពម៉ាទ្រីសម៉ូម៉ង់ (Momentum) \({\mathbf{M}}\) ដោយប្រើប្រាស់ហ្គ្រាដ្យ៉ង់
គណនាម៉ាទ្រីសអរតូកូណាល់ (Orthogonal) \({\mathbf{O}}\) របស់ \({\mathbf{M}}\)
ធ្វើបច្ចុប្បន្នភាព ប៉ារ៉ាម៉ែត្រ \({\mathbf{W}}\) ដោយប្រើប្រាស់ \({\mathbf{O}}\)

ម៉ូអន (Muon) មានសមត្ថភាពខ្ពស់ជាង អាដាំ (Adam)

Figure 1: Scaling law experiments comparing Muon and Adam. Muon is \(\approx\) 2\(\times\) more computational efficient than Adam with compute optimal training.

ម៉ូអនបង្កើនស្ថិរភាពនៃការបង្ហាត់ម៉ូដែលភាសា Kimi K2

Figure 2: Per-step training loss curve of Kimi K2, without smoothing or sub-sampling. It shows no spikes throughout the entire training process. Note that we omit the very beginning of training for clarity.

Definition

Linear Layer

parameters: \({\mathbf{W}}\in{\mathbb{R}}^{k\times d}\)
forward pass: \(\hat{{\boldsymbol{y}}}={\mathbf{W}}{\boldsymbol{x}}\), where \({\boldsymbol{x}}\in{\mathbb{R}}^{d}\)

For example, in case \(k=3\) and \(d=4\), we have \[ \begin{bmatrix} \hat{y}_{1} \\ \hat{y}_{2} \\ \hat{y}_{3} \end{bmatrix} = \begin{bmatrix} w_{11} & w_{12} & w_{13} & w_{14} \\ w_{21} & w_{22} & w_{23} & w_{24} \\ w_{31} & w_{32} & w_{33} & w_{34} \end{bmatrix} \begin{bmatrix} x_{1} \\ x_{2} \\ x_{3} \\ x_{4} \end{bmatrix} \]

compute loss: \(L({\boldsymbol{y}}, \hat{{\boldsymbol{y}}})\) where \({\boldsymbol{y}}\in{\mathbb{R}}^{k}\) is the ground-truth

Gradient Descent

backward pass: \(\frac{\partial L}{\partial w_{ij}}\) for \(1\le i\le k\) and \(1\le j\le d\)
update: \[w_{ij}^{\mathrm{new}} \leftarrow w_{ij}^{\mathrm{old}} - \alpha \frac{\partial L}{\partial w_{ij}}\]

where \(\alpha\) is the learning rate

Gradient Descent

backward pass: \(\frac{\partial L}{\partial w_{ij}}\) for \(1\le i\le k\) and \(1\le j\le d\)
update: \[w_{ij} \leftarrow w_{ij} - \alpha \frac{\partial L}{\partial w_{ij}}\]

For \(k=3\) and \(d=4\), we have \[ {\mathbf{W}} \leftarrow \begin{bmatrix} w_{11} & w_{12} & w_{13} & w_{14} \\ w_{21} & w_{22} & w_{23} & w_{24} \\ w_{31} & w_{32} & w_{33} & w_{34} \end{bmatrix} - \alpha \begin{bmatrix} \frac{\partial L}{\partial w_{11}} & \frac{\partial L}{\partial w_{12}} & \frac{\partial L}{\partial w_{13}} & \frac{\partial L}{\partial w_{14}} \\ \frac{\partial L}{\partial w_{21}} & \frac{\partial L}{\partial w_{22}} & \frac{\partial L}{\partial w_{23}} & \frac{\partial L}{\partial w_{24}} \\ \frac{\partial L}{\partial w_{31}} & \frac{\partial L}{\partial w_{32}} & \frac{\partial L}{\partial w_{33}} & \frac{\partial L}{\partial w_{34}} \end{bmatrix} \]

Adaptive Moment Estimation (Adam)

backward pass: \(\frac{\partial L}{\partial w_{ij}}\) for \(1\le i\le k\) and \(1\le j\le d\)

update momentum \({\mathbf{M}}\in{\mathbb{R}}^{k\times d}\) where \[m_{ij} \leftarrow \beta_1 m_{ij} + (1-\beta_1) \frac{\partial L}{\partial w_{ij}}\]

update Root Mean Squared propagation (RMSprop): \[\nu_{ij} \leftarrow \beta_2 \nu_{ij} + (1-\beta_2) \left(\frac{\partial L}{\partial w_{ij}}\right)^2\]

update: \[w_{ij} \leftarrow w_{ij} - \alpha \frac{m_{ij}}{\sqrt{\nu_{ij}} + \varepsilon}\]

Concept of Muon

backward pass: \(\frac{\partial L}{\partial w_{ij}}\) for \(1\le i\le k\) and \(1\le j\le d\)
update momentum \({\mathbf{M}}\in{\mathbb{R}}^{k\times d}\) where \[m_{ij} \leftarrow \beta_1 m_{ij} + (1-\beta_1) \frac{\partial L}{\partial w_{ij}}\]

compute orthogonal of \({\mathbf{M}}\): \[\mathrm{Ortho}({\mathbf{M}}) := \arg\min_{{\mathbf{O}}}{||{\mathbf{O}}-{\mathbf{M}}||} \mathrm{\ subject\ to:\ either\ } {\mathbf{O}}{\mathbf{O}}^{\top} = {\mathbf{I}}_{k\times k} \mathrm{\ or\ } {\mathbf{O}}^{\top}{\mathbf{O}}= {\mathbf{I}}_{d\times d} \tag{1}\]

update \({\mathbf{W}}\leftarrow {\mathbf{W}}-\alpha \mathrm{Ortho}({\mathbf{M}})\)

Concept of Muon

Equation 1 has one known solution: \({\mathbf{O}}={\mathbf{U}}{\mathbf{I}}_{k\times d}{\mathbf{V}}^{\top}\) where \({\mathbf{M}}={\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top}\),
- \({\mathbf{S}}\) is diagonal
- \({\mathbf{U}}\) and \({\mathbf{V}}\) are orthonormal
- \({\mathbf{U}}{\mathbf{U}}^{\top} ={\mathbf{U}}^{\top}{\mathbf{U}}={\mathbf{I}}_{k\times k}\) and \({\mathbf{V}}{\mathbf{V}}^{\top} ={\mathbf{V}}^{\top}{\mathbf{V}}={\mathbf{I}}_{d\times d}\)

\({\mathbf{S}}\), \({\mathbf{U}}\), and \({\mathbf{V}}\) can be computed by Single Value Decomposition (SVD)

SVD is computationally expensive

Design of Muon

let \(\phi(x):=ax + bx^3 + cx^5\) be a quintic polynomial
apply \(\phi\) on \({\mathbf{M}}\) gives

\[ \begin{aligned} a{\mathbf{M}}+ b({\mathbf{M}}{\mathbf{M}}^{\top}){\mathbf{M}}+ c({\mathbf{M}}{\mathbf{M}}^{\top})^2{\mathbf{M}} &= a{\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +b \left({\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top}({\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top})^{\top}\right){\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +c\left({\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top}({\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top})^{\top}\right)^2{\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} \\ &= a{\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +b \left({\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top}{\mathbf{V}}({\mathbf{U}}{\mathbf{S}})^{\top}\right){\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +c\left({\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top}({\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top})^{\top}\right)^2{\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} \\ &= a{\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +b \left({\mathbf{U}}{\mathbf{S}}({\mathbf{U}}{\mathbf{S}})^{\top}\right){\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +c\left({\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top}({\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top})^{\top}\right)^2{\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} \\ &= a{\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +b \left({\mathbf{U}}{\mathbf{S}}{\mathbf{S}}^{\top}{\mathbf{U}}^{\top}\right){\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +c\left({\mathbf{U}}{\mathbf{S}}{\mathbf{S}}^{\top}{\mathbf{U}}^{\top}\right)\left({\mathbf{U}}{\mathbf{S}}{\mathbf{S}}^{\top}{\mathbf{U}}^{\top}\right){\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} \\ &= a{\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +b {\mathbf{U}}{\mathbf{S}}^{3}{\mathbf{V}}^{\top} +c\left({\mathbf{U}}{\mathbf{S}}^{2}{\mathbf{U}}^{\top}\right){\mathbf{U}}{\mathbf{S}}^{3}{\mathbf{V}}^{\top} \\ &= a{\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +b {\mathbf{U}}{\mathbf{S}}^{3}{\mathbf{V}}^{\top} +c{\mathbf{U}}{\mathbf{S}}^{5}{\mathbf{V}}^{\top} \\ &= {\mathbf{U}}\left(a{\mathbf{S}}+b {\mathbf{S}}^{3} +c{\mathbf{S}}^{5}\right){\mathbf{V}}^{\top} \\ &= {\mathbf{U}}\phi({\mathbf{S}}){\mathbf{V}}^{\top} \end{aligned} \]

Design of Muon

let \(\phi(x):=ax + bx^3 + cx^5\) be a quintic polynomial
we have \[ \begin{aligned} \phi({\mathbf{M}}) &= {\mathbf{U}}\phi({\mathbf{S}}){\mathbf{V}}^{\top} \\ \phi\big(\phi({\mathbf{M}})\big) &= {\mathbf{U}}\phi\big(\phi({\mathbf{S}})\big){\mathbf{V}}^{\top} \\ \phi\Big(\phi\big(\phi({\mathbf{M}})\big)\Big) &= {\mathbf{U}}\phi\Big(\phi\big(\phi({\mathbf{S}})\big)\Big){\mathbf{V}}^{\top} \\ &\vdots \\ \phi^N({\mathbf{M}}) &= {\mathbf{U}}\phi^N({\mathbf{S}}){\mathbf{V}}^{\top} \end{aligned} \]

Design of Muon

Since \({\mathbf{S}}\) is diagonal, \(\phi({\mathbf{S}})\) simply applies \(\phi\) to each diagonal element¹ of \({\mathbf{S}}\)

For example, in case \(k=3, d=4\), \[ \phi({\mathbf{S}}) = \begin{bmatrix} \phi(s_{11}) & 0 & 0 & 0 \\ 0 & \phi(s_{22}) & 0 & 0 \\ 0 & 0 & \phi(s_{33}) & 0 \end{bmatrix}, \mathrm{\ and\ } \phi^N({\mathbf{S}}) = \begin{bmatrix} \phi^N(s_{11}) & 0 & 0 & 0 \\ 0 & \phi^N(s_{22}) & 0 & 0 \\ 0 & 0 & \phi^N(s_{33}) & 0 \end{bmatrix} \]

we can select the coefficients \((a, b, c)\) such that \(\phi^N(x) \to 1\) as \(N\to\infty\) for all \(x\in(0, 1]\)
in that case, \(\phi^N({\mathbf{S}})\to{\mathbf{I}}_{k\times d }\) as \(N\to\infty\) and \(\phi^N({\mathbf{M}})\to {\mathbf{O}}\)

Muon

backward pass: \(\frac{\partial L}{\partial w_{ij}}\) for \(1\le i\le k\) and \(1\le j\le d\)
update momentum \({\mathbf{M}}\in{\mathbb{R}}^{k\times d}\) where \[m_{ij} \leftarrow \beta_1 m_{ij} + (1-\beta_1) \frac{\partial L}{\partial w_{ij}}\]

normalize the momentum \[{\mathbf{M}}':=\frac{{\mathbf{M}}}{||{\mathbf{M}}||+\varepsilon} \]

with¹ \(a, b, c = (3.4445, -4.7750, 2.0315)\), compute of \({\mathbf{O}}:=\phi^5({\mathbf{M}}')\)

update \({\mathbf{W}}\leftarrow {\mathbf{W}}-\alpha {\mathbf{O}}\)

Muon consumes GPU memory less than Adam

Muon needs storage for momentum, each equals to parameter storage
Adam needs storage for momentum, and RMSprop, each equals to parameter storage

References

Liu, Jingyuan, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, et al. 2025. “Muon Is Scalable for LLM Training.” https://arxiv.org/abs/2502.16982.

Team, Kimi, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, et al. 2025. “Kimi K2: Open Agentic Intelligence.” arXiv Preprint arXiv:2507.20534.