Train Faster With Less GPUs and Great Performance
2026-03-10
ម៉ូអន (Muon) ជាបរមាករ1 (optimizer) ដែលធ្វើប្រមាណវិធីខាងក្រោម
Figure 1: Scaling law experiments comparing Muon and Adam. Muon is \(\approx\) 2\(\times\) more computational efficient than Adam with compute optimal training.
Figure 1 of Liu et al. (2025).
Figure 2: Per-step training loss curve of Kimi K2, without smoothing or sub-sampling. It shows no spikes throughout the entire training process. Note that we omit the very beginning of training for clarity.
Figure 3 of Team et al. (2025).
For example, in case \(k=3\) and \(d=4\), we have \[ \begin{bmatrix} \hat{y}_{1} \\ \hat{y}_{2} \\ \hat{y}_{3} \end{bmatrix} = \begin{bmatrix} w_{11} & w_{12} & w_{13} & w_{14} \\ w_{21} & w_{22} & w_{23} & w_{24} \\ w_{31} & w_{32} & w_{33} & w_{34} \end{bmatrix} \begin{bmatrix} x_{1} \\ x_{2} \\ x_{3} \\ x_{4} \end{bmatrix} \]
where \(\alpha\) is the learning rate
For \(k=3\) and \(d=4\), we have \[ {\mathbf{W}} \leftarrow \begin{bmatrix} w_{11} & w_{12} & w_{13} & w_{14} \\ w_{21} & w_{22} & w_{23} & w_{24} \\ w_{31} & w_{32} & w_{33} & w_{34} \end{bmatrix} - \alpha \begin{bmatrix} \frac{\partial L}{\partial w_{11}} & \frac{\partial L}{\partial w_{12}} & \frac{\partial L}{\partial w_{13}} & \frac{\partial L}{\partial w_{14}} \\ \frac{\partial L}{\partial w_{21}} & \frac{\partial L}{\partial w_{22}} & \frac{\partial L}{\partial w_{23}} & \frac{\partial L}{\partial w_{24}} \\ \frac{\partial L}{\partial w_{31}} & \frac{\partial L}{\partial w_{32}} & \frac{\partial L}{\partial w_{33}} & \frac{\partial L}{\partial w_{34}} \end{bmatrix} \]
\[ \begin{aligned} a{\mathbf{M}}+ b({\mathbf{M}}{\mathbf{M}}^{\top}){\mathbf{M}}+ c({\mathbf{M}}{\mathbf{M}}^{\top})^2{\mathbf{M}} &= a{\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +b \left({\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top}({\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top})^{\top}\right){\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +c\left({\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top}({\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top})^{\top}\right)^2{\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} \\ &= a{\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +b \left({\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top}{\mathbf{V}}({\mathbf{U}}{\mathbf{S}})^{\top}\right){\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +c\left({\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top}({\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top})^{\top}\right)^2{\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} \\ &= a{\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +b \left({\mathbf{U}}{\mathbf{S}}({\mathbf{U}}{\mathbf{S}})^{\top}\right){\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +c\left({\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top}({\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top})^{\top}\right)^2{\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} \\ &= a{\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +b \left({\mathbf{U}}{\mathbf{S}}{\mathbf{S}}^{\top}{\mathbf{U}}^{\top}\right){\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +c\left({\mathbf{U}}{\mathbf{S}}{\mathbf{S}}^{\top}{\mathbf{U}}^{\top}\right)\left({\mathbf{U}}{\mathbf{S}}{\mathbf{S}}^{\top}{\mathbf{U}}^{\top}\right){\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} \\ &= a{\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +b {\mathbf{U}}{\mathbf{S}}^{3}{\mathbf{V}}^{\top} +c\left({\mathbf{U}}{\mathbf{S}}^{2}{\mathbf{U}}^{\top}\right){\mathbf{U}}{\mathbf{S}}^{3}{\mathbf{V}}^{\top} \\ &= a{\mathbf{U}}{\mathbf{S}}{\mathbf{V}}^{\top} +b {\mathbf{U}}{\mathbf{S}}^{3}{\mathbf{V}}^{\top} +c{\mathbf{U}}{\mathbf{S}}^{5}{\mathbf{V}}^{\top} \\ &= {\mathbf{U}}\left(a{\mathbf{S}}+b {\mathbf{S}}^{3} +c{\mathbf{S}}^{5}\right){\mathbf{V}}^{\top} \\ &= {\mathbf{U}}\phi({\mathbf{S}}){\mathbf{V}}^{\top} \end{aligned} \]
Kimang KHUN (Ph.D.), ក្រង AI