Focus Period lund 2026
PhD Student
George Mason University (USA)
Michael Crawshaw is a final-year Ph.D. student in Computer Science at George Mason University. His research focuses on optimization for machine learning, with the goal of developing optimization theory that faithfully explains practical machine learning, and building more efficient optimization algorithms for deep learning. Prior to his Ph.D., he received an M.S. in Computer Science (2022) from George Mason University, and a B.S. in Mathematics and Computer Science (2019) from The Ohio State University.
Presenting: An Exploration of Non-Euclidean Gradient Descent: Muon and its Many Variants
The recently introduced Muon optimizer has demonstrated great efficiency for training language models, though its design is a heuristic mix of steepest descent in the spectral norm with practical tricks. This talk will cover our work to develop a principled foundation for Muon, and along the way we explore various design decisions that lead to new optimization algorithms. To define a steepestdescent method over a neural network, we need to choose a norm for each layer, a way to aggregate these norms across layers, and whether to use normalization. We systematically exploredifferent alternatives for aggregating norms across layers, both formalizing existing combinations of Adam and Muon as a type of non-Euclidean gradient descent, and deriving new variants of the Muon optimizer. Through a comprehensive experimental evaluation of the optimizers within our framework, we find that Muon is sensitive to the choice of learning rate, whereas a new variant wecall MuonMax is significantly more robust. We then show how to combine any non-Euclidean gradient method with model based momentum (known as Momo). The new Momo variants of Muon areless sensitive to the choice of learning rate (and often achieve a better validation score), which greatly alleviates the cost of tuning hyperparameters.
