Focus Period lund 2026
Postdoctoral Researcher
French Institute for Research in Computer Science and Automation – Inria (France)
Frederik Kunstner is a postdoctoral researcher at Inria Paris, working with Francis Bach. His research focuses on the intersection of optimization theory and machine learning, aiming to build a better understanding on how to train ML models. He received his PhD from the University of British Columbia where he worked with Mark Schmidt. His thesis has received the CAIAC Best Doctoral Dissertation Award and an AAAI/ACM SIGAI Honorable Mention, and his work on the EM algorithm won the AISTATS 2021 best paper award.
Presenting: Adam and Gradient Descent with Zipf-distributed tokens
Adam, and more recently Muon, outperform gradient descent in training transformer-based language models, but we have a poor understanding of why they improve training performance. We give empirical evidence that their benefit is due to the heavy-tailed distribution of words in text data. In text, the frequency of the kth most frequent word is proportional to 1/k, following Zipf’s law. This frequency imbalance leads to poor performance with gradient descent, wile Adam does not suffer from the same slowdown. To better understand the impact of the data distribution on training performance, we study a linear bigram model for next-token prediction when tokens follow power laws. We show that Zipf’s law is worst-case, leading to the largest separation in performance between gradient descent and sign descent, as a proxy for Adam, and that this performance gap scales with the vocabulary size.
