Detailed program
Please note that the program is still subject to change.
May 5, 2026
17:00 - 19:00
Historical Museum at Lund University
Krafts Torg 1, 223 50 Lund.
Welcome reception at the Historical Museum
A welcome drink and some hors d’oeuvres will be served.
Day 1 – May 6, 2026
08:30 - 08:45
Registration
08:45 - 09:00
Opening
09:00 - 09:40
TBA
Peter Richtárik, KAUST
Biography
Peter Richtárik is a professor of Computer Science at King Abdullah University of Science and Technology – KAUST, Saudi Arabia, where he leads the Optimization and Machine Learning Lab. Through his work on randomized and distributed optimization algorithms, he has contributed to the foundations of machine learning and optimization. He is one of the original developers of Federated Learning. Prof Richtárik’s works attracted international awards, including the Charles BroydenPrize, SIAM SIGEST Best Paper Award, and a Distinguished Speaker Award at the 2019 International Conference on Continuous Optimization. He serves as an Area Chair for leading machine learning conferences, including NeurIPS, ICML and ICLR, and is an Action Editor of JMLR, and Associate Editor of Numerische Mathematik, and Optimization Methods and Software.
09:40 - 10:20
Exploiting Similarity in Federated Learning
Sebastian Stich, CISPA and ELLIS
Biography
Dr. Sebastian Stich is a tenured faculty member at the CISPA Helmholtz Center for Information Security and a member of the European Laboratory for Learning and Intelligent Systems (ELLIS). His research focuses on the intersection of machine learning, optimization, and statistics, with an emphasis on efficient parallel and distributed algorithms for training models over decentralized datasets.
He obtained his PhD from ETH Zurich and held postdoctoral positions at UCLouvain and EPFL. His work has been recognized with a Meta Research Award (2022), a Google Research Scholar Award (2023), and an ERC Consolidator Grant (CollectiveMinds, 2024).
Abstract
We provide a brief introduction to local update methods developed for federated optimization and discuss their worst-case complexity. Surprisingly, these methods often perform much better in practice than predictedby theoretical analyses using classical assumptions. Recent years have revealed that their performance can be better described using refined notions that capture the similarity among client objectives. In this talk, we introducea generic framework based on a distributed proximal point algorithm, which consolidates many of our insights and allows for the adaptation of arbitrary centralized optimization algorithms to the convex federated setting, including accelerated variants. Our theoretical analysis shows that the derived methods enjoy faster convergence when the degree of similarity among clients is high.
Based on joint work with Xiaowen Jiang and Anton Rodomanov.
10:20 - 10:50
Coffee
10:50 - 11:30
TBA
Niao He, ETH Zurich
Biography
Niao He is currently an Associate Professor in the Department of Computer Science at ETH Zürich, and leading the Optimization & Decision Intelligence (ODI) Group. She is also a core faculty member at:
- Institute of Machine Learning
- ETH AI Center
- ETH Foundations of Data Science
- Max Planck ETH Center for Learning Systems
Niao He’s work lies in the interface of optimization and machine learning, with a primary focus on the algorithmic and theoretical foundations for principled, scalable, and trustworthy decision intelligence. She is also interested in developing machine learning models and algorithms for interdisciplinary applications in operations management, mechanism design, control & robotics, etc.
With thanks to Swiss National Science Foundations, ETH Foundations, NCCR Automation for generously funding the current research.
11:30 - 12:10
TBA
Anastasia Koloskova, University of Zurich
Biography
Anastasia Koloskova is an Assistant Professor of AI and Optimization in the Department of Mathematical Modeling and Machine Learning at the University of Zurich. Her research focuses on machine learning and optimization, particularly indecentralized and collaborative learning and privacy. Previously, Anastasia Koloskova was a postdoctoral researcher at Stanford University (STAIR lab, Prof. Sanmi Koyejo), andcompleted her PhD at EPFL in the Machine Learning and Optimization Laboratory (MLO) with Prof. Martin Jaggi.
12:10 - 13:40
Lunch
13:40 - 14:20
Strong convergence and fast residual decay for monotone operator flows via Tikhonov regularization
Radu I. Boţ, University of Vienna
Biography
Radu I. Boţ is Professor of Applied Mathematics with Emphasis on Optimization at the Faculty of Mathematics of the University of Vienna and a founding member of the Research Platform “Data Science@Uni Vienna”. He currently serves as Dean of the Faculty of Mathematics at the University of Vienna. He received his Diploma and M.Sc. degrees in Mathematics from Babeş-Bolyai University in Cluj-Napoca, Romania, and earned his Ph.D. degree as well as his Habilitation in Mathematics from Chemnitz University of Technology, Germany.
His research interests include continuous-time and discrete-time models for optimization and monotone inclusions, convex analysis, nonsmooth and variational analysis, monotone operator theory, and optimization methods for data science. His research has been funded by the Austrian Science Fund, the Austrian Research Promotion Agency, the German Research Foundation, the Romanian National Research Council, the Australian Research Council, as well as by industrial partners. He is (co-) author of the books Duality in Vector Optimization and Conjugate Duality in Convex Optimization, published by Springer. Radu I. Boţ serves on the editorial boards of several leading journals, including Mathematical Programming, Computational Optimization and Applications, Applied Mathematics and Optimization, and the Journal of Optimization Theory and Applications. Since January 2026, he has been Editor-in-Chief of the prestigious SIAM Journal on Optimization
Abstract
In the framework of real Hilbert spaces, we investigate first-order dynamical systems governed by monotone and continuous operators. It has been established that for these systems, only the ergodic trajectory converges to a zero of the operator. However, trajectory convergence is assured for operators with the stronger property of cocoercivity. For this class of operators, the trajectory’s velocity and the opertor values along the trajectory converge in norm to zero at a rate of o(1/√t) as t → +∞.
In this talk, we show that augmenting a monotone operator flow with a Tikhonov regularization term ensures not only strong convergence of the trajectory to the minimal-norm element of the zero set, but also enables the derivation of explicit convergence rates. In particular, we establish norm rates for the trajectory’s velocity and for the residual of the operator along the trajectory, expressed in terms of the regularization function. In some particular cases, these rates can be as fast as O(1/t) as t → +∞. In this way, we emphasize a surprising acceleration feature of the Tikhonov regularization. Additionally, we explore these properties for monotone operator flows that incorporate time rescaling and an anchor point. For a specific choice of the Tikhonov regularization function, these flows are closely linked to second-order dynamical systems with a vanishing damping term. The convergence and convergence rate results we achieve for these systems complement recent findings for the Fast Optimistic Gradient Descent Ascent (OGDA) dynamics.
Finally, derive via an explicit discretization of the Tikhonov regularized monotone flow a novel Extra-Gradient method with anchor term governed by general parameters. We establish strong convergence to specific points within the solution set, as well as convergence rates expressed in terms of the regularization parameters. Notably, our approach recovers the fast residual decay rate O(1/k) as k → +∞ for standard parameter choices.
14:20 - 15:00
Lightning Talks
15:00 - 15:45
Poster session & coffee
15:45 - 16:25
Gradient alignment, learning, and optimization
Jelena Diakonikolas, University of Wisconsin-Madison
Biography
Jelena Diakonikolas is Assistant Professor at the Department of Computer Sciences and (by courtesy) the Department of Statistics, at the University of Wisconsin-Madison. She is also an affiliate of the Data Science Institute at UW-Madison.
Her main research interests are in the area of large-scale optimization. She is also interested in applications of optimization methods, particularly within machine learning.
Prior to joining UW-Madison, Jelena Diakonikolas was a Postdoctoral Fellow at UC Berkeley’s Foundations of Data Analysis (FODA) TRIPODS Institute, where she primarily worked with Mike Jordan. In Fall 2018, she was a Microsoft Research Fellow at the Simons Institute for the Theory of Computing, associated with the program on Foundations of Data Science. Prior to starting the postdoctoral position at UC Berkeley, she was a Postdoctoral Associate at the Department of Computer Science, Boston University, where she worked with Lorenzo Orecchia. She completed my Ph.D. at the Department of Electrical Engineering, Columbia University, where she was co-advised by Gil Zussman and Cliff Stein.
Some publications are under her maiden name — Marašević.
Abstract
Generalized Linear Models (GLMs) represent functions formed by composing a known univariate nonlinear activation with a linear map defined by an unknown vector w. The learning task—recovering w from i.i.d. labeledexamples (x,y), where y is a noisy evaluation of the GLM—leads to a nonconvex, often nonsmooth optimization problem, even for simple activations.
GLMs are a fundamental model in supervised learning, capturing low-dimensional structure in high-dimensional data. While the setting with zero-mean bounded-variance noise has been well studied, more realistic formulations—where labels may deviate arbitrarily from any ground-truth GLM—are substantially more challenging. In particular, more relaxed notions of error and much stronger structural assumptions about both the activation and the distribution generating the data are required for computational tractability. Most provable guarantees in this regime have emerged only recently.
In this talk, I will survey these developments and present a unifying optimization-theoretic framework based on local error bounds. These bounds capture how the gradient field remains meaningfully aligned with a target solution, thus providing a geometric “signal” that enables efficient learning with first-order methods, despite nonconvexity and noise. I will further discuss a generalization of these results to the setting of single-index models, where the activation is unknown and optimization is performed over a class of unknown activations, in addition to the parameter vector.
16:25 - 17:05
TBA
Nicolas Loizou, Johns Hopkins University
Biography
Nicolas Loizou is an Assistant Professor in the Department of Applied Mathematics and Statistics and the Mathematical Institute for Data Science (MINDS) at Johns Hopkins University, where he leads the Optimization and Machine Learning Lab. Prior to this, he was a Postdoctoral Research Fellow at Mila – Quebec Artificial Intelligence Institute and the University of Montreal. He holds a Ph.D. in Optimization and Operational Research from the University of Edinburgh, School of Mathematics, an M.Sc. in Computing from Imperial College London, and a BSc in Mathematics from the National and Kapodistrian University of Athens.
His research interests include large-scale optimization, machine learning, randomized numerical linear algebra, distributed and decentralized algorithms, algorithmic game theory, and federated learning. He currently serves as action editor for Information and Inference: A Journal of the IMA, Optimization Methods and Software, and Transactions on Machine Learning Research. He has received several awards, including the OR Society’s 2019 Doctoral Award (runner-up), the IVADO Fellowship, the COAP 2020 Best Paper Award, the CISCO 2023 Research Award, and the Catalyst 2025 Award.
Day 2 – May 7, 2026
09:00 - 09:40
Recent advances on the systematic analysis and design of first-order optimization algorithms
Convex optimization as a proof assistant for algorithm analysis and design
Adrien Taylor, Inria Paris
Biography
Adrien Taylor is currently a research scientist at French Institute for Research in Computer Science and Automation – Inria in Paris, within the SIERRA team. Before that, he was a postdoctoral researcher in the same team in 2017-2019, working with Francis Bach. He completed a PhD at Université catholique de Louvain, in the department of mathematical engineering (part of the ICTEAMinstitute), where he held a F.R.S.-FNRS FRIA scholarship for his PhD under the supervision of François Glineur and Julien Hendrickx.
His research currently focuses on optimization (mostly first-order) and numerical analysis with a bit of control and machine learning. He finds it particularly important to push toward reproducible (including theory) and understandable science, and many of his research projects have this orientation. Adrien Taylor was awarded an ERC Starting Grant 2024 (project CASPER) for working in this direction from fall 2024 to fall 2029.
Abstract
Complexity analysis plays a key role in the design and analysis of algorithms in modern optimization theory. However, establishing worst-case convergence bounds classically requires non-obvious insights and ad hoc reasoning. This talk aims to provide a gentle introduction to performance estimation techniques for the analysis of first-order optimization algorithms, along with a few open questions and recent developments around it. The talk will be accompanied by concrete examples and demonstration of the usage of recent packages for computer-aided complexity analyses, including the PEPit package, available at https://pepit.readthedocs.io/ .
09:40 - 10:20
TBA
Pontus Giselsson, Lund University
Biography
Pontus Giselsson is Associate Professor at the Department of Automatic Control at Lund University, and organizer of the ELLIIT focus period on Optimization for learning.
His main research focus is within optimization, which is a modeling tool that has been extensively used as a core component for a wide range of problems, such as, optimal control, financial decision making, signal reconstruction, route planning, statistical estimation, and machine learning training. Different optimization problems have different properties and fall into different categories. They can be coarsely divided into convex or nonconvex problems, smooth or nonsmooth problems, and small-scale or large-scale problems. Contemporary optimization problems in, e.g., machine learning, signal reconstruction, control, and statistical estimation are often large-scale. The research in this group is focused on understanding and developing efficient algorithms for solving such problems. We focus on convex and nonsmooth problems with a primary focus is on so-called operator splitting methods and their stochastic variants. In particular, we develop frameworks for understanding a wide range of operator splitting methods that allow for a unified analysis and paves the way for design of new and improved algorithms. We also develop tools for automated algorithm analysis in which a so-called performance estimation optimization problem is formulated that exactly captures the worst possible performance of an optimization algorithm for some user-specified class of optimization problems. A solution to this, typically small-scale, performance estimation problem can give convergence guarantees for the analyzed algorithm.
10:20 - 10:50
Coffee
10:50 - 11:30
TBA
Peter Ochs, Saarland University
Biography
Peter Ochs received his M.Sc. degree in mathematics from Saarland University in Germany in 2010, and his Ph.D. degree in mathematics from the University of Freiburg in 2015. During his Ph.D., he spent three months as a visiting researcher at TU-Graz in Austria. After a year as a postdoctoral researcher at Saarland University in Germany, he returned to Freiburg. In November 2017, he became Junior professor of Applied Mathematics at the Saarland University and, in September 2020, Tenure-Track Professor at the University of Tübingen with final evaluation successfully completed in 2020.
Since March 2023, he is full Professor for Mathematics and Computer Science at Saarland University, where he is heading the Mathematical Optimization for Data Science group. He received the best paper award at the Scale Space and Variational Methods Conference (SSVM) in 2015 and at the German Conference on Pattern Recognition (GCPR) in 2016. His research interests are in non-smooth optimization with applications in computer vision, machine learning, image analysis, and data science in general.
11:30 - 12:10
Spectral optimizers for deep learning: muon, scion, and so on
Antonio Silveti-Falls, CentraleSupélec
Biography
Antonio (Tony) Silveti-Falls is an associate professor (maître de conférences) at CentraleSupélec in the south of Paris, where he is a member of the Centre pour la Vision Numérique laboratory and the INRIA team OPIS. After receiving his PhD in mathematics from Université de Caen Normandie in 2021, where he was supervised by Jalal Fadili and Gabriel Peyré, he completed a postdoc at Toulouse School of Economics with Jérôme Bolte and Edouard Pauwels. His research continues to focus on {nonsmooth, stochastic, noneuclidean} optimization, especially conditional gradient methods (Frank-Wolfe) and conservative calculus (path differentiable functions) applied to deep learning. His work on the generalized conditional gradient method won the best paper award at SPARS 2019.
Abstract
We discuss some recent advances in optimization for deep learning, with special attention paid to the spectral norm. We will comment on both the theoretical and the empirical properties of these algorithms, especially using the former to predict the latter.
12:10 - 13:40
Lunch
13:40 - 14:20
TBA
Wotao Yin, Alibaba DAMO Academy
Biography
Wotao Yin is an applied mathematician, scientist, and engineer currently serving as the director of the Decision Intelligence Lab at the Alibaba DAMO Academy, following a tenure as a Professor of Mathematics at UCLA. He received his Ph.D. in OR from Columbia University and is widely recognized for his research in computational optimization, particularly large-scale and distributed algorithms, operator splitting methods, and their applications in image processing and machine learning. His contributions to the field have been honored with numerous prestigious awards, including the Morningside Gold Medal, the NSF CAREER Award, the Alfred P. Sloan Research Fellowship, and the INFORMS Egon Balas Prize.
14:20 - 15:00
Lightning Talks
15:00 - 15:45
Poster session & coffee
15:45 - 16:25
First-Order Methods through Partial Linearization
Alp Yurtsever, Umeå University
Biography
Alp Yurtsever is a WASP Assistant Professor of Optimization and Machine Learning at the Department of Mathematics and Mathematical Statistics, Umeå University, Sweden. His research develops theory and algorithms for challenging optimization problems, motivated by applications in resource allocation, networked decision-making, and machine learning. His interests include conic programming, large-scale semidefinite programming, structured nonconvex and bilevel optimization, quantum-assisted optimization, distributed learning, operator splitting, and adaptive methods. Prior to joining Umeå University, he received his PhD in Computer and Communication Sciences (EDIC) from École Polytechnique Fédérale de Lausanne (EPFL), where his dissertation was awarded a Thesis Distinction, and completed a postdoctoral fellowship at the Massachusetts Institute of Technology (MIT) in the Laboratory for Information and Decision Systems (LIDS).
Abstract
Difference-of-convex algorithms are built on a partial linearization mechanism. Taking this mechanism as a starting point, I consider objectives of the form F = f + g and focus on settings where linearizing g leads to tractablesurrogate problems. This yields a DCA-type template for first-order methods. Within this template, several classical first-order methods can be recovered as special cases. This viewpoint exposes a broad algorithmic design space induced by decomposition choices, but also raises a fundamental selection problem: Which decomposition should one use in practice? I will illustrate this question with a concrete case study using projection-freemethods, where different decompositions lead to distinct oracle complexity guarantees.
16:25 - 17:05
TBA
Hayden Schaeffer, UCLA
Biography
Hayden Schaeffer is the Director of Applied Mathematics and a Professor of Mathematicsat the University of California, Los Angeles. His research is in mathematical and scientific machine learning, differential equations, randomization, and modeling. He has received an NSF CAREER award and an AFOSR Young Investigator Award. Previously, he was an NSF Mathematical Sciences Postdoctoral Research Fellow, a von Karman Instructor at Caltech, a UC President’s Postdoctoral Fellow at UC Irvine, an NDSEG Fellow, and a Collegium of University Teaching Fellow at UCLA.
19:00
Turning Torso, Lilla Varvsgatan 14, 211 15 Malmö
Symposium dinner
Bus transport to dinner venue Turning Torso in Malmö departs from Lund Cathedral at 18:00.
Day 3 – May 8, 2026
09:00 - 09:40
Making your Theory-to-Practice Work: Online-to-Batch via Schedules & Schedule-Free Learning
Aaron Defazio, FAIR, Meta Superintelligence Labs
Biography
Aaron Defazio is a Research Scientist at the FAIR (Fundamental AI Research), part of Meta Superintelligence labs, where he researches new theoretically driven approaches to AI training, with the ultimate goal of developing automatic, reliable and fast optimization methods. He has previously worked on deep learning based methods for MRI imaging (fastMRI project) and automated theorem proving. His Schedule-Free Learning method won the AlgoPerf Self-Tuning Track Challenge in 2024, and in 2023 his work on the D-Adaptation method was awarded an ICML best paper award. He obtained his PhD in Computer Science from Australian National University in 2014.
Abstract
I will introduce an alternative view of learning rate schedules, where they are considered as a technique for ensuring optimal convergence rates for the last iterate of an optimization procedure, a form of online-to-batchconversion. This view leads to highly predictive theory of optimal learning rate schedules, explaining learning rate warmup and annealing procedures used in practice. Going beyond this, I will show how this viewpoint suggests Schedule-Free approaches, where learning rate schedules are replaced by iterate averaging schemes, which yield a number of benefits: no need to specify the stopping time in advance, smoother loss curves and often bettereval metrics.
09:40 - 10:20
TBA
Jason Altschuler, University of Pennsylvania
Biography
Jason Altschuler is Assistant Professor at UPenn in the Department of Statistics and Data Science, and by courtesy also the Departments of Computer Science, Electrical Engineering, and Applied Mathematics. Previously, he received his undergraduate degree from Princeton and his PhD from MIT. He is the recipient of a Sloan Fellowship in Mathematics, the ICS Prize for the best papers at the interface of computer science and operations research, the MIT Sprowls Dissertation Award, the Mathematical Optimization Society’s Tucker Finalist Prize, and Undergraduate Teaching Excellence Awards. His research interests lie at the interface of optimization, probability, and machine learning, with a focus on the design and analysis of efficient algorithms.
10:20 - 10:50
Coffee
10:50 - 11:30
TBA
Jeremy Bernstein, Thinking Machines Lab
Biography
Jeremy Bernstein is a machine learning researcher based in San Francisco, California. He works at Thinking Machines Lab. His goal is to uncover the computational and statistical laws of natural and artificial intelligence, and thereby design learning systems that are more efficient, more automatic and more useful in practice.
11:30 - 12:10
Training LLMs: Do We Understand Our Optimizers?
Antonio Orvieto, ELLIS Institute Tübingen, MPI
Biography
Antonio studied Control Engineering in Italy and Switzerland. He holds a PhD in Computer Science from ETH Zürich and spent time at DeepMind (UK), Meta (US), MILA (CA), INRIA (FR), and HILTI (LI). He is currently a Hector Endowed Fellow and Principal Investigator (PI) at the ELLIS Institute Tübingen and Independent Group Leader of the MPI for Intelligent Systems, where he leads the Deep Models and Optimization group. He received the ETH medal for outstanding doctoral theses and the Schmidt Sciences AI2050 Early Career Fellowship.
In his research, Antonio strives to improve the efficiency of deep learning technologies by pioneering new architectures and training techniques grounded in theoretical knowledge. His work encompasses two main areas: understanding the intricacies of large-scale optimization dynamics and designing innovative architectures and powerful optimizers capable of handling complex data. Central to his studies is exploring innovative techniques for decoding patterns in sequential data, with implications in biology, neuroscience, natural language processing, and music generation.
Abstract
Why does Adam so consistently outperform SGD when training Transformer language models? Despite numerous proposed explanations, the optimizer gap remains largely unexplained. In this talk, we will present results from two complementary studies. First, using over 2000 language model training runs, we compare Adam with simplified variants such as signed gradient and signed momentum. We find that while signed momentum is faster than SGD, it still lags behind Adam; however, we crucially notice that constraining Adam’s momentum parameters to be equal (beta1 = beta2) retains near-optimal performance. This is of great practical importance and also reveals a new insight: Adam in this form has a robust statistical interpretation and a clear link to mollified sign descent. Second, through carefully tuned comparisons of SGD with momentum and Adam, we show that SGD can actually match Adam in small-batch training, but loses ground as batch size grows. Analyzing both Transformer experiments and quadratic models with stochastic differential equations, we shed new light on the role of batch size in shaping training dynamics.
12:10 - 13:40
Lunch
13:40 - 14:20
River-Valley Landscapes in Neural Network Training and a Theory-Practice Gap for Momentum
Chulhee Yun, KAIST
Biography
Chulhee “Charlie” Yun is an Ewon Assistant Professor at KAIST Kim Jaechul Graduate School of AI, where he has directed the Optimization & Machine Learning Laboratory since 2022. Starting September 2025, he holds a joint affiliation with KAIST Graduate School of AI for Math and a part-time Visiting Faculty Researcher position at Google Research. He finished his PhD from the Laboratory for Information and Decision Systems (LIDS) at MIT, under the joint supervision of Prof. Suvrit Sra and Prof. Ali Jadbabaie, following an MSc from Stanford University and a BSc from KAIST. His research focuses on the theoretical aspects of optimization algorithms, machine learning, and deep learning, with the goal of bridging the gap between theory and practice in these areas.
Abstract
Neural network training is often believed to be largely confined to a low-dimensional subspace aligned with the sharpest-curvature directions (Gur-Ari et al., 2018). In this talk, I will present evidence that challenges this picture: in modern neural network training, substantial progress can instead be driven by movement in the “bulk,” outside the sharpest-curvature subspace. Building on this observation, I introduce a “river-valley” view of the loss landscape, where sharp directions form valley walls while learning happens along a flatter river direction. This lens helps explain many common behaviors of neural network optimizers—most notably why Polyak momentum canaccelerate convergence by increasing effective progress along the river—and why schedule-free methods (Defazio et al., 2024) often track low-loss trajectories. I will close with a theoretical counterpoint from our recent work: in nonconvex optimization under a mere smoothness assumption, momentum admits worst-case lower bounds showing it can be strictly slower than non-momentum counterparts. This contrast raises the question of whichassumptions and which notions of progress are needed to faithfully connect theory to practice.
14:20 - 15:00
Lightning Talks
15:00 - 15:45
Poster session & coffee
15:45 - 16:25
Understanding Optimization in Deep Learning with Central Flows
A two-part talk with Alex Damian
Jeremy Cohen, The Flatiron Institute
Biography
Jeremy Cohen is a research fellow at the Flatiron Institute, New York, USA. He is broadly interested in turning deep learning into a principled engineering discipline, and currently works on understanding the dynamics of optimization algorithms in deep learning. He obtained his PhD in 2024 from Carnegie Mellon University, advised by Zico Kolter and Ameet Talwalkar.
Abstract
Traditional theories of optimization cannot describe the dynamics of optimization in deep learning, even in the simple setting of deterministic training. The challenge is that optimizers typically operate in a complex, oscillatory regime called the “edge of stability.” In this paper, we develop theory that can describe the dynamics of optimization in this regime. Our key insight is that while the *exact* trajectory of an oscillatory optimizer may be challenging to analyze, the *time-averaged* (i.e. smoothed) trajectory is often much more tractable. To analyze an optimizer, we derive a differential equation called a “central flow” that characterizes this time-averaged trajectory. We empirically show that these central flows can predict long-term optimization trajectories for generic neural networks with a high degree of numerical accuracy. By interpreting these central flows, we are able to understand how gradient descent makes progress even as the loss sometimes goes up; how adaptive optimizers “adapt” to the local loss landscape; and how adaptive optimizers implicitly navigate towards regions where they can take largersteps. Our results suggest that central flows can be a valuable theoretical tool for reasoning about optimization in deep learning.
16:25 - 17:05
Understanding Optimization in Deep Learning with Central Flows
A two-part talk with Jeremy Cohen
Alex Damian, The Kempner Institute at Harvard University
Biography
Alex Damian is a research fellow at the Kempner Institute at Harvard University and will join MIT in Fall 2026 as an Assistant Professor of Mathematics and EECS[AI+D]. His research focuses on the mathematical foundations of deep learning, with particular emphasis on optimization dynamics and representation learning. He received his Ph.D. in Applied and Computational Mathematics from Princeton University, where he was advised by Jason D. Lee, and his B.S. in Mathematics from Duke University. His work has been supported by the NSF Graduate Research Fellowship and the Jane Street Graduate Research Fellowship.
Abstract
Traditional theories of optimization cannot describe the dynamics of optimization in deep learning, even in the simple setting of deterministic training. The challenge is that optimizers typically operate in a complex, oscillatory regime called the “edge of stability.” In this paper, we develop theory that can describe the dynamics of optimization in this regime. Our key insight is that while the *exact* trajectory of an oscillatory optimizer may be challenging to analyze, the *time-averaged* (i.e. smoothed) trajectory is often much more tractable. To analyze an optimizer, we derive a differential equation called a “central flow” that characterizes this time-averaged trajectory. We empirically show that these central flows can predict long-term optimization trajectories for generic neural networks with a high degree of numerical accuracy. By interpreting these central flows, we are able to understand how gradient descent makes progress even as the loss sometimes goes up; how adaptive optimizers “adapt” to the local loss landscape; and how adaptive optimizers implicitly navigate towards regions where they can take largersteps. Our results suggest that central flows can be a valuable theoretical tool for reasoning about optimization in deep learning.
17:05 - 17:15
Closing
Map
