Detailed program
Please note that the program is still subject to change.
May 5, 2026
17:00 - 19:00
Historical Museum at Lund University
Krafts Torg 1, 223 50 Lund.
Welcome reception at the Historical Museum
A welcome drink and some hors d’oeuvres will be served.
Day 1 – May 6, 2026
08:30 - 08:45
Registration
08:45 - 09:00
Opening
09:00 - 09:40
An Alternative to the Frank-Wolfe Method & Potential Applications to ML
Peter Richtárik, KAUST
Biography
Peter Richtárik is a professor of Computer Science at King Abdullah University of Science and Technology – KAUST, Saudi Arabia, where he leads the Optimization and Machine Learning Lab. Through his work on randomized and distributed optimization algorithms, he has contributed to the foundations of machine learning and optimization. He is one of the original developers of Federated Learning. Prof Richtárik’s works attracted international awards, including the Charles BroydenPrize, SIAM SIGEST Best Paper Award, and a Distinguished Speaker Award at the 2019 International Conference on Continuous Optimization. He serves as an Area Chair for leading machine learning conferences, including NeurIPS, ICML and ICLR, and is an Action Editor of JMLR, and Associate Editor of Numerische Mathematik, and Optimization Methods and Software.
Abstract
I will talk about a new method based on the linear minimization oracle. The method has stronger convergence properties than the Frank-Wolfe method, but relies on a somewhat more involved linear minimization oracle, and delicate step-size rules. Since Frank-Wolfe has many applications across machine learning (e.g., the recent Scion optimizer is a stochastic variant of Frank-Wolfe with momentum), the new method is potentially an interesting alternative to, or even a replacement for, the now classical Frank-Wolfe approach.
09:40 - 10:20
Exploiting Similarity in Federated Learning
Sebastian Stich, CISPA and ELLIS
Biography
Dr. Sebastian Stich is a tenured faculty member at the CISPA Helmholtz Center for Information Security and a member of the European Laboratory for Learning and Intelligent Systems (ELLIS). His research focuses on the intersection of machine learning, optimization, and statistics, with an emphasis on efficient parallel and distributed algorithms for training models over decentralized datasets.
He obtained his PhD from ETH Zurich and held postdoctoral positions at UCLouvain and EPFL. His work has been recognized with a Meta Research Award (2022), a Google Research Scholar Award (2023), and an ERC Consolidator Grant (CollectiveMinds, 2024).
Abstract
We provide a brief introduction to local update methods developed for federated optimization and discuss their worst-case complexity. Surprisingly, these methods often perform much better in practice than predictedby theoretical analyses using classical assumptions. Recent years have revealed that their performance can be better described using refined notions that capture the similarity among client objectives. In this talk, we introducea generic framework based on a distributed proximal point algorithm, which consolidates many of our insights and allows for the adaptation of arbitrary centralized optimization algorithms to the convex federated setting, including accelerated variants. Our theoretical analysis shows that the derived methods enjoy faster convergence when the degree of similarity among clients is high.
Based on joint work with Xiaowen Jiang and Anton Rodomanov.
10:20 - 10:50
Coffee
10:50 - 11:30
Three Paths to Minima Selection
Niao He, ETH Zurich
Biography
Niao He is currently an Associate Professor in the Department of Computer Science at ETH Zürich, and leading the Optimization & Decision Intelligence (ODI) Group. She is also a core faculty member at:
- Institute of Machine Learning
- ETH AI Center
- ETH Foundations of Data Science
- Max Planck ETH Center for Learning Systems
Niao He’s work lies in the interface of optimization and machine learning, with a primary focus on the algorithmic and theoretical foundations for principled, scalable, and trustworthy decision intelligence. She is also interested in developing machine learning models and algorithms for interdisciplinary applications in operations management, mechanism design, control & robotics, etc.
With thanks to Swiss National Science Foundations, ETH Foundations, NCCR Automation for generously funding the current research.
Abstract
Classical optimization is often built around a simple goal: find a minimizer. Most existing theories emphasize convergence rates towards the minima, implicitly treating all solutions as equivalent once optimality is achieved. Modern machine learning applications tell a different story. Distinct minimizers with identical objective values can differ dramatically in properties that matterin practice, including generalization performance, robustness, and even broader societal implications. In overparameterized models, especially in deep learning, a striking phenomenon emerges: even after training error reaches zero, test performance continues to improve, indicating that optimization dynamics keep evolving within the set of global minima. This raises a fundamental and largely under-explored question: which minima do optimization algorithms implicitly select? More ambitiously, can we actively steer optimization dynamics toward desirable minima? In this talk, I willpresent recent results that shed light on both implicit and active minima selection, highlighting the roles played by optimization dynamics (first– and zeroth-order methods), stochastic noise, and the geometry of the solution landscape.
11:30 - 12:10
TBA
Anastasia Koloskova, University of Zurich
Biography
Anastasia Koloskova is an Assistant Professor of AI and Optimization in the Department of Mathematical Modeling and Machine Learning at the University of Zurich. Her research focuses on machine learning and optimization, particularly indecentralized and collaborative learning and privacy. Previously, Anastasia Koloskova was a postdoctoral researcher at Stanford University (STAIR lab, Prof. Sanmi Koyejo), andcompleted her PhD at EPFL in the Machine Learning and Optimization Laboratory (MLO) with Prof. Martin Jaggi.
12:10 - 13:40
Lunch
13:40 - 14:20
Strong convergence and fast residual decay for monotone operator flows via Tikhonov regularization
Radu I. Boţ, University of Vienna
Biography
Radu I. Boţ is Professor of Applied Mathematics with Emphasis on Optimization at the Faculty of Mathematics of the University of Vienna and a founding member of the Research Platform “Data Science@Uni Vienna”. He currently serves as Dean of the Faculty of Mathematics at the University of Vienna. He received his Diploma and M.Sc. degrees in Mathematics from Babeş-Bolyai University in Cluj-Napoca, Romania, and earned his Ph.D. degree as well as his Habilitation in Mathematics from Chemnitz University of Technology, Germany.
His research interests include continuous-time and discrete-time models for optimization and monotone inclusions, convex analysis, nonsmooth and variational analysis, monotone operator theory, and optimization methods for data science. His research has been funded by the Austrian Science Fund, the Austrian Research Promotion Agency, the German Research Foundation, the Romanian National Research Council, the Australian Research Council, as well as by industrial partners. He is (co-) author of the books Duality in Vector Optimization and Conjugate Duality in Convex Optimization, published by Springer. Radu I. Boţ serves on the editorial boards of several leading journals, including Mathematical Programming, Computational Optimization and Applications, Applied Mathematics and Optimization, and the Journal of Optimization Theory and Applications. Since January 2026, he has been Editor-in-Chief of the prestigious SIAM Journal on Optimization
Abstract
In the framework of real Hilbert spaces, we investigate first-order dynamical systems governed by monotone and continuous operators. It has been established that for these systems, only the ergodic trajectory converges to a zero of the operator. However, trajectory convergence is assured for operators with the stronger property of cocoercivity. For this class of operators, the trajectory’s velocity and the opertor values along the trajectory converge in norm to zero at a rate of o(1/√t) as t → +∞.
In this talk, we show that augmenting a monotone operator flow with a Tikhonov regularization term ensures not only strong convergence of the trajectory to the minimal-norm element of the zero set, but also enables the derivation of explicit convergence rates. In particular, we establish norm rates for the trajectory’s velocity and for the residual of the operator along the trajectory, expressed in terms of the regularization function. In some particular cases, these rates can be as fast as O(1/t) as t → +∞. In this way, we emphasize a surprising acceleration feature of the Tikhonov regularization. Additionally, we explore these properties for monotone operator flows that incorporate time rescaling and an anchor point. For a specific choice of the Tikhonov regularization function, these flows are closely linked to second-order dynamical systems with a vanishing damping term. The convergence and convergence rate results we achieve for these systems complement recent findings for the Fast Optimistic Gradient Descent Ascent (OGDA) dynamics.
Finally, derive via an explicit discretization of the Tikhonov regularized monotone flow a novel Extra-Gradient method with anchor term governed by general parameters. We establish strong convergence to specific points within the solution set, as well as convergence rates expressed in terms of the regularization parameters. Notably, our approach recovers the fast residual decay rate O(1/k) as k → +∞ for standard parameter choices.
14:20 - 15:00
Lightning Talks
15:00 - 15:45
Poster session & coffee
15:45 - 16:25
Gradient alignment, learning, and optimization
Jelena Diakonikolas, University of Wisconsin-Madison
Biography
Jelena Diakonikolas is Assistant Professor at the Department of Computer Sciences and (by courtesy) the Department of Statistics, at the University of Wisconsin-Madison. She is also an affiliate of the Data Science Institute at UW-Madison.
Her main research interests are in the area of large-scale optimization. She is also interested in applications of optimization methods, particularly within machine learning.
Prior to joining UW-Madison, Jelena Diakonikolas was a Postdoctoral Fellow at UC Berkeley’s Foundations of Data Analysis (FODA) TRIPODS Institute, where she primarily worked with Mike Jordan. In Fall 2018, she was a Microsoft Research Fellow at the Simons Institute for the Theory of Computing, associated with the program on Foundations of Data Science. Prior to starting the postdoctoral position at UC Berkeley, she was a Postdoctoral Associate at the Department of Computer Science, Boston University, where she worked with Lorenzo Orecchia. She completed my Ph.D. at the Department of Electrical Engineering, Columbia University, where she was co-advised by Gil Zussman and Cliff Stein.
Some publications are under her maiden name — Marašević.
Abstract
Generalized Linear Models (GLMs) represent functions formed by composing a known univariate nonlinear activation with a linear map defined by an unknown vector w. The learning task—recovering w from i.i.d. labeledexamples (x,y), where y is a noisy evaluation of the GLM—leads to a nonconvex, often nonsmooth optimization problem, even for simple activations.
GLMs are a fundamental model in supervised learning, capturing low-dimensional structure in high-dimensional data. While the setting with zero-mean bounded-variance noise has been well studied, more realistic formulations—where labels may deviate arbitrarily from any ground-truth GLM—are substantially more challenging. In particular, more relaxed notions of error and much stronger structural assumptions about both the activation and the distribution generating the data are required for computational tractability. Most provable guarantees in this regime have emerged only recently.
In this talk, I will survey these developments and present a unifying optimization-theoretic framework based on local error bounds. These bounds capture how the gradient field remains meaningfully aligned with a target solution, thus providing a geometric “signal” that enables efficient learning with first-order methods, despite nonconvexity and noise. I will further discuss a generalization of these results to the setting of single-index models, where the activation is unknown and optimization is performed over a class of unknown activations, in addition to the parameter vector.
16:25 - 17:05
Extragradient Methods for Modern Machine Learning: New Theory, Step-Size Rules, and Stochastic Variants
Nicolas Loizou, Johns Hopkins University
Biography
Nicolas Loizou is an Assistant Professor in the Department of Applied Mathematics and Statistics and the Mathematical Institute for Data Science (MINDS) at Johns Hopkins University, where he leads the Optimization and Machine Learning Lab. He holds secondary appointments in the Departments of Computer Science and Electrical and Computer Engineering and is a member of Johns Hopkins Data Science Institute and Ralph O’Connor Sustainable Energy Institute (ROSEI). Prior to this, he was a Postdoctoral Research Fellow at Mila – Quebec Artificial Intelligence Institute and the University of Montreal. He holds a Ph.D. in Optimization and Operational Research from the University of Edinburgh, School of Mathematics, an M.Sc. in Computing from Imperial College London, and a BSc in Mathematics from the National and Kapodistrian University of Athens.
His research interests include large-scale optimization, machine learning, randomized numerical linear algebra, distributed and decentralized algorithms, algorithmic game theory, and federated learning. He currently serves as action editor for Information and Inference: A Journal of the IMA, Optimization Methods and Software, and Transactions on Machine Learning Research. He has received several awards and fellowships, including the OR Society’s 2019 Doctoral Award (runner-up) for the ”Most Distinguished Body of Research leading to the Award of a Doctorate in the field of Operational Research’’, the IVADO Fellowship, the COAP 2020 Best Paper Award, the CISCO 2023 Research Award, and the Catalyst 2025 Award.
Abstract
Extragradient methods are a fundamental class of algorithms for min-max optimization problems and variational inequalities, with growing relevance in modern machine learning. While the classical theory is largely developed under smoothness and other relatively restrictive assumptions, many machine learning problems call for analysis under weaker regularity conditions and in stochastic, large-scale settings. In this talk, we present new convergence results for deterministic and stochastic extragradient methods beyond the classical framework. In particular, we establish guarantees under weaker regularity assumptions, namely the (L0 ,L1 )-Lipschitz condition, and derive new step-size rules that expand the range of provably convergent regimes. We also introduce Polyak-type step sizes for deterministic and stochastic extragradient methods, leading to adaptive variants with favorable theoretical properties and practical performance. Our results focus primarily on monotone problems, with extensions to selected structured non-monotone settings. We conclude with numerical experiments illustrating both the theory and the practical behavior of the proposed methods.
Day 2 – May 7, 2026
09:00 - 09:40
Recent advances on the systematic analysis and design of first-order optimization algorithms
Convex optimization as a proof assistant for algorithm analysis and design
Adrien Taylor, Inria Paris
Biography
Adrien Taylor is currently a research scientist at French Institute for Research in Computer Science and Automation – Inria in Paris, within the SIERRA team. Before that, he was a postdoctoral researcher in the same team in 2017-2019, working with Francis Bach. He completed a PhD at Université catholique de Louvain, in the department of mathematical engineering (part of the ICTEAMinstitute), where he held a F.R.S.-FNRS FRIA scholarship for his PhD under the supervision of François Glineur and Julien Hendrickx.
His research currently focuses on optimization (mostly first-order) and numerical analysis with a bit of control and machine learning. He finds it particularly important to push toward reproducible (including theory) and understandable science, and many of his research projects have this orientation. Adrien Taylor was awarded an ERC Starting Grant 2024 (project CASPER) for working in this direction from fall 2024 to fall 2029.
Abstract
Complexity analysis plays a key role in the design and analysis of algorithms in modern optimization theory. However, establishing worst-case convergence bounds classically requires non-obvious insights and ad hoc reasoning. This talk aims to provide a gentle introduction to performance estimation techniques for the analysis of first-order optimization algorithms, along with a few open questions and recent developments around it. The talk will be accompanied by concrete examples and demonstration of the usage of recent packages for computer-aided complexity analyses, including the PEPit package, available at https://pepit.readthedocs.io/ .
09:40 - 10:20
Automated tight Lyapunov analysis for first-order methods
Pontus Giselsson, Lund University
Biography
Pontus Giselsson is Associate Professor at the Department of Automatic Control at Lund University, and organizer of the ELLIIT focus period on Optimization for learning.
His research focuses on optimization, with particular emphasis on principled methodologies for algorithm analysis and design. Although many optimization methods are still studied on a case-by-case basis, their analyses often exhibit strong common structure. His work seeks to capture these similarities through unified frameworks and automated tools for systematically analyzing, designing, and improving algorithms, with rigorous convergence guarantees.
Abstract
This talk is about automating a central step in the convergence analysis of splitting methods for structured optimization and inclusion problems: the search for a suitable Lyapunov inequality. While this step underlies many existing convergence proofs, finding such an inequality is often technically delicate and carried out on a case-by-case basis. To aid in this process, we present a numeric Lyapunov-based framework along with the AutoLyap package in which quadratic convergence certificates can be found by solving small semidefinite programs. Using this methodology, we derive significantly extended convergent parameter regions for classical methods including Douglas–Rachford splitting, ADMM, and the Chambolle–Pock method, when the analysis is specialized to convex optimization problems rather than the broader monotone inclusion setting. These results highlight the potential of automated Lyapunov analysis to uncover improved convergence guarantees for classical and new splitting methods.
10:20 - 10:50
Coffee
10:50 - 11:30
Learning Optimization Algorithms with Average Case Convergence Rates
Peter Ochs, Saarland University
Biography
Peter Ochs received his M.Sc. degree in mathematics from Saarland University in Germany in 2010, and his Ph.D. degree in mathematics from the University of Freiburg in 2015. During his Ph.D., he spent three months as a visiting researcher at TU-Graz in Austria. After a year as a postdoctoral researcher at Saarland University in Germany, he returned to Freiburg. In November 2017, he became Junior professor of Applied Mathematics at the Saarland University and, in September 2020, Tenure-Track Professor at the University of Tübingen with final evaluation successfully completed in 2020.
Since March 2023, he is full Professor for Mathematics and Computer Science at Saarland University, where he is heading the Mathematical Optimization for Data Science group. He received the best paper award at the Scale Space and Variational Methods Conference (SSVM) in 2015 and at the German Conference on Pattern Recognition (GCPR) in 2016. His research interests are in non-smooth optimization with applications in computer vision, machine learning, image analysis, and data science in general.
Abstract
The change of paradigm from purely model driven to data driven (learning based) approaches has tremendously altered the picture in many applications in Machine Learning, Computer Vision, Signal Processing, Inverse Problems, Statistics and so on. There is no need to mention the significant boost in performance for many specific applications, thanks to the advent oflarge scale Deep Learning. In this talk, we open the area of optimization algorithms for this data driven paradigm, for which theoretical guarantees are indispensable. The expectations aboutan optimization algorithm are clearly beyond empirical evidence, as there may be a whole processing pipeline depending on a reliable output of the optimization algorithm, and applicationdomains of algorithms can vary significantly. While there is already a vast literature on “learning to optimize“, there is no theoretical guarantees associated with these algorithms that meetthese expectations from an optimization point of view. We develop the first framework to learn optimization algorithms with provable generalization guarantees to certain classes ofoptimization problems, while the learning based backbone enables the algorithms‘ functioning far beyond the limitations of classical (deterministic) worst case bounds. Our results rely on PAC-Bayes bounds for general, unbounded loss-functions based on exponential families. We learn optimization algorithms with provable generalization guarantees (PAC-bounds) and explicit trade-off between a high probability of convergence and a high convergence speed.
11:30 - 12:10
Spectral optimizers for deep learning: muon, scion, and so on
Antonio Silveti-Falls, CentraleSupélec
Biography
Antonio (Tony) Silveti-Falls is an associate professor (maître de conférences) at CentraleSupélec in the south of Paris, where he is a member of the Centre pour la Vision Numérique laboratory and the INRIA team OPIS. After receiving his PhD in mathematics from Université de Caen Normandie in 2021, where he was supervised by Jalal Fadili and Gabriel Peyré, he completed a postdoc at Toulouse School of Economics with Jérôme Bolte and Edouard Pauwels. His research continues to focus on {nonsmooth, stochastic, noneuclidean} optimization, especially conditional gradient methods (Frank-Wolfe) and conservative calculus (path differentiable functions) applied to deep learning. His work on the generalized conditional gradient method won the best paper award at SPARS 2019.
Abstract
We discuss some recent advances in optimization for deep learning, with special attention paid to the spectral norm. We will comment on both the theoretical and the empirical properties of these algorithms, especially using the former to predict the latter.
12:10 - 13:40
Lunch
13:40 - 14:20
TBA
Wotao Yin, Alibaba DAMO Academy
Biography
Wotao Yin is an applied mathematician, scientist, and engineer currently serving as the director of the Decision Intelligence Lab at the Alibaba DAMO Academy, following a tenure as a Professor of Mathematics at UCLA. He received his Ph.D. in OR from Columbia University and is widely recognized for his research in computational optimization, particularly large-scale and distributed algorithms, operator splitting methods, and their applications in image processing and machine learning. His contributions to the field have been honored with numerous prestigious awards, including the Morningside Gold Medal, the NSF CAREER Award, the Alfred P. Sloan Research Fellowship, and the INFORMS Egon Balas Prize.
14:20 - 15:00
Lightning Talks
15:00 - 15:45
Poster session & coffee
15:45 - 16:25
First-Order Methods through Partial Linearization
Alp Yurtsever, Umeå University
Biography
Alp Yurtsever is a WASP Assistant Professor of Optimization and Machine Learning at the Department of Mathematics and Mathematical Statistics, Umeå University, Sweden. His research develops theory and algorithms for challenging optimization problems, motivated by applications in resource allocation, networked decision-making, and machine learning. His interests include conic programming, large-scale semidefinite programming, structured nonconvex and bilevel optimization, quantum-assisted optimization, distributed learning, operator splitting, and adaptive methods. Prior to joining Umeå University, he received his PhD in Computer and Communication Sciences (EDIC) from École Polytechnique Fédérale de Lausanne (EPFL), where his dissertation was awarded a Thesis Distinction, and completed a postdoctoral fellowship at the Massachusetts Institute of Technology (MIT) in the Laboratory for Information and Decision Systems (LIDS).
Abstract
Difference-of-convex algorithms are built on a partial linearization mechanism. Taking this mechanism as a starting point, I consider objectives of the form F = f + g and focus on settings where linearizing g leads to tractablesurrogate problems. This yields a DCA-type template for first-order methods. Within this template, several classical first-order methods can be recovered as special cases. This viewpoint exposes a broad algorithmic design space induced by decomposition choices, but also raises a fundamental selection problem: Which decomposition should one use in practice? I will illustrate this question with a concrete case study using projection-freemethods, where different decompositions lead to distinct oracle complexity guarantees.
16:25 - 17:05
Architectures and Optimization for General-Task PDE Models
Hayden Schaeffer, UCLA
Biography
Hayden Schaeffer is the Director of Applied Mathematics and a Professor of Mathematicsat the University of California, Los Angeles. His research is in mathematical and scientific machine learning, differential equations, randomization, and modeling. He has received an NSF CAREER award and an AFOSR Young Investigator Award. Previously, he was an NSF Mathematical Sciences Postdoctoral Research Fellow, a von Karman Instructor at Caltech, a UC President’s Postdoctoral Fellow at UC Irvine, an NDSEG Fellow, and a Collegium of University Teaching Fellow at UCLA.
Abstract
Learning general-purpose models for partial differential equations (PDEs) is an important problem in scientific machine learning, requiring methods that generalize across equations, discretizations, and data regimes. We present recent progress on multi-task, multi-operator learning and PDE foundation models for spatiotemporal prediction, based on global representation learning and autoregressive modeling. This includes multimodal inputs such as partial observations, varying parameters, and heterogeneous data sources, enabling zero-shot and few-shot generalization. We also discuss scalable training strategies based on Muon and its adaptive extensions, which combine orthogonalized momentum with moment-based normalization to improve stability and convergence in the large-model regime. In particular, NAMO and its diagonal variant utilize Muon-style update directions with Adam-like moment adaptation, improving robustness to gradient noise and heterogeneous data while preserving efficient scaling. This enables models that can be trained once and adapted across regimes, providing effective surrogate models for complex spatiotemporal systems.
19:00
Turning Torso, Lilla Varvsgatan 14, 211 15 Malmö
Symposium dinner
Bus transport to dinner venue Turning Torso in Malmö departs from Lund Cathedral at 18:00.
Day 3 – May 8, 2026
09:00 - 09:40
Making your Theory-to-Practice Work: Online-to-Batch via Schedules & Schedule-Free Learning
Aaron Defazio, FAIR, Meta Superintelligence Labs
Biography
Aaron Defazio is a Research Scientist at the FAIR (Fundamental AI Research), part of Meta Superintelligence labs, where he researches new theoretically driven approaches to AI training, with the ultimate goal of developing automatic, reliable and fast optimization methods. He has previously worked on deep learning based methods for MRI imaging (fastMRI project) and automated theorem proving. His Schedule-Free Learning method won the AlgoPerf Self-Tuning Track Challenge in 2024, and in 2023 his work on the D-Adaptation method was awarded an ICML best paper award. He obtained his PhD in Computer Science from Australian National University in 2014.
Abstract
I will introduce an alternative view of learning rate schedules, where they are considered as a technique for ensuring optimal convergence rates for the last iterate of an optimization procedure, a form of online-to-batchconversion. This view leads to highly predictive theory of optimal learning rate schedules, explaining learning rate warmup and annealing procedures used in practice. Going beyond this, I will show how this viewpoint suggests Schedule-Free approaches, where learning rate schedules are replaced by iterate averaging schemes, which yield a number of benefits: no need to specify the stopping time in advance, smoother loss curves and often bettereval metrics.
09:40 - 10:20
Acceleration by Stepsize Hedging
Jason Altschuler, University of Pennsylvania
Biography
Jason Altschuler is Assistant Professor at UPenn in the Department of Statistics and Data Science, and by courtesy also the Departments of Computer Science, Electrical Engineering, and Applied Mathematics. Previously, he received his undergraduate degree from Princeton and his PhD from MIT. He is the recipient of a Sloan Fellowship in Mathematics, the ICS Prize for the best papers at the interface of computer science and operations research, the MIT Sprowls Dissertation Award, the Mathematical Optimization Society’s Tucker Finalist Prize, and Undergraduate Teaching Excellence Awards. His research interests lie at the interface of optimization, probability, and machine learning, with a focus on the design and analysis of efficient algorithms.
Abstract
It is commonly said that the most important hyperparameter in deep learning is the stepsize schedule. However, even in seemingly simple convex settings, it is unclear how best to choose stepsizes. In this talk, I will describe a new approach for choosing stepsizes which has enabled us to dispel longstanding beliefs about the speed limit of gradient descent in convex optimization and min-max optimization. The key idea is “hedging” between short steps and long steps since bad cases for the former are good cases for the latter, and vice versa. Properly combining these stepsizes yields faster convergence due to the misalignment of worst-case functions.
This talk is based on a line of work with Pablo Parrilo, Henry Shugart, and Jinho Bok that originates from my 2018 Master’s Thesis — which established for the first time that judiciously chosen stepsizes can enable accelerated convex optimization. Prior to this thesis, the only such result was for the special case of quadratics, due to Young in 1953.
10:20 - 10:50
Coffee
10:50 - 11:30
TBA
Jeremy Bernstein, Thinking Machines Lab
Biography
Jeremy Bernstein is a machine learning researcher based in San Francisco, California. He works at Thinking Machines Lab. His goal is to uncover the computational and statistical laws of natural and artificial intelligence, and thereby design learning systems that are more efficient, more automatic and more useful in practice.
11:30 - 12:10
Training LLMs: Do We Understand Our Optimizers?
Antonio Orvieto, ELLIS Institute Tübingen, MPI
Biography
Antonio studied Control Engineering in Italy and Switzerland. He holds a PhD in Computer Science from ETH Zürich and spent time at DeepMind (UK), Meta (US), MILA (CA), INRIA (FR), and HILTI (LI). He is currently a Hector Endowed Fellow and Principal Investigator (PI) at the ELLIS Institute Tübingen and Independent Group Leader of the MPI for Intelligent Systems, where he leads the Deep Models and Optimization group. He received the ETH medal for outstanding doctoral theses and the Schmidt Sciences AI2050 Early Career Fellowship.
In his research, Antonio strives to improve the efficiency of deep learning technologies by pioneering new architectures and training techniques grounded in theoretical knowledge. His work encompasses two main areas: understanding the intricacies of large-scale optimization dynamics and designing innovative architectures and powerful optimizers capable of handling complex data. Central to his studies is exploring innovative techniques for decoding patterns in sequential data, with implications in biology, neuroscience, natural language processing, and music generation.
Abstract
Why does Adam so consistently outperform SGD when training Transformer language models? Despite numerous proposed explanations, the optimizer gap remains largely unexplained. In this talk, we will present results from two complementary studies. First, using over 2000 language model training runs, we compare Adam with simplified variants such as signed gradient and signed momentum. We find that while signed momentum is faster than SGD, it still lags behind Adam; however, we crucially notice that constraining Adam’s momentum parameters to be equal (beta1 = beta2) retains near-optimal performance. This is of great practical importance and also reveals a new insight: Adam in this form has a robust statistical interpretation and a clear link to mollified sign descent. Second, through carefully tuned comparisons of SGD with momentum and Adam, we show that SGD can actually match Adam in small-batch training, but loses ground as batch size grows. Analyzing both Transformer experiments and quadratic models with stochastic differential equations, we shed new light on the role of batch size in shaping training dynamics.
12:10 - 13:40
Lunch
13:40 - 14:20
River-Valley Landscapes in Neural Network Training and a Theory-Practice Gap for Momentum
Chulhee Yun, KAIST
Biography
Chulhee “Charlie” Yun is an Ewon Assistant Professor at KAIST Kim Jaechul Graduate School of AI, where he has directed the Optimization & Machine Learning Laboratory since 2022. Starting September 2025, he holds a joint affiliation with KAIST Graduate School of AI for Math and a part-time Visiting Faculty Researcher position at Google Research. He finished his PhD from the Laboratory for Information and Decision Systems (LIDS) at MIT, under the joint supervision of Prof. Suvrit Sra and Prof. Ali Jadbabaie, following an MSc from Stanford University and a BSc from KAIST. His research focuses on the theoretical aspects of optimization algorithms, machine learning, and deep learning, with the goal of bridging the gap between theory and practice in these areas.
Abstract
Neural network training is often believed to be largely confined to a low-dimensional subspace aligned with the sharpest-curvature directions (Gur-Ari et al., 2018). In this talk, I will present evidence that challenges this picture: in modern neural network training, substantial progress can instead be driven by movement in the “bulk,” outside the sharpest-curvature subspace. Building on this observation, I introduce a “river-valley” view of the loss landscape, where sharp directions form valley walls while learning happens along a flatter river direction. This lens helps explain many common behaviors of neural network optimizers—most notably why Polyak momentum canaccelerate convergence by increasing effective progress along the river—and why schedule-free methods (Defazio et al., 2024) often track low-loss trajectories. I will close with a theoretical counterpoint from our recent work: in nonconvex optimization under a mere smoothness assumption, momentum admits worst-case lower bounds showing it can be strictly slower than non-momentum counterparts. This contrast raises the question of whichassumptions and which notions of progress are needed to faithfully connect theory to practice.
14:20 - 15:00
Lightning Talks
15:00 - 15:45
Poster session & coffee
15:45 - 16:25
Understanding Optimization in Deep Learning with Central Flows
A two-part talk with Alex Damian
Jeremy Cohen, The Flatiron Institute
Biography
Jeremy Cohen is a research fellow at the Flatiron Institute, New York, USA. He is broadly interested in turning deep learning into a principled engineering discipline, and currently works on understanding the dynamics of optimization algorithms in deep learning. He obtained his PhD in 2024 from Carnegie Mellon University, advised by Zico Kolter and Ameet Talwalkar.
Abstract
Traditional theories of optimization cannot describe the dynamics of optimization in deep learning, even in the simple setting of deterministic training. The challenge is that optimizers typically operate in a complex, oscillatory regime called the “edge of stability.” In this paper, we develop theory that can describe the dynamics of optimization in this regime. Our key insight is that while the *exact* trajectory of an oscillatory optimizer may be challenging to analyze, the *time-averaged* (i.e. smoothed) trajectory is often much more tractable. To analyze an optimizer, we derive a differential equation called a “central flow” that characterizes this time-averaged trajectory. We empirically show that these central flows can predict long-term optimization trajectories for generic neural networks with a high degree of numerical accuracy. By interpreting these central flows, we are able to understand how gradient descent makes progress even as the loss sometimes goes up; how adaptive optimizers “adapt” to the local loss landscape; and how adaptive optimizers implicitly navigate towards regions where they can take largersteps. Our results suggest that central flows can be a valuable theoretical tool for reasoning about optimization in deep learning.
16:25 - 17:05
Understanding Optimization in Deep Learning with Central Flows
A two-part talk with Jeremy Cohen
Alex Damian, The Kempner Institute at Harvard University
Biography
Alex Damian is a research fellow at the Kempner Institute at Harvard University and will join MIT in Fall 2026 as an Assistant Professor of Mathematics and EECS[AI+D]. His research focuses on the mathematical foundations of deep learning, with particular emphasis on optimization dynamics and representation learning. He received his Ph.D. in Applied and Computational Mathematics from Princeton University, where he was advised by Jason D. Lee, and his B.S. in Mathematics from Duke University. His work has been supported by the NSF Graduate Research Fellowship and the Jane Street Graduate Research Fellowship.
Abstract
Traditional theories of optimization cannot describe the dynamics of optimization in deep learning, even in the simple setting of deterministic training. The challenge is that optimizers typically operate in a complex, oscillatory regime called the “edge of stability.” In this paper, we develop theory that can describe the dynamics of optimization in this regime. Our key insight is that while the *exact* trajectory of an oscillatory optimizer may be challenging to analyze, the *time-averaged* (i.e. smoothed) trajectory is often much more tractable. To analyze an optimizer, we derive a differential equation called a “central flow” that characterizes this time-averaged trajectory. We empirically show that these central flows can predict long-term optimization trajectories for generic neural networks with a high degree of numerical accuracy. By interpreting these central flows, we are able to understand how gradient descent makes progress even as the loss sometimes goes up; how adaptive optimizers “adapt” to the local loss landscape; and how adaptive optimizers implicitly navigate towards regions where they can take largersteps. Our results suggest that central flows can be a valuable theoretical tool for reasoning about optimization in deep learning.
17:05 - 17:15
Closing
Map
