May 6 – May 8, 2026

Focus Period Symposium: Optimization for Learning

AF Borgen, Lund

The ELLIIT Focus Period Symposium is the highlight of the five-week focus period, during which young international scholars, ELLIIT researchers and other well-established international academics gather in Lund to work together in these joint research challenges. See the current list of confirmed speakers on the invited speakers page. 

The focus period symposium on Optimization for Learning takes place in AF Borgen, Sandgatan 2, 223 50 Lund.

Please note, that the number of participants is limited and that registration might close earlier than the deadline indicates. 

Detailed program

Please note that the program is still subject to change.

May 5, 2026

}

17:00 - 19:00

Historical Museum at Lund University

Krafts Torg 1, 223 50 Lund.

Welcome reception at the Historical Museum

A welcome drink and some hors d’oeuvres will be served.

The outside of the Historical Museum in Lund.

Day 1 – May 6, 2026

}

08:30 - 08:45

Registration

}

08:45 - 09:00

Opening

}

09:00 - 09:40

An Alternative to the Frank-Wolfe Method & Potential Applications to ML 

Peter Richtárik, KAUST 

Biography

Peter Richtárik is a professor of Computer Science at King Abdullah University of Science and Technology – KAUST, Saudi Arabia, where he leads the Optimization and Machine Learning Lab. Through his work on randomized and distributed optimization algorithms, he has contributed to the foundations of machine learning and optimization. He is one of the original developers of Federated Learning. Prof Richtárik’s works attracted international awards, including the Charles BroydenPrize, SIAM SIGEST Best Paper Award, and a Distinguished Speaker Award at the 2019 International Conference on Continuous Optimization. He serves as an Area Chair for leading machine learning conferences, including NeurIPS, ICML and ICLR, and is an Action Editor of JMLR, and Associate Editor of Numerische Mathematik, and Optimization Methods and Software. 

Abstract

I will talk about a new method based on the linear minimization oracle. The method has stronger convergence properties than the Frank-Wolfe method, but relies on a somewhat more involved linear minimization oracle, and delicate step-size rules. Since Frank-Wolfe has many applications across machine learning (e.g., the recent Scion optimizer is a stochastic variant of Frank-Wolfe with momentum), the new method is potentially an interesting alternative to, or even a replacement for, the now classical Frank-Wolfe approach. 

Profiel picture of Peter Richtárik.
}

09:40 - 10:20

Exploiting Similarity in Federated Learning

Sebastian Stich, CISPA and ELLIS

Biography

Dr. Sebastian Stich is a tenured faculty member at the CISPA Helmholtz Center for Information Security and a member of the European Laboratory for Learning and Intelligent Systems (ELLIS). His research focuses on the intersection of machine learning, optimization, and statistics, with an emphasis on efficient parallel and distributed algorithms for training models over decentralized datasets. 

He obtained his PhD from ETH Zurich and held postdoctoral positions at UCLouvain and EPFL. His work has been recognized with a Meta Research Award (2022), a Google Research Scholar Award (2023), and an ERC Consolidator Grant (CollectiveMinds, 2024). 

Abstract

We provide a brief introduction to local update methods developed for federated optimization and discuss their worst-case complexity. Surprisingly, these methods often perform much better in practice than predictedby theoretical analyses using classical assumptions. Recent years have revealed that their performance can be better described using refined notions that capture the similarity among client objectives. In this talk, we introducea generic framework based on a distributed proximal point algorithm, which consolidates many of our insights and allows for the adaptation of arbitrary centralized optimization algorithms to the convex federated setting, including accelerated variants. Our theoretical analysis shows that the derived methods enjoy faster convergence when the degree of similarity among clients is high. 

Based on joint work with Xiaowen Jiang and Anton Rodomanov. 

Profile picture of Sebastian Stich.
}

10:20 - 10:50

Coffee

}

10:50 - 11:30

Three Paths to Minima Selection

Niao He, ETH Zurich 

Biography

Niao He is currently an Associate Professor in the Department of Computer Science at ETH Zürich, and leading the Optimization & Decision Intelligence (ODI) Group. She is also a core faculty member at:  

Niao He’s work lies in the interface of optimization and machine learning, with a primary focus on the algorithmic and theoretical foundations for principled, scalable, and trustworthy decision intelligence. She is also interested in developing machine learning models and algorithms for interdisciplinary applications in operations management, mechanism design, control & robotics, etc.  

With thanks to Swiss National Science Foundations, ETH Foundations, NCCR Automation for generously funding the current research. 

Abstract

Classical optimization is often built around a simple goal: find a minimizer. Most existing theories emphasize convergence rates towards the minima, implicitly treating all solutions as equivalent once optimality is achieved. Modern machine learning applications tell a different story. Distinct minimizers with identical objective values can differ dramatically in properties that matterin practice, including generalization performance, robustness, and even broader societal implications. In overparameterized models, especially in deep learning, a striking phenomenon emerges: even after training error reaches zero, test performance continues to improve, indicating that optimization dynamics keep evolving within the set of global minima. This raises a fundamental and largely under-explored question: which minima do optimization algorithms implicitly select? More ambitiously, can we actively steer optimization dynamics toward desirable minima? In this talk, I willpresent recent results that shed light on both implicit and active minima selection, highlighting the roles played by optimization dynamics (first– and zeroth-order methods), stochastic noise, and the geometry of the solution landscape.

Profile picture of Niao He.
}

11:30 - 12:10

TBA

Anastasia Koloskova, University of Zurich

Biography

Anastasia Koloskova is an Assistant Professor of AI and Optimization in the Department of Mathematical Modeling and Machine Learning at the University of Zurich. Her research focuses on machine learning and optimization, particularly indecentralized and collaborative learning and privacy. Previously, Anastasia Koloskova was a postdoctoral researcher at Stanford University (STAIR lab, Prof. Sanmi Koyejo), andcompleted her PhD at EPFL in the Machine Learning and Optimization Laboratory (MLO) with Prof. Martin Jaggi. 

Profile picture of Anastasia Koloskova.
}

12:10 - 13:40

Lunch

}

13:40 - 14:20

Strong convergence and fast residual decay for monotone operator flows via Tikhonov regularization

Radu I. Boţ, University of Vienna 

Biography

Radu I. Boţ is Professor of Applied Mathematics with Emphasis on Optimization at the Faculty of Mathematics of the University of Vienna and a founding member of the Research Platform “Data Science@Uni Vienna”. He currently serves as Dean of the Faculty of Mathematics at the University of Vienna. He received his Diploma and M.Sc. degrees in Mathematics from Babeş-Bolyai University in Cluj-Napoca, Romania, and earned his Ph.D. degree as well as his Habilitation in Mathematics from Chemnitz University of Technology, Germany.

His research interests include continuous-time and discrete-time models for optimization and monotone inclusions, convex analysis, nonsmooth and variational analysis, monotone operator theory, and optimization methods for data science. His research has been funded by the Austrian Science Fund, the Austrian Research Promotion Agency, the German Research Foundation, the Romanian National Research Council, the Australian Research Council, as well as by industrial partners. He is (co-) author of the books Duality in Vector Optimization and Conjugate Duality in Convex Optimization, published by Springer. Radu I. Boţ serves on the editorial boards of several leading journals, including Mathematical Programming, Computational Optimization and Applications, Applied Mathematics and Optimization, and the Journal of Optimization Theory and Applications. Since January 2026, he has been Editor-in-Chief of the prestigious SIAM Journal on Optimization

Abstract

In the framework of real Hilbert spaces, we investigate first-order dynamical systems governed by monotone and continuous operators. It has been established that for these systems, only the ergodic trajectory converges to a zero of the operator. However, trajectory convergence is assured for operators with the stronger property of cocoercivity. For this class of operators, the trajectory’s velocity and the opertor values along the trajectory converge in norm to zero at a rate of o(1/√t) as t → +∞.

In this talk, we show that augmenting a monotone operator flow with a Tikhonov regularization term ensures not only strong convergence of the trajectory to the minimal-norm element of the zero set, but also enables the derivation of explicit convergence rates. In particular, we establish norm rates for the trajectory’s velocity and for the residual of the operator along the trajectory, expressed in terms of the regularization function. In some particular cases, these rates can be as fast as O(1/t) as t → +∞. In this way, we emphasize a surprising acceleration feature of the Tikhonov regularization. Additionally, we explore these properties for monotone operator flows that incorporate time rescaling and an anchor point. For a specific choice of the Tikhonov regularization function, these flows are closely linked to second-order dynamical systems with a vanishing damping term. The convergence and convergence rate results we achieve for these systems complement recent findings for the Fast Optimistic Gradient Descent Ascent (OGDA) dynamics.

Finally, derive via an explicit discretization of the Tikhonov regularized monotone flow a novel Extra-Gradient method with anchor term governed by general parameters. We establish strong convergence to specific points within the solution set, as well as convergence rates expressed in terms of the regularization parameters. Notably, our approach recovers the fast residual decay rate O(1/k) as k → +∞ for standard parameter choices.

Profile picture of Radu Bot.
}

14:20 - 15:00

Lightning Talks

}

15:00 - 15:45

Poster session & coffee

}

15:45 - 16:25

Gradient alignment, learning, and optimization

Jelena Diakonikolas, University of Wisconsin-Madison 

Biography

Jelena Diakonikolas is Assistant Professor at the Department of Computer Sciences and (by courtesy) the Department of Statistics, at the University of Wisconsin-Madison. She is also an affiliate of the Data Science Institute at UW-Madison. 

Her main research interests are in the area of large-scale optimization. She is also interested in applications of optimization methods, particularly within machine learning. 

Prior to joining UW-Madison, Jelena Diakonikolas was a Postdoctoral Fellow at UC Berkeley’s Foundations of Data Analysis (FODA) TRIPODS Institute, where she primarily worked with Mike Jordan. In Fall 2018, she was a Microsoft Research Fellow at the Simons Institute for the Theory of Computing, associated with the program on Foundations of Data Science. Prior to starting the postdoctoral position at UC Berkeley, she was a Postdoctoral Associate at the Department of Computer Science, Boston University, where she worked with Lorenzo Orecchia. She completed my Ph.D. at the Department of Electrical Engineering, Columbia University, where she was co-advised by Gil Zussman and Cliff Stein. 

Some publications are under her maiden name — Marašević. 

Abstract

Generalized Linear Models (GLMs) represent functions formed by composing a known univariate nonlinear activation with a linear map defined by an unknown vector w. The learning task—recovering w from i.i.d. labeledexamples (x,y), where y is a noisy evaluation of the GLM—leads to a nonconvex, often nonsmooth optimization problem, even for simple activations.  

GLMs are a fundamental model in supervised learning, capturing low-dimensional structure in high-dimensional data. While the setting with zero-mean bounded-variance noise has been well studied, more realistic formulations—where labels may deviate arbitrarily from any ground-truth GLM—are substantially more challenging. In particular, more relaxed notions of error and much stronger structural assumptions about both the activation and the distribution generating the data are required for computational tractability. Most provable guarantees in this regime have emerged only recently. 

In this talk, I will survey these developments and present a unifying optimization-theoretic framework based on local error bounds. These bounds capture how the gradient field remains meaningfully aligned with a target solution, thus providing a geometric “signal” that enables efficient learning with first-order methods, despite nonconvexity and noise. I will further discuss a generalization of these results to the setting of single-index models, where the activation is unknown and optimization is performed over a class of unknown activations, in addition to the parameter vector. 

 

Profile picture of Jelena Diakonikolas
}

16:25 - 17:05

Extragradient Methods for Modern Machine Learning: New Theory, Step-Size Rules, and Stochastic Variants 

Nicolas Loizou, Johns Hopkins University 

Biography

Nicolas Loizou is an Assistant Professor in the Department of Applied Mathematics and Statistics and the Mathematical Institute for Data Science (MINDS) at Johns Hopkins University, where he leads the Optimization and Machine Learning Lab. He holds secondary appointments in the Departments of Computer Science and Electrical and Computer Engineering and is a member of Johns Hopkins Data Science Institute and Ralph O’Connor Sustainable Energy Institute (ROSEI). Prior to this, he was a Postdoctoral Research Fellow at Mila – Quebec Artificial Intelligence Institute and the University of Montreal. He holds a Ph.D. in Optimization and Operational Research from the University of Edinburgh, School of Mathematics, an M.Sc. in Computing from Imperial College London, and a BSc in Mathematics from the National and Kapodistrian University of Athens. 
 
His research interests include large-scale optimization, machine learning, randomized numerical linear algebra, distributed and decentralized algorithms, algorithmic game theory, and federated learning. He currently serves as action editor for Information and Inference: A Journal of the IMA, Optimization Methods and Software, and Transactions on Machine Learning Research. He has received several awards and fellowships, including the OR Society’s 2019 Doctoral Award (runner-up) for the ”Most Distinguished Body of Research leading to the Award of a Doctorate in the field of Operational Research’’, the IVADO Fellowship, the COAP 2020 Best Paper Award, the CISCO 2023 Research Award, and the Catalyst 2025 Award. 

Abstract

Extragradient methods are a fundamental class of algorithms for min-max optimization problems and variational inequalities, with growing relevance in modern machine learning. While the classical theory is largely developed under smoothness and other relatively restrictive assumptions, many machine learning problems call for analysis under weaker regularity conditions and in stochastic, large-scale settings. In this talk, we present new convergence results for deterministic and stochastic extragradient methods beyond the classical framework. In particular, we establish guarantees under weaker regularity assumptions, namely the (L0 ,L1 )-Lipschitz condition, and derive new step-size rules that expand the range of provably convergent regimes. We also introduce Polyak-type step sizes for deterministic and stochastic extragradient methods, leading to adaptive variants with favorable theoretical properties and practical performance. Our results focus primarily on monotone problems, with extensions to selected structured non-monotone settings. We conclude with numerical experiments illustrating both the theory and the practical behavior of the proposed methods.

Profile picture of Nicolas Loizou.

Day 2 – May 7, 2026

}

09:00 - 09:40

Recent advances on the systematic analysis and design of first-order optimization algorithms  

Convex optimization as a proof assistant for algorithm analysis and design 

Adrien Taylor, Inria Paris 

Biography

Adrien Taylor is currently a research scientist at French Institute for Research in Computer Science and Automation – Inria in Paris, within the SIERRA team. Before that, he was a postdoctoral researcher in the same team in 2017-2019, working with Francis Bach. He completed a PhD at Université catholique de Louvain, in the department of mathematical engineering (part of the ICTEAMinstitute), where he held a F.R.S.-FNRS FRIA scholarship for his PhD under the supervision of François Glineur and Julien Hendrickx. 

His research currently focuses on optimization (mostly first-order) and numerical analysis with a bit of control and machine learning. He finds it particularly important to push toward reproducible (including theory) and understandable science, and many of his research projects have this orientation. Adrien Taylor was awarded an ERC Starting Grant 2024 (project CASPER) for working in this direction from fall 2024 to fall 2029. 

Abstract

Complexity analysis plays a key role in the design and analysis of algorithms in modern optimization theory. However, establishing worst-case convergence bounds classically requires non-obvious insights and ad hoc reasoning. This talk aims to provide a gentle introduction to performance estimation techniques for the analysis of first-order optimization algorithms, along with a few open questions and recent developments around it. The talk will be accompanied by concrete examples and demonstration of the usage of recent packages for computer-aided complexity analyses, including the PEPit package, available at https://pepit.readthedocs.io/ .

Profile picture of Adrien Taylor.
}

09:40 - 10:20

Automated tight Lyapunov analysis for first-order methods

Pontus Giselsson, Lund University 

Biography

Pontus Giselsson is Associate Professor at the Department of Automatic Control at Lund University, and organizer of the ELLIIT focus period on Optimization for learning.

His research focuses on optimization, with particular emphasis on principled methodologies for algorithm analysis and design. Although many optimization methods are still studied on a case-by-case basis, their analyses often exhibit strong common structure. His work seeks to capture these similarities through unified frameworks and automated tools for systematically analyzing, designing, and improving algorithms, with rigorous convergence guarantees.

Abstract

This talk is about automating a central step in the convergence analysis of splitting methods for structured optimization and inclusion problems: the search for a suitable Lyapunov inequality. While this step underlies many existing convergence proofs, finding such an inequality is often technically delicate and carried out on a case-by-case basis. To aid in this process, we present a numeric Lyapunov-based framework along with the AutoLyap package in which quadratic convergence certificates can be found by solving small semidefinite programs. Using this methodology, we derive significantly extended convergent parameter regions for classical methods including Douglas–Rachford splitting, ADMM, and the Chambolle–Pock method, when the analysis is specialized to convex optimization problems rather than the broader monotone inclusion setting. These results highlight the potential of automated Lyapunov analysis to uncover improved convergence guarantees for classical and new splitting methods.

Profile picture of Pontus Giselsson.
}

10:20 - 10:50

Coffee

}

10:50 - 11:30

Learning Optimization Algorithms with Average Case Convergence Rates

Peter Ochs, Saarland University

Biography

Peter Ochs received his M.Sc. degree in mathematics from Saarland University in Germany in 2010, and his Ph.D. degree in mathematics from the University of Freiburg in 2015. During his Ph.D., he spent three months as a visiting researcher at TU-Graz in Austria. After a year as a postdoctoral researcher at Saarland University in Germany, he returned to Freiburg. In November 2017, he became Junior professor of Applied Mathematics at the Saarland University and, in September 2020, Tenure-Track Professor at the University of Tübingen with final evaluation successfully completed in 2020.

Since March 2023, he is full Professor for Mathematics and Computer Science at Saarland University, where he is heading the Mathematical Optimization for Data Science group. He received the best paper award at the Scale Space and Variational Methods Conference (SSVM) in 2015 and at the German Conference on Pattern Recognition (GCPR) in 2016. His research interests are in non-smooth optimization with applications in computer vision, machine learning, image analysis, and data science in general. 

Abstract

The change of paradigm from purely model driven to data driven (learning based) approaches has tremendously altered the picture in many applications in Machine Learning, Computer Vision, Signal Processing, Inverse Problems, Statistics and so on. There is no need to mention the significant boost in performance for many specific applications, thanks to the advent oflarge scale Deep Learning. In this talk, we open the area of optimization algorithms for this data driven paradigm, for which theoretical guarantees are indispensable. The expectations aboutan optimization algorithm are clearly beyond empirical evidence, as there may be a whole processing pipeline depending on a reliable output of the optimization algorithm, and applicationdomains of algorithms can vary significantly. While there is already a vast literature on “learning to optimize“, there is no theoretical guarantees associated with these algorithms that meetthese expectations from an optimization point of view. We develop the first framework to learn optimization algorithms with provable generalization guarantees to certain classes ofoptimization problems, while the learning based backbone enables the algorithmsfunctioning far beyond the limitations of classical (deterministic) worst case bounds. Our results rely on PAC-Bayes bounds for general, unbounded loss-functions based on exponential families. We learn optimization algorithms with provable generalization guarantees (PAC-bounds) and explicit trade-off between a high probability of convergence and a high convergence speed. 

Profile picture of Peter Ochs.
}

11:30 - 12:10

Spectral optimizers for deep learning: muon, scion, and so on 

Antonio Silveti-Falls, CentraleSupélec 

Biography

Antonio (Tony) Silveti-Falls is an associate professor (maître de conférences) at CentraleSupélec in the south of Paris, where he is a member of the Centre pour la Vision Numérique laboratory and the INRIA team OPIS. After receiving his PhD in mathematics from Université de Caen Normandie in 2021, where he was supervised by Jalal Fadili and Gabriel Peyré, he completed a postdoc at Toulouse School of Economics with Jérôme Bolte and Edouard Pauwels. His research continues to focus on {nonsmooth, stochastic, noneuclidean} optimization, especially conditional gradient methods (Frank-Wolfe) and conservative calculus (path differentiable functions) applied to deep learning. His work on the generalized conditional gradient method won the best paper award at SPARS 2019. 

Abstract

We discuss some recent advances in optimization for deep learning, with special attention paid to the spectral norm. We will comment on both the theoretical and the empirical properties of these algorithms, especially using the former to predict the latter. 

Profile picture of Antonio Silveti-Falls.
}

12:10 - 13:40

Lunch

}

13:40 - 14:20

TBA

Wotao Yin, Alibaba DAMO Academy 

Biography

Wotao Yin is an applied mathematician, scientist, and engineer currently serving as the director of the Decision Intelligence Lab at the Alibaba DAMO Academy, following a tenure as a Professor of Mathematics at UCLA. He received his Ph.D. in OR from Columbia University and is widely recognized for his research in computational optimization, particularly large-scale and distributed algorithms, operator splitting methods, and their applications in image processing and machine learning. His contributions to the field have been honored with numerous prestigious awards, including the Morningside Gold Medal, the NSF CAREER Award, the Alfred P. Sloan Research Fellowship, and the INFORMS Egon Balas Prize.

Profile picture of Wotao Yin.
}

14:20 - 15:00

Lightning Talks

}

15:00 - 15:45

Poster session & coffee

}

15:45 - 16:25

First-Order Methods through Partial Linearization 

Alp Yurtsever, Umeå University 

Biography

Alp Yurtsever is a WASP Assistant Professor of Optimization and Machine Learning at the Department of Mathematics and Mathematical Statistics, Umeå University, Sweden. His research develops theory and algorithms for challenging optimization problems, motivated by applications in resource allocation, networked decision-making, and machine learning. His interests include conic programming, large-scale semidefinite programming, structured nonconvex and bilevel optimization, quantum-assisted optimization, distributed learning, operator splitting, and adaptive methods. Prior to joining Umeå University, he received his PhD in Computer and Communication Sciences (EDIC) from École Polytechnique Fédérale de Lausanne (EPFL), where his dissertation was awarded a Thesis Distinction, and completed a postdoctoral fellowship at the Massachusetts Institute of Technology (MIT) in the Laboratory for Information and Decision Systems (LIDS). 

Abstract

Difference-of-convex algorithms are built on a partial linearization mechanism. Taking this mechanism as a starting point, I consider objectives of the form F = f + g and focus on settings where linearizing g leads to tractablesurrogate problems. This yields a DCA-type template for first-order methods. Within this template, several classical first-order methods can be recovered as special cases. This viewpoint exposes a broad algorithmic design space induced by decomposition choices, but also raises a fundamental selection problem: Which decomposition should one use in practice? I will illustrate this question with a concrete case study using projection-freemethods, where different decompositions lead to distinct oracle complexity guarantees. 

Profile picture of Alp Yurtserver.
}

16:25 - 17:05

Architectures and Optimization for General-Task PDE Models

Hayden Schaeffer, UCLA 

Biography

Hayden Schaeffer is the Director of Applied Mathematics and a Professor of Mathematicsat the University of California, Los Angeles. His research is in mathematical and scientific machine learning, differential equations, randomization, and modeling. He has received an NSF CAREER award and an AFOSR Young Investigator Award. Previously, he was an NSF Mathematical Sciences Postdoctoral Research Fellow, a von Karman Instructor at Caltech, a UC President’s Postdoctoral Fellow at UC Irvine, an NDSEG Fellow, and a Collegium of University Teaching Fellow at UCLA.

Abstract

Learning general-purpose models for partial differential equations (PDEs) is an important problem in scientific machine learning, requiring methods that generalize across equations, discretizations, and data regimes. We present recent progress on multi-task, multi-operator learning and PDE foundation models for spatiotemporal prediction, based on global representation learning and autoregressive modeling. This includes multimodal inputs such as partial observations, varying parameters, and heterogeneous data sources, enabling zero-shot and few-shot generalization. We also discuss scalable training strategies based on Muon and its adaptive extensions, which combine orthogonalized momentum with moment-based normalization to improve stability and convergence in the large-model regime. In particular, NAMO and its diagonal variant utilize Muon-style update directions with Adam-like moment adaptation, improving robustness to gradient noise and heterogeneous data while preserving efficient scaling. This enables models that can be trained once and adapted across regimes, providing effective surrogate models for complex spatiotemporal systems. 

Profile picture of Nicolas Loizou.
}

19:00

Turning Torso, Lilla Varvsgatan 14, 211 15 Malmö

Symposium dinner

Bus transport to dinner venue Turning Torso in Malmö departs from Lund Cathedral at 18:00.

Panoramic view of the building Turning torso in Malmö.

Day 3 – May 8, 2026

}

09:00 - 09:40

Making your Theory-to-Practice Work: Online-to-Batch via Schedules & Schedule-Free Learning 

Aaron Defazio, FAIR, Meta Superintelligence Labs 

Biography

Aaron Defazio is a Research Scientist at the FAIR (Fundamental AI Research), part of Meta Superintelligence labs, where he researches new theoretically driven approaches to AI training, with the ultimate goal of developing automatic, reliable and fast optimization methods. He has previously worked on deep learning based methods for MRI imaging (fastMRI project) and automated theorem proving. His Schedule-Free Learning method won the AlgoPerf Self-Tuning Track Challenge in 2024, and in 2023 his work on the D-Adaptation method was awarded an ICML best paper award. He obtained his PhD in Computer Science from Australian National University in 2014. 

Abstract

I will introduce an alternative view of learning rate schedules, where they are considered as a technique for ensuring optimal convergence rates for the last iterate of an optimization procedure, a form of online-to-batchconversion. This view leads to highly predictive theory of optimal learning rate schedules, explaining learning rate warmup and annealing procedures used in practice. Going beyond this, I will show how this viewpoint suggests Schedule-Free approaches, where learning rate schedules are replaced by iterate averaging schemes, which yield a number of benefits: no need to specify the stopping time in advance, smoother loss curves and often bettereval metrics. 

Profile picture of Aaron Defazio.
}

09:40 - 10:20

Acceleration by Stepsize Hedging

Jason Altschuler, University of Pennsylvania

Biography

Jason Altschuler is Assistant Professor at UPenn in the Department of Statistics and Data Science, and by courtesy also the Departments of Computer Science, Electrical Engineering, and Applied Mathematics. Previously, he received his undergraduate degree from Princeton and his PhD from MIT. He is the recipient of a Sloan Fellowship in Mathematics, the ICS Prize for the best papers at the interface of computer science and operations research, the MIT Sprowls Dissertation Award, the Mathematical Optimization Society’s Tucker Finalist Prize, and Undergraduate Teaching Excellence Awards. His research interests lie at the interface of optimization, probability, and machine learning, with a focus on the design and analysis of efficient algorithms. 

Abstract

It is commonly said that the most important hyperparameter in deep learning is the stepsize schedule. However, even in seemingly simple convex settings, it is unclear how best to choose stepsizes. In this talk, I will describe a new approach for choosing stepsizes which has enabled us to dispel longstanding beliefs about the speed limit of gradient descent in convex optimization and min-max optimization. The key idea is “hedging” between short steps and long steps since bad cases for the former are good cases for the latter, and vice versa. Properly combining these stepsizes yields faster convergence due to the misalignment of worst-case functions. 

This talk is based on a line of work with Pablo Parrilo, Henry Shugart, and Jinho Bok that originates from my 2018 Master’s Thesis — which established for the first time that judiciously chosen stepsizes can enable accelerated convex optimization. Prior to this thesis, the only such result was for the special case of quadratics, due to Young in 1953. 

Profile picture of Jason Altschuler.
}

10:20 - 10:50

Coffee

}

10:50 - 11:30

TBA

Jeremy Bernstein, Thinking Machines Lab 

Biography

Jeremy Bernstein is a machine learning researcher based in San Francisco, California. He works at Thinking Machines Lab. His goal is to uncover the computational and statistical laws of natural and artificial intelligence, and thereby design learning systems that are more efficient, more automatic and more useful in practice. 

Profile picture of Jeremy Bernstein
}

11:30 - 12:10

Training LLMs: Do We Understand Our Optimizers? 

Antonio Orvieto, ELLIS Institute Tübingen, MPI

Biography

Antonio studied Control Engineering in Italy and Switzerland. He holds a PhD in Computer Science from ETH Zürich and spent time at DeepMind (UK), Meta (US), MILA (CA), INRIA (FR), and HILTI (LI). He is currently a Hector Endowed Fellow and Principal Investigator (PI) at the ELLIS Institute Tübingen and Independent Group Leader of the MPI for Intelligent Systems, where he leads the Deep Models and Optimization group. He received the ETH medal for outstanding doctoral theses and the Schmidt Sciences AI2050 Early Career Fellowship.

In his research, Antonio strives to improve the efficiency of deep learning technologies by pioneering new architectures and training techniques grounded in theoretical knowledge. His work encompasses two main areas: understanding the intricacies of large-scale optimization dynamics and designing innovative architectures and powerful optimizers capable of handling complex data. Central to his studies is exploring innovative techniques for decoding patterns in sequential data, with implications in biology, neuroscience, natural language processing, and music generation.

Abstract

Why does Adam so consistently outperform SGD when training Transformer language models? Despite numerous proposed explanations, the optimizer gap remains largely unexplained. In this talk, we will present results from two complementary studies. First, using over 2000 language model training runs, we compare Adam with simplified variants such as signed gradient and signed momentum. We find that while signed momentum is faster than SGD, it still lags behind Adam; however, we crucially notice that constraining Adam’s momentum parameters to be equal (beta1 = beta2) retains near-optimal performance. This is of great practical importance and also reveals a new insight: Adam in this form has a robust statistical interpretation and a clear link to mollified sign descent. Second, through carefully tuned comparisons of SGD with momentum and Adam, we show that SGD can actually match Adam in small-batch training, but loses ground as batch size grows. Analyzing both Transformer experiments and quadratic models with stochastic differential equations, we shed new light on the role of batch size in shaping training dynamics. 

Profile picture of Antonio Orvieto.
}

12:10 - 13:40

Lunch

}

13:40 - 14:20

River-Valley Landscapes in Neural Network Training and a Theory-Practice Gap for Momentum 

Chulhee Yun, KAIST 

Biography

Chulhee “Charlie” Yun is an Ewon Assistant Professor at KAIST Kim Jaechul Graduate School of AI, where he has directed the Optimization & Machine Learning Laboratory since 2022. Starting September 2025, he holds a joint affiliation with KAIST Graduate School of AI for Math and a part-time Visiting Faculty Researcher position at Google Research. He finished his PhD from the Laboratory for Information and Decision Systems (LIDS) at MIT, under the joint supervision of Prof. Suvrit Sra and Prof. Ali Jadbabaie, following an MSc from Stanford University and a BSc from KAIST. His research focuses on the theoretical aspects of optimization algorithms, machine learning, and deep learning, with the goal of bridging the gap between theory and practice in these areas. 

Abstract

Neural network training is often believed to be largely confined to a low-dimensional subspace aligned with the sharpest-curvature directions (Gur-Ari et al., 2018). In this talk, I will present evidence that challenges this picture: in modern neural network training, substantial progress can instead be driven by movement in the “bulk,” outside the sharpest-curvature subspace. Building on this observation, I introduce a “river-valleyview of the loss landscape, where sharp directions form valley walls while learning happens along a flatter river direction. This lens helps explain many common behaviors of neural network optimizersmost notably why Polyak momentum canaccelerate convergence by increasing effective progress along the river—and why schedule-free methods (Defazio et al., 2024) often track low-loss trajectories. I will close with a theoretical counterpoint from our recent work: in nonconvex optimization under a mere smoothness assumption, momentum admits worst-case lower bounds showing it can be strictly slower than non-momentum counterparts. This contrast raises the question of whichassumptions and which notions of progress are needed to faithfully connect theory to practice. 

Profile picture of Chulhee Yun.
}

14:20 - 15:00

Lightning Talks

}

15:00 - 15:45

Poster session & coffee

}

15:45 - 16:25

Understanding Optimization in Deep Learning with Central Flows

A two-part talk with Alex Damian

Jeremy Cohen, The Flatiron Institute

Biography

Jeremy Cohen is a research fellow at the Flatiron Institute, New York, USAHe is broadly interested in turning deep learning into a principled engineering discipline, and currently works on understanding the dynamics of optimization algorithms in deep learning. He obtained his PhD in 2024 from Carnegie Mellon University, advised by Zico Kolter and Ameet Talwalkar. 

Abstract

Traditional theories of optimization cannot describe the dynamics of optimization in deep learning, even in the simple setting of deterministic training. The challenge is that optimizers typically operate in a complex, oscillatory regime called the “edge of stability.” In this paper, we develop theory that can describe the dynamics of optimization in this regime. Our key insight is that while the *exact* trajectory of an oscillatory optimizer may be challenging to analyze, the *time-averaged* (i.e. smoothed) trajectory is often much more tractable. To analyze an optimizer, we derive a differential equation called a “central flowthat characterizes this time-averaged trajectory. We empirically show that these central flows can predict long-term optimization trajectories for generic neural networks with a high degree of numerical accuracy. By interpreting these central flows, we are able to understand how gradient descent makes progress even as the loss sometimes goes up; how adaptive optimizersadapt” to the local loss landscape; and how adaptive optimizers implicitly navigate towards regions where they can take largersteps. Our results suggest that central flows can be a valuable theoretical tool for reasoning about optimization in deep learning. 

Profile picture of Jeremy Cohen.
}

16:25 - 17:05

Understanding Optimization in Deep Learning with Central Flows

A two-part talk with Jeremy Cohen

Alex Damian, The Kempner Institute at Harvard University 

Biography

Alex Damian is a research fellow at the Kempner Institute at Harvard University and will join MIT in Fall 2026 as an Assistant Professor of Mathematics and EECS[AI+D]. His research focuses on the mathematical foundations of deep learning, with particular emphasis on optimization dynamics and representation learning. He received his Ph.D. in Applied and Computational Mathematics from Princeton University, where he was advised by Jason D. Lee, and his B.S. in Mathematics from Duke University. His work has been supported by the NSF Graduate Research Fellowship and the Jane Street Graduate Research Fellowship. 

Abstract

Traditional theories of optimization cannot describe the dynamics of optimization in deep learning, even in the simple setting of deterministic training. The challenge is that optimizers typically operate in a complex, oscillatory regime called the “edge of stability.” In this paper, we develop theory that can describe the dynamics of optimization in this regime. Our key insight is that while the *exact* trajectory of an oscillatory optimizer may be challenging to analyze, the *time-averaged* (i.e. smoothed) trajectory is often much more tractable. To analyze an optimizer, we derive a differential equation called a “central flowthat characterizes this time-averaged trajectory. We empirically show that these central flows can predict long-term optimization trajectories for generic neural networks with a high degree of numerical accuracy. By interpreting these central flows, we are able to understand how gradient descent makes progress even as the loss sometimes goes up; how adaptive optimizersadapt” to the local loss landscape; and how adaptive optimizers implicitly navigate towards regions where they can take largersteps. Our results suggest that central flows can be a valuable theoretical tool for reasoning about optimization in deep learning. 

Profile picture of Alex Damian.
}

17:05 - 17:15

Closing

Map

A map showing the center of Lund city.