Ilyas Fathkullin

Focus Period lund 2026

PhD Student

ETH Zurich and an ETH AI Center Fellow (Switzerland)

Ilyas Fatkhullin is a Ph.D. candidate in Computer Science at ETH Zurich and an ETH AI Center Fellow. His research lies at the intersection of optimization theory and reinforcement learning, developing principled methods for reliable learning in high-dimensional, non-convex, and stochastic settings, including hidden convexity, heavy-tailed noise, and communication-efficient distributed training. His work has appeared in venues such as NeurIPS, ICML, and AISTATS, and in journals including SIAM Journal on Optimization, SIAM Journal on Control and Optimization, and JMLR. Highlights include an oral presentation at NeurIPS 2021 (main track), a spotlight presentation at ICML 2022 (main track), and oral presentations at the NeurIPS 2025 OPT and COML workshops. He is an invited early-career scholar at the ELLIIT Focus Period Lund 2026.

Presenting: Can SGD Handle Heavy-Tailed Noise?

Stochastic Gradient Descent (SGD) is a cornerstone of large-scale optimization, yet its theoretical behavior under heavy-tailed noise—common in modern machine learning and reinforcementlearning—remains poorly understood. In this work, we rigorously investigate whether vanilla SGD, without any adaptive modifications, can provably succeed under such adverse stochasticconditions. Assuming only that the stochastic gradients have bounded moments, we establish sharp convergence guarantees for (projected) SGD across convex, strongly convex, and non-convexproblem classes. In particular, we show that SGD achieves minimax-optimal sample complexity under minimal assumptions in the convex and strongly convex regimes. For non-convex objectives, we prove convergence to a stationary point with sample complexity
O(\eps^{2p/(p−1)}), and we complement this positive result with a matching lower bound specific to SGD with arbitrary polynomial step-size schedules. Finally, we establish a high-probability lowerbound for SGD, suggesting a tight polynomial dependence on the inverse of the failure probability, unlike the recent literature on adaptive methods, which can achieve a polylogarithmicdependence. On the one hand, these results challenge the prevailing view that heavy-tailed noise renders SGD ineffective and establish vanilla SGD as a robust and theoretically principled baseline—even in regimes where the variance is unbounded. On the other hand, our negative results justify the use of adaptive step-size upgrades to vanilla SGD in such challenging settings.