24w5297 Home Confirmed Participants Schedule Workshop Videos

Schedule for: 24w5297 - Mathematics of Deep Learning

Beginning on Sunday, June 9 and ending Friday June 14, 2024

All times in Oaxaca, Mexico time, CDT (UTC-5).

Sunday, June 9
14:00 - 23:59	Check-in begins (Front desk at your assigned hotel)
19:30 - 22:00	Dinner (Restaurant Hotel Hacienda Los Laureles)
20:30 - 21:30	Informal gathering (Hotel Hacienda Los Laureles)

Monday, June 10
07:30 - 08:45	Breakfast (Restaurant Hotel Hacienda Los Laureles)
08:45 - 09:00	Introduction and Welcome (Conference Room San Felipe)
09:00 - 10:00	Jeremias Sulam: Yes, my deep network works! But.. what did it learn? ↓ Abstract: Modern machine learning methods are revolutionizing what we can do with data — from TikTok video recommendations to biomarker discovery in cancer research. Yet, the complexity of these deep models makes it harder to understand what functions these data-dependent models are computing, and which features they learn to be important for a given task. In this talk, I will review two approaches for turning general deep learning models more interpretable. The first one will study an unsupervised setting in the context of imaging inverse problems and will show how to design and train networks that provide exact proximal operators that approximate that of the (log) prior distribution of the data. The second will switch to supervised classification problems for computer vision, where will re-think the use of Shapley coefficients for black-box model explanations. (Conference Room San Felipe)
10:00 - 10:30	Coffee Break (Conference Room San Felipe)
10:30 - 11:15	Sam Buchanan: Lecture I: White-Box Architecture Design via Unrolled Optimization and Compression ↓ Abstract: We will discuss the overall objective of high-dimensional data analysis, that is, learning and transforming the data distribution towards template distributions with relevant semantic content for downstream tasks (such as linear discriminative representations, expressive mixtures of semantically-meaningful incoherent subspaces), and how it connects to deep representation learning. We will briefly review classical methods such as sparse coding through dictionary learning as particular instantiations of this learning paradigm when the underlying signal model is linear or sparsely generated, along with unrolled optimization as a design principle for white-box deep networks that are interpretable ab initio. To bridge from these model-specific white-box networks and white-box networks for general high-dimensional data distributions, we will then discuss the information theoretic and statistical principles behind linear discriminative representations, and design a loss function, called the coding rate reduction, which is optimized at such a representation. (Conference Room San Felipe)
11:15 - 11:45	Coffee Break (Hotel Hacienda Los Laureles)
11:45 - 12:30	Sam Buchanan: Lecture II: White-Box Transformers via Sparse Rate Reduction ↓ Abstract: We demonstrate how combining sparse coding and rate reduction yields "sparse linear discriminative representations" using an objective called sparse rate reduction. We develop CRATE, a deep network architecture, by unrolling the optimization of this objective and parameterizing feature distribution in each layer. CRATE's operators are mathematically interpretable, with each layer representing an optimization step, making the network a transparent "white box". Remarkably, CRATE closely resembles the transformer architecture, suggesting that the interpretability gains from such networks might also improve our understanding of current, practical deep architectures. Experiments show that CRATE, despite its simplicity, indeed learns to compress and sparsify representations of large-scale real-world image and text datasets, and achieves performance very close to highly engineered transformer-based models. (Conference Room San Felipe)
12:30 - 14:00	Lunch (Restaurant Hotel Hacienda Los Laureles)
14:00 - 14:30	Lachlan MacDonald: Equivariant convolutional networks over arbitrary Lie groups ↓ Abstract: The notion of equivariance has had a profound impact on machine learning, allowing the construction of data-efficient models that respect symmetries in data. A notable implementation of equivariance in machine learning is in group equivariant convolutional neural networks (GCNNs), however the range of groups considered in much of the literature on GCNNs is restricted by inadequately general theory. In this talk, I will discuss a natural generalisation of the best-known framework, due to Cohen and Welling in 2016, to arbitrary (connected) Lie groups, building on ideas from noncommutative geometry, and discuss its applications and present shortcomings. (Conference Room San Felipe)
14:30 - 15:00	Joshua Agterberg: A Precise High-Dimensional Statistical Theory for Convex and Nonconvex Matrix Sensing ↓ Abstract: The problem of matrix sensing, or trace regression, is a problem wherein one wishes to estimate a low-rank matrix from linear measurements perturbed with noise, and this model serves as a toy model for various deep learning problems, particularly two-layer neural networks. Several existing works have studied both convex and nonconvex approaches to this problem, establishing minimax error rates when the number of measurements is sufficiently large relative to the rank and dimension of the low-rank matrix, though a precise comparison of these procedures remains unexplored. We will explicitly compare convex and nonconvex approaches to this problem and establish asymptotically the mean-squared errors for each of these problems. We show that the nonconvex problem with known rank uniformly dominates the convex relaxation. Time permitting, we will discuss our proof, which is based on a novel generalization of the Convex Gaussian Min-Max theorem as well as a careful study of the interplay between local minima of the convex, regularized nonconvex, and unregularized nonconvex formulations and their corresponding “debiased” estimators. (Conference Room San Felipe)
15:00 - 15:30	Coffee Break (Conference Room San Felipe)
15:30 - 16:30	Alejandro Ribeiro: Constrained Reinforcement Learning ↓ Abstract: Constrained reinforcement learning (CRL) involves multiple rewards that must individually accumulate to given thresholds. CRL arises naturally in cyberphysical systems which are most often specified by a set of requirements. We explain in this talk that CRL problems have null duality gaps even though they are not convex. These facts imply that they can be solved in the dual domain but that standard dual gradient descent algorithms may fail to find optimal policies. We circumvent this limitation with the introduction of a state augmented algorithm in which Lagrange multipliers are incorporated in the state space. We show that state augmented algorithms sample from stochastic policies that achieve target rewards. We further introduce resilient CRL as a mechanism to relax constraints when requirements are overspecified. We illustrate results and implications with a brief discussion of safety constraints. (Conference Room San Felipe)
16:30 - 17:00	Wilson Gregory: Equivariant Tensor Polynomials for Sparse Vector Recovery. ↓ Abstract: First we show how you can write O(d)-equivariant polynomials from tensor inputs to tensor outputs in a nice fashion. Then we use those little freaks to build a machine learning model to perform sparse vector recovery. This model outperforms the best theoretically known spectral methods in some settings. (Conference Room San Felipe)
17:00 - 19:00	Adjourn (Hotel Hacienda Los Laureles)
19:00 - 21:00	Dinner (Restaurant Hotel Hacienda Los Laureles)

Tuesday, June 11
07:30 - 09:00	Breakfast (Restaurant Hotel Hacienda Los Laureles)
09:00 - 10:00	Fanny Yang: Surprising phenomena of max-ℓ𝑝-margin classifiers in high dimensions ↓ Abstract: In recent years, the analysis of max-lp-margin classifiers has gained attention from the theory community not only due to the implicit bias of first-order methods, but also due to the observation of harmless interpolation for neural networks. In this talk, I will discuss two results: We show that surprisingly, in the noiseless case, while minimizing the l1-norm achieves optimal rates for regression for hard-sparse ground truths, this adaptivity does not directly apply to the equivalent of max l1-margin classification. Further, for noisy observations, we prove how max-lp-margin classifiers can achieve 1/\sqrt{n} rates for p slightly larger than one, while the maximum l1-margin classifier only achieves rates of order 1/sqrt(log(d/n)). (Online - CMO)
10:00 - 10:30	Coffee Break (Conference Room San Felipe)
10:30 - 11:30	Gitta Kutyniok: Overcoming the Boundaries of Artificial Intelligence: A Mathematical Approach ↓ Abstract: Classical approaches of artificial intelligence typically employ digital hardware. However, it turns out that such computing platforms impose serious restrictions to AI-based algorithms in terms of computability, reliability, legal requirements, and energy requirements. In this lecture, we will first discuss current mathematical limitations of artificial intelligence imposed by digital hardware modeled as a Turing machine. We will then show how those boundaries can be overcome by embracing analog computing approaches, modeled by the Blum-Shub-Smale machine. This will reveal the tremendous importance of novel computing hardware such as neuromorphic hardware for future AI computing. Finally, we will discuss mathematical aspects of spiking neural networks, which mimic natural neural networks much closer than classical artificial neural networks and are perfectly adapted to neuromorphic hardware. (Online - CMO)
11:30 - 12:00	Group Photo (Hotel Hacienda Los Laureles)
12:00 - 14:00	Lunch (Restaurant Hotel Hacienda Los Laureles)
14:00 - 15:00	Rene Vidal: Learning Dynamics of Overparametrized Neural Networks ↓ This talk will provide a detailed analysis of the dynamics of gradient based methods in overparameterized models. For linear networks, we show that the weights converge to an equilibrium at a linear rate that depends on the imbalance between input and output weights (which is fixed at initialization) and the margin of the initial solution. For ReLU networks, we show that the dynamics has a feature learning phase, where neurons collapse to one of the class centers, and a classifier learning phase, where the loss converges to zero at a rate 1/t. (Conference Room San Felipe)
15:00 - 15:30	Coffee Break (Conference Room San Felipe)
15:30 - 16:30	Joshua Vogelstein: Probably Approximately Correct in the Future. ↓ Abstract: In real world applications, the distribution of the data and our goals evolve over time. And we therefore care about performance over time, rather than just instantaneous performance. Yet, the prevailing theoretical framework in artificial intelligence (AI) is probably approximately correct (PAC), which ignores time. Existing strategies to address the dynamic nature of distributions and goals have typically not treated time formally, but rather, heuristically. We therefore enrich PAC learning to by assuming the data are sampled from a stochastic process, rather than a random variable, and adjust the loss accordingly. This generalizes the notion of learning to something we call ``prospective learning''. We prove that time-agnostic empirical risk minimization cannot solve certain trivially simple prospective learning problems. We then prove that a simple time-aware augmentation to empirical risk minimization provably solves certain prospective learning problems. Numerical experiments illustrate that a few different ways of incorporating time, including modifications of a transformer, lead to improved algorithms for prospective learning, including visual recognition tasks constructed from MNIST and CIFAR. This framework offers a conceptual link towards both (i) improving AI solutions for currently intractable problems, and (ii) better characterizing the naturally intelligent systems that solve them. (Conference Room San Felipe)
16:30 - 17:00	Salma Tarmoun: Gradient Descent and Attention Models: Challenges Posed by the Softmax Function ↓ Abstract: Transformers have become ubiquitous in modern machine learning applications, yet their training remains a challenging task often requiring extensive trial and error. Unlike previous architectures, transformers possess unique attention-based components, which can complicate the training process. The standard optimization algorithm, Gradient Descent, consistently underperforms in this context, underscoring the need for a deeper understanding of these difficulties: existing theoretical frameworks fall short and fail to explain this phenomenon. To address this gap, we analyze a simplified Softmax attention model that captures some of the core challenges associated with training transformers. Through a local analysis of the gradient dynamics, we highlight the role of the Softmax function on the local curvature of the loss and show how it can lead to ill-conditioning of these models, which in turn can severely hamper the convergence speed. Our experiments confirm these theoretical findings on the critical impact of Softmax on the dynamics of Gradient Descent. (Conference Room San Felipe)
17:00 - 19:00	Adjourn (Hotel Hacienda Los Laureles)
19:00 - 21:00	Dinner (Restaurant Hotel Hacienda Los Laureles)

Wednesday, June 12
07:30 - 09:00	Breakfast (Restaurant Hotel Hacienda Los Laureles)
09:00 - 10:00	Matus Telgarsky: Where within F_1 does gradient descent search? ↓ Abstract: To what extent does gradient depart its initial set of features and learn new features? On the one hand, learning arbitrarily good features promises sample complexity improvements, but on the other, it is known to be computationally intractable. This talk will survey a few different regimes and their tradeoffs, specifically cases where gradient descent does (and does not!) need tremendous computation go below sample complexity lower bounds for the initial feature set. (Conference Room San Felipe)
10:00 - 10:30	Coffee Break (Conference Room San Felipe)
10:30 - 11:30	Qi Lei: Data Reconstruction Attacks and Defenses: From Theory to Practice ↓ Abstract: Reconstruction attacks and defenses are essential in understanding the data leakage problem in machine learning. However, prior work has centered around empirical observations of gradient inversion attacks, lacks theoretical grounding, and cannot disentangle the usefulness of defending methods from the computational limitation of attacking methods. In this talk, we propose to view the problem as an inverse problem, enabling us to theoretically and systematically evaluate the data reconstruction attack. On various defense methods, we derived the algorithmic upper bound and the matching (in feature dimension and architecture dimension) information-theoretical lower bound on the reconstruction error for two-layer neural networks. To complement the theoretical results and investigate the utility-privacy trade-off, we defined a natural evaluation metric of the defense methods with similar utility loss among the strongest attacks. We further propose a strong reconstruction attack that helps update some previous understanding of the strength of defense methods under our proposed evaluation metric. (Conference Room San Felipe)
11:30 - 12:00	Free Time (Hotel Hacienda Los Laureles)
12:00 - 13:00	Lunch (Restaurant Hotel Hacienda Los Laureles)
13:00 - 19:00	Free Afternoon \| Visit to Monte Alban (Oaxaca)
19:00 - 21:00	Dinner (Restaurant Hotel Hacienda Los Laureles)

Thursday, June 13
07:30 - 09:00	Breakfast (Restaurant Hotel Hacienda Los Laureles)
09:00 - 10:00	Luana Ruiz: A Poincaré Inequality and Consistency Results for Signal Sampling on Large Graphs ↓ Abstract: Large-scale graph machine learning is challenging as the complexity of learning models scales with the graph size. Subsampling the graph is a viable alternative, but sampling on graphs is nontrivial as graphs are non-Euclidean. Existing graph sampling techniques require not only computing the spectra of large matrices but also repeating these computations when the graph changes, e.g., grows. In this paper, we introduce a signal sampling theory for a type of graph limit---the graphon. We prove a Poincaré inequality for graphon signals and show that complements of node subsets satisfying this inequality are unique sampling sets for Paley-Wiener spaces of graphon signals. Exploiting connections with spectral clustering and Gaussian elimination, we prove that such sampling sets are consistent in the sense that unique sampling sets on a convergent graph sequence converge to unique sampling sets on the graphon. We then propose a related graphon signal sampling algorithm for large graphs, and demonstrate its good empirical performance on graph machine learning tasks. (Conference Room San Felipe)
10:00 - 10:30	Coffee Break (Conference Room San Felipe)
10:30 - 11:30	Cristóbal Guzmán: The Role of Sparsity in Differentially-Private Learning ↓ Abstract: With the increasing use of personally-sensitive data in machine learning applications, privacy has become a central concern. In this context, differential privacy (DP) offers a rigorous and quantifiable control of the privacy risk of a machine learning model. One of the main problems of interest in differential privacy is stochastic optimization, where we are interested in computing a model that approximately minimizes the empirical or population excess risk, while satisfying differential privacy with respect to thedata used for training the model. In this talk, we will present various settings where one can obtain nearly dimension independent accuracy rates in differentially private (stochastic) optimization and saddle-point problems. We start with results involving stochastic optimization under polyhedral constraints, popular in sparsity-oriented machine learning. Next, we move to stochastic saddle-point problems, where we study the use of stochastic mirror-descent methods and vertex sampling; these results are applied to problems including DP and group-fair machine learning, and DP synthetic data generation. Finally, I will present results on a "dual" counterpart of the above problems: stochastic optimization with sparse gradients, a setting of high relevance in large embedding models. Here, we provide new matching upper and lower bounds both in the convex and nonconvex settings. (Conference Room San Felipe)
11:30 - 12:00	Soledad Villar: Symmetries in machine learning: point clouds and graphs. ↓ Abstract: In this talk, we give an overview of the use of exact and approximate symmetries in machine learning models. We will focus on (1) mathematical tools to efficiently implement invariant functions on point clouds, and (2) symmetries as model selection for graph neural networks and bias-variance tradeoffs. (Online - CMO)
12:00 - 14:00	Lunch (Restaurant Hotel Hacienda Los Laureles)
14:00 - 15:00	Chong You: How many FLOPs is a token worth? ↓ Abstract: Despite the remarkable capabilities of deep learning models, their training and inference processes are often hindered by exorbitant computational costs, largely attributed to the intensive nature of dense matrix multiplication. This presentation illuminates an intriguing revelation: within trained Transformer models lies a substantial degree of inherent sparsity in intermediate activation maps, with percentages such as 3.0% for T5 and 6.3% for ViT. Through rigorous experimentation, we demonstrate the prevalence of such sparsity across diverse tasks, Transformer architectures of varying scales, and at all levels of depth. By harnessing this sparsity, we offer practical strategies to enhance the efficiency of inference for large language models. Concluding the discussion, we delve into theoretical insights elucidating the origins of sparsity within unconstrained feature models. (Conference Room San Felipe)
15:00 - 15:30	Coffee Break (Conference Room San Felipe)
15:30 - 16:00	Josue Tonelli-Cueto: Lazy quotient metrics: Approximate symmetries for ML models ↓ Abstract: In many learning contexts, the target labels don't change much if we apply a small deformation to the input data. Mathematically this can be formalized by quotienting out approximate symmetries in the design of the hypothesis class of functions. By doing so we obtain a space of functions that are invariant with respect to "small" symmetries but not “large” ones. In this talk, I present a new approach where we formalize the notion of approximate symmetries using what we call the lazy quotient metric and apply it to toy machine learning problems. This is joint work with Dustin Mixon, Soledad Villar, and Brantley Vose. (Conference Room San Felipe)
16:00 - 16:30	Young Kyung Kim: Vision Transformers with Natural Language Semantics ↓ Abstract: Tokens or patches within Vision Transformers (ViT) lack essential semantic information, unlike their counterparts in natural language processing (NLP). Typically, ViT tokens are associated with rectangular image patches that lack specific semantic context, making interpretation difficult and failing to effectively encapsulate fundamental visual information. We introduce a novel transformer model, Semantic Vision Transformers (sViT), which leverages recent progress on segmentation models to design novel tokenizer strategies. sViT effectively harnesses semantic information, creating an inductive bias reminiscent of convolutional neural networks while capturing global dependencies and contextual information within images that are characteristic of transformers. Through validation using real datasets, sViT demonstrates superiority over ViT, requiring less training data. Furthermore, sViT demonstrates significant superiority in out-of-distribution generalization and robustness to natural distribution shifts, attributed to its scale invariance semantic characteristic. Notably, the use of semantic tokens significantly enhances the model's interpretability. Lastly, the proposed paradigm facilitates the introduction of new and powerful augmentation techniques at the token (or segment) level, increasing training data diversity and generalization capabilities. To conclude, just as sentences are made of words, images are formed by semantic objects; our proposed methodology leverages recent progress in object segmentation and takes an important and natural step toward interpretable and robust vision transformers. (Conference Room San Felipe)
16:30 - 17:00	Samar Hadou: Robust Unrolled Networks ↓ Abstract: Deep unrolling is an emerging learning-to-optimize method that unrolls a truncated iterative algorithm in the layers of a trainable neural network. However, the convergence guarantees and generalizability of the unrolled networks are still open problems. To tackle these problems, we provide deep unrolled architectures with a stochastic descent nature by imposing descending constraints during training. The descending constraints are forced layer by layer to ensure that each unrolled layer takes, on average, a descent step toward the optimum during training. We theoretically prove that the sequence constructed by the outputs of the unrolled layers is then guaranteed to converge. We also show that our imposed constraints provide the unrolled networks with robustness to perturbations. We numerically assess unrolled architectures trained under the proposed constraints in two different applications, including the sparse coding using learnable iterative shrinkage and thresholding algorithm (LISTA) and image inpainting using proximal generative flow (GLOW-Prox), and demonstrate the performance and robustness benefits of the proposed method. (Conference Room San Felipe)
17:00 - 19:00	Adjourn (Hotel Hacienda Los Laureles)
19:00 - 21:00	Dinner (Restaurant Hotel Hacienda Los Laureles)

Friday, June 14
07:30 - 09:00	Breakfast (Restaurant Hotel Hacienda Los Laureles)
09:00 - 10:00	TBA (Hotel Hacienda Los Laureles)
10:00 - 10:30	Coffee Break (Conference Room San Felipe)
10:30 - 11:00	TBA (Hotel Hacienda Los Laureles)
11:00 - 11:30	TBA (Conference Room San Felipe)
11:30 - 12:00	TBA (Conference Room San Felipe)
12:00 - 13:30	Lunch (Restaurant Hotel Hacienda Los Laureles)

©2025 Banff International Research Station for Mathematical Innovation and Discovery. All Rights Reserved.