Schedule for: 24w5199 - Mathematical and Statistical Tools for High Dimensional Data on Compressive Networks

Beginning on Sunday, May 26 and ending Friday May 31, 2024

All times in Oaxaca, Mexico time, CDT (UTC-5).

Sunday, May 26
14:00 - 23:59 Check-in begins (Front desk at your assigned hotel)
19:30 - 22:00 Dinner (Restaurant Hotel Hacienda Los Laureles)
20:30 - 21:30 Informal gathering (Hotel Hacienda Los Laureles)
Monday, May 27
07:30 - 09:15 Breakfast (Restaurant Hotel Hacienda Los Laureles)
09:15 - 09:30 Introduction and Welcome (Conference Room San Felipe)
09:30 - 10:30 Ji Zhu: Statistical Network Analysis: Estimation and Inference
Recent advances in computing and measurement technologies have led to an explosion in the amount of data with network structures in a variety of fields including social networks, biological networks, transportation networks, the World Wide Web, and so on. This creates a compelling need to understand the generative mechanism of these networks and to explore various characteristics of the network structures in a principled way. Latent space models are powerful statistical tools for modeling and understanding network data. While the importance of accounting for uncertainty in network analysis is well recognized, current literature predominantly focuses on point estimation and prediction, leaving the statistical inference of latent space network models an open question. In this talk, I will present some of our recent work that aims to fill this gap by providing a general framework for analyzing the theoretical properties of the maximum likelihood estimators for latent space network models. In particular, we establish uniform consistency and individual asymptotic distribution results for latent space network models with a broad range of link functions and edge types. Furthermore, the proposed framework enables us to generalize our results to the sparse and dependent-edge scenarios. Our theories are supported by simulation studies and have the potential to be applied in downstream inferences, such as link prediction and network-assisted supervised learning.
(Conference Room San Felipe)
10:30 - 11:00 Coffee Break (Conference Room San Felipe)
11:00 - 12:00 Group Photo (Hotel Hacienda Los Laureles)
13:30 - 15:00 Lunch (Restaurant Hotel Hacienda Los Laureles)
15:00 - 16:00 Linglong Kong: Inference for functional response model
In functional response model, one of the key components is to estimate the functional coefficients, also called varying coefficients. There are many existing literatures with various methods, for example, kernel smoothing, spline, functional principal component, and Reproducing Kernel Hilbert Space (RKHS)-based methods. However, the inference on the the functional coefficients is limited. In this talk, we will discuss some techniques on the inferences for kernel and spline smoothing. We will also explore the potential on the inference for functional principal component, and ReKHS-based methods. If possible, we will discuss the inference when treating functional data as in manifold space. Our framework is general and is capable of dealing with difference loss functions, say L2, L1, quantile, expectile and M-estimation.
(Conference Room San Felipe)
16:00 - 16:30 Coffee Break (Conference Room San Felipe)
16:30 - 17:30 Ping-Shou Zhong: Causal Inference for Biomarker Identification with High-dimensional Outcome Variables
In the fields of genomics, genetics, and neuroimaging, the identification of biomarkers plays a crucial role in detecting disease-caused changes among a multitude of potential candidates. Existing causal inference methods predominantly address low-dimensional outcome variables, rendering them unsuitable for biomarker identification involving a substantial number of candidates. To address this gap, we present a novel causal inference procedure tailored for high-dimensional data when the number of biomarker candidates exceeds the sample size. Our proposed method exhibits doubly robust properties, offering resilience against misspecification in both propensity score functions and outcome regression models. We establish the asymptotic distributions of our proposed statistic, which vary depending on the particular misspecifications in propensity score or outcome regression models. To adaptively estimate moments in the asymptotic distributions, we introduce a bootstrap procedure. We evaluate the finite-sample performance of our approach through comprehensive numerical simulation studies. Additionally, we apply our method to a diffusion MRI dataset, identifying regions of interest with potential as biomarkers for Parkinson's disease.
(Conference Room San Felipe)
19:00 - 21:00 Dinner (Restaurant Hotel Hacienda Los Laureles)
Tuesday, May 28
07:30 - 09:00 Breakfast (Restaurant Hotel Hacienda Los Laureles)
09:30 - 10:30 Ting Li: Optimal Clustering of Discrete Mixtures and Multi-layer Networks
We first study the fundamental limit of clustering networks when a multi-layer network is present. Under the mixture multi-layer stochastic block model (MMSBM), we show that the minimax optimal network clustering error rate, which takes an exponential form and is characterized by the R\'enyi-$1/2$ divergence between the edge probability distributions of the component networks. We propose a novel two-stage network clustering method including a tensor-based initialization algorithm involving both node and sample splitting and a refinement procedure by likelihood-based Lloyd's algorithm. Network clustering must be accompanied by node community detection. Our proposed algorithm achieves the minimax optimal network clustering error rate and allows extreme network sparsity under MMSBM. Numerical simulations and real data experiments both validate that our method outperforms existing methods. We then extend our methodology and analysis framework to study the minimax optimal clustering error rate for mixture of discrete distributions including Binomial, Poisson, and multi-layer Poisson networks. The minimax optimal clustering error rates in these discrete mixtures all take the same exponential form characterized by the R\'enyi-$1/2$ divergence. These optimal clustering error rates in discrete mixtures can also be achieved by our proposed two-stage clustering algorithm.
(Conference Room San Felipe)
10:30 - 11:00 Coffee Break (Conference Room San Felipe)
11:00 - 12:00 Min Yang: Scalable Methodologies for Big Data Analysis: Integrating Flexible Statistical Models and Optimal Designs
The formidable challenge presented by the analysis of big data stems not just from its sheer volume, but also from the diversity, complexity, and the rapid pace at which it needs to be processed or delivered. A compelling approach is to analyze a sample of the data, while still preserving the comprehensive information contained in the full dataset. Although there is a considerable amount of research on this subject, the majority of it relies on classical statistical models, such as linear models and generalized linear models, etc. These models serve as powerful tools when the relationships between input and output variables are uniform. However, they may not be suitable when applied to complex datasets, as they tend to yield suboptimal results in the face of inherent complexity or heterogeneity. In this presentation, we will introduce a broadly applicable and scalable methodology designed to overcome these challenges. This is achieved through an in-depth exploration and integration of cutting-edge statistical methods, drawing particularly from neural network models and, more specifically, Mixture-of-Experts (ME) models, along with optimal designs.
(Conference Room San Felipe)
13:30 - 15:00 Lunch (Restaurant Hotel Hacienda Los Laureles)
15:00 - 16:00 Wenlu Tang: Statistical Inference for Pairwise Comparison Models
Pairwise comparison models have been widely used for utility evaluation and ranking across various fields. The increasing scale of problems today underscores the need to understand statistical inference in these models when the number of subjects diverges, a topic currently lacking in the literature except in a few special instances. To partially address this gap, this paper establishes a near-optimal asymptotic normality result for the maximum likelihood estimator in a broad class of pairwise comparison models, as well as a non-asymptotic convergence rate for each individual subject under comparison. The key idea lies in identifying the Fisher information matrix as a weighted Laplacian graph, which can be studied via a meticulous spectral analysis. Our findings provide a unified theory for performing statistical inference in a wide range of pairwise comparison models beyond the Bradley--Terry model, benefiting practitioners with theoretical guarantees for their use. Simulations utilizing synthetic data are conducted to validate the asymptotic normality result, followed by a hypothesis test using a tennis competition dataset.
(Conference Room San Felipe)
16:00 - 16:30 Coffee Break (Conference Room San Felipe)
16:30 - 17:30 Bei Jiang: Online local differential private quantile inference via self-normalization
Based on binary inquiries, we developed an algorithm to estimate population quantiles under Local Differential Privacy (LDP). By self-normalizing, our algorithm provides asymptotically normal estimation with valid inference, resulting in tight confidence intervals without the need for nuisance parameters to be estimated. Our proposed method can be conducted fully online, leading to high computational efficiency and minimal storage requirements with space. We also proved an optimality result by an elegant application of one central limit theorem of Gaussian Differential Privacy (GDP) when targeting the frequently encountered median estimation problem. With mathematical proof and extensive numerical testing, we demonstrate the validity of our algorithm both theoretically and experimentally.
(Conference Room San Felipe)
19:00 - 21:00 Dinner (Restaurant Hotel Hacienda Los Laureles)
Wednesday, May 29
07:30 - 09:00 Breakfast (Restaurant Hotel Hacienda Los Laureles)
09:00 - 10:00 Jingfang Huang: Analysis based tools for high dimensional data sets.
We present our recent work on developing tools for special classes of high dimensional data sets. Utilizing the compressible features in the data sets and an associated network structure, we show how some high dimensional data sets can be effectively processed in 10+ dimensions. We present two examples to demonstrate some ideas. In the first example, we show how to compute the expectations of low-rank functions when the random variables follow the truncated multivariate normal distribution (TMVN) in dimensions as high as 1000 when the variance matrix has hierarchical low-rank structures. In the second example, we show some recent work on the dependence test of a scalar random variable with a vector of 10+ variables.
(Conference Room San Felipe)
10:30 - 11:00 Coffee Break (Conference Room San Felipe)
11:00 - 12:00 Jeremy Marzuola: Spectral minimal partitions, nodal deficiency and the Dirichlet-to-Neumann map
The oscillation of a Laplacian eigenfunction gives a great deal of information about the manifold on which it is defined. This oscillation can be encoded in the nodal deficiency, an important geometric quantity that is notoriously hard to compute, or even estimate. Here we compare two recently obtained formulas for the nodal deficiency, one in terms of an energy function on the space of equipartitions of the manifold, and the other in terms of a two-sided Dirichlet-to-Neumann map defined on the nodal set. We relate these two approaches by giving an explicit formula for the Hessian of the equipartition energy in terms of the Dirichlet-to-Neumann map. This allows us to compute Hessian eigenfunctions, and hence directions of steepest descent, for the equipartition energy in terms of the corresponding Dirichlet-to-Neumann eigenfunctions. Our results do not assume bipartiteness, and hence are relevant to the study of spectral minimal partitions. This is joint work partly with Greg Berkolaiko, Yaiza Canzani and Graham Cox. I will also discuss maximal oscillations in so-called “chain domains,” which is a continuum version of a Stochastic Block Model, which is joint work with Tom Beck and Yaiza Canzani. I will also discuss an in progress version of this notion for analyzing spectral partitions of graphs.
(Conference Room San Felipe)
12:00 - 13:00 Lunch (Restaurant Hotel Hacienda Los Laureles)
13:00 - 19:00 Free Afternoon (Oaxaca)
19:00 - 21:00 Dinner (Restaurant Hotel Hacienda Los Laureles)
Thursday, May 30
07:30 - 09:00 Breakfast (Restaurant Hotel Hacienda Los Laureles)
09:30 - 10:30 Braxton Osting: Targeting Influence in a social network
We introduce and analyze a model for targeting influence in a social network. Namely, the opinion of each member of the network lies in an opinion space, taken to be the convex hull of some extreme opinions. Member opinions are assumed to evolve via a nearest-neighbor Laplacian dynamical model and are affected by opinion authorities. We pose the question: how should opinion authorities be chosen to maximally target influence within the network? We establish that our general problem is NP-hard and that the objective function is a submodular function. Introducing a convex relaxation, we show that the problem can be approximately solved using fast methods. This is joint work with Zachary Boyd, Nicolas Fraiman, Jeremy Marzuola, and Peter Mucha.
(Conference Room San Felipe)
10:30 - 11:00 Coffee Break (Conference Room San Felipe)
11:00 - 12:00 Vladas Pipiras: Multivariate (high-dimensional) time series modeling for multiple subjects
The focus of this talk is on multivariate, possibly high-dimensional, time series modeling for multiple subjects. The time series could be large sparse vector autoregressions (VARs) or dynamic factor models sharing common structures across the subjects. The talk will cover parts of several recent papers of the speaker and co-authors, touching upon methodological and computational issues, several applications, and theoretical challenges.
(Conference Room San Felipe)
13:30 - 15:00 Lunch (Restaurant Hotel Hacienda Los Laureles)
15:00 - 16:00 Sayan Banerjee: Centrality measures in dynamic random networks
Centrality measures (CM) are vertex statistics which quantify the `popularity’ of a vertex in the network. These can be local statistics like degree, or they can non-local, like Google’s PageRank, where the popularity of a vertex depends on the network geometry beyond its one-step neighborhood. We investigate the role of CM in the evolution, asymptotics and reconstruction of dynamic random networks. In particular, we discuss a class of growing networks where the propensity of a vertex to attract new `friends’ depends on its current centrality score. This includes preferential attachment models (CM= degree) and random surfer models (CM=PageRank). We explore root and seed detection problems, the distribution of centrality scores of the network vertices, and compare the efficacy of different centrality measures in quantifying popularity.
(Conference Room San Felipe)
16:00 - 16:30 Coffee Break (Conference Room San Felipe)
16:30 - 17:30 Zach Boyd: Correlation networks: Interdisciplinary approaches beyond thresholding
Many empirical networks originate from correlational data, arising in domains as diverse as psychology, neuroscience, genomics, microbiology, finance, and climate science. Specialized algorithms and theory have been developed in different application domains for working with such networks, as well as in statistics, network science, and computer science, often with limited communication between practitioners in different fields. This leaves significant room for cross-pollination across disciplines. A central challenge is that it is not always clear how to best transform correlation matrix data into networks for the application at hand, and probably the most widespread method, i.e., thresholding on the correlation value to create either unweighted or weighted networks, suffers from multiple problems. In this article, we review various methods of constructing and analyzing correlation networks, ranging from thresholding and its improvements to weighted networks, regularization, dynamic correlation networks, threshold-free approaches, and more. Finally, we propose and discuss a variety of key open questions currently confronting this field.
(Conference Room San Felipe)
19:00 - 21:00 Dinner (Restaurant Hotel Hacienda Los Laureles)
Friday, May 31
07:30 - 09:00 Breakfast (Restaurant Hotel Hacienda Los Laureles)
09:30 - 10:30 Nicolas Fraiman: Semi-Supervised Community Detection: A Quasi-Stationary Distribution Approach
This talk introduces a novel method for semi-supervised community detection in graphs that significantly improves detection accuracy by leveraging limited labeled data. Our approach employs quasi-stationary distributions of random walks to enhance detection capabilities. We demonstrate its effectiveness on the partially labeled stochastic block model, obtaining explicit error rates for recovery, and showing successful community detection even below the information theory threshold for partial recovery with no side information. This is joint work with Michael Nisenzon.
(Conference Room San Felipe)
10:30 - 11:00 Coffee Break (Conference Room San Felipe)
11:00 - 12:00 Todd Ogden (Conference Room San Felipe)
12:00 - 14:00 Lunch (Restaurant Hotel Hacienda Los Laureles)