Idea Seminar

The MLOPT Idea Seminar is a casual forum where junior researchers from all around the world are invited to present their recent ideas and findings. The seminar aims to provide an open and supportive platform for researchers to share their published or in-progress papers with the opportunity to receive feedback from their peers. Speakers are also encouraged to bring up open problems or recent research which caught their attention so that it may spark interesting discussions and potentially future collaboration. The series covers a range of topics, typically but not limited to the intersection of machine learning and optimization. We meet on Thursdays from 12:30 pm to 1:30 pm CT.

Select seminar recordings will be uploaded at our YouTube channel (https://www.youtube.com/channel/UCOVAy4FR4YlfPHuft0Rv1-w), subject to speaker approval, for a wider audience to access and engage.

Upcoming Events

December 19, 2024: Haoyue Bai (UW–Madison)

Speaker: Haoyue Bai (UW–Madison)
Time: December 5, 2024, 11:30 AM – 12:30 PM CT
Zoom Link: This link
Title: AHA: Adaptive Human-Assisted Out-of-Distribution Learning
Abstract: Modern machine learning models deployed often encounter distribution shifts in real-world applications, manifesting as covariate or semantic out-of-distribution (OOD) shifts. These shifts give rise to challenges in OOD generalization and OOD detection. This work introduces a novel, integrated approach AHA (Adaptive Human-Assisted OOD learning) to simultaneously address both OOD generalization and detection through a human-assisted framework by labeling data in the wild. Our approach strategically labels examples within a novel maximum disambiguation region, where the number of semantic and covariate OOD data roughly equalizes. By labeling within this region, we can maximally disambiguate the two types of OOD data, thereby maximizing the utility of the fixed labeling budget.
Bio: I am a Ph.D. student at the Computer Sciences Department, University of Wisconsin-Madison, working with Prof. Robert Nowak. Prior to that, I obtained my bachelor’s degree at Zhejiang University and my master’s degree at The Hong Kong University of Science and Technology. My research focuses on trustworthy AI and developing reliable machine learning in the open-world, with an emphasis on enhancing robustness against distribution shifts and detecting out-of-distribution data.
Location: Mech Engr 3210

Previous Events

December 5, 2024: Sean (Xuefeng) Du (UW–Madison)

Speaker: Sean (Xuefeng) Du (UW–Madison)
Time: December 5, 2024, 11:30 AM – 12:30 PM CT
Zoom Link: This link
Title: HaloScope: Harnessing Unlabeled LLM Generations for Hallucination Detection
Abstract: The increasing reliance on large language models (LLMs) has brought attention to their reliability, particularly concerning hallucinations—fabricated or untruthful information embedded in model outputs. This talk presents HaloScope, a novel framework that addresses the challenge of hallucination detection using unlabeled LLM generations naturally emerging in deployment scenarios. HaloScope introduces a two-step process: (1) identifying a latent subspace within LLM representations that captures hallucinated content, and (2) leveraging this subspace to estimate membership scores, enabling the training of a binary truthfulness classifier. The framework eliminates the need for labor-intensive human annotations or extensive labeled datasets, offering flexibility and scalability for real-world applications. Empirical results demonstrate HaloScope’s superior performance on benchmark datasets and its robustness across diverse LLM architectures, outperforming state-of-the-art methods by a significant margin. This talk will explore HaloScope’s methodology, experimental insights, and implications for advancing reliable LLM deployments.
Bio: Sean (Xuefeng) Du is a final year CS Ph.D. student at University of Wisconsin-Madison advised by Prof. Sharon Li. His research interest is in reliable machine learning and the applications to foundation models and AI safety. His first-author papers have been recognized with multiple oral and spotlight presentations at NeurIPS and CVPR. He is a recipient of the Jane Street Graduate Research Fellowship, and Rising Stars in Data Science award.
Location: Mech Engr 3210

November 14, 2024: Julia Nakhleh (UW–Madison)

Speaker: Julia Nakhleh (UW–Madison)
Time: November 14, 2024, 12:30 PM – 1:30 PM CT
Zoom Link: This link
Title: The Effects of Multi-Task Learning on ReLU Neural Network Functions
Abstract: We study the properties of solutions to multi-task shallow ReLU neural network learning problems, wherein the network is trained to fit a dataset with minimal sum of squared weights. Remarkably, the solutions learned for each individual task resemble those obtained by solving a kernel method, revealing a novel connection between neural networks and kernel methods. It is known that single-task neural network training problems are equivalent to minimum norm interpolation problem in a non-Hilbertian Banach space, and that the solutions of such problems are generally non-unique. In contrast, we prove that the solutions to univariate-input, multi-task neural network interpolation problems are almost always unique, and coincide with the solution to a minimum-norm interpolation problem in a Sobolev (Reproducing Kernel) Hilbert Space. We also demonstrate a similar phenomenon in the multivariate-input case; specifically, we show that neural network learning problems with large numbers of diverse tasks are approximately equivalent to an $\ell^2$$ (Hilbert space) minimization problem over a fixed kernel determined by the optimal neurons.
Bio: Julia is a PhD student in Computer Science at UW Madison, advised by Robert Nowak. Her interests broadly lie in the mathematical theory of deep learning, particularly in the mathematical properties of neural networks trained with different regularization strategies (such as weight decay).
Location: ECB 1045

October 31, 2024: Zheyang Xiong (UW-Madison)

Speaker: Zheyang Xiong (UW-Madison)
Time: October 31, 2024, 12:30 PM – 1:30 PM CT
Zoom Link: This link
Title: Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition
Abstract: Large Language Models (LLMs) have demonstrated remarkable in-context learning (ICL) capabilities. In this study, we explore a surprising phenomenon related to ICL: LLMs can perform multiple, computationally distinct ICL tasks simultaneously, during a single inference call, a capability we term “task superposition”. We provide empirical evidence of this phenomenon across various LLM families and scales and show that this phenomenon emerges even if we train the model to in-context learn one task at a time. We offer theoretical explanations that this capability is well within the expressive power of transformers. We also explore how LLMs internally compose task vectors during superposition. Furthermore, we show that larger models can solve more ICL tasks in parallel, and better calibrate their output distribution. Our findings offer insights into the latent capabilities of LLMs, further substantiate the perspective of “LLMs as superposition of simulators”, and raise questions about the mechanisms enabling simultaneous task execution.
Bio: Zheyang Xiong is a graduate student in Computer Science at the University of Wisconsin-Madison, advised by Prof. Dimitris Papailiopoulos. His research interest is in Large Language Models.
Location: Mechanical Engineering, Room 3210

May 3, 2024: Dan Fu (Stanford)

Speaker: Dan Fu (Stanford)
Time: May 3, 2024, 12:30 PM – 1:30 PM CT
Zoom Link: https://uwmadison.zoom.us/j/99660474178?pwd=a2ZnellHazNJSEtqQjA3MENibWo2dz09
Title: Hardware-Aware Efficient Primitives for Machine Learning
Abstract: Efficiency is increasingly tied to quality to machine learning, with more efficient training algorithms leading to more powerful models. However, today’s most popular machine learning models are built on asymptotically inefficient primitives. For example, attention in Transformers scales quadratically in the input size, while MLPs scale quadratically in model dimension. In this talk, I discuss my work on improving the efficiency of the core primitives in machine learning, with an emphasis on hardware-aware algorithms and long-context applications. First, I focus on replacing attention with gated state space models (SSMs) and convolutions, which scale sub-quadratically in context length. I describe the H3 (Hungry Hungry Hippos) architecture, a gated SSM architecture that matches Transformers in quality up to 3B parameters and achieves 2.4x faster inference. Second, I focus on developing hardware-aware algorithms for SSMs and convolutions. I describe FlashFFTConv, a fast algorithm for computing SSMs and convolutions on GPU by optimizing the Fast Fourier Transform (FFT). FlashFFTConv yields up to 7x speedup and 5x memory savings, even over vendor solutions from Nvidia. Third, I will briefly touch on how these same techniques can also be used to develop sub-quadratic scaling in the model dimension. I will describe Monarch Mixer, which uses a generalization of the FFT to achieve sub-quadratic scaling in both sequence length and model dimension. Throughout the talk, I will give examples of how these ideas are beginning to take hold, with gated SSMs and their variants now leading to state-of-the-art performance in long-context language models, embedding models, and DNA foundation models.
Bio: Dan Fu is a PhD student in the Computer Science Department at Stanford University, where he is co-advised by Christopher Ré and Kayvon Fatahalian. His research interests are at the intersection of systems and machine learning. Recently, he has focused on developing algorithms and architectures to make machine learning more efficient, especially for enabling longer-context applications. His research has appeared as oral and spotlight presentations at NeurIPS, ICML, and ICLR, and he has received the best student paper runner up at UAI. Dan has also been supported by an NDSEG fellowship.
Location: Engineering Research Building (1500 Engineering Drive) Room 106

April 26, 2024 Aniket Rege (UW-Madison)

Speaker: Aniket Rege (UW-Madison)
Time: April 26, 2024, 12:30 PM – 1:30 PM CT
Zoom Link: https://uwmadison.zoom.us/j/91746213778?pwd=OE03U2VpOHBPZHJMZG1mL2RCSGZhQT09
Title: Matryoshka Representation Learning and Adaptive Semantic Search
Abstract: Learned representations are the backbone of modern ML systems, but often struggle to adapt to the required capacity of downstream tasks. To this end we propose Matryoshka Representation Learning which learns coarse-to-fine nested representations at no training or inference overhead to existing frameworks. We show MRL models can match or surpass independently trained baselines at practical dataset regimes (ImageNet, JFT) across vision (ResNet, ViT), language (BERT) and V+L (ALIGN) modalities on both classification and retrieval tasks. MRL notably powers OpenAI’s latest embedding models (Jan ’24) and Google Search products. We also propose the Adaptive Approximate Nearest Neighbors Search (AdANNS) framework, which leverages the flexibility of MRL to decouple the building blocks of web-scale search systems (search space pruning and approximate distance computation). AdANNS mitigates the rigidity of web-scale search systems and provides pareto optimal compute-accuracy tradeoff on ImageNet (image) and Natural Questions (document) retrieval. MRL and AdANNS are a step towards flexible representation learning and fast yet accurate web-scale search.
Bio: Aniket Rege is a 1^st year PhD student working with Prof. Ramya Vinayak (primary) and Prof. Yong Jae Lee. He is broadly interested in generative models and learning from human preferences (alignment). He graduated with a master’s degree in electrical and computer engineering from the University of Washington in May 2023, advised by Prof. Ali Farhadi.
Location: Materials Science and Engineering (1509 University Ave) Room 221

April 19, 2024: Ian Covert (Stanford)

Speaker: Ian Covert (Stanford)
Time: April 19, 2024, 12:30 PM – 1:30 PM CT
Zoom Recording: https://uwmadison.zoom.us/rec/share/gbKZBjMOucEItMIgBpo8XZ9FFVdch73vrK76ck3UFJHsBz-Kme2kq9o5N3mQRTHo.lN3wZTOlKLAHW7xq
Title: Using Deep Learning to Improve the Transparency of Deep Learning
Abstract: As AI systems become increasingly ubiquitous, understanding their decision-making processes is crucial. However, modern machine learning models are black boxes that provide limited insight into which input features or training examples influence their predictions. This talk presents several methods to provide actionable transparency into ML models, including adaptive feature selection, feature attribution and data valuation. A commonality between these methods is the use of deep learning to facilitate otherwise intractable algorithms, through a general technique known as “amortized optimization.”
Bio: Ian Covert is a postdoc at Stanford University in the labs of James Zou and Tatsunori Hashimoto. He received his PhD from the University of Washington in the lab of Su-In Lee, where his work focused on developing methods for transparent machine learning. Previously, he was also a student researcher at Google Brain and Citadel Securities.
Location: Engineering Research Building (1500 Engineering Drive) Room 106

April 5, 2024: Liu Yang (UW-Madison)

Speaker: Liu Yang (UW-Madison)
Time: April 5, 2024, 12:30 PM – 1:30 PM CT
Recording Link: https://youtu.be/3et5AZ_j8bA
Title: Looped Transformers are Better at Learning Learning Algorithms
Abstract: Transformers have demonstrated effectiveness in in-context solving data-fitting problems from various (latent) models, as reported by Garg et al. However, the absence of an inherent iterative structure in the transformer architecture presents a challenge in emulating the iterative algorithms, which are commonly employed in traditional machine learning methods. To address this, we propose the utilization of looped transformer architecture and its associated training methodology, with the aim of incorporating iterative characteristics into the transformer architectures. Experimental results suggest that the looped transformer achieves performance comparable to the standard transformer in solving various data-fitting problems, while utilizing less than 10% of the parameter count.
Bio: Liu Yang is a graduate student in computer science at the University of Wisconsin-Madison, co-advised by Prof. Nowak, Prof. Papailiopoulos, and Prof. Lee. Her current research interests are centered on the in-context learnability of large language models.
Location: Engineering Research Building (1500 Engineering Drive) Room 106

Mar 22, 2024: Ekin Akyürek (MIT)

Speaker: Ekin Akyürek (MIT)
Time: Mar 22, 2024, 12:30 PM – 1:30 PM CT
Recording Link: https://youtu.be/CXLL_-UFZsQ
Title: In-context Language Learning and N-gram Heads
Abstract: Language models exhibit in-context learning (ICL), the ability to learn new tasks from just a few examples of prompts in the context presented to them. Prior work has studied ICL through the lens of simple learning problems like linear regression, but there remains a gap in understanding the rich language generation capabilities exhibited in real language models. In this talk, I will discuss a new model problem for understanding ICL — in-context learning of (formal) languages (ICLL). In ICLL, language models are presented with example strings from a probabilistic language and must generate additional strings from that same language. Focusing on regular languages sampled from random finite automata, we study the behavior of a variety of sequence models on the ICLL task. We show that Transformers significantly outperform recurrent and convolutional models on these tasks. Moreover, we find evidence that their ability to do so relies on specialized “n-gram heads” (higher-order variants of induction heads) that compute input-conditional next-token distributions. Finally, we show that hard-wiring these heads into neural models improves performance not just on formal language learning, but modeling of real natural-language text — improving the perplexity of 340M-parameter models by up to 1.14 points (6.7%) on the SlimPajama dataset.
Bio: Ekin Akyürek is a PhD candidate at MIT CSAIL studying generalization in neural sequence models, advised by Jacob Andreas. He wants to understand how learning and algorithmic reasoning work in LMs, and use this understanding to design better models. His early work aimed to enable neural sequence models to compositionally generalize new tasks by a small number of demonstrations. His recent work contributed to understanding of how Transformers do in-context learning in model problems like linear regression and formal language learning. He is a recipient of Amazon Alexa Fellowship, and area chair paper award in ACL 2023. He received his Bachelor’s degrees in Electrical & Electronics Engineering and in Physics from Koç University in 2019.
Location: Engineering Research Building (1500 Engineering Drive) Room 106

Mar 15, 2024: Mu Cai (UW-Madison)

Speaker: Mu Cai (UW-Madison)
Time: May 15, 2024, 12:30 PM – 1:30 PM CT
Recording: https://www.youtube.com/watch?v=6WbUVOAtFE8
Title: Compositional Visual-Linguistic Models Via Visual Markers and Counterfactual Examples
Abstract: Vision-Language models like CLIP, GPT-4, and LLaVA have made significant advancements in visual recognition and reasoning, yet they still struggle with understanding region-level visual information and complex linguistic concepts such as distinguishing between “black shirt and blue pants” and “blue shirts and black pants”. Our research indicates that compositionality can enhance these models’ capabilities. By using visually marked, overlaid images, our refined method can reach state-of-the-art performance levels in region-level understanding. Moreover, we found that using counterfactual reasoning to curate compositional images and captions can enhance the model’s understanding of complex object relationships. We also demonstrate that visual markers can be represented as Scalable Vector Graphics (SVG), allowing visual information to be textually represented, thereby eliminating the need for a visual encoder when building Vision-Language models.
Bio: Mu Cai is a fourth-year Ph.D. student in the Computer Sciences Department at the University of Wisconsin-Madison, advised by Prof. Yong Jae Lee. His research interest lies in the intersection of deep learning and computer vision. He is especially interested in multimodal generative models, video and 3D understanding.
Location: Engineering Research Building (1500 Engineering Drive) Room 106

Mar 8, 2024: Haotian Liu (UW-Madison)

Speaker: Haotian Liu (UW-Madison)
Title: Steerable Visual Intelligence
Time: Mar 8, 2024, 12:30 PM – 1:30 PM CT
Recording Link: https://www.youtube.com/watch?v=To8Gzo3IPlc
Abstract: Understanding and reasoning about the visual world based on human instructions has long been a challenging problem. The previous paradigm, which involved training supervised models on many sub-tasks and unifying them into a large system, was not streamlined and offered limited steerability. In this talk, I will introduce two of my recent works, REACT and the LLaVA-series, that approach this problem by enhancing customizability using retrieval, and bringing improved steerability with natural language instructions. We demonstrate that REACT and the LLaVA-series offer a promising path for building customizable, large multimodal models that follow human intent at an affordable cost. Finally, I will present several future directions I am eager to explore in building next-generation steerable visual intelligence systems.
Bio: Haotian Liu is a final-year PhD student at the University of Wisconsin-Madison, advised by Prof. Yong Jae Lee. His research primarily focuses on computer vision and vision-language multimodal learning. His recent work has centered on building customizable and steerable large models that follow humans’ intent, including instruction-following multimodal models, controllable image generation, and customizable foundation models. He co-organized the 1st and 2nd Workshop on Computer Vision in the Wild in ECCV 2022 and CVPR 2023.
Location: Engineering Research Building (1500 Engineering Drive) Room 514

Mar 1, 2024: Fanghui Liu (University of Warwick)

Speaker: Fanghui Liu (University of Warwick)
Title: Can overfitted deep neural networks in adversarial training generalize? – An approximation viewpoint
Time: Mar 1, 2024, 12:30 PM – 1:30 PM CT
Recording Link: https://www.youtube.com/watch?v=6mqGmNRo7Po
Abstract: In this talk, I will discuss whether overfitted DNNs in adversarial training can generalize from an approximation viewpoint. We prove by construction the existence of infinitely many adversarial training classifiers on over-parameterized DNNs that obtain arbitrarily small adversarial training error (overfitting), whereas achieving good robust generalization error under certain conditions concerning the data quality, well separated, and perturbation level. This construction is optimal and thus points out the fundamental limits of DNNs under adversarial training with statistical guarantees. Part of this talk comes from our recent work [https://arxiv.org/abs/2401.13624].
Bio: Fanghui Liu is currently an assistant professor at University of Warwick, UK. His research interests focus on machine learning theory as well as theoretical-oriented applications. For his work on learning theory and cooperation, he was chosen for AAAI New Faculty Highlights 2024, Rising Star in AI (KAUST 2023) and presented two tutorials at ICASSP 2023 and CVPR 2023. Prior to his current position, Fanghui worked as a postdoc researcher at EPFL, Switzerland and KU Leuven, Belgium, respectively. He received his PhD from Shanghai Jiao Tong University, China in 2019 and bachelor’s degree from Harbin Institute of Technology in 2014.
Location: Engineering Research Building (1500 Engineering Drive) Room 514

Feb 23, 2024: Yuchen Zeng (UW-Madison)

Title: Can MLLMs Perform Text-to-Image In-Context Learning?
Speaker: Yuchen Zeng (UW-Madison)
Time: Feb 23, 2024, 12:45 PM – 1:45 PM CT
Recording Link: https://www.youtube.com/watch?v=GNLkneJWnEE&t=1s
Paper Link: https://arxiv.org/abs/2402.01293
Abstract: The evolution from Large Language Models (LLMs) to Multimodal Large Language Models (MLLMs) has spurred research into extending In-Context Learning (ICL) to its multimodal counterpart. Existing such studies have primarily concentrated on image-to-text ICL. However, the Text-to-Image ICL (T2I-ICL), with its unique characteristics and potential applications, remains underexplored. To address this gap, we formally define the task of T2I-ICL and present CoBSAT, the first T2I-ICL benchmark dataset, encompassing ten tasks. Utilizing our dataset to benchmark six state-of-the-art MLLMs, we uncover considerable difficulties MLLMs encounter in solving T2I-ICL. We identify the primary challenges as the inherent complexity of multimodality and image generation. To overcome these challenges, we explore strategies like fine-tuning and Chain-of-Thought prompting, demonstrating notable improvements. Our code and dataset are available at this https URL.
Bio: Yuchen is a graduate student pursuing a PhD’s degree in the Department of Computer Science at the University of Wisconsin-Madison. She is advised by Prof. Kangwook Lee. Her current research interest is centering on large language models.
Location: Engineering Research Building (1550 Engineering Drive) Room 106

Nov 17, 2023: Harit Vishwakarma (UW-Madison)

Title: Promises and Pitfalls of Threshold-based Auto-labeling
Speaker: Harit Vishwakarma (UW-Madison)
Time: Nov 17, 2023, 12:30 PM CT
Recording Link: https://www.youtube.com/watch?v=WYUKyBy2yRs
Paper Link: https://openreview.net/pdf?id=RUCFAKNDb2
Abstract: Creating large-scale high-quality labeled datasets is a major bottleneck in supervised learning. Threshold-based auto-labeling (TBAL), where validation data obtained from humans is used to find a confidence threshold above which the data is machine-labeled, reduces reliance on manual annotation. TBAL is emerging as a widely-used solution in practice. Given the long shelf-life and diverse usage of the resulting datasets, understanding when the data obtained by such auto-labeling systems can be relied on is crucial. We analyze TBAL systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data. Our results provide two crucial insights. First, reasonable chunks of unlabeled data can be automatically and accurately labeled by seemingly bad models. Second, a hidden downside of TBAL systems is potentially prohibitive validation data usage. Together, these insights describe the promise and pitfalls of using such systems.
Bio: Harit is a CS Ph.D. student at UW-Madison, co-advised by Fred Sala and Ramya Vinayak. His research interests lie in the foundations of data-centric and safe machine learning. Before starting his Ph.D. he obtained a master’s from IISc, Bangalore, and worked at IBM Research, India for 3 years.
Location: Engineering Centers Building (1550 Engineering Drive) M1059 (M floor)

Nov 10, 2023: John William Thickstun (Stanford)

Title: Anticipation and the Anticipatory Music Transformer
Speaker: John William Thickstun (Stanford)
Recording Link:https://www.youtube.com/watch?v=b_vFfssGJNI
Time: Nov 10, 2023, 12:30 PM – 1:30 PM
Location: Engineering Centers Building (1550 Engineering Drive) M1059 (M floor)
Abstract: Imagine you are given a melody, and asked to compose a harmonizing accompaniment; this melody is an example of a temporal point process: a sequence of (musical) events that arrive stochastically at points in time. An accompaniment is also a temporal point process, one that is tightly correlated with—but asynchronous from—the melody. Motivated by this example, we are interested in constructing generative models of a temporal point process (the event process) that can be conditioned on realizations of a second, correlated point process (the control process). We achieve this using anticipation: a new method of controllable generation that interleaves sequences of events and controls, such that controls appear following stopping times in the event sequence. We apply anticipation to construct anticipatory infilling models of music; these models unlock new control capabilities for music generation, without sacrificing the performance of unconditional generation.
Bio: John Thickstun is a Postdoc at Stanford University, working with Percy Liang and members of the Center for Research on Foundation Models. John’s work has been recognized by an NSF Graduate Fellowship, a Qualcomm Innovation Fellowship, and outstanding paper awards at the Neurips and ACL conferences. His current research focus is on advancing the capabilities, controllability, and governance of generative models.

Nov 3, 2023: Daniel Beaglehole (UC San Diego)

Title: Mechanism of feature learning in neural networks
Speaker: Daniel Beaglehole (UC San Diego)
Time: Nov 3, 2023, 12:30 PM – 1:30 PM CT (10:30 AM – 11:30 AM PT)
Location: Engineering Centers Building (1550 Engineering Drive) M1059 (M floor)
Recording: https://www.youtube.com/watch?v=QUAWfCkOuTw&t=122s
Abstract: Understanding how neural networks select features, or relevant patterns in data, for prediction is necessary for their reliable use in several technological and scientific applications. In my talk, I will present a unifying mechanism that characterizes feature learning in general neural network architectures. We posit the Deep Neural Feature Ansatz, which states that features selected by neural networks are captured by a mathematical operator known as average gradient outer product (AGOP). Moreover, we demonstrate that AGOP is a universal, backpropagation-free feature learning mechanism that enables feature learning in non-feature learning models such as kernel machines.
Bio: Daniel is a PhD student in the CSE department at UC San Diego in the machine learning and theory groups. He is advised by Professor Mikhail Belkin and Sanjoy Dasgupta. Prior to his PhD, he completed an MS in computer science at Columbia, where he researched under the mentorship of Professor Alexandr Andoni.

Oct 27, 2023: Jifan Zhang (UW-Madison)

Title: LabelBench: A Comprehensive Framework for Benchmarking Adaptive Label-Efficient Learning
Speaker: Jifan Zhang (UW-Madison)
Time: Oct 27, 2023, 12:30 PM – 1:30 PM CT
Location: Engineering Centers Building (1550 Engineering Drive) M1059 (M floor)
Recording link: https://www.youtube.com/watch?v=aaiCMSVwkvY
Abstract: Labeled data are critical to modern machine learning applications, but obtaining labels can be expensive. To mitigate this cost, machine learning methods, such as transfer learning, semi-supervised learning and active learning, aim to be label-efficient: achieving high predictive performance from relatively few labeled examples. While obtaining the best label-efficiency in practice often requires combinations of these techniques, existing benchmark and evaluation frameworks do not capture a concerted combination of all such techniques. This paper addresses this deficiency by introducing LabelBench, a new computationally-efficient framework for joint evaluation of multiple label-efficient learning techniques. As an application of LabelBench, we introduce a novel benchmark of state-of-the-art active learning methods in combination with semi-supervised learning for fine-tuning pretrained vision transformers. Our benchmark demonstrates better label-efficiencies than previously reported in active learning. LabelBench’s modular codebase is open-sourced for the broader community to contribute label-efficient learning methods and benchmarks. The repository can be found at: this URL.
Bio: Jifan is a PhD student at UW-Madison working with Prof. Robert Nowak. His work focuses on label-efficient learning and its modern application to large-scale deep learning systems. He is also generally interested in human-in-the-loop learning and cross-disciplinary research that apply these methods in the real world.

Oct 20, 2023: Giannis Daras (UT-Austin)

Title: Consistent Diffusion Models and Learning from Corrupted Data with Ambient Diffusion
Speaker: Giannis Daras (UT-Austin)
Time: Oct 20, 2023, 12:30 PM – 1:30 PM CT
Location: ECB M1059 (M floor)
Recoding: https://uwmadison.zoom.us/rec/share/mZPbg7r70Xvs37rFN9gCVrtJ_8CVLmbSMa4O-8KYTvcJBi_qW6tHA9IvEVO08IRB.l0MbeY1QeT4819pb
Abstract: In this talk, we will explore recent algorithmic innovations that generalize diffusion models and tackle the limitations of the standard formulation. In the first part of the presentation, I will introduce the problem of diffusion sampling drift that arises from the recursive nature of the generation process and the propagation of errors due to imperfect score-matching. I will show how to mitigate this problem by enforcing a consistency property (NeurIPS 2023) which states that predictions of the model on its own generated data are consistent across time. Empirically our training objective yields state-of-the-art results for conditional and unconditional generation. Theoretically, we show that if the score is learned perfectly on some non-drifted points and if the consistency property is enforced everywhere, then the score is learned accurately everywhere. In the second part of the talk, we will focus on the problem of learning a generative model using only lossy measurements. This problem arises in scientific applications where access to uncorrupted samples is impossible or expensive to acquire. I will present Ambient Diffusion (NeurIPS 2023), the first diffusion-based framework that can learn an unknown distribution using only highly corrupted samples. Ambient Diffusion can be used to finetune a pre-trained foundation model or train a model from scratch.
Bio: Giannis Daras is a fourth-year Ph.D. student in the Computer Science Department of The University of Texas at Austin, supervised by Prof. Alexandros Dimakis. Giannis is also a Research Scientist Intern at NVIDIA. His research is supported by the Bodossaki Fellowship, the Onassis Fellowship, and the Leventis Fellowship. Giannis is doing research on generative modeling. He is interested in learning generative models from lossy measurements and developing algorithms to use generative priors for solving inverse problems.

Oct 6, 2023: Joe Shenouda (UW-Madison)

Time: Oct 6, 2023, 12:30 PM CT
Speaker: Joe Shenouda (UW-Madison)
Recording: https://www.youtube.com/watch?v=3nj1sg4rMkE
Title: Vector-Valued Variation Spaces and Width Bounds for Deep Neural Networks: Insights on Weight Decay Regularization
Abstract: Deep neural networks (DNNs) trained to minimize a loss term plus the sum of squared weights via gradient descent corresponds to the common approach of training with weight decay. In this talk I will discuss our new insights into this common learning framework. We characterize the kinds of functions learned by training with weight decay for multi-output (vector-valued) ReLU neural networks. This extends previous characterizations that were limited to single-output (scalar-valued) networks. This characterization requires the definition of a new class of neural function spaces that we call vector-valued variation (VV) spaces. We prove that neural networks (NNs) are optimal solutions to learning problems posed over VV spaces via a novel representer theorem. This new representer theorem shows that solutions to these learning problems exist as vector-valued neural networks with widths bounded in terms of the number of training data. Next, via a novel connection to the multi-task lasso problem, we derive new and tighter bounds on the widths of homogeneous layers in DNNs. The bounds are determined by the effective dimensions of the training data embeddings in/out of the layers. This result sheds new light on the architectural requirements for DNNs. Finally, the connection to the multi-task lasso problem suggests a new approach to compressing pre-trained networks.
Bio: Joe Shenouda is a Ph.D. student at the University of Wisconsin-Madison, advised by Professors Kangwook Lee and Rob Nowak. He completed his B.S. in Electrical and Computer Engineering at Rutgers University. His research interests include signal processing, machine learning, and deep learning.

Sep 22, 2023: Grigorios Chrysos (UW-Madison)

Time: Sep 22, 2023, 12:30 PM CT
Speaker: Grigorios Chrysos (UW-Madison)
link: [recording]
Title: Designing polynomial expansions in deep neural networks
Abstract: The meticulous design of neural architectures is a critical component behind the success of deep networks. Polynomial Networks (PNs) enable a new network design that treats a network as a high-degree polynomial expansion of the input. In the first part of the talk, we identify how polynomial expansions exist inside popular deep networks, such as non-local networks. Then, we exhibit how we can design different PNs and how we can estimate their unknown parameters. In the last part of the talk, we extend the PNs beyond the single-input variable polynomial expansions. Having multiple (possibly diverse) inputs is critical for conditional generation tasks. PNs can be extended to tackle such conditional generation tasks and we showcase how they can be used for recovering missing attribute combinations from the training set, e.g. in image generation.
Bio: Prof. Grigorios Chrysos is a new Assistant Professor at UW-Madison. His research mainly focuses on reliable machine learning and the design and study of expressive models.

April 28, 2023: Seiyun Shin (UIUC)

Time: Apr 28, 2023, 12:30 PM CT
Speaker: Seiyun Shin (UIUC)
link: [paper][zoom]
Title: Adaptive Power Method: Eigenvector Estimation from Sampled Data
Abstract: Computing the dominant eigenvectors of a matrix $A$ has many applications, such as principal component analysis, spectral embedding, and PageRank. However, in general, this task relies on the complete knowledge of the matrix $A$, which can be too large to store or even infeasible to observe in many applications, e.g., large-scale social networks. Thus, a natural question is how to accurately estimate the eigenvectors of $A$ when only partial observations can be made by sampling entries from $A$. To this end, we propose the Adaptive Power Method (APM), a variant of the well-known power method. At each power iteration, APM adaptively selects a subset of the entries of $A$ to observe based on the current estimate of the top eigenvector. We show that APM can estimate the dominant eigenvector(s) of A with squared error at most $\epsilon$ by observing roughly $O(n\epsilon^{-2}\log^2(n/\epsilon))$ entries of an $n \times n$ matrix. We present empirical results for the problem of eigenvector centrality computation on two real-world graphs and show that APM significantly outperforms a non-adaptive estimation algorithm using the same number of observations. Furthermore, in the context of eigenvector centrality, APM can also adaptively allocate the observation budget to selectively refine the estimate of nodes with high centrality scores in the graph.
Bio: Seiyun Shin is a Ph.D. student at UIUC ECE, advised by Professors Ilan Shomorony and Han Zhao. His research interest lies in theoretical machine learning, establishing sample complexity in GraphML (e.g., graph neural networks), instance-adaptive algorithms, and multi-armed bandits.

Apr 21, 2023: Yihan Du (Tsinghua)

Time: Apr 21, 2023, 9:30 AM CT (10:30 PM Beijing Time)
Speaker: Yihan Du (Tsinghua)
link: [zoom]
Title: Risk-aware Online Decision Making
Abstract: Classic online decision-making (e.g., bandit and reinforcement learning) algorithms focus mainly on the risk-neutral criterion, i.e., maximizing the expected cumulative reward, and can fail to avoid rare but disastrous situations. As a result, existing online decision-making algorithms cannot be directly applied to tackle real-world risk-sensitive tasks, such as autonomous driving and clinical treatment planning, where policies that ensure low risk of getting into catastrophic situations are strongly preferred. Motivated by the above facts, we study risk-aware online decision making. Specifically, we investigate bandit and reinforcement learning (RL) problems equipped with risk-sensitive criteria, e.g., mean-covariance and conditional value-at-risk (CVaR), which aim to control the risk during decision making and prevent the agent from getting into catastrophic states. We design efficient risk-aware bandit and RL algorithms with rigorous theoretical guarantees.
Bio: Yihan Du is a Ph.D. student at the Institute for Interdisciplinary Information Sciences (headed by Prof. Andrew Chi-Chih Yao) of Tsinghua University, advised by Prof. Longbo Huang. She was a research intern at Microsoft Research Asia supervised by Dr. Wei Chen (IEEE Fellow, Director of Microsoft Research Asia Theory Center), and a visiting Ph.D. student at Cornell University supervised by Prof. Wen Sun. She is broadly interested in the area of machine learning, with emphases on online learning, reinforcement learning and representation learning.

April 14, 2023: Sanket Vaibhav Mehta (CMU)

Time: Apr 14, 2023, 12:30 PM CT
Speaker: Sanket Vaibhav Mehta (CMU)
link: [Video]
Title: Understanding the Role of Initialization and Training Dynamics in Lifelong Learning
Abstract: The lifelong learning paradigm in machine learning is an attractive alternative to the more prominent isolated learning scheme not only due to its resemblance to biological learning, but also its potential to reduce energy waste by obviating excessive model re-training. A key challenge to this paradigm is the phenomenon of catastrophic forgetting. With the increasing popularity and success of pre-trained models in machine learning, we pose the question: What role does pre-training play in lifelong learning, specifically with respect to catastrophic forgetting? We investigate existing methods in the context of large, pre-trained models and evaluate their performance on a variety of text and image classification tasks. Across all settings, we observe that generic pre-training implicitly alleviates the effects of catastrophic forgetting when learning multiple tasks sequentially compared to randomly initialized models. We then further investigate why pre-training alleviates forgetting in this setting. We study this phenomenon by analyzing the loss landscape, finding that pre-trained weights appear to ease forgetting by leading to wider minima. Based on this insight, we propose jointly optimizing for current task loss and loss basin sharpness in order to explicitly encourage wider basins during sequential fine-tuning. We show that this optimization approach leads to performance comparable to the state-of-the-art in task-sequential continual learning across multiple settings, without retaining a memory that scales in size with the number of tasks.
Bio: Sanket Vaibhav Mehta is a Ph.D. student at LTI, CMU advised by Emma Strubell. He is primarily interested in machine learning, natural language processing, and optimization, with a particular focus on continuous learning from limited labeled data, multiple tasks, and non-stationary data streams. His doctoral thesis centers on the design of efficient lifelong learning systems that can address the issue of catastrophic forgetting of previously learned knowledge while enabling incremental learning of new tasks.

Mar 31, 2023: Zhenmei Shi (UW-Madison)

Time: Mar 31, 2023, 12:30 PM CT
Speaker: Zhenmei Shi (UW-Madison)
Link: [Video]
Title: The Trade-off between Universality and Label Efficiency of Representations from Contrastive Learning – ICLR 2023 Spotlight
Abstract: Pre-trained representations (a.k.a. foundation models) have recently become a prevalent learning paradigm, where one first pre-trains a representation using large-scale unlabeled data and then learns simple classifiers on top of the representation using small labeled data from the downstream tasks. There are two key desiderata for the representation: label efficiency (the ability to learn an accurate classifier on top of the representation with a small amount of labeled data) and universality (usefulness across a wide range of downstream tasks). In this paper, we focus on one of the most popular instantiations of this paradigm: contrastive learning with linear probing, i.e., learning a linear predictor on the representation pre-trained by contrastive learning. We show that there exists a trade-off between the two desiderata so that one may not be able to achieve both simultaneously.Specifically, we provide analysis using a theoretical data model and show that, while more diverse pre-training data result in more diverse features for different tasks (improving universality), it puts less emphasis on task-specific features, giving rise to larger sample complexity for down-stream supervised tasks, and thus worse prediction performance. Guided by this analysis, we propose a contrastive regularization method to improve the trade-off. We validate our analysis and method empirically with systematic experiments using real-world datasets and foundation models.
Short Bio: Zhenmei Shi is a Ph.D. student in the UW-Madison Computer Sciences department, advised by Prof. Yingyu Liang. His research interest is in Understanding and Improving Deep Learning, particularly feature learning (representation learning) in deep neural networks.

Mar 24, 2023: Dheeraj Nagaraj (Google)

Time: Mar 24, 2023, 12:30 PM CT
Speaker: Dheeraj Nagaraj (Google)
Links: [ArXiv][Recording]
Title: Utilizing the CLT Structure in Stochastic Approximations of Sampling Algorithms
Abstract: We consider stochastic approximations of sampling algorithms like Langevin Monte Carlo and Interacting Particle Dynamics with random batches. These are heavily deployed in Bayesian inference, and the physical sciences. The noise induced by random batches is approximately Gaussian (due to the Central Limit Theorem) while the Brownian motion driving the algorithm is exactly Gaussian. We utilize this structure to provide improved guarantees for sampling algorithms under significantly weaker assumptions. We also propose covariance correction, which rescales the brownian motion to approximately remove the random batch error. We show that covariance corrected algorithms enjoy even better convergence.
Bio: Dheeraj Nagaraj is a research scientist at Google Research. He works on topics in stochastic optimization, applied probability and machine learning. Currently, he is interested in questions regarding systems which exhibit spatio-temporal dependence like random graphs, sampling algorithms and multi-agent reinforcement learning. He obtained his PhD from the Lab for Information and Decision Systems at MIT in 2021, advised by Guy Bresler.

Feb 3, 2023: Kwangjun Ahn (MIT)

Time: Feb 3, 2023, 12:30 PM CT
Speaker: Kwangjun Ahn (MIT)
Title: Learning threshold neurons via edge of stability
Abstract: Existing analyses of neural network training often operate under the unrealistic assumption of an extremely small learning rate. This lies in stark contrast to practical wisdom and empirical studies, such as the work of J. Cohen et al. (ICLR 2021), which exhibit startling new phenomena (the “edge of stability” or “unstable convergence”) and potential benefits for generalization in the large learning rate regime. Despite a flurry of recent works on this topic, however, the latter effect is still poorly understood. In this paper, we take a step towards understanding genuinely non-convex training dynamics with large learning rates by performing a detailed analysis of gradient descent for simplified models of two-layer neural networks. For these models, we prove the edge of stability phenomenon and discover a sharp phase transition for the step size below which the neural network fails to learn “threshold-like” neurons (i.e., neurons with a non-zero first-layer bias). This elucidates one possible mechanism by which the edge of stability can in fact lead to better generalization, as threshold neurons are basic building blocks with a useful inductive bias for many tasks.
Short Bio: Kwangjun Ahn is a Ph.D. student at MIT EECS, advised by Professors Suvrit Sra and Ali Jadbabaie. His research interest lies in Machine Learning, Optimization, and Learning for Control.

Dec 9, 2022: Liu Yang (UW-Madison)

Time: Dec 9, 2022, 12:30 PM CT
Speaker: Liu Yang (UW-Madison)
Title: A Better Way to Decay: Proximal Gradient Training Algorithms for Neural Nets
Abstract: Weight decay is one of the most widely used forms of regularization in deep learning, and has been shown to improve generalization and robustness. The optimization objective driving weight decay is a sum of losses plus a term proportional to the sum of squared weights. This paper argues that stochastic gradient descent (SGD) may be an inefficient algorithm for this objective. For neural networks with ReLU activations, solutions to the weight decay objective are equivalent to those of a different objective in which the regularization term is instead a sum of products of ℓ2 (not squared) norms of the input and output weights associated with each ReLU. This alternative (and effectively equivalent) regularization suggests a novel proximal gradient algorithm for network training. Theory and experiments support the new training approach, showing that it can converge much faster to the sparse solutions it shares with standard weight decay training.
Short bio: Liu Yang is a graduate student in UW Madison CS department, co-advised by Prof. Nowak, Prof. Papailiopoulos, and Prof. Lee. Her research interests lie in the intersection of machine learning and deep learning.

Nov 25, 2022: Thanksgiving

No Events.

Nov 11, 2022: Dohyun Kwon (UW-Madison)

Time: Nov 11, 2022, 12:30 PM CT
Speaker: Dohyun Kwon (UW-Madison)
Title: Convergence of score-based generative modeling and related problems
Links: [Video]
Abstract: Score-based generative models and diffusion probabilistic models have recently demonstrated remarkable performance in numerous applications. However, it remains unclear whether the distance between the given data distribution and the model distribution is sufficiently small. In this talk, we obtain the upper bound of the Wasserstein distance between two distributions. Under suitable assumptions, this theoretically guarantees that the framework approximates data distributions in the space of probability measures equipped with the Wasserstein distance. Possible variants from the theoretical analysis and related open questions will be discussed. This talk is based on joint work with Ying Fan and Kangwook Lee (UW-Madison).
Short bio: Dohyun Kwon is a Van Vleck Visiting Assistant Professor in the Department of Mathematics at the University of Wisconsin-Madison. His research interests are partial differential equations and their applications to physics and optimization. Before joining UW-Madison, he obtained his Ph. D. in mathematics from UCLA in June 2020 under the advisement of Professor Inwon Kim. He received B.S. in Mathematical Sciences from Seoul National University in South Korea. He is a recipient of the Kwanjeong Educational Foundation Fellowship for doctoral study and the Beckenbach Award from UCLA.

Oct 28, 2022: Jiefeng Chen (UW-Madison)

Time: Oct 28, 2022, 12:30 PM CT
Speaker: Jiefeng Chen (UW-Madison)
Title: Detecting Errors and Estimating Accuracy on Unlabeled Data with Self-training Ensembles
Links: [Video]
Abstract: When a deep learning model is deployed in the wild, it can encounter test data drawn from distributions different from the training data distribution and suffer a drop in performance. For safe deployment, it is essential to estimate the accuracy of the pre-trained model on the test data. However, the labels for the test inputs are usually not immediately available in practice, and obtaining them can be expensive. This observation leads to two challenging tasks: (1) unsupervised accuracy estimation, which aims to estimate the accuracy of a pre-trained classifier on a set of unlabeled test inputs; (2) error detection, which aims to identify misclassified test inputs. In this talk, I will introduce a principled and practically effective framework called self-training ensembles that simultaneously addresses these two tasks.
Short bio: Jiefeng Chen is a Ph.D. student in the Computer Science Department at the University of Wisconsin-Madison. He is co-advised by Prof. Yingyu Liang and Prof. Somesh Jha. His research interest is trustworthy machine learning, including adversarial robustness, handling distribution shifts and out-of-distribution detection, etc.

Oct. 14, 2022: Yeonwoo Jeong (SNU)

Time: Oct 14, 2022, 12:30 PM CT
Speaker: Yeonwoo Jeong (SNU)
Links:[Video]
Title: Combinatorial optimization for machine learning
Abstract: Discrete variables are often more efficient to define real-world problems than continuous variables. However, neural networks and their gradient descent style training are basically designed to represent continuous representations. Therefore, training neural networks with discrete variables is often a difficult problem. Previous works proposed Gumbel-soft max trick, straight-through estimator, and REINFORCE trick to handle discrete variables. In this seminar, I propose another approach to handling discrete variables via combinatorial optimization when training neural networks. This seminar handles several domains: metric learning, disentanglement, and pruning.
Short bio: Yeonwoo is a fifth-year Ph.D. student in Computer Science and Engineering at Seoul National University, advised by Hyun Oh Song. He previously graduated from Seoul National University in 2018 with a B.S. in electrical and computer engineering. His research interests are combinatorial optimization and machine learning. Also, He is interested in Blockchain and cryptocurrency.

Sep. 2, 2022: Andrew Lowy (USC)

Title: Private Federated Learning Without a Trusted Server
Speaker: Andrew Lowy (USC)
Time: Sep. 2, 2022, 12:30 pm CT
Links: [Video]
Abstract: This work considers federated learning in the absence of a trustworthy server/clients. In this setting, each client needs to ensure the privacy of its own data, even if the server or other clients act adversarially. This requirement motivates the study of client-level local differential privacy (CL-LDP), an intermediate privacy notion between the “no trust” classical local differential privacy (LDP) and “high trust” central differential privacy (CDP) models. We provide tight (up to logarithms) upper and lower bounds for CL-LDP convex/strongly convex federated stochastic optimization with homogeneous (i.i.d.) client data. The CL-LDP federated learning rates sit between the rates for classical LDP and CDP, and match the optimal statistical rates in certain practical parameter regimes. Remarkably, we show that similar rates are attainable for smooth losses with arbitrary heterogeneous client data distributions, via a linear-time accelerated CL-LDP algorithm. With a secure “shuffler” to anonymize client reports (but without the presence of a trusted server), our algorithm attains the optimal CDP rates for stochastic convex/strongly convex optimization. Time permitting, we will also discuss CL-LDP federated learning with non-convex loss functions.
Short Bio: Andrew Lowy is a PhD student in Applied Math at University of Southern California, advised by Meisam Razaviyayn. Prior to attending USC, Andrew received his B.A. in public policy and economics at Princeton University, and completed a post-bac in mathematics at Columbia University. His research interests lie in the areas of machine learning and optimization, primarily in the intersection of these areas with differential privacy and algorithmic fairness. He especially enjoys understanding fundamental limits and developing scalable algorithms.

Aug. 19, 2022: Peihao Wang (UT-Austin)

Aug. 19, 2022: Peihao Wang (UT-Austin)
Title: Signal Processing for Implicit Neural Representation
Time: Aug 19, 12:30 pm CT
Links: [Video][Paper]
Abstract: Implicit Neural Representations (INR) encoding continuous multi-media data via multi-layer perceptrons has shown undebatable promise in various computer vision tasks. Despite many successful applications, editing and processing an INR remains intractable as signals are represented by agnostic parameters of a neural network. Existing works manipulate such continuous representations via processing on their discretized instance, which breaks down the compactness and continuous nature of INR. In this work, we present a pilot study on the question: how to directly modify an INR without explicit decoding? We answer this question by proposing an implicit neural signal processing network, dubbed INSP-Net, via differential operators on INR. Our key insight is that spatial gradients of neural networks can be computed analytically and invariant to translation, while mathematically we show that any continuous convolution filter can be uniformly approximated by a linear combination of high-order differential operators. With these two knobs, we instantiate the INR signal operator as a composition of computational graphs corresponding to the high-order derivatives, where the weighting parameters can be either handcrafted or data-driven learned. Based on our proposed INSP-Net, we further build the first Convolutional Neural Network (CNN) that implicitly runs on INRs, named INSP-ConvNet. Our experiments validate the expressiveness of INSP-Net and INSP-ConvNet in fitting low-level image processing kernels (e.g. edge detection, blurring, deblurring, denoising, inpainting) as well as for high-level tasks on implicit fields such as image classification.
Bio: Peihao Wang is a PhD student in Electrical and Computer Engineering at The University of Texas at Austin advised by Prof. Atlas Wang. Prior to that, he obtained bachelor’s degree from ShanghaiTech University where he worked with Prof. Jingyi Yu and Prof. Manolis Tsakiris. His research interests lie in theories and applications of graph representation learning, 3D vision/geometry/graphics, and computational photography.

Aug. 5, 2022: Sixu Li (UW-Madison)

Speaker: Sixu Li (UW-Madison)
Time: Aug 5th, 12:30 pm CT
Links: [Video]
Title: Wasserstein Barycenter-based Model Fusion and Connections to Loss Landscapes of Neural Networks
Abstract: We propose a unified mathematical framework for neural network (NN) model fusion based on the concepts of Wasserstein barycenter (WB) and Gromov-Wasserstein barycenter (GWB). In our framework, the fusion occurs in a layer-wise manner and builds on an interpretation of a node in a network as a function of the layer preceding it. The versatility of our mathematical framework allows us to manage the fusion of different types of NNs, including fully connected NN, CNN, ResNet, RNN, and LSTM, in each case exploiting the specific structure of the network architecture. We present extensive numerical experiments to: 1) illustrate the strengths of our approach in relation to other model fusion methodologies and 2) reveal new insights about the loss landscapes for different types of network architectures and datasets. In particular, from a certain perspective, our findings provide new empirical evidence for recent conjectures which say that local minima found by gradient-based methods end up lying on the same basin of the loss landscape after a proper permutation of weights is applied to one of the models.
Short Bio: Sixu is a second-year PhD student in the Stats department working with Professor Nicolas Garcia Trillos. His research interests include optimal transport theory, interaction particle system, and their applications to federated learning.

July 8, 2022: Jiawei Zhang (MIT)

Speaker: Jiawei Zhang (MIT)

Time: July 8th, 2022, 12:30 pm CT (1:30 pm ET)
Link: [Slides][Video]
Title: What is a Good Metric to Study Generalization of Minimax Learners?
Abstract: Minimax optimization has served as the backbone of many machine learning (ML) problems. Although the convergence behavior of optimization algorithms has been extensively studied in the minimax settings, their generalization guarantees in {stochastic minimax optimization problems}, i.e., how the solution trained on empirical data performs on unseen testing data, have been relatively underexplored. A fundamental question remains elusive: What is a good metric to study generalization of minimax learners? In this paper, we aim to answer this question by first showing that primal risk, a universal metric to study generalization in minimization problems, {which has also been adopted recently to study generalization in minimax ones,} fails in simple examples. We thus propose a new metric to study generalization of minimax learners: the primal gap, {defined as the difference between the primal risk and its minimum over all models,} to circumvent the issues. Next, we derive generalization {error} bounds for the primal gap in nonconvex-concave settings. As byproducts of our analysis, we also solve two open questions: establishing generalization {error} bounds for primal risk and primal-dual risk{, another existing metric that is only well-defined when the global saddle-point exists,} in the strong sense, i.e., without strong concavity or assuming that the maximization and expectation can be interchanged, while either of these assumptions was needed in the literature. Finally, we leverage this new metric to compare the generalization behavior of two popular algorithms — gradient descent-ascent (GDA) and gradient descent-max (GDMax) in stochastic minimax optimization.
Short Bio: Jiawei Zhang is a postdoc associate in Laboratory for Information & Decision Systems At MIT working with Prof. Asuman Ozdaglar. He attained his Ph.D. degree from the Chinese University of Hong Kong (Shenzhen) supervised by Prof. Tom Luo. Previously, he attained B.Sc in Mathematics (Hua Loo-Keng talent program) from the University of Science and Technology of China. His research interest includes optimization theory and algorithms for nonconvex problems, with application to machine learning, and signal processing.

July 1, 2022: Sewon Min (UWashington)

Speaker: Sewon Min (UWashington)

Time: July 1, 2022, 12:30 pm CT
Link: [Slides]
Title: Understanding and Improving Learning through Inference with Large Language Models
Abstract: Language models are capable of “learning at inference” (also referred to as in-context learning) – learning a new task by conditioning on k examples and making a prediction for a new input with no parameter updates. While impressive, models suffer from high variance and low worst-case accuracy. Moreover, we do not understand how or why in-context learning works. First, I will introduce new methods that lead to significant performance gains by reducing variance and improving worst-case accuracy. I will then show that in-context learning in fact works very differently from conventional learning: the model does not benefit from the correctly paired training data, but rather benefit from the correct specification of the independent distribution of inputs and labels. Finally, I will conclude the talk with lessons learned, limitations, and avenues for future work.
Short Bio: Sewon Min is a Ph.D. student in the Paul G. Allen School of Computer Science & Engineering at the University of Washington, and a part-time visiting researcher at Meta AI. Her research is in the area of natural language processing and machine learning. She was a co-organizer of multiple workshops and tutorials at ACL, EMNLP, NeurIPS, and AKBC, including a workshop on Machine Reading for Question Answering, a workshop on Representation Learning for NLP, a workshop on Semiparametric Methods in NLP, and a tutorial on Zero- and Few-shot Learning with Pretrained Language Models. Prior to UW, she obtained a B.S. degree in Computer Science & Engineering from Seoul National University.

Jun. 24, 2022: Yuxin Sun (UW-Madison)

Speaker: Yuxin Sun (UW-Madison)

Link: [Video]
Time: June 24, 12:30 pm CT (10:30 am PT, 1:30 pm ET)
Title: On Outlier Robust Learning of Ising Models
Abstract: We study the problem of learning high-temperature Ising models in the outlier-robust setting where a constant fraction of the samples is adversarially corrupted. We provide the first computationally efficient robust learning algorithm for this problem with near-optimal error guarantees $O(\epsilon\log(1/\epsilon))$. Our algorithm can be seen as a special case of an algorithm for robustly learning a distribution from a general exponential family. To prove its correctness for Ising models, we establish new anti-concentration results for degree-2 polynomials of Ising models that may be of independent interest.

In addition, we show that no efficient SQ algorithm with access to an $\epsilon$-corrupted ferromagnetic high-temperature Ising model can learn the model to total variation distance $o(\epsilon\log(1/\epsilon))$. Our SQ lower bound matches the error guarantees of known algorithms for this problem, providing evidence that the current upper bound for this task is the best possible. At the technical level, we develop a generic SQ lower bound for discrete high-dimensional distributions starting from low-dimensional moment matching constructions that we believe will find other applications.
Short Bio: Yuxin Sun is a Ph.D. student in the Department of Computer Sciences at the University of Wisconsin-Madison, advised by Prof. Ilias Diakonikolas. His research interests lie between theoretical computer science and machine learning with a focus on developing efficient algorithms for fundamental problems in data science with provable guarantee, especially in settings where data might be heterogeneous or contaminated.

Jun. 3, 2022: Dung (June) Thai (UMass Amherst)

Speaker: Dung (June) Thai (UMass Amherst)

Link: [Video]
Title: Case-Based Reasoning for Question Answering
Abstract: Question answering (QA) tasks often necessitate a diverse set of reasoning chains across various types of questions and data availability. Handling new combinations of inferential chains and entities while adapting to new data is challenging. Case-Based Reasoning (CBR) by design fits well for these needs. A recent study (Das et al.) has successfully adopted CBR for supervised knowledge-base QA, significantly improved performances on questions with novel entities. This talk further explores CBR methods for weak-supervised KBQA and open-domain QA.
Short Bio: Dung (June) Thai is a Ph.D. student at UMass Amherst, advised by Prof. Andrew McCallum. Her research focuses on question answering systems. Prior to the Ph.D. program, she got her BS and MS degree from Vietnam National University.

May. 13, 2022: Xuefeng Du (UW-Madison)

Links: [Video]
Title: Unknown-aware learning for object detection
Abstract: Out-of-distribution (OOD) detection is indispensable for deploying machine learning models in the wild. One of the key challenges in OOD detection is that models lack supervision signals from unknown data. Previous approaches rely on real outlier datasets for model regularization, which can be costly and sometimes infeasible to obtain in practice. In this talk, we are going to talk about our recent approaches to virtual outlier synthesis and spatial-temporal unknown distillation to generate unknowns in a convenient way. Then, we design sophisticated regularization terms to help our model achieve strong OOD detection performance on a new benchmark for object detection.
Short bio: Xuefeng is a second-year CS Ph.D. student at UW-Madison. His research focus is trustworthy machine learning, such as adversarial robustness, learning problem with label noise, and out-of-distribution detection. He is also interested in neural architecture search, graph mining, and high-level vision tasks, such as object detection and segmentation.

Apr. 29, 2022: Andrew Wagenmaker (UWashington)

Apr. 29, 2022: Andrew Wagenmaker (UWashington)
Links: [Video]
Title: Towards the True Complexity of Learning Near-Optimal Policies in Markov Decision Processes
Abstract: Recent advances in reinforcement learning (RL) demonstrate its potential for solving real-world problems. However, significant challenges remain, limiting its widespread deployment. In particular, the large sample complexity—the number of interactions with the environment—required by RL algorithms to learn near-optimal policies has proved a critical difficulty. In this work, we take steps towards addressing this challenge by seeking to understand the true sample complexity of RL, with the goal of developing algorithms that take into account problem structure to learn efficiently. We consider two directions in particular. First, we study the instance-dependent complexity of learning near-optimal policies in tabular Markov Decision Processes (MDPs). We seek to obtain sample complexities that scale not with worst-case measures such as the number of states and actions, but with quantities that depend on the inherent difficulty of each individual MDP. We propose a new instance-dependent measure of complexity, the gap-visitation complexity, and an algorithm with sample complexity scaling as the gap-visitation complexity. Furthermore, by analyzing the gap-visitation complexity on several examples, we show that existing approaches to learning near-optimal policies can have sample complexity that is arbitrarily suboptimal. Next, we study how the structure of the MDP affects the ability to generalize across reward functions. We show that, when the MDP has a particular underlying linear structure, the sample complexity of learning a near-optimal policy for an arbitrary number of reward functions is, in the worst case, no larger than the sample complexity of learning a near-optimal policy for a single reward function, implying that reward-free RL is no harder than reward-aware RL in linear MDPs.
Short bio: Andrew is a fourth-year Ph.D. student in Computer Science at the University of Washington working with Kevin Jamieson. His research interests are in active learning, reinforcement learning, and control theory. He is supported by an NSF Graduate Research Fellowship

Apr. 15, 2022: Jeongyeol Kwon (UTexas)

Time: 12:30 pm CDT
Location: Orchard view room, Wisconsin Institutes for Discovery (WID), 330 N Orchard St.
Title: Reinforcement Learning with Latent Contexts
Abstract: Learning a near optimal policy in a partially observable system remains an elusive challenge in contemporary reinforcement learning. One example where such a challenge arises is a system associated with latent contexts. In this talk, we consider episodic reinforcement learning in a reward-mixing Markov decision process (RM-MDP). There, a reward function is drawn from one of multiple possible reward models at the beginning of every episode, but the identity of the chosen reward model (context) is not revealed to the agent. With no assumptions on system dynamics, even for two switching-reward models, the problem requires several new ideas beyond existing algorithmic and analysis techniques for efficient exploration. We develop the first efficient algorithm that does not require any assumptions for two switching-reward models. I will then briefly discuss the scalability and generalizability of the proposed method in more challenging scenarios.
Short bio: Jeongyeol is a Ph.D. Candidate in Electrical and Computer Engineering at the University of Texas at Austin. He is broadly interested in theoretical aspects of machine learning, with a focus on Statistical Learning Theory, Optimization, Bandits/Reinforcement Learning, and Robust Statistics. His current research aims to develop efficient and robust algorithms for sequential decision-making problems arising from real applications.

Apr. 1, 2022: Defeng Liu (PolyMtl)

Link: [Video]
Title: Learning to search in local branching for solving mixed-integer programs
Abstract: Finding high-quality solutions to mixed-integer linear programming problems (MILPs) is of great importance for many practical applications. In this respect, the refinement heuristic local branching (LB) has been proposed to produce improving solutions and has been highly influential for the development of local search methods in MILP. For a LB algorithm, the choice of the neighborhood size is critical to performance. Although it was initialized by a conservative value in the original LB scheme, our new observation is that the best size is strongly dependent on the particular MILP instance. In this work, we investigate the relation between the size of the search neighborhood and the behavior of the underlying LB algorithm, and we devise a leaning based framework for guiding the neighborhood search of the LB heuristic. We computationally show that the neighborhood size can indeed be learned, leading to improved performances and that the overall algorithm generalizes well both with respect to the instance size and, remarkably, across instances.
Short Bio: Defeng Liu is currently doing his Ph.D. in Mathematics at the Polytechnique Montréal, under the supervision of Prof. Andrea Lodi. His research focuses on Mixed-Integer Programming, Mathematical Optimization, and Machine Learning. Namely, Defeng is interested in applying deep learning and reinforcement learning techniques in mixed-integer programs to solve complex problems, aiming at bringing machine learning and operations research closer together.

Mar. 18, 2022: Yae Jee Cho (CMU)

Links: [Video]
Title: Leveraging Biased Client Selection in Federated Learning
Abstract: Federated learning is a distributed optimization paradigm that enables a large number of resource-limited client nodes to cooperatively train a model without data sharing. Previous works analyzed federated learning by accounting of data heterogeneity, communication/computation limitations, and partial client participation. However, most assume unbiased client participation, where clients are selected such that the aggregated model update is unbiased. In our work, we investigate how biased client selection can be leveraged to improve many aspects of federated learning including convergence speed, fairness, and client incentives.
Short Bio: Yae Jee Cho is a Ph.D. student in the ECE department at Carnegie Mellon University advised by Prof. Gauri Joshi. Her main research interests are in distributed machine learning, in particular, federated learning with a focus on client systems and data heterogeneity. She received her B.S. and M.S. degrees in Electrical & Computer Engineering from Yonsei University. She has worked with the Privacy in AI team at Microsoft Research as a research intern on federated learning and is a recipient of the Korean Government Doctoral Scholarship.

Mar. 4, 2022: Boxin Zhao (UChicago)

Links: [arXiv][arXiv]
Title: Adaptive Client Sampling in Federated Learning via Online Learning with Bandit Feedback
Abstract: In federated learning (FL) problems, client sampling plays a key role in the convergence speed of training algorithm. However, while being an important problem in FL, client sampling is lack of study. In this paper, we propose an online learning with bandit feedback framework to understand the client sampling problem in FL. By adapting an Online Stochastic Mirror Descent algorithm to minimize the variance of gradient estimation, we propose a new adaptive client sampling algorithm. Besides, we use online ensemble method and doubling trick to automatically choose the tuning parameters in the algorithm. Theoretically, we show dynamic regret bound with comparator as the theoretically optimal sampling sequence; we also include the total variation of this sequence in our upper bound, which is a natural measure of the intrinsic difficulty of the problem. To the best of our knowledge, these theoretical contributions are novel to existing literature. Moreover, by implementing both synthetic and real data experiments, we show empirical evidence of the advantages of our proposed algorithms over widely-used uniform sampling and also other online learning based sampling strategies in previous studies. We also examine its robustness to the choice of tuning parameters. Finally, we discuss its possible extension to sampling without replacement and personalized FL objective. While the original goal is to solve client sampling problem, this work has more general applications on stochastic gradient descent and stochastic coordinate descent methods.
Short bio: Boxin Zhao is a Ph.D. student in Statistics and Econometrics at the University of Chicago, advised by Prof. Mladen Kolar. His research interests include machine learning, federated learning, probabilistic graphical models, high-dimensional statistics, and optimization, with a focus on developing novel methodologies with both practical applications and theoretical guarantees.

Feb. 18, 2022: Sungryull Sohn (LG AI Research US)

Links: [Video] [arXiv]
Title: Successor Feature Landmarks for Long-Horizon Goal-Conditioned Reinforcement Learning
Abstract: Operating in the real world often requires agents to learn about a complex environment and apply this understanding to achieve a breadth of goals. This problem, known as goal-conditioned reinforcement learning (GCRL), becomes especially challenging for long-horizon goals. Current methods have tackled this problem by augmenting goal-conditioned policies with graph-based planning algorithms. However, they struggle to scale to large, high-dimensional state spaces and assume access to exploration mechanisms for efficiently collecting training data. In this work, we introduce Successor Feature Landmarks (SFL), a framework for exploring large, high-dimensional environments so as to obtain a policy that is proficient for any goal. SFL leverages the ability of successor features (SF) to capture transition dynamics, using it to drive exploration by estimating state-novelty and to enable high-level planning by abstracting the state-space as a non-parametric landmark-based graph. We further exploit SF to directly compute a goal-conditioned policy for inter-landmark traversal, which we use to execute plans to “frontier” landmarks at the edge of the explored state space. We show in our experiments on MiniGrid and ViZDoom that SFL enables efficient exploration of large, high-dimensional state spaces and outperforms state-of-the-art baselines on long-horizon GCRL tasks.
Short bio: Sungryull Sohn is a research scientist at LG AI research US. He received B.S. and M.S. degrees in Electrical Engineering from KAIST, and a Ph.D. degree from the University of Michigan. Before his Ph.D., he worked at Electronics and Telecommunications Research Institute (ETRI) as a researcher. During his Ph.D., he worked at Microsoft Research as a research intern and Google Brain as a research intern and student researcher. His research interests include various areas in deep reinforcement learning, such as compositional task learning, hierarchical reinforcement learning, meta-reinforcement learning, multi-task reinforcement learning, offline reinforcement learning, and neural combinatorial optimization.

Feb. 4, 2022: Jiaxin Hu (UW-Madison)

Links: [Video][arXiv]
Title: Multiway Spherical Clustering via Degree-Corrected Tensor Block Models
Abstract: We consider the problem of multiway clustering in the presence of unknown degree heterogeneity. Such data problems arise commonly in applications such as recommendation system, neuroimaging, community detection, and hypergraph partitions in social networks. The allowance of degree heterogeneity provides great flexibility in clustering models, but the extra complexity poses significant challenges in both statistics and computation. Here, we develop a degree-corrected tensor block model with estimation accuracy guarantees. We present the phase transition of clustering performance based on the notion of angle separability, and we characterize three signal-to-noise regimes corresponding to different statistical-computational behaviors. In particular, we demonstrate that an intrinsic statistical-to-computational gap emerges only for tensors of order three or greater. Further, we develop an efficient polynomial-time algorithm that provably achieves exact clustering under mild signal conditions. The efficacy of our procedure is demonstrated through two data applications, one on human brain connectome project, and another on Peru Legislation network dataset.
Short Bio: Jiaxin is a second-year Ph.D. student in Statistics at the University of Wisconsin-Madison working with Prof. Miaoyan Wang. Her research interests lie in the intersection of statistics and machine learning. Currently, she is working on higher-order tensor methods with application in networks. Prior to the Ph.D. program, she received a Master’s degree in Statistics from UW-Madison in 2020 and Bachelor’s degree in Statistics from Wuhan University in 2019.

Jan. 21, 2022: Jy-yong Sohn (UW-Madison)

Links: [Video][arXiv]
Title: GenLabel: Mixup Relabeling using Generative Models
Abstract: Mixup is a data augmentation method that generates new data points by mixing a pair of input data. While mixup generally improves the prediction performance, it sometimes degrades the performance. In this paper, we first identify the main causes of this phenomenon by theoretically and empirically analyzing the mixup algorithm. To resolve this, we propose GenLabel, a simple yet effective relabeling algorithm designed for the mixup. In particular, GenLabel helps the mixup algorithm correctly label mixup samples by learning the class-conditional data distribution using generative models. Via extensive theoretical and empirical analysis, we show that mixup, when used together with GenLabel, can effectively resolve the aforementioned phenomenon, improving the generalization performance and the adversarial robustness.
Short Bio: Jy-yong is a post-doctoral researcher in the Department of Electrical and Computer Engineering (ECE) at the University of Wisconsin-Madison, working with Prof. Dimitris Papailiopoulos and Prof. Kangwook Lee. He is interested in the intersection of machine learning, information theory, and distributed algorithms. He received his Ph.D. degree in 2020 from KAIST, under the supervision of Prof. Jaekyun Moon. He is a recipient of the IEEE ICC Best Paper Award, Qualcomm Innovation Awards, KAIST Global Leader Fellowship, KAIST EE Best Research Achievement Award, and NRF Korea Post-doctoral Fellowship.

Jan. 7, 2022: Hongyi Wang (CMU)

Links: [video][arXiv]
Title: On the Utility of Gradient Compression in Distributed Training Systems
Abstract: A rich body of prior work has highlighted the existence of communication bottlenecks in synchronous data-parallel training. To alleviate these bottlenecks, a long line of recent research proposes gradient and model compression methods. In this talk, I will first present an evaluation of gradient compression methods’ efficacy and comparisons of their scalability with optimized implementations of synchronous data-parallel SGD across more than 200 realistic distributed setups. The observation is that, surprisingly, only in 6 cases out of 200, gradient compression methods provide promising speedup. I will then introduce our extensive investigation to identify the root causes of this phenomenon and present a performance model that can be used to identify the benefits of gradient compression for a variety of system setups. Finally, I will propose a list of desirable properties that gradient compression methods should satisfy, in order for them to provide meaningful utility in practice.
Short Bio: Hongyi Wang is currently a postdoctoral research fellow at the Machine Learning Department at CMU working with Eric Xing. He obtained his Ph.D. degree at the CS department at UW-Madison, advised by Dimitris Papailiopoulos. His research interests are machine learning algorithms and systems. Hongyi has developed several well-known frameworks regarding communication-efficient and robust distributed learning, which were appeared on ICML, NeurIPS, ICLR, and MLSys. His recent work on federated learning has received lots of attention as well as the Baidu best paper award from the SpicyFL workshop at NeurIPS 2020.

Dec. 10, 2021: Angeliki Giannou (UW-Madison)

Links: [video] [arxiv]
Title: Survival of the strictest: Stable and unstable equilibria under regularized learning with partial information
Abstract: In this work, we examine the Nash equilibrium convergence properties of no-regret learning in general N-player games. For concreteness, we focus on the archetypal follow the regularized leader (FTRL) family of algorithms, and we consider the full spectrum of uncertainty that the players may encounter – from noisy, oracle-based feedback, to bandit, payoff-based information. In this general context, we establish a comprehensive equivalence between the stability of a Nash equilibrium and its support: a Nash equilibrium is stable and attracting with arbitrarily high probability if and only if it is strict (i.e., each equilibrium strategy has a unique best response).
Short Bio: Angeliki is a first-year Ph.D. student in the UW-Madison, advised by Prof. Dimitris Papailiopoulos. Prior to that, she received a Bachelor & M.Sc degree in ECE from National Technical University of Athens, working with Prof. Panayotis Mertikopoulos. Her research interests involve machine learning theory and optimization.

Nov. 12, 2021: Chulhee Yun (MIT)

Links: [video] [arxiv]
Title: Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond
Abstract: In distributed learning, local SGD (also known as federated averaging) and its simple baseline minibatch SGD are widely studied optimization methods. Most existing analyses of these methods assume independent and unbiased gradient estimates obtained via with-replacement sampling. In contrast, we study shuffling-based variants: minibatch and local Random Reshuffling, which draw stochastic gradients without replacement and are thus closer to practice. For smooth functions satisfying the Polyak-Łojasiewicz condition, we obtain convergence bounds (in the large epoch regime) which show that these shuffling-based variants converge faster than their with-replacement counterparts. Moreover, we prove matching lower bounds showing that our convergence analysis is tight. Finally, we propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings. Joint work with Shashank Rajput (UW-Madison) and Suvrit Sra (MIT).

Organizers: Yuchen Zeng and Angeliki Giannou
Faculty coordinator: Prof. Kangwook Lee
Sponsor: FuriosaAI
Get the latest MLOPT Idea Seminar news: join the MLOPT Idea Seminar mailing list on Google Groups (you must have a google account to join).
Check past MLOPT Idea Seminar recordings: subscribe to our YouTube channel!
If you are interested in presenting at the MLOPT Idea Seminar, please contact yzeng58@wisc.edu.