Abstracts
A (2+ϵ)-Approximation Algorithm for Metric k-Median
-
The metric k-median problem is a fundamental NP-hard clustering task where the goal is to select exactly k centers from a potential set to minimize the total connection cost for n clients in a metric space.
While seminal work provided a 2-approximation that opens k centers in expectation, achieving a comparable guarantee for algorithms that open exactly k centers has remained a challenge, with the previous best factor standing at 2.613, often achieved via bi-point rounding techniques. In this talk, we close this gap: for any ϵ>0, we introduce a (2+ϵ)-approximation algorithm for metric k-median. Our approach combines two key components. First, we develop a modification of the classic Greedy algorithm employing O(log(n)/ϵ) adaptive phases, yielding a (2+ϵ)-approximation that opens at most k+O(log(n)/ϵ2) centers. Second, we design a novel (2+ϵ)-approximation algorithm specifically tailored for stable instances, leveraging a k-means++ inspired sampling approach and a reduction to submodular optimization under a partition matroid constraint. This combined approach achieves the first (2+ϵ)-approximation factor for the metric k-median problem guaranteed to open exactly k centers.This is joint work with Vincent Cohen-Addad, Fabrizio Grandoni, Euiwoong Lee, and Chris Schwiegelshohn.
Moment Estimation, Sampling, and the Levy-Khintchine Representation Theorem
-
In this talk I'll survey a series of results from SODA'25, ITCS'25, and STOC'25 on a unified view of moment estimation and sampling in the traditional data stream model, where a vector xϵRn is updated and we are either interested in estimating f-moments: Σi f(xi), or f-sampling: sampling i from the support of x proportional to f(i).
The main message is that many prior sketches can be interpreted as sampling from Levy processes, and may be analyzed using the Levy-Khintchine representation theorem. Moreover, after understanding this connection, we are able to design simpler moment-estimation sketches & samplers by explicit reference to corresponding Levy processes.
Joint work with Dingyu Wang.
Faster Combinatorial Matrix-Multiplication
-
It has long been conjectured that no "combinatorial" matrix-multiplication algorithm (avoiding Strassen-like divide-and-conquer) can achieve truly-subcubic runtime n3-Ω(1). We present an O(n2.89)-time algorithm, which only uses a bunch of convolutions in ℤmk (multivariate polynomial multiplications) via FFT, building on the work of Cohn, Kleinberg, Szegedy and Umans (ISAAC'06).
The algorithm is faster than naïve MatMul (n3) for n ≳ 312. To make the algorithm more practical, we consider an approximate version of it, using low-degree approximating polynomials. This scheme yields the first Approximate MatMul algorithm (AMM) which breaks the linear speed-accuracy tradeoff of compression-based (sketching or sampling) algorithms. We discuss potential applications of the algorithm for inference speedup in Deep Learning, offering a generic alternative to MatMuls in LLMs.Joint work with Yahel Uffenheimer and Kevin Pratt.
The Bidirected Cut Relaxation for Steiner Tree has Integrality Gap Smaller than 2
-
The Steiner tree problem is one of the most prominent network design problems. It asks to connect a given set of terminals in the cheapest possible way. The best-known approximation algorithms for Steiner tree involve enumeration of a (polynomial but) very large number of candidate components and have therefore high running times.
A promising ingredient for the design of fast and accurate approximation algorithms for Steiner tree is the bidirected cut relaxation (BCR). This linear programming relaxation is known to be integral in the spanning tree case [Edmonds'67], that is, when all vertices are terminals. For general instances, however it was not known whether the integrality gap of BCR is better than the integrality gap of the natural undirected relaxation, which is well-known to be exactly 2.
We resolve this question by proving that the integrality gap of BCR ist at most 1.9988.This is joint work with Jarek Byrka and Fabrizio Grandoni.
Improved Algorithms for Estimating the Spanning Tree Count
-
We study the problem of estimating the number of spanning trees in an undirected graph G up to multiplicative (1+ϵ) error, and give an algorithm with running time m1.5/ϵ, improving over previous works in sparse graphs. To obtain the results, we take a different approach than previous works, which leveraged Schur complement determinant sparsifiers, and instead use tools such as the localization of electrical flows.
Joint work with Richard Peng, Junzhao Yang.
Towards Overcoming the Reranking Bottleneck
-
Reranking is a popular approach to information retrieval. It proceeds in two stages. In the first stage, a "quick-and-dirty" data structure retrieves a shortlist of r points closest to the query, where the length of the shortlist r is larger than the desired output k. In the second stage, the shortlist is post-processed to identify k<<r points that satisfy the desired objective. For example, the postprocessing could identify the k most "diverse" points in the shortlist or use a "slower-but-accurate" distance metric to identify the best answers. Despite its popularity, it has various drawbacks; notably the quality of the output is limited by the accuracy of the first stage.
In this talk, I will discuss an alternative to reranking, which fuses the two stages into a single search procedure. The new approach crucially uses recent developments in graph-based algorithms for high-dimensional similarity search, as well the tools developed to analyze such algorithms.Talk based on:
- Piyush Anand, Piotr Indyk, Ravishankar Krishnaswamy, Sepideh Mahabadi, Vikas C Raykar, Kirankumar Shiragur, Haike Xu, Graph-Based Algorithms for Diverse Similarity SearchLink opens in a new window, ICML 2025.
- Haike Xu, Sandeep Silwal, Piotr Indyk, A Bi-metric Framework for Fast Similarity SearchLink opens in a new window, 2024.
- Piotr Indyk, Haike Xu, Worst-case performance of popular approximate nearest neighbor search implementations: Guarantees and limitationsLink opens in a new window, NeurIPS 2023.
Spectral Clustering in Birthday Paradox Time
-
Given a vertex in a (k,φ,ε)-clusterable graph, i.e. a graph whose vertex set can be partitioned into a disjoint union of φ-expanders of size ~n/k with outer conductance bounded by ε, can one quickly tell which cluster it belongs to? This is a classical question going back to the expansion testing problem of Goldreich and Ron'11 (the case of k=2) that has received a lot of attention in the literature. For k=2 a sample of ~n1/2+O(ε/φ^2) logarithmic length walks from a given vertex approximately determines its cluster membership by the birthday paradox: two vertices whose random walk samples are 'close' are likely in the same cluster, and otherwise in different clusters.
The study of the general case k>2 was initiated by Czumaj, Peng and Sohler [STOC'15], and the work of Chiplunkar et al. [FOCS'18], Gluch et al. [SODA'21] showed that ~poly(k) n1/2+O(ε/φ^2) random walk samples, and the same query time, suffices for general k. This matches the k=2 result up to polynomial factors in k, but one notices a conceptual inconsistency: if the birthday paradox is indeed the phenomenon guiding the query complexity here, then the query complexity should decrease, as opposed to increase, with the number of clusters k! The currently best known query time (of Gluch et al. [SODA'21]), however, increases with k due to computationally heavy linear-algebraic post-processing of random walk samples.
We design a novel representation of vertices in a (k,φ,ε)-clusterable graph by a mixture of samples of logarithmic length walks. This representation not only uses the optimal ~(n/k)1/2+O(ε/\φ^2) number of walks per vertex, but also allows for fast nearest neighbor search: given a collection of ~k vertices representing the clusters, and a query vertex x, we can find the cluster of x, using nearly linear time in the representation size of x. This gives a spectral clustering oracle with query time ~(n/k)1/2+O(ε/φ^2) and space complexity k (n/k)1/2+O(ε/φ^2), matching the birthday paradox bound.
Joint work with Ekaterina Kochetkova and Weronika Wrzos-Kaminska.
Data Dependent Frameworks for High-Dimensional Problems
-
We present a general framework for designing efficient data structures for high-dimensional pattern-matching problems: preprocess a dataset X of n points so that given a query y, find x in X satisfying F(x,y)=1 for some predicate F. This captures many classic problems, such as the (online) Orthogonal Vectors (OV) problem, where x,y are d-dimensional binary vectors, and F(x,y) is 1 iff x*y=0.
Our framework relies on efficient communication protocols for the predicate F under product distribution. The latter is the key innovation: eg, for the OV predicate (Set Disjointness), while the worst-case communication complexity is Ω(d), the complexity under products is only ~√d. This framework falls into a general recent line of data-dependent (data structure) algorithms where the underlying data transformations depend on the structure of the overall dataset X - in contrast to, say, the standard data-oblivious methods like JL dimension reduction.
Joint work with Shunhua Jiang and Omri Weinstein.
The Price of Explainability for Clustering
-
A recent line of work asks: given a clustering optimization problem (like k-means or k-medians), how does the cost of the best explainable clustering compare to that of the best clustering? We consider the model where a clustering is called explainable if it can be defined by a tree of axis-parallel cuts. Building on a recent sequence of works, we pin down the optimal price of explainability for k-median clustering, and also improve on existing results for k-means clustering. In this talk, I will discuss some of these results (both ours and prior results), the techniques underlying them, and point out some of the open questions in the area.
This talk is based on this paper (https://arxiv.org/abs/2304.09743Link opens in a new window) with Madhusudhan Pittu (CMU), Ola Svensson (EPFL), and Rachel Yuan (CMU).
How to Protect Yourself from Threatening Skeletons: Optimal Padded Decompositions for Minor-Free Graphs
-
A metric space has padding parameter β if, roughly, for any scale Δ>0, it admits a stochastic decomposition into clusters of diameter at most Δ, such that any ball of radius γΔ is contained within a single cluster with probability at least e-γβ. The padding parameter is an important characteristic of metric spaces with vast algorithmic implications.
This talk will begin by surveying different classic constructions of padded decompositions. We will then discuss the main ideas in our recent proof that the shortest path metric of every Kr-minor-free graph has a padding parameter of O(log r) (which is also tight). This resolves a long-standing open question and exponentially improves the previous bound.
Algorithms for Experimental Design
-
Randomized controlled trials (RCTs), or A/B tests, are the standard method for estimating causal effects of new treatments. A key challenge is to assign experimental units into two treatment groups so that pre-treatment characteristics (known as covariates) are balanced, improving estimation when covariates correlate with treatment outcomes, while preserving enough randomness to remain robust when they do not. In the first part of the talk, I will describe how tools from theoretical computer science, particularly discrepancy theory, can improve the design of RCTs. In the second part, I will show how ideas from concentration inequalities can be used to analyze treatment data and construct interval estimators for causal effects.
Sublinear Metric Steiner Tree via Improved Bounds for Set Cover
-
We study the metric Steiner tree problem in the sublinear query model. In this problem, for a set of n points V in a metric space given to us by means of query access to an n×n matrix w, and a set of terminals T⊆V, the goal is to find the minimum-weight subset of the edges that connects all the terminal vertices.
Recently, Chen, Khanna and Tan [SODA'23] gave an algorithm that uses Õ(n13/7) queries and outputs a (2−η)-estimate of the metric Steiner tree weight, where η>0 is a universal constant. A key component in their algorithm is a sublinear algorithm for a particular set cover problem where, given a set system (U,F), the goal is to provide a multiplicative-additive estimate for |U|−SC(U,F). Here U is the set of elements, F is the collection of sets, and SC(U,F) denotes the optimal set cover size of (U,F). In particular, their algorithm returns a (1/4,ε⋅|U|)-multiplicative-additive estimate for this set cover problem using Õ(|F|7/4) membership oracle queries.
In this work, we improve the query complexity of (2−η)-estimating the metric Steiner tree weight to Õ(n5/3) by showing a (1/2,ε⋅|U|)-estimate for the above set cover problem using Õ(|F|5/3) membership queries. To design our set cover algorithm, we estimate the size of a random greedy maximal matching for an auxiliary multigraph that the algorithm constructs implicitly, without access to its adjacency list or matrix.
This is a joint work with Mohammad Roghani, Jakub Tarnawski, and Ali Vakilian.