Skip to main content Skip to navigation

Talk Abstracts by Research Theme

Theory, Foundations and Discrete Mathematics

James Town — Saddle-to-Saddle Dynamics and Implicit Bias Without Large Overparametrisation
The success of modern deep learning hinges on the use of first order methods to optimise neural networks. Despite this, many foundational questions about the success of such algorithms remain open. We present a complete and non-asymptotic description of the dynamics of gradient flow on a two-layer neural network with ReLU activation functions and orthogonal inputs in the small initialisation limit. We prove convergence to zero loss with high probability with number of neurons sub-polynomial in the input dimension, despite the non-convexity of the loss landscape. We also note that the training process follows saddle-to-saddle dynamics, providing insight into the feature learning process. We discuss current work seeking to prove that in high dimensions, the norm of the learned network weights is with high probability bounded by a constant multiple of the norm of the optimal interpolator. We discuss numerical evidence supporting our findings and future directions of research.

Christopher Brown — Local Graph Algorithms
Many fundamental graph problems can be solved efficiently in a global setting. However, as datasets grow larger with advances in data storage and collection, even linear time algorithms may prove to be too slow when working on inputs of this magnitude. Furthermore, it is not feasible to store the entire output at once, leaving most classical approaches redundant. A lot of work has therefore gone into the development of sublinear algorithms that are designed to perform in a distributed setting, particularly in the field of local computation algorithms (LCAs). Unlike a typical graph algorithm, which aims to compute the entire output, local algorithms aim to instead have each part of the graph compute its own part of the output using information only from its local neighbourhood. For the problem of finding a maximal independent set in a graph, this means that each node must decide whether it is part of the MIS or not by looking only at local information. This presentation will cover some of the important results and recent advances in this area.

Luisa Estrada Plata — On the Limits of PAC Learning of Networks from Opinion Dynamics
Agents in social networks with threshold-based dynamics change opinions when influenced by sufficiently many peers. Existing literature typically assumes that the network structure and dynamics are fully known, which is often unrealistic. In this work, we ask how to learn a network structure from samples of the agents’ synchronous opinion updates. Firstly, if the opinion dynamics follow a threshold rule in which a fixed number of influencers prevent opinion change (e.g., unanimity and quasi-unanimity), we provide an efficient PAC learning algorithm provided that the number of influencers per agent is bounded. Secondly, under standard computational complexity assumptions, we prove that if agents’ opinions follow the majority of their influencers, then there is no efficient PAC learning algorithm. We propose a polynomial-time heuristic that successfully learns consistent networks in over 98% of our simulations on random graphs, with no failures for some specified conditions on the numbers of agents and opinion diffusion examples.

Neha Rino — New Product Constructions for Intersecting Dense Automata
Nondeterministic finite automata (NFA) are a simple and foundational model of computation that precisely capture the class of regular languages. Computing the intersection of regular languages is one of the most basic problems in automata theory. Given two NFAs, the classic method of constructing an NFA that recognises the intersection of their languages uses what is known as the direct product of automata, and this classic construction from 1959 seems remain the state of the art till now. In this talk, we will describe three new product constructions for NFAs (the nodding product, the catchup product, and the leapfrog product) that all recognise the intersection, while yielding much smaller NFAs than the classical direct product. These product constructions provide a faster algorithm for the related problem of checking whether the intersection of regular languages is empty, and we show that this new algorithm seems to be optimal subject to a well-known fine-grained complexity hypothesis, the Combinatorial k-Clique hypothesis.

Jedrzej Olkowski — On Important Cuts with Additional Structural Properties in Eulerian Digraphs
We will present a meta-theorem for Eulerian digraphs that generalizes the notion of important cuts. As our main contribution, we show an upper bound (independent of the size of the graph) on the number of s–t cuts of size at most k, where the graph induced by the s-side of the cut satisfies a digraph property P expressible in monadic second-order logic. Moreover, we show that, under some additional assumptions on P, we are able to enumerate all important s–t cuts in time polynomially depending on n. Finally, we show how this framework can be useful in designing an FPT algorithm for the Arc-Odd-Cycle-Transversal problem on Eulerian graphs when parameterized by the solution size, in contrast to its known W[1]-hardness on general digraphs.

Anton Baychkov — Beyond Lower Quota: Avoiding Overrepresentation in Multi-Winner Voting
In the recent social choice literature on proportional representation, much attention has been given to the question of avoiding underrepresentation in approval-based multi-winner voting. In this paper, we explore the largely overlooked complementary question of avoiding overrepresentation. This has not been explored systematically, despite being a desirable property with concrete applications. Intuitively, overrepresentation happens when a group determines a disproportionately large part of the committee, thereby exceeding the group’s quota. We formulate a strong and appealing axiom for avoiding overrepresentation, called justified upper quota (JUQ). We introduce a generalization of Thiele rules, composite Thiele rules, and characterize the unique rule in this class satisfying our axiom. This rule, Adams-AV, which naturally extends Adams’ apportionment method, has not been studied before. Additionally, we introduce a polynomial-time rule that satisfies JUQ. Furthermore, we introduce justified near quota, an axiom that balances avoiding under and overrepresentation. It characterizes the unique Thiele rule extending the Sainte-Laguë apportionment method. Finally, we analyze the compatibility of our axioms with established proportionality notions such as EJR+.

Nikhil Bansal — Cloning is as Hard as Learning for Stabilizer States
The impossibility of simultaneously cloning non-orthogonal states lies at the foundations of quantum theory as No-Cloning theorem. Even when allowing for approximation errors, cloning an arbitrary pure state requires as many initial copies as needed to fully learn the state. Rather than arbitrary unknown states, modern quantum learning theory often considers structured classes of states and exploits such structure to develop learning algorithms that outperform general-state learning. This raises the question: How do the sample complexities of learning and cloning relate for such structured classes? We answer this question for a very important class of states: stabilizer states, which have a linear structure. We show that cloning is as hard as learning for stabilizer states, establishing the No-approximate-Cloning theorem for them. We show this by relating the task of quantum cloning of structured states to the task of sample amplification of probability distributions. Sample amplification is the task of amplifying samples drawn from an unknown distribution. For arbitrary discrete distributions, sample amplification is known to be easier than learning. We prove that for discrete distributions with an underlying linear structure, sample amplification is as hard as learning. We then “lift” this hardness to quantum cloning using tools from representation theory.

Jinqiao Hu — Failure of Symmetry of Information for Randomized Computations
Symmetry of Information (SoI) is a fundamental result in Kolmogorov complexity stating that for all n-bit strings x and y, K(x,y)=K(y)+K(x∣y) up to an additive error of order O(log n). In contrast, understanding whether SoI holds for time-bounded Kolmogorov complexity measures is closely related to longstanding open problems in complexity theory and cryptography, such as the P vs NP question and the existence of one-way functions. In this paper, we prove that SoI fails for rKt complexity, the randomized analogue of Levin’s Kt complexity. This is the first unconditional result of this type for a randomized notion of time-bounded Kolmogorov complexity. Motivated by applications in cryptography, we also establish the failure of SoI for a related notion called pKt complexity, and provide an extension of the results to the average-case setting. More generally, we establish a close relationship between the validity of SoI for rKt and the existence of randomized algorithms approximating rKt(x). Our findings complement previous work demonstrating the failure of SoI for Kt complexity. In contrast, the randomized setting poses a significant challenge, which we overcome using insights from recent structural results about rKt, and techniques from meta-complexity and computational pseudorandomness.

Ermiya Farokhnejad — Additive One Approximation for Minimum Degree Spanning Tree: Breaking the O(mn) Time Barrier
We consider the minimum degree spanning tree problem. As input, we receive an undirected, connected graph G=(V,E) with n nodes and m edges, and our task is to find a spanning tree T of G that minimizes the maximum degree of a node in T. The problem is known to be NP-hard. In the early 1990s, an influential work by Fürer and Raghavachari presented a local search algorithm that runs in O(mn) time and returns a spanning tree with maximum degree at most Δ*+1, where Δ* is the optimal objective. This remained the state-of-the-art runtime bound for computing an additive one approximation until now. We break this O(mn) runtime barrier by providing a deterministic algorithm that returns an additive one approximate optimal spanning tree in O(mn^{3/4}) time. This constitutes substantive progress towards answering an open question that has been repeatedly posed in the literature. Our algorithm is based on a novel application of the blocking flow paradigm.

Dimitrios Tsintsilidas — The Complexity of Student-Teacher Games: How many counterexamples to find the right answer?
The Student–Teacher Game is a model of computation for total search problems at the second level of the polynomial hierarchy; that is, problems of finding an element that satisfies a given property and cannot be falsified by any counterexample. In this model, computation is carried out interactively between the Student, who attempts to find the required element, and the Teacher, who provides counterexamples to failed attempts. In this talk, we introduce total search problems and the framework of Student–Teacher Games as a model of interactive counterexample computation. We then investigate how this model is affected by the number of rounds and by the number of queries the Student can make to the Teacher in each round. Finally, we present examples of lower bounds that arise from connections between this model and logic.

Machine Learning

Haili Yuan — Heterophily-Robust Graph Learning via Diffusion: Random Walks, Convection Diffusion, and Graph-Agnostic Clustering
Graph learning methods often assume homophily, that connected nodes share labels or features. This assumption breaks in heterophilic graphs and can make standard message passing and diffusion over smooth representations or amplify noise. This survey talk reviews diffusion-based strategies that remain effective under heterophily, based on three recent lines of work that I have studied and reproduced. Rather than a broad taxonomy, I focus on the modelling choices that directly alter how information propagates. I first revisit random-walk diffusion, where heterophily-aware Personalised PageRank adapts both transition probabilities and restart distributions using local heterophily and feature diversity, and then uses the resulting scores to extract node-centric contextual subgraphs for node classification. I then move to continuous-time dynamics, where graph neural convection-diffusion augments neural diffusion with a learnable convection term, enabling directional information flow across dissimilar neighbours instead of pure smoothing. Finally, I discuss unsupervised clustering, where diffusion-based graph-agnostic clustering reframes representation learning and clustering as Dirichlet energy minimisation and performs dual diffusion on the observed topology and an attribute-induced affinity graph, improving robustness to noisy or heterophilic edges. Across these perspectives, a shared principle is to make diffusion adaptive: local heterophily estimation, controllable diffusion depth or integration time, and multi-view diffusion over topology and attributes.

Ben Lewis — Text to Stealthy Adversarial Face Masks
Recent studies have demonstrated that modern facial recognition systems, which are based on deep neural networks, are vulnerable to adversarial attacks, including the use of accessories, makeup patterns, or precision lighting. However, developing attacks that are both robust and stealthy remains a significant challenge. In this context, we introduce a novel diffusion-based method (DAFR) capable of generating robust and stealthy face masks for dodging recognition systems. Specifically, our approach is capable of producing high-fidelity printable textures using the guidance of textual prompts to determine the style. This method can also be adapted for impersonation purposes, where the system misidentifies the attacker as a specific other individual. Finally, we address a gap in the existing literature by presenting a comprehensive benchmark (FAAB) for evaluating adversarial accessories in three dimensions, assessing their robustness and stealthiness.

Barath Ashok — Survey of GNNs and their explainability
Graph Neural Networks (GNNs) have become vital for recent progress in molecular structure modelling, social network analysis, and analysis of financial data. This is due to their ability to capture and learn from both graph structural information, such as neighbourhood information, and node features that may be independent of the graph structure. Despite this success, GNNs are widely regarded as black-box models, leading to concerns in deployment in safety-critical and high-stakes scenarios. This survey talk provides a systematic overview of Graph Neural Networks, explainability in GNNs, and the unique challenges one faces when trying to build interpretations for these GNN models.

João Lobo Pevidor — Learning steps: how gradient descent improves representation in neural networks
Despite their widespread success, the internal learning dynamics of neural networks remain mathematically elusive. The paper “High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation” by Ba et al. shows that for a two-layer network, the first-layer weights after a single gradient descent step are well approximated by the initial weights plus a rank-one, data-dependent matrix. This reveals that feature learning follows a simple, low-dimensional trajectory from the very first step. In this talk, we’ll unpack the intuition behind this result and its implications for understanding feature learning in these models. We’ll see how this allows neural networks to move beyond static random-feature models, steadily improving their approximation of the target function.

Tuan Nguyen — Safety Game: Balancing Safe and Informative Conversations with Blackbox Agentic AI using LP Solvers
Ensuring that large language models (LLMs) comply with safety requirements is a central challenge in AI deployment. Existing alignment approaches primarily operate during training, such as through fine-tuning or reinforcement learning from human feedback, but these methods are costly and inflexible. In this work, we propose a model-independent, black-box framework for safety alignment that does not require retraining or access to the underlying LLM architecture. As a proof of concept, we address the problem of trading off between generating safe but uninformative answers versus helpful yet potentially risky ones. We formulate this dilemma as a two-player zero-sum game whose minimax equilibrium captures the optimal balance between safety and helpfulness. LLM agents operationalize this framework by leveraging a linear programming solver at inference time to compute equilibrium strategies.

Xuefeng Xu — Federated Computation of ROC and PR Curves
Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves are fundamental tools for evaluating machine learning classifiers. However, in Federated Learning scenarios, where data is distributed across multiple clients, computing these curves is challenging due to privacy and communication constraints. In this paper, we propose a novel method for approximating ROC and PR curves in a federated setting by estimating quantiles of the prediction score distribution under distributed differential privacy. We provide theoretical bounds on the Area Error between the true and estimated curves, demonstrating the trade-offs between approximation accuracy, privacy, and communication cost. Empirical results on real-world datasets demonstrate that our method achieves high approximation accuracy with minimal communication and strong privacy guarantees.

High Performance Computing

Prakanth Thilakaraj — Multi Stage Sequential Reinforcement Environment for MLIR Optimization
Machine learning–guided compiler optimization has recently attracted significant attention, with applications such as optimization pass prediction in LLVM IR and heuristic tuning in domain-specific language compilers. In this work, we extend this paradigm to the Multi-Level Intermediate Representation (MLIR), a compiler infrastructure for DSL development that is widely used in AI compiler stacks. We present MLIRCompilerEnv, a reinforcement learning environment for predicting optimization passes and pass options in MLIR-based compiler pipelines, and evaluate it in the context of AI compilation. Our design targets the linalg, affine, and scf dialect levels, while remaining extensible to other MLIR-based compiler stacks for meta-optimization. We formulate the compiler optimization passes as a multi-stage RL problem, where each dialect stage is associated with a dedicated agent responsible for predicting optimization passes and their corresponding options. A message-passing graph neural network serves as a shared backbone across agents extracting structural features from the intermediate representations. Agents operate sequentially to mirror the ordering of stages in the MLIR lowering pipeline, and end-to-end execution runtime is used as the reward signal to guide learning.

Artificial Intelligence

Daniela Valdes — Identifying commitment and accountability in executive boards
Leadership is increasingly understood as a collective, relational process emerging through moment-to-moment interaction. Although recent scholarship emphasises commitment, it continues to face issues of scalability and researcher bias. At the same time, leadership communication studies often overlook live meetings, leaving a significant gap in understanding how leadership unfolds. Large Language Models (LLMs) and Machine Learning offer ways to analyse such interactions at scale while enhancing qualitative depth. This study brings together leadership science, computer science, and sociolinguistics to examine how collective commitment and accountability is co-created within seven UK National Health Service Executive Boards. Using a robustly developed codebook, we applied In Context Learning to train LLMs to detect ‘leadership moments’, defined as visible accountability toward action. The model was also trained to recognise instances where accountability was absent. Results show that Gemma 3.3 27bn, used zero shot, achieved 74.8% accuracy in identifying collective leadership, though performance dropped for non-accountability cases.

Aakash Rao — HistoForge
HistoForge is a pathologist-facing, agentic computational pathology platform designed to support whole-slide image workflows end to end. The system combines a LibreChat-based interface with a locally served large language model and a modular Model Context Protocol tool ecosystem, enabling users to execute complex WSI-level analyses through natural language while preserving auditability and deterministic execution. Core capabilities include dataset validation, visualisation, standardised preprocessing, thumbnailing and QC reports, and a “model zoo” for feature extraction and downstream predictive tasks. HistoForge integrates modern histopathology representation learning methods and supports configurable pipelines for slide-level classification and risk stratification.

Rafid Ameer Mahmud — A Game-Theoretic Formulation of Opinion Diversity in Social Networks
With the generality of communication mediums like social media and personalised recommendations, opinion formation and shifts have become fast and reactive. Prior research has shown that continued interaction between two opinions has a polarising effect, causing them to drift towards extremes like echo chambers and lose diversity. In this presentation, we extend the analysis of the two-opinion model to a network of opinions where they are represented by both individuals and external content such as advertisements, seminars, and campaigns. We also present a metric to define population-level diversity and monitor polarisation. We then formulate this phenomenon as a game on a network with the goal of maximising diversity by manipulating the edges of the network.

Yuanzhen Shuai — Block-Level Personalized Model Construction for Heterogeneous Federated Learning
Federated learning enables collaborative model training without sharing raw data, but most traditional approaches assume homogeneous model architectures across clients, limiting their applicability in real-world scenarios. Personalized federated learning has recently emerged to address client heterogeneity in data distribution, computational capability, and model design. This talk surveys recent developments in personalized federated learning under heterogeneous model settings. In particular, we focus on block-level model aggregation strategies that construct customized client models through modular neural network components. We further discuss the advantages and limitations of modular personalization approaches and highlight open challenges in heterogeneous federated learning.

Shufan Yang — SimRank and Its Variants: Properties, Limitations, and Random Walk Approaches
SimRank is a classical similarity measure for graphs, defining node similarity through recursive neighbour comparison. While it satisfies several desirable properties for similarity search, it also exhibits fundamental limitations. In particular, SimRank is sensitive to odd distances, tends to over-propagate similarity across structurally different nodes, and struggles to capture role-based similarity. To address efficiency and scalability, many works reinterpret SimRank through coupled random walks, where similarity is expressed as the probability that two walks meet. This survey reviews the formulation of SimRank, summarizes its main variants and random-walk-based computation methods, and discusses how different design choices affect key similarity properties relevant to large-scale similarity search.

Delong Huang — Error-Bounded Extrapolation PINNs
Physics-informed neural networks (PINNs) can be inefficient and unstable when trained over an entire domain at once, particularly when the solution contains localized nonlinear structure. We propose Error-Bounded Extrapolation PINNs, which expands the reliable solution region under an explicit error tolerance and learns the solution through Taylor-expansion remainder learning. Beginning from an initial trusted subdomain, the method expands the active region in stages. At each stage, a Taylor-based extrapolation is constructed from the previously learned solution values and derivatives at the current subdomain and used as a prescribed baseline in the newly activated region. By repeatedly applying error-bounded expansion and remainder-focused physics enforcement, the approach targets local nonlinearity more effectively than global training.

Computer Security

Jack Nolan — Cryptographically Verifiable Election Audits
A Risk-Limiting Audit (RLA) is a procedure that can be used to catch errors in a reported election outcome with high probability. Current practice is for RLAs to be administered by a trusted independent auditing authority which presents concerns on security and transparency. In this talk I present a novel paper ballot design that divides printed ballots into batches whose tallies can be publicly verified if no ballot from the batch is lost. The design further provides some verifiability guarantees by embedding cryptograms on ballots as well as voter receipts. Encrypted votes are published to a publicly available append-only bulletin board which can be used to tabulate votes for comparison with tallies reported by the election authority.

Yihan Cai — From Zero-Knowledge Proofs to Well-Formedness Proofs
Zero-knowledge proofs (ZKPs) enable a prover to convince a verifier that a statement is true while revealing nothing beyond its validity. This talk surveys the core ideas behind ZKPs and shows how a small set of generic techniques scales to proofs used in practical systems. We start with standard intuition and formalize the three defining properties: completeness, soundness, and zero-knowledge. We then introduce Sigma protocols and explain how they are made non-interactive via the Fiat–Shamir heuristic. Finally, we show how these OR-proofs serve as the main building block for well-formedness proofs used in verifiable e-voting.

Yasmine Vazirinejad — Hybrid Password-Authenticated Key Exchange
Password-Authenticated Key Exchange protocols allow two parties sharing only a low-entropy password to establish a secure session key without relying on a Public Key Infrastructure. Hybrid PAKEs address the challenge of post-quantum resilience by combining classical and post-quantum assumptions. We present a three-pass hybrid PAKE that combines the classical J-PAKE protocol with the post-quantum Key Encapsulation Mechanism ML-KEM (Kyber). We analyze the protocol in the Universal Composability framework and show that it achieves perfect forward secrecy against passive post-quantum adversaries.

Computer Vision

Yundi Hong — PartMatch: Part-Aware Pseudo-Labeling for Fine-Grained Semi-Supervised Learning
Fine-grained visual classification is characterized by subtle inter-class differences that are often localized in small, discriminative object parts. Existing semi-supervised learning methods rely solely on global pseudo-labels, which may be unstable for fine-grained categories. In this work, we introduce PartMatch, a part-aware semi-supervised framework that integrates part token extraction into the FreeMatch paradigm. Our method constructs entropy-weighted part-level predictions, fuses them with global weak-view predictions via a reliability-aware combination, and enforces part-global consistency through a KL regularization loss. Extensive experiments demonstrate consistent improvements on CUB-200-2011 and Stanford Dogs.

Jiaqi Li — Towards Mitigating Modality Bias in Vision-Language Models for Temporal Action Localization
Temporal Action Localization requires identifying both the boundaries and categories of actions in untrimmed videos. While vision-language models offer rich semantics to complement visual evidence, existing approaches tend to overemphasize linguistic priors at the expense of visual performance, leading to a pronounced modality bias. We propose ActionVLM, a vision-language aggregation framework that systematically mitigates modality bias in TAL. Our key insight is to preserve vision as the dominant signal while adaptively exploiting language only when beneficial. Experiments on THUMOS14 show that our model outperforms state-of-the-art by up to 3.2% mAP.

Computational Biology

Paraskevi Mylona — Modelling the impact of synaptic heterogeneity on information processing in neurons
Information transfer between neurons takes place at synapses. A neuron can have hundreds of synapses on their axon and convey diverse signals to the neural circuit. Synapses are therefore not just conveying information but act as filters with distinctive properties. To understand how the signal is filtered, we first need to understand the mechanisms regulating vesicular release and how calcium signals are decoded and translated into complex release patterns. We employ a computational model that combines fast and slow calcium dynamics, integrating a reaction-diffusion model of the presynaptic side with global calcium kinetics and the inherent stochasticity of vesicular release.

Jack Bacon — Orientation Invariant Latent Space Clustering For Synthetic Cell Images
We propose an interactive learning approach to generate synthetic cell microscopy images. Recent advances in Variational Autoencoders and Generative Adversarial Networks enable clustering in the latent space to generate samples from different classes with distinct features. Current methods are, however, oversensitive to the orientation of training data, which is arbitrary in cell microscopy. We demonstrate that the inclusion of orientation-invariant architectures can eliminate this sensitivity and improve cluster accuracy. We then incorporate this methodology into existing hierarchical clustering techniques to reduce the cost of annotation for new biological datasets.

Natural Language Processing

Jakub Czarlinski — Towards User-Aligned Speech Synthesis using Reinforcement Learning
Modern text-to-speech systems predominantly rely on supervised learning, framing speech synthesis as a sequence-to-sequence prediction task with a learnable mapping from text to audio. While effective, this approach is limited in its ability to capture the nuances of human speech due to its lack of a feedback mechanism from the environment. Acoustic hints from a user’s response, including emotional cues, comprehension signals, or urgency, remain unutilised. We explore the steps required to transition into using reinforcement learning for speech synthesis, the potential benefits of this approach, and the challenges that need to be overcome for its long-term viability.

Phong Do — Prompt-as-Graph: A Structured Framework for Prompt Sensitivity Analysis
While prompting has established itself as the primary interface for steering language models, it remains an inexact science characterized by unpredictable brittleness. Seemingly innocuous edits can lead to disproportionate shifts in model behavior, making reliability difficult to guarantee. To move prompt engineering from an intuitive art toward a more rigorous discipline, we propose Prompt-as-Graph, a structured representation that decomposes prompt strings into typed, interacting components. Leveraging this abstraction, we formalize prompt robustness through the lens of Byzantine-fault tolerance and describe an evaluation protocol that quantifies robustness using metrics such as keep-correct rate and risk curves.

Danial Ataee — Machine Unlearning for LLMs
Large Language Models have rapidly become central to modern computing, with widespread adoption across academia, industry, and society. Their deployment, however, raises important concerns around privacy, safety, and regulatory compliance. In particular, LLMs may inadvertently retain and reproduce sensitive personal data, copyrighted content, or harmful behaviours learned during training. This talk reviews the motivations behind unlearning, outlines common threat and use-case scenarios, and discusses key algorithmic approaches, including fine-tuning-based, gradient-based, and data-influence methods.

Databases

Kitaek Lee — Vector Databases: Foundations, Design Trade-offs, and Open Research Challenges
Vector databases have emerged as a core infrastructure for modern data-intensive applications, particularly those involving similarity search over high-dimensional embeddings produced by machine learning models. Unlike traditional relational or key–value databases, vector databases are designed to support approximate nearest neighbour search at scale, balancing accuracy, latency, and resource efficiency. This survey presentation provides a structured overview of vector databases, focusing on their conceptual foundations and system-level design choices. We discuss common ANN indexing approaches, practical design considerations, and open research challenges such as scalability under limited memory budgets and hardware-aware optimisation.

Others

Qiman Zhang — GSim: A Scalable Model for Billion-Node Cross-Graph Similarity with Accuracy Guarantees*
Assessing node similarity in graph structures is fundamental to numerous applications, including recommendation systems and web search. We propose GSim*, an enhanced similarity model that jointly considers both in- and out-neighbour interactions for more expressive structural matching. We further develop an efficient algorithm that maintains an adaptive low-dimensional similarity representation and introduces a novel iterative scheme. Experiments demonstrate 1–4 orders of magnitude speedup over state-of-the-art methods on billion-edge graphs while preserving exact accuracy.

Min Zhang — UPPR+: Scaling Uncertain Personalised PageRank Computation on Billion-Sized Graphs with Mutually Exclusive Edges
While Personalised PageRank is widely used for ranking nodes in certain graphs, research on PPR for uncertain graphs remains limited. To address this, we propose UPPR+, an efficient scheme to retrieve PPR on billion-scale uncertain graphs. We introduce an efficient approach that groupifies uncertain edges, removes duplicates of source node set, and maps uncertainties into a small subspace using low-dimensional embeddings. We further devise a novel uncertain subspace reduction method and advance the aggregation of PPRs over all possible worlds within the subspace via subset embedding. Empirical studies validate that UPPR+ is substantially faster than state-of-the-art competitors without compromising accuracy.

Let us know you agree to cookies