Computer Science News
SIGMOD 2024 Test of Time Award for ‘PrivBayes’
The work of Professor Graham Cormode has been recognized with a “test of time” award. The ACM SIGMOD conference presents an award each year for the paper from SIGMOD 10-12 years previously that has had the biggest impact, and passed the “test-of-time”. The 2014 paper “PrivBayes: private data release via bayesian networks” (Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, Xiaokui Xiao) was selected for this honour. The award will be presented at the 2024 ACM SIGMOD Conference in Santiago.
Summary of PrivBayes: The paper studies a fundamental problem in data privacy: given a relation R containing personal data, how can we release a synthetic version of R that preserves the statistical essence of R without compromising privacy? This was an open problem due to the curse of dimensionality: prior attempts struggle as they incur prohibitive computation costs and information loss when R contains more than half a dozen columns. The PrivBayes paper solves this problem under differential privacy. The key insight of PrivBayes lies in its use of Bayesian networks to effectively decompose R into a set of smaller and more manageable relations. This process allows for the intricate statistical relationships within the data to be preserved in a lower-dimensional space. By focusing on the decomposed relations, PrivBayes is able to construct a statistical model that both protects individual privacy and retains the utility of the data. This model then serves as the foundation for generating a synthetic dataset that mirrors the original's statistical properties without bearing the heavy computational costs or the accuracy degradation from which previous methods suffer.
Impact: PrivBayes was the first practical solution for privacy-preserving synthesis of relations. It is widely adopted in commercial platforms and open-source tools. Notable examples include SAP’s Data Intelligence Cloud, which implements PrivBayes for users to generate synthetic data for machine learning—the following open-source data synthesis tools that incorporate either PrivBayes or its variants:
- Reprosyn, by the Alan Turing Institute
- Synthcity, by the University of Cambridge
- DataSynthesizer, by the Data Responsibly Consortium
- Synthetic Data Gym, by DataCebo
- DPART, by Hazy
- SmartNoise, by OpenDP
PrivBayes is acknowledged in a number of patents held by major corporations such as Microsoft and SAP. Furthermore, a recent comparative study funded by the NIST Public Safety Communications Research Division concludes that PrivBayes is a data synthesis method that “data practitioners would most easily adopt”.
Most recently, PrivBayes has been adopted by Israel’s Ministry of Health to release statistics on live births, after an extensive comparison to other alternatives.
Apart from its practical impacts, PrivBayes has catalyzed a large body of subsequent research, extending its influence beyond the data management field to encompass areas such as security, machine learning, and healthcare. PrivBayes has made its mark in three distinct ways. First, it introduced an innovative approach for synthesizing relations through low-dimensional decomposition, which established the notion of “marginal-based methods”, directly influencing a number of subsequent methods such as PrivMRF, MST and AIM, which are now considered state-of-the-art. Second, PrivBayes set a high standard for privacy-preserving data synthesis, for which it has been frequently used to benchmark new solutions for privacy protection. Third, as a practical solution for data synthesis, it is adopted in various application domains, such as the generation of synthetic medical record, educational data (modelling outcomes for learning tasks), curricula vitae (field, role, experience level etc.), and so on. NIST has run high-profile competitions for private data generation: In 2018, an approach using PrivBayes came 3rd, while the winning entry extended it. In 2020, the winning approach used PrivMRF, and the runner up used MST – both methods are based on PrivBayes.