research-article

Workload-Driven Antijoin Cardinality Estimation

Authors:

Chris JermaineAuthors Info & Claims

ACM Transactions on Database Systems (TODS), Volume 40, Issue 3

Article No.: 16, Pages 1 - 41

https://doi.org/10.1145/2818178

Published: 23 October 2015 Publication History

Abstract

Antijoin cardinality estimation is among a handful of problems that has eluded accurate efficient solutions amenable to implementation in relational query optimizers. Given the widespread use of antijoin and subset-based queries in analytical workloads and the extensive research targeted at join cardinality estimation—a seemingly related problem—the lack of adequate solutions for antijoin cardinality estimation is intriguing. In this article, we introduce a novel sampling-based estimator for antijoin cardinality that (unlike existent estimators) provides sufficient accuracy and efficiency to be implemented in a query optimizer. The proposed estimator incorporates three novel ideas. First, we use prior workload information when learning a mixture superpopulation model of the data offline. Second, we design a Bayesian statistics framework that updates the superpopulation model according to the live queries, thus allowing the estimator to adapt dynamically to the online workload. Third, we develop an efficient algorithm for sampling from a hypergeometric distribution in order to generate Monte Carlo trials, without explicitly instantiating either the population or the sample. When put together, these ideas form the basis of an efficient antijoin cardinality estimator satisfying the strict requirements of a query optimizer, as shown by the extensive experimental results over synthetically-generated as well as massive TPC-H data.

References

[1]

A. Aboulnaga and S. Chaudhuri. 1999. Self-tuning histograms: Building histograms without looking at data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 181--192.

Digital Library

[2]

S. Acharya, P.B. Gibbons, V. Poosala, and S. Ramaswamy. 1999. Join synopses for approximate query answering. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 275--286.

Digital Library

[3]

S. Agarwal, H. Milner, A. Kleiner, A. Talwalkar, M. Jordan, S. Madden, B. Mozafari, and I. Stoica. 2014. Knowing when you're wrong: Building fast and reliable approximate query processing systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 481--492.

Digital Library

[4]

S. Babu, P. Bizarro, and D. DeWitt. 2005. Proactive reoptimization. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 107--118. http://dx.doi.org/10.1145/1066157.1066171

Digital Library

[5]

T. Benaglia, D. Chauveau, D.R. Hunter, and D. Young. 2009. mixtools: An R package for analyzing finite mixture models. J. Stat. Softw. 32, 6 (2009), 1--29.

[6]

J.A. Bilmes. 1998. A Gentle Tutorial of the EM algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. ftp://ftp.icsi.berkeley.edu/pub/techreports/1997/tr-97-021.pdf. (Last accessed September 2014.)

[7]

N. Bruno, S. Chaudhuri, and L. Gravano. 2001. STHoles: A multidimensional workload-aware histogram. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 211--222.

Digital Library

[8]

S. Chaudhuri. 1998. An overview of query optimization in relational systems. In Proceedings of the ACM PODS Symposium on Principles of Database Systems. 34--43.

Digital Library

[9]

S. Chaudhuri, R. Motwani, and V. Narasayya. 1999. On random sampling over joins. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 263--274.

Digital Library

[10]

M. Colgan. 2010. Optimizer transformations: Subquery unesting, Part 2. https://blogs.oracle.com/optimizer/entry/optimizer_transformations_subquery_unesting_part_2. (Last accessed September 2014.)

[11]

G. Cormode, M.N. Garofalakis, P.J. Haas, and C. Jermaine. 2012. Synopses for massive data: Samples, histograms, wavelets, sketches. Found. Trends Datab. 4, 1--3 (2012), 1--294.

Digital Library

[12]

A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum-likelihood from incomplete data via the EM algorithm. J. Royal Statist. 39 (1977).

[13]

A. Dobra, C. Jermaine, F. Rusu, and F. Xu. 2009. Turbo-charging estimate convergence in DBO. Proc. VLDB 2, 1 (2009), 419--430.

Digital Library

[14]

S. Ganguly, P.B. Gibbons, Y. Matias, and A. Silberschatz. 1996. Bifocal sampling for skew-resistant join size estimation. SIGMOD Record 25, 2 (1996), 271--281.

Digital Library

[15]

A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. 2003. Bayesian data analysis. Chapman & Hall/CRC.

[16]

P.B. Gibbons. 2001. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In Proceedings of the VLDB International Conference on Very Large Data Bases. 541--550.

Digital Library

[17]

G. Graefe. 1995. The cascades framework for query optimization. IEEE Data Eng. Bulle. 18, 3 (1995), 19--29.

[18]

P.J. Haas, I.F. Ilyas, G.M. Lohman, and V. Markl. 2009. Discovering and exploiting statistical properties for query optimization in relational databases: A survey. Stat. Anal. Data Mining 1, 4 (2009), 223--250.

Digital Library

[19]

P.J. Haas, J.F. Naughton, S. Seshadri, and L. Stokes. 1995. Sampling-based estimation of the number of distinct values of an attribute. In Proceedings of the VLDB International Conference on Very Large Data Bases. 311--322.

Digital Library

[20]

P.J. Haas and A.N. Swami. 1992. Sequential sampling procedures for query size estimation. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 341--350.

Digital Library

[21]

Y.E. Ioannidis. 1996. Query optimization. ACM Comput. Surv. 28, 1 (1996), 121--123.

Digital Library

[22]

Y.E. Ioannidis. 2003. The history of histograms (abridged). In Proceedings of the VLDB International Conference on Very Large Data Bases. 19--30.

Digital Library

[23]

Y.E. Ioannidis and S. Christodoulakis. 1993. Optimal histograms for limiting worst-case error propagation in the size of join results. Trans. Datab. Syst. 18, 4 (1993), 709--748.

Digital Library

[24]

C. Jermaine, A. Dobra, A. Pol, and S. Joshi. 2005. Online estimation for subset-based SQL queries. In Proceedings of the VLDB International Conference on Very Large Data Bases. 745--756.

Digital Library

[25]

N.L. Johnson, S. Kotz, and N. Balakrishnan. 1996. Discrete Multivariate Distributions. John Wiley & Sons, Inc.

[26]

S. Joshi and C. Jermaine. 2009. Sampling-based estimators for subset-based queries. VLDB J. 18, 1 (2009), 181--202.

Digital Library

[27]

N. Kabra and D. DeWitt. 1998. Efficient mid-query reoptimization of suboptimal query execution plans. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 106--117.

Digital Library

[28]

R. Kaushik, J.F. Naughton, R. Ramakrishnan, and V.T. Chakaravarthy. 2005. Synopses for query optimization: A space-complexity perspective. Trans. Datab. Syst. 30, 4 (2005), 1102--1127.

Digital Library

[29]

R. Kaushik and D. Suciu. 2009. Consistent histograms in the presence of distinct value counts. Proc. VLDB 2, 1 (2009), 850--861.

Digital Library

[30]

G.M. Lohman. 2014. Is query optimization a “solved” problem? http://wp.sigmod.org/?p=1075. (Last accessed May 2014.)

[31]

V. Markl, G. Lohman, and V. Raman. 2003. LEO: An automatic query optimizer for DB2. IBM Syst. J. 42, 1 (2003), 98--106.

Digital Library

[32]

V. Markl, V. Raman, D.E. Simmen, G.M. Lohman, and H. Pirahesh. 2004. Robust query processing through progressive optimization. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 659--670.

Digital Library

[33]

G. Moerkotte, T. Neumann, and G. Steidl. 2009. Preventing bad plans by bounding the impact of cardinality estimation errors. PVLDB 2, 1 (2009), 982--993.

Digital Library

[34]

K.P. Murphy. 2006. Binomial and multinomial distributions. http://www.cs.ubc.ca/&sim;murphyk/Teaching/CSE40-Fall06/reading/bernoulli.pdf. (Last accessed September 2014.)

[35]

A. Pol and C. Jermaine. 2005. Relational confidence bounds are easy with the bootstrap. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 587--598.

Digital Library

[36]

V. Poosala, P.J. Haas, Y.E. Ioannidis, and E.J. Shekita. 1996. Improved histograms for selectivity estimation of range predicates. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 294--305.

Digital Library

[37]

C.P. Robert and G. Casella. 2005. Monte Carlo Statistical Methods. Springer.

Digital Library

[38]

R. Schrag. 2005. Speeding up queries with semi-joins and anti-joins: How Oracle evaluates EXISTS, NOT EXISTS, IN, and NOT IN. http://www.dbspecialists.com/files/presentations/semijoins.html. (Last accessed September 2014.)

[39]

P.G. Selinger, M.M. Astrahan, D.D. Chamberlin, R.A. Lorie, and T.G. Price. 1979. Access path selection in a relational database management system. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 23--34.

Digital Library

[40]

J. Spiegel and N. Polyzotis. 2009. TuG synopses for approximate query answering. Trans. Datab. Syst. 34, 1 (2009).

Digital Library

[41]

U. Srivastava, P.J. Haas, V. Markl, and N. Megiddo. 2006. ISOMER: Consistent histogram construction using query feedback. In Proceedings of the IEEE ICDE International Conference on Data Engineering. 39.

Digital Library

[42]

M. Stillger, G. Lohman, V. Markl, and M. Kandil. 2001. LEO—DB2's learning optimizer. In Proceedings of the VLDB International Conference on Very Large Data Bases. 19--28.

Digital Library

[43]

A. Swami and K.B. Schiefer. 1994. On the estimation of join result sizes. In Proceedings of the EDBT Extended Database Technology Conference. 287--300.

Digital Library

[44]

F. Yu, W. Hou, C. Luo, D. Che, and M. Zhu. 2013. CS2: A new database synopsis for query estimation. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 469--480.

Digital Library

[45]

K. Zeng, S. Gao, B. Mozafari, and C. Zaniolo. 2014. The analytical bootstrap: A new method for fast error estimation in approximate query processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 277--288.

Digital Library

Cited By

Alvarez-Ayllon APalomo-Duarte MDodero J(2019)Interactive Data Exploration of Distributed Raw Files: A Systematic Mapping StudyIEEE Access10.1109/ACCESS.2018.28822447(10691-10717)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2018.2882244
Zhao ZChristensen RLi FHu XYi KDas GJermaine CBernstein P(2018)Random Sampling over Joins RevisitedProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3183739(1525-1539)Online publication date: 27-May-2018
https://dl.acm.org/doi/10.1145/3183713.3183739

Index Terms

Workload-Driven Antijoin Cardinality Estimation

Recommendations

Pessimistic Cardinality Estimation: Tighter Upper Bounds for Intermediate Join Cardinalities
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

In this work we introduce a novel approach to the problem of cardinality estimation over multijoin queries. Our approach leveraging randomized hashing and data sketching to tighten these bounds beyond the current state of the art. We demonstrate that ...
ASM: Harmonizing Autoregressive Model, Sampling, and Multi-dimensional Statistics Merging for Cardinality Estimation
SIGMOD

Recent efforts in learned cardinality estimation (CE) have substantially improved estimation accuracy and query plans inside query optimizers. However, achieving decent efficiency, scalability, and the support of a wide range of queries at the same time, ...
Query Size Estimation for Joins Using Systematic Sampling

We propose a new approach to the estimation of query result sizes for join queries. The technique, which we have called “systematic sampling—SYSSMP”, is a novel variant of the sampling-based approach. A key novelty of the systematic sampling is that it ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems

ACM Transactions on Database Systems Volume 40, Issue 3

October 2015

247 pages

ISSN:0362-5915

EISSN:1557-4644

DOI:10.1145/2838914

Editor:
Christian S. Jensen
Aalborg University, Denmark

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 October 2015

Accepted: 01 June 2015

Revised: 01 December 2014

Received: 01 May 2014

Published in TODS Volume 40, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
222
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 15 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Alvarez-Ayllon APalomo-Duarte MDodero J(2019)Interactive Data Exploration of Distributed Raw Files: A Systematic Mapping StudyIEEE Access10.1109/ACCESS.2018.28822447(10691-10717)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2018.2882244
Zhao ZChristensen RLi FHu XYi KDas GJermaine CBernstein P(2018)Random Sampling over Joins RevisitedProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3183739(1525-1539)Online publication date: 27-May-2018
https://dl.acm.org/doi/10.1145/3183713.3183739

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents