skip to main content
research-article

Workload-Driven Antijoin Cardinality Estimation

Published: 23 October 2015 Publication History

Abstract

Antijoin cardinality estimation is among a handful of problems that has eluded accurate efficient solutions amenable to implementation in relational query optimizers. Given the widespread use of antijoin and subset-based queries in analytical workloads and the extensive research targeted at join cardinality estimation—a seemingly related problem—the lack of adequate solutions for antijoin cardinality estimation is intriguing. In this article, we introduce a novel sampling-based estimator for antijoin cardinality that (unlike existent estimators) provides sufficient accuracy and efficiency to be implemented in a query optimizer. The proposed estimator incorporates three novel ideas. First, we use prior workload information when learning a mixture superpopulation model of the data offline. Second, we design a Bayesian statistics framework that updates the superpopulation model according to the live queries, thus allowing the estimator to adapt dynamically to the online workload. Third, we develop an efficient algorithm for sampling from a hypergeometric distribution in order to generate Monte Carlo trials, without explicitly instantiating either the population or the sample. When put together, these ideas form the basis of an efficient antijoin cardinality estimator satisfying the strict requirements of a query optimizer, as shown by the extensive experimental results over synthetically-generated as well as massive TPC-H data.

References

[1]
A. Aboulnaga and S. Chaudhuri. 1999. Self-tuning histograms: Building histograms without looking at data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 181--192.
[2]
S. Acharya, P.B. Gibbons, V. Poosala, and S. Ramaswamy. 1999. Join synopses for approximate query answering. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 275--286.
[3]
S. Agarwal, H. Milner, A. Kleiner, A. Talwalkar, M. Jordan, S. Madden, B. Mozafari, and I. Stoica. 2014. Knowing when you're wrong: Building fast and reliable approximate query processing systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 481--492.
[4]
S. Babu, P. Bizarro, and D. DeWitt. 2005. Proactive reoptimization. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 107--118. http://dx.doi.org/10.1145/1066157.1066171
[5]
T. Benaglia, D. Chauveau, D.R. Hunter, and D. Young. 2009. mixtools: An R package for analyzing finite mixture models. J. Stat. Softw. 32, 6 (2009), 1--29.
[6]
J.A. Bilmes. 1998. A Gentle Tutorial of the EM algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. ftp://ftp.icsi.berkeley.edu/pub/techreports/1997/tr-97-021.pdf. (Last accessed September 2014.)
[7]
N. Bruno, S. Chaudhuri, and L. Gravano. 2001. STHoles: A multidimensional workload-aware histogram. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 211--222.
[8]
S. Chaudhuri. 1998. An overview of query optimization in relational systems. In Proceedings of the ACM PODS Symposium on Principles of Database Systems. 34--43.
[9]
S. Chaudhuri, R. Motwani, and V. Narasayya. 1999. On random sampling over joins. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 263--274.
[10]
M. Colgan. 2010. Optimizer transformations: Subquery unesting, Part 2. https://blogs.oracle.com/optimizer/entry/optimizer_transformations_subquery_unesting_part_2. (Last accessed September 2014.)
[11]
G. Cormode, M.N. Garofalakis, P.J. Haas, and C. Jermaine. 2012. Synopses for massive data: Samples, histograms, wavelets, sketches. Found. Trends Datab. 4, 1--3 (2012), 1--294.
[12]
A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum-likelihood from incomplete data via the EM algorithm. J. Royal Statist. 39 (1977).
[13]
A. Dobra, C. Jermaine, F. Rusu, and F. Xu. 2009. Turbo-charging estimate convergence in DBO. Proc. VLDB 2, 1 (2009), 419--430.
[14]
S. Ganguly, P.B. Gibbons, Y. Matias, and A. Silberschatz. 1996. Bifocal sampling for skew-resistant join size estimation. SIGMOD Record 25, 2 (1996), 271--281.
[15]
A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. 2003. Bayesian data analysis. Chapman & Hall/CRC.
[16]
P.B. Gibbons. 2001. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In Proceedings of the VLDB International Conference on Very Large Data Bases. 541--550.
[17]
G. Graefe. 1995. The cascades framework for query optimization. IEEE Data Eng. Bulle. 18, 3 (1995), 19--29.
[18]
P.J. Haas, I.F. Ilyas, G.M. Lohman, and V. Markl. 2009. Discovering and exploiting statistical properties for query optimization in relational databases: A survey. Stat. Anal. Data Mining 1, 4 (2009), 223--250.
[19]
P.J. Haas, J.F. Naughton, S. Seshadri, and L. Stokes. 1995. Sampling-based estimation of the number of distinct values of an attribute. In Proceedings of the VLDB International Conference on Very Large Data Bases. 311--322.
[20]
P.J. Haas and A.N. Swami. 1992. Sequential sampling procedures for query size estimation. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 341--350.
[21]
Y.E. Ioannidis. 1996. Query optimization. ACM Comput. Surv. 28, 1 (1996), 121--123.
[22]
Y.E. Ioannidis. 2003. The history of histograms (abridged). In Proceedings of the VLDB International Conference on Very Large Data Bases. 19--30.
[23]
Y.E. Ioannidis and S. Christodoulakis. 1993. Optimal histograms for limiting worst-case error propagation in the size of join results. Trans. Datab. Syst. 18, 4 (1993), 709--748.
[24]
C. Jermaine, A. Dobra, A. Pol, and S. Joshi. 2005. Online estimation for subset-based SQL queries. In Proceedings of the VLDB International Conference on Very Large Data Bases. 745--756.
[25]
N.L. Johnson, S. Kotz, and N. Balakrishnan. 1996. Discrete Multivariate Distributions. John Wiley & Sons, Inc.
[26]
S. Joshi and C. Jermaine. 2009. Sampling-based estimators for subset-based queries. VLDB J. 18, 1 (2009), 181--202.
[27]
N. Kabra and D. DeWitt. 1998. Efficient mid-query reoptimization of suboptimal query execution plans. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 106--117.
[28]
R. Kaushik, J.F. Naughton, R. Ramakrishnan, and V.T. Chakaravarthy. 2005. Synopses for query optimization: A space-complexity perspective. Trans. Datab. Syst. 30, 4 (2005), 1102--1127.
[29]
R. Kaushik and D. Suciu. 2009. Consistent histograms in the presence of distinct value counts. Proc. VLDB 2, 1 (2009), 850--861.
[30]
G.M. Lohman. 2014. Is query optimization a “solved” problem? http://wp.sigmod.org/?p=1075. (Last accessed May 2014.)
[31]
V. Markl, G. Lohman, and V. Raman. 2003. LEO: An automatic query optimizer for DB2. IBM Syst. J. 42, 1 (2003), 98--106.
[32]
V. Markl, V. Raman, D.E. Simmen, G.M. Lohman, and H. Pirahesh. 2004. Robust query processing through progressive optimization. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 659--670.
[33]
G. Moerkotte, T. Neumann, and G. Steidl. 2009. Preventing bad plans by bounding the impact of cardinality estimation errors. PVLDB 2, 1 (2009), 982--993.
[34]
K.P. Murphy. 2006. Binomial and multinomial distributions. http://www.cs.ubc.ca/∼murphyk/Teaching/CSE40-Fall06/reading/bernoulli.pdf. (Last accessed September 2014.)
[35]
A. Pol and C. Jermaine. 2005. Relational confidence bounds are easy with the bootstrap. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 587--598.
[36]
V. Poosala, P.J. Haas, Y.E. Ioannidis, and E.J. Shekita. 1996. Improved histograms for selectivity estimation of range predicates. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 294--305.
[37]
C.P. Robert and G. Casella. 2005. Monte Carlo Statistical Methods. Springer.
[38]
R. Schrag. 2005. Speeding up queries with semi-joins and anti-joins: How Oracle evaluates EXISTS, NOT EXISTS, IN, and NOT IN. http://www.dbspecialists.com/files/presentations/semijoins.html. (Last accessed September 2014.)
[39]
P.G. Selinger, M.M. Astrahan, D.D. Chamberlin, R.A. Lorie, and T.G. Price. 1979. Access path selection in a relational database management system. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 23--34.
[40]
J. Spiegel and N. Polyzotis. 2009. TuG synopses for approximate query answering. Trans. Datab. Syst. 34, 1 (2009).
[41]
U. Srivastava, P.J. Haas, V. Markl, and N. Megiddo. 2006. ISOMER: Consistent histogram construction using query feedback. In Proceedings of the IEEE ICDE International Conference on Data Engineering. 39.
[42]
M. Stillger, G. Lohman, V. Markl, and M. Kandil. 2001. LEO—DB2's learning optimizer. In Proceedings of the VLDB International Conference on Very Large Data Bases. 19--28.
[43]
A. Swami and K.B. Schiefer. 1994. On the estimation of join result sizes. In Proceedings of the EDBT Extended Database Technology Conference. 287--300.
[44]
F. Yu, W. Hou, C. Luo, D. Che, and M. Zhu. 2013. CS2: A new database synopsis for query estimation. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 469--480.
[45]
K. Zeng, S. Gao, B. Mozafari, and C. Zaniolo. 2014. The analytical bootstrap: A new method for fast error estimation in approximate query processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 277--288.

Cited By

View all
  • (2019)Interactive Data Exploration of Distributed Raw Files: A Systematic Mapping StudyIEEE Access10.1109/ACCESS.2018.28822447(10691-10717)Online publication date: 2019
  • (2018)Random Sampling over Joins RevisitedProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3183739(1525-1539)Online publication date: 27-May-2018

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 40, Issue 3
October 2015
247 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/2838914
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 October 2015
Accepted: 01 June 2015
Revised: 01 December 2014
Received: 01 May 2014
Published in TODS Volume 40, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Antijoin operator
  2. Bayesian statistics
  3. Monte Carlo
  4. query optimization
  5. sampling

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)1
Reflects downloads up to 15 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2019)Interactive Data Exploration of Distributed Raw Files: A Systematic Mapping StudyIEEE Access10.1109/ACCESS.2018.28822447(10691-10717)Online publication date: 2019
  • (2018)Random Sampling over Joins RevisitedProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3183739(1525-1539)Online publication date: 27-May-2018

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media