research-article

Index compression is good, especially for random access

Authors:

Stefan Büttcher,

Charles L. A. ClarkeAuthors Info & Claims

CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

Pages 761 - 770

https://doi.org/10.1145/1321440.1321546

Published: 06 November 2007 Publication History

Abstract

Index compression techniques are known to substantially decrease the storage requirements of a text retrieval system. As a side-effect, they may increase its retrieval performance by reducing disk I/O overhead. Despite this advantage, developers sometimes choose to store index data in uncompressed form, in order to not obstruct random access into each index term's postings list.

In this paper, we show that index compression does not harm random access performance. In fact, we demonstrate that, in some cases, random access into a term's postings list may be realized more efficiently if the list is stored in compressed form instead of uncompressed. This is regardless of whether the index is stored on disk or in main memory, since both types of storage - hard drives and RAM - do not support efficient random access in the first place.

References

[1]

V. N. Anh and A. Moffat. Inverted Index Compression using Word-Aligned Binary Codes. Information Retrieval, 8(1):151--166, January 2005.

Digital Library

[2]

E. A. Brewer. Combining Systems and Databases: A Search Engine Retrospective. In J. M. Hellerstein and M. Stonebraker, editors, Readings in Database Systems (4th edition), Cambridge, Massachusetts, 2005. MIT Press.

[3]

D. Burger, J. R. Goodman, and G. S. Sohi. Memory Systems. In The Computer Science and Engineering Handbook, pages 447--461. 1997.

[4]

S. Büttcher and C. L. A. Clarke. Unaligned Binary Codes for Index Compression in Schema-Independent Text Retrieval Systems. University of Waterloo Technical Report CS-2006-40, October 2006.

[5]

B. Cui, B. C. Ooi, J. Su, and K.-L. Tan. Main Memory Indexing: The Case for BD-Tree. IEEE Transactions on Knowledge and Data Engineering, 16(7):870--874, 2004.

Digital Library

[6]

E. D. Demaine, A. López-Ortiz, and J. I. Munro. Adaptive Set Intersections, Unions, and Differences. In Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2000), pages 743--752, San Francisco, USA, January 2000.

Digital Library

[7]

E. D. Demaine, A. López-Ortiz, and J. I. Munro. Experiments on Adaptive Set Intersections for Text Retrieval Systems. In Revised Papers from the Third International Workshop on Algorithm Engineering and Experimentation (ALENEX 2001), pages 91--104, Washington, DC, USA, January 2001.

Digital Library

[8]

A. S. Fraenkel and S. T. Klein. Novel Compression of Sparse Bit-Strings - Preliminary Report. Combinatorial Algorithms on Words, NATO ASI Series, 12:169--183, 1985.

[9]

S. W. Golomb. Run-Length Encodings. IEEE Transactions on Information Theory, IT-12:399--401, July 1966.

Digital Library

[10]

R. A. Hankins and J. M. Patel. Effect of Node Size on the Performance of Cache-Conscious B+-Trees. In Proceedings of the 2003 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 283--294, San Diego, USA, June 2003.

Digital Library

[11]

S. Héman. Super-Scalar Database Compression between RAM and CPU Cache. Master's Thesis. University of Amsterdam. Amsterdam, The Netherlands, July 2005.

[12]

M. Kowarschik and C. Weiß. An Overview of Cache Optimization Techniques and Cache-Aware Numerical Algorithms. In Algorithms for Memory Hierarchies, Advanced Lectures, pages 213--232, March 2002.

[13]

A. Moffat and L. Stuiver. Binary Interpolative Coding for Effective Index Compression. Information Retrieval, 3(1):25--47, 2000.

Digital Library

[14]

A. Moffat and J. Zobel. Self-Indexing Inverted Files for Fast Text Retrieval. ACM Transactions on Information Systems, 14(4):349--379, October 1996.

Digital Library

[15]

Y. Perl, A. Itai, and H. Avni. Interpolation Search - A log log n Search. Communications of the ACM, 21(7):550--553, 1978.

Digital Library

[16]

J. Rao and K. A. Ross. Cache Conscious Indexing for Decision-Support in Main Memory. In Proceedings of 25th International Conference on Very Large Data Bases, pages 78--89, Edinburgh, UK, September 1999.

Digital Library

[17]

J. Rao and K. A. Ross. Making B+-Trees Cache Conscious in Main Memory. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pages 475--486, Dallas, USA, May 2000.

Digital Library

[18]

C. Ruemmler and J. Wilkes. An Introduction to Disk Drive Modeling. The Computer Journal, 27(3):17--28, 1994.

Digital Library

[19]

F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel. Compression of Inverted Indexes for Fast Query Evaluation. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 222--229, Tampere, Finland, August 2002.

Digital Library

[20]

A. J. Smith. Bibliography and Reading on CPU Cache Memories and Related Topics. SIGARCH Computer Architecture News, 14(1):22--42, 1986.

Digital Library

[21]

H. Turtle and J. Flood. Query Evaluation: Strategies and Optimization. Information Processing & Management, 31(1):831--850, November 1995.

Digital Library

[22]

H. E. Williams and J. Zobel. Compressing Integers for Fast File Access. The Computer Journal, 42(3):193--201, 1999.

Digital Library

[23]

L. Xiao, X. Zhang, and S. A. Kubricht. Improving Memory Performance of Sorting Algorithms. ACM Journal of Experimental Algorithms, 5, 2000.

Digital Library

[24]

J. Zobel and A. Moffat. Adding Compression to a Full-Text Retrieval System. Software - Practice and Experience, 25(8):891--903, 1995.

Digital Library

[25]

J. Zobel and A. Moffat. Inverted Files for Text Search Engines. ACM Computing Surveys, 38(2):6, 2006.

Digital Library

[26]

M. Zukowski, S. Héman, N. Nes, and P. A. Boncz. Super-Scalar RAM-CPU Cache Compression. In ICDE, page 59, 2006.

Digital Library

Cited By

Zhang YFang HLi X(2019)Scalable Top-K Query Processing Using Graphics Processing UnitLanguages and Compilers for Parallel Computing10.1007/978-3-030-35225-7_16(240-261)Online publication date: 15-Nov-2019
https://doi.org/10.1007/978-3-030-35225-7_16
Mallia AOttaviano GPorciani ETonellotto NVenturini RKando NSakai TJoho HLi Hde Vries AWhite R(2017)Faster BlockMax WAND with Variable-sized BlocksProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080780(625-634)Online publication date: 7-Aug-2017
https://dl.acm.org/doi/10.1145/3077136.3080780
Pibiri GVenturini R(2017)Clustered Elias-Fano IndexesACM Transactions on Information Systems10.1145/305277336:1(1-33)Online publication date: 6-Apr-2017
https://dl.acm.org/doi/10.1145/3052773
Show More Cited By

Index Terms

Index compression is good, especially for random access
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

An efficient random access inverted index for information retrieval
WWW '10: Proceedings of the 19th international conference on World wide web

To improve query performance and space efficiency, an efficient random access blocked inverted index (RABI) is proposed. RABI divides an inverted list into blocks and compresses different part of each block with the corresponding encoding method to ...
An Experimental Study of Bitmap Compression vs. Inverted List Compression
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

Bitmap compression has been studied extensively in the database area and many efficient compression schemes were proposed, e.g., BBC, WAH, EWAH, and Roaring. Inverted list compression is also a well-studied topic in the information retrieval community ...
Using d-gap patterns for index compression
WWW '07: Proceedings of the 16th international conference on World Wide Web

Sequential patterns of d-gaps exist pervasively in inverted lists of Web document collection indices due to the cluster property. In this paper the information of d-gap sequential patterns is used as a new dimension for improving inverted index ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

November 2007

1048 pages

ISBN:9781595938039

DOI:10.1145/1321440

Co-chair:
Alberto H. F. Laender,
Conference Chairs:
André O. Falcão
Universidade de Lisboa, Portugal
,
Øystein Haug Olsen,
General Chair:
Mário J. Silva
(Universidade de Lisboa, Portugal)
,
Program Chairs:
Ricardo Baeza-Yates,
Deborah L. McGuinness,
Bjorn Olstad

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM07

Sponsor:

CIKM07: Conference on Information and Knowledge Management

November 6 - 10, 2007

Lisbon, Portugal

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
573
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)1

Reflects downloads up to 15 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang YFang HLi X(2019)Scalable Top-K Query Processing Using Graphics Processing UnitLanguages and Compilers for Parallel Computing10.1007/978-3-030-35225-7_16(240-261)Online publication date: 15-Nov-2019
https://doi.org/10.1007/978-3-030-35225-7_16
Mallia AOttaviano GPorciani ETonellotto NVenturini RKando NSakai TJoho HLi Hde Vries AWhite R(2017)Faster BlockMax WAND with Variable-sized BlocksProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080780(625-634)Online publication date: 7-Aug-2017
https://dl.acm.org/doi/10.1145/3077136.3080780
Pibiri GVenturini R(2017)Clustered Elias-Fano IndexesACM Transactions on Information Systems10.1145/305277336:1(1-33)Online publication date: 6-Apr-2017
https://dl.acm.org/doi/10.1145/3052773
Konow RNavarro GClarke CLópez-Ortíz A(2017)Inverted TreapsACM Transactions on Information Systems10.1145/300718635:3(1-45)Online publication date: 4-Jan-2017
https://dl.acm.org/doi/10.1145/3007186
Lin JTrotman A(2017)The role of index compression in score-at-a-time query evaluationInformation Retrieval Journal10.1007/s10791-016-9291-520:3(199-220)Online publication date: 25-Jan-2017
https://doi.org/10.1007/s10791-016-9291-5
Eknath VDhananjay G(2016)Emerging opportunities in Domain Specific SearchProceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies10.1145/2905055.2905222(1-4)Online publication date: 4-Mar-2016
https://dl.acm.org/doi/10.1145/2905055.2905222
Lemire DBoytsov LKurz N(2016)SIMD compression and the intersection of sorted integersSoftware—Practice & Experience10.1002/spe.232646:6(723-749)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1002/spe.2326
Lemire DBoytsov L(2015)Decoding billions of integers per second through vectorizationSoftware—Practice & Experience10.1002/spe.220345:1(1-29)Online publication date: 1-Jan-2015
https://dl.acm.org/doi/10.1002/spe.2203
Gerasoulis A(2014)Web Search Engines:: Practice and ExperienceComputing Handbook, Third Edition10.1201/b16812-58(1-24)Online publication date: 8-May-2014
https://doi.org/10.1201/b16812-58
Wu HFang HLi JWang XGarofalakis MSoboroff ISuel TWang M(2014)Document Prioritization for Scalable Query ProcessingProceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management10.1145/2661829.2661914(1609-1618)Online publication date: 3-Nov-2014
https://dl.acm.org/doi/10.1145/2661829.2661914
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents