skip to main content
10.1145/1321440.1321546acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Index compression is good, especially for random access

Published: 06 November 2007 Publication History

Abstract

Index compression techniques are known to substantially decrease the storage requirements of a text retrieval system. As a side-effect, they may increase its retrieval performance by reducing disk I/O overhead. Despite this advantage, developers sometimes choose to store index data in uncompressed form, in order to not obstruct random access into each index term's postings list.
In this paper, we show that index compression does not harm random access performance. In fact, we demonstrate that, in some cases, random access into a term's postings list may be realized more efficiently if the list is stored in compressed form instead of uncompressed. This is regardless of whether the index is stored on disk or in main memory, since both types of storage - hard drives and RAM - do not support efficient random access in the first place.

References

[1]
V. N. Anh and A. Moffat. Inverted Index Compression using Word-Aligned Binary Codes. Information Retrieval, 8(1):151--166, January 2005.
[2]
E. A. Brewer. Combining Systems and Databases: A Search Engine Retrospective. In J. M. Hellerstein and M. Stonebraker, editors, Readings in Database Systems (4th edition), Cambridge, Massachusetts, 2005. MIT Press.
[3]
D. Burger, J. R. Goodman, and G. S. Sohi. Memory Systems. In The Computer Science and Engineering Handbook, pages 447--461. 1997.
[4]
S. Büttcher and C. L. A. Clarke. Unaligned Binary Codes for Index Compression in Schema-Independent Text Retrieval Systems. University of Waterloo Technical Report CS-2006-40, October 2006.
[5]
B. Cui, B. C. Ooi, J. Su, and K.-L. Tan. Main Memory Indexing: The Case for BD-Tree. IEEE Transactions on Knowledge and Data Engineering, 16(7):870--874, 2004.
[6]
E. D. Demaine, A. López-Ortiz, and J. I. Munro. Adaptive Set Intersections, Unions, and Differences. In Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2000), pages 743--752, San Francisco, USA, January 2000.
[7]
E. D. Demaine, A. López-Ortiz, and J. I. Munro. Experiments on Adaptive Set Intersections for Text Retrieval Systems. In Revised Papers from the Third International Workshop on Algorithm Engineering and Experimentation (ALENEX 2001), pages 91--104, Washington, DC, USA, January 2001.
[8]
A. S. Fraenkel and S. T. Klein. Novel Compression of Sparse Bit-Strings - Preliminary Report. Combinatorial Algorithms on Words, NATO ASI Series, 12:169--183, 1985.
[9]
S. W. Golomb. Run-Length Encodings. IEEE Transactions on Information Theory, IT-12:399--401, July 1966.
[10]
R. A. Hankins and J. M. Patel. Effect of Node Size on the Performance of Cache-Conscious B+-Trees. In Proceedings of the 2003 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 283--294, San Diego, USA, June 2003.
[11]
S. Héman. Super-Scalar Database Compression between RAM and CPU Cache. Master's Thesis. University of Amsterdam. Amsterdam, The Netherlands, July 2005.
[12]
M. Kowarschik and C. Weiß. An Overview of Cache Optimization Techniques and Cache-Aware Numerical Algorithms. In Algorithms for Memory Hierarchies, Advanced Lectures, pages 213--232, March 2002.
[13]
A. Moffat and L. Stuiver. Binary Interpolative Coding for Effective Index Compression. Information Retrieval, 3(1):25--47, 2000.
[14]
A. Moffat and J. Zobel. Self-Indexing Inverted Files for Fast Text Retrieval. ACM Transactions on Information Systems, 14(4):349--379, October 1996.
[15]
Y. Perl, A. Itai, and H. Avni. Interpolation Search - A log log n Search. Communications of the ACM, 21(7):550--553, 1978.
[16]
J. Rao and K. A. Ross. Cache Conscious Indexing for Decision-Support in Main Memory. In Proceedings of 25th International Conference on Very Large Data Bases, pages 78--89, Edinburgh, UK, September 1999.
[17]
J. Rao and K. A. Ross. Making B+-Trees Cache Conscious in Main Memory. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pages 475--486, Dallas, USA, May 2000.
[18]
C. Ruemmler and J. Wilkes. An Introduction to Disk Drive Modeling. The Computer Journal, 27(3):17--28, 1994.
[19]
F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel. Compression of Inverted Indexes for Fast Query Evaluation. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 222--229, Tampere, Finland, August 2002.
[20]
A. J. Smith. Bibliography and Reading on CPU Cache Memories and Related Topics. SIGARCH Computer Architecture News, 14(1):22--42, 1986.
[21]
H. Turtle and J. Flood. Query Evaluation: Strategies and Optimization. Information Processing & Management, 31(1):831--850, November 1995.
[22]
H. E. Williams and J. Zobel. Compressing Integers for Fast File Access. The Computer Journal, 42(3):193--201, 1999.
[23]
L. Xiao, X. Zhang, and S. A. Kubricht. Improving Memory Performance of Sorting Algorithms. ACM Journal of Experimental Algorithms, 5, 2000.
[24]
J. Zobel and A. Moffat. Adding Compression to a Full-Text Retrieval System. Software - Practice and Experience, 25(8):891--903, 1995.
[25]
J. Zobel and A. Moffat. Inverted Files for Text Search Engines. ACM Computing Surveys, 38(2):6, 2006.
[26]
M. Zukowski, S. Héman, N. Nes, and P. A. Boncz. Super-Scalar RAM-CPU Cache Compression. In ICDE, page 59, 2006.

Cited By

View all
  • (2019)Scalable Top-K Query Processing Using Graphics Processing UnitLanguages and Compilers for Parallel Computing10.1007/978-3-030-35225-7_16(240-261)Online publication date: 15-Nov-2019
  • (2017)Faster BlockMax WAND with Variable-sized BlocksProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080780(625-634)Online publication date: 7-Aug-2017
  • (2017)Clustered Elias-Fano IndexesACM Transactions on Information Systems10.1145/305277336:1(1-33)Online publication date: 6-Apr-2017
  • Show More Cited By

Index Terms

  1. Index compression is good, especially for random access

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
    November 2007
    1048 pages
    ISBN:9781595938039
    DOI:10.1145/1321440
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 November 2007

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. index compression
    2. main memory
    3. ram
    4. random access

    Qualifiers

    • Research-article

    Conference

    CIKM07

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 15 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)Scalable Top-K Query Processing Using Graphics Processing UnitLanguages and Compilers for Parallel Computing10.1007/978-3-030-35225-7_16(240-261)Online publication date: 15-Nov-2019
    • (2017)Faster BlockMax WAND with Variable-sized BlocksProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080780(625-634)Online publication date: 7-Aug-2017
    • (2017)Clustered Elias-Fano IndexesACM Transactions on Information Systems10.1145/305277336:1(1-33)Online publication date: 6-Apr-2017
    • (2017)Inverted TreapsACM Transactions on Information Systems10.1145/300718635:3(1-45)Online publication date: 4-Jan-2017
    • (2017)The role of index compression in score-at-a-time query evaluationInformation Retrieval Journal10.1007/s10791-016-9291-520:3(199-220)Online publication date: 25-Jan-2017
    • (2016)Emerging opportunities in Domain Specific SearchProceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies10.1145/2905055.2905222(1-4)Online publication date: 4-Mar-2016
    • (2016)SIMD compression and the intersection of sorted integersSoftware—Practice & Experience10.1002/spe.232646:6(723-749)Online publication date: 1-Jun-2016
    • (2015)Decoding billions of integers per second through vectorizationSoftware—Practice & Experience10.1002/spe.220345:1(1-29)Online publication date: 1-Jan-2015
    • (2014)Web Search Engines:: Practice and ExperienceComputing Handbook, Third Edition10.1201/b16812-58(1-24)Online publication date: 8-May-2014
    • (2014)Document Prioritization for Scalable Query ProcessingProceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management10.1145/2661829.2661914(1609-1618)Online publication date: 3-Nov-2014
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media