skip to main content
10.1145/1031171.1031259acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Approximating the top-m passages in a parallel question answering system

Published: 13 November 2004 Publication History

Abstract

We examine the problem of retrieving the top-<i>m</i> ranked items from a large collection, randomly distributed across an <i>n</i>-node system. In order to retrieve the top <i>m</i> overall, we must retrieve the top <i>m</i> from the subcollection stored on each node and merge the results. However, if we are willing to accept a small probability that one or more of the top-<i>m</i> items may be missed, it is possible to reduce computation time by retrieving only the top <i>k < m</i> from each node. In this paper, we demonstrate that this simple observation can be exploited in a realistic application to produce a substantial efficiency improvement without compromising the quality of the retrieved results. To support our claim, we present a statistical model that predicts the impact of the optimization. The paper is structured around a specific application~---~passage retrieval for question answering~---~but the primary results are more broadly applicable.

References

[1]
Nicolas Bruno, Surajit Chaudhuri, and Luis Gravano. Top-k selection queries over relational databases: Mapping strategies and performance evaluation. ACM-Transactions on Database Systems, 27(2):153--187, 2002.
[2]
Michael J. Carey and Donald Kossmann. On saying "Enough already!" in SQL. In Proceedings of the 1997 ACM SIGMODIn ternational Conference on Management of Data, pages 219--230, 1997.
[3]
David Carmel, Yoelle S. Maarek, Matan Mandelbrod, Yosi Mass, and Aya Soffer. Searching XML documents via XML fragments. In 26nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 151--158, Toronto, Canada, July 2003.
[4]
C. L. A. Clarke, G. V. Cormack, T .R. Lynam, and E. L. Terra. Question answering by passage selection. In Tomek Strzalkowski and Sanda Harabagiu, editors, Advances in Open Domain Question Answering. Kluwer Academic Press, 2004.
[5]
Charles L. A. Clarke, Gordon V. Cormack, and Thomas R. Lynam. Exploiting redundancy in question answering. In 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 358--365, New Orleans, September 2001.
[6]
Charles L. A. Clarke, Philip L. Tilker, Allen Quoc-Luan Tran, Kevin Harris, and Antonio S. Cheng. A reliable storage management layer for distributed information retrieval systems. In 12th ACM International Conference on Information and Knowledge Management, pages 207--215, New Orleans, 2003.
[7]
David Hawking, Nick Craswell, and Paul Thistlewaite. Overview of the TREC-7 very large collection track. In 7th Text REtrieval Conference, Gaithersburg, Maryland, November 1998.
[8]
Ihab F. Ilyas, Rahul Shah, Walid G. Aref, Jeffrey Scott Vitter, and Ahmed K. Elmagarmid. Rank-aware query optimization. In Proceedings of the ACM SIGMOD/PODS 2004 Conference, Paris, 2004.
[9]
A. Ittycheriah, M. Franz, W-J Zhu, A. Ratnaparkhi, and R.J. Mammone. IBM's statistical question answering system. In 9th Text REtrieval Conference, Gaithersburg, Maryland, November 2000.
[10]
Jaap Kamps, Maarten de Rijke, and Börkur Sigurbjörnsson. Length normalization in XML retrieval. In 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK, 2004.
[11]
Marcin Kaszkiel and Justin Zobel. Effective ranking with arbitrary passages. Journal of the American Society of Information Science, 52(4):344--364, 2001.
[12]
G. Kazai, M. Lalmas, and A.P. de Vries. The overlap problem in content-oriented XML retrieval evaluation. In 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK, 2004.
[13]
G. Geunbae Lee, S. Lee, H. Jung, B-H Cho, C. Lee, B-K Kwak, J. Cha, D. Kim, J. An, J. Seo, H. Kim, and K. Kim. SiteQ: Engineering high performance QA system using lexico-semantic pattern matching and shallow NLP. In 10th Text REtrieval Conference, Gaithersburg, Maryland, November 2001.
[14]
M. Light, G. Mann, E. Rilo., and E. Breck. Analyses for elucidating current question answering technology. Journal of Natural Language Engineering, 7(4), 2001.
[15]
Das Madirakshi, R. Manmatha, and Edward M. Riseman. Indexing flower patent images using domain knowledge. IEEE Intelligent Systems, 14(5), 1999.
[16]
Christof Monz. From Document Retrieval to Question Answering. Ph. D. dissertation, Universiteit Van Amsterdam, December 2003.
[17]
Apostol Natsev, Yuan-Chi Chang, John R. Smith, Chung-Sheng Li, and Jeffrey Scott Vitter. Supporting incremental join queries on ranked inputs. In Proceedings of the 27th International Conference on Very Large Data Bases, pages 281--290, Rome, Italy, 2001.
[18]
Berthier Ribeiro-Neto, Edleno S. Moura, and Marden S. Neubert. Efficient distributed algorithms to build inverted files. In 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 105--112, Berkeley, California, August 1999.
[19]
Tomek Strzalkowski and Sanda Harabagiu, editors. Advances in Open Domain Question Answering. Kluwer Academic Press, 2004.
[20]
Stefanie Tellex, Boris Katz, Jimmy Lin, Aaron Fermandes, and Gregory Marton. Quantitative evaluation of passage retrieval algorithms for question answering. In 26nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 41--47, Toronto, Canada, July 2003.
[21]
Ellen M. Voorhees and Dawn Tice. Building a question answering test collection. In 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 200--207, Athens, August 2000.
[22]
Roger Weber and Michael Mlivoncic. Efficient region-based image retreival. In 12th ACM International Conference on Information and Knowledge Management, pages 69--76, New Orleans, 2003.
[23]
Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishing, San Francisco, second edition, 1999.

Cited By

View all
  • (2014)Lightweight approximate top-k for distributed settings2014 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2014.7004313(835-844)Online publication date: Oct-2014
  • (2014)Interaction History Based Answer Formulation for Question AnsweringKnowledge Engineering and the Semantic Web10.1007/978-3-319-11716-4_11(128-139)Online publication date: 2014
  • (2012)Improving on-demand learning to rank through parallelismProceedings of the 13th international conference on Web Information Systems Engineering10.1007/978-3-642-35063-4_38(526-537)Online publication date: 28-Nov-2012
  • Show More Cited By

Index Terms

  1. Approximating the top-m passages in a parallel question answering system

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management
      November 2004
      678 pages
      ISBN:1581138741
      DOI:10.1145/1031171
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 November 2004

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. parallel information retrieval
      2. question answering
      3. ranking queries

      Qualifiers

      • Article

      Conference

      CIKM04
      Sponsor:
      CIKM04: Conference on Information and Knowledge Management
      November 8 - 13, 2004
      D.C., Washington, USA

      Acceptance Rates

      Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)3
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 15 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2014)Lightweight approximate top-k for distributed settings2014 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2014.7004313(835-844)Online publication date: Oct-2014
      • (2014)Interaction History Based Answer Formulation for Question AnsweringKnowledge Engineering and the Semantic Web10.1007/978-3-319-11716-4_11(128-139)Online publication date: 2014
      • (2012)Improving on-demand learning to rank through parallelismProceedings of the 13th international conference on Web Information Systems Engineering10.1007/978-3-642-35063-4_38(526-537)Online publication date: 28-Nov-2012
      • (2012)Fast and effective soft linksSoftware: Practice and Experience10.1002/spe.212243:5(577-593)Online publication date: 11-Apr-2012

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media