research-article

Bayesian Inferential Risk Evaluation On Multiple IR Systems

Authors:

Ben Carterette,

J. Shane Culpepper,

Alistair MoffatAuthors Info & Claims

SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 339 - 348

https://doi.org/10.1145/3397271.3401033

Published: 25 July 2020 Publication History

Abstract

Information retrieval (IR) ranking models in production systems continually evolve in response to user feedback, insights from research, and new developments. Rather than investing all engineering resources to produce a single challenger to the existing system, a commercial provider might choose to explore multiple new ranking models simultaneously. However, even small changes to a complex model can have unintended consequences. In particular, the per-topic effectiveness profile is likely to change, and even when an overall improvement is achieved, gains are rarely observed for every query, introducing the risk that some users or queries may be negatively impacted by the new model if deployed into production.

Risk adjustments that re-weight losses relative to gains and mitigate such behavior are available when making one-to-one system comparisons, but not for one-to-many or many-to-one comparisons. Moreover, no IR evaluation methodology integrates priors from previous or alternative rankers in a homogeneous inferential framework. In this work, we propose a Bayesian approach where multiple challengers are compared to a single champion. We also show that risk can be incorporated, and demonstrate the benefits of doing so. Finally, the alternative scenario that is commonly encountered in academic research is also considered, when a single challenger is compared against several previous champions.

Supplementary Material

MP4 File (3397271.3401033.mp4)

This is a video of the talk that will be presented at SIGIR.

Download
451.29 MB

References

[1]

J. Allan, D. Harman, E. Kanoulas, D. Li, C. Van Gysel, and E. M. Voorhees. TREC 2017 common core track overview. In Proc. TREC, pages 1--14, 2017.

[2]

J. Arguello, F. Diaz, J. Lin, and A. Trotman. SIGIR 2015 workshop on reproducibility, inexplicability, and generalizability of results (RIGOR). In Proc. SIGIR, pages 1147--1148, 2015.

[3]

T. G. Armstrong, A. Moffat, W. Webber, and J. Zobel. Improvements that don't add up: Ad-hoc retrieval results since 1998. In Proc. CIKM, pages 601--610, 2009.

Digital Library

[4]

A. Azzalini. A class of distributions which includes the normal ones. Scand. J. Stat., 12 (2): 171--178, 1985.

[5]

D. J. Barr, R. Levy, C. Scheepers, and H. J. Tily. Random effects structure for confirmatory hypothesis testing: Keep it maximal. J. Mem. Lang., 68 (3): 255--278, 2013.

[6]

R. Benham and J. S. Culpepper. Risk-reward trade-offs in rank fusion. In Proc. ADCS, pages 1:1--1:8, 2017.

Digital Library

[7]

R. Benham, B. Carterette, A. Moffat, and J. S. Culpepper. Taking risks with confidence. In Proc. ADCS, pages 1--4, 2019.

Digital Library

[8]

B. Carterette. Model-based inference about IR systems. In Proc. ICTIR, pages 101--112, 2011.

[9]

B. Carterette. Bayesian inference for information retrieval evaluation. In Proc. ICTIR, pages 31--40, 2015.

Digital Library

[10]

C. W. Cleverdon, J. Mills, and M. Keen. Factors determining the performance of indexing systems. Cranfield: College of Aeronautics, 1 (28), 1966.

[11]

K. Collins-Thompson, C. Macdonald, P. Bennett, F. Diaz, and E. M. Voorhees. TREC 2013 web track overview. In Proc. TREC, 2014.

[12]

G. V. Cormack, C. L. A. Clarke, and S. Buettcher. Reciprocal rank fusion outperforms Condorcet and individual rank learning methods. In Proc. SIGIR, pages 758--759, 2009.

Digital Library

[13]

B. T. Dinçer, C. Macdonald, and I. Ounis. Risk-sensitive evaluation and learning to rank using multiple baselines. In Proc. SIGIR, pages 483-492, 2016.

Digital Library

[14]

B. Dincc er, I. Ounis, and C. Macdonald. Tackling biased baselines in the risk-sensitive evaluation of retrieval systems. In Proc. ECIR, pages 26--38, 2014 a.

[15]

B. T. Dincc er, C. Macdonald, and I. Ounis. Hypothesis testing for the risk-sensitive evaluation of retrieval systems. In Proc. SIGIR, pages 23--32, 2014 b.

[16]

N. Ferro and G. Silvello. A general linear mixed models approach to study system component effects. In Proc. SIGIR, pages 25--34, 2016.

Digital Library

[17]

N. Ferro, Y. Kim, and M. Sanderson. Using collection shards to study retrieval performance effect sizes. ACM Trans. Inf. Sys., 37 (3): 30, 2019.

Digital Library

[18]

E. A. Fox and J. A. Shaw. Combination of multiple searches. Proc. TREC, pages 243--252, 1994.

[19]

N. Fuhr. Some common mistakes in IR evaluation, and how they can be avoided. SIGIR Forum, 51 (3): 32--41, 2018.

Digital Library

[20]

A. Gelman and D. B. Rubin. Inference from iterative simulation using multiple sequences. Stat. Sci, 7 (4): 457--472, 1992.

[21]

A. Gelman, J. Hill, and M. Yajima. Why we (usually) don't have to worry about multiple comparisons. J. Res. Int. Educ., 5 (2): 189--211, 2012.

[22]

A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian data analysis. Chapman and Hall/CRC, 2013.

[23]

M. D. Hoffman and A. Gelman. The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res, 15 (1): 1593--1623, 2014.

[24]

K. Hofmann, L. Li, and F. Radlinski. Online evaluation for information retrieval. Found. Trends in Inf. Ret., 10 (1): 1--117, 2016.

Digital Library

[25]

K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Sys., 20 (4): 422--446, 2002.

Digital Library

[26]

J. Kruschke. Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan. Academic Press, 2014.

[27]

B. Lambert. A student's guide to Bayesian statistics. Sage, 2018.

[28]

A. Moffat and J. Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Sys., 27 (1): 2.1--2.27, 2008.

[29]

C. Muth, Z. Oravecz, and J. Gabry. User-friendly Bayesian regression modeling: A tutorial with rstanarm and shinystan. Quant. Methods Psychol., 14 (2): 99--119, 2018.

[30]

J. Parapar, D. E. Losada, M. A. Presedo-Quindimil, and A. Barreiro. Using score distributions to compare statistical significance tests for information retrieval evaluation. J. Am. Soc. Inf. Sci., 71 (1): 98--113, 2020.

Digital Library

[31]

T. Sakai. Statistical significance, power, and sample sizes: A systematic review of SIGIR and TOIS, 2006--2015. In Proc. SIGIR, pages 5--14, 2016.

Digital Library

[32]

T. Sakai. The probability that your hypothesis is correct, credible intervals, and effect sizes for IR evaluation. In Proc. SIGIR, pages 25--34, 2017.

Digital Library

[33]

T. Sakai and N. Kando. On information retrieval metrics designed for evaluation with incomplete relevance assessments. Inf. Retr., 11 (5): 447--470, 2008.

Digital Library

[34]

M. D. Smucker, J. Allan, and B. Carterette. A comparison of statistical significance tests for information retrieval evaluation. In Proc. SIGIR, pages 623--632, 2007.

Digital Library

[35]

lves, Rosa, and Martins]sousa2019riskD. X. Sousa, S. Canuto, M. A. Goncc alves, T. C. Rosa, and W. S. Martins. Risk-sensitive learning to rank with evolutionary multi-objective feature selection. ACM Trans. Inf. Sys., 37 (2): 1--34, 2019.

[36]

J. Urbano, H. Lima, and A. Hanjalic. Statistical significance testing in information retrieval: An empirical analysis of type I, type II and type III errors. In Proc. SIGIR, pages 505--514, 2019.

Digital Library

[37]

E. M. Voorhees. Overview of TREC 2004 robust retrieval track. In Proc. TREC, pages 69--77, 2004.

[38]

L. Wang, P. N. Bennett, and K. Collins-Thompson. Robust ranking models via risk-sensitive optimization. In Proc. SIGIR, pages 761--770, 2012.

Digital Library

Cited By

Ekstrand MCarterette BDiaz F(2023)Distributionally-Informed Recommender System EvaluationACM Transactions on Recommender Systems10.1145/36134552:1(1-27)Online publication date: 5-Aug-2023
https://dl.acm.org/doi/10.1145/3613455
Silva Rodrigues PXavier Sousa DCouto Rosa TGonçalves MAmigo ECastells PGonzalo JCarterette BCulpepper JKazai G(2022)Risk-Sensitive Deep Neural Learning to RankProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3532056(803-813)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3477495.3532056
Azzopardi LMackenzie JMoffat AHasibi FFang YAizawa A(2021)ERR is not C/W/L: Exploring the Relationship Between Expected Reciprocal Rank and Other MetricsProceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3471158.3472239(231-237)Online publication date: 11-Jul-2021
https://dl.acm.org/doi/10.1145/3471158.3472239
Show More Cited By

Index Terms

Bayesian Inferential Risk Evaluation On Multiple IR Systems
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Retrieval effectiveness
2. Mathematics of computing
  1. Probability and statistics
    1. Probabilistic inference problems
      1. Bayesian computation
    2. Statistical paradigms
      1. Exploratory data analysis

Recommendations

Mean-field variational approximate Bayesian inference for latent variable models

The ill-posed nature of missing variable models offers a challenging testing ground for new computational techniques. This is the case for the mean-field variational Bayesian inference. The behavior of this approach in the setting of the Bayesian probit ...
Bayesian learning of Bayesian networks with informative priors

This paper presents and evaluates an approach to Bayesian model averaging where the models are Bayesian nets (BNs). A comprehensive study of the literature on structural priors for BNs is conducted. A number of prior distributions are defined using ...
Bayesian optimization for likelihood-free inference of simulator-based statistical models

Our paper deals with inferring simulator-based statistical models given some observed data. A simulator-based model is a parametrized mechanism which specifies how data are generated. It is thus also referred to as generative model. We assume that only ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2020

2548 pages

ISBN:9781450380164

DOI:10.1145/3397271

General Chairs:
Jimmy Huang
York University, Canada
,
Yi Chang
Jilin University, China
,
Xueqi Cheng
Chinese Academy of Sciences, China
,
Program Chairs:
Jaap Kamps
University of Amsterdam, Netherlands
,
Vanessa Murdock
Amazon, U.S.A.
,
Ji-Rong Wen
Renmin University of China, China
,
Yiqun Liu
Tsinghua University, China

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Australian Research Council

Conference

SIGIR '20

Sponsor:

SIGIR

SIGIR '20: The 43rd International ACM SIGIR conference on research and development in Information Retrieval

July 25 - 30, 2020

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
274
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)2

Reflects downloads up to 14 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ekstrand MCarterette BDiaz F(2023)Distributionally-Informed Recommender System EvaluationACM Transactions on Recommender Systems10.1145/36134552:1(1-27)Online publication date: 5-Aug-2023
https://dl.acm.org/doi/10.1145/3613455
Silva Rodrigues PXavier Sousa DCouto Rosa TGonçalves MAmigo ECastells PGonzalo JCarterette BCulpepper JKazai G(2022)Risk-Sensitive Deep Neural Learning to RankProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3532056(803-813)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3477495.3532056
Azzopardi LMackenzie JMoffat AHasibi FFang YAizawa A(2021)ERR is not C/W/L: Exploring the Relationship Between Expected Reciprocal Rank and Other MetricsProceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3471158.3472239(231-237)Online publication date: 11-Jul-2021
https://dl.acm.org/doi/10.1145/3471158.3472239
Benham RMoffat ACulpepper J(2021)Bayesian System Inference on Shallow PoolsAdvances in Information Retrieval10.1007/978-3-030-72240-1_17(209-215)Online publication date: 28-Mar-2021
https://dl.acm.org/doi/10.1007/978-3-030-72240-1_17

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents