skip to main content
10.1145/3397271.3401033acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Bayesian Inferential Risk Evaluation On Multiple IR Systems

Published: 25 July 2020 Publication History

Abstract

Information retrieval (IR) ranking models in production systems continually evolve in response to user feedback, insights from research, and new developments. Rather than investing all engineering resources to produce a single challenger to the existing system, a commercial provider might choose to explore multiple new ranking models simultaneously. However, even small changes to a complex model can have unintended consequences. In particular, the per-topic effectiveness profile is likely to change, and even when an overall improvement is achieved, gains are rarely observed for every query, introducing the risk that some users or queries may be negatively impacted by the new model if deployed into production.
Risk adjustments that re-weight losses relative to gains and mitigate such behavior are available when making one-to-one system comparisons, but not for one-to-many or many-to-one comparisons. Moreover, no IR evaluation methodology integrates priors from previous or alternative rankers in a homogeneous inferential framework. In this work, we propose a Bayesian approach where multiple challengers are compared to a single champion. We also show that risk can be incorporated, and demonstrate the benefits of doing so. Finally, the alternative scenario that is commonly encountered in academic research is also considered, when a single challenger is compared against several previous champions.

Supplementary Material

MP4 File (3397271.3401033.mp4)
This is a video of the talk that will be presented at SIGIR.

References

[1]
J. Allan, D. Harman, E. Kanoulas, D. Li, C. Van Gysel, and E. M. Voorhees. TREC 2017 common core track overview. In Proc. TREC, pages 1--14, 2017.
[2]
J. Arguello, F. Diaz, J. Lin, and A. Trotman. SIGIR 2015 workshop on reproducibility, inexplicability, and generalizability of results (RIGOR). In Proc. SIGIR, pages 1147--1148, 2015.
[3]
T. G. Armstrong, A. Moffat, W. Webber, and J. Zobel. Improvements that don't add up: Ad-hoc retrieval results since 1998. In Proc. CIKM, pages 601--610, 2009.
[4]
A. Azzalini. A class of distributions which includes the normal ones. Scand. J. Stat., 12 (2): 171--178, 1985.
[5]
D. J. Barr, R. Levy, C. Scheepers, and H. J. Tily. Random effects structure for confirmatory hypothesis testing: Keep it maximal. J. Mem. Lang., 68 (3): 255--278, 2013.
[6]
R. Benham and J. S. Culpepper. Risk-reward trade-offs in rank fusion. In Proc. ADCS, pages 1:1--1:8, 2017.
[7]
R. Benham, B. Carterette, A. Moffat, and J. S. Culpepper. Taking risks with confidence. In Proc. ADCS, pages 1--4, 2019.
[8]
B. Carterette. Model-based inference about IR systems. In Proc. ICTIR, pages 101--112, 2011.
[9]
B. Carterette. Bayesian inference for information retrieval evaluation. In Proc. ICTIR, pages 31--40, 2015.
[10]
C. W. Cleverdon, J. Mills, and M. Keen. Factors determining the performance of indexing systems. Cranfield: College of Aeronautics, 1 (28), 1966.
[11]
K. Collins-Thompson, C. Macdonald, P. Bennett, F. Diaz, and E. M. Voorhees. TREC 2013 web track overview. In Proc. TREC, 2014.
[12]
G. V. Cormack, C. L. A. Clarke, and S. Buettcher. Reciprocal rank fusion outperforms Condorcet and individual rank learning methods. In Proc. SIGIR, pages 758--759, 2009.
[13]
B. T. Dinçer, C. Macdonald, and I. Ounis. Risk-sensitive evaluation and learning to rank using multiple baselines. In Proc. SIGIR, pages 483-492, 2016.
[14]
B. Dincc er, I. Ounis, and C. Macdonald. Tackling biased baselines in the risk-sensitive evaluation of retrieval systems. In Proc. ECIR, pages 26--38, 2014 a.
[15]
B. T. Dincc er, C. Macdonald, and I. Ounis. Hypothesis testing for the risk-sensitive evaluation of retrieval systems. In Proc. SIGIR, pages 23--32, 2014 b.
[16]
N. Ferro and G. Silvello. A general linear mixed models approach to study system component effects. In Proc. SIGIR, pages 25--34, 2016.
[17]
N. Ferro, Y. Kim, and M. Sanderson. Using collection shards to study retrieval performance effect sizes. ACM Trans. Inf. Sys., 37 (3): 30, 2019.
[18]
E. A. Fox and J. A. Shaw. Combination of multiple searches. Proc. TREC, pages 243--252, 1994.
[19]
N. Fuhr. Some common mistakes in IR evaluation, and how they can be avoided. SIGIR Forum, 51 (3): 32--41, 2018.
[20]
A. Gelman and D. B. Rubin. Inference from iterative simulation using multiple sequences. Stat. Sci, 7 (4): 457--472, 1992.
[21]
A. Gelman, J. Hill, and M. Yajima. Why we (usually) don't have to worry about multiple comparisons. J. Res. Int. Educ., 5 (2): 189--211, 2012.
[22]
A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian data analysis. Chapman and Hall/CRC, 2013.
[23]
M. D. Hoffman and A. Gelman. The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res, 15 (1): 1593--1623, 2014.
[24]
K. Hofmann, L. Li, and F. Radlinski. Online evaluation for information retrieval. Found. Trends in Inf. Ret., 10 (1): 1--117, 2016.
[25]
K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Sys., 20 (4): 422--446, 2002.
[26]
J. Kruschke. Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan. Academic Press, 2014.
[27]
B. Lambert. A student's guide to Bayesian statistics. Sage, 2018.
[28]
A. Moffat and J. Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Sys., 27 (1): 2.1--2.27, 2008.
[29]
C. Muth, Z. Oravecz, and J. Gabry. User-friendly Bayesian regression modeling: A tutorial with rstanarm and shinystan. Quant. Methods Psychol., 14 (2): 99--119, 2018.
[30]
J. Parapar, D. E. Losada, M. A. Presedo-Quindimil, and A. Barreiro. Using score distributions to compare statistical significance tests for information retrieval evaluation. J. Am. Soc. Inf. Sci., 71 (1): 98--113, 2020.
[31]
T. Sakai. Statistical significance, power, and sample sizes: A systematic review of SIGIR and TOIS, 2006--2015. In Proc. SIGIR, pages 5--14, 2016.
[32]
T. Sakai. The probability that your hypothesis is correct, credible intervals, and effect sizes for IR evaluation. In Proc. SIGIR, pages 25--34, 2017.
[33]
T. Sakai and N. Kando. On information retrieval metrics designed for evaluation with incomplete relevance assessments. Inf. Retr., 11 (5): 447--470, 2008.
[34]
M. D. Smucker, J. Allan, and B. Carterette. A comparison of statistical significance tests for information retrieval evaluation. In Proc. SIGIR, pages 623--632, 2007.
[35]
lves, Rosa, and Martins]sousa2019riskD. X. Sousa, S. Canuto, M. A. Goncc alves, T. C. Rosa, and W. S. Martins. Risk-sensitive learning to rank with evolutionary multi-objective feature selection. ACM Trans. Inf. Sys., 37 (2): 1--34, 2019.
[36]
J. Urbano, H. Lima, and A. Hanjalic. Statistical significance testing in information retrieval: An empirical analysis of type I, type II and type III errors. In Proc. SIGIR, pages 505--514, 2019.
[37]
E. M. Voorhees. Overview of TREC 2004 robust retrieval track. In Proc. TREC, pages 69--77, 2004.
[38]
L. Wang, P. N. Bennett, and K. Collins-Thompson. Robust ranking models via risk-sensitive optimization. In Proc. SIGIR, pages 761--770, 2012.

Cited By

View all
  • (2023)Distributionally-Informed Recommender System EvaluationACM Transactions on Recommender Systems10.1145/36134552:1(1-27)Online publication date: 5-Aug-2023
  • (2022)Risk-Sensitive Deep Neural Learning to RankProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3532056(803-813)Online publication date: 6-Jul-2022
  • (2021)ERR is not C/W/L: Exploring the Relationship Between Expected Reciprocal Rank and Other MetricsProceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3471158.3472239(231-237)Online publication date: 11-Jul-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2020
2548 pages
ISBN:9781450380164
DOI:10.1145/3397271
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. bayesian inference
  2. credible intervals
  3. effectiveness metric
  4. multiple comparisons
  5. risk-biased evaluation

Qualifiers

  • Research-article

Funding Sources

  • Australian Research Council

Conference

SIGIR '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)2
Reflects downloads up to 14 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Distributionally-Informed Recommender System EvaluationACM Transactions on Recommender Systems10.1145/36134552:1(1-27)Online publication date: 5-Aug-2023
  • (2022)Risk-Sensitive Deep Neural Learning to RankProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3532056(803-813)Online publication date: 6-Jul-2022
  • (2021)ERR is not C/W/L: Exploring the Relationship Between Expected Reciprocal Rank and Other MetricsProceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3471158.3472239(231-237)Online publication date: 11-Jul-2021
  • (2021)Bayesian System Inference on Shallow PoolsAdvances in Information Retrieval10.1007/978-3-030-72240-1_17(209-215)Online publication date: 28-Mar-2021

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media