Showing posts with label replication. Show all posts
Showing posts with label replication. Show all posts

Monday, 27 October 2025

Problems with ELife's new article type: Replication studies

 

I was interested to receive an email from eLife last week, telling me that "As part of our commitment to open science, scientific rigour and transparency we now accept submissions of Replication Studies". Sounds good, I thought, but on reading further I became increasingly dismayed. I think the way this is set up dooms it to failure.

Back in 2023, when eLife created a storm by altering its publishing model, I argued that they would not achieve their aims unless they changed the basis for selecting articles for peer review.  Although eLife has abandoned the traditional binary outcome where articles are accepted or rejected, there is still a decision delegated to editors, which is whether the article is selected for peer review. My suggestion was that they should adopt results-blind selection. This avoids the bias that favours articles with positive results. Publication bias leaves well-conducted studies with null results to languish unpublished. This matters because a cumulative science should include negative as well as positive findings. Nissen et al (2016) showed how publication bias leads to what they termed the "canonization of false facts". It is a universal phenomenon and I doubt that eLife editors are immune to it.

One method of avoiding publication bias is the Registered Reports format, whereby the introduction, methods and analysis plan are peer reviewed before data are collected, with the article accepted in principle by a journal, provided the researchers did the study as planned, or gave adequate reasons for deviating from protocol. In my experience, biomedical researchers are highly resistant to Registered Reports; they tell me the approach is incompatible with how they work, which involves development of ideas and methods in the course of doing a study 🙄. However, that argument does not hold for replication studies, where the idea is to reproduce the methods of an existing published study. Indeed, eLife took a pioneering stand in hosting the Reproducibility Project on Cancer Biology (Errington et al., 2021a). So it is disappointing that their current proposal is for a much more timid approach.

First of all, it sounds as if the plan is for individuals to complete a replication study before submitting  it to eLife. This means that we'll be up against publication bias all over again - the likelihood of an editor selecting a study for peer review will depend not on the strength and suitability of methods but on whether the replication was "successful".

Second the eLife instructions state: "The authors should work closely with the authors of the original study and summarize their interactions with the original authors as part of a cover letter". It seems entirely appropriate to liaise with original authors to ensure that materials and methods are suitable for replication. However, we know from the Cancer Biology Replication project that many authors were unresponsive or even obstructive when asked to give advice on a replication study (Errington et al, 2021b). Many attempts at replication failed because it was impossible to work out exactly what had been done in the original study or because authors would not share materials. In effect, then, the current eLife approach to replication studies gives original authors the ability to veto any replication attempt.

I can't think who on earth would take the risk of doing a replication study under these conditions. Many scientists already regard replication as an inferior type of research activity, for which replicators get little credit - and potential abuse. Furthermore, it is difficult to get funds for replication studies, because they are deemed insufficiently novel. Yet, previous large-scale replication studies have found that direct replications often fail to find original effects, or replicate the effect but with a smaller effect size. So you make yourself unpopular by attempting a replication, and then when you come to try to submit it to eLife you are told it won't be peer reviewed because you obtained a null effect.

I think that we won't get replication studies in biosciences unless they are explicitly incentivised - and judged on their methodological quality rather than their results. Meanwhile, because of the field's obsession with novelty, research progress stalls as people keep trying to build on results without knowing whether they provide a solid foundation.

My prediction: give it a year, and see how many Replication Studies have been accepted for peer review in eLife. I'll be surprised if it's more than zero.

References

Errington, T. M., Mathur, M., Soderberg, C. K., Denis, A., Perfito, N., Iorns, E., & Nosek, B. A. (2021a). Investigating the replicability of preclinical cancer biology. eLife, 10, e71601. https://doi.org/10.7554/eLife.71601

Errington, T. M., Denis, A., Perfito, N., Iorns, E., & Nosek, B. A. (2021). Reproducibility in Cancer Biology: Challenges for assessing replicability in preclinical cancer biology | eLife. eLife. https://doi.org/10.7554/eLife.67995

Nissen, S. B., Magidson, T., Gross, K., & Bergstrom, C. T. (2016). Publication bias and the canonization of false facts. eLife, 5, e21451. https://doi.org/10.7554/eLife.21451

Monday, 20 October 2025

A LEAP into the future, or off a cliff: Wellcome LEAP's new $50M program

A few days ago, I saw this post on LinkedIn:

How does the gut microbiome shape early brain development? That’s what FORM, a new $50 million programme from Wellcome Leap, aims to answer. Critically, it wants to identify the role of the microbiome in autism and other neurological disorders. Applicants from universities, companies and non-profits are invited to submit project proposals by 14 November. 

The full programme announcement for FORM (Foundations of a Resilient Microbiome) that you can download here hops around citing various references that indicate the microbiome is important for early development and can be influenced by factors such as antibiotics. So far, so uncontroversial. But then the topic of autism is introduced.

First we hear that autism diagnoses have increased. Then, a section devoid of references states:

 Many have attributed this increase to expanded surveillance, broadening of diagnostic categories (to include milder autism-related difficulties), or increased public awareness. While all are true, the significance of the increase suggests other rising risk factors may also be contributing. 

We then are told that this can't be due to genes because they are pretty stable in populations, and so we seem led remorselessly to the conclusion that it must be an environmental factor, and what better culprit could there be than the microbiome.

If I was going to predicate a $50 million research program on that premise, I'd do a bit more research into those studies on the increase in diagnosis. Diagnostic criteria have changed radically, so children who would have in the past had other diagnoses, or no diagnosis, are now encompassed within autism. Furthermore, there is wider understanding of autism, and a diagnosis can bring with it educational support, which can be a reason why parents will seek a diagnosis. Here's a simple explainer that I wrote in 2012, and subsequent studies by Lundström et al (2015)Cardinal et al (2016) and Zeidan et al (2022).

The fragments of supportive evidence that are provided for the autism/microbiome link seem cherrypicked and are not impressive. At least one claim seems just plain wrong: "Babies exposed to antibiotics in the first 6 months of life may be twice as likely to develop ASD as those exposed later." The cited paper by Azad et al (2016) doesn't mention autism and I when I searched for another source, I found a solid-looking study claiming no association. Other cited results are the kinds of findings that you get if you test for so many associations that some are bound to come up by chance. The handful of animal model studies that are cited have been criticised on methodological grounds.

Things go more seriously off the rails when specific quantitative goals are set for the program: 

Autism currently affects about 3.2% of children. To identify what proportion of these cases may be attributable to gut microbiome dysfunction, we will need objective biomarkers that can detect the dysfunction with high accuracy (balanced accuracy >90%). Establishing this will require a large cohort — likely more than 15,000 children — to ensure statistical power. With that sample size, we can reliably estimate whether microbiome dysfunction accounts for as much as 50% of ASD cases (around 1.6% of all children) or as little as 10% (about 0.3%).

Three things about this:

  1. In a footnote it is noted that "Severe autism with an established genetic origin (about 10–20%) and mild/ moderate ASD (~40%) fall outside the scope of this program" - these estimates don't seem to take that into account. And it's not clear if the databases that will be used for the analysis actually allow one to distinguish autism subtypes.
  2. You can't establish causality from observational data. As has been shown by Yap et al (2021), the microbiome is influenced by specific dietary preferences of autistic children.
  3. It is assumed that the lower bound is that microbiome dysfunction explains 10% of cases. This shows remarkable commitment to a causal hypothesis that has no solid evidence: a realistic lower bound would be zero.

As regards the plans in "Thrust 2" to have a diagnostic set of biomarkers that will predict severe autism-related difficulties, there are so many issues here, that I recommend reading previous blogposts I wrote on this topic, here, and here.  In brief, screening is only effective if there is a strong association between biomarkers and outcomes and if the biomarker measures are stable. Even if those conditions are met, if the base rate of the condition (autism) is low, you will be overwhelmed with false positives.

I was surprised that a reputable funding body was associated with this program, so I wanted to find out more about Wellcome LEAP.  They are a U.S.-based non-profit organization founded by the Wellcome Trust that:

builds bold, unconventional programs, and funds them at scale. Programs optimized to deliver breakthroughs in human health over 5 – 10 years and demonstrate seemingly impossible results on seemingly impossible timelines.

No doubt I'm too conventional, but I'm nervous of "seemingly impossible" things. If they are claimed, I am suspicious, especially if this occurs on "seemingly impossible timelines".

Reading on, I can see lots of things to like about LEAP. The idea is to cut time spent in bureaucratic processes of setting up grants, and to bring together networks of researchers from different institutions and different disciplines who can work together to solve problems that involve large and complex datasets, and check generalisability of findings. This kind of international collaborative approach that involves diverse populations is a definite bonus of LEAP.

The worrying bit was the emphasis on speed - especially since this was coupled with an expectation that the results would be commercialised.

This is clarified here:

Wellcome Leap anticipates that it will normally further our mission (and the organization’s mission) to commercialize the results of Wellcome Leap-funded research. If we determine that a Performer’s organization is not making appropriate efforts to further commercialization, either itself or through a third party (e.g., a licensee), we have the option to request a meeting and a remedial commercialization plan to address the issue.

Given that I have in the past written in favour of slow science, it's perhaps not surprising that this funding model doesn't appeal to me. The thing is that not all delays in scientific progress are down to bureaucracy or timidity. A major obstacle to progress is time wasted trying to build on prior research findings that prove to be illusory.

Science should be cumulative, which means we should be able to proceed with confidence and assume that published research is robust. The likelihood of this being the case is low if we are dealing with complex multidimensional systems, where the temptation is to just hunt around until we find something that looks exciting, embellish it with a few statistical credentials and claim we have a novel result.  As Chin (2025) has argued, when put under pressure to deliver speedy results, scientists may be forced into a position of hyping their findings, cutting corners, and reporting only favourable results. 

I was frankly dismayed to read that in her prior role as Program Director for the Wellcome Leap “First 1000 Days” (1kD) initiative, the Program Director of FORM:

led the delivery of multiple new product opportunities to improve cognitive development in the first 1000 days of a child’s life — including a breakthrough microbiome-directed diagnostic and therapeutic now positioned for commercialization

The 1kD initiative was funded in summer of 2021. So it seems that in four years a microbiome-directed "diagnostic and therapeutic" has been developed and validated. I'm afraid that without solid published evidence, this looks like NeuroPointDX all over again. 

Wellcome LEAP programs are focused around "What If" questions. My question is "What if there is no association between autism and microbiome dysfunction?" All three "thrusts" of this program depend on there being an association. My suggestion is to set aside 1% of the funds for this program for pre-registered replications of the studies that are cited as foundational for the research. I know the idea is to fund high-risk, unconventional research, but a lot of time and money could be saved by checking that the foundations are solid before building an edifice on this premise. 

 Comments on this blog are moderated so may take time to appear.  In general, anonymous comments are not approved. 

Tuesday, 10 September 2019

Responding to the replication crisis: reflections on Metascience2019

Talk by Yang Yang at Metascience 2019
I'm just back from MetaScience 2019. It was an immense privilege to be invited to speak at such a stimulating and timely meeting, and I would like to thank the Fetzer Franklin Fund, who not only generously funded the meeting, but also ensured the impressively smooth running of a packed schedule. The organisers - Brian Nosek, Jonathan Schooler, Jon Krosnick, Leif Nelson and Jan Walleczek - did a great job in bringing together speakers on a range of topics, and the quality of talks was outstanding. For me, highlights were hearing great presentations from people well outside my usual orbit, such as Melissa Schilling on 'Where do breakthrough ideas come from?', Carl Bergstrom on modelling grant funding systems, and Callin O'Connor on scientific polarisation.

The talks were recorded, but I gather it may be some months before the film is available. Meanwhile, slides of many of the presentations are available here, and there is a copious Twitter stream on the hashtag #metascience2019. Special thanks are due to Joseph Fridman (@joseph_fridman): if you look at his timeline, you can pretty well reconstruct the entire meeting from live tweets. Noah Haber (@NoahHaber) also deserves special mention for extensive commentary, including a post-conference reflection starting here.  It is a sign of a successful meeting, I think, if it gets people, like Noah, raising more general questions about the direction the field is going in, and it is in that spirit I would like to share some of my own thoughts.

In the past 15 years or so, we have made enormous progress in documenting problems with credibility of research findings, not just in psychology, but in many areas of science. Metascience studies have helped us quantify the extent of the problem and begun to shed light on the underlying causes. We now have to confront the question of what we do next. That would seem to be a no-brainer: we need to concentrate on fixing the problem. But there is a real danger of rushing in with well-intentioned solutions that may be ineffective at best or have unintended consequences at worst.

One question is whether we should be continuing with a focus on replication studies. Noah Haber was critical of the number of talks that focused on replication, but I had a rather different take on this: it depends on what the purpose of a replication study is. I think further replication initiatives, in the style of the original Reproducibility Project, can be invaluable in highlighting problems (or not) in a field. Tim Errington's talk about the Cancer Biology Reproducibility Project demonstrated beautifully how a systematic attempt to replicate findings can reveal major problems in a field. Studies in this area are often dependent on specialised procedures and materials, which are either poorly described or unavailable. In such circumstances it becomes impossible for other labs to reproduce the methods, let alone replicate the results. The mindset of many researchers in this area is also unhelpful – the sense is that competition dominates, and open science ideals are not part of the training of scientists. But these are problems that can be fixed.

As was evident from my questions after the talk, I was less enthused by the idea of doing a large, replication of Darryl Bem's studies on extra-sensory perception. Zoltán Kekecs and his team have put in a huge amount of work to ensure that this study meets the highest standards of rigour, and it is a model of collaborative planning, ensuring input into the research questions and design from those with very different prior beliefs. I just wondered what the point was. If you want to put in all that time, money and effort, wouldn't it be better to investigate a hypothesis about something that doesn't contradict the laws of physics? There were two responses to this. Zoltán's view was that the study would tell us more than whether or not precognition exists: it would provide a model of methods that could be extended to other questions. That seems reasonable: some of the innovations, in terms of automated methods and collaborative working could be applied in other contexts to ensure original research was done to the highest standards. Jonathan Schooler, on the other hand, felt it was unscientific of me to prejudge the question, given a large previous literature of positive findings on ESP, including a meta-analysis. Given that I come from a field where there are numerous phenomena that have been debunked after years of apparent positive evidence, I was not swayed by this argument. (See for instance this blogpost on 5-HTTLPR and depression). If the study by Kekecs et al sets such a high standard that the results will be treated as definitive, then I guess it might be worthwhile. But somehow I doubt that a null finding in this study will convince believers to abandon this line of work.

Another major concern I had was the widespread reliance on proxy indicators of research quality. One talk that exemplified this was Yang Yang's presentation on machine intelligence approaches to predicting replicability of studies. He started by noting that non-replicable results get cited just as much as replicable ones: a depressing finding indeed, and one that motivated the study he reported. His talk was clever at many levels. It was ingenious to use the existing results from the Reproducibility Project as a database that could be mined to identify characteristics of results that replicated. I'm not qualified to comment on the machine learning approach, which involved using ngrams extracted from texts to predict a binary category of replicable or not. But implicit in this study was the idea that the results from this exercise could be useful in future in helping us identify, just on the basis of textual analysis, which studies were likely to be replicable.

Now, this seems misguided on several levels. For a start, as we know from the field of medical screening, the usefulness of a screening test depends on the base rate of the condition you are screening for, the extent to which the sample you develop the test on is representative of the population, and the accuracy of prediction. I would be frankly amazed if the results of this exercise yielded a useful screener. But even if they did, then Goodhart's law would kick in: as soon as researchers became aware that there was a formula being used to predict how replicable their research was, they'd write their papers in a way that would maximise their score. One can even imagine whole new companies springing up who would take your low-scoring research paper and, for a price, revise it to get a better score. I somehow don't think this would benefit science. In defence of this approach, it was argued that it would allow us to identify characteristics of replicable work, and encourage people to emulate these. But this seems back-to-front logic. Why try to optimise an indirect, weak proxy for what makes good science (ngram characteristics of the write-up) rather than optimising, erm, good scientific practices. Recommended readings in this area include Philip Stark's short piece on Preproducibility, as well as Florian Markowetz's 'Five selfish reasons to work reproducibly'.

My reservations here are an extension of broader concerns about reliance on text-mining in meta-science (see e.g. https://peerj.com/articles/1715/https://peerj.com/articles/1715/). We have this wonderful ability to pull in mountains of data from online literature to see patterns that might be undetectable otherwise, But ultimately, the information that we extract cannot give more than a superficial sense of the content. It seems sometimes that we're moving to a situation where science will be done by bots, leaving the human brain out of the process altogether. This would, to my mind, be a mistake.









Friday, 9 February 2018

Improving reproducibility: the future is with the young


I've recently had the pleasure of reviewing the applications to a course on Advanced Methods for Reproducible Science that I'm running in April together with Marcus Munafo and Chris Chambers.  We take a broad definition of 'Reproducibility' and cover not only ways to ensure that code and data are available for those who wish to reproduce experimental results, but also focus on how to design, analyse and pre-register studies to give replicable and generalisable findings.

There is a strong sense of change in the air. Last year, most applicants were psychologists, even though we prioritised applications in biomedical sciences, as we are funded by the Biotechnology and Biological Sciences Research Council and European College of Neuropsychopharmacology. The sense was that issues of reproducibility were not not so high on the radar of disciplines outside psychology. This year things are different. We again attracted a fair number of psychologists, but we also have applicants from fields as diverse as gene expression, immunology, stem cells, anthropology, pharmacology and bioinformatics.

One thing that came across loud and clear in the letters of application to the course was dissatisfaction with the status quo. I've argued before that we have a duty to sort out poor reproducibility because it leads to enormous waste of time and talent of those who try to build on a glitzy but non-replicable result. I've edited these quotes to avoid identifying the authors, but these comments – all from PhD students or postdocs in a range of disciplines - illustrate my point:
  • 'I wanted to replicate the results of an influential intervention that has been widely adopted. Remarkably, no systematic evidence has ever been published that the approach actually works. So far, it has been extremely difficult to establish contact with initial investigators or find out how to get hold of the original data for re-analysis.' 

  • 'I attempted a replication of a widely-cited study, which failed. Although I first attributed it to a difference between experimental materials in the two studies, I am no longer sure this is the explanation.' 

  • 'I planned to use the methods of a widely cited study for a novel piece of research. The results of this previous study were strong, published in a high impact journal, and the methods apparently straightforward to implement, so this seemed like the perfect approach to test our predictions. Unfortunately, I was never able to capture the previously observed effect.' 

  • 'After working for several years in this area, I have come to the conclusion that much of the research may not be reproducible. Much of it is conducted with extremely small sample sizes, reporting implausibly large effect sizes.' 

  • 'My field is plagued by irreproducibility. Even at this early point in my career, I have been affected in my own work by this issue and I believe it would be difficult to find someone who has not themselves had some relation to the topic.' 

  • 'At the faculty I work in, I have witnessed that many people are still confused about or unaware of the very basics of reproducible research.'

Clearly, we can't generalise to all early-career researchers: those who have applied for the course are a self-selected bunch. Indeed, some of them are already trying to adopt reproducible practices, and to bring about change to the local scientific environment. I hope, though, that what we are seeing is just the beginning of a groundswell of dissatisfaction with the status quo. As Chris Chambers suggested in this podcast, I think that change will come more from the grassroots than from established scientists.

We anticipate that the greater diversity of subjects covered this year will make the course far more challenging for the tutors, but we expect it will also make it even more stimulating and fun than last year (if that is possible!). The course lasts several days and interactions between people are as important as the course content in making it work. I'm pretty sure that the problems and solutions from my own field have relevance for other types of data and methods, but I anticipate I will learn a lot from considering the challenges encountered in other disciplines.

Training early career researchers in reproducible methods does not just benefit them: those who attended the course last year have become enthusiastic advocates for reproducibility, with impacts extending beyond their local labs. We are optimistic that as the benefits of reproducible working become more widely known, the face of science will change so that fewer young people will find their careers stalled because they trusted non-replicable results.

Friday, 16 December 2016

When is a replication not a replication?


-->
Replication studies have been much in the news lately, particularly in the field of psychology, where a great deal of discussion has been stimulated by the Reproducibility Project spearheaded by Brian Nosek.

Replication of a study is an important way to test the the reproducibility and generalisability of the results. It has been a standard requirement for publication in reputable journals in the field of genetics for several years (see Kraft et al, 2009). However, at interdisciplinary boundaries, the need for replication may not be appreciated, especially where researchers from other disciplines include genetic associations in their analyses. I’m interested in documenting how far replications are routinely included in genetics papers that are published in neuroscience journals, and so I attempted to categorise a set of papers on this basis.

I’ve encountered many unanticipated obstacles in the course of this study (unintelligible papers and uncommunicative authors, to name just two I have blogged about), but I had not expected to find it difficult to make this binary categorisation. But it has become clear that there are nuances to the idea of replication. Here are two of those I have encountered:

a)    Studies which include a straightforward Discovery and Replication sample, but which fail to reproduce the original result in the Replication sample. The authors then proceed to analyse the data with both samples combined and conclude that the original result is still there, so all is okay. Now, as far as I am concerned, you can’t treat this as a successful replication; the best you can say of it is that it is an extension of the original study to a larger sample size.  But if, as is typically the case, the original result was afflicted by the Winner’s Curse, then the combined result will be biased.
b)    Studies which use different phenotypes for Discovery and Replication samples. On the one hand, one can argue that such studies are useful for identifying how generalizable the initial result is to changes in measures. It may also be the only practical solution if using pre-existing samples for replication, as one has to use what measures are available. The problem is that there is an asymmetry in terms of how the results are then treated. If the same result is obtained with a new sample using different measures, this can be taken as strong evidence that the genotype is influencing a trait regardless of how it is measured. But when the Replication sample fails to reproduce the original result, one is left with uncertainty as to whether it was type I error, or a finding that is sensitive to how it is measured. I’ve found that people are very reluctant to treat failures to replicate as undermining the original finding in this circumstance.

I’m reminded of arguments in the field of social psychology, where failures to reproduce well-known phenomena are often attributed to minor changes in the procedures or lack of ‘flair’ of experimenters. The problem is that while this interpretation could be valid, there is another, less palatable, interpretation, which is that the original finding was a type I error.  This is particularly likely when the original study was underpowered or the phenotype was measured using an unreliable instrument. 

There is no simple solution, but as a start, I’d suggest that researchers in this field should, where feasible, use the same phenotype measures in Discovery and Replication samples. Where that is not feasible, the could pre-register their predictions for a Replication Sample prior to looking at the data, taking into account the reliability of the measures of the phenotype and the power of the Replication Sample to detect the original effect, based on the sample size

Saturday, 11 July 2015

Publishing replication failures: some lessons from history


I recently travelled to Lismore, Ireland, to speak at the annual Robert Boyle summer school. I had been intrigued by the invitation, as it was clear this was not the usual kind of scientific meeting. The theme of Robert Boyle, who was born in Lismore Castle, was approached from very different angles, and those attending included historians of science, scientists, journalists, as well as interested members of the public. We were treated to reconstructions of some of Boyle's livelier experiments, heard wonderful Irish music, and we celebrated the installation of a plaque at Lismore Castle to honour Katherine Jones, Boyle's remarkable sister, who was also a scientist.

My talk was on the future of scientific scholarly publication, a topic that the Royal Society had explored in a series of meetings to celebrate the 350th Anniversary of the publication of Philosophical Transactions. I'm particularly interested in the extent to which current publishing culture discourages good science, and I concluded by proposing the kind of model that I recently blogged about, where the traditional science journal is no longer relevant to communicating science.

What I hadn't anticipated was the relevance of some of Boyle's writing to such contemporary themes.

Boyle, of course, didn't have to grapple with issues such as the Journal Impact Factor or Open Access payments. But some of the topics he covered are remarkably contemporary. He would have been interested in the views of Jason Mitchell, John L. Loeb Associate Professor of the Social Sciences at Harvard, who created a stir last year by writing a piece entitled "On the emptiness of failed replications". I see that the essay has now been removed from the Harvard website, but the main points can be found here*. It was initially thought to be a parody, but it seems to have been a sincere attempt at defending the thesis that "unsuccessful experiments have no meaningful scientific value." Furthermore, according to Mitchell, "Whether they mean to or not, authors and editors of failed replications are publicly impugning the scientific integrity of their colleagues." I have taken issue with this standpoint in an earlier blogpost; my view is that we should not assume that a failure to replicate a result is due to fraud or malpractice, but rather should encourage replication attempts as a means of establishing which results are reproducible.

I am most grateful to Eoin Gill of Calmast for pointing me to Robert Boyle's writings on this topic, and for sending me transcripts of the most relevant bits. Boyle has two essays on "the Unsuccessfulness of Experiments" in a collection of papers entitled “Certain Physiological Essays and other Tracts”. In these he discusses (at inordinate length!) the problems that arise when an experimental result fails to replicate. He starts by noting that such unsuccessful experiments are not uncommon:
… in the serious and effectual prosecution of Experimental Philosophy, I must add one discouragement more, which will perhaps as much surprize you as dishearten you; and it is, That besides that you will find …… many of the Experiments publish'd by Authors, or related to you by the persons you converse with, false or unsuccessful, … you will meet with several Observations and Experiments, which though communicated for true by Candid Authors or undistrusted Eye-witnesses, or perhaps recommended to you by your own experience, may upon further tryal disappoint your expectation, either not at all succeeding constantly, or at least varying much from what you expected. (opening passage)
He is interested in exploring the reasons for such failure; his first explanation seems equivalent to one that those using statistical analyses are all too familiar with – a chance false positive result.
And that if you should have the luck to make an Experiment once, without being able to perform the same thing again, you might be apt to look upon such disappointments as the effects of an unfriendliness in Nature or Fortune to your particular attempts, as proceed but from a secret contingency incident to some experiments, by whomsoever they be tryed. (p. 44)
And he urges the reader not to be discouraged – replication failures happen to everyone!
…. though some of your Experiments should not always prove constant, you have divers Partners in that infelicity, who have not been discouraged by it. (p. 44)
He identifies various possible systematic reasons for such failure: a problem with skill of the experimenter, with purity of ingredients, or variation in the specific context in which the experiment is conducted. He even, implicitly, addresses statistical power, noting how one needs many observations to distinguish what is general from individual variation.
…the great variety in the number, magnitude, position, figure, &c. of the parts taken notice of by Anatomical Writers in their dissections of that one Subject the humane body, about which many errors would have been delivered by Anatomists, if the frequency of dissections had not enabled them to discern betwixt those things that are generally and uniformly found in dissected bodies, and those which are but rarely, and (if I may so speak) through some wantonness or other deviation of Nature, to be met with. (p. 94)
Because of such uncertainties, Boyle emphasises the need for replication, and the dangers of building complex theory on the basis of a single experiment:
….try those Experiments very carefully, and more than once, upon which you mean to build considerable Superstructures either theorical or practical, and to think it unsafe to rely too much upon single Experiments, especially when you have to deal in Minerals: for many to their ruine have found, that what they at first look'd upon as a happy Mineral Experiment has prov'd in the issue the most unfortunate they ever made. (p. 106)
I'm sure there are some modern scientists who must be thinking their lives may have been made much easier if they had heeded this advice. But perhaps the most relevant to the modern world, where there is such concern about the consequences of failure to replicate, are Boyle's comments on the reputational impact of publishing irreproducible results:
…if an Author that is wont to deliver things upon his own knowledge, and shews himself careful not to be deceived, and unwilling to deceive his Readers, shall deliver any thing as having try'd or seen it, which yet agrees not with our tryals of it; I think it but a piece of Equity, becoming both a Christian and a Philosopher, to think (unless we have some manifest reason to the contrary) that he set down his Experiment or Observation as he made it, though for some latent reason it does not constantly hold; and that therefore though his Experiment be not to be rely'd upon, yet his sincerity is not to be rejected. Nay, if the Author be such an one as has intentionally and really deserved well of Mankind, for my part I can be so grateful to him, as not only to forbear to distrust his Veracity, as if he had not done or seen what he says he did or saw, but to forbear to reject his Experiments, till I have tryed whether or no by some change of Circumstances they may not be brought to succeed. (p. 107)
The importance of fostering a 'no blame' culture was one theme that emerged in a recent meeting on Reproducibility and Reliability of Biomedical Research at the Academy of Medical Sciences. It seems that in this, as in so many other aspects of science, Boyle's views are well-suited to the 21st century.

For more on Robert Boyle, see here


12th July 2015: Thanks to Daniël Lakens who pointed me to the Wayback machine, where earlier versions of the article can be found:   http://web.archive.org/web/*/http://wjh.harvard.edu/~jmitchel/writing/failed_science.htm

Friday, 29 August 2014

Replication and reputation: Whose career matters?



©CartoonStock.com
Some people are really uncomfortable with the idea that psychology studies should be replicated. The most striking example is Jason Mitchell, Professor at Harvard University, who famously remarked in an essay that "unsuccessful experiments have no meaningful scientific value".

Hard on his heels now comes UCLA's Matthew Lieberman, who has published a piece in Edge on the replication crisis. Lieberman is careful to point out that he thinks we need replication. Indeed, he thinks no initial study should be taken on face value - it is, according to him, just a scientific anecdote, and we'll always need more data. He emphasises:"Anyone who says that replication isn't absolutely essential to the success of science is pretty crazy on that issue, as far as I'm concerned."

It seems that what he doesn't like, though, is how people are reporting their replication attempts, especially when they fail to confirm the initial finding. "There's a lot of stuff going on", he complains "where there's now people making their careers out of trying to take down other people's careers".  He goes on to say that replications aren't unbiased, and that people often go into them trying to shoot down the original findings and this can lead to bad science:
"Making a public process of replication, and a group deciding who replicates what they replicate, only replicating the most counterintuitive findings, only replicating things that tend to be cheap and easy to replicate, tends to put a target on certain people's heads and not others. I don't think that's very good science that we, as a group, should sanction."
It's perhaps not surprising that a social neuroscientist should be interested in the social consequences of replication, but I would take issue with Lieberman's analysis. His depiction of the power of the non-replicators seems misguided. You do a replication to move up in your career? Seriously? Has Lieberman ever come across anyone who was offered a job because they failed to replicate someone else? Has he ever tried to publish a replication in a high-impact outlet? Give it a try and you'll soon be told it is not novel enough. Many of the most famous journals are notorious for turning down failures to replicate studies that they themselves published.  Lieberman is correct in noting that failures to replicate can get a lot of attention on Twitter, but a strong Twitter following is not going to recommend you to a hiring committee (and, btw, that Kardashian index paper was a parody).

Lieberman makes much of the career penalty for those whose work is not replicated. But anyone who has been following the literature on replication will be aware of just how common non-replication is (see e.g. Ioannidis, 2005). There are various possible reasons for this, and nobody with any sense would count it against someone if they do a well-conducted and adequately powered study that does not replicate. What does count against them is if they start putting forward implausible reasons why the replication must be wrong and they must be right. If they can show the replicators did a bad job, their reputation can only be enhanced. But they'll be in a weak position if their original study was not methodologically strong and should not have been submitted for publication without further evidence to support it.  In other words, reputation and career prospects will, at the end of the day, come down to the scientific rigour of a person's research, not on whether a particular result did or did not cross a threshold of p < .05.

The problem with failures to replicate is that they can arise for at least four reasons, and it can be hard to know which applies in an individual case. One reason, emphasized by Lieberman,  is that the replicator may be incompetent or biased.  But a positive feature of the group replication efforts that Lieberman so dislikes is that the methods and data are entirely open, allowing anyone who wants to evaluate them – see for instance this example. Others have challenged replication failures on the grounds that there are crucial aspects of the methodology that only the original experimenter knows about. To those I recommend making all aspects of methods explicit.

A second possibility is that a scientist does a well-designed study whose results don't replicate because all results are influenced by randomness – this could mean that an original effect was a false positive, or the replication was a false negative. The truth of the matter will only be settled by more, rather than less replication, but there's research showing that the odds are that an initial large effect will be smaller on replication, and may disappear altogether - the so-called Winner's Curse (Button et al, 2012).

The third reason why someone's work doesn't replicate is if they are a charlatan or fraudster, who has learned that they can have a very successful career by telling lies. We all hope they are very rare and we all agree they should be stopped. Nobody would make the assumption that someone must be in this category just because a study fails to replicate.

The fourth reason for lack of replication arises when researchers are badly trained and simply don't understand about probability theory, and so engage in various questionable research practices to tweak their data to arrive at something 'significant'. Although they are innocent of bad intentions, they stifle scientific progress by cluttering the field with nonreplicable results. Unfortunately, such practices are common and often not recognised as a problem, though there is growing awareness of the need to tackle them.

There are repeated references in Lieberman's article to people's careers: not just the people who do the replications ("trying to create a career out of a failure to replicate someone") but also the careers of those who aren't replicated ("When I got into the field it didn't seem like there were any career-threatening giant debates going on"). There is, however, another group whose careers we should consider: graduate students and postdocs who may try to build on published work only to find that the original results don't stand up. Publication of non-replicable findings leads to enormous waste in science and demoralization of the next generation. One reason why I take reproducibility initiatives seriously is because I've seen too many young people demoralized after finding that the exciting effect they want to investigate is actually an illusion.

While I can sympathize with Lieberman's plea for a more friendly and cooperative tone to the debate, at the end of the day, replication is now on the agenda and it is inevitable that there will be increasing numbers of cases of replication failure.

So suppose I conduct a methodologically sound study that fails to replicate a colleague's work. Should I hide my study away for fear of rocking the boat or damaging someone's career? Have a quiet word with the author of the original piece? Rather than holding back for fear of giving offence it is vital that we make our data and methods public: For a great example of how to do this in a rigorous yet civilized fashion I recommend this blogpost by Betsy Levy Paluck.

In short, we need to develop a more mature understanding that the move towards more replication is not about making or breaking careers: it is about providing an opportunity to move science forward, improve our methodology and establish which results are reliable (Ioannidis, 2012). And this can only help the careers of those who come behind us.


References  
Button, K., Ioannidis, J., Mokrysz, C., Nosek, B., Flint, J., Robinson, E., & Munafó, M. (2013). Power failure: why small sample size undermines the reliability of neuroscience Nature Reviews Neuroscience, 14 (6), 365-376 DOI: 10.1038/nrn3475

Ioannidis, J. (2005). Contradicted and Initially Stronger Effects in Highly Cited Clinical Research JAMA, 294 (2) DOI: 10.1001/jama.294.2.218

Ioannidis, J. (2012). Why Science Is Not Necessarily Self-Correcting Perspectives on Psychological Science, 7 (6), 645-654 DOI: 10.1177/1745691612464056

Thursday, 9 January 2014

Off with the old and on with the new: the pressures against cumulative research

 
Yesterday I escaped a very soggy Oxford to make it down to London for a symposium on "Increasing value, reducing waste" in Research. The meeting marked the publication of a special issue of the Lancet containing five papers and two commentaries, which can be downloaded here.

I was excited by the symposium because, although the focus was on medicine, it raised a number of issues that have much broader relevance for science, including several that I have raised on this blog, including pre-registration of research, criteria used by high-impact journalsethics regulation, academic backlogs, and incentives for researchers. It was impressive to see that major players in the field of medicine are now recognizing that there is a massive problem of waste in research. Better still, they are taking seriously the need to devise ways in which this could be fixed.

I hope to blog about more of the issues that came up in the meeting, but for today I'll confine myself to one topic that I hadn't really thought about much before, but which I see as important, namely the importance of doing research that builds on previous research, and the current pressures against this.

Iain Chalmers presented one of the most disturbing slides of the day, a forest plot of effect sizes found in medical trials for a treatment to prevent bleeding during surgery.
Based on Figure 3 of Chalmers et al, 2014
Time is along the x-axis, and the horizontal line corresponds to a result where the active and control treatments do not differ. Points which are below the line and whose fins do not cross it show a beneficial effect of treatment. The graph shows that the effectiveness of the treatment was clearly established by around 2002, yet a further 20 studies including several hundred patients were reported in the literature after that date. Chalmers made the point that it is simply unethical to do a clinical trial if previous research has already established an effect. The problem is that researchers often don't check the literature to see what has already been done, and so there is wasteful repetition of studies. In the field of medicine this is particularly serious because patients may be denied the most effective treatment if they enrol in a research project.

Outside medicine, I'm not sure this is so much of an issue. In fact, as I've argued elsewhere, in psychology and neuroscience I think there's more of a problem with lack of replication. But there definitely is much neglect of prior research. I lose count of the number of papers I review where the introduction presents a biased view of the literature that supports the authors' conclusions. For instance, if you are interested in the relation between auditory deficit and children's language disorders, it is possible to write an introduction presenting this association as an established fact, or to write one arguing that it has been comprehensively debunked. I have seen both.

Is this just lazy, biased or ignorant authors? In part, I suspect it is. But I think there is a deeper problem which has to do with the insatiable demand for novelty shown by many journals, especially the high-impact ones. These journals typically have a lot of pressure on page space and often allow only 500 words or less for an introduction. Unless authors can refer to a systematic review of the topic they are working on, they are obliged to give the briefest account of prior literature. It seems we no longer value the idea that research should build on what has gone before: rather, everyone wants studies that are so exciting that they stand alone. Indeed, if a study is described as 'incremental' research, that is typically the death knell in a funding committee.

We need good syntheses of past research, yet these are not valued because they are not deemed novel. One point made by Iain Chalmers was that funders have in the past been reluctant to give grants for systematic reviews. Reviews also aren't rated highly in academia: for instance, I'm proud of a review on mismatch negativity that I published in Psychological Bulletin in 2007. It not only condensed and critiqued existing research, but also discovered patterns in data that had not previously been noted. However, for the REF, and for my publications list on a grant renewal, reviews don't count.

We need a rethink of our attitude to reviews. Medicine has led the way and specified rigorous criteria for systematic reviews, so that authors can't just cherrypick specific studies of interest. But it has also shown us that such reviews are an invaluable part of the research process. They help ensure that we do not waste resources by addressing questions that have already been answered, and they encourage us to think of research as a cumulative, developing process, rather than a series of disconnected, dramatic events.

Reference
Chalmers, Iain, Bracken, Michael B., Djulbegovic, Ben, Garattini, Silvio, Grant, Jonathan, Gülmezoglu, A. Metin, Howells, David W., Ioannidis, John P. A., & Oliver, Sandy (2014). How to increase value and reduce waste when research priorities are set Lancet : 10.1016/S0140-6736(13)62229-1

Thursday, 19 January 2012

Novelty, interest and replicability


So at last, your paper is written. It represents the culmination of many years’ work. You think is an important advance for the field. You write it up. You carefully format it for your favoured journal. You grapple with the journal’s portal, tracking down details of recommended reviewers and then sit back. You anticipate a delay of a few weeks before you get reviewer comments. But, no. What’s this? A decision letter within a week: “Unfortunately we receive many more papers than we can publish or indeed review and must make difficult decisions on the basis of novelty and general interest as well as technical correctness.” It’s the publishing equivalent of the grim reaper: a reject without review.

It happens increasingly often, especially if you send work to journals with high impact factors. I’ve been an editor and I know there are difficult decisions to make. It can be kinder to an author to reject immediately if you sense that the paper isn’t going to make it through the review process. One thing you learn as an author is that there’s no point protesting or moaning. You just try again with another journal. I’m confident our paper is important and will get published, and there’s no reason for me to single this journal out for complaint. But this experience has made me reflect more generally on factors affecting publication, and I do think there are things about the system that are problematic.

So, using this blog as my soapbox, there are two points I’d like to make: A little one and a big one. Let’s get the little one out of the way first. It’s simply this: if a journal commonly rejects papers without review, then it shouldn’t be fussy about the format in which a paper is submitted. It’s just silly for busy people to spend time getting the references correctly punctuated, or converting their figures to a specific format, if there’s a strong probability that their paper will be bounced. Let the formatting issues be addressed after the first round of review.
The second point concerns the criteria of “novelty and general interest”. My guess is that our paper was triaged on the novelty criterion because it involved replication. We reported a study that involved measuring electrical brain responses to sounds. We compared these responses in children with developmental language impairments and typically-developing children. The rationale is explained in a blogpost I wrote for the Wellcome Trust.
We’re not the first people to do this kind of research. There have been a few previous studies, but it’s a fair summary to say the literature is messy. I reviewed part of it a few years back and I was shocked at how bad things were. It was virtually impossible to draw any general conclusions from 26 studies. Now these studies are really hard to do. Just recruiting people is difficult and it can take months if not years to get an adequate sample. Then there is the data analysis which is not for the innumerate or faint-hearted. So a huge amount of time and money had gone into these studies, but we didn’t seem to be progressing very far. The reason was simple: you couldn’t generalise because nobody ever attempted to replicate previous research. The studies were focussed on the same big questions, but they differed in important ways. So if they got different results, you couldn’t tell why.
In response to this, part of my research strategy has been to take those studies that look the strongest and attempt to replicate them. So when we found strikingly similar results to a study by Shafer et al (2010) I was excited. The fact that two independent labs on different sides of the world had obtained virtually the same result gave me confidence in the findings. I was able to build on this result to do some novel analyses that helped establish direction of causal influences, and felt we at last we were getting somewhere. But my excitement was clearly not shared by the journal editor, who no doubt felt our findings were not sufficiently novel. I wasn’t particularly surprised by this decision, as this is the way things work. But is the focus on novelty good for science?
The problem is that unless novel findings are replicated, we don’t know which results are solid and reliable. We ought to know: we apply statistical methods with the sole goal of establishing this. But in practice, statistics are seldom used appropriately. People generate complex datasets and then explore different ways of analysing data to find statistically significant results. In electrophysiological studies, there are numerous alternative ways in which data can be analysed, by examining different peaks in a waveform, different methods of identifying peaks, different electrodes, different time windows, and so on. If you do this, it is all too easy for “false positives” to be mistaken as genuine effects (Simmons, Nelson, & Simonsohn, 2011). And the problem is compounded by the “file drawer problem” whereby people don’t publish null results. Such considerations led Ioannidis (2005) to conclude that most published research findings are false.
This is well-recognised in the field of genetics, where it became apparent that most early studies linking genetic variants to phenotypes were spurious (see Flint et al). The reaction, reflected in a recent editorial in Behavior Genetics has been to insist that authors replicate findings of associations between genes and behaviour. So if you want to say something novel, you have to demonstrate the effect in two independent samples.
This is all well and good, but requiring that authors replicate their results is unrealistic in a field where a study takes several years to complete, or involves a rare disorder. You can, however, create an expectation that researchers include a replication of prior work when designing a study, and/or use existing research to generate a priori predictions about expected effects.
It wouldn’t be good for science if journals only published boring replications of things we already knew. Once a finding is established as reliable, then there’s no point in repeating the study. But something that has been demonstrated at least twice in independent samples (replicable) is far more important to science than something that has never been shown before (novel), because the latter is likely to be spurious. I see this as a massive challenge for psychology and neuroscience.
In short, my view is that top journals should reverse their priorities and treat replicability as more important than novelty.
Unfortunately, most scientists don’t bother to attempt replications because they know the work will be hard to publish. We will only reverse that perception if journal editors begin to put emphasis on replicability.
A few individuals are speaking out on this topic. I recommend a blogpost by Brian Knutson who argued, “Replication should be celebrated rather than denigrated.” He suggested that we need a replicability index to complement the H-index. If scientists were rewarded for doing studies that others can replicate, we might see a very different rank ordering of research stars.
I leave the last word to Kent Anderson: “Perhaps we’re measuring the wrong things … Perhaps we should measure how many results have been replicated. Without that, we are pursuing a cacophony of claims, not cultivating a world of harmonious truths.”


Simmons, J., Nelson, L., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant Psychological Science, 22 (11), 1359-1366 DOI: 10.1177/0956797611417632