Showing posts with label classification. Show all posts
Showing posts with label classification. Show all posts

Tuesday, 6 December 2022

Biomarkers to screen for autism (again)


Diagnosis of autism from biomarkers is a holy grail for biomedical researchers. The days when it was thought we would find “the autism gene” are long gone, and it’s clear that both the biology and the psychology of autism is highly complex and heterogeneous. One approach is to search for individual genes where mutations are more likely in those with autism. Another is to address the complexity head-on by looking for combinations of biomarkers that could predict who has autism.  The latter approach is adopted in a paper by Bao et al (2022) who claimed that an ensemble of gene expression measures taken from blood samples could accurately predict which toddlers were autistic (ASD) and which were typically-developing (TD). An anonymous commenter on PubPeer queried whether the method was as robust as the authors claimed, arguing that there was evidence for “overfitting”. I was asked for my thoughts by a journalist, and they were complicated enough to merit a blogpost.  The bottom line is that there are reasons to be cautious about the conclusion of the authors that they have developed “an innovative and accurate ASD gene expression classifier”.

 

Some of the points I raise here applied to a previous biomarker study that I blogged about in 2019. These are general issues about the mismatch between what is done in typical studies in this area and what is needed for a clinically useful screening test.

 

Base rates

Consider first how a screening test might be used. One possibility is that there might be a move towards universal screening, allowing early diagnosis that might help ensure intervention starts young.  But for effective screening in that context, you need extremely high diagnostic accuracy, and accuracy depends on the frequency of autism in the population.  I discussed this back in 2010. The levels of accurate classification reported by Bao et al would be of no use for population screening because there would be an extremely high rate of false positives, given that most children don’t have autism.

 

Diagnostic specificity

But, you may say, we aren’t talking about universal screening.  The test might be particularly useful for those who either (a) already have an older child with autism, or (b) are concerned about their child’s development.  Here the probability of a positive autism diagnosis is higher than in the general population.  However, if that’s what we are interested in, then we need a different comparison group – not typically-developing toddlers, but unaffected siblings of children with autism, and/or children with other neurodevelopmental disorders.   

When I had a look at the code that the authors deposited for data analysis, it implied that they did have data on children with more general developmental delays, and sibs of those with autism, but they are not reported in this paper. 

 

The analyses done by the researchers are extremely complex and time-consuming, and it is understandable that they may prefer to start out with the clearest case of comparing autism with typically-developing children. But the acid test of the suitability of the classifier for clinical use would be a demonstration that it could distinguish children with autism from unaffected siblings, and from nonautistic children with intellectual disability.

 

Reliability of measures

If you run a diagnostic test, an obvious question is whether you’d get the same result on a second test run.  With biological and psychological measures the answer is almost always no, but the key issue for a screener is just how much change there is. Gene expression levels could vary from occasion to occasion depending on time of day or what you’d eaten – I have no idea how important this might be, but it's not possible to evaluate in this paper, where measures come from a single blood sample. My personal view is that the whole field of biomedical research needs to wake up to the importance of reliability of measurement so that researchers don’t waste time exploring the predictive power of measures that may be too unreliable to be useful.  Information about stability of measures over time is a basic requirement for any diagnostic measure.

 

A related issue concerns comparability of procedures for autism and TD groups. Were blood samples collected by the same clinicians over the same period and processed in the same lab for these two groups? Were the blood analyses automated and/or done blind? It’s crucial to be confident that minor differences in clinical or lab procedures do not bias results in this kind of study.

 

Overfitting

Overfitting is really just a polite way of saying that the data may be noise. If you run enough analyses, something is bound to look significant, just by chance.  In the first step of the analysis, the researchers ran 42,840 models on “training” data from 93 autistic and 82 TD children and found 1,822 of them performed better than .80 on a measure that reflects diagnostic accuracy (AUC-ROC – which roughly corresponds to proportion correctly classified: .50 is chance, and 1.00 is perfect classification).  So we can see that just over 4% of the models (1822/42840) performed this well.

 

The researchers were aware of the possibility of overfitting, and they addressed it head-on, saying: “To test this, we permuted the sample labels (i.e., ASD and TD) for all subjects in our Training set and ran the pipeline to test all feature engineering and classification methods. Importantly, we tested all 42,840 candidate models and found the median AUC-ROC score was 0.5101 with the 95th CI (0.42–0.65) on the randomized samples. As expected, only rare chance instances of good 'classification' occurred.”  The distribution of scores is shown in Figure 2b. 

 

 


Figure 2b from Bao et al (2022)

 

They then ran a further analysis on a “test set” of 34 autistic and 31 TD children who had been held out of the original analysis, and found that 742 of the 1822 models performed better than .80 in classification. That’s 40% of the tested models.  Assuming I have understood the methods correctly, that does look meaningful and hard to explain just in terms of statistical noise.  In effect, they have run a replication study and found that a substantial subset of the identified models do continue to separate autism and TD groups when new children are considered. The claim is that there is substantial overlap in the models that fall in the right-hand area under the curve for the red and pink distributions.

 

The PubPeer commenter seems concerned that results look too good to be true. In particular, Figure 2b suggests the models perform a bit better in the test set than in the training set. But the figure shows the distribution of scores for all the models (not just the selected models) and, given the small sample sizes, the differences between distributions does not seem large to me. I was more surprised by the relatively tight distribution of AUC-ROC values obtained in the permutation analysis, as I would have anticipated some models would have given high classification accuracy just by chance in a sample of this size.

The researchers went on to present data for the set of models that achieved .8 classification in both training and test sets. This seemed a reasonable approach to me. The PubPeer commenter is correct in arguing that there will be some bias caused by selecting models this way, and that one would expect  less good performance in a completely new sample, but the 2-stage selection of models would seem to ensure there is not "massive overfitting". I think there would be a problem if only 4% of the 1822 selected models had given accurate classification, but the good rate of agreement between the models selected in the training and test samples, coupled with the lack of good models in the permuted data, suggests there is a genuine effect here. 

 

Conclusion

So, in sum, I think that the results can’t just be attributed to overfitting, but I nevertheless have reservations about whether they would be useful for screening for autism.  And one of the first things I’d check if I were the researchers would be the reliability of the diagnostic classification in repeated blood samples taken on different occasions, as that would need to be high for the test to be of clinical use.

 

Note: I'd welcome comments or corrections on this post. Please note, comments are moderated to avoid spam, and so may not appear immediately. If you post a comment and it has not appeared in 24 hr, please email me and I'll ensure it gets posted. 

 PS. See comment from original PubPeer poster attached. 

Also, 8th Dec 2022, I added a further PubPeer comment asking authors to comment on Figure 2B, which does seem odd. 

https://pubpeer.com/publications/B693366B2B51D143C713359F151F7B#4  


PPS. 10th Dec 2022. 

Author Eric Courchesne has responded to several of the points made in this blogpost on Pubpeer: https://pubpeer.com/publications/B693366B2B51D143C713359F151F7B#5  



 


 

 

 

 

 

 

 

 

Saturday, 29 February 2020

Developmental Language Disorder (DLD) in relation to DSM-5

The tl;dr (too long, didn't read) message in this post is that for all intents and purposes, Developmental Language Disorder (DLD), as defined in the CATALISE project, can be regarded as equivalent to DSM-5 Language Disorder (American Psychiatric Association, 2013). This is a question of interest to people working in systems that require a DSM-5 diagnosis for access to services or insurance payments.

Diagnostic flowchart for CATALISE, based on Figure 1 from Bishop et al (2017).





Figure 1 shows how DLD is defined in the CATALISE project (Bishop et al, 2017). In this framework, DLD is a subset of the broader term 'Language Disorder', with the 'Developmental' prefix used to indicate that the language problems are not associated with a known biomedical condition. The list of biomedical conditions includes brain injury, acquired epileptic aphasia in childhood, certain neurodegenerative conditions, cerebral palsy, oral language limitations associated with sensori-neural hearing loss, autism, intellectual disability and genetic conditions such as Down syndrome.

In DSM-5, 'Language Disorder' is used with a meaning that closely corresponds to DLD. DSM-5 Diagnostic criteria for Language Disorder (category 315.32; F80.2) are as follows:

A. Persistent difficulties in the acquisition and use of language across modalities (I.e. spoken, written, sign language, or other) due to deficits in comprehension of production that include the following: 
  • Reduced vocabulary (word knowledge and use) 
  • Limited sentence structure (ability to put word and word endings together to form sentences based on the rules of grammar and morphology 
  • Impairments in discourse (ability to use vocabulary and connect sentences to explain or describe a topic or series of events or have a conversation) 
B. Language abilities are substantially and quantifiably below those expected for age, resulting in functional limitations in effective communication, social participation, academic achievement, or occupational performance, individually or in any combination 
C. Onset of symptoms is in the early developmental period 
D. The difficulties are not attributable to hearing or other sensory impairment, motor dysfunction, or another medical or neurological condition, and are not better explained by intellectual disability (intellectual developmental disorder) or global developmental delay. 

The relationship between the CATALISE terminology and DSM-5 are shown in Figure 2. In both CATALISE DLD and DSM-5 Language Disorder, cognitive referencing is not used (i.e. there is no IQ cutoff for inclusion in the category), the language problems are seen in early childhood and are persistent and lead to functional impairment, and children with associated biomedical conditions are excluded. 
Figure 2: Set diagram showing relationship between CATALISE and DSM-5 terminology
The CATALISE criteria are more explicit than DSM-5 in relation to bilingual/multilingual children, making it clear that one would not diagnose DLD unless the child showed poor language competence in their best language, but a similar idea is conveyed in the "Differential diagnosis" section of DSM-5, where it is noted that "Language disorder needs to be distinguished from normal developmental variations... Regional, social, or cultural/ethnic variations of language must be considered when an individual is being assessed for language impairment." Unfortunately, a recent account of DLD by a member of the CATALISE panel stated that bilingual/multilingual children were excluded from the DLD category (Rice, 2020). It is unclear how this misunderstanding arose, but it is clearly discrepant with Figure 1.

One might ask why the CATALISE panel decided against alignment with the DSM-5 terminology, given that there is such overlap between CATALISE DLD and DSM-5 Language Disorder. The answer is that "Language Disorder" as defined in DSM-5 is a problematic label because on the one hand it refers to a specific DSM-5 category 315.32 (F80.2), but on the other hand it is widely used to refer to symptom-level problems in many conditions. "Language disorder" is a hopeless term to use in a literature search because it will turn up a far broader range of conditions than those defined by the diagnostic criteria above, including many types of acquired language disorder associated with both developmental and adult-onset conditions.

This broader meaning of the term gives ample scope for confusion. Indeed, even within the DSM-5 manual, ambiguity is apparent. Under "Differential diagnosis" mention is made of Neurological disorders, with the statement "Language disorder can be acquired in association with neurological disorders, including epilepsy (e.g. acquired aphasia or Landau-Kleffner syndrome)". A literal reading of this would mean that, contrary to Diagnostic criteria point D, children with acquired aphasia or Landau-Kleffner syndrome can be regarded as having Language Disorder. I don't think that is what the authors intended: rather they imply that one should differentiate acquired language disorder associated with neurological conditions from "Language Disorder" that corresponds to DSM-5 category 315.32 (F80.2).

In short, in proposing the DLD label, the CATALISE panel were capturing a grouping that is conceptually the same as DSM-5 Language Disorder. We felt, however, that 'Language Disorder" was a problematic label that would generate confusion, and so we used the more traditional term "Developmental Language Disorder" specifically to identify a category of language disorder without associated biomedical conditions.

References
American Psychiatric Association. (2013). Diagnostic and statistical manual of mental disorders (5th ed.). Arlington, VA: American Psychiatric Publishing.
Bishop, D. V. M., Snowling, M. J., Thompson, P. A., Greenhalgh, T., & CATALISE Consortium. (2017). Phase 2 of CATALISE: a multinational and multidisciplinary Delphi consensus study of problems with language development: Terminology. Journal of Child Psychology and Psychiatry, 58(10), 1068-1080. doi:10.1111/jcpp.12721
Rice, M. (2020). Clinical lessons from studies of children with Specific Language Impairment. Perspectives of the ASHA Special Interest Groups, 5(1), 12-29. doi:https://doi.org/10.1044/2019_PERSP-19-00011

Saturday, 12 January 2019

NeuroPointDX's blood test for Autism Spectrum Disorder: a critical evaluation

NeuroPointDX (NPDX), a Madison-based biomedical company, is developing blood tests for early diagnosis of Autism Spectrum Disorder (ASD). According to their Facebook page, the NPDX ASD test is available in 45 US states. It does not appear to require FDA approval. On the Payments tab of the website, we learn that the test is currently self-pay (not covered by insurance), but for those who have difficulty meeting the costs, a Payment Plan is available, whereby the test is conducted after a down payment is received, but the results are not disclosed to the referring physician until two further payments have been made.

So what does the test achieve, and what is the evidence behind it?

Claims made for the test
On their website, NPDX describe their test as a 'tool for earlier ASD diagnosis'. Specifically they say:
'It can be difficult to know when to be concerned because kids develop different skills, like walking and talking, at different times. It can be hard to tell if a child is experiencing delayed development that could signal a condition like ASD or is simply developing at a different pace compared to his or her peers...... This is why a biological test, one that’s less susceptible to interpretation, could help doctors diagnose children with ASD at a younger age. The NPDX ASD test was developed for children as young as 18 months old.'
They go on to say:
'In our research of autism spectrum disorder (ASD) and metabolism, we found differences in the metabolic profiles of certain small molecules in the blood of children with ASD. The NPDX ASD test measures a number of molecules in the blood called metabolites and compares them to these metabolic profiles.

The results of our metabolic test provide the ordering physician with information about the child’s metabolism. In some instances, this information may be used to inform more precise treatment. Preliminary research suggests, for example, that adding or removing certain foods or supplements may be beneficial for some of these children. NeuroPointDX is working on further studies to explore this.

The NPDX ASD test can identify about 30% of children with autism spectrum disorder with an increased risk of an ASD diagnosis. This means that three in 10 kids with autism spectrum disorder could receive an earlier diagnosis, get interventions sooner, and potentially receive more precise treatment suggestions from their doctors, based on information about their own metabolism.'
They further state that this is:  'A new approach to thinking about ASD that has been rigorously validated in a large clinical study' and they note that results from their Children’s Autism Metabolome Project (CAMP) study have been 'published in a peer-reviewed, highly-regarded journal, Biological Psychiatry'.

The test is recommended for a child who:
  • Has failed screening for developmental milestones indicating risk for ASD (e.g. M-CHAT, ASQ-3, PEDS, STAT, etc.). 
  • Has a family history such as a sibling diagnosed with ASD. 
  • Has an ASD diagnosis for whom additional metabolic information may provide insight into the child’s condition and therapy.
In September, Xconomy, which reports on biotech developments, ran an interview with Stemina CEO and co-founder Elizabeth Donley, which gives more background, noting that the test is not intended as a general population screen, but rather as a way of identifying specific subtypes among children with developmental delay.

Where are the non-autistic children with developmental delay? 
I looked at the published paper from the CAMP study in Biological Psychiatry.

Given the recommendations made by NPDX, I had expected that the study would involve comparison of children with developmental delay to compare metabolomic profiles in those who did and did not subsequently meet diagnostic criteria for ASD.

However, what I found instead was a study that compared metabolomics in 516 children with a certified diagnosis of ASD and 164 typically-developing children. There was a striking difference between the two groups in 'developmental quotient (DQ)', which is an index of overall developmental level. The mean DQ for the ASD group was 62.8 (SD = 17.8), whereas that of the typically developing comparison group was 100.1 (SD = 16.5). This information can be found in Supplementary Materials Table 3.

It is not possible, using this study design, to use metabolomic results to distinguish children with ASD from other cases of developmental delay. To do that, we'd need a comparison sample of non-autistic children with developmental delay.

The CAMP study is registered on ClinicalTrials.gov, where it is described as follows:
'The purpose of this study is to identify a metabolite signature in blood plasma and/or urine using a panel of biomarker metabolites that differentiate children with autism spectrum disorder (ASD) from children with delayed development (DD) and/or typical development (TD), to develop an algorithm that maximizes sensitivity and specificity of the biomarker profile, and to evaluate the algorithm as a diagnostic tool.' (My emphasis)
The study is also included on the NIH Project Reporter portfolio, where the description includes the following information:
'Stemina seeks funding to enroll 1500 patients in a well-defined clinical study to develop a biomarker-based diagnostic test capable of classifying ASD relative to other developmental delays at greater than 80% accuracy. In addition, we propose to identify metabolic subtypes present within the ASD spectrum that can be used for personalized treatment. The study will include ASD, DD and TD children between 18 and 48 months of age. Inclusion of DD patients is a novel and important aspect of this proposed study from the perspective of a commercially available diagnostic test.' (My emphasis)
So, the authors were aware that it was important to include a group with developmental delay, but they then reported no data on this group. Such children are difficult to recruit, especially for a study involving invasive procedures, and it is not unusual for studies to fail to meet recruitment goals. That is understandable. But it is not understandable that the test should then be described as being useful for diagnosing ASD from within a population with developmental delay, when it has not been validated for that purpose.

Is the test more accurate than behavioural diagnostic tests? 
A puzzling aspect of the NPDX claims is a footnote (marked *) on this webpage:
'Our test looks for certain metabolic imbalances that have been identified through our clinical study to be associated with ASD. When we detect one or more imbalance(s), there is an increased risk that the child will receive an ASD diagnosis'
*Compared to the results of the ADOS-2 (Autism Diagnostic Observation Schedule), Second Edition
It's not clear exactly what is meant by this: it sounds as though the claim is that the blood test is more accurate than ADOS-2. That can't be right, though, because in the CAMP study, we are told: 'The Autism Diagnostic Observation Schedule–Second Version (ADOS-2) was performed by research-reliable clinicians to confirm an ASD diagnosis.' So all the ASD children in the study met ADOS-2 criteria. It looks like 'compared to' means 'based on' in this context, but it is then unclear what the 'increased risk' refers to.

How reliable is the test?
A test's validity depends crucially on its reliability: if a blood test gives different results on different occasions, then it cannot be used for diagnosis of a long-term condition. Presumably because of this, the account of the study on ClinicalTrials.gov states: 'A subset of the subjects will be asked to return to the clinic 30-60 days later to obtain a replicate metabolic profile.' Yet no data on this replicate sample is reported in the Biological Psychiatry paper.

I have no expertise in metabolomics, but it seems reasonable to suppose that amines measured in the blood may vary from one occasion to another; indeed in 2014 the authors published a preliminary report on a smaller sample from CAMP, where they specifically noted that, presumably to minimise impact of medication or special diets, blood samples were taken when the child was fasting and prior to morning administration of medication. (34% of the ASD group and 10% of the typically-developing group were on regular medication, and 19% of the ASD group were on gluten and/or casein-free diets).

I contacted the authors to ask for information on this point. They did not provide any data on test-retest reliability beyond stating:
Thirty one CAMP subjects were recruited at random for a test-retest analysis during CAMP. These subjects were all amino acid dysregulation metabotype negative at the initial time point (used in the analysis for the manuscript). The subjects were sampled 30-60 days later for retest analysis. At the second time point the 31 subjects were still metabotype negative. There are plans for additional resampling of a select group of CAMP subjects. These will include metabotype positive individuals.
Thus, we do not currently know whether a positive result on the NPDX ASD test is meaningful, in the sense of being a consistent physiological marker in the individual.

Scientific evaluation of the methods used in the Biological Psychiatry paper 
The Biological Psychiatry paper describing development of the test is highly complex, involving a wide range of statistical methods. In their previous paper with a smaller sample, the authors described thousands of blood markers and claimed that using machine learning methods, they could identify a subset that discriminated the ASD and typically-developing groups with above chance accuracy. However, they noted this finding needed confirmation in a larger sample.

In the 2018 Biological Psychiatry paper, no significant differences were found for measures of metabolite abundance, failing to replicate the 2014 findings. However, further consideration of the data led the authors to concentrate instead on ratios between metabolites. As they noted: 'Ratios can uncover biological properties not evident with individual metabolites and increase the signal when two metabolites with a negative correlation are evaluated.'

Furthermore, they focused on individuals with extreme values for ratio scores, on the grounds that ASD is a heterogeneous condition, and the interest is in identifying subgroups who may have altered metabolism. The basic logic is illustrated in Figure 1 – the idea is to find a cutoff on the distribution which selects a higher proportion of ASD than typical cases. Because 76% of the sample are ASD cases, we would expect to find 76% of cases in the tail of the distribution. However, by exploring different cutoffs, it can be possible to identify a higher proportion. The proportion of ASD cases above a positive cutoff (or below a negative cutoff) is known as the positive predictive value (PPV), and for some of the ratios examined by the researchers, it was over 90%.


Figure 1: Illustrative distributions of z-scores for 4 of the 31 metabolites in ASD and typical group: this plot shows raw levels for metabolites; blue boxes show the numbers falling above or below a cutoff that is set to maximise group differences. The final analysis focused on ratios between metabolites, rather than raw levels. From Figure S2, Smith et al (2018).

This kind of approach readily lends itself to finding spurious 'positive' results, insofar as one is first inspecting the data and then identifying a cutoff that maximises the difference between two groups. It is noteworthy that the metabolites that were selected for consideration in ratio scores were identified on the basis that they showed negative correlations within a subset of the ASD sample (the 'training set'). Accordingly, PPV values from a 'training set' are likely to be biased and will over-estimate group differences. However, to avoid circularity, one can take cutoffs from the training set, and then see how they perform with a new subset of data that was not used to derive the cutoff – the 'test set'. Provided the test set is predetermined prior to any analysis, and totally separate from the training set, then the results with the test set can be regarded as giving a good indication of how the test would perform in a new sample. This is a standard way of approaching this kind of classification problem.

Usually, the PPV for a test set will be less good than for a training set: this is just a logical consequence of the fact that observed differences between groups will involve random noise as well as true population differences, and these will boost the PPV. In the test set, random effects will be different are so are more likely to hinder rather than help prediction, and so PPV will decline. However, in the Biological Psychiatry paper, the PPVs for the test sets were only marginally different from those from the training sets: for the ratios described in Table 1, the mean PPV was .887 (range .806 - .943) for the training set, and mean .880 (range .757 - .975) for the test set.

I wanted to understand this better, and asked the authors for their analysis scripts, so I could reconstruct what they did. Here is the reply I received from Beth Donley:
We would be happy to have a call to discuss the methodology used to arrive at the findings in our paper. Our scripts and the source code they rely on are proprietary and will not be made public unless and until we publish them in a paper of our own. We think it would be more meaningful to have a call to discuss our approach so that you can ask questions and we can provide answers.
My questions were sufficiently technical and complex that this was not going to work, so I provided written questions, to which I received responses. However, although the replies were prompt, they did not really inspire confidence, and, without the scripts I could not check anything.

For instance:
My question: Is there an explanation for why the PPVs are so similar for training and test datasets? Usually you'd expect a drop in PPV in the test dataset if the function was optimised for the training dataset, just because the training threshold would inevitably be capitalising on chance.
Response: We observed this phenomenon, as well, and were surprised by the similarity of the training and test confusion matrix performance metric values. We have no way to know why the metrics were similar between sets. Our best guess is that the demographics of the training and test set of subjects had closely matched demographic and study related variables.
But the demographic similarity between a test and training set is not the main issue here. One thing that crucially determines how close the results will be is the reliability of the metabolomic measure. The lower the test-retest reliability of the measure, the more likely that results from a training set will fail to replicate. So it would be helpful if the authors would report the quantitative data that they have on this question.

If we ignore all the problems, how good is prediction? 
Unfortunately, it is virtually impossible to tell how accurate the test would be in a real-life context. First, we would have to make the assumption that a non-autistic group with developmental delay would be comparable to the typically-developing group. If non-autistic children with developmental delay show metabolomic imbalances, then the test's potential for diagnosis of ASD is compromised. Second, we would have to come up with an estimate of how many children who are given the test will actually have ASD: that's very hard to judge, but let us suppose it may be as high as 50%. Then, for the ratios reported in the Biological Psychiatry paper, we can compute that around 50% to 83% of those testing positive would have ASD. Note that the majority of children with and without ASD won't have scores in the tail of the distribution and will not therefore test positive (see Figure 1). On the NPDX website is is claimed that around 30% of children with ASD test positive: That is hard to square this with the account in Biological Psychiatry which reported 'an altered metabolic phenotype' in 16.7% of those with ASD.

Conflict of interest and need for transparency
The published paper gives a comprehensive COI statement as follows:
AMS, MAL, and REB are employees of, JJK and PRW were employees of, and ELRD is an equity owner in Stemina Biomarker Discovery Inc. AMS, JJK, PRW, MAL, ELRD, and REB are inventors on provisional patent application 62/623,153 titled “Amino Acid Analysis and Autism Subsets” filed January 29, 2018. DGA receives research funding from the National Institutes of Health, the Simons Foundation, and Stemina Biomarker Discovery Inc. He is on the scientific advisory boards of Stemina Biomarker Discovery Inc. and Axial Therapeutics.
It is generally accepted that just because there is COI, this does not invalidate the work: it simply provides a context in which it can be interpreted. The study reported in Biological Psychiatry represents a huge investment of time and money, with research funds contributed from both public and private sources. In the Xconomy interview, it is stated that the research has cost $8 million to date. This kind of work may only be possible to do with involvement of a biotechnology company which is willing to invest funds in the hope of making discoveries that can be commercialised; this is a similar model to drug development.

Where there is a strong commercial interest in the outcome of research, the best way of counteracting negative impressions is for researchers to be as open and transparent as possible. This was not the case with the NPDX study: as described above, there were substantial changes from the registered protocol on ClinicalTrials.gov, not discussed in the paper. The analysis scripts are not available – this means we have to take on trust details of the methods in an area where the devil is in the detail. As Philip Stark has argued, a paper that is long on results but short on methods is more like an advertisement than a research communication: "Science should be ‘show me’, not ‘trust me’; it should be ‘help me if you can’, not ‘catch me if you can’."

Postscript
On 27th December, Biological Psychiatry published correspondence on the Smith et al paper by Kristin Sainani and Steven Goodman from Stanford University. They raised some of the points noted above regarding the lack of predictive utility of the blood test in clinical contexts, the lack of a comparison sample with developmental delay, and the conflict of interest issues. In their response, the authors made the point that they had noted these limitations in their published paper.

References
Sainani, K. L., & Goodman, S. N. (2018). Lack of diagnostic utility of 'amino acid dysregulation metabotypes'. Biological Psychiatry. doi:10.1016/j.biopsych.2018.11.012

Smith, A. M., Donley, E. L. R., Burrier, R. E., King, J. J., & Amaral, D. G. (2018). Reply to: Lack of Diagnostic Utility of “Amino Acid Dysregulation Metabotypes”. Biological Psychiatry. doi: https://doi.org/10.1016/j.biopsych.2018.11.013

Smith, A. M., King, J., J, West, P. R., Ludwig, M. A., Donley, E. L. R., Burrier, R. E., & Amaral, D. G. (2018). Amino acid dysregulation metabotypes: Potential biomarkers for diagnosis and individualized treatment for subtypes of autism spectrum disorder. Biological Psychiatry. doi:https://doi.org/10.1016/j.biopsych.2018.08.016

Stark, P. (2018). Before reproducibiity must come preproducibility. Nature, 557, 613. doi:10.1038/d41586-018-05256-0
West, P. R., Amaral, D. G., Bais, P., Smith, A. M., Egnash, L. A., Ross, M. E., . . . Burrier, R. E. (2014). Metabolomics as a tool for discovery of biomarkers of Autism Spectrum Disorder in the blood plasma of children. PLOS One, 9(11), e112445. doi:  https://doi.org/10.1371/journal.pone.0112445

Sunday, 11 May 2014

Changing the landscape of psychiatric research:

What will the RDoC initiative by NIMH achieve?


©CartoonStock.com

There's a lot wrong with current psychiatric classification. Every few years, the American Psychiatric Association comes up with a new set of labels and diagnostic criteria, but whereas the Diagnostic and Statistical Manual used to be seen as some kind of Bible for psychiatrists, the latest version, DSM5, has been greeted with hostility and derision. The number of diagnostic categories keeps multiplying without any commensurate increase in the evidence base to validate the categories. It has been argued that vested interests from pharmaceutical companies create pressures to medicalise normality so that everyone will sooner or later have a diagnosis (Frances, 2013). And even excluding such conflict of interest, there are concerns that such well-known categories as schizophrenia and depression lack reliability and validity (Kendell & Jablensky, 2003).

In 2013, Tom Insel, Director of the US funding agency, National Institute of Mental Health (NIMH), created a stir with a blogpost in which he criticised the DSM5 and laid out the vision of a new Research Domain Criteria (RDoC) project. This aimed "to transform diagnosis by incorporating genetics, imaging, cognitive science, and other levels of information to lay the foundation for a new classification system."

He drew parallels with physical medicine, where diagnosis is not made purely on the basis of symptoms, but also uses measures of underlying physiological function that help distinguish between conditions and indicate the most appropriate treatment. This, he argued, should be the goal of psychiatry, to go beyond presenting symptoms to underlying causes, reconceptualising disorders in terms of neural systems.

This has, of course, been a goal for many researchers for several years, but Insel expressed frustration at the lack of progress, noting that at present: "We cannot design a system based on biomarkers or cognitive performance because we lack the data". That being the case, he argued, a priority for NIMH should be to create a framework for collecting relevant data. This would entail casting aside conventional psychiatric diagnoses, working with dimensions rather than categories, and establishing links between genetic, neural and behavioural levels of description.

This represents a massive shift in research funding strategy, and some are uneasy about it. Nobody, as far as I am aware, is keen to defend the status quo, as represented by DSM.  As Insel remarked in his blogpost: "Patients with mental disorders deserve better". The issue is whether RDoC is going to make things any better. I see five big problems.

1. McLaren (2011) is among those querying the assumption that mental illnesses are 'disorders of brain circuits'. The goal of the RDoC program is to fill in a huge matrix with new research findings. The rows of the matrix are not the traditional diagnostic categories: instead they are five research domains: Negative Valence Systems, Positive Valence Systems, Cognitive Systems, Systems for Social Processes, Arousal/Regulatory Systems, each of which has subdivisions: e.g. Cognitive Systems is broken down into Attention, Perception, Working memory, Declarative memory, Language behavior and Cognitive (effortful) control. The columns of the matrix are Genes, Molecules, Cells, Circuits, Physiology, Behavior, Self-Reports, and Paradigms. Strikingly absent is anything about experience or environment.

This seems symptomatic of our age. I remember sitting through a conference presentation about a study investigating whether brain measures could predict response to cognitive behaviour therapy in depression.  OK, it's possible that they might, but what surprised me was that no measures of past life events or current social circumstances were included in the study. My intuitions may be wrong, but it would seem that these factors are likely to play a role. My impression is that some of the more successful interventions developed in recent years are based not on neurobiology or genetics, but on a detailed analysis of the phenomenology of mental illness, as illustrated, for example, by the work of my colleagues David Clark and Anke Ehlers. Consideration of such factors is strikingly absent from RDoC.

 2. The goal of the RDoC is ultimately to help patients, but the link with intervention is unclear. Suppose I become increasingly obsessed with checking electrical switches, such that I am unable to function in my job. Thanks to the RDoC program, I'm found to have a dysfunctional neural circuit. Presumably the benefit of this is that I could be given a new pharmacological intervention targeting that circuit, which will make me less obsessive. But how long will I stay on the drug? It's not given me any way to cope with the tendency of checking the unwanted thoughts that obtrude into my consciousness, and they are likely to recur when I come off it.  I'm not opposed to pharmacological interventions in principle, but they tend not to have a 'stop rule'. 

There are psychological interventions that tackle the symptoms and the cognitive processes that underlie them more directly.  Could better knowledge of neurobiological correlates help develop more of these?  I guess it is possible, but my overall sense is that this translational potential is exaggerated – just as with the current hype around 'educational neuroscience'. The RDoC program embodies a mistaken belief that neuroscientific research is inherently better than psychological research because it deals with primary causes, when in fact it cannot capture key clinical phenomena. For instance, the distinction between a compulsive hand-washer and a compulsive checker is unlikely to have a clear brain correlate, yet we need to know about the specific symptoms of the individual to help them overcome them.

3. Those proposing RDoC appear to have a naive view of the potential of genetics to inform psychiatry.  It's worth quoting in detail from their vision of the kinds of study that would be encouraged by NIMH, as stated here:

Recent studies have shown that a number of genes reported to confer risk for schizophrenia, such as DISC1 (“Disrupted in schizophrenia”) and neuregulin, actually appear to be similar in risk for unipolar and bipolar mood disorders. ... Thus, in one potential design, inclusion criteria might simply consist of all patients seen for evaluation at a psychotic disorders treatment unit. The independent variable might comprise two groups of patients: One group would be positive and the other negative for one or more risk gene configurations (SNP or CNV), with the groups matched on demographics such as age, sex, and education. Dependent variables could be responses to a set of cognitive paradigms, and clinical status on a variety of symptom measures. Analyses would be conducted to compare the pattern of differences in responses to the cognitive or emotional tasks in patients who are positive and negative for the risk configurations.

This sounds to me like a recipe for wasting a huge amount of research funding. The effect sizes of most behavioural/cognitive genetic associations are tiny and so one would need an enormous sample size to see differences related to genotype. Coupled with an open-ended search for differences between genotypes on a battery of cognitive measures, this would undoubtedly generate some 'significant' results which could go on to mislead the field for some time before a failure to replicate was achieved (cf. Munafò, & Gage, 2013).

The NIMH website notes that "the current diagnostic system is not informed by recent breakthroughs in genetics". There is good reason for that: to date, the genetic findings have been disappointing. Such associations as are found either indicate extremely rare and heterogeneous mutations of large effect and/or involve common genetic variants whose small effects are not of clinical significance. We cannot know what the future holds, but to date talk of 'breakthroughs' is misleading.

4. Some of the entries in the RDoC matrix also suggest a lack of appreciation of the difference between studying individual differences versus group effects.  The RDoC program is focused on understanding individual differences. That requires particularly stringent criteria for measures, which need to be adequately reliable, valid and sensitive to pick up differences between people.  I appreciate that the published RDoC matrices are seen as a starting-point and not as definitive, but I would recommend that more thought goes into establishing the psychometric credibility of measures before embarking on expensive studies looking for correlations between genes, brains and behaviour. If the rank ordering of a group of people on a measure is not the same from one occasion to another, or if there are substantial floor or ceiling effects, that measure is not going to be much use as an indicator of an underlying construct. Furthermore, if different versions of a task that are supposed to tap into a single construct give different patterns of results, then we need a rethink – see e.g. Foti et al, 2013; Shilling et al, 2013, for examples.  Such considerations are often ignored by those attempting to move experimental work into a translational phase. If we are really to achieve 'precision medicine' we need precise measures.

5. The matrix as it stands does not give much confidence that the RDoC approach will give clearer gene-brain-behaviour links than traditional psychiatric categories.

For instance, BDNF appears in the Gene column of the matrix for the constructs of acute threat, auditory perception, declarative memory, goal selection, and response selection. COMT appears with threat, loss, frustrative nonreward, reward learning, goal selection, response selection and reception of facial communication. Of course, it's early days. The whole purpose of the enterprise is to flesh out the matrix with more detailed and accurate information. Nevertheless, the attempts at summarising what is known to date do not inspire confidence that this goal will be achieved.

After such a list of objections to RDoC, I do have one good thing to say about it, which is that it appears to be encouraging and embracing data-sharing and open science. This will be an important advance that may help us find out more quickly which avenues are worth exploring and which are cul-de-sacs. I suspect we will find out some useful things from the RDoC project: I just have reservations as to whether they will be of any benefit to psychiatry, or more importantly, to psychiatric patients.

References
Foti, D., Kotov, R., & Hajcak, G. (2013). Psychometric considerations in using error-related brain activity as a biomarker in psychotic disorders. Journal of Abnormal Psychology, 122(2), 520-531. doi: 10.1037/a0032618

Frances, A. (2013). Saving normal: An insider's revolt against out-of-control psychiatric diagnosis, DSM-5, big pharma, and the medicalization of ordinary life. New York: HarperCollins.

Kendell, R., & Jablensky, A. (2003). Distinguishing between the validity and utility of psychiatric diagnoses. American Journal of Psychiatry, 160, 4-12.

McLaren, N. (2011). Cells, Circuits, and Syndromes: A Critical Commentary on the NIMH Research Domain Criteria Project Ethical Human Psychology and Psychiatry, 13 (3), 229-236 DOI: 10.1891/1559-4343.13.3.229

Munafò, M. R., & Gage, S. H. (2013). Improving the reliability and reporting of genetic association studies. Drug and Alcohol Dependence(0). doi: http://dx.doi.org/10.1016/j.drugalcdep.2013.03.023

Shilling, V. M., Chetwynd, A., & Rabbitt, P. M. A. (2002). Individual inconsistency across measures of inhibition: an investigation of the construct validity of inhibition in older adults. Neuropsychologia, 40, 605-619.


This article (Figshare version) can be cited as:
 Bishop, Dorothy V M (2014): Changing the landscape of psychiatric research: What will the RDoC initiative by NIMH achieve?. figshare. http://dx.doi.org/10.6084/m9.figshare.1030210  


P.S.8th October 2015. 
RDoC is in the news again, leading Jon Roiser to send me a tweet asking whether my views expressed re social factors were just intuitions or evidence-based. That's a good question, given the importance I attach to evidence. So is there any evidence that past life events or current social situation predict response to intervention in depression? 
I have to confess I am not an expert in this area. My views are largely formed from what I learned years ago when training as a clinical psychologist, when research by Brown and Harris showed life events were potent predictors of depression:

Brown, G.W. & Harris, T.O. (1978). Social origins of depression: A study of psychiatric disorder in women. London: Tavistock. 

These studies were not on intervention, but it does seem plausible that the same factors that are associated with initial onset will also influence response to intervention. Thus it seems reasonable that it would be harder to treat someone's depression if they are still experiencing the factors that led to the initial depression, e.g. living in an abusive relationship, coping with the death of a loved one, or experiencing financial stress.
In response to Jon's query, I did a small trawl through recent articles in Web of Science; I have only looked at abstracts for these, so don't know how good quality the evidence is, but the general impression is that social factors and life events are still regarded as important factors in the etiology of depression - and therefore might also be expected to influence response to intervention. Here's a handful of papers:

Colman, I., Zeng, Y., McMartin, S. E., Naicker, K., Ataullahjan, A., Weeks, M., . . . Galambos, N. L. (2014). Protective factors against depression during the transition from adolescence to adulthood: Findings from a national Canadian cohort. Preventive Medicine, 65, 28-32. doi: 10.1016/j.ypmed.2014.04.008

Cwik, M., Barlow, A., Tingey, L., Goklish, N., Larzelere-Hinton, F., Craig, M., & Walkup, J. T. (2015). Exploring Risk and Protective Factors with a Community Sample of American Indian Adolescents Who Attempted Suicide. Archives of Suicide Research, 19(2), 172-189. doi: 10.1080/13811118.2015.1004472
Dour, H. J., Wiley, J. F., Roy-Byrne, P., Stein, M. B., Sullivan, G., Sherbourne, C. D., . . . Craske, M. G. (2014). Perceived social support mediates anxiety and depressive symptom changes following primary care intervention. Depression and Anxiety, 31(5), 436-442. doi: 10.1002/da.22216
Kemner, S. M., Mesman, E., Nolen, W. A., Eijckemans, M. J. C., & Hillegers, M. H. J. (2015). The role of life events and psychological factors in the onset of first and recurrent mood episodes in bipolar offspring: results from the Dutch Bipolar Offspring Study. Psychological Medicine, 45(12), 2571-2581. doi: 10.1017/s0033291715000495
Sheidow, A. J., Henry, D. B., Tolan, P. H., & Strachan, M. K. (2014). The Role of Stress Exposure and Family Functioning in Internalizing Outcomes of Urban Families. Journal of Child and Family Studies, 23(8), 1351-1365. doi: 10.1007/s10826-013-9793-3

I'd be happy to consider alternative evidence, but my view is that if we want to look at brain or gene predictors, we'd do well to also assess life events and social factors - things that are relatively easy to measure, might explain a significant proportion of variance, and could also possibly provide a mechanism to account for neurobiological findings.