Showing posts with label intelligence. Show all posts
Showing posts with label intelligence. Show all posts

Tuesday, 18 February 2020

Why eugenics is wrong

I really didn't think I would need to write a blogpost on this topic, but it seems eugenics is having a resurgence in the UK, so here we go.

The idea of eugenics deceptively simple. Given that some traits are heritable, we should identify those with beneficial heritable traits and encourage them to have more children, while discouraging (or even preventing) those with less desirable traits to breed. This way we will improve the human race.

Those promoting eugenics have decided that high intelligence is a desirable trait, and indeed it does correlate with things like good educational outcomes, earnings and health. It is also heritable. So wouldn't it be great to improve the human race by genetic selection for intelligence?

Much of the debate on this question has focussed on whether it could be done, rather than whether it should be. Many who would blench at enforced sterilisation have warmed to the suggestion that Polygenic Risk Scores can be used to predict educational attainment, and so could be used for embryo selection. However, those promoting this idea have exaggerated the predictive power of polygenic scores (see this preprint by Tim Morris for a recent analysis, and this review of Kevin Mitchell's book Innate for other links). But let us suppose for a moment that in the future we could predict an individual's intelligence from a genetic score: would it be acceptable then to use this information to improve the human race?

The flaw in the argument is exposed when you consider the people who are making it. Typically, they are people who do well on intelligence tests. In effect, they are saying "The world needs more people like us". Looking at those advocating eugenic policies, many of us would beg to differ.

The bad state of the world we live in is not caused by unintelligent people. It is caused by intelligent people who have used their abilities to amass disproportionate wealth, manipulate others or elbow them out of the way. Eugenicists should be especially aware that their advantages are due to luck rather than merit, yet they behave as if they deserve them, and fiercely protect them from "people not like us".

If we really wanted to use our knowledge of genetics to make the world a better place, we would select for the traits of kindness and tolerance. Rather than sterilising the unintelligent, we would minimise breeding by those who are characterised by greed and a sense of superiority over other human beings. But there's the catch: it's only those who think they're superior to others who actually want to implement eugenic policies.

Wednesday, 15 May 2013

Have we become slower and dumber?

Guest post by Patrick Rabbitt

http://www.flickr.com/photos/sciencemuseum/3321607591/
This week, a paper by Woodley et al (2013) was widely quoted in the media (e.g. Daily Mail, Telegraph). The authors dramatically announced that the average intelligence of populations of Western industrialised societies has fallen since the Victorian era. This is provocative because previous analyses of large archived datasets of intelligence tests scores by Flynn and others show the opposite. However, Woodley et al did not examine average intelligence test scores obtained from different generations. They compared 16 sets of data from Simple Reaction - Time (SRT) experiments made on groups of people at various times between 1884 and 2002. In all of  these experiments volunteers responded to a single light signal by pressing a single response key.  Data for women are incomplete but averages of  SRTs for men increase significantly with year of testing.  Because Woodley et al regard SRTs as good inverse proxy measures for intelligence test scores, which are in some senses “purer” measures of intelligence than pencil and paper tests, they concluded that more recent samples are less intelligent than earlier ones

Throughout their paper the authors argue that higher intelligence of persons alive during the Victorian era can explain why their creativity and achievements were markedly greater than for later, duller generations. We can leave aside an important question whether there is any sound evidence that creativity and intellectual achievements have declined since a Great Victorian Flowering because only two of  the 16 datasets they compared were collected before Victoria’s death in 1901. The remaining 14 datasets date between 1941 and 2004 and, of these, only four were collected before 1970. So most of the studies analysed were made within my personal working lifespan. This provokes both nostalgia and distrust. Between 1959 and 2004 I collected  reaction times (RTs) from many large samples of people but it would make no sense for me to compare absolute values of group mean RTs that I obtained before and after 1975. This was because, until  1975, like nearly all of my colleagues, the only apparatus I had were Dekatron counters, the Birren Psychomet or SPARTA apparatus, none of which measured intervals shorter than 100 msec. Consequently, when my apparatus gave a reading of 200 msec. the actual Reaction Time might be anywhere between 200 and 299 msec. Like most of my colleagues I always computed and published mean RTs to three decimal places, but this was pretentious because all the  RTs  I had collected had been, in effect, rounded down by my equipment. After 1975, easier access to computers and better programs gradually began to allow true millisecond resolution. More investigators took advantage of new equipment and our reports of millisecond averages became less misleading. I am unsurprised that mean RTs computed from post-1975 data were consistently, and significantly longer than those for pre-1975 data.

Changes in recording accuracy are a sufficient reason to withold excitement at Woodley et al’s comparison. It is worth noticing that different  methodological issues also make it tricky to compare absolute values for means of RTs that were collected at different times and so with different kinds of equipment. For example RTs are affected by differences in signal visibility and rise-times to maximum brightness between tungsten lamps, computer monitor displays, neon bulbs and LCDs. The stiffness and “throw” of response buttons will also have varied between the set-ups that investigators used. When comparing absolute values of SRTs, another important factor is whether or not each signal to respond is preceded by a warning signal, whether the periods between warning signals and response signals are constant or variable and just how long they are (intervals between, approximately, 200 and 800 ms allow faster RTs than shorter or longer ones) Knowing  these methodological  quirks makes us realise that, in marked contrast to intelligence tests, methodologies for measuring RT have been thoroughly explored but never standardised.

So I do not yet believe that Wooley et al’s analyses show that psychologists of my generation were probably (once!) smarter than our young colleagues (now) are. This seems unlikely, but perhaps if I read further publications by these industrious investigators  I may become convinced that this is really the case. 


References
Flynn, J. R. (1987). Massive IQ gains in 14 nations - what IQ tests really measure. Psychological Bulletin, 101(2), 171-191. doi: 10.1037/0033-2909.101.2.171
Michael A. Woodley, Jan te Nijenhuis, & Raegan Murphy (2013). Were the Victorians cleverer than us? The decline in general intelligence estimated from a meta-analysis of the slowing of simple reaction time Intelligence : http://dx.doi.org/10.1016/j.intell.2013.04.006


POST SCRIPT, 24th May 2013

Dr Woodley has published a response to my critique on James Thompson's blog. He asks me to answer. I am glad to do so. Sluggishness has been due only to the pleasure of reading the many articles to which Woodley drew my attention. Dorothy’s remorseless archaeology of this trove, summarised in the table below, has provoked much domestic merriment during the past few days. We are grateful to Dr Woodley for this diversion. Here are my thoughts on his comments on my post.

Woodley et al used data from a meta-analysis by Silverman (2010). I am grateful to Prof Silverman for very rapid access to his paper in which he compared average times to make a single response to a light signal from large samples in Francis Galton's anthropometric laboratories and from several later, smaller samples dating from 1941 to 2006. To these Woodley et al added a dataset from Helen Bradford Thompson's 1903 monograph "The mental traits of sex".

As Silverman (2010) trenchantly points out there is a limit to possible comparisons from these datasets,: “In principle, it would be possible to uncover the rate at which RT increased (since the Galton studies) by controlling for potentially confounding variables in a multiple regression analysis. However, this requires that each of these variables be represented by multiple data points, but this requirement cannot be met by the present dataset. Accurately describing change over time also requires that both ends of the temporal dimension be well represented in the dataset and that the dataset be free of outliers (Cohen, Cohen, West, & Aiken, 2003); neither of these requirements can be met …… Thus, it is important to reiterate that the purpose … is not to show that RT has changed according to a specific function over time but rather to show that modern studies have obtained RTs that are far longer than those obtained by Galton."

Neither Silverman nor Woodley et al seem much concerned that results of comparisons might depend on differences between studies in apparatus and methods, which are shown here, together with temporal resolution where reported. 


Since Galton's dataset is the key baseline for the conclusion that population mean RT is increasing, it is worth considering details of his equipment described here and in a wonderful archival paper “Galton’s Data a Century Later” by Johnson et al (1985): “……during its descent the pendulum gives a sight-signal by brushing against a very light and small mirror which reflects a light off or onto a screen, or, on the other hand, it gives a sound-signal by a light weight being thrown off the pendulum by impact with a hollow box. The position of the pendulum at either of these occurrences is known. The position of the pendulum when the response is made is obtained by means of a thread stretched parallel to the axis of the pendulum by two elastic bands one above and one below, the thread being in a plane through the axes of the pendulum, perpendicular to the plane of the pendulum's motion. This thread moves freely between two parallel bars in a horizontal plane, and the pressing of a key causes the horizontal bars to clamp the thread. Thus the clamped thread gives the position of the pendulum on striking the key. The elastic bands provide for the pendulum not being suddenly checked on the clamping. The horizontal bars are just below a horizontal scale, 800 mm. below the point of suspension of the pendulum. Galton provided a table for reading off the distance along the scale from the vertical position of the pendulum in terms of the time taken from the vertical position to the position in which the thread is clamped." (p. 347).

Contemporary journal referees would press authors for reassurance that the apparatus produced consistent values over trials and had no bias to over or underestimate. Obviously this would have been very difficult for Galton to achieve.

In my earlier post I noted that over the mid-to late 20th century it became obvious that to report reaction times (RT) to three decimal places is misleading if equipment only allows centi-second recording. In the latter case a reading of 200 ms will remain until a further 100 ms have elapsed, effectively "rounding down" the RT. Woodley argues that we cannot assume that rounding down occurred. I do not follow his reasoning on this point. He also offers a statistical analysis to confirm that if the temporal resolution of the measure is the only difference between studies, this would not systematically underestimate RT. Disagreement on whether rounding occurred may only be resolved with empirical data comparing recorded and true RTs between equipments.

A general concern with comparisons of RTs between studies is that they are significantly affected by the methodology and apparatus used to collect them. This is not only due to differences in resolution but can lead to systematic bias in timing of trials. For a comprehensive account of how minor differences between different 21st century computers and commercial software can flaw comparisons between studies see Plant and Quinlan (2013), who write: "All that we can state with absolute certainty is that all studies are likely to suffer from varying degrees of presentation, synchronization, and response time errors if timing error was not specifically controlled for." I earlier suggested that apparently trivial procedural details can markedly affect RTs. Among these are whether or not participants are given warning signals, whether the intervals between warning signals and response signals are constant or vary across trials and how long these intervals are, the brightness of signal lamps and the mechanical properties of response keys. A further point also turns out to be relevant to assessment of Woodley et al's argument: average values will also depend on the number of trials recorded, and averaged, for each person, and whether outliers are excluded. Note, for instance, that the equipment used in the studies by Deary and Der, though appropriate for the comparisons that they made and reported, did not record RTs for individual trials but an averaged RT for an entire session. This makes it impossible to exclude outliers, as is normal good practice. The point is that comparisons that are satisfactory within the context of a single well-run experiment may be seriously misleading if made between equally scrupulous experiments using different apparatus and procedures. Johnson et al (1985) and Silverman (2010) stress that Galton’s data were wonderfully internally consistent. This reassures us that equipment and methods were well standardised within his own study. It cannot give any assurance that his data can be sensibly comparable with those obtained with other very diverse equipments and methodologies.

Another excellent feature of the Galton dataset is that re-testing of part of his large initial sample allowed estimates of reliability of his measures. With his large sample sizes even low values of test/re-test correlations were better than chance. Nevertheless it is interesting that the test-retest correlation for visual RT, at .17, on which Silverman’s and Woodley’s conclusions depends, was lower than the next lowest (high frequency auditory acuity,.28), or Snellen eye-chart (.58) and visual acuity (.76 to.79) (Johnson et al, 1985, Table 2).

We do not know whether warning signals were used in Galton's RT studies, or, if so, how long the preparatory intervals between warning and response signals might have been. Silverman (2010) had earlier acknowledged that preparatory interval duration might be an issue but felt that he could ignore it because a report by Teichner of Wundt’s discovery that fore-period duration effects could not be independently substantiated and also because he accepted Seashore et al ‘s (1941) reassurance that there are no effects on RT of fore-period duration.

Ever since a convincing study by Klemmer (1957) it has been recognised that the durations of preparatory intervals do significantly affect reaction times, that the effects of fore-period variation are large and that results cannot be usefully compared unless these factors are taken into consideration. Indeed during the 1960s fore-period effects were the staple topic of a veritable academic industry (see review by Niemi and Naatanen, 1981, and far too many other papers by Bertelson, Nickerson, Sanders, Rabbitt etc. etc). In this context Seashore et al’s (1941) failure to find for-period effects does not increase our confidence in their results as one of the data points on which Woodley et al’s analysis is based.

Silverman’s lack of interest in fore-period duration was also heightened by Johnson et al’s (1985) comment that, as far as they were able to discover, each of Galton’s volunteers was only given one trial. Silverman implies that if each of Galton’s volunteers only recorded a single RT, variations in preparatory intervals are hardly an issue. It is also arguable that this relaxed procedure might have lengthened rather than shortened RTs. Well… Yes and No. First, it would be nice to know just how volunteers were alerted that their single trial was imminent? By a nod or a wink? A friendly pat on the shoulder? A verbal “Ready”? Second, an important point of using warning signals, and of recording many rather than just one trial is that the first thing that all of us who run simple RT Experiments discover is that volunteers are very prone to “jump the gun” and begin to respond before any signal appears, so recording absurdly fast “RTs” that can be as low as 10 to 60 ms. 20th and 21st century investigators are usually (!) careful to identify and exclude such observations. Many also edit out what they regard as implausibly slow responses. We do not know whether or how either kind of editing occurred in the Galton laboratories. Many participants would have jumped the gun and if this was their sole recorded reaction the effects on group means would have been considerable. If Galton’s staff did edit RTs, both acceptance of impulsive responses or dismissal of very slow responses would reduce means and favour the idea of “Speedy Victorians”.

I would like to stress that my concerns are methodological rather than dogmatic. Investigators of reaction times try to test models for information processing by making small changes in single variables in tasks run on the same apparatus and with exactly the same procedures. This makes us wary of conclusions from comparisons between datasets collected with wildly different equipments, procedures and groups of people. My concerns were shared by some of those whose data are used by Silverman and Woodley et al. For example, the operational guide for the Datico Terry 84 device used by Anger et al states that "A single device has been chosen because it is very difficult to compare reaction time data from different test devices".

Because I have spent most of my working life using RTs to compare the mental abilities of people of different ages I am very much in favour of using RT measurements as a research tool for individual differences. (For my personal interpretation of the relationships between people’s calendar ages and gross brain status and their performance on measures of mental speed, of fluid intelligence, of executive function, and of memory see e.g. Rabbitt et al, 2007). I also strongly believe that mining archived data is a very valuable scientific endeavour and becomes more valuable as the volume of available data exponentially increases. For example, Flynn’s dogged analyses of archived intelligence test scores show that data mining has raised provocative and surprising questions. I also believe, with Silverman, that large population studies provide good epidemiological evidence of the effects of changes in incidence of malnutrition or of misuse of pesticides or antibiotics. I am more amused than concerned when, in line with Galton’s strange eugenic obsessions, they are also discussed as potential illustrations of growing degeneracy of our species due to increased survival odds for the biologically unfit. As I noted in my original post, my only concern is that it is a time-wasting mistake to uncritically treat measurements of Reaction Times as being, in some sense, “purer”, more direct and more trustworthy indices of individual differences than other measures such as intelligence tests. Of course RTs can be sensitive and reliable measures of individual differences but, as things stand, equipments and procedures are not standardised and, because RTs are liable to many methodological quirks, we obtain widely different mean values from different population samples even from apparently very similar tasks.

Wednesday, 21 November 2012

Moderate drinking in pregnancy: toxic or benign?

There’s no doubt that getting tipsy while pregnant is a seriously bad idea. Alcohol is a toxin that can pass through the placenta to the foetus and cause damage to the developing brain.  For women who are regular heavy drinkers or binge drinkers, there is a risk that the child will develop foetal alcohol syndrome, a condition that affects physical development and is associated with learning difficulties.
But what of more moderate drinking? The advice is conflicting. Many doctors take the view that alcohol is never going to be good for the developing foetus and they recommend complete abstention during pregnancy as a precautionary measure. Others have argued, though, that this advice is too extreme, and that moderate drinking does not pose any risk to the child.

Last week a paper by Lewis et al was published in PLOS One providing evidence on this issue, and concluding that moderate drinking does pose a risk and should be avoided. The methodology of the paper was complex and it’s worth explaining in detail what was done.

The researchers used data from ALSPAC, a large study that followed the progress of several thousand British children from before birth. A great strength of this study is that information was gathered prospectively: in the case of maternal drinking, mothers completed questionnaires during pregnancy, at 18 and 32 weeks gestation.  Obviously, the data won’t be perfect: you have to rely on women to report their intake honestly, but it’s hard to see how else to gather such data without being overly intrusive. When children were 8 years old, they were given a standard IQ test, and this was the dependent variable in the study.

One obvious thing to do with the data would be to see if there is any relationship between amount drank in pregnancy and the child’s IQ. Quite a few studies have done this and a recent systematic review concluded that, provided one excluded women who drank more than 12 g (1.5 UK units) per day or who were binge-drinkers, there was no impact on the child. Lewis et al pointed out, however, that this is not watertight, because drinking in pregnancy is associated with other confounding factors. Indeed, in their study, the lowest IQs were obtained by children of mothers who did not drink at all during pregnancy. However, these mothers were also likely to be younger and less socially-advantaged than mothers who drank, making it hard to disentangle causal influences.

So this is where the clever bit of the study design came in, in the shape of mendelian randomisation. The logic goes like this: there are genetic differences between people in how they metabolise alcohol. Some people can become extremely drunk, or indeed ill, after a single drink, whereas others can drink everyone else under the table. This relates to variation in a set of genes known as ADH genes, which are clustered together on chromosome 4. If a woman metabolises alcohol slowly, this could be particularly damaging to the foetus, because alcohol hangs around in the bloodstream longer. There are quite large racial differences in ADH genes, and for that reason the researchers restricted consideration just to those of White European background. For this group, they showed that variation in ADH genes is not related to social background. So they had a very specific prediction: for women who drank in pregnancy, there should be a relationship between their ADH genes and the child’s outcome. However, if the woman did not drink at all, then the ADH genotype should make no difference. This is the result they reported. It’s important to be clear that they did not directly estimate the impact of maternal drinking on the child’s IQ: rather, they inferred that if ADH genotype is associated with child’s IQ only in drinkers, then this is indirect evidence that drinking is having an impact. This is a neat way of showing that there is an effect of a risk factor (alcohol consumption) avoiding the complications of confounding by social class differences.

Several bloggers, however, were critical of the study. Skeptical Scalpel noted that the effect on IQ was relatively small and not of clinical significance. However, in common with some media reports, he seems to have misunderstood the study and assumed that the figure of 1.8 IQ points was an estimate of the difference between drinkers and abstainers – rather than the effect of ADH risk alleles in drinkers (see below). David Spiegelhalter pointed out that there was no direct estimate of the size of the effect of maternal alcohol intake. Indeed, when drinkers and non-drinkers were directly compared, IQs were actually slightly lower in non-drinkers. Carl Heneghan also commented on the small IQ effect size, but was particularly concerned about the statistical analysis, arguing that it did not adjust adequately for the large number of genetic variants that were considered.

Should we dismiss effects because they are small? I’m not entirely convinced by that argument. Yes, it’s true that IQ is not a precise measure: if an individual child has an IQ of 100, there is error of measurement around that estimate so that the 95% confidence interval is around 95-105 (wider still if a short form IQ is used, as was the case here). This measurement error is larger than the per-allele effects reported by Lewis et al., but they were reporting means from very large numbers of children. If there are reliable differences between these means, then this would indicate a genuine impact on cognition, potentially as large as 3.5 IQ points (for those with four rather than two risk alleles). Sure, we should not alarm people by implying that moderate drinking causes clinically significant learning difficulties, but I don’t think we should just dismiss such a result. Overall cognitive ability is influenced by a host of risk factors, most of which are small, but whose effects add together. For a child who already has other risks present, even a small downwards nudge to IQ could make a difference.

But what about Heneghan’s concern about the reliability of the results? This is something that also worried me when I scrutinised Table 1, which shows for each genetic locus the ‘per allele’ effect on IQ. I’ve plotted the data for child genotypes in Figure 1. Only one SNP (#10) seems to have a significant effect on child IQ. Yet when all loci were entered into a stepwise multiple regression analysis, no fewer than four child loci were identified as having a significant effect. The authors suggested that this could reflect interactions between genes that are on the same genetic pathway.
Effect of child SNP variants (per allele) on IQ (in IQ points), with 95% CI, from Lewis et al Table 1,

I had been warned about stepwise regression by those who taught me statistics many years ago. Wikipedia has a section on Criticisms, noting that results can be biased when many variables are included as predictors. But I found it hard to tell just how serious a problem this was. When in doubt, I find it helpful to simulate data, and so that is what I did in this case, using a function in R that generates multivariate normal data. So I made a dataset where there was no relationship between any of 11 variables – ten of which were designated as genetic loci, and one as IQ. I then ran backwards stepwise regression on the dataset. I repeated this exercise many times, and was surprised at just how often spurious associations of IQ with ‘genotypes’ was seen (as described here). I was concerned that this dataset was not a realistic simulation, because the genotype data from Lewis et al consisted of counts of how many uncommon alleles there were at a given locus (0, 1 or 2 – corresponding to aa, aA or AA, if you remember Mendel’s peas). So I also simulated that situation from the same dataset, but actually it made no difference to the findings. Nor did it make any difference if I allowed for correlations between the ‘genotypes’. Overall, I came away alarmed at just how often you can get spurious results from backwards stepwise regression – at least if you use the AIC criterion that is the default in the R package.

Lewis et al did one further analysis, generating an overall risk score based on the number of risk alleles (i.e. the version of the gene associated with lower IQ) for the four loci that were selected by the stepwise regression. This gave a significant association with child IQ, just in those who drunk in pregnancy: mean IQ was 104.0 (SD 15.8) for those with 4+ risk alleles, 105.4 (SD = 16.1) for those with 3 risk alleles and 107.5 (SD = 16.3) for those with 2 or less risk alleles. However, I was able to show very similar results from my analysis of random data: the problem here is that in a very large sample with many variables some associations will emerge as significant just by chance, and if you then select just those variables and add them up, you are capitalising on the chance effect.

One other thing intrigued me. The authors made a binary divide between those who reported drinking in pregnancy and those who did not. The category of drinker spanned quite a wide range from those who reported drinking less than 1 unit per week (either in the first 3 months or at 32 weeks of pregnancy) up to those who reported drinking up to 6 units per week. (Those drinking more than this were excluded, because the interest was in moderate drinkers). Now I’d have thought there would be interest in looking more quantitatively at the impact of moderate drinking, to see if there was a dose-response effect, with a larger effect of genotype on those who drank more. The authors mentioned a relevant analysis where the effect of genotype score on child IQ was greater after adjustment for amount drank at 32 weeks of pregnancy, but it is not clear whether this was a significant increase, or whether the same was seen for amount drank at 18 weeks. In particular, one cannot tell whether there is a safe amount to drink from the data reported in this paper. In a reply to my comment on the PLOS One paper, the first author states: “We have since re-run our analysis among the small group of women who reported drinking less than 1 unit throughout pregnancy and we found a similar effect to that which we reported in the paper.” But that suggests there is no dose-response effect for alcohol: I’m not an expert on alcohol effects, but I do find it surprising that less than one drink per week should have an effect on the foetal brain – though as the author points out, it’s possible that women under-reported their intake.

I’m also not a statistical expert and I hesitate to recommend an alternative approach to the analysis, though I am aware that there are multiple regression methods designed to avoid the pitfalls of stepwise regression. It will be interesting to see whether, as predicted by the authors, the genetic variants associated with lower IQ are those that predispose to slow alcohol metabolism. At the end of the day, the results will stand or fall according to whether they replicate in an independent sample.


Reference
Lewis SJ, Zuccolo L, Davey Smith G, Macleod J, Rodriguez S, Draper ES, Barrow M, Alati R, Sayal K, Ring S, Golding J, & Gray R (2012). Fetal Alcohol Exposure and IQ at Age 8: Evidence from a Population-Based Birth-Cohort Study. PloS one, 7 (11) PMID: 23166662

Tuesday, 13 November 2012

Flaky chocolate and the New England Journal of Medicine



Early in October a weird story hit the media: a nation’s chocolate consumption is predictive of its number of Nobel prize-winners, after correcting for population size. This is the kind of kooky statistic that journalists  love, and the story made a splash. But was it serious? Most academics initially assumed not. The source of the story was the New England Journal of Medicine, an august publication with stringent standards, which triages a high proportion of submissions that don’t get sent out for review. (And don't try asking for an explanation of why you’ve been triaged). It seemed unlikely that a journal with such exacting standards would give space to a lightweight piece on chocolate. So the first thought was that the piece had been published to make a point about the dangers of assuming causation from correlation, or the inaccuracies that can result when a geographical region is used as the unit of analysis. But reading the article more carefully gave one pause. It did have a somewhat jocular tone. Yet if this was intended as a cautionary tale, we might have expected it to be accompanied by some serious discussion of the methodological and interpretive problems with this kind of analysis. Instead, beneficial effects of dietary flavanols was presented as the most plausible explanation of the findings.

The author, cardiologist Franz Messerli, did discuss the possibility of a non-causal explanation for the findings, only to dismiss it. He stated “as to a third hypothesis, it is difficult to identify a plausible common denominator that could possibly drive both chocolate consumption and the number of Nobel laureates over many years. Differences in socioeconomic status from country to country and geographic and climatic factors may play some role, but they fall short of fully explaining the close correlation observed.” And how do we know “they fall short?” Well, because the author, Dr Messerli, says so.

As is often the case, the blogosphere did a better job of critiquing the paper than the journal editors and reviewers (see, for instance, here and here). The failure to consider seriously the role of a third explanatory variable was widely commented on, but, as far as I am aware, nobody actually did the analysis that Messerli should have done. I therefore thought I'd give it a go. Messerli explained where he’d got his data from – a chocolatier’s website and Wikipedia – so it was fairly straightforward to reproduce them (with some minor differences due to missing data from one chocolate website that's gone offline). Wikipedia helpfully also provided data on gross domestic product (GDP) per head for different nations, and it was easy to find another site with data on proportion of GDP spend on education (except China, which has figures here). So I re-ran the analysis, computing the partial correlation between chocolate consumption and Nobel prizes after adjusting for spend per head on education. When education spend was partialled out, the correlation dropped from .73 to .41, just falling short of statistical significance.

Since Nobel laureates typically are awarded their prizes only after a long period of achievement, a more convincing test of the association would be based on data on both chocolate consumption and education spend from a few decades ago. I’ve got better things to do than to dig out the figures, but I suggest that Dr Messerli might find this a useful exercise.

Another point to note is that the mechanism proposed by Dr Messerli involves an impact of improved cardiovascular fitness on cognitive function. The number of Nobel laureates is not the measure one would pick if setting out to test this hypothesis. The topic of national differences in ability is a contentious and murky one, but it seemed worth looking at such data as are available on the web to see what the chocolate association looks like when a more direct measure is used. For the same 22 countries, the correlation between chocolate consumption and estimated average cognitive ability is nonsignificant at .24, falling to .13 when education spend is partialled out.

I did write a letter to the New England Journal of Medicine reporting the first of my analyses (all there was room for: they allow you 175 words), but, as expected, they weren't interested. "I am sorry that we will not be able to print your recent letter to the editor regarding the Messerli article of 18-Oct-2012." they wrote. "The space available for correspondence is very limited, and we must use our judgment to present a representative selection of the material received."

It took me all of 45 minutes to extract the data and run these analyses. So why didn’t Dr Messerli do this? And why did the NEJM editor allow him to get away with asserting that third variables “fall short” when it’s so easy to check it out? Could it be that in our celebrity-obsessed world, the journal editors think that there’s no such thing as bad publicity?

Messerli, F. (2012). Chocolate Consumption, Cognitive Function, and Nobel Laureates New England Journal of Medicine, 367 (16), 1562-1564 DOI: 10.1056/NEJMon1211064