Wool and Nuts: warppls

Showing posts with label warppls. Show all posts

Monday, November 11, 2013

Latitude and cancer rates in US states: Aaron Blaisdell’s intuition confirmed

In the comments section of my previous post on cancer rates in the US states () my friend Aaron Blaisdell noted that: …comparing states that are roughly comparable in terms of number of seniors per 1000 individuals, latitude appears to have the largest effect on rates of cancer.

Good point, so I collected data on the latitudes of US states, built a more complex model (with several multivariate controls), and analyzed it with WarpPLS 4.0 ().

The coefficient of association for the effect of latitude on cancer rates (path coefficient) turned out to be 0.35. Its P value was lower than 0.001, meaning that the probability that this is a false positive is less than a tenth of a percent, or that we can be 99.9 percent confident that this is not a false positive.

This was calculated controlling for the: (a) proportion of seniors in the population (population age); (b) proportion of obese individuals in the population (obesity rates); and (c) the possible moderating effect of latitude on the effect of population age on cancer rates. The graph below shows this multivariate-adjusted association.

What is cool about a multivariate analysis is that you can control for certain effects. For example, since we are controlling for proportion of seniors in the population (population age), the fact that we have a state with a very low proportion of seniors (Alaska) does not tilt the effect toward that outlier as much as it would if we had not controlled for the proportion of seniors. This is a mathematical property that is difficult to grasp, but that makes multivariate adjustment such a powerful technique.

I should note that the 99.9 percent confidence mentioned above refers to the coefficient of association. That is, we are quite confident that the coefficient of association is not zero; that is it. The P value does not support the hypothesized direction of causality (latitude -> cancer) or exclude the possibility of a major confounder causing the effect.

Nonetheless, among the newest features of WarpPLS 4.0 (still a beta version) are several causality assessment coefficients: path-correlation signs, R-squared contributions, path-correlation ratios, path-correlation differences, Warp2 bivariate causal direction ratios, Warp2 bivariate causal direction differences, Warp3 bivariate causal direction ratios, and Warp3 bivariate causal direction differences. Without going into a lot of technical detail, which you can get from the User Manual () without even having to install the software, I can tell you that all of these causality assessment coefficients support the hypothesized direction of causality.

Also, while we cannot exclude the possibility of a major confounder causing the effect, we included two possible confounders in the analysis and controlled for their effects. They were the proportion of seniors in the population (population age) and the proportion of obese individuals in the population (obesity rates).

Having said all of the above, I should also say that the effect is similar in magnitude to the effect of population age on cancer rates, which I discussed in the previous post linked above. That is, it is not the type of effect that would be clearly noticeable in a person’s normal life.

Sunlight exposure? Maybe.

We do know that our body naturally produces as much as 10,000 IU of vitamin D based on a few minutes of sun exposure when the sun is high (). Getting that much vitamin D from dietary sources is very difficult, even after “fortification”.

Monday, October 28, 2013

Aging and cancer: The importance of taking a hard look at the numbers

The table below is from a study by Hayat and colleagues (). It illustrates one common trend regarding cancer – it increases dramatically in incidence among those who are older. With some exceptions, such as Hodgkin's lymphoma, there is a significant increase in risk particularly after 50 years of age.

So I decided to get state data from the US Census web site (), on the percentage of seniors (age 65 or older) by state and cancer diagnoses per 1,000 people. I was able to get some recent data, for 2011.

I analyzed the data with WarpPLS (version 4.0 has been just released: ), generating the types of coefficients that would normally be reported by researchers who wanted to make an effect appear very strong.

In this case, the effect would be essentially of population aging on cancer incidence (assessed indirectly), summarized in the graph below. The graph was generated by WarpPLS. The scales are standardized, and so are the coefficients of association in the two segments shown. As you can see, the coefficients of association increase as we move along the horizontal scale, because this is a nonlinear relationship. The overall coefficient of association, which is a weighted average of the two betas shown, is 0.84. The probability that this is a false positive is less than 1 percent.

A beta coefficient of 0.84 essentially means that a 1 standard deviation variation in the percentage of seniors in a state is associated with an overall 84 percent increase in cancer diagnoses, taking the standardized unit of the number of cancer diagnoses as the baseline. This sounds very strong and would usually be presented as an enormous effect. Since the standard deviation for the percentage of seniors in various states is 1.67, one could say that for each 1.67 increment in the percentage of seniors in a state the number of cancer diagnoses goes up by 84 percent.

Effects expressed in percentages can sometimes give a very misleading picture. For example, let us consider an increase in mortality due to a disease from 1 to 2 cases for each 1 million people. This essentially is a 100 percent increase! Moreover, the closer the baseline is from zero, the more impressive the effect becomes, since the percentage increase is calculated by dividing the increment by the baseline number. As the baseline number approaches zero, the percentage increase from the baseline approaches infinity.

Now let us take a look at the graph below, also generated by WarpPLS. Here the scales are unstandardized, which means that they refer to the original measures in their respective original scales. (Standardization makes the variables dimensionless, which is sometimes useful when the original measurement scales are not comparable – e.g., dollars vs. meters.) As you can see here, the number of cancer diagnoses per 1,000 people goes from a low of 3.74 in Utah to a high of 6.64 in Maine.

One may be tempted to explain the increase in cancer diagnoses that we see on this graph based on various factors (e.g., lifestyle), but the percentage of seniors in a state seems like a very good and reasonable predictor. You may say: This is very depressing. You may be even more depressed if I tell you that controlling for state obesity rates does not change this picture at all.

But look at what these numbers really mean. What we see here is an increase in cancer diagnoses per 1,000 people of less than 3. In other words, there is a minute increase of less than 3 diagnoses for each group of 1,000 people considered. It certainly feels terrible if you are one of the 3 diagnosed, but it is still a minute increase.

Also note that one of the scales, for diagnoses, refers to increments of 1 in 1,000; while the other, for seniors, refers to increments of 1 in 100. This leads to an interesting effect. If you move from Alaska to Florida you will see a significant increase in the number of seniors around, as the difference in the percentage of seniors between these two states is about 10. However, the difference in the number of cancer diagnoses will not be even close to the difference in the presence of seniors.

The situation above is very common in medical research. An effect that is fundamentally tiny is stated in such a way that the general public has the impression that the effect is enormous. Often the reason is not to promote a drug, but to attract media attention to a research group or organization.

When you look at the actual numbers, the magnitude of the effect is such that it would go unnoticed in real life. By real life I mean: John, since we moved from Alaska to Maine I have been seeing a lot more people of my age being diagnosed with cancer. An effect of the order of 3 in 1,000 would not normally be noticed in real life by someone whose immediate circle of regular acquaintances included fewer than 333 people (about 1,000 divided by 3).

But thanks to Facebook, things are changing … to be fair, the traditional news media (particularly television) tends to increase perceived effects a lot more than social media, often in a very stressful way.

Monday, April 2, 2012

The 2012 Arch Intern Med red meat-mortality study: Eating 234 g/d of red meat could reduce mortality by 23 percent

As we have seen in an earlier post on the China Study data (), which explored relationships hinted at by Denise Minger’s previous and highly perceptive analysis (), one can use a multivariate analysis tool like WarpPLS () to explore relationships based on data reported by others. This is true even when the dataset available is fairly small.

So I entered the data reported in the most recent (published online in March 2012) study looking at the relationship between red meat consumption and mortality into WarpPLS to do some exploratory analyses. I discussed the study in my previous post; it was conducted by Pan et al. (Frank B. Hu is the senior author) and published in the prestigious Archives of Internal Medicine (). The data I used is from Table 1 of the article; it reports figures on several variables along 5 quintiles, based on separate analyses of two samples, called “Health Professionals” and “Nurses Health” samples. The Health Professionals sample comprised males; the Nurses Health sample, females.

Below is an interesting exploratory model, with results. It includes a number of hypotheses, represented by arrows, which seem to make sense. This is helpful, because a model incorporating hypotheses that make sense allows for easy identification of nonsense results, and thus rejection of the model or the data. (Refutability is one of the most important characteristics of good theoretical models.) Keep in mind that the sample size here is very small (N=10), as the authors of the study reported data along 5 quintiles for the Health Professionals sample, together with 5 quintiles for the Nurses Health sample. In a sense, this is somewhat helpful, because a small sample tends to be “unstable”, leading nonsense results and other signs of problems to show up easily – one example would be multivariate coefficients of association (the beta coefficients reported near the arrows) greater than 1 due to collinearity ().

So what does the model above tell us? It tells us that smoking (Smokng) is associated with reduced physical activity (PhysAct); beta = -0.92. It tells us that smoking (Smokng) is associated with reduced food intake (FoodInt); beta = -0.36. It tells us that physical activity (PhysAct) is associated with reduced incidence of diabetes (Diabetes); beta = -0.25. It tells us that increased food intake (FoodInt) is associated with increased incidence of diabetes (Diabetes); beta = 0.93. It tells us that increased food intake (FoodInt) is associated with increased red meat intake (RedMeat); beta = 0.60. It tells us that increased incidence of diabetes (Diabetes) is associated with increased mortality (Mort); beta = 0.61. It tells us that being female (SexM1F2) is associated with reduced mortality (Mort); beta = -0.67.

Some of these betas are a bit too high (e.g., 0.93), due to the level of collinearity caused by such a small sample. Due to being quite high, they are statistically significant even in a small sample. Betas greater than 0.20 tend to become statistically significant when the sample size is 100 or greater; so all of the coefficients above would be statistically significant with a larger sample size. What is the common denominator of all of the associations above? The common denominator is that all of them make sense, qualitatively speaking; there is not a single case where the sign is the opposite of what we would expect. There is one association that is shown on the graph and that is missing from my summary of associations above; and it also makes sense, at least to me. The model also tells us that increased red meat intake (RedMeat) is associated with reduced mortality (Mort); beta = -0.25. More technically, it tells us that, when we control for biological sex (SexM1F2) and incidence of diabetes (Diabetes), increased red meat intake (RedMeat) is associated with reduced mortality (Mort).

How do we roughly estimate this effect in terms of amounts of red meat consumed? The -0.25 means that, for each standard deviation in the amount of red meat consumed, there is a corresponding 0.25 standard deviation reduction of mortality. (This interpretation is possible because I used WarpPLS’ linear analysis algorithm; a nonlinear algorithm would lead to a more complex interpretation.) The standard deviation for red meat consumption is 0.897 servings. Each serving has about 84 g. And the highest number of servings in the dataset is 3.1 servings, or 260 g/d (calculated as: 3.1*84). To stay a bit shy of this extreme, let us consider a slightly lower intake amount, which is 3.1 standard deviations, or 234 g/d (calculated as: 3.1*0.897*84). Since the standard deviation for mortality is 0.3 percentage points, we can conclude that an extra 234 g of red meat per day is associated with a reduction in mortality of approximately 23 percent (calculated as: 3.1*0.25*0.3).

Let me repeat for emphasis: the data reported by the authors suggests that, when we control for biological sex and incidence of diabetes, an extra 234 g of red meat per day is associated with a reduction in mortality of approximately 23 percent. This is exactly the opposite, qualitatively speaking, of what was reported by the authors in the article. I should note that this is also a minute effect, like the effect reported by the authors. (The mortality rates in the article are expressed as percentages, with the lowest being around 1 percent. So this 23 percent is a percentage of a percentage.) If you were to compare a group of 100 people who ate little red meat with another group of the same size that ate 234 g more of red meat every day, over a period of more than 20 years, you would not find a single additional death in either group. If you were to compare matched groups of 1,000 individuals, you would find only 2 additional deaths among the folks who ate little red meat.

At the same time, we can also see that excessive food intake is associated with increased mortality via its effect on diabetes. The product beta coefficient for the mediated effect FoodInt --> Diabetes --> Mort is 0.57. This means that, for each standard deviation of food intake in grams, there is a corresponding 0.57 standard deviation increase in mortality, via an increase in the incidence of diabetes. This is very likely at levels of food consumption where significantly more calories are consumed than spent, ultimately leading to many people becoming obese. The standard deviation for food intake is 355 calories. The highest daily food intake quintile reported in the article is 2,396 calories, which happens to be associated with the highest mortality (and is probably an underestimation); the lowest is 1,202 (also probably underestimated).

So, in summary, the data suggests that, for the particular sample studied (made up of two subsamples): (a) red meat intake is protective in terms of overall mortality, through a direct effect; and (b) the deleterious effect of overeating on mortality is stronger than the protective effect of red meat intake. These conclusions are consistent with those of my previous post on the same study (). The difference is that the previous post suggested a possible moderating protective effect; this post suggests a possible direct protective effect. Both effects are small, as was the negative effect reported by the authors of the study. Neither is statistically significant, due to sample size limitations (secondary data from an article; N=10). And all of this is based on a study that categorized various types of processed meat as red meat, and that did not distinguish grass-fed from non-grass-fed meat.

By the way, in discussions of red meat intake’s effect on health, often iron overload is mentioned. What many people don’t seem to realize is that iron overload is caused primarily by hereditary haemochromatosis. Another cause is “blood doping” to improve athletic performance (). Hereditary haemochromatosis is a very rare genetic disorder; rare enough to be statistically “invisible” in any study that does not specifically target people with this disorder.

Monday, March 19, 2012

The 2012 red meat-mortality study (Arch Intern Med): The data suggests that red meat is protective

I am not a big fan of using arguments such as “food questionnaires are unreliable” and “observational studies are worthless” to completely dismiss a study. There are many reasons for this. One of them is that, when people misreport certain diet and lifestyle patterns, but do that consistently (i.e., everybody underreports food intake), the biasing effect on coefficients of association is minor. Measurement errors may remain for this or other reasons, but regression methods (linear and nonlinear) assume the existence of such errors, and are designed to yield robust coefficients in their presence. Besides, for me to use these types of arguments would be hypocritical, since I myself have done several analyses on the China Study data (), and built what I think are valid arguments based on those analyses.

My approach is: Let us look at the data, any data, carefully, using appropriate analysis tools, and see what it tells us; maybe we will find evidence of measurement errors distorting the results and leading to mistaken conclusions, or maybe not. With this in mind, let us take a look at the top part of Table 3 of the most recent (published online in March 2012) study looking at the relationship between red meat consumption and mortality, authored by Pan et al. (Frank B. Hu is the senior author) and published in the prestigious Archives of Internal Medicine (). This is a prominent journal, with an average of over 270 citations per article according to Google Scholar. The study has received much media attention recently.

Take a look at the area highlighted in red, focusing on data from the Health Professionals sample. That is the multivariate-adjusted cardiovascular mortality rate, listed as a normalized percentage, in the highest quintile (Q5) of red meat consumption from the Health Professionals sample. The non-adjusted percentages are 1.4 percent mortality in Q5 and 1.13 in Q1 (from Table 1 of the same article); so the multivariate adjustment-normalization changed the values of the percentages somewhat, but not much. The highlighted 1.35 number suggests that for each group of 100 people who consumed a lot of red meat (Q5), when compared with a group of 100 people who consumed little red meat (Q1), there were on average 0.35 more deaths over the same period of time (more than 20 years).

The heavy red meat eaters in Q5 consumed 972.37 percent more red meat than those in Q1. This is calculated with data from Table 1 of the same article, as: (2.36-0.22)/0.22. In Q5, the 2.36 number refers to the number of servings of red meat per day, with each serving being approximately 84 g. So the heavy red meat eaters ate approximately 198 g per day (a bit less than 0.5 lb), while the light red meat eaters ate about 18 g per day. In other words, the heavy red meat eaters ate 9.7237 times more, or 972.37 percent more, red meat.

So, just to be clear, even though the folks in Q5 consumed 972.37 percent more red meat than the folks in Q1, in each matched group of 100 you would not find a single additional death over the same time period. If you looked at matched groups of 1,000 individuals, you would find 3 more deaths among the heavy red meat eaters. The same general pattern, of a minute difference, repeats itself throughout Table 3. As you can see, all of the reported mortality ratios are 1-point-something. In fact, this same pattern repeats itself in all mortality tables (all-cause, cardiovascular, cancer). This is all based on a multivariate analysis that according to the authors controlled for a large number of variables, including baseline history of diabetes.

Interestingly, looking at data from the same sample (Health Professionals), the incidence of diabetes is 75 percent higher in Q5 than in Q1. The same is true for the second sample (Nurses Health), where the Q5-Q1 difference in incidence of diabetes is even greater - 81 percent. This caught my eye, being diabetes such a prototypical “disease of affluence”. So I entered the whole data reported in the article into HCE () and WarpPLS (), and conducted some analyses. The graphs below are from HCE. The data includes both samples – Health Professionals and Nurses Health.

HCE calculates bivariate correlations, and so does WarpPLS. But WarpPLS stores numbers with a higher level of precision, so I used WarpPLS for calculating coefficients of association, including correlations. I also double-checked the numbers with other software, just in case (e.g., SPSS and MATLAB). Here are the correlations calculated by WarpPLS, which refer to the graphs above: 0.030 for red meat intake and mortality; 0.607 for diabetes and mortality; and 0.910 for food intake and diabetes. Yes, you read it right, the correlation between red meat intake and mortality is a very low and non-significant 0.030 in this dataset. Not a big surprise when you look at the related HCE graph, with the line going up and down almost at random. Note that I included the quintiles data from both the Health Professionals and Nurses Health samples in one dataset.

Those folks in Q5 had a much higher incidence of diabetes, and yet the increase in mortality for them was significantly lower, in percentage terms. A key difference between Q5 and Q1 being what? The Q5 folks ate a lot more red meat. This looks suspiciously suggestive of a finding that I came across before, based on an analysis of the China Study II data (). The finding was that animal food consumption (and red meat is an animal food) was protective, actually reducing the negative effect of wheat flour consumption on mortality. That analysis actually suggested that wheat flour consumption may not be so bad if you eat 221 g or more of animal food daily.

So, I built the model below in WarpPLS, where red meat intake (RedMeat) is hypothesized to moderate the relationship between diabetes incidence (Diabetes) and mortality (Mort). Below I am also including the graphs for the direct and moderating effects; the data is standardized, which reduces estimation error, particularly in moderating effects estimation. I used a standard linear algorithm for the calculation of the path coefficients (betas next to the arrows) and jackknifing for the calculation of the P values (confidence = 1 – P value). Jackknifing is a resampling technique that does not require multivariate normality and that tends to work well with small samples; as is the case with nonparametric techniques in general.

The direct effect of diabetes on mortality is positive (0.68) and almost statistically significant at the P < 0.05 level (confidence of 94 percent), which is noteworthy because the sample size here is so small – only 10 data points, 5 quintiles from the Health Professionals sample and 5 from the Nurses Health sample. The moderating effect is negative (-0.11), but not statistically significant (confidence of 61 percent). In the moderating effect graphs (shown side-by-side), this negative moderation is indicated by a slightly less steep inclination of the regression line for the graph on the right, which refers to high red meat intake. A less steep inclination means a less strong relationship between diabetes and mortality – among the folks who ate the most red meat.

Not too surprisingly, at least to me, the results above suggest that red meat per se may well be protective. Although we should consider a least two other possibilities. One is that red meat intake is a marker for consumption of some other things, possibly present in animal foods, that are protective - e.g., choline and vitamin K2. The other possibility is that red meat is protective in part by displacing other less healthy foods. Perhaps what we are seeing here is a combination of these.

Whatever the reason may be, red meat consumption seems to actually lessen the effect of diabetes on mortality in this sample. That is, according to this data, the more red meat is consumed, the fewer people die from diabetes. The protective effect might have been stronger if the participants had eaten more red meat, or more animal foods containing the protective factors; recall that the threshold for protection in the China Study II data was consumption of 221 g or more of animal food daily (). Having said that, it is also important to note that, if you eat excess calories to the point of becoming obese, from red meat or any other sources, your risk of developing diabetes will go up – as the earlier HCE graph relating food intake and diabetes implies.

Please keep in mind that this post is the result of a quick analysis of secondary data reported in a journal article, and its conclusions may be wrong, even though I did my best not to make any mistake (e.g., mistyping data from the article). The authors likely spent months, if not more, in their study; and have the support of one of the premier research universities in the world. Still, this post raises serious questions. I say this respectfully, as the authors did seem to try their best to control for all possible confounders.

I should also say that the moderating effect I uncovered is admittedly a fairly weak effect on this small sample and not statistically significant. But its magnitude is apparently greater than the reported effects of red meat on mortality, which are not only minute but may well be statistical artifacts. The Cox proportional hazards analysis employed in the study, which is commonly used in epidemiology, is nothing more than a sophisticated ANCOVA; it is a semi-parametric version of a special case of the broader analysis method automated by WarpPLS.

Finally, I could not control for confounders because, given the small sample, inclusion of confounders (e.g., smoking) leads to massive collinearity. WarpPLS calculates collinearity estimates automatically, and is particularly thorough at doing that (calculating them at multiple levels), so there is no way to ignore them. Collinearity can severely distort results, as pointed out in a YouTube video on WarpPLS (). Collinearity can even lead to changes in the signs of coefficients of association, in the context of multivariate analyses - e.g., a positive association appears to be negative. The authors have the original data – a much, much larger sample - which makes it much easier to deal with collinearity.

Moderating effects analyses () – we need more of that in epidemiological research eh?

Monday, January 16, 2012

The China Study II: Wheat’s total effect on mortality is significant, complex, and highlights the negative effects of low animal fat diets

The graph below shows the results of a multivariate nonlinear WarpPLS () analysis including the variables listed below. Each row in the dataset refers to a county in China, from the publicly available China Study II dataset (). As always, I thank Dr. Campbell and his collaborators for making the data publicly available. Other analyses based on the same dataset are also available ().
- Wheat: wheat flour consumption in g/d.
- Aprot: animal protein consumption in g/d.
- PProt: plant protein consumption in g/d.
- %FatCal: percentage of calories coming from fat.
- Mor35_69: number of deaths per 1,000 people in the 35-69 age range.
- Mor70_79: number of deaths per 1,000 people in the 70-79 age range.

Below are the total effects of wheat flour consumption, along with the number of paths used to calculate them, and the respective P values (i.e., probabilities that the effects are due to chance). Total effects are calculated by considering all of the paths connecting two variables. Identifying each path is a bit like solving a maze puzzle; you have to follow the arrows connecting the two variables. Version 3.0 of WarpPLS (soon to be released) does that automatically, and also calculates the corresponding P values.

To the best of my knowledge, this is the first time that total effects are calculated for this dataset. As you can see, the total effects of wheat flour consumption on mortality in the 35-69 and 70-79 age ranges are both significant, and fairly complex in this model, each relying on 7 paths. The P value for mortality in the 35-69 age range is 0.038; in other words, the probability that the effect is “real”, and thus not due to chance, is 96.2 percent (100-3.8=96.2). The P value for mortality in the 70-79 age range is 0.024; a 97.6 percent probability that the effect is “real”.

Note that in the model the effects of wheat flour consumption on mortality in both age ranges are hypothesized to be mediated by animal protein consumption, plant protein consumption, and fat consumption. These mediating effects have been suggested by previous analyses discussed on this blog (). The strongest individual paths are between wheat flour consumption and plant protein consumption, plant protein consumption and animal protein consumption, as well as animal protein consumption and fat consumption.

So wheat flour consumption contributes to plant protein consumption, probably by being a main source of plant protein (through gluten). Plant protein consumption in turn decreases animal protein consumption, which significantly decreases fat consumption. From this latter connection we can tell that most of the fat consumed likely came from animal sources.

How much fat and protein are we talking about? The graphs below tell us how much, and these graphs are quite interesting. They suggest that, in this dataset, daily protein consumption tended to be on average 60 g, whatever the source. If more protein came from plant foods, the proportion from animal foods went down, and vice-versa.

The more animal protein consumed, the more fat is also consumed in this dataset. And that is animal fat, which comes mostly in the form of saturated and monounsaturated fats, in roughly equal amounts. How do I know that it is animal fat? Because of the strong association with animal protein. By the way, with a few exceptions (e.g., some species of fatty fish) animal foods in general provide only small amounts of polyunsaturated fats – omega-3 and omega-6.

Individually, animal protein and wheat flour consumption have the strongest direct effects on mortality in both age ranges. Animal protein consumption is protective, and wheat flour consumption detrimental.

Does the connection between animal protein, animal fat, and longevity mean that a diet high in saturated and monounsaturated fats is healthy for most people? Not necessarily, at least without extrapolation, although the results do not suggest otherwise. Look at the amounts of fat consumed per day. They range from a little less than 20 g/d to a little over 90 g/d. By comparison, one steak of top sirloin (about 380 g of meat, cooked) trimmed to almost no visible fat gives you about 37 g of fat.

These results do suggest that consumption of animal fats, primarily saturated and monounsaturated fats, is likely to be particularly healthy in the context of a low fat diet. Or, said in a different way, these results suggest that longevity is decreased by diets that are low in animal fats.

How much fat should one eat? In this dataset, the more fat was consumed together with animal protein (i.e., the more animal fat was consumed), the better in terms of longevity. In other words, in this dataset the lowest levels of mortality were associated with the highest levels of animal fat consumption. The highest level of fat consumption in the dataset was a little over 90 g/d.

What about higher fat intake contexts? Well, we know that men on a high fat diet such as a variation of the Optimal Diet can consume on average a little over 170 g/d of animal fat (130 g/d for women), and their health markers remain generally good ().

One of the critical limiting factors, in terms of health, seems to be the amount of animal fat that one can eat and still remain relatively lean. Dietary saturated and monounsaturated fats are healthy. But when accumulated as excess body fat, beyond a certain level, they become pro-inflammatory.

Monday, December 12, 2011

Finding your sweet spot for muscle gain with HCE

In order to achieve muscle gain, one has to repeatedly hit the “supercompensation” window, which is a fleeting period of time occurring at some point in the muscle recovery phase after an intense anaerobic exercise session. The figure below, from Vladimir Zatsiorsky’s and William Kraemer’s outstanding book Science and Practice of Strength Training () provides an illustration of the supercompensation idea. Supercompensation is covered in more detail in a previous post ().

Trying to hit the supercompensation window is a common denominator among HealthCorrelator for Excel (HCE) users who employ the software () to maximize muscle gain. (That is, among those who know and subscribe to the theory of supercompensation.) This post outlines what I believe is a good way of doing that while avoiding some pitfalls. The data used in the example that follows has been created by me, and is based on a real case. I disguised the data, simplified it, added error etc. to make the underlying method relatively easy to understand, and so that the data cannot be traced back to its “real case” user (for privacy).

Let us assume that John Doe is an intermediate weight training practitioner. That is, he has already gone through the beginning stage where most gains come from neural adaptation. For him, new gains in strength are a reflection of gains in muscle mass. The table below summarizes the data John obtained when he decided to vary the following variables in order to see what effects they have on his ability to increase the weight with which he conducted the deadlift () in successive exercise sessions:
- Number of rest days in between exercise sessions (“Days of rest”).
- The amount of weight he used in each deadlift session (“Deadlift weight”).
- The amount of weight he was able to add to the bar each session (“Delta weight”).
- The number of deadlift sets and reps (“Deadlift sets” and “Deadlift reps”, respectively).
- The total exercise volume in each session (“Deadlift volume”). This was calculated as follows: “Deadlift weight” x “Deadlift sets” x “Deadlift reps”.

John’s ability to increase the weight with which he conducted the deadlift in each session is measured as “Delta weight”. That was his main variable of interest. This may not look like an ideal choice at first glance, as arguably “Deadlift volume” is a better measure of total effort and thus actual muscle gain. The reality is that this does not matter much in his case, because: John had long rest periods within sets, of around 5 minutes; and he made sure to increase the weight in each successive session as soon as he felt he could, and by as much as he could, thus never doing more than 24 reps. If you think that the number of reps employed by John is too high, take a look at a post in which I talk about Doug Miller and his ideas on weight training ().

Below are three figures, with outputs from HCE: a table showing the coefficients of association between “Delta weight” and the other variables, and two graphs showing the variation of “Delta weight” against “Deadlift volume” and “Days of rest”. As you can see, nothing seems to be influencing “Delta weight” strongly enough to reach the 0.6 level that I recommend as the threshold for a “real effect” to be used in HCE analyses. There are two possibilities here: it is what it looks it is, that is, none of the variables influence “Delta weight”; or there are effects, but they do not show up in the associations table (as associations equal to or greater than 0.6) because of nonlinearity.

The graph of “Delta weight” against “Deadlift volume” is all over the place, suggesting a lack of association. This is true for the other variables as well, except “Days of rest”; the last graph above. That graph, of “Delta weight” against “Days of rest”, suggests the existence of a nonlinear association with the shape of an inverted J curve. This type of association is fairly common. In this case, it seems that “Delta weight” is maximized in the 6-7 range of “Days of rest”. Still, even varying things almost randomly, John achieved a solid gain over the time period. That was a 33 percent gain from the baseline “Deadlift weight”, a gain calculated as: (285-215)/215.

HCE, unlike WarpPLS (), does not take nonlinear relationships into consideration in the estimation of coefficients of association. In order to discover nonlinear associations, users have to inspect the graphs generated by HCE, as John did. Based on his inspection, John decided to changes things a bit, now working out on the right side of the J curve, with 6 or more “Days of rest”. That was difficult for John at first, as he was addicted to exercising at a much higher frequency; but after a while he became a “minimalist”, even trying very long rest periods.

Below are four figures. The first is a table summarizing the data John obtained for his second trial. The other three are outputs from HCE, analogous to those obtained in the first trial: a table showing the coefficients of association between “Delta weight” and the other variables, two graphs (side-by-side) showing “Delta weight” against “Deadlift sets” and “Deadlift reps”, and one graph of “Delta weight” against “Days of rest”. As you can see, “Days of rest” now influences “Delta weight” very strongly. The corresponding association is a very high -0.981! The negative sign means that “Delta weight” decreases as “Days of rest” increase. This does NOT mean that rest is not important; remember, John is now operating on the right side of the J curve, with 6 or more “Days of rest”.

The last graph above suggests that taking 12 or more “Days of rest” shifted things toward the end of the supercompensation window, in fact placing John almost outside of that window at 13 “Days of rest”. Even so, there was no loss of strength, and thus probably no muscle loss. Loss of strength would be suggested by a negative “Delta weight”, which did not occur (the “Delta weight” went down to zero, at 13 “Days of rest”). The two graphs shown side-by-side suggest that 2 “Deadlift sets” seem to work just as well for John as 3 or 4, and that “Deadlift reps” in the 18-24 range also work well for John.

In this second trial, John achieved a better gain over a similar time period than in the first trial. That was a 36 percent gain from the baseline “Deadlift weight”, a gain calculated as: (355-260)/260. John started with a lower baseline than in the end of the first trial period, probably due to detraining, but achieved a final “Deadlift weight” that was likely very close to his maximum potential (at the reps used). Because of this, the 36 percent gain in the period is a lot more impressive than it looks, as it happened toward the end of a saturation curve (e.g., the far right end of a logarithmic curve).

One important thing to keep in mind is that if an HCE user identifies a nonlinear relationship of the J-curve type by inspecting the graphs like John did, in further analyses the focus should be on the right or left side of the curve by either: splitting the dataset into two, and running a separate analysis for each new dataset; or running a new trial, now sticking with a range of variation on the right or left side of the curve, as John did. The reason is that nonlinear relationships tend to distort the linear coefficients calculated by HCE, hiding a real relationship between two variables.

This is a very simplified example. Most serious bodybuilders will measure variations in a number of variables at the same time, for a number of different exercise types and formats, and for longer periods. That is, their “HealthData” sheet in HCE will be a lot more complex. They will also have multiple instances of HCE running on their computer. HCE is a collection of sheets and code that can be copied, and saved with different names. The default is “HCE_1_0.xls” or “HCE_1_0.xlsm”, depending on which version you are using. Each new instance of HCE may contain a different dataset for analysis, stored in the “HealthData” sheet.

It is strongly recommended that you keep your data in a separate set of sheets, as a backup. That is, do not store all your data in the “HealthData” sheets in different HCE instances. Also, when you copy your data into the “HealthData” sheet in HCE, copy only the values and formats, and NOT the formulas. If you copy the formulas, you may end up having some problems, as some of the cells in the “HealthData” sheet will not be storing values. I also recommend storing values for other types variables, particularly perception-based variables.

Examples of perception-based variables are: “Perceived stress”, “Perceived delayed onset muscle soreness (DOMS)”, and “Perceived non-DOMS pain”. These can be answered on Likert-type scales, such as scales going from 1 (very strongly disagree) to 7 (very strongly agree) in response to self-prepared question-statements like “I feel stressed out” (for “Perceived stress”). If you find that a variable like “Perceived non-DOMS pain” is associated with working out at a particular volume range, that may help you avoid serious injury in the future, as non-DOMS pain is not a very good sign (). You also may find that working out in the volume range that is associated with non-DOMS pain adds nothing in terms of muscle gain.

Generally speaking, I think that many people will find out that their sweet spot for muscle gain involves less frequent exercise at lower volumes than they think. Still, each individual is unique; there is no one quite like John. The relationship between “Delta weight” and “Days of rest” varies from person to person based on age; older folks generally require more rest. It also varies based on whether the person is dieting or not; less food intake leads to longer recovery periods. Women will probably see visible lower-body muscle gain, but very little visible upper-body muscle gain (in the absence of steroid use), even as they experience upper-body strength gains. Other variables of interest for both men and women may be body weight, body fat percentage, and perceived muscle tone.

Saturday, November 5, 2011

The China Study II: How gender takes us to the elusive and deadly factor X

The graph below shows the mortality in the 35-69 and 70-79 age ranges for men and women for the China Study II dataset. I discussed other results in my two previous posts () (), all taking us to this post. The full data for the China Study II study is publicly available (). The mortality numbers are actually averages of male and female deaths by 1,000 people in each of several counties, in each of the two age ranges.

Men do tend to die earlier than women, but the difference above is too large.

Generally speaking, when you look at a set time period that is long enough for a good number of deaths (not to be confused with “a number of good deaths”) to be observed, you tend to see around 5-10 percent more deaths among men than among women. This is when other variables are controlled for, or when men and women do not adopt dramatically different diets and lifestyles. One of many examples is a study in Finland (); you have to go beyond the abstract on this one.

As you can see from the graph above, in the China Study II dataset this difference in deaths is around 50 percent!

This huge difference could be caused by there being significantly more men than women per county included the dataset. But if you take a careful look at the description of the data collection methods employed (), this does not seem to be the case. In fact, the methodology descriptions suggest that the researchers tried to have approximately the same number of women and men studied in each county. The numbers reported also support this assumption.

As I said before, this is a well executed research project, for which Dr. Campbell and his collaborators should be commended. I may not agree with all of their conclusions, but this does not detract even a bit from the quality of the data they have compiled and made available to us all.

So there must be another factor X causing this enormous difference in mortality (and thus longevity) among men and women in the China Study II dataset.

What could be this factor X?

This situation helps me illustrate a point that I have made here before, mostly in the comments under other posts. Sometimes a variable, and its effects on other variables, are mostly a reflection of another unmeasured variable. Gender is a variable that is often involved in this type of situation. Frequently men and women do things very differently in a given population due to cultural reasons (as opposed to biological reasons), and those things can have a major effect on their health.

So, the search for our factor X is essentially a search for a health-relevant variable that is reflected by gender but that is not strictly due to the biological aspects that make men and women different (these can explain only a 5-10 percent difference in mortality). That is, we are looking for a variable that shows a lot of variation between men and women, that is behavioral, and that has a clear impact on health. Moreover, as it should be clear from my last post, we are looking for a variable that is unrelated to wheat flour and animal protein consumption.

As it turns out, the best candidate for the factor X is smoking, particularly cigarette smoking.

The second best candidate for factor X is alcohol abuse. Alcohol abuse can be just as bad for one’s health as smoking is, if not worse, but it may not be as good a candidate for factor X because the difference in prevalence between men and women does not appear to be just as large in China (). But it is still large enough for us to consider it a close second as a candidate for factor X, or a component of a more complex factor X – a composite of smoking, alcohol abuse and a few other coexisting factors that may be reflected by gender.

I have had some discussions about this with a few colleagues and doctoral students who are Chinese (thanks William and Wei), and they mentioned stress to me, based on anecdotal evidence. Moreover, they pointed out that stressful lifestyles, smoking, and alcohol abuse tend to happen together - with a much higher prevalence among men than women.

What an anti-climax for this series of posts eh?

With all the talk on the Internetz about safe and unsafe starches, animal protein, wheat bellies, and whatnot! C’mon Ned, give me a break! What about insulin!? What about leucine deficiency … or iron overload!? What about choline!? What about something truly mysterious, related to an obscure or emerging biochemistry topic; a hormone du jour like leptin perhaps? Whatever, something cool!

Smoking and alcohol abuse!? These are way too obvious. This is NOT cool at all!

Well, reality is often less mysterious than we want to believe it is.

Let me focus on smoking from here on, since it is the top candidate for factor X, although much of the following applies to alcohol abuse and a combination of the two as well.

One gets different statistics on cigarette smoking in China depending on the time period studied, but one thing seems to be a common denominator in these statistics. Men tend to smoke in much, much higher numbers than women in China. And this is not a recent phenomenon.

For example, a study conducted in 1996 () states that “smoking continues to be prevalent among more men (63%) than women (3.8%)”, and notes that these results are very similar to those in 1984, around the time when the China Study II data was collected.

A 1995 study () reports similar percentages: “A total of 2279 males (67%) but only 72 females (2%) smoke”. Another study () notes that in 1976 “56% of the men and 12% of the women were ever-smokers”, which together with other results suggest that the gap increased significantly in the 1980s, with many more men than women smoking. And, most importantly, smoking industrial cigarettes.

So we are possibly talking about a gigantic difference here; the prevalence of industrial cigarette smoking among men may have been over 30 times the prevalence among women in the China Study II dataset.

Given the above, it is reasonable to conclude that the variable “SexM1F2” reflects very strongly the variable “Smoking”, related to industrial cigarette smoking, and in an inverse way. I did something that, grossly speaking, made the mysterious factor X explicit in the WarpPLS model discussed in my previous post. I replaced the variable “SexM1F2” in the model with the variable “Smoking” by using a reverse scale (i.e., 1 and 2, but reversing the codes used for “SexM1F2”). The results of the new WarpPLS analysis are shown on the graph below. This is of course far from ideal, but gives a better picture to readers of what is going on than sticking with the variable “SexM1F2”.

With this revised model, the associations of smoking with mortality in the 35-69 and 70-79 age ranges are a lot stronger than those of animal protein and wheat flour consumption. The R-squared coefficients for mortality in both ranges are higher than 20 percent, which is a sign that this model has decent explanatory power. Animal protein and wheat flour consumption are still significantly associated with mortality, even after we control for smoking; animal protein seems protective and wheat flour detrimental. And smoking’s association with the amount of animal protein and wheat flour consumed is practically zero.

Replacing “SexM1F2” with “Smoking” would be particularly far from ideal if we were analyzing this data at the individual level. It could lead to some outlier-induced errors; for example, due to the possible existence of a minority of female chain smokers. But this variable replacement is not as harmful when we look at county-level data, as we are doing here.

In fact, this is as good and parsimonious model of mortality based on the China Study II data as I’ve ever seen based on county level data.

Now, here is an interesting thing. Does the original China Study II analysis of univariate correlations show smoking as a major problem in terms of mortality? Not really.

The table below, from the China Study II report (), shows ALL of the statistically significant (P<0.05) univariate correlations with mortality in 70-79 age range. I highlighted the only measure that is directly related to smoking; that is “dSMOKAGEm”, listed as “questionnaire AGE MALE SMOKERS STARTED SMOKING (years)”.

The high positive correlation with “dSMOKAGEm” does not even make a lot of sense, as one would expect a negative correlation here – i.e., the earlier in life folks start smoking, the higher should be the mortality. But this reverse-signed correlation may be due to smokers who get an early start dying in disproportionally high numbers before they reach age 70, and thus being captured by another age range mortality variable. The fact that other smoking-related variables are not showing up on the table above is likely due to distortions caused by inter-correlations, as well as measurement problems like the one just mentioned.

As one looks at these univariate correlations, most of them make sense, although several can be and probably are distorted by correlations with other variables, even unmeasured variables. And some unmeasured variables may turn out to be critical. Remember what I said in my previous post – the variable “SexM1F2” was introduced by me; it was not in the original dataset. “Smoking” is this variable, but reversed, to account for the fact that men are heavy smokers and women are not.

Univariate correlations are calculated without adjustments or control. To correct this problem one can adjust a variable based on other variables; as in “adjusting for age”. This is not such a good technique, in my opinion; it tends to be time-consuming to implement, and prone to errors. One can alternatively control for the effects of other variables; a better technique, employed in multivariate statistical analyses. This latter technique is the one employed in WarpPLS analyses ().

Why don’t more smoking-related variables show up on the univariate correlations table above? The reason is that the table summarizes associations calculated based on data for both sexes. Since the women in the dataset smoked very little, including them in the analysis together with men lowers the strength of smoking-related associations, which would probably be much stronger if only men were included. It lowers the strength of the associations to the point that their P values become higher than 0.05, leading to their exclusion from tables like the one above. This is where the aggregation process that may lead to ecological fallacy shows its ugly head.

No one can blame Dr. Campbell for not issuing warnings about smoking, even as they came mixed with warnings about animal food consumption (). The former warnings, about smoking, make a lot of sense based on the results of the analyses in this and the last two posts.

The latter warnings, about animal food consumption, seem increasingly ill-advised. Animal food consumption may actually be protective in regards to the factor X, as it seems to be protective in terms of wheat flour consumption ().

Monday, October 31, 2011

The China Study II: Gender, mortality, and the mysterious factor X

WarpPLS and HealthCorrelator for Excel were used to do the analyses below. For other China Study analyses, many using WarpPLS as well as HealthCorrelator for Excel, click here. For the dataset used, visit the HealthCorrelator for Excel site and check under the sample datasets area. As always, I thank Dr. T. Colin Campbell and his collaborators for making the data publicly available for independent analyses.

In my previous post I mentioned some odd results that led me to additional analyses. Below is a screen snapshot summarizing one such analysis, of the ordered associations between mortality in the 35-69 and 70-79 age ranges and all of the other variables in the dataset. As I said before, this is a subset of the China Study II dataset, which does not include all of the variables for which data was collected. The associations shown below were generated by HealthCorrelator for Excel.

The top associations are positive and with mortality in the other range (the “M006 …” and “M005 …” variables). This is to be expected if ecological fallacy is not a big problem in terms of conclusions drawn from this dataset. In other words, the same things cause mortality to go up in the two age ranges, uniformly across counties. This is reassuring from a quantitative analysis perspective.

The second highest association in both age ranges is with the variable “SexM1F2”. This variable is a “dummy” variable coded as 1 for male sex and 2 for female, which I added to the dataset myself – it did not exist in the original dataset. The association in both age ranges is negative, meaning that being female is protective. They reflect in part the role of gender on mortality, more specifically the biological aspects of being female, since we have seen before in previous analyses that being female is generally health-protective.

I was able to add a gender-related variable to the model because the data was originally provided for each county separately for males and females, as well as through “totals” that were calculated by aggregating data from both males and females. So I essentially de-aggregated the data by using data from males and females separately, in which case the totals were not used (otherwise I would have artificially reduced the variance in all variables, also possibly adding uniformity where it did not belong). Using data from males and females separately is the reverse of the aggregation process that can lead to ecological fallacy problems.

Anyway, the associations with the variable “SexM1F2” got me thinking about a possibility. What if females consumed significantly less wheat flour and more animal protein in this dataset? This could be one of the reasons behind these strong associations between being female and living longer. So I built a more complex WarpPLS model than the one in my previous post, and ran a linear multivariate analysis on it. The results are shown below.

What do these results suggest? They suggest no strong associations between gender and wheat flour or animal protein consumption. That is, when you look at county averages, men and women consumed about the same amounts of wheat flour and animal protein. Also, the results suggest that animal protein is protective and wheat flour is detrimental, in terms of longevity, regardless of gender. The associations between animal protein and wheat flour are essentially the same as the ones in my previous post. The beta coefficients are a bit lower, but some P values improved (i.e., decreased); the latter most likely due to better resample set stability after including the gender-related variable.

Most importantly, there is a very strong protective effect associated with being female, and this effect is independent of what the participants ate.

Now, if you are a man, don’t rush to take hormones to become a woman with the goal of living longer just yet. This advice is not only due to the likely health problems related to becoming a transgender person; it is also due to a little problem with these associations. The problem is that the protective effect suggested by the coefficients of association between gender and mortality seems too strong to be due to men "being women with a few design flaws".

There is a mysterious factor X somewhere in there, and it is not gender per se. We need to find a better candidate.

One interesting thing to point out here is that the above model has good explanatory power in regards to mortality. I'd say unusually good explanatory power given that people die for a variety of reasons, and here we have a model explaining a lot of that variation. The model explains 45 percent of the variance in mortality in the 35-69 age range, and 28 percent of the variance in the 70-79 age range.

In other words, the model above explains nearly half of the variance in mortality in the 35-69 age range. It could form the basis of a doctoral dissertation in nutrition or epidemiology with important implications for public health policy in China. But first the factor X must be identified, and it must be somehow related to gender.

Next post coming up soon ...

Monday, October 24, 2011

The China Study II: Animal protein, wheat, and mortality … there is something odd here!

WarpPLS and HealthCorrelator for Excel were used in the analyses below. For other China Study analyses, many using WarpPLS and HealthCorrelator for Excel, click here. For the dataset used, visit the HealthCorrelator for Excel site and check under the sample datasets area. I thank Dr. T. Colin Campbell and his collaborators at the University of Oxford for making the data publicly available for independent analyses.

The graph below shows the results of a multivariate linear WarpPLS analysis including the following variables: Wheat (wheat flour consumption in g/d), Aprot (animal protein consumption in g/d), Mor35_69 (number of deaths per 1,000 people in the 35-69 age range), and Mor70_79 (number of deaths per 1,000 people in the 70-79 age range).

Just a technical comment here, regarding the possibility of ecological fallacy. I am not going to get into this in any depth now, but let me say that the patterns in the data suggest that, with the possible exception of some variables (e.g., blood glucose, gender; the latter will get us going in the next few posts), ecological fallacy due to county aggregation is not a big problem. The threat of ecological fallacy exists, here and in many other datasets, but it is generally overstated (often by those whose previous findings are contradicted by aggregated results).

I have not included plant protein consumption in the analysis because plant protein consumption is very strongly and positively associated with wheat flour consumption. The reason is simple. Almost all of the plant protein consumed by the participants in this study was probably gluten, from wheat products. Fruits and vegetables have very small amounts of protein. Keeping that in mind, what the graph above tells us is that:

- Wheat flour consumption is significantly and negatively associated with animal protein consumption. This is probably due to those eating more wheat products tending to consume less animal protein.

- Wheat flour consumption is positively associated with mortality in the 35-69 age range. The P value (P=0.06) is just shy of the 5 percent (i.e., P=0.05) that most researchers would consider to be the threshold for statistical significance. More consumption of wheat in a county, more deaths in this age range.

- Wheat flour consumption is significantly and positively associated with mortality in the 70-79 age range. More consumption of wheat in a county, more deaths in this age range.

- Animal protein consumption is not significantly associated with mortality in the 35-69 age range.

- Animal protein consumption is significantly and negatively associated with mortality in the 70-79 age range. More consumption of animal protein in a county, fewer deaths in this age range.

Let me tell you, from my past experience analyzing health data (as well as other types of data, from different fields), that these coefficients of association do not suggest super-strong associations. Actually this is also indicated by the R-squared coefficients, which vary from 3 to 7 percent. These are the variances explained by the model on the variables above the R-squared coefficients. They are low, which means that the model has weak explanatory power.

R-squared coefficients of 20 percent and above would be more promising. I hate to disappoint hardcore carnivores and the fans of the “wheat is murder” theory, but these coefficients of association and variance explained are probably way less than what we would expect to see if animal protein was humanity's salvation and wheat its demise.

Moreover, the lack of association between animal protein consumption and mortality in the 35-69 age range is a bit strange, given that there is an association suggestive of a protective effect in the 70-79 age range.

Of course death happens for all kinds of reasons, not only what we eat. Still, let us take a look at some other graphs involving these foodstuffs to see if we can form a better picture of what is going on here. Below is a graph showing mortality at the two age ranges for different levels of animal protein consumption. The results are organized in quintiles.

As you can see, the participants in this study consumed relatively little animal protein. The lowest mortality in the 70-79 age range, arguably the range of higher vulnerability, was for the 28 to 35 g/d quintile of consumption. That was the highest consumption quintile. About a quarter to a third of 1 lb/d of beef, and less of seafood (in general), would give you that much animal protein.

Keep in mind that the unit of analysis here is the county, and that these results are based on county averages. I wish I had access to data on individual participants! Still I stand by my comment earlier on ecological fallacy. Don't worry too much about it just yet.

Clearly the above results and graphs contradict claims that animal protein consumption makes people die earlier, and go somewhat against the notion that animal protein consumption causes things that make people die earlier, such as cancer. But they do so in a messy way - that spike in mortality in the 70-79 age range for 21-28 g/d of animal protein is a bit strange.

Below is a graph showing mortality at the two age ranges (i.e., 35-69 and 70-79) for different levels of wheat flour consumption. Again, the results are shown in quintiles.

Without a doubt the participants in this study consumed a lot of wheat flour. The lowest mortality in the 70-79 age range, which is the range of higher vulnerability, was for the 300 to 450 g/d quintile of wheat flour consumption. The high end of this range is about 1 lb/d of wheat flour! How many slices of bread would this be equivalent to? I don’t know, but my guess is that it would be many.

Well, this is not exactly the smoking gun linking wheat with early death, a connection that has been reaching near mythical proportions on the Internetz lately. Overall, the linear trend seems to be one of decreased longevity associated with wheat flour consumption, as suggested by the WarpPLS results, but the relationship between these two variables is messy and somewhat weak. It is not even clearly nonlinear, at least in terms of the ubiquitous J-curve relationship.

Frankly, there is something odd about these results.

This oddity led to me to explore, using HealthCorrelator for Excel, all ordered associations between mortality in the 35-69 and 70-79 age ranges and all of the other variables in the dataset. That in turn led me to a more complex WarpPLS analysis, which I’ll talk about in my next post, which is still being written.

I can tell you right now that there will be more oddities there, which will eventually take us to what I refer to as the mysterious factor X. Ah, by the way, that factor X is not gender - but gender leads us to it.

Monday, May 23, 2011

The China Study II: Wheat may not be so bad if you eat 221 g or more of animal food daily

In previous posts on this blog covering the China Study II data we’ve looked at the competing effects of various foods, including wheat and animal foods. Unfortunately we have had to stick to the broad group categories available from the specific data subset used; e.g., animal foods, instead of categories of animal foods such as dairy, seafood, and beef. This is still a problem, until I can find the time to get more of the China Study II data in a format that can be reliably used for multivariate analyses.

What we haven’t done yet, however, is to look at moderating effects. And that is something we can do now. A moderating effect is the effect of a variable on the effect of another variable on a third. Sounds complicated, but WarpPLS makes it very easy to test moderating effects. All you have to do is to make a variable (e.g., animal food intake) point at a direct link (e.g., between wheat flour intake and mortality). The moderating effect is shown on the graph as a dashed arrow going from a variable to a link between two variables.

The graph below shows the results of an analysis where animal food intake (Afoods) is hypothesized to moderate the effects of wheat flour intake (Wheat) on mortality in the 35 to 69 age range (Mor35_69) and mortality in the 70 to 79 age range (Mor70_79). A basic linear algorithm was used, whereby standardized partial regression coefficients, both moderating and direct, are calculated based on the equations of best-fitting lines.

From the graph above we can tell that wheat flour intake increases mortality significantly in both age ranges; in the 35 to 69 age range (beta=0.17, P=0.05), and in the 70 to 79 age range (beta=0.24, P=0.01). This is a finding that we have seen before on previous posts, and that has been one of the main findings of Denise Minger’s analysis of the China Study data. Denise and I used different data subsets and analysis methods, and reached essentially the same results.

But here is what is interesting about the moderating effects analysis results summarized on the graph above. They suggest that animal food intake significantly reduces the negative effect of wheat flour consumption on mortality in the 70 to 79 age range (beta=-0.22, P<0.01). This is a relatively strong moderating effect. The moderating effect of animal food intake is not significant for the 35 to 69 age range (beta=-0.00, P=0.50); the beta here is negative but very low, suggesting a very weak protective effect.

Below are two standardized plots showing the relationships between wheat flour intake and mortality in the 70 to 79 age range when animal food intake is low (left plot) and high (right plot). As you can see, the best-fitting line is flat on the right plot, meaning that wheat flour intake has no effect on mortality in the 70 to 79 age range when animal food intake is high. When animal food intake is low (left plot), the effect of wheat flour intake on mortality in this range is significant; its strength is indicated by the upward slope of the best-fitting line.

What these results seem to be telling us is that wheat flour consumption contributes to early death for several people, perhaps those who are most sensitive or intolerant to wheat. These people are represented in the variable measuring mortality in the 35 to 69 age range, and not in the 70 to 79 age range, since they died before reaching the age of 70.

Those in the 70 to 79 age range may be the least sensitive ones, and for whom animal food intake seems to be protective. But only if animal food intake is above a certain level. This is not a ringing endorsement of wheat, but certainly helps explain wheat consumption in long-living groups around the world, including the French.

How much animal food does it take for the protective effect to be observed? In the China Study II sample, it is about 221 g/day or more. That is approximately the intake level above which the relationship between wheat flour intake and mortality in the 70 to 79 age range becomes statistically indistinguishable from zero. That is a little less than ½ lb, or 7.9 oz, of animal food intake per day.