Showing posts with label damn lies and statistics. Show all posts
Showing posts with label damn lies and statistics. Show all posts

Tuesday, July 10, 2012

Partisan Statistic Alert

Some people are touting the statistics reported here, which show that the national debt has increased by a smaller percentage under Obama than the four previous presidents.

Reagan - 189% Bush - 55% Clinton - 37% Bush - 86% Obama - 35%
I double-checked the numbers, and they are technically correct. But they’re also meaningless.

First, these figures compare presidents instead of presidential terms. It shouldn’t be surprising that a two-term president will tend to rack up more debt than a one-term president.

Second, this is one of those instances where percentages are completely misleading. Each administration inherits the debt built up by all previous administrations, and that inherited debt provides the denominator for calculating the percentage increase. As a result, the percentage is automatically pulled downward for later presidents simply because they are later.

(For comparison, imagine if the entire $14.9 trillion in debt accumulated since 1980 had been added in equal-sized chunks by all eight presidential terms. That would be $1.87 trillion per term. Yet the percentages wouldn’t be equal at all. They would decline in every single year, from a high of 201% for Reagan 1 to a low of 13% for Obama.)

So what happens when we correct for both errors? Correcting the term problem first, and also adjusting dollars for inflation (something else I don’t think the original source did), here are the percentage increases by presidential term:


Now Obama’s record isn’t the best. He has the third highest percentage increase, and he hasn’t even finished his term yet. (I used the most recent national debt figures, which you can find here. For pre-1993 figures, see here.)

But again, the percentage is misleading. It would be better to look at the absolute dollar increase (again, adjusted for inflation). Here’s what you get:


Now it becomes clear: Compared to the previous seven presidential terms, Obama has presided over the largest increase in the national debt. And again, his term isn’t over yet.

Obviously, Obama’s defenders will say his actions were justified. He inherited a terrible economy, a large stimulus was necessary to boost performance, some expenditure increases were outside Obama’s control, etc. Those arguments might even be right, and they’re free to make them… but only after admitting that the national debt did, in fact, increase dramatically during Obama’s term.

One final addendum: these numbers could, of course, be adjusted in many other ways as well. You could adjust for the size of GDP or population. You could change the start-and-end dates to reflect who passed the relevant budgets, or to reflect that a presidential term doesn’t start until about a month after the election; doing so would shift some of Obama’s debt into Bush II’s second term (as well as shorten Obama’s effective time in office). In truth, there’s something inherently silly about trying to attribute changes in the national debt to specific presidents at all, since additional debt results from a complex interplay of policies created by multiple presidents and congresses over time. All I’m really trying to correct here is two very obvious errors that the creators of these particular statistics should have seen instantly, and probably would have seen if they didn’t have partisan blinders on.

Read More...

Tuesday, May 15, 2012

Circular Reasoning

Former student Gabe Krupa, remembering a lecture I gave on the topic of misleading graphs and statistics, alerted me to this graphic showing the fall in Yahoo’s enterprise value. (I didn’t know the term enterprise value, but apparently it’s similar to market capitalization but with a few tweaks.)


As you can see, from 2006 to 2012, Yahoo’s enterprise value fell from $54.9 billion to $17.26 billion. The current value is just under a third of its value six years ago. But that big circle looks a lot more than three times larger than the small circle. In fact, it’s about ten times larger.

As Gabe said in his email to me, the creators of this graphic used a 2/3 reduction in the radius of the circle when they should have used a 2/3 reduction in the area. Since the area of a circle increases with the square of the radius, the graphic drastically overstates the difference in value. (To be more specific, the small circle’s radius is about 31.4% of the big circle’s radius. The square of 0.314 is 0.098, meaning the small circle’s area is 9.8% of the big circle’s area.)

This kind of error was highlighted in Darrell Huff’s How to Lie With Statistics, first published in 1954. The bad news is that media sources still make the same error, whether purposely or accidentally, almost 60 years later. The good news is that apparently some students really do remember what they learned in class, even years later. My thanks to Gabe for bringing this example to my attention six years after taking my course.

Read More...

Wednesday, March 09, 2011

Unions, Education, and Simpson's Paradox

I know, I know, I never blog anymore. I hope to change that soon. Maybe.

But in the meantime, I had to draw attention to this post by IowaHawk. As Tyler Cowen observes, the post is marred by some unnecessary partisan vitriol. But the central point is nevertheless fascinating. Although unionized Wisconsin outperforms un-unionized Texas in educational performance when you look at overall figures, it turns out that Texas outperforms Wisconsin for every major ethnic group. Now, this might sound impossible, until you realize that Texas has a substantially higher percentage of blacks and Hispanics, who tend to get lower test scores regardless of the state.

I'm not trying to make political hay out of this; I'll leave that to others. No, the reason I find this so interesting is that, as a math nerd, I'm excited at any sighting of Simpson's Paradox in the wild. The same phenomenon can happen when, for instance, a hospital with better doctors gets a disproportionate number of the sickest patients. See here for a previous, albeit artificially constructed, appearance of Simpson's Paradox on this blog.

Read More...

Thursday, August 12, 2010

Tax Cuts Relative to Tax Payments

So this graph from Ezra Klein is making the rounds. In a nutshell, it shows that the proposed GOP tax cut (really an extension of the Bush tax cut) gives lots of money back to the very rich, whereas the Democratic tax cut (really a partial repeal of the Bush tax cut, meaning a tax increase) is not so generous with the rich.


The graph speaks for itself. Or does it? What the graph doesn’t show is how much each income group pays in taxes to begin with. The real question is how much each group is getting back relative to how much they put in.

Think of a tax cut as a kind of rebate: the government took some of your money, and now it’s giving some back. So how big is the rebate per dollar of tax paid? Using IRS data and the numbers in Klein’s graph, I’ve broken it down:


The chart shows that under both plans, the highest-income groups get a much smaller rebate per dollar, while the lowest-income groups get a much larger rebate per dollar. The difference is that the Democratic plan gives the rich almost no rebate at all -- about 1 cent per dollar -- whereas the GOP plan does give the rich a rebate of about 13 cents per dollar. Meanwhile, everyone earning less than $200K gets a rebate of at least 22 cents per dollar, with some groups getting much larger rebates (reaching as high as 73 cents per dollar for households earning $10-$20K).

The fact is that the rich pay a whole lot more in taxes than everyone else. It should be no surprise that their tax cuts are larger as well; the only way to avoid this is to give a disproportionate tax cut to people with lower incomes. For more, see this post from a few years ago.

Read More...

Monday, July 26, 2010

How Is This a Problem?

Business Insider’s Chart of the Day purports to show the “dismal” state of America’s household net worth. The authors describe the chart as showing the ratio of household net worth to disposable personal income “falling back to levels last seen in the late 1980s and early 1990s.”


Now, the main thing that makes our current position look bad is those two big spikes on the far right. As the authors note, those correspond to the dot-com bubble and housing bubble, respectively. And since those were, y’know, bubbles, they don’t really represent the value of fundamentals. Those years should arguably be ignored. But once you ignore those years, a quick eyeball reveals that the profile is pretty much flat. The ratio has hovered around 500% for over half a century, with the exception of a dip during the 1970s and early 1980s. And 500% is where we are now. How is this a problem?

Maybe the authors think the figure should be rising over time, and the flatness of the graph (once the bubble are removed) reflects stagnation. But that’s a non sequitur. Since the figure is a ratio, it’s perfectly consistent with both rising net worth and rising disposable income. For the ratio to trend upward, net worth would need to rise more quickly than disposable income. But as far as I know, there’s no reason to expect that. Am I missing something?

Read More...

Thursday, May 20, 2010

U.S. News: Less Transparency = More Fairness

Robert Morse today announced that, in response to evidence that law schools had been gaming its rankings, U.S. News would change the way it estimates the "Employment at 9 Months" measure for schools that decline to report that figure. Paul Caron offers some background here. Said Morse: "U.S. News is planning to significantly change its estimate for the at-graduation rate employment for nonresponding schools in order to create an incentive for more law schools to report their actual at-graduation employment rate data. This new estimating procedure will not be released publicly before we publish the rankings."

I understand that U.S. News generated the formula it formerly used to estimate the Emp9 figure for non-reporting schools by running a regression comparing the Emp0 and Emp9 data from reporting schools. It used to puzzle me that U.S. News did not evidently re-run the regression each year, but rather stuck with the original estimate. In retrospect, though, I see that sticking to the same formula might have partially helped U.S. News offset the gaming it so dislikes. After all, as more and more schools with low numbers refused to report Emp9 data, opting to rely instead on the publicized formula, the correlation between Emp0 and Emp9 scores would change so as to favor non-reporting schools. Better to stick with the old formula, dated though it might be, than to increase the incentive to opt out of reporting.

U.S. News thus avoided a vicious cycle, but only at the cost of signaling to schools exactly when hiding Emp9 data would help their rankings. Will its new reticence work? Schools can now only guess at how U.S. News will turn Emp0 numbers into Emp9 estimates, and will rightly worry that they might misjudge the new cutoff. Even if big-E ethics does not counsel reporting Emp9 numbers, therefore, small-c conservatism will. Granted, a school might reason, "U.S. News will still try to find a reasonably accurate way to turn Emp0 data into Emp9 estimates, and it has always helped us to not report in the past, so it remains a gamble worth taking." But such schools should also rightly worry that U.S. News might throw a punitive little kick into its new formula, to encourage schools to worry more about accuracy than about rankings.

[Crossposted at Agoraphilia and MoneyLaw.]

Read More...

Monday, March 02, 2009

The Singles Map

Richard Florida has posted a nice map (originally from National Geographic) showing which cities have the greatest numerical disparity between single men and single women. It appears that I'm in a pickle, being a single guy in L.A., which has the largest excess of men over women.



But wait a minute... the spots on the map are based on absolute differences, not scaled for population. It looks like Las Vegas has about 20,000 more single men than women, a difference that constitutes about 1% of the Las Vegas region's population. The L.A. region has a larger difference of 40,000, but that constitutes only 0.3% of the region's population. (It would be even better to take the ratio of men to women in each region, but I don't have that data.)

Also, shouldn't we take demographics into account? I'll bet a lot of those single women are aging widows. Notice that the retirement mecca of Miami has a rather large female-over-male disparity. (I'm reminded of this old post about where to find single women.)

Read More...

Sunday, June 08, 2008

Genetics and Life Expectancy

The major problem with using life expectancy to measure the quality of a country’s healthcare system is that life expectancy is affected by so many things other than healthcare: diet, crime, geography, education, and so on. And, of course, genetics. According to this article from New Scientist magazine (subscription required), the Japanese are more likely to have a key longevity-improving gene (which happens to reside in the mitochondrial, rather than nuclear, DNA) than other nationalities.

The most striking example comes from Japan. Here, there is a common variant in mitochondrial DNA, a change in a single DNA "letter". A decade ago Masashi Tanaka, now at the Tokyo Metropolitan Institute of Gerontology, and his colleagues reported that this tiny change almost halved the risk of being hospitalised for any age-related disease at all, while doubling the chance of living to 100. Most Japanese centenarians have the variant, but unfortunately for the rest of us it's very rare outside Japan.
Note that Japan always comes out at or near the top of life expectancy rankings.

Of course, we have long known (or at least suspected) that genetics had something to do with longevity. But for national health statistics, what matters is distribution; if the distribution of longevity-improving genes were independent of national origin, then genetic effects would wash out in international comparisons. As this example demonstrates, that’s not the case.

Read More...

Thursday, March 27, 2008

Musing about Correlations

In the course I'm currently teaching, the exams usually consist of two sections: multiple choice and short answer. I just finished recording the midterm grades, and for curiosity's sake I calculated the correlation between the two sections. For one class, the correlation between multiple choice and short answer was 0.59; for the other class, it was 0.70.

The question is, what correlation should I hope to get if the test has been optimally designed? Two things occur to me:

1. I don't want a correlation of zero, because I expect smarter and better prepared students to well on both parts (and dimmer and less prepared students to do poorly on both parts). A correlation of zero would raise the worry that my test doesn't really measure ability and preparation.

2. I don't want a correlation of one, because that means the sections are effectively redundant. Multiple choice is much easier to grade, so why bother grading the short answers if the multiple choice section tells me everything I need to know? If the two sections measure different things, correlation should be less than perfect.

So the correlation should be in between zero and one. But that's a pretty wide range. How much correlation should I aim for? Is there even a right answer to the question?

Read More...

Sunday, March 23, 2008

Death of the Diet: Doubting the Data

I wish I had known last Tuesday was “Death of the Diet Day.” I probably would have celebrated it. It sounds like a really fun day.

But why March 18th? That’s supposedly “the day when more commitments will fall by the wayside than on any other day in the calendar” (in the U.K.). The article implies that the commitments in question are New Year’s resolutions. So in other words, DoD Day is defined as the modal day of commitment breaking. Why not report the median or mean day?

It’s easy enough to dismiss the mean. If there are enough people who successfully keep their commitments indefinitely, then it’s mathematically impossible to calculate the mean, because the largest observations are excluded.

But what about the median? That would seem to be more useful than the mode; it would tell the day by which exactly 50% of commitments have been broken. So why not tell us that? I have a hypothesis, though I’d need to see the underlying data to be sure. If the distribution of diet-breaking dates is positively skewed, meaning we have a relatively large number of people keeping their commitments a long time (or indefinitely), then the median is probably a good bit later than the mode. That is, the situation probably looks something like this:

So reporting the median would mean waiting longer to announce the failure of our collective willpower, and that would not be consistent with the ongoing campaign to convince Westerners that we’re all too fat and need somebody to come to our rescue.

I’m also skeptical about the very notion of a diet being broken on a specific day. I’ve never had a successful diet that didn’t allow for the occasional indulgence. To construe any one indulgence as “breaking” the diet is to miss the point badly. Having treats every once in a while is part of a good diet. The demise of a diet is not revealed by what happens on any single day, but by a pattern of behavior over time.

Read More...

Wednesday, February 13, 2008

Correlation versus Causation

I have a friend-of-a-friend who refuses to drink diet soda, on grounds that it makes you fat. The evidence? The fact that so many diet-soda drinkers are fat. Our friend-in-common refers to this guy as "the Caveman." I think the Caveman would appreciate this article, which notes a statistical correlation between drinking diet soda and having metabolic syndrome (a combination of cardiovascular risk factors including abdominal obesity and high blood pressure):

The scientists gathered dietary information on more than 9,500 men and women ages 45 to 64 and tracked their health for nine years. ... [S]urprisingly, the risk of developing metabolic syndrome was 34 percent higher among those who drank one can of diet soda a day compared with those who drank none.
Surprisingly? I'm not surprised at all. Fortunately, a co-author of the study at least hints at the most likely explanation: "Why is it happening? Is it some kind of chemical in the diet soda, or something about the behavior of diet soda drinkers?" I'm betting on the latter.

Read More...

Thursday, October 25, 2007

An Oldie But a Goodie

This statistical error is very well-known, but it’s worth calling out again. I recently received a mailer from GEICO trying to persuade me to switch car insurance providers. Among other things, the mailer claims:

Even if you’ve received a rate quote from GEICO, try us again, because new customers report an average annual savings over $500. (emphasis added)
Wow, does that mean a random person switching to GEICO can expect to save, on average, about $500? Um, no. Most, if not all, of GEICO’s new customers are people who expected to save money because GEICO gave them a quote better than their current insurer. The people who got a quote worse than their current insurer either didn’t switch or switched to someone besides GEICO. So the sample of new customers is obviously biased in favor of people who would save money and against those who would lose money. It would be very surprising for the average savings of new customers not to be positive.

Read More...

Wednesday, October 24, 2007

How to Score

This post will seem frivolous at first. It’s not. Wait for the punchline.

Suppose John Doe cares about two things in a woman: looks and personality. Moreover, he says these characteristics are equally important to him (that is, he places 50% weight on looks, 50% weight on personality).

The World Mating Association (WMA) would like to create a ranking of women according to John’s preferences. So the WMA assembles a group of women and marches them in front of John, and he scores each woman’s looks on a scale from 0 to 10. Then he interacts with each woman (from behind a screen, if you insist) and scores each woman’s personality on a scale from 0 to 10.

The scores John gives for looks range all over the map, from 0 to 10, while the scores he gives for personality are bunched together in the 6 to 8 range. (Maybe John is a relatively tolerant guy when it comes to personality, though he doesn’t think anyone is super-fantastic.)

To calculate composite scores, the WMA decides to rescale the personality scores. It calculates each woman’s personality score as follows: personality = 10 x (raw score – 6) / (8 – 6). In other words, it measures a woman’s score as the percentage of the distance between the lowest-scoring and highest-scoring women. A woman John gave a 7 would be rescaled to a 5, because she’s halfway between 6 and 8. A woman he gave a 6 would now be a 0, and a woman he gave an 8 would now be a 10.

Once the personality scores have been rescaled, the WMA computes each woman’s composite score by computing the weighted average of the scores, using the weights John provided (50% and 50%).

Does this method make sense? Well, let’s see. Take two women, Alma and Betsy. Alma got a 9 on looks and a 6 (now rescaled to 0) on personality. Betsy got a 6 on looks and a 7 (now rescaled to 5) on personality. So Alma’s and Betsy’s composite scores are 4.5 and 5.5 respectively. Betsy wins!

“But wait a minute,” John objects. “I said looks and personality were equally important to me. Alma’s three whole points better looking than Betsy. And while her personality is not quite as nice, it’s not that different. I thought they were both nice enough. All things considered, I’d give Alma a 7.5 and Betsy a 6.5. What I’m trying to say is, I like Alma better!”

The problem, obviously, is the rescaling. John said personality matters just as much as looks to him – but fortunately for him, he likes most women’s personalities. The WMA’s approach exaggerated the significance of personality to John by treating women whose personalities he liked somewhat (6’s) as women he didn’t like at all, and women whose personalities he liked a lot (8’s) as women he thought were flawless.

Okay, so the WMA’s approach is clearly flawed. The punchline is that the method I’ve just described is the method the World Health Organization (WHO) used to make its composite scores of healthcare system performance. These are the scores used to create the widely-cited rankings of nations’ healthcare systems.

As I’ve noted in previous posts, the WHO method calculates performance as an index of five different factors (such as health level, health responsiveness, and financial fairness). Some of these factors arguably shouldn’t be included at all, but set that aside. In order to assign weights to these five factors, they conducted a survey of “health experts” about their relative importance. This is equivalent to asking John the importance he attaches to looks and personality. Assume, for the sake of argument, that the resulting weights are sensible.

Nevertheless, the resulting composite scores are not sensible. Because the five component factors were measured on different scales to begin with, the WHO researchers had no choice but to scale them to make them comparable. When they scaled them, they used the approach described above: they measured a country’s factor score as the percentage of the distance between the lowest-scoring and highest-scoring countries for that factor.

As a result, a factor could have an exaggerated effect on the composite health performance scores if the raw scores for that factor were bunched more tightly than were other factors. If, for instance, if financial fairness ranged from 0.5 to 10.0 on the fairness scale, countries with fairness of 0.5 would be treated as having a fairness of 0. Essentially, a country that is somewhat fair would get treated as not fair at all. (This is assuming the raw fairness measure is meaningful to begin with, which I suspect it is not.)

How would fixing this problem affect the WHO rankings? Honestly, I don’t know. In fact, there may be no objective answer to that question. Since the five factors are on different scales, some rescaling is unavoidable. But as soon as you rescale, the meaning of factor weights is questionable at best. What if John had said he cared equally about two things in women – body mass index (BMI) and intelligence (IQ)? What would it even mean to give equal weight to BMI points and IQ points? Any rescaling of BMI and IQ to make them “comparable,” e.g., by using the range or standard deviation, would unavoidably be affected by the relative dispersion of women on these two scales. Unless John could tell us which BMIs were equivalent to which IQs, John’s 50-50 weighting could be swamped by differences in dispersion that John may know nothing about. The same is true of the factor weights used to construct the WHO healthcare rankings.

Read More...

Tuesday, August 21, 2007

Spinning Good News Into Bad

Via Radley, I find a USA Today article about trends in drunk-driving accidents. As Radley notes, the number of drunk-driving-related deaths fell nationwide from 2005 to 2006, and it also fell in 28 states. Yet the headline reads, “Drunken driving deaths up in 22 states.” Talk about seeing the glass half-empty.

Even more statistical deception lies within the article. While the number of traffic fatalities involving drivers with BAC of 0.08 or above dropped (from 13,582 in 2005 to 13,470 in 2006), the number of fatalities involving drivers with any alcohol at all their systems rose slightly (from 17,590 in 2005 to 17,602 in 2006). Guess which statistic the Secretary of Transportation decided to focus on?

"The number of people who died on the nation's roads actually fell last year," U.S. Transportation Secretary Mary Peters said at a news conference in this Washington suburb. "However the trend did not extend to alcohol-related crashes."
Naturally, these statistics are being used to promote an expanded campaign against drunk driving, with the slogan, “Drunk Driving. Over The limit. Under Arrest.” But wait a minute – the number of fatalities involving drivers over the legal limit went down!

Even if we focus on fatalities involving drivers with any alcohol at all in their systems, we should still adjust for population growth. By my calculations, the number of such fatalities per 100,000 fell slightly, from 5.93 in 2005 to 5.88 in 2006.

And, as Radley observes, all statistics of this nature are based on the underlying assumption that alcohol was the cause of every accident in which one of the drivers had alcohol in his system – whether or not that driver was deemed at fault.

Read More...

Saturday, July 07, 2007

WHO's Healthcare Rankings, Part 3

[NOTE TO NEW READERS: This series of blog posts culminated in a Cato Institute Briefing Paper, which discusses all of my criticisms of the WHO healthcare rankings.]

This is probably my last post on this topic, but no promises. I want to draw attention to the margins of error associated with WHO’s healthcare rankings, which media reports on the rankings typically neglect to mention.

If you look at the study that produced the rankings for “overall health system attainment” (this is the one that ranked France, Canada, and the U.S. 6th, 7th, and 14th, respectively), an 80% confidence interval puts the U.S. rank anywhere from 7th to 24th. France is anywhere from 3rd to 11th, Canada from 4th to 14th. Here is a blown-up section of the study’s Figure 2, which shows the intervals graphically:

The U.S. is not named specifically on the horizontal axis, but it is the country just to the left of Iceland. Notably, its interval is a good bit wider than those around it. Obviously, there is considerable overlap among these intervals, and we cannot say with great confidence that the U.S. doesn't rank better than both France and Canada.

But these intervals result only from errors associated with random sampling in the construction of the statistics. They do not consider differences that could result from different weightings of the factors that compose the attainment index. As I argued in the two previous posts, there is good reason to think the proper weight for three of these factors is, in fact, zero. The authors of the study did not calculate rankings based on that weighting, but they did consider some other possible factor weights. Here is a blown-up section of the relevant graph (the study’s Figure 5):

According to the study, this figure shows that “for only a small number of countries was there any substantive change in rank” as a result of different factor weights. But looking at the section above, one country stands out as having an especially wide interval. That country is the one just to the left of Iceland – once again, the U.S. In other words, the ranking of the U.S. healthcare system is especially sensitive to the choice of weights in the index.

And, it should be noted, the rank resulting from any given factor weighting will itself have a margin of error resulting from random sampling. That means the two different sorts of intervals shown above ought to be considered jointly, resulting in yet even wider ranking intervals. More reason, I think, to regard the WHO healthcare rankings as unreliable at best.

[UPDATE: See also Part 1 and Part 2.]

Read More...

Monday, July 02, 2007

WHO's Healthcare Rankings, Part 2

[NOTE TO NEW READERS: This series of blog posts culminated in a Cato Institute Briefing Paper, which discusses all of my criticisms of the WHO healthcare rankings.]

As I noted in the last post, WHO’s ranking of healthcare systems relies on a measure of performance that includes “financial fairness,” which has nothing to do with the quality of healthcare. At best –and even this is highly questionable – it says something about how many people face financial hardship as a result of the healthcare they receive.

But this is not the only problematic factor in the WHO rankings. The rankings result from an index based on five factors, weighted as follows:

1. Health Level: 25%
2. Health Distribution: 25%
3. Responsiveness: 12.5%
4. Responsiveness Distribution: 12.5%
5. Financial Fairness: 25%
Only two of these – health level and responsiveness – are direct indicators of health outcomes. Even these are subject to some objections (such as that health level is affected by things like crime and nutrition), but they’re at least relevant. But neither health distribution nor responsiveness distribution properly belongs in an index of healthcare performance.

Why not? Because inequality (that’s what “distribution” is all about) is distinct from quality of care. You could have a system characterized by both extensive inequality and good care for everyone. Suppose, for instance, that Country A has responsiveness ranging from “good” to “excellent,” while Country B has responsiveness that is uniformly “poor.” Then Country B does better than Country A in terms of responsiveness distribution, despite Country A having better responsiveness than Country B for even the worst-off citizens. The same point applies to the distribution of health level.

To put it another way, suppose that a country currently provides everyone the same quality of healthcare. And then suppose the quality of healthcare improves for half of the population, while remaining the same (not getting any worse) for the other half. This is obviously an improvement – some people get better off, and no one gets worse off. But this change would cause the country to fall in the WHO rankings, other things equal.

[UPDATE: Clarification of the above example. As a result of the change, average health quality would rise, but inequality would rise as well. The former effect would tend to increase the country's WHO ranking, while the latter effect would tend to decrease it. The overall effect is ambiguous, even though common sense says the effect should be unambiguously positive.]

Now, it’s not silly to consider the quality of care received by the worst-off or poorest citizens. But distribution statistics emphatically don’t do that! They measure relative differences in quality, without regard to the absolute level of quality. A better approach would include in the index a factor for the health quality of the worst-off individuals. Or you could construct a separate health performance index for (say) the bottom 20% of the income distribution. These approaches would surely have problems of their own, but they would at least be focusing on the real concern. WHO’s current approach, sadly, doesn’t even do that much.

[UPDATE: See also Part 1 and Part 3.]

Read More...

Sunday, July 01, 2007

WHO's Healthcare Rankings, Part 1

[NOTE TO NEW READERS: This series of blog posts culminated in a Cato Institute Briefing Paper, which discusses all of my criticisms of the WHO healthcare rankings.]

As a result of Michael Moore’s SiCKO and the ensuing public debate, you’ve probably heard that the U.S. healthcare system United States ranks 37th in performance compared to other countries. Meanwhile, France and Canada’s systems both rank in the top 10, according to the World Health Organization. Here’s CNN.com’s story on the subject.

I was already skeptical of these numbers when I first heard them, because health performance statistics are affected by many things besides healthcare, such as crime, nutrition, and lifestyle choices. But I had assumed the statistics at least measured actual health outcomes. It took a post from David Masten at Distributed Republic to make me realize I had assumed incorrectly. Masten draws attention to this crucial little sentence (emphasis added):

The rankings are based on general health of the population, access, patient satisfaction and how the care’s paid for.
Including the mode of payment when measuring the system’s performance is, as Masten astutely observes, assuming what you’re trying to prove. After all, the whole question is how healthcare ought to be financed – publicly or privately, with insurance or out-of-pocket payments, etc.

To see the illogic for myself, I downloaded the relevant WHO report and the study it was based on. But before I could verify the factors included in the health performance index, I had to figure out which index to look at. It turns out that the U.S. ranks 37th on the “overall performance index.” But on this index, while it’s true that France is #1, Canada does not rank in the top 10 – it’s only #30. There is another index, “overall health system attainment,” on which the U.S. ranks #15 (while France and Canada are #6 and #7, respectively). As far as I can tell, the two indices are based on the same underlying data, but with the “overall performance index” calibrated according to some measure of how well the country is theoretically capable of doing. I’m still trying to figure out exactly how this calibration works. In any case, it looks an awful lot like someone cherry-picked the results to make the U.S.’s relative performance look worse than it is. Contrary to CNN.com (and possibly Michael Moore – I haven’t seen the movie yet), there is no index that has both Canada and France in the top 10 and the U.S. at 37.

But back to Masten’s point. Both of these indices include “financial fairness” (FF) as a factor with 25% weight in measuring the system’s performance. FF is measured first by finding a household’s contribution to health expenditure as a percentage of household income (beyond subsistence), and then looking at the distribution of this percentage over all households. The wider is the distribution, the worse a nation will perform on the health performance index (other things equal). But it should not be surprising at all that a larger percentage of poor people’s income will be spend on health than would be spent by the rich. Insofar as healthcare is treated as a necessity, we should expect that people will spend a decreasing fraction (not a decreasing amount, but a decreasing fraction) of their income on healthcare as their income increases. Rich people tend to spend a larger percentage of their income on luxuries than do the poor.

More importantly, the distribution of household contributions will obviously decline when the government shoulders more of the health spending burden. In the extreme, if the government pays for all healthcare, every household will spend the same percentage of their income – zero – on healthcare. In other words, this measure of health outcomes necessarily makes countries that rely on private payment look inferior.

It gets worse. The ostensible reason for including FF in the healthcare performance index is to consider the possibility of people landing in dire financial straits because of their health needs. It’s debatable whether this factor deserves inclusion in a strict measure of actual health performance – but let’s suppose it does. FF does not actually measure exposure to risk of impoverishment. FF is based on cubing (!), for each household, the difference between that household’s contribution and the average household’s contribution to healthcare. Consequently, FF is negatively affected by households that spend a larger percentage of their income on healthcare than others. But FF is also negatively affected by households that spend a much smaller percentage of their income on healthcare than others! This is senseless, but it’s a natural result of focusing on distribution instead of the effectiveness of the healthcare people receive.

More later.

[UPDATE: See also Part 2 and Part 3.]

Read More...

Tuesday, May 22, 2007

D&D Statistics

I put this question on yesterday's exam for my Econ-Stat students. I'm curious to see how many people can figure out the answer without having heard the corresponding lesson in class.

The King of Wakovia has two orders of knights at his service: the Order of the Stone and the Order of the Temple. The Stone Knights suffer a higher death rate when fighting dragons that do the Temple Knights. However, when fighting red dragons, the Temple Knights have the higher death rate; and when fighting green dragons, the Temple Knights also have the higher death rate. There are no other kinds of dragons. What do we call this phenomenon, and how is it possible? Be specific.
And if you're an RPG geek, I expect your answer to be even more specific.

Read More...

Wednesday, April 04, 2007

On Creating Bogus Data Sets

For a class I teach on economic statistics, I occasionally need to cook up a bogus data set for my students to analyze using regression analysis and other tools. This turns out to be a much more difficult task than you might expect.

The main problem is that almost any data set I create that actually represents a true underlying relationship (e.g., output is some function of labor and capital) comes out with insanely high levels of statistical significance (p-values in the hundred-thousands or millionths). And no, I’m not making the mistake of failing to include an error term; these high levels of statistical significance persist even when I include random errors with fairly high variances.

The significance levels do decrease with higher variance in the error term, but they’re still ridiculously high. By the time I’ve inserted enough variance to get more believable p-values, the y-values have become untenable (for instance, producing negative values for output). Those with econometrics experience will say this is just a problem of truncated variables – which is true, but the econometric methods needed to deal with truncated variables are far beyond the level of the class.

I’ve tried other methods of solving the problem, such as including additional explanatory variables when creating the y-values and then omitting those variables from the regression. This has been only moderately successful, especially since it’s hard to control how much effect your omitted variables will have on the regression results.

I have found one method, however, that allows me to create realistic-looking data sets and exercise a modicum of control of how large or small the significance levels are. It is this: I create the data set using a true underlying relationship as well as an error term. Then I replace some fraction q of the y-values with random numbers, in the same approximate range as the other y-values, but completely unaffected by the corresponding x-values. By changing q, I can approximately target the levels of significance that will pop out of the regression. I’ve found that a q between 20% and 40% of the observations will usually generate significance levels in the “interesting” range (p-values ranging from 0.001 to 0.30 or so).

As for what we can conclude about the real world and the statistical tools we use to analyze it, make of this what you will.

Read More...

Monday, March 26, 2007

The How, Who, and Why of Strategic Emp9 Reporting

In a prior post, I described how U.S. News & World Report calculates law schools' "employment at 9 months" figures and observed that its "Emp9" formula encourages semantic slight-of-hand. Specifically, a law school can score notably higher in USN&WR's rankings by characterizing a graduate as "unemployed and not seeking work" rather than as "unemployed and studying for the Bar full-time." I closed that post with several questions, which I here start tackling.

Why does that classification strategy benefit law schools?

Working through the details of USN&WR's Emp9 formula, spelled out in my prior post, makes evident why a law school benefits from classifying its gradates as "unemployed and not seeking work" instead of "unemployed and studying for the Bar full-time." Putting students in the former category decreases the denominator in USN&WR's Emp9 formula, thus increasing a school's Emp9 score. Calling a student "unemployed and studying for the Bar full-time" has no like effect.

Recur to the randomly chosen example I used earlier: American University School of Law. In the fall of 2005, it reported to the ABA (and thus presumably to USN&WR) that it had had 16 students "unemployed and not seeking work" and 3 students "unemployed and studying for the Bar full-time" nine months after their 2004 graduation. Those figures, together with others used by USN&WR, gave it a 97.2% Emp9 in the 2007 rankings (which issued in March of 2006). Suppose that American had classified all 19 of those students as "unemployed and not seeking work." In that event, it would have boasted a Emp9 of 98.1%. Conversely, if it had classified all 19 as "unemployed and studying for the Bar full-time," it would have had an Emp9 of only 92.7%.

Which law schools pursue that strategy?

I do not know which law schools, if any, have embraced the characterization strategy I've described. I am not privy to any law school's deliberations on that count—not even my own school's. I venture, however, that if a law school has opportunistically pushed graduates into the "unemployed and not seeking work" category and out of the "unemployed and studying for the Bar full-time" one, it will tend to report a large percentage of graduates in the former category relative to the latter one.

To identify which law schools might have done that, I drew on data in the 2007 ABA-LSAC Official Guide to ABA-Approved Law Schools. I collected it in an Excel file, together with Emp9 data from last year's USN&WR rankings, and offer it to you for the asking; just drop me a line. (Contrary to what it promised last fall, the ABA has still not made its data available in an easily downloaded format.)

I invite you to analyze that data as you see fit. As I said, though, comparing a law school's "unemployed and not seeking work" figure with its "unemployed and studying for the Bar full-time" might help us to determine—not "prove"!—which law schools have reported placement data in ways that improved their Emp9 scores. When I subtracted the latter measure from the former, I found that these schools had "strategic reporting indicator" scores above 5%:

Strategic Emp09 Reporting Table

Again, I emphasize that I do not know why the law schools listed above had so many graduates "unemployed and not seeking work" relative to "unemployed and studying for the Bar full-time." I've written to a couple of those schools seeking explanations, but as yet gotten no replies. For now, all I can say is that the law schools listed above, as well as many other schools with notably high though lesser "strategic reporting indicator" scores, reported placement numbers in a pattern consistent with what we would expect from a school that categorized its graduates so as to maximize its USN&WR ranking.

How much do they benefit from it?

Because the Emp9 measure counts for 14% of a law school's score in the rankings, and because most schools' scores cluster in a narrow range, relatively small changes in a school's Emp9 measure can have a large effect on its ranking. Ted Seto documents that phenomenon quite ably in his excellent paper. To give you an idea of how much strategically categorizing graduates can help a law school, allow me to run some numbers through my model of the 2007 rankings.

The 2007 USN&WR rankings, for instance, credited UCLA with a 99.7% Emp9, an overall score of 71, and a rank of 15. Suppose, however, that UCLA had reported not the "unemployed and not seeking work" and "unemployed and studying for the Bar full-time" percentages related above, but rather the average percentages reported by all law schools ranked by USN&WR—2.3% and 3.0%, respectively. In that event, holding all else equal, UCLA would have had a 90.0% Emp9, an overall score of about 68, and a rank of 17—neck-in-neck with its cross-town rival, USC.


I here pause again, leaving for now unaddressed these questions: "Is that [i.e., strategic Emp9 reporting] ethical?" "How did we get into the mess?" and "How do we get out of it?" I'm not very sure, yet, about my answers. (With regard to the last two, at least, Andy P. Morriss' and William D. Henderson's impressive draft paper offers some leads.) Please feel free to comment here, or to email me, if you have your own answers.

[Crossposted to MoneyLaw.]

Earlier posts about Emp9 measure:

Read More...