[Previous] [Main Index] [Next]

Friday, January 27, 2006

Robert Pape Tests His Theory of Suicide Terrorism Statistically: 9th in a 10-Part Series

Introductory Comments

This, the 9th article in a long-running series on Robert Pape's Dying to Win , continues its recent scrutiny of his efforts to test his theory's causal pathways by use of logistic regression. Aside from a few introductory comments that follow, the article is wholly concerned with analyzing all the bugs, errors, glitches, blunders, and rippling misinterpretations that envelope Pape's statistical work for his stated purpose . . . or, come to that, any purpose under the sun except those hatched and admired by the denizens of the funny farm.

The Necessary Background

Recall that the 4th, 5th, and 6th buggy articles in this series set out the basics of linear regression and of non-linear logistic regression, and the use of logit modeling or analysis that enables a researcher like Pape to estimate the coefficients of his independent variables and monitor the behavior of his dependent variable's outcomes --- whether suicide terrorism occurs or not in each of the 58 cases in his data-set or sample selection –-- as a linear regression in logged odd terms.

No need to say anything more about these technical basics. If you find that you're unable to make sense of today's buggy analysis, you'd be well advised to look over those earlier articles again.

Pape's Disastrously Small Sample Size

The 6th and 7th buggy articles also set out the severely flawed nature of Pape's data-set, both substantively and for its itty-bitty sample size . . . too small for the reliable use of maximum likelihood estimation, the normal and most effective way logit modeling estimates the coefficients of the independent variable and the behavior of the outcome or dependent variable.

Summary and a Pointer to Parts Two and Three Today

We'll say a little more today about the huge problems that Pape apparently was unaware of caused by using such a small data-set for logit analysis --- or maybe, come to think of it, that he just side-stepped in case one of his 20 expert scholarly chums ever put him wise. From several angles, these problems torpedo any effective logistic regression run on his data-set whether viewed as . . .

  • The minimum size data-set Pape needed for Maximum Likelihood Estimation, MLE, which entails "asymptotic" assumptions --- which means that the samples are large enough to "assume" (not prove) that MLE will produce unbiased coefficients of the estimators (independent variables) as well as an error term that assumes or approximates a normal distribution. Pape's set, as we've seen in earlier buggy articles and will see again today, is simply too small to meet this minimal requirement. And note carefully that asymptotic assumptions are precisely those, assumptions and nothing more: we will return to this critical point in Part One
  • Or the number of variables he used for reporting any results in a 2x2 classification table for "predictive success,"
  • Or the number of "events" needed on the smallest of his binary dependent variable of a qualitative sort . . which for Pape is Y = 1, with only 9 events of suicide terrorism possible not just in his sample of 58 cases, but in the population from which is sample is drawn (one and the same).


As Parts Two and Three will explain carefully, Hosmer and Lemeshow --- the authors of the best book on applied logistic regression, and by far --- insist that a minimum of 10 events on the smaller outcome of the categorical (qualitative) variable is needed for each of the predictors or estimators on the right-side of the logit model . . . which means that Pape couldn't even generate a null model with only an intercept variable accurately. By this measure, Pape's logistic regression plops into fatuity again. In particular, on p. 99, his reported logit model has at least 4 estimators or independent variables, and so he would need a minimal number of 40 suicide terrorist events to produce anything close to reliable estimation of the variables' coefficients or parameters. As it is, recall, there are only 9 such suicide terrorist events in his entire population! (Other logistic regression theorists, by the way, require even more events for proper logit modeling by means of minimal likelihood estimation or MLE than do Hosmer and Lemeshow.)

And It Gets Worse

As we'll see, all these problems that entangle Pape's logit models are compounded by other howlers --- such as an inaccurate interpretation of his interaction term and a zero-cell defect that any logistic regression researcher should easily have caught and corrected. A zero-cell defect, which is innocently shown in Pape's reported logit mode, will "play havoc with the estimation routines." (See J.S. Cramer, Logit Models From Economics and Other Fields Cambridge University Press, 2003, p. 46).

Then, too, in logit modeling, the use of case-study data frequently creates a problem of "endogenous sample selection" --- sometimes called "state-dependent sampling" --- which means that the sample values of an X estimator or independent variable "are not independent of the values taken by Y." Unless corrected, as J.S. Cramer observes on p. 39, such interdependence in the data between X estimators and the Y outcome-variable will "do serious damage to maximum likelihood estimation" --- the universal way that logit modeling estimates the parameters and other effects of logistic regression.

Whether out of innocence, incompetence, or fudging, Pape has sidestepped all these crackling problems of estimation and interpretation that hound his reported logit model on p. 99 of Dying to Win . . . the resulting statistical work a horror-show exemplar, when you get down to it, of everything that's wrong with reflexive, cookbook statistical regression --- software driven and mechanically carried out, with little or no understanding of what efficient logistic regression entails. To compound the ignorance, Pape then serves up some misleading puff-claims that herald his logit model's success.

All these and other technical howlers, almost telephone-book in size, that infest Pape's statistical work are carefully explained in parts one, two, and three of today's buggy article.

All of Which Brings Us To the Greatest Blunder of All: Pape's Data-Set Fantasies

The 7th and 8th buggy article also delved deeply into the even more serious flaws and howlers that make his data-set totally unreliable in substantive terms . . . like virtually all the major data-sets and charts in his book, about 25 in all that turn out to be largely make-believe stuff like the Sea-Serpents thought to be genuine by ancient peoples. Part One will touch on these fantasized data-sets later on today.

Recall Some of

. . . the most blatant deficiencies set out in the last buggy article. Specifically,

1. Click here for Pape's table 1 on p. 15 of Dying to Win that is the first installment of a lengthy snow-job that blankets out the towering near-monopoly of radical Islamist groups in suicide terrorism after 1980.

 

(i) Note prof bug's favorite gem in this bleached-out table, with its Hall-of-Mirrors display and its cover-up stuff: case 18, where Professor Pape confesses that he's unaware of the "religion" of the Iraqi Kabooming rebels. Hmmm, who might they be . . . Slews of Seventh Day Adventists who took a wrong turn on the Hollywood freeway one night in early 2003 and ended up hours later full of road-rage in Baghdad? A gaggle of Gone-Haywire Quakers, tired of their aggression-blocked, turn-the-other-cheek pacifism? Nested swarms of New Age Spiritualists telepathetically transported to Iraq and hearing Vangelis music inform them to kill US soldiers and their allies . . . oh, and of course a few more than 10,000 Iraqi school kids, shoppers, cafe patrons, pedestrians, and people at prayer?

Isn't it touching to see how modestly Professor Pape confesses to his ignorance here . . . this despite having 16 research assistants at his disposal and 20 expert scholarly chums as readers of his ms.? Reminds you of poor Alice at the Mad Hatter tea party, no less puzzled than these 37 researchers taken together, the poor things, when the Mad Hatter set out a brand new riddle:

"Why is a raven like a writing-desk?"
`Have you guessed the riddle yet?' the Hatter said, turning to Alice again.
`No, I give it up,' Alice replied: `What's the answer?'
`I haven't the slightest idea,' said the Hatter.
`Nor I,' said the March Hare.

"Nor us either" replied without dissent, even if much more recent,
A Sly-Boots Professor and his 36 Simple-Soul Assistants,
Each a totally stumped Innocent,
Left in the dark, hopelessly Ignorant.

The Riddle of the Rebels' religion,
An Iraqi puzzle with no solution.



 

  Yes, no doubt about it: this Pape-puzzlement is so touching that prof bug promises in the very next buggy article to help our self-confessed ignoramus-colleague and his 36 adjutant-sidemen work through their distressing failure of enlightment on this score . . . starting, maybe, with some more heartfelt tips as to the religion of the "Iraqi" rebels such as Abu Musab al-Zarqawi, the head of al Qaeda's Kaboomings in the country who, at last report --- found obviously to be unreliable by Professor Pape, hawkeyed as ever in his sixth-sense soothsayer's way --- was regarded as a Jordanian.

Not to worry though.

When you're briskly busy sweating 5th-dimension non-linear regression 20 hours a day, you don't want to be distracted by trivial details of this sort, do you? Iraqi rebels are all Iraqi, right? And their religion is a real brain-scratching puzzler, just about the biggest scientific dilemma of the 21st century, wouldn't you say? --- beyond even the ability of agile mental giants like the March Hare and the Mad Hatter to solve, not to forget poor perplexed Alice who had tumbled recently down the dark, mile-long rabbit-hole into another world and ended up in their company at a tea-party . . . maybe, who knows? immediately after reading some of Dying to Win in early ms.-form and stumbling around in amazed disbelief. In which case, when you ponder it, she's lucky that she hadn't ended up in a padded-cell for chronic, helplessly incurable brain-blown schizos.

(ii.) Wait though!

One further tip that comes right to mind by way of helping our bewildered colleague is worth mentioning just now: possibly the Kabooming Iraqi rebels are a Meshuga Mob of Manic Malibu Sun-Worshippers, who had been vacationing at a very chic Club Med in the middle of the Iraqi desert (delicious fried lizards for breakfast, yummy sautéed lizard-gizzards with organic turnips and dandelions for dinner, and lots of cheap hash for snacks, ha ha, the kind you don't really eat) when the war with the Saddamite regime began in March 2003; and so--- good and pissed off, and very understandably so, I mean, who could blame them now that their sunshine was occasionally blotted out ever afterwards by swarms of American attack helicopters flying overhead" --- they decided in unison as a raucous Rolling Stone song played loudly through their iPods (all up-to-date, the latest model just out with up-to-five-hours more ear-splitting battery life) that they had had enough, You don't piss on sun-tanners, you imperialist nasties, you! and pell-mell, not a dissenting vote, agreed to convert their beloved Sun-Lotion bottles ("Waterproof Coppertone, 45 SPF Rated") into Molotov cocktails that they've been slinging around in Iraqi cities ever since.

Vigilante vengeance at its best, right?

Right, provided we add that their vengeance is altruistic in motive, just as Professor Pape says it is repeatedly in Dying to Win. Their Kaboomings, we have to further infer from Pape-logic, fervently supported by oppressed sun-tanners world-wide, unable any longer to enjoy lengthy, low-cost sojourns at that formerly chic Iraqi Club-Med in the desert --- the joint now turned into a gay-bar run with zest by al Qaeda terrorists, all briskly keen to demonstrate, for everyone with 20-20 vision, how rollickingly diverse Islamist fundamentalism happens to be around the world. Yes, very very diverse, and exactly as Professor Pape assures us it is between pages 105 and 110 in his no- bullshitting, straight-from-the-horse's-mouth account . . . which, alas, manages once more to crash into fantasy-land fatuity, AKA data-table 13, the whole thing a whitewash job of breath-taking dimensions and unrelieved eye-popping error.

So click here for the rest of this startling, breathtaking analysis of the Manical Mobs of Mass-Murdering Malibu Sun-Tanners --- the possible hard-core Kaboomers in Iraq that Professor Pape, his 16 research assistants, and his 20 scholarly pals are seemingly in the dark about (how sad! how sad!) --- tucked away neatly in an Appendix at the very end of today's buggy article. Click now or read later, depending on your preference --- but note, you are duly warned: if it's late at night when you're reading this and eerie noises are sounding in the inky night-shadows outside your residence, you'd do well to wait until the morning . . . the high-pounding tension calculated to keep you in awake in a damp sweat until then anyway.


 

2. For a corrected buggy table that shows how Pape omitted 20 cases of suicide terrorism between 1980 and the start of February 2004, click here .

3. Poor Professor Pape can't even divide properly . . . or check to see if his research assistants could.

On p. 205, there's an extraordinary pie-chart that pretends to show the ideology or religious background of 38 known Hezbollah suicide-terrorists in the 1980s. It shows that 71% of them were Christians, yet the one paragraph that sets out the data here, just above it, finds that only 3 of the 38 were Christians. Usually, in earth-bound mathematics, 3 / 38 equals 7.9%, not 71%, but what the heck, when you're busy adding bleach most of the time to your analysis of Islamic terrorism, you're probably too busy to do 2nd grade mathematics properly. You are left wondering whether any of the 20 "expert" scholars who Pape acknowledges at the book's end as readers of his manuscript were sober or even sane when they looked it over.

4. As for an even worse botch-job of data-analysis and the extravagantly misleading claims that Pape postulates on its basis, see this table about al Qaeda suicide bombers and Pape's eye-popping inability to ever check his sources . . . no doubt brought to him by the broom-and-shovel graduate research assistants, all 16 of them acknowledged in Dying to Win as indispensable to his analytical catastrophe. click here


Which brings us to



5. The specific data-set that Pape contrives in chapter 6 for running on his logit models.

Based on 58 cases he coded that involve democratic governments militarily occupying either foreign territory or their own regional territories where restive ethnic minorities were active, it derives in part from the earlier error-riddled data-sets that hide the overwhelming dominance of Islamic terrorist groups in the 23 years after 1980 that Pape focused on, and it's just as full of howlers . . . beginning with his decision to look only at democratic occupiers, then followed by the use of crude categories for classifying his coded data. Needless to add, the resulting data-set is as markedly misleading as the other major data-sets in his book. In particular, as the data-sets corrected by prof bug showed, most targeted countries were not democratic military occupiers but Islamic ones, none of which were democratic when attacked by radical Islamist suicide terrorists except for Turkey . . . a point that we'll briefly clarify once more in a moment or two




No need to say more about Professor Pape's Fairyland Data-Sets at this point, most of which will no doubt have a place-of-honor one day in the Valhalla of Whitewash and Pishposh. We'll place them in storage, labeled clearly buggy article #11 --- the article after the next installment in this series --- while we move directly to Part One and more technical statistical matters that underscore just how rollickingly cruddy Pape's use of logistic regression turns out to be.
 

PART ONE: HOW TO EVALUATE ANY MULTI-REGRESSION MODEL
. . . AND SPECIFICALLY PAPE'S LOGIT MODEL


Whether you're dealing with linear or non-linear regression --- the latter, the case of Pape's logit models --- there are five major ways to evaluate any researcher's statistical regressions, and what follows are general comments about them, with some specific tag-on observations about Pape's work from time to time. A detailed scrutiny of Pape's numerous statistical blunders will be analyzed in Parts Two and Three . . . which, it seems to prof bug, merit being moved forward to the next buggy article, today's article ending with the technical analysis that is about to begin.



FIVE CRITERIA ARE NEED TO VALIDATE THE SUCCESS OF ANY REGRESSION MODEL

The Five in Question Are:

1. The substantive quality of the sample-set used for modeling purposes.

2. The sample-size and whether it's drawn from a natural population or --- at the opposite end of the spectrum --- is created, coded, and classified by the researcher, with some populations in between these two poles.

3. The apparent validity of the fitted regression equation, whether linear or non-linear . . . essentially, what statistical software will produce, and increasingly with ease --- which doesn't amount to much in valid terms per se.

4. The use of "internal" validation techniques on the fitted and statistically tested regression equation: which means drawing a new sample from the same population and running the equation on it. Almost always, as we'll see, there is "shrinkage" when the equation is run again this way: the estimated coefficients and tests of model performance turn out to have been over-optimistic the first time. Variants of this new sampling are split-sampling, cross-validation, double-cross validation, jackknifing, and bootstrapping . . . all terms clarified in a few seconds.

5. The use of "external" validation techniques on the fitted and statistically tested regression equation that has been adjusted by means of internal validation: in effect, this refers to applying that equation to samples drawn from a new population . . . either one of a similar sort at present (say, 30,000 new cardiac patients treated with statin drugs over the last year) or a similar population in the future (30,000 new patients in following years).

 

Pape's Fiasco on All Five Criteria

We'll clarify these criteria presently. Simply note here how, on each and every one, Pape's reported logit model on p. 99 is gravely deficient --- a matter that we've already seen regarding the first two: the use of totally inaccurate, whitewashed data-sets, including the one on which his logit models are tested, and a sample size so deficient that it can't possibly produce accurate logistic regression results. In Part Two, we'll summarize his deficiencies on these scores once more.

As for "apparent" validity, the 7th buggy article showed how a small sample size of his sort can't generated accurate estimations of the independent variables' parameters or coefficients, and it's doubly inaccurate as we'll see in Part Three because there aren't enough instances of suicide terrorism in his small data set of 58 cases --- only 9 --- for producing accurate estimations in logistic regression of a null model with only an intercept variable, and no other estimators. And those aren't the other obstacles hindering accurate estimation and statistical tests of his logit models. As we'll see, his results of the logit model set out in a reorganized 2x2 table are misinterpreted seriously by him, besides being mediocre. Worse, there's a zero-cell defect in that table that plays havoc with logit estimations, and to top off the statistical blunders, Pape has misinterpreted the key interaction term for one of his variable with near certainty.

Which Brings Us To the Latter Two Kinds of Validation

Take internal validation. Pape hasn't subject his logit model to a new sample drawn from the same 58 cases, which wouldn't help if he did because if the original sample of 58 cases is too small for accurate logistic regression by means of Maximum Likelihood Estimation, jackknifed or bootstrapped samples that would check to ensure that the 58 cases were independent of one another and lead to new samples would be even smaller.

As for external validation, the only way Pape could check his logit model this way would be for him to apply it to the numerous cases of suicide terrorism that have erupted since the end of 2003. Even then, assuming he saw a professional responsibility to do so during the 16 months or so that followed January 1st, 2004, and ended with his book's publication, he'd have wasted his time once more. There are too many blunders that entangle his reported logit model. It's constructed and run on a phony data-set; it couldn't possibly produce reliable estimations of the model's coefficients because of its size even if the data were accurate; it would be doubly wrong because of the zero-cell defect and the misinterpreted interaction term; it couldn't be tested for model performance with any reliability; and it would be mediocre even if none of these statistical howlers weren't present because, remember, even a statistical nincompoop could outperform Pape's model by simply predicting that there'd be no suicide terrorism observed for each and every one of Pape's 58 cases --- resulting in nincompoop "predictive success" higher of about 85%, which is much better than Pape's reported model mopped up.

Which leaves you wondering what adjective the statistical nincompoop would apply to Pape's wondrous statistical fiasco.

 

THE FIVE CRITERIA CLARIFIED

1. The Substantive Quality of the Sample Set

If the sample set's data are bad, inadequate, or misleading, no amount of technical fixes or remedies --- however elaborate or clever --- will compensate for these deficiencies: just the opposite, they might even conceal them and make proper interpretation and reporting of the final, fitted regression model's outcomes harder than should be the case. Pape's data-set, developed in part from earlier make-believe and wholly erroneous data, and in part from the use of very coarse categories like the measure of ethnic rebellion, couldn't result in accurate logistic regression outcomes even if he were a statistical Einstein . . . rather than, or so it seems, hardly better than a student in a second-quarter course on multi-regression.

2. The Technical Side of the Sample Set

A trio of questions predominate here:

  • Is the sample selection developed from a natural population or a near-natural one, or is it like Pape's a self-created, self-coded, self-organized population. There are, needless to add, some populations of data in between, but all statistical researchers should feel obligated to warn their readers clearly, and if need be repeatedly, that the population is not natural or near-natural.


  • Is the sample selected in a probabilistic way, or is it like Pape's wholly self-created . . . the latter because the sample is equivalent to the population of relevant cases, or so Pape claims, and he himself coded and classified the 58 total number of such cases.


  • And, even if the sample is randomly drawn from a natural population of data, is the sample-size large enough to run a regression model on it --- whether linear or not --- and assume that the coefficients or parameters of the variables are reliable? Any honest researcher would warn his readers if the sample set is a small one --- which is true of Pape's sample of 58 cases (equivalent to the population of cases or observations he created, coded, and organized), and he would do the exact opposite of Pape: ensure that they understood how qualified any statistical results


3. The Apparent Validity of the Final Regression Model:

This criterion covers all the indispensable ground-work in multi-regression, but it's the least reliable of the three validity measures in a key sense of the term: whether the effects or results of any final (or fitted) regression model can be generalized. Without internal and external validation, these results --- estimates of the variable's parameters (or coefficients) and so on --- are strictly conditional on the sample on which the estimations are made. Unfortunately, it's where about 95% of regression modeling comes to a full stop. If anything, the better statistical software has become and the easier it is to use, the more suspect most multi-regression work has become in all the social sciences . . . little different than mechanical cookbook stuff with virtually little or no theoretical interest that can be generalized in valid ways beyond the technical details run on the existing sample set.

Which, come to that, doesn't mean that Pape has even done a good job in basic logistic regression modeling, just the opposite --- another series of bungles and misleading reports . . . or so we'll see in Part Two.

More concretely, apparent validity refers to these technical basics of regression modeling:

(i) The selection and coding of variables in model construction;

(ii.) The estimation of the fitted model's parameter values (regression coefficients) as they influenced the behavior of Y, the dependent variable;

(iii.) The statistical tests applied to overall model performance and individual parameter values for hypothesis testing . . .



These tests refer to two things essentially: 1) Testing model performance by the use of overall measures like R-square or ANOVA for linear regression or a goodness-of-fit test like the Homer-Lemeshow statistic for logistic regression, and 2) Applying some statistical test --- say, a t-test in linear regression or a Wald statistic in logistic regression --- to determine whether the coefficient parameters can be distinguished from random chance at some minimal level, usually 5.0% or less.

If they can, the null hypothesis that each of the fitted model's coefficients are statistically insignificant and don't influence the response or outcome-variable (Y) can be rejected. Statistical power is often inferred on that basis. (As the 5th buggy article in this series noted, the Wald statistic --- though widely used for testing individual variables for their statistical significance --- is generally unreliable compared to the loglikelihood ratio test (-2LL) for two or three reasons.


(iv.)The interpretation by the researcher of the software's results, including the effort to treat the results as a "predictive model" in frequency terms . . . which is really a matter of interpreting properly the results of comparing the fitted model's predictions of outcomes on each case or data-point in the sample set and the actual observed outcomes. Without

The difference between the two, which Pape never mentions, is a measure of the error terms viewed as a matter of frequency, and it's not surprising that the best logistic regression theorists don't even like or use the term "predictive", but rather a question of whether the researcher properly classified the original data --- see (i.) --- as organized by logistic or linear regression software in a 2x2 classification table. Without submitting even these classified results to the other two criteria of validation.

In any case, as we noted before and will note in Part Two again, even those logistic regression theorists who take predictive models seriously would not regard Pape's logit model as reported on p. 99 as fitting that category because a predictive model is essentially one where the outcomes are homogenous on the Y dependent variable:


(v.) A full and careful report of his statistical work for readers.


What Follows?

To put it bluntly, what follows is that any statistically satisfying results of only apparent validity are of limited value in theoretical work or as guides to practical policymaking . . . at any rate if they stop there.

As we'll see, even the best modeling and performance on statistical tests that go no further are strictly conditional on the sample-selection and its data that was used for performing all three tasks: model-construction, parameter-estimation, and model and coefficient calibration and hypothesis testing. In the health science, for instance, there has emerged a general consensus by logistic regression theorists that apparent validity almost invariably results in "over-optimistic" results. Unlike Pape's logistic regression, moreover, the two binary outcomes of the qualitative dependent variable --- Y = 1 (say, the incidence of cardiac attacks after medical treatment for people over 65 in age ) and Y = 0 (its incidence for the sampled population over 65 who haven't been treated) --- are equally important. Pape, self-delusive as ever --- or a con-artist if not --- thinks that Y = 0 in his logit models (or the non-occurrence of suicide terrorism in his self-created sample of 58 coded cases) is of interest to anyone.

For a little fable about a statistical wonder, a humorous off-the-wall professor of sociological statistics at the University of New Orleans who is trying to coax our National Security Adviser into funding a research project for a cool $10 million or so to investigate and test statistically the dangerous menace of "non-occuring suicide-terrorism in the USA," click here. It seems absurd, of course, but no more so than Pape's efforts to coax us into believing that the non-occurence of suicide-terrorism in his logit model --- 49 of the 58 cases in his rigged data-set --- has theoretical or practical significance, and according to a few emails prof bug has received, a lot funnier.



 

ENTER THE TWO OTHER CRITERIA: INTERNAL AND EXTERNAL VALIDATION

What They Involve

To compensate for over-optimistic results that are nearly omnipresent --- especially on reported outcomes by patients themselves (or their doctors) --- regression theorists in the health sciences and educational studies argue that two other validation criteria are more important than the apparent validation that's run on the same data used for model construction, estimation, and performance: internal validation and external validation.

4. Internal Validation, Remember,

. . . .refers to the need to run the fitted regression model that satisfies apparent validity on a new sample drawn from the same population. It could be an entirely new sample, or there could be various technical means of deriving a new or better sample by means of split-sampling, cross-validation, jackknifing, or bootstrapping.

These Techniques Briefly Explained:

Split-sampling is the simplest form of internal validation to use: randomly draw two samples from the population of cases or observations, one for constructing the regression model and the other for testing its performance and statistical significance.

Cross-validation, which will be looked at in detail in a few moments, is a sophisticated variant of split-sampling. A randomly drawn sample is split into two parts, one half used for model construction and the other for testing, and then vice versa . . . with an average of the two tests of model performance than regarded as sounder with less "shrinkage" than otherwise. The technique is then repeated several times.

There are variants of this technique as well. One variant is to leave out all the cases or observations when a new sample is drawn that appeared in previous samples. Another is to repeat this same variant but continue doing so until all the cases or observations in the population have been selected for model construction and testing at least once. For instance, suppose you have a large population of 10,000 cases (Pape's fantasy-land population, which is self-constructed by him, has only 58 cases): a careful researcher would draw, say, a new sample of 500 cases twenty times, with replacement of old cases in each new sample, until all 10,000 cases or observations have been used for testing model performance.

Jackknifing is a third form of internal validation. It's really cross-validation of a complex sort, which omits one case at a time and continues sample-selection this way as the population of remaining cases gets ever small until a new sample is too tiny to use for regression purposes. As you can see, a large population of 10,000 cases could involve several thousand samples --- something that can nonetheless be handled fairly quickly these days on powerful pc's.

Bootstrapping, which is extensively used in econometrics these days, is a computer-intensive variant of jackknifing and cross-validation. A researcher draws a sample randomly from an underlying population --- say, a sample of 500 from the 10,000 total cases in that population --- and continues drawing new samples of 500 cases hundreds or thousands of times, each time with replacement of the cases already used in the previous sample. The bootstrapped samples are then used for model construction, and the model itself is tested on the original sample or from new, possibly bootstrapped samples.


In our (hopefully) funny fable about Professor Bernard de Stapler of the University of New Orleans, the manic, half-conman professor runs a rapid bootstrapping on his powerful up-to-date notebook of his data-base a good thousand times in order to achieve an up-to-the-moment point-estimate of the number of non-suicide occurrences in the US in the previous 10 minutes of his interview with the National Security Adviser. His point estimate arrived at by God-alone-knows what kind of regression technique was 5.216 trillion non-occurrences in that brief period, with a confidence interval of 13 trillion on the upside and 643 (or something like that) on the downside.


5. External Validation, You'll Recall,

. . . refers to testing the "predictive success" of a regression model that is both apparently valid and internally valid on a sample drawn from a new population of different patients --- or, more generally, from any different but similar population taken from a new if very similar population to see if the predicted frequency outcomes hold on this new sample. In Pape's case, this would mean --- assuming that he had a sound data-set to begin with (which he doesn't) and that it was large enough for logistic regression to fit a model to it (which it isn't) --- that his logit model would lack full validity unless it were applied to new suicide terrorist cases of a large enough number after the start of 2004.

 



THE CAUSES OF OVER-OPTIMISM CLARIFIED

Some References to the Technical Literature

Note that these remarks about internal and external validation draw from several sources, especially the following:

Internal Validation of Predictive Models written by six statistical specialists in this country and Holland ; Jason W. Osborne, "Prediction in Multiple Regression" ; Ewout Steyerberg and Frank Harrell, Validation of Predictive Regression Models" ; Frank Harrell Jr., "Regression Modeling and Validation Strategies"; and Marie-Christine Jaulent et al, "Logistic Regression Model: Conditions Required for the Stability of Prediction",

Its Causes

The reasons for this over-optimism derive, at bottom, from a major problem that encumbers all statistical regression: in plain, to-the-point terms, any regression model's estimated coefficients or parameters and its report of the fitted model in terms of "predictive success" are strictly conditional on the sample used and can't be generalized with full accuracy to other samples drawn from the same population, let alone new, fully similar populations.

This, note quickly, is the case despite the invocation used by most regression researchers of the central limit theorem and assumptions of asymptotic nature. Meaning? Meaning three related claims: that

1) If the same fitted and tested regression model were run a large number of times on increasingly large samples (derived in probability manner or simulated in Monte Carlo studies), and you calculated the "average' sample mean of them,

2) Then the error term would be normally distributed, and moreover

3) The final mean and variance of the last ever larger sample would eventually be the same as the actual average mean and variance of the population from which the sample has been drawn.


In different terms, the mean and variance wouldn't exactly match the values of the population even with these three assumptions taken as sound, and for two reasons.

For one thing, error-term and disturbing conditions in each sample wouldn't be exactly the same for those in the other samples no matter how many times a new sample were drawn from the underlying population, or however large the sample might be . . . provided it's not equal to the underlying population. (If the sample is equivalent to the population, then you are dealing with strict descriptive statistics with no statistical inference involved . . . a simple statement of fact, not however ever mentioned by Pape when he runs his fantasized data-set on his logit model in chapter 6 of his book.) For another thing, asymptotic assumptions --- which are another name for 1) above, the use of ever larger samples run an ever larger number of times --- make it strictly impossible for the "final sample" mean and variance to ever match exactly the population's mean and variance, and quite simply because asymptotic refers to a curve that ever more closely approximates a final value without ever actually intersecting with it.

(The clearest exposition of the asymptotic assumptions that prof bug is aware of can be found in Peter Kennedy, A Guide to Econometrics [MIT, 5th ed.], pp. 19-23, 33 where Kennedy notes that the "asymptotic expectation" my be equal to the "plim", but not always. and pp. 429-435 where Kennedy delves more deeply into the technical nature of asymptotic properties. In ways that are particularly relevant to Pape's small sample size used for non-linear logistic regression, Kennedy notes on p. 438 the particular importance of asymptotic assumptions when non-linear data are involved . . . "the algebra" needed for dealing with sample samples of the latter "can become become formidably" difficult. At some point in his exposition, Kennedy himself jokes by way of referring to these assumptions with an anecdote: three statisticians go duck-hunting. The hunter to the left shoots a foot in front of a nearby duck; the hunter to the right shoots a foot behind; the hunter in the middle then exclaims, "Looks like we shot the sucker dead-center, fellows!" )

Despite these problems that surround the three related assumptions, they are at the heart of frequentist --- or standard, non-Bayesian --- inferential statistics, which doesn't mean they're necessarily right.

If they were, you see, there wouldn't be the vexing problem of shrinkage or over-optimism that regression specialists have worried increasingly about in the health sciences, where fitted and tested models --- usually but not always of a non-linear logistic regression sort --- don't pan out accurately when applied to new samples of patients. Something else to ponder here. Observe swiftly here. Even though the problem of shrinkage exists no matter how large the sample size might be, it is much more serious for small sample-sizes of the Pape sort.

Shrinkage and a Hypothetical Buggy Example

For the time being, simply note that this problem of over-optimism is technically known as "shrinkage:" a major test of model performance in linear regression like R2 will turn out to be lower when the same model is used on a different sample. (See John W. Osborne, http://pareonline.net/getvn.asp?v=7&n=2).

If it helps, consider the following hypothetical case.

Think of a non-linear logistic model that tests successfully for model performance on a large sample when it comes to explaining and predicting the effects of using cholesterol-lowering drugs on a population of 50,000 patients with cardio-health problems after a year's treatment in 2004. Let's say that the successfully tested model in apparently valid terms uses three estimators (or independent variables): the amount of daily exercise, diet, number of visits to the doctor in the 12 month period, and of course the use of statin or other cholesterol-lowering medicine, plus of course the intercept term. Enter shrinkage or over-optimism. Specifically, no matter how good that model's performance turns out to be on the various measures of apparent validity for logit modeling, it will not produce the same good results if a new sample is drawn from those 50,000 patients: invariably, its over-optimism --- or shrinkage in statistical jargon --- will be anywhere from 1 or 2% to far higher than that. Worse, if a new population of patients with the same Cardio-Health Problems and similar medical treatment is generated in 2005 and then sampled randomly, the original fitted logit model is likely to turn out to be even less reliable on that new sample.

  Back to Pape.

If the dangers of over-optimism are found repeatedly to occur in health-care prediction models and those used in educational studies, think how much greater the dangers are when a scholar contrives his own data-set, itself too small for effective logistic regression at all, and makes puffed up claims about the results that are, in turn, badly misinterpreted . . . something Part Three will deal with in depth. So what should you expect a good statistical researcher to do if something important is at stake in his or her regression-work --- whether theoretically or, in the health areas, practical matters of medical treatment for physical maladies and mental troubles, or both?

 

Worse for Pape's Logit Modeling, Small-Size Samples Are Especially Prone to Shrinkage As Well As to Erratic Instability When Run on New Samples

A Revealing Regression Exercise

The best way to show these two drawbacks is by means of a concrete example taken from the work of a statistical theorist, Jason W. Osborne. Taken from an article of his, Prediction in Multiple Regression , it uses a table --- reproduced below here --- to summarize the wildly erratic nature of parameter-estimation, R-square as a test of linear-regression model-performance, and extreme shrinkage when run on different sample sizes.

The original model --- set out in the initial row of the table below the column-headings --- is run on the entire population of 8th grade pupils, 24,559 in all, whose academic achievement in the 12th grade was predicted by means of linear regression. The original model is then subject to two forms of internal validation: 1) a different sample-size drawn five different times from the population of 24,599 8th grade pupils, and 2) the original model that is derived in each of the five sample sizes is then subject to double cross-validation for shrinkage --- or over-optimism.

As you can see from table, the original model constructed on the entire population of 8th grade pupils looks like this:

Y'= -1.71+2.08(GPA) -0.73(race) -0.60(part) +0.32(pared)

where four predictors are used, including the intercept term (1.71): GPA in the 8th grade, race (white = 0, nonwhite = 1), participation in school-based extracurricular activities (no = 0, yes = 1), and parents'education.

R2, the coefficient of determination found in column 3, is the single best statistic for testing overall model performance in linear regression, though there are other statistics used for hypothesis testing. R-square measures the proportion of the variability of Y, the outcome variable, that is explained by the variability of the X independent variables. It runs between +1 and -1, with zero a totally bombed regression equation and +1 or -1 extreme or perfect regression, something that only happens in identity terms as when changes in a Y variable, say inches, is regressed on the sole X variable measured in centimeters.

In the 4th column, you have the the cross-validity coefficient, ryy that correlates the "predicted" outcomes or scores in a regression equation with the "observed" outcomes or scores . . . something dealt with at length in the 7th buggy article where we looked at Pape's logt model's predicted vs. observed outcomes as set out in a 2x2 classification table. Ryy, when squared, is the R-square that results when original regression model run on the alternative sample drawn from the same population --- a matter of cross-validation. When ryy2 is substracted from the original from the original R2 derived on the first sample in each sample-set, you have a measure of the "shrinkage" between the two estimated outcomes. You can see if you look briefly at the table that shrinkage always exists, but it is much greater when the sample set is of small size.

The only other technical matter in the Osborne table is the use of "double" cross-validation, mentioned earlier. As you might recall, it also uses two samples drawn from the same population as simple cross-validation does, but it goes further and constructs and tests regression models or equations in both samples, producing a more rigorous test of how generalizable the regression equations might be.

The Results?

Osborne's own comments can't be matched for pithy relevance:



How Does Sample Size Affect the "Shrinkage and Stability of a Prediction Equation?

"As discussed above, there are many different opinions as to the minimum sample size one should use in prediction research. As an illustration of the effects of different subject to predictor ratios on shrinkage and stability of a regression equation, data from the National Education Longitudinal Survey of 1988 (NELS 88, from the National Center for Educational Statistics) were used to construct prediction equations identical to our running example. This data set contains data on 24,599 eighth grade students representing 1052 schools in the United States. Further, the data can be weighted to exactly represent the population, so an accurate population estimate can be obtained for comparison. Two samples, each representing ratios of 5, 15, 40, 100, and 400 subjects per predictor were randomly selected from this sample (randomly selecting from the full sample for each new pair of a different size). Following selection of the samples, prediction equations were calculated, and double cross-validation was performed. The results are presented in Table 1".


Table 1: Comparison of Double Cross Validation Results 
With Differing Subject:Predictor ratios
Untitled Document
Sample Ratio (subjects: predictors) Obtained Prediction Equation R 2 r yy' 2 Shrink-
age
Population   Y'= -1.71+2.08(GPA) -0.73(race) -0.60(part) +0.32(pared) .48    
5:1
 Sample 1 Y'= -8.47 +1.87(GPA) -0.32(race) +5.71(part) +0.28(pared) .62 .53 .09
Sample 2 Y'= -6.92 +3.03(GPA) +0.34(race) +2.49 (part) -0.32(pared) .81 .67 .14
15:1
 Sample 1 Y'= -4.46 +2.62(GPA) -0.31(race) +0.30(part) +0.32(pared) .69 .24 .45
Sample 2 Y'= -1.99 +1.55(GPA) +0.34(race) +1.04 (part) -0.58(pared) .53 .49 .04
40:1
 Sample 1 Y'= -0.49 +2.34(GPA) -0.79(race) -1.51(part) +0.08(pared) .55 .50 .05
 Sample 2 Y'= -2.05 +2.03(GPA) -0.61(race) -0.37(part) -+0.51(pared) .58 .53 .05
100:1
 Sample 1 Y'= -1.89 +2.05(GPA) -0.52(race) -0.17(part) +0.35(pared) .46 .45 .01
 Sample 2 Y'= -2.04 +1.92(GPA) -0.01(race) +0.32(part) +0.37(pared) .46 .45 .01
400:1
 Sample 1 Y'= -1.26 +1.95(GPA) -0.70(race) -0.41(part) +0.37(pared) .47 .46 .01
 Sample 2 Y'= -1.10 +1.94(GPA) -0.45(race) -0.56(part) +0.35(pared) .42 .41 .01


Osborne follows up with two paragraphs that interpret the results in a nifty incisive manner:

"The first observation from the table is that, by comparing regression line equations, the very small samples have wildly fluctuating equations (both intercept and regression coefficients). Even the 40:1 ratio samples have impressive fluctuations in the actual equation. While the fluctuations in the 100:1 sample are fairly small in magnitude, some coefficients reverse direction, or are far off of the population regression line. As expected, it is only in the largest ratios presented, the 100:1 and 400:1 ratios, that the equations stabilize and remain close to the population equation.

"Comparing variance accounted for, variance accounted for is overestimated in the equations with less than a 100:1 ratio.

"Cross-validity coefficients vary a great deal across samples until a 40:1 ratio is reached, where they appear to stabilize. Finally, it appears that shrinkage appears to minimize as a 40:1 ratio is reached. If one takes Pedhazur's suggestion to compare cross-validity coefficients to determine if your equation is stable, from these data one would need a 40:1 ratio or better before that criterion would be reached. If the goal is to get an accurate, stable estimate of the population regression equation (which it should be if that equation is going to be widely used outside the original sample), it appears desirable to have at least 100 subjects per predictor."


From this example, you get a clear idea, hopefully, of how erratically and even wildly the same model can perform when applied to sample sizes of noticeable difference . . . the erratic behavior far less noticeable as the sample size for each predictor increased markedly. And note something very relevant to our own assessment of Pape's extravagantly deficient sample size used for his logit modeling: if Osborne findings are generalized, Pape --- whose reported logit model has four predictors, which amounts to 14.5 cases (subjects) for each --- can't remotely approximate the minimal 100 cases per predictor needed for proper stability of estimation if it were ever subject to varying sample sizes drawn from the same population.



Once again, except for the eagerly awaited Appendix, the length of this buggy exposition suggests that it would be better to end today's installment here, continuing our analysis of all the howlers and screamers in Pape's Alice-in-Wonderland statistical work in the next buggy article.

 



 

APPENDIX, THE FOREIGN LEGION OF DEATH DEALING MALIBU KABOOMERS

This continues prof bug's little effort, begun earlier in today's introductory comments, to help out our perplexed colleague, Professor Pape --- and presumably his 16 research assistants as well, not to forget his 20 expert scholarly pals into the bargain --- all of whom, whether individually or collectively, are very certain that the al Qaeda Foreign Legionnaires busily kabooming in Iraq since early 2003 are all "Iraqi" rebels, even as simultaneously, and of course quite understandably, these 37 research-specialists have been unable to figure out the religion of these "Iraqi" suicide-terrorists.

Who couldn't be sympathetic here with their plight?

The religion of the "Iraqi" suicide terrorists has emerged as one of the biggest scientific puzzlers of the 21st century, and you can't but admire this laudable --- yes, to go further, even exemplary --- piece of self-confessed ignorance in the face of such brain-numbing uncertainty, regarding which prof bug, a tad less modest than his esteemed colleagues, has offered some tips on the possible religious affiliation in question: a clutch of crazed Quakers, or maybe Seventh Day Adventist adventurers full of road rage, or telepathetic New Age Spiritualists hearing voices in Vangelis music telling them to Kaboom by long-distance mental effort, or --- prof bug's championed candidate --- pissed-off Malibu sun-tanners, touchy-tetchy that their sun-bathing rights have been infringed on by the helicopters of the oppressive democratic-country coalition suppressing Iraqi freedoms ever since late March 2003.

(ii-b) And note quickly: the Malibu Mob's Sun-Worshipping dominance of Iraqi Kaboomers would also solve a big puzzle for the rest of us: Why, if Professor Pape's theory is right on this score about the Kaboomers' altruistic motives, have 90% of the victims in Iraq have turned out to be Muslim civilians, whether Shia, Sunnis, or Kurds?

The answer? Well, you see . . . unable to discern things properly through their ultra-dark Ralph Lauren sun-shades bought at a costly Italian boutique in downtown Santa Monica (Christian Dior shades no longer the fashion, very very yesterday-stuff), these Manical Mobs of Mass-Murdering Malibu Residents have repeatedly, but very very excusably, confused over 10,000 Iraqi school-kids, cafe-patrons, people in prayer at mosques, theater-goers, and passers-by with the nasty American and British Special-Op Forces in the country, the confusion, to be precise, leading them to Kaboom the former targets into cadaverous condition at a rate of about 15:1 compared to the evil occupiers themselves . . . the latter, in Pape-think, the proper intended victims of the "Iraqi rebels" and, it appears, deservedly so.

(iii.) Oh oh! before we move on, another puzzler pushes to the fore and needs buggy illumination.

Namely? Well, as the 2nd column for case 18 in Pape's table 1 shows, he knows for certain that the Kaboomers are all "Iraqis", even as he and his 16 research-assistants and 20 expert chums remain, alas, lamentably ignorant of their religion in column 3. And so logic, if nothing else, demands that prof bug grapple with this contradiction. No loose ends left untied here, right? This series on Pape isn't just straggly pap, is it?

Obviously not. So start by noticing just how sure-footed Professor Pape's claim about the rebels' citizenship is in that 2nd column --- Iraqis one and all, Abu Musab al-Zarqawi and his Foreign Legion of al Qaeda hitmen included.

In that case, we, as diligent readers of his book, are left with only one inference possible here: each and every one of our High-Pep Pissed-Off California Sun-Crazies --- bustlingly busy Kabooming at a breakneck pace around the country the last three years --- must have taken enough time off to attend a secret swearing-in ceremony in a darkened cellar one night where, to a man and woman, along with an occasional University of Chicago guest-of-honor, they were duly awarded Iraqi citizenship by an authorized Iraqi official . . . most likely, if we're allowed a touch of speculation here, a former Baathist Minister of Justice with only 113,333 cadavers on his hands, who might once have been a teaching assistant in the University of Chicago in his youth. Maybe, who knows for sure? --- formerly in its political science department, the proud bearer two decades later of an "A+" grade for interstellar performance in the Intermediate Seminar in Alternative-Universe Statistics, a young professor in charge at the time (no names to be mentioned, without more evidence of course). And speaking of time, the late-night Honorary Iraqi-Citizenship Award itself --- to use a little more Pape-speak here --- a token of universal Iraqi gratitude, nothing more, but also nothing less, for the Choleric California Kaboomer's Heroic Sacrifices as "Honorary Community-Minded Altruists" in the Iraqi People's death-dealing struggle against all demonic foreign oppressors.

Note, now, how everything now falls into place, all good things, and a few bad ones, conforming to the uncannily accurate arguments in Dying to Win: The Strategic Logic of Suicide Terrorism.

With each suicide attack, no exceptions whatsoever, calculated and carried out with impeccable Strategic Logic; yes, impeccable, beyond all reproach --- no two ways about it. Computed, if you insist on getting down to cases --- and precision is what Dying to Win is all about, right? --- yes, computed to the nth degree and then double-checked with the fastidious use of advanced polychotomous logistic regression, mainly with m-slope coefficients identical for all options, but, if need be, with independent-of-irrelevant-alternatives (oh! oh! that darn pesky IIA problem, wouldn't you know, Dude?); plus, when necessary, the use too of nested multi-agent game theory that entails, at times, no way around it either ---Yeah, you think there's an alternative here, huh Dude? Then you go and simulate it with Monte-Carlo techniques on your own damn pc, not mine, Creep! --- the tricky application of appropriate parametric restrictions as a necessary limit of suitable approximation in a Super-Game of Milky-Way dimensions, each partition in which has been fully inspired by ancient Aztec Sun-Worshipping Rituals. All these calculations, please note, carefully counted, ciphered, and checked, moreover, on costly laptop computers --- air-shipped all the way from Malibu in the air-conditioned baggage compartment (with, truth be known, concentrated wheatgrass juice laced with 200-proof Vodka in dried-freeze form stuffed inside the CD-drives next to the pretzels and blue-corn tortilla chips), and run on batteries charged with solar-cells exclusively . . . your run-of-the-mill Sun-Tanning Terrorist from Malibu very very respectful of the environment; plus, as further double-check, more last-second intelligence-reconnaissance calculations out in the field itself --- the chief intelligence-officer, a long-time Malibu resident, the chief technical adviser to Jack Bauer himself in 24 Hours and hence doubly qualified --- as the Made-Pure Malibu-Kaboomers work their way stealthly through the sunny streets of Baghdad and toward downtown except . . . well, except for that occasional mishap in tactical follow-through, that chronic, terribly lamentable failure, repeated a thousand times over by now , to see clearly who the damn victims happened to be through those ultra-dark Ralph Lauren shades.

(iv.) What? What's That? --- you can practically here Professor Pape or his 16 research-assistants ask, or maybe it's just one of the 20 helping-hand expert scholars (or is prof bug hearing voices once more?): Why don't the dopes take off their Ralph Lauren shades and pick their victims more carefully?

A very logical question, no? Almost worthy of logistic regression with a predictor model, yes . . . Professor Pape's logit models to the rescue, bugles blowing?

And exactly the logic that we'd expect from 21 scholars and 16 research assistants to use even if they can't collectively figure out, all 37 of them, their brains whirring and humming madly like a turbine at beserker-speed, what the religion of the Iraqi rebels might be . . . this, you understand, even as they do know that Abu Musab al-Zarqawi and his al Qaeda legionnaires are --- contrary to popular prejudice --- 100% pure-bred Iraqis. Yes, very logical --- this query; prof bug is insistent here, egghead thought at its best. Alas, Alas! you cannot expect Pious Malibu Sun-Tanners to blow themselves straight skywards, Kaboom! Kaboom! no matter how altruistic the cause, unless these Self-Sacrificing Purified Sun-Worshippers go to meet their Lord and Master Savior rising daily in the East in full Sacramental Attire, can you?

The moral?

Never piss off a Malibu Sun-Worshipper in his or her Ralph Lauren shades bent on getting at least 16 hours of sun-rays daily, no interruptions tolerated . . . wars or not.

(v.) But whoa, another puzzler suddenly prompts itself here --- a real brain-scratcher this time, always assuming this last bit of bugged-out speculation has struck you as unimpeachably sound. How would Professor Pape know all the Kabooming rebels happen to be 100% Iraqis in Column 2 of his table, the invited al Qaeda guests-of-honor included, if he and his busy-bee assistants and impressively savvy scholarly chums, their brains otherwise crackling with snappy inventive power, couldn't figure out the rebels' religious status by the time he and they reached Column 3 of case 18 in his Alice-in-Wonderland table 1?

Huh? How would that be possible? Or did Professor Pape and his 36 eagle-eyed helpers run out of dough to buy more whitewash?

Ha! Ha! time to fess up, prof bug has only been kidding here. Really, cross-his-heart and hope-to-die, he assumes that Professor Pape and his 16 research assistants and 20 helping side-men scholars are way too deftly fast in the upstairs-department not to know what the religion of the "Iraqi rebels" happens to be . . . which does leave you wondering, though, at any rate in prof bug's mind, just who Professor Pape thinks he might be kidding in Column 3.


To return to your original spot, here