PART ONE: HOW TO EVALUATE ANY MULTI-REGRESSION MODEL
. . . AND SPECIFICALLY PAPE'S LOGIT MODEL
Whether you're dealing with linear or non-linear regression --- the latter, the case of Pape's logit models --- there are five major ways to evaluate any researcher's statistical regressions, and what follows are general comments about them, with some specific tag-on observations about Pape's work from time to time. A detailed scrutiny of Pape's numerous statistical blunders will be analyzed in Parts Two and Three . . . which, it seems to prof bug, merit being moved forward to the next buggy article, today's article ending with the technical analysis that is about to begin.
FIVE CRITERIA ARE NEED TO VALIDATE THE SUCCESS OF ANY REGRESSION MODEL
The Five in Question Are:
1. The substantive quality of the sample-set used for modeling purposes.
2. The sample-size and whether it's drawn from a natural population or --- at the opposite end of the spectrum --- is created, coded, and classified by the researcher, with some populations in between these two poles.
3. The apparent validity of the fitted regression equation, whether linear or non-linear . . . essentially, what statistical software will produce, and increasingly with ease --- which doesn't amount to much in valid terms per se.
4. The use of "internal" validation techniques on the fitted and statistically tested regression equation: which means drawing a new sample from the same population and running the equation on it. Almost always, as we'll see, there is "shrinkage" when the equation is run again this way: the estimated coefficients and tests of model performance turn out to have been over-optimistic the first time. Variants of this new sampling are split-sampling, cross-validation, double-cross validation, jackknifing, and bootstrapping . . . all terms clarified in a few seconds.
5. The use of "external" validation techniques on the fitted and statistically tested regression equation that has been adjusted by means of internal validation: in effect, this refers to applying that equation to samples drawn from a new population . . . either one of a similar sort at present (say, 30,000 new cardiac patients treated with statin drugs over the last year) or a similar population in the future (30,000 new patients in following years).
Pape's Fiasco on All Five Criteria
We'll clarify these criteria presently. Simply note here how, on each and every one, Pape's reported logit model on p. 99 is gravely deficient --- a matter that we've already seen regarding the first two: the use of totally inaccurate, whitewashed data-sets, including the one on which his logit models are tested, and a sample size so deficient that it can't possibly produce accurate logistic regression results. In Part Two, we'll summarize his deficiencies on these scores once more.
As for "apparent" validity, the 7th buggy article showed how a small sample size of his sort can't generated accurate estimations of the independent variables' parameters or coefficients, and it's doubly inaccurate as we'll see in Part Three because there aren't enough instances of suicide terrorism in his small data set of 58 cases --- only 9 --- for producing accurate estimations in logistic regression of a null model with only an intercept variable, and no other estimators. And those aren't the other obstacles hindering accurate estimation and statistical tests of his logit models. As we'll see, his results of the logit model set out in a reorganized 2x2 table are misinterpreted seriously by him, besides being mediocre. Worse, there's a zero-cell defect in that table that plays havoc with logit estimations, and to top off the statistical blunders, Pape has misinterpreted the key interaction term for one of his variable with near certainty.
Which Brings Us To the Latter Two Kinds of Validation
Take internal validation. Pape hasn't subject his logit model to a new sample drawn from the same 58 cases, which wouldn't help if he did because if the original sample of 58 cases is too small for accurate logistic regression by means of Maximum Likelihood Estimation, jackknifed or bootstrapped samples that would check to ensure that the 58 cases were independent of one another and lead to new samples would be even smaller.
As for external validation, the only way Pape could check his logit model this way would be for him to apply it to the numerous cases of suicide terrorism that have erupted since the end of 2003. Even then, assuming he saw a professional responsibility to do so during the 16 months or so that followed January 1st, 2004, and ended with his book's publication, he'd have wasted his time once more. There are too many blunders that entangle his reported logit model. It's constructed and run on a phony data-set; it couldn't possibly produce reliable estimations of the model's coefficients because of its size even if the data were accurate; it would be doubly wrong because of the zero-cell defect and the misinterpreted interaction term; it couldn't be tested for model performance with any reliability; and it would be mediocre even if none of these statistical howlers weren't present because, remember, even a statistical nincompoop could outperform Pape's model by simply predicting that there'd be no suicide terrorism observed for each and every one of Pape's 58 cases --- resulting in nincompoop "predictive success" higher of about 85%, which is much better than Pape's reported model mopped up.
Which leaves you wondering what adjective the statistical nincompoop would apply to Pape's wondrous statistical fiasco.
THE FIVE CRITERIA CLARIFIED
1. The Substantive Quality of the Sample Set
If the sample set's data are bad, inadequate, or misleading, no amount of technical fixes or remedies --- however elaborate or clever --- will compensate for these deficiencies: just the opposite, they might even conceal them and make proper interpretation and reporting of the final, fitted regression model's outcomes harder than should be the case. Pape's data-set, developed in part from earlier make-believe and wholly erroneous data, and in part from the use of very coarse categories like the measure of ethnic rebellion, couldn't result in accurate logistic regression outcomes even if he were a statistical Einstein . . . rather than, or so it seems, hardly better than a student in a second-quarter course on multi-regression.
2. The Technical Side of the Sample Set
A trio of questions predominate here:
- Is the sample selection developed from a natural population or a near-natural one, or is it like Pape's a self-created, self-coded, self-organized population. There are, needless to add, some populations of data in between, but all statistical researchers should feel obligated to warn their readers clearly, and if need be repeatedly, that the population is not natural or near-natural.
- Is the sample selected in a probabilistic way, or is it like Pape's wholly self-created . . . the latter because the sample is equivalent to the population of relevant cases, or so Pape claims, and he himself coded and classified the 58 total number of such cases.
- And, even if the sample is randomly drawn from a natural population of data, is the sample-size large enough to run a regression model on it --- whether linear or not --- and assume that the coefficients or parameters of the variables are reliable? Any honest researcher would warn his readers if the sample set is a small one --- which is true of Pape's sample of 58 cases (equivalent to the population of cases or observations he created, coded, and organized), and he would do the exact opposite of Pape: ensure that they understood how qualified any statistical results
3. The Apparent Validity of the Final Regression Model:
This criterion covers all the indispensable ground-work in multi-regression, but it's the least reliable of the three validity measures in a key sense of the term: whether the effects or results of any final (or fitted) regression model can be generalized. Without internal and external validation, these results --- estimates of the variable's parameters (or coefficients) and so on --- are strictly conditional on the sample on which the estimations are made. Unfortunately, it's where about 95% of regression modeling comes to a full stop. If anything, the better statistical software has become and the easier it is to use, the more suspect most multi-regression work has become in all the social sciences . . . little different than mechanical cookbook stuff with virtually little or no theoretical interest that can be generalized in valid ways beyond the technical details run on the existing sample set.
Which, come to that, doesn't mean that Pape has even done a good job in basic logistic regression modeling, just the opposite --- another series of bungles and misleading reports . . . or so we'll see in Part Two.
More concretely, apparent validity refers to these technical basics of regression modeling:
(i) The selection and coding of variables in model construction;
(ii.) The estimation of the fitted model's parameter values (regression coefficients) as they influenced the behavior of Y, the dependent variable;
(iii.) The statistical tests applied to overall model performance and individual parameter values for hypothesis testing . . .
These tests refer to two things essentially: 1) Testing model performance by the use of overall measures like R-square or ANOVA for linear regression or a goodness-of-fit test like the Homer-Lemeshow statistic for logistic regression, and 2) Applying some statistical test --- say, a t-test in linear regression or a Wald statistic in logistic regression --- to determine whether the coefficient parameters can be distinguished from random chance at some minimal level, usually 5.0% or less.
If they can, the null hypothesis that each of the fitted model's coefficients are statistically insignificant and don't influence the response or outcome-variable (Y) can be rejected. Statistical power is often inferred on that basis. (As the 5th buggy article in this series noted, the Wald statistic --- though widely used for testing individual variables for their statistical significance --- is generally unreliable compared to the loglikelihood ratio test (-2LL) for two or three reasons.
(iv.)The interpretation by the researcher of the software's results, including the effort to treat the results as a "predictive model" in frequency terms . . . which is really a matter of interpreting properly the results of comparing the fitted model's predictions of outcomes on each case or data-point in the sample set and the actual observed outcomes. Without
The difference between the two, which Pape never mentions, is a measure of the error terms viewed as a matter of frequency, and it's not surprising that the best logistic regression theorists don't even like or use the term "predictive", but rather a question of whether the researcher properly classified the original data --- see (i.) --- as organized by logistic or linear regression software in a 2x2 classification table. Without submitting even these classified results to the other two criteria of validation.
In any case, as we noted before and will note in Part Two again, even those logistic regression theorists who take predictive models seriously would not regard Pape's logit model as reported on p. 99 as fitting that category because a predictive model is essentially one where the outcomes are homogenous on the Y dependent variable:
(v.) A full and careful report of his statistical work for readers.
To put it bluntly, what follows is that any statistically satisfying results of only apparent validity are of limited value in theoretical work or as guides to practical policymaking . . . at any rate if they stop there.
As we'll see, even the best modeling and performance on statistical tests that go no further are strictly conditional on the sample-selection and its data that was used for performing all three tasks: model-construction, parameter-estimation, and model and coefficient calibration and hypothesis testing. In the health science, for instance, there has emerged a general consensus by logistic regression theorists that apparent validity almost invariably results in "over-optimistic" results. Unlike Pape's logistic regression, moreover, the two binary outcomes of the qualitative dependent variable --- Y = 1 (say, the incidence of cardiac attacks after medical treatment for people over 65 in age ) and Y = 0 (its incidence for the sampled population over 65 who haven't been treated) --- are equally important. Pape, self-delusive as ever --- or a con-artist if not --- thinks that Y = 0 in his logit models (or the non-occurrence of suicide terrorism in his self-created sample of 58 coded cases) is of interest to anyone.
For a little fable about a statistical wonder, a humorous off-the-wall professor of sociological statistics at the University of New Orleans who is trying to coax our National Security Adviser into funding a research project for a cool $10 million or so to investigate and test statistically the dangerous menace of "non-occuring suicide-terrorism in the USA," click here. It seems absurd, of course, but no more so than Pape's efforts to coax us into believing that the non-occurence of suicide-terrorism in his logit model --- 49 of the 58 cases in his rigged data-set --- has theoretical or practical significance, and according to a few emails prof bug has received, a lot funnier.
ENTER THE TWO OTHER CRITERIA: INTERNAL AND EXTERNAL VALIDATION
What They Involve
To compensate for over-optimistic results that are nearly omnipresent --- especially on reported outcomes by patients themselves (or their doctors) --- regression theorists in the health sciences and educational studies argue that two other validation criteria are more important than the apparent validation that's run on the same data used for model construction, estimation, and performance: internal validation and external validation.
4. Internal Validation, Remember,
. . . .refers to the need to run the fitted regression model that satisfies apparent validity on a new sample drawn from the same population. It could be an entirely new sample, or there could be various technical means of deriving a new or better sample by means of split-sampling, cross-validation, jackknifing, or bootstrapping.
These Techniques Briefly Explained:
Split-sampling is the simplest form of internal validation to use: randomly draw two samples from the population of cases or observations, one for constructing the regression model and the other for testing its performance and statistical significance.
Cross-validation, which will be looked at in detail in a few moments, is a sophisticated variant of split-sampling. A randomly drawn sample is split into two parts, one half used for model construction and the other for testing, and then vice versa . . . with an average of the two tests of model performance than regarded as sounder with less "shrinkage" than otherwise. The technique is then repeated several times.
There are variants of this technique as well. One variant is to leave out all the cases or observations when a new sample is drawn that appeared in previous samples. Another is to repeat this same variant but continue doing so until all the cases or observations in the population have been selected for model construction and testing at least once. For instance, suppose you have a large population of 10,000 cases (Pape's fantasy-land population, which is self-constructed by him, has only 58 cases): a careful researcher would draw, say, a new sample of 500 cases twenty times, with replacement of old cases in each new sample, until all 10,000 cases or observations have been used for testing model performance.
Jackknifing is a third form of internal validation. It's really cross-validation of a complex sort, which omits one case at a time and continues sample-selection this way as the population of remaining cases gets ever small until a new sample is too tiny to use for regression purposes. As you can see, a large population of 10,000 cases could involve several thousand samples --- something that can nonetheless be handled fairly quickly these days on powerful pc's.
Bootstrapping, which is extensively used in econometrics these days, is a computer-intensive variant of jackknifing and cross-validation. A researcher draws a sample randomly from an underlying population --- say, a sample of 500 from the 10,000 total cases in that population --- and continues drawing new samples of 500 cases hundreds or thousands of times, each time with replacement of the cases already used in the previous sample. The bootstrapped samples are then used for model construction, and the model itself is tested on the original sample or from new, possibly bootstrapped samples.
In our (hopefully) funny fable about Professor Bernard de Stapler of the University of New Orleans, the manic, half-conman professor runs a rapid bootstrapping on his powerful up-to-date notebook of his data-base a good thousand times in order to achieve an up-to-the-moment point-estimate of the number of non-suicide occurrences in the US in the previous 10 minutes of his interview with the National Security Adviser. His point estimate arrived at by God-alone-knows what kind of regression technique was 5.216 trillion non-occurrences in that brief period, with a confidence interval of 13 trillion on the upside and 643 (or something like that) on the downside.
5. External Validation, You'll Recall,
. . . refers to testing the "predictive success" of a regression model that is both apparently valid and internally valid on a sample drawn from a new
population of different patients --- or, more generally, from any different but similar population taken from a new if very similar population to see if the predicted frequency outcomes hold on this new sample. In Pape's case, this would mean --- assuming that he had a sound data-set to begin with (which he doesn't) and that it was large enough for logistic regression to fit a model to it (which it isn't) --- that his logit model would lack full validity unless it were applied to new suicide terrorist cases of a large enough number after the start of 2004.
THE CAUSES OF OVER-OPTIMISM CLARIFIED
Some References to the Technical Literature
Note that these remarks about internal and external validation draw from several sources, especially the following:
Internal Validation of Predictive Models
written by six statistical specialists in this country and Holland ; Jason W. Osborne, "Prediction in Multiple Regression"
; Ewout Steyerberg and Frank Harrell, Validation of Predictive Regression Models"
; Frank Harrell Jr., "Regression Modeling and Validation Strategies"
; and Marie-Christine Jaulent et al, "Logistic Regression Model: Conditions Required for the Stability of Prediction",
The reasons for this over-optimism derive, at bottom, from a major problem that encumbers all statistical regression: in plain, to-the-point terms, any regression model's estimated coefficients or parameters and its report of the fitted model in terms of "predictive success" are strictly conditional on the sample used
and can't be generalized with full accuracy to other samples drawn from the same population, let alone new, fully similar populations.
This, note quickly, is the case despite the invocation used by most regression researchers of the central limit theorem and assumptions of asymptotic nature. Meaning? Meaning three related claims: that
1) If the same fitted and tested regression model were run a large number of times on increasingly large samples (derived in probability manner or simulated in Monte Carlo studies), and you calculated the "average' sample mean of them,
2) Then the error term would be normally distributed, and moreover
3) The final mean and variance of the last ever larger sample would eventually be the same as the actual average mean and variance of the population from which the sample has been drawn.
In different terms, the mean and variance wouldn't exactly match the values of the population even with these three assumptions taken as sound, and for two reasons.
For one thing, error-term and disturbing conditions in each sample wouldn't be exactly the same for those in the other samples no matter how many times a new sample were drawn from the underlying population, or however large the sample might be . . . provided it's not equal to the underlying population. (If the sample is equivalent to the population, then you are dealing with strict descriptive statistics with no statistical inference involved . . . a simple statement of fact, not however ever mentioned by Pape when he runs his fantasized data-set on his logit model in chapter 6 of his book.) For another thing, asymptotic assumptions --- which are another name for 1) above, the use of ever larger samples run an ever larger number of times --- make it strictly impossible for the "final sample" mean and variance to ever match exactly the population's mean and variance, and quite simply because asymptotic refers to a curve that ever more closely approximates a final value without ever actually intersecting with it.
(The clearest exposition of the asymptotic assumptions that prof bug is aware of can be found in Peter Kennedy, A Guide to Econometrics [MIT, 5th ed.], pp. 19-23, 33 where Kennedy notes that the "asymptotic expectation" my be equal to the "plim", but not always. and pp. 429-435 where Kennedy delves more deeply into the technical nature of asymptotic properties. In ways that are particularly relevant to Pape's small sample size used for non-linear logistic regression, Kennedy notes on p. 438 the particular importance of asymptotic assumptions when non-linear data are involved . . . "the algebra" needed for dealing with sample samples of the latter "can become become formidably" difficult. At some point in his exposition, Kennedy himself jokes by way of referring to these assumptions with an anecdote: three statisticians go duck-hunting. The hunter to the left shoots a foot in front of a nearby duck; the hunter to the right shoots a foot behind; the hunter in the middle then exclaims, "Looks like we shot the sucker dead-center, fellows!" )
Despite these problems that surround the three related assumptions, they are at the heart of frequentist --- or standard, non-Bayesian --- inferential statistics, which doesn't mean they're necessarily right.
If they were, you see, there wouldn't be the vexing problem of shrinkage or over-optimism that regression specialists have worried increasingly about in the health sciences, where fitted and tested models --- usually but not always of a non-linear logistic regression sort --- don't pan out accurately when applied to new samples of patients. Something else to ponder here. Observe swiftly here. Even though the problem of shrinkage exists no matter how large the sample size might be, it is much more serious for small sample-sizes of the Pape sort.
Shrinkage and a Hypothetical Buggy Example
For the time being, simply note that this problem of over-optimism is technically known as "shrinkage:" a major test of model performance in linear regression like R2
will turn out to be lower when the same model is used on a different sample. (See John W. Osborne, http://pareonline.net/getvn.asp?v=7&n=2).
If it helps, consider the following hypothetical case.
Think of a non-linear logistic model that tests successfully for model performance on a large sample when it comes to explaining and predicting the effects of using cholesterol-lowering drugs on a population of 50,000 patients with cardio-health problems after a year's treatment in 2004. Let's say that the successfully tested model in apparently valid terms uses three estimators (or independent variables): the amount of daily exercise, diet, number of visits to the doctor in the 12 month period, and of course the use of statin or other cholesterol-lowering medicine, plus of course the intercept term. Enter shrinkage or over-optimism. Specifically, no matter how good that model's performance turns out to be on the various measures of apparent validity for logit modeling, it will not produce the same good results if a new sample is drawn from those 50,000 patients: invariably, its over-optimism --- or shrinkage in statistical jargon --- will be anywhere from 1 or 2% to far higher than that. Worse, if a new population of patients with the same Cardio-Health Problems and similar medical treatment is generated in 2005 and then sampled randomly, the original fitted logit model is likely to turn out to be even less reliable on that new sample.
Back to Pape.
If the dangers of over-optimism are found repeatedly to occur in health-care prediction models and those used in educational studies, think how much greater the dangers are when a scholar contrives his own data-set, itself too small for effective logistic regression at all, and makes puffed up claims about the results that are, in turn, badly misinterpreted . . . something Part Three will deal with in depth. So what should you expect a good statistical researcher to do if something important is at stake in his or her regression-work --- whether theoretically or, in the health areas, practical matters of medical treatment for physical maladies and mental troubles, or both?
Worse for Pape's Logit Modeling, Small-Size Samples Are Especially Prone to Shrinkage As Well As to Erratic Instability When Run on New Samples
A Revealing Regression Exercise
The best way to show these two drawbacks is by means of a concrete example taken from the work of a statistical theorist, Jason W. Osborne. Taken from an article of his, Prediction in Multiple Regression
, it uses a table --- reproduced below here --- to summarize the wildly erratic nature of parameter-estimation, R-square as a test of linear-regression model-performance, and extreme shrinkage when run on different sample sizes.
The original model --- set out in the initial row of the table below the column-headings --- is run on the entire population of 8th grade pupils, 24,559 in all, whose academic achievement in the 12th grade was predicted by means of linear regression. The original model is then subject to two forms of internal validation: 1) a different sample-size drawn five different times from the population of 24,599 8th grade pupils, and 2) the original model that is derived in each of the five sample sizes is then subject to double cross-validation for shrinkage --- or over-optimism.
As you can see from table, the original model constructed on the entire population of 8th grade pupils looks like this:
Y'= -1.71+2.08(GPA) -0.73(race) -0.60(part) +0.32(pared)
where four predictors are used, including the intercept term (1.71): GPA in the 8th grade, race (white = 0, nonwhite = 1), participation in school-based extracurricular activities (no = 0, yes = 1), and parents'education.
, the coefficient of determination found in column 3, is the single best statistic for testing overall model performance in linear regression, though there are other statistics used for hypothesis testing. R-square measures the proportion of the variability of Y, the outcome variable, that is explained by the variability of the X independent variables. It runs between +1 and -1, with zero a totally bombed regression equation and +1 or -1 extreme or perfect regression, something that only happens in identity terms as when changes in a Y variable, say inches, is regressed on the sole X variable measured in centimeters.
In the 4th column, you have the the cross-validity coefficient, ryy
that correlates the "predicted" outcomes or scores in a regression equation with the "observed" outcomes or scores . . . something dealt with at length in the 7th buggy article where we looked at Pape's logt model's predicted vs. observed outcomes as set out in a 2x2 classification table. Ryy
, when squared, is the R-square that results when original regression model run on the alternative sample drawn from the same population --- a matter of cross-validation. When ryy2
is substracted from the original from the original R2
derived on the first sample in each sample-set, you have a measure of the "shrinkage" between the two estimated outcomes. You can see if you look briefly at the table that shrinkage always exists, but it is much greater when the sample set is of small size.
The only other technical matter in the Osborne table is the use of "double" cross-validation, mentioned earlier. As you might recall, it also uses two samples drawn from the same population as simple cross-validation does, but it goes further and constructs and tests regression models or equations in both samples, producing a more rigorous test of how generalizable the regression equations might be.
Osborne's own comments can't be matched for pithy relevance:
How Does Sample Size Affect the "Shrinkage and Stability of a Prediction Equation?
"As discussed above, there are many different opinions as to the minimum sample size one should use in prediction research. As an illustration of the effects of different subject to predictor ratios on shrinkage and stability of a regression equation, data from the National Education Longitudinal Survey of 1988 (NELS 88, from the National Center for Educational Statistics) were used to construct prediction equations identical to our running example. This data set contains data on 24,599 eighth grade students representing 1052 schools in the United States. Further, the data can be weighted to exactly represent the population, so an accurate population estimate can be obtained for comparison. Two samples, each representing ratios of 5, 15, 40, 100, and 400 subjects per predictor were randomly selected from this sample (randomly selecting from the full sample for each new pair of a different size). Following selection of the samples, prediction equations were calculated, and double cross-validation was performed. The results are presented in Table 1".
Table 1: Comparison of Double Cross Validation Results
With Differing Subject:Predictor ratios Untitled Document
| Sample Ratio (subjects: predictors) || Obtained Prediction Equation || R 2 || r yy' 2 || Shrink- |
| Population || Y'= -1.71+2.08(GPA) -0.73(race) -0.60(part) +0.32(pared) || .48 || || |
| 5:1 |
| Sample 1 || Y'= -8.47 +1.87(GPA) -0.32(race) +5.71(part) +0.28(pared) || .62 || .53 || .09 |
| Sample 2 || Y'= -6.92 +3.03(GPA) +0.34(race) +2.49 (part) -0.32(pared) || .81 || .67 || .14 |
| 15:1 |
| Sample 1 || Y'= -4.46 +2.62(GPA) -0.31(race) +0.30(part) +0.32(pared) || .69 || .24 || .45 |
| Sample 2 || Y'= -1.99 +1.55(GPA) +0.34(race) +1.04 (part) -0.58(pared) || .53 || .49 || .04 |
| 40:1 |
| Sample 1 || Y'= -0.49 +2.34(GPA) -0.79(race) -1.51(part) +0.08(pared) || .55 || .50 || .05 |
| Sample 2 || Y'= -2.05 +2.03(GPA) -0.61(race) -0.37(part) -+0.51(pared) || .58 || .53 || .05 |
| 100:1 |
| Sample 1 || Y'= -1.89 +2.05(GPA) -0.52(race) -0.17(part) +0.35(pared) || .46 || .45 || .01 |
| Sample 2 || Y'= -2.04 +1.92(GPA) -0.01(race) +0.32(part) +0.37(pared) || .46 || .45 || .01 |
| 400:1 |
| Sample 1 || Y'= -1.26 +1.95(GPA) -0.70(race) -0.41(part) +0.37(pared) || .47 || .46 || .01 |
| Sample 2 || Y'= -1.10 +1.94(GPA) -0.45(race) -0.56(part) +0.35(pared) || .42 || .41 || .01 |
Osborne follows up with two paragraphs that interpret the results in a nifty incisive manner:
"The first observation from the table is that, by comparing regression line equations, the very small samples have wildly fluctuating equations (both intercept and regression coefficients). Even the 40:1 ratio samples have impressive fluctuations in the actual equation. While the fluctuations in the 100:1 sample are fairly small in magnitude, some coefficients reverse direction, or are far off of the population regression line. As expected, it is only in the largest ratios presented, the 100:1 and 400:1 ratios, that the equations stabilize and remain close to the population equation.
"Comparing variance accounted for, variance accounted for is overestimated in the equations with less than a 100:1 ratio.
"Cross-validity coefficients vary a great deal across samples until a 40:1 ratio is reached, where they appear to stabilize. Finally, it appears that shrinkage appears to minimize as a 40:1 ratio is reached. If one takes Pedhazur's suggestion to compare cross-validity coefficients to determine if your equation is stable, from these data one would need a 40:1 ratio or better before that criterion would be reached. If the goal is to get an accurate, stable estimate of the population regression equation (which it should be if that equation is going to be widely used outside the original sample), it appears desirable to have at least 100 subjects per predictor."
From this example, you get a clear idea, hopefully, of how erratically and even wildly the same model can perform when applied to sample sizes of noticeable difference . . . the erratic behavior far less noticeable as the sample size for each predictor increased markedly. And note something very relevant to our own assessment of Pape's extravagantly deficient sample size used for his logit modeling: if Osborne findings are generalized, Pape --- whose reported logit model has four predictors, which amounts to 14.5 cases (subjects) for each --- can't remotely approximate the minimal 100 cases per predictor needed for proper stability of estimation if it were ever subject to varying sample sizes drawn from the same population.
Once again, except for the eagerly awaited Appendix, the length of this buggy exposition suggests that it would be better to end today's installment here, continuing our analysis of all the howlers and screamers in Pape's Alice-in-Wonderland statistical work in the next buggy article.
APPENDIX, THE FOREIGN LEGION OF DEATH DEALING MALIBU KABOOMERS
This continues prof bug's little effort, begun earlier in today's introductory comments, to help out our perplexed colleague, Professor Pape --- and presumably his 16 research assistants as well, not to forget his 20 expert scholarly pals into the bargain --- all of whom, whether individually or collectively, are very certain that the al Qaeda Foreign Legionnaires busily kabooming in Iraq since early 2003 are all "Iraqi" rebels, even as simultaneously, and of course quite understandably, these 37 research-specialists have been unable to figure out the religion of these "Iraqi" suicide-terrorists.
Who couldn't be sympathetic here with their plight?
The religion of the "Iraqi" suicide terrorists has emerged as one of the biggest scientific puzzlers of the 21st century, and you can't but admire this laudable --- yes, to go further, even exemplary --- piece of self-confessed ignorance in the face of such brain-numbing uncertainty, regarding which prof bug, a tad less modest than his esteemed colleagues, has offered some tips on the possible religious affiliation in question: a clutch of crazed Quakers, or maybe Seventh Day Adventist adventurers full of road rage, or telepathetic New Age Spiritualists hearing voices in Vangelis music telling them to Kaboom by long-distance mental effort, or --- prof bug's championed candidate --- pissed-off Malibu sun-tanners, touchy-tetchy that their sun-bathing rights have been infringed on by the helicopters of the oppressive democratic-country coalition suppressing Iraqi freedoms ever since late March 2003.
(ii-b) And note quickly: the Malibu Mob's Sun-Worshipping dominance of Iraqi Kaboomers would also solve a big puzzle for the rest of us: Why, if Professor Pape's theory is right on this score about the Kaboomers' altruistic motives, have 90% of the victims in Iraq have turned out to be Muslim civilians, whether Shia, Sunnis, or Kurds?
The answer? Well, you see . . . unable to discern things properly through their ultra-dark Ralph Lauren sun-shades bought at a costly Italian boutique in downtown Santa Monica (Christian Dior shades no longer the fashion, very very yesterday-stuff), these Manical Mobs of Mass-Murdering Malibu Residents have repeatedly, but very very excusably, confused over 10,000 Iraqi school-kids, cafe-patrons, people in prayer at mosques, theater-goers, and passers-by with the nasty American and British Special-Op Forces in the country, the confusion, to be precise, leading them to Kaboom the former targets into cadaverous condition at a rate of about 15:1 compared to the evil occupiers themselves . . . the latter, in Pape-think, the proper intended victims of the "Iraqi rebels" and, it appears, deservedly so.
(iii.) Oh oh! before we move on, another puzzler pushes to the fore and needs buggy illumination.
Namely? Well, as the 2nd column for case 18 in Pape's table 1 shows, he knows for certain that the Kaboomers are all "Iraqis", even as he and his 16 research-assistants and 20 expert chums remain, alas, lamentably ignorant of their religion in column 3. And so logic, if nothing else, demands that prof bug grapple with this contradiction. No loose ends left untied here, right? This series on Pape isn't just straggly pap, is it?
Obviously not. So start by noticing just how sure-footed Professor Pape's claim about the rebels' citizenship is in that 2nd column --- Iraqis one and all, Abu Musab al-Zarqawi and his Foreign Legion of al Qaeda hitmen included.
In that case, we, as diligent readers of his book, are left with only one inference possible here: each and every one of our High-Pep Pissed-Off California Sun-Crazies --- bustlingly busy Kabooming at a breakneck pace around the country the last three years --- must have taken enough time off to attend a secret swearing-in ceremony in a darkened cellar one night where, to a man and woman, along with an occasional University of Chicago guest-of-honor, they were duly awarded Iraqi citizenship by an authorized Iraqi official . . . most likely, if we're allowed a touch of speculation here, a former Baathist Minister of Justice with only 113,333 cadavers on his hands, who might once have been a teaching assistant in the University of Chicago in his youth. Maybe, who knows for sure? --- formerly in its political science department, the proud bearer two decades later of an "A+" grade for interstellar performance in the Intermediate Seminar in Alternative-Universe Statistics, a young professor in charge at the time (no names to be mentioned, without more evidence of course). And speaking of time, the late-night Honorary Iraqi-Citizenship Award itself --- to use a little more Pape-speak here --- a token of universal Iraqi gratitude, nothing more, but also nothing less, for the Choleric California Kaboomer's Heroic Sacrifices as "Honorary Community-Minded Altruists" in the Iraqi People's death-dealing struggle against all demonic foreign oppressors.
Note, now, how everything now falls into place, all good things, and a few bad ones, conforming to the uncannily accurate arguments in Dying to Win: The Strategic Logic of Suicide Terrorism.
With each suicide attack, no exceptions whatsoever, calculated and carried out with impeccable Strategic Logic; yes, impeccable, beyond all reproach --- no two ways about it. Computed, if you insist on getting down to cases --- and precision is what Dying to Win is all about, right? --- yes, computed to the nth degree and then double-checked with the fastidious use of advanced polychotomous logistic regression, mainly with m-slope coefficients identical for all options, but, if need be, with independent-of-irrelevant-alternatives (oh! oh! that darn pesky IIA problem, wouldn't you know, Dude?); plus, when necessary, the use too of nested multi-agent game theory that entails, at times, no way around it either ---Yeah, you think there's an alternative here, huh Dude? Then you go and simulate it with Monte-Carlo techniques on your own damn pc, not mine, Creep! --- the tricky application of appropriate parametric restrictions as a necessary limit of suitable approximation in a Super-Game of Milky-Way dimensions, each partition in which has been fully inspired by ancient Aztec Sun-Worshipping Rituals. All these calculations, please note, carefully counted, ciphered, and checked, moreover, on costly laptop computers --- air-shipped all the way from Malibu in the air-conditioned baggage compartment (with, truth be known, concentrated wheatgrass juice laced with 200-proof Vodka in dried-freeze form stuffed inside the CD-drives next to the pretzels and blue-corn tortilla chips), and run on batteries charged with solar-cells exclusively . . . your run-of-the-mill Sun-Tanning Terrorist from Malibu very very respectful of the environment; plus, as further double-check, more last-second intelligence-reconnaissance calculations out in the field itself --- the chief intelligence-officer, a long-time Malibu resident, the chief technical adviser to Jack Bauer himself in 24 Hours and hence doubly qualified --- as the Made-Pure Malibu-Kaboomers work their way stealthly through the sunny streets of Baghdad and toward downtown except . . . well, except for that occasional mishap in tactical follow-through, that chronic, terribly lamentable failure, repeated a thousand times over by now , to see clearly who the damn victims happened to be through those ultra-dark Ralph Lauren shades.
(iv.) What? What's That? --- you can practically here Professor Pape or his 16 research-assistants ask, or maybe it's just one of the 20 helping-hand expert scholars (or is prof bug hearing voices once more?): Why don't the dopes take off their Ralph Lauren shades and pick their victims more carefully?
A very logical question, no? Almost worthy of logistic regression with a predictor model, yes . . . Professor Pape's logit models to the rescue, bugles blowing?
And exactly the logic that we'd expect from 21 scholars and 16 research assistants to use even if they can't collectively figure out, all 37 of them, their brains whirring and humming madly like a turbine at beserker-speed, what the religion of the Iraqi rebels might be . . . this, you understand, even as they do know that Abu Musab al-Zarqawi and his al Qaeda legionnaires are --- contrary to popular prejudice --- 100% pure-bred Iraqis. Yes, very logical --- this query; prof bug is insistent here, egghead thought at its best. Alas, Alas! you cannot expect Pious Malibu Sun-Tanners to blow themselves straight skywards, Kaboom! Kaboom! no matter how altruistic the cause, unless these Self-Sacrificing Purified Sun-Worshippers go to meet their Lord and Master Savior rising daily in the East in full Sacramental Attire, can you?
Never piss off a Malibu Sun-Worshipper in his or her Ralph Lauren shades bent on getting at least 16 hours of sun-rays daily, no interruptions tolerated . . . wars or not.
(v.) But whoa, another puzzler suddenly prompts itself here --- a real brain-scratcher this time, always assuming this last bit of bugged-out speculation has struck you as unimpeachably sound. How would Professor Pape know all the Kabooming rebels happen to be 100% Iraqis in Column 2 of his table, the invited al Qaeda guests-of-honor included, if he and his busy-bee assistants and impressively savvy scholarly chums, their brains otherwise crackling with snappy inventive power, couldn't figure out the rebels' religious status by the time he and they reached Column 3 of case 18 in his Alice-in-Wonderland table 1?
Huh? How would that be possible? Or did Professor Pape and his 36 eagle-eyed helpers run out of dough to buy more whitewash?
Ha! Ha! time to fess up, prof bug has only been kidding here. Really, cross-his-heart and hope-to-die, he assumes that Professor Pape and his 16 research assistants and 20 helping side-men scholars are way too deftly fast in the upstairs-department not to know what the religion of the "Iraqi rebels" happens to be . . . which does leave you wondering, though, at any rate in prof bug's mind, just who Professor Pape thinks he might be kidding in Column 3.
To return to your original spot, here