Independent Education ReviewA peer-reviewed electronic journal.   ISSN 1557-2870

Phelps' critique of Cannell paper

Review of second version of "Lake Woebegone," Twenty Years Later:

I recommend that this article be rejected for publication in the TEG Review.  There are simply too many false and unsupported assertions.

I would be amenable to publishing the original submission as a commentary, provided that one of us checks out the claim Dr. Cannell makes about the California tests, and that claim is either deleted if it cannot be corroborated or some citation is added for it.   Alternatively, a longer essay that focuses solely on Dr. Cannel's recollections, and eschews commentary on current testing programs, would be fine. 

I also think TEG owes Dr. Cannell an apology.  TEG's Web site encourages people to send in essays, and that's what Dr. Cannell did at first.  Another reviewer, perhaps unaware of that policy, seems to have encouraged Dr. Cannell to turn it into a research piece, with unfortunate results.  The number of questions Dr. Cannell takes on in the second version of his paper would probably require a good year of research work to answer.  Finding some citations to the effect that so-and-so says this or that can be done in a week or two, but with what result?  One can find citations to support any allegation.

Below are just some of the comments and criticisms I have about the second paper Dr. Cannell submitted.

Nagging questions:

How do we know that West Virginia used the exact same test questions for eight years in a row?  [paragraph 8]  Did the author consult technical reports and, if so, which ones?  Did the author talk to someone and, if so, who and when?

How do we know that the federal government forced WV to stop excluding lower functioning students and, if they did, on which tests?  The federal government does not have jurisdiction over most WV testing.

The "test preparation" materials cited in paragraph 29: did they contain actual, identical operational test items, or just "study guide" items?  All tests should have study guides with sample items.  No student should be surprised by test structure and item format on the day of the exam.  Seeing the exact same items in advance, however, is not acceptable.

As I understand it, and Dr. Cannell provides no evidence to the contrary, the Texas Education Agency did all that Dr. Cannell would recommend that they should have done, with item rotation and tight security.  But, Dr. Cannell accuses them of cheating.   I would argue that what Dr. Cannell currently cites as evidence for cheating is no evidence at all and, if I am correct, thousands of people are being casually slandered.  If he has good evidence that security was lax, or there was too little item rotation, he should cite it.

Some assertions are factually incorrect, for example:

The rest of the world does not use broad tests of achievement (except for, at best, a few percent of the total, and not then for high-stakes).  The rest of the world uses standards-based, criterion-referenced, narrow, specific tests--mostly end-of-level and end-of-course tests that are a 100% match to a jurisdiction-wide, uniform curriculum.  "Broad tests of achievement" are almost uniquely a North American development, derived from IQ and aptitude tests, and only imperfectly converted into achievement tests.  Our nationally norm-referenced, or "broad", tests of achievement are aligned with no one's standards, no one's curriculum, are blatantly unfair to use in high-stakes situations and, moreover, are illegal to use in high-stakes situations.

The fact that trends in one test's scores do not parallel trends in another test's scores is not evidence of cheating.  One is comparing apples and oranges--and two moving targets.  At the very least, one should not even be thinking of making such a comparison unless one has first done a curriculum match study between the two tests.  I would guess that the Texas TAAS and the state NAEP of the 1990s were about a 50% match or less.  Saying that one can use trends in one test as evidence of cheating in another is about like saying that students who are training in, and improving their test scores in, oncology should be, measure for measure, improving their scores on podiatry tests, even though they aren't studying podiatry. 

The author claims that "things have changed little since the 1980s," but he also admits that he hasn't been paying attention and, indeed, he hasn't.  There is simply no comparison between, say, the California or Massachusetts testing programs of today and twenty years ago.  Twenty years ago these states, essentially, did nothing, or pretty close to it.  Look at the hundreds of pages in their Web sites today, or the several dozen technical reports available, or the thousands of pages of data available, or the news coverage (mostly because now there are stakes to their tests).  [paragraph 8]

CTB was neither the largest nor most successful of the test publishers twenty years ago nor is it today.  [paragraph 7]  Throwaway facts like this are dangerous.  It is a piece of information that does not help the story Dr. Cannell is trying to tell.  Essentially, it is irrelevant information.  But, some readers will know that it is not true, and they will then be less likely to believe other claims in the article.

Likewise, few psychometricians would consider Educational Measurement: Issues and Practice to be “the premier journal of educational psychometrics.”  Probably, they would think of Psychometrika, Ed. & Psych Measurement, or the Journal of Ed. Measurement.  Adding this assertion makes it seem like Dr. Cannell is willing to make up facts in order to promote his point of view.  This tendency diminishes the credibility of the entire article.

Moreover, Dr. Cannell cites the Fall 1990 EM:IP issue as supportive of his work.  To the contrary, that issue, written by CRESST researchers, while agreeing that there was a Lake Wobegon “effect” also asserted that Cannell was entirely wrong in his characterization of its causes and the degree of the effect.

The Rand report cited as evidence that the Texas TAAS was bad is convoluted and contrived.  I recommend reading Nicholas Stix’s “October Surprise” series from late 2000.  One should be able to find it on the Web.

Misunderstandings:

Because WV's test shows a higher percentage of "proficient" students than the NAEP does not prove cheating.  It implies that WV set its cut score for "proficient" relatively lower than did the NAEP but, again, these are apples and oranges.  WV teachers are required to teach the WV standards, not the NAEP standards.

The ACT and SAT are not the highest stakes tests.  Indeed, they may more accurately be categorized as medium stakes tests.  One can do poorly on either test, and one will still get into college somewhere.  By contrast, a couple dozen states, and most other countries, require passage of a test in order to graduate.  That's high stakes.

There are three types of tests used in education:  achievement, aptitude, and monitoring tests.  Achievement tests are designed to measure how much you have learned about something in particular.  Aptitude tests are designed to predict how well you might do in the future.  Monitoring tests are designed to get a snapshot of an entire system's performance.  The NAEP is a monitoring test and is based on no one's curriculum.  The ACT and SAT are "aptitude" tests whose sole reason for being is to predict future performance.  They do this by measuring as wide an array of knowledge and skill as they can, with no attention paid to what the curriculum may be anywhere.  The theory is that those who have the widest base of knowledge can most easily build new knowledge, say, in college.  It is not really valid to compare the ACT and SAT to state achievement tests--they are different tests, designed for different purposes.  By law, high stakes achievement tests (e.g., high school graduation tests) must be based on the standards and curriculum to which the students have been exposed.  On these high stakes tests, often the scores trend up over time, as they should.  Students and teachers are motivated to work harder.  Teachers learn to adapt their lessons over time to be more successful.  Schools learn how to align their instruction over time to better match the standards on which the test is based.  Meanwhile, scores on unrelated tests, not based on the same standards and curriculum may not rise in lockstep.  And, why should they?

Many readers are going to be confused by the phrase "truly standardized test."  It looks as if the ACT, SAT, and NAEP are given this label.  But, they are no more or less standardized than the other tests discussed in the paper.  Moreover, they are no "truer"--in the sense that they are administered with tight security and item rotation--than most current high-stakes, standards-based tests.  Most low/no-stakes tests, however, are not administered with tight security and item rotation.

The last line suggests that we join the rest of the world with standardized national tests.  But, the U.S. Constitution prevents us from doing this.  Besides, it is not true that all of the rest of the world uses standardized national tests.  Switzerland does not.  Germany does not.  Canada does not.  Belgium does not.  Other countries with federal education systems like ours do not.  Only some of those with national education systems do.  Like it or not, the U.S. Constitution says nothing about education.  Therefore, the states are responsible for education.  We can have national tests in the U.S., but they cannot be standards-based, criterion-referenced tests.  They can, however, be Lake Wobegon tests.

Like it or not, one can have West Virginia reading and West Virginia math, but there is no American reading nor American math.  States set standards, the U.S. does not.  A state can have a uniform curriculum, the U.S. cannot.  And, like it or not, state standards and state curricula vary substantially.   A few months ago, I spent some time writing mathematics test items.  I gathered all the textbooks available to me and noticed (1) they teach a lot in the schools now that they did not teach when I was a kid (e.g., proofs in elementary school, discrete math (i.e., networking, graph theory, etc.), exploratory data analysis in elementary school, stats and probability in middle school) and (2) no two textbooks are alike.  If one added up all the content in all the, say, 4th-grade textbooks, one would end up with 3 three years' worth of math instruction.  No single school can teach all of it.  Topics, and the sequencing of topics, vary from state to state and district to district.  To expect students who have studied exploratory data analysis and graph theory to do just as well on a test that covers those two topics as they might on a different test that covers different topics (to which they have not been exposed) is unreasonable.

The author recommends, at the end, using "broad tests of achievement", but that's what the Lake Wobegon tests were.  The other, non-Lake Wobegon tests that Dr. Cannell encountered in the 1980s--the ones that were administered with high levels of security and item rotation typically were standards-based and/or criterion-referenced tests (i.e., "narrow" and specific tests of achievement).

Dr. Cannell should realize that some items on a test must be the same from year to year; it is necessary for equating.