kitchen table math, the sequel: 2005 math scores in NY state

## Monday, September 10, 2007

### 2005 math scores in NY state

re: Brett's post about the easier 2005 test

As Brett mentioned, the reading test wasn't the only problem. The math test was easier, too.

The News obtained technical details on high-stakes math tests given to fourth-graders across the state over the past six years and found that in every year when scores went up, testmakers had identified the questions as easier during pretest trials.

In years when scores were lower, pretest trials showed the questions were harder.

"That's pretty strong evidence that something is just not right with the test," said New York University Prof. Robert Tobias, who ran the Board of Education's testing department for 13 years.

"If this were a single year's data or two years' data, I would say it would be inappropriate to make conclusions," Tobias said. "But with the pattern over time ...that's prima facie evidence that something's not right."

In 2005, for example, when a record-breaking 85% of New York State's fourth-graders passed the test, the questions had the highest average easy score in years. The easy score was .73 - meaning the average question was answered correctly by 73% of the kids who participated in pretest trials.

In contrast, when 68% of kids passed the state test in 2002, the easy score was .61.

here's more:

Before any high-stakes test is given to kids in New York, testmakers subject every possible question to an experimental trial called a field test.

[snip]

Every question gets an easy score - called a P-value - that comes from the percent of field-testers who correctly answered a question.

If 61% of kids get a question right - as field-testers did for the average question on the 2002 fourth-grade exam - the question has a P-value of .61.

Kids in New York get the same number of points for correct answers regardless of whether a question is rated easy or difficult. One way testmakers equalize exams is by requiring more correct answers on easier tests.

If the 2005 test was easier than the 2002 test, that wasn't done. Kids needed 40 points to pass the 2002 test but only 39 points to pass in 2005.

"Wow!" said NYU testing expert Robert Tobias. "This is really good evidence that the test was easier, substantially easier."

State officials deny that the 2005 test was easier. Testmaker CTB/McGraw-Hill used the highly regarded Item Response Theory to ensure equivalent exams.

Item Response Theory considers the difficulty of a question, its ability to separate smart kids from struggling ones and the odds that a kid can guess the answer correctly.

Test experts who reviewed technical reports from state exams for the Daily News said McGraw-Hill used state-of-the-art equating methods, but they said they couldn't know if the equating was done properly without a thorough audit.

Some experts were troubled by the fact that test scores have gone up and down over the past six years in the same pattern as the easiness ratings.

"It is worrisome that the average P-values for those items on the field test do tend to track the overall passing rate," said Columbia University testing expert James Corter.

This report is significant in terms of Irvington's public use of data. Last year the administration presented data on our ELA scores to the board. In that meeting we learned that the 8th grade class had scored poorly on the ELA exam. Where 43% of the class had earned 4s in 4th grade, only 16.7% of them had earned 4s in 8th grade.

The administration offered 3 explanations:

• a couple of ELA teachers took sudden leaves, so many 8th graders were taught by substitutes

• 18 new students moved into the district, 14 of whom were "receiving services" (mostly 504C or "building support"); these low-scoring students depressed the scores of the rest of the class (total class size approximately 150)

• you really can't compare one year's kids to any other year's kids anyway because "the scaling might be different"

And there it was left.

It is a mathematical impossibility for 18 new students entering a class of 150 to cause a decline from from 43% earning 4s to 16.7%.

That said, the percentages are beside the point.

What matters is that, in 4th grade, 68 students in this class scored 4s on ELA; in 8th grade, only 25 students scored 4s. You can jigger the figures in a couple of ways (fewer kids were tested in 8th grade than in 4th, for instance), but any way you slice it, a bunch of 4s turned into 3s. The district offered excuses, then told us that the 8th grade test was "unnecessarily difficult" and the scores would "bounce back."

What the administration meant was that the scores on the 8th grade ELA exam would bounce back on the high school Regents, which has low cut scores. The tests aren't equivalent.*

Moreover, the 8th grade ELA test, which is in fact more difficult than any of the other state ELA tests, is the only test that comes within calling distance of matching NAEP. (Look at the column for New York.)

This use of data by district administrators is unsound.

a district email to the community

Not long after that, things heated up in the district for a number of reasons, one of them being the fact that I had begun to write up and distribute to the community data parents had not seen. This post in particular, on the recentering of SAT scores in 1995, was forwarded widely around the community I'm told.

A couple of other things occurred, including a public threat by the teacher's union to sue me (or to look into suing me - that's the threat that gets made around here)......

One thing led to another and presently the superintendent issued an email filled with all good data, including the news that:

This year’s 5th grade, the first to use Trailblazers in both 3rd and 4th grades, scored 96% of students at proficient or mastery level. Sixty-one percent of students scored at mastery level.

There was no mention made of the "bad" data discussed at the board meeting. As far as I'm aware, the administration has never mentioned the 8th grade ELA scores outside of an untelevised school board meeting attended by only a handful of parents.

Now we learn that the 2005 math test, the test that yielded Irvington's 96% 3s and 4s, was easier than the test taken by older students using the old curriculum (SRA Math).

It strikes me as unlikely that the administration will mention this development to the wider community. Nor do I know whether the administration has plans to discuss these reports with the board.

auditing the data

I believe that school districts need, at a minimum, routine independent audits of data, data analysis, and the use of data to make decisions. This is simply good practice. When I was on the board of NAAR, we were required to undergo an independent audit each and every year of operation. Businesses must do the same.

For a school, standardized test scores are money; scores directly support real estate value.

I would also like to see citizen's oversight committees set up to give data and the district's use of data a second look, and to and offer guidance on the way in which data is used by schools. A number of statisticians and researchers live in Irvington; we need these people looking at our data and offering the administration their expertise.

2005 math scores in NY state - test far easier
All Quizzes Not Created Equal (Daily News)
4th and 5th graders subjected to comparison study
2002 NY state math test
2005 NY state math test
answer keys
if you can't improve the results, make the test easier (2005 reading scores in NY)

* No links--sorry. I'm not going to spend hours of my life running down data on the NY edu-web site, which has become even more impossible to navigate than it was last year. Scores don't bounce; this is a core psychometric truth. The burden is on my district to prove that high school Regents is equivalent to the 8th grade ELA in difficulty, not on me to prove that it's not. They're the ones asserting an anomaly as a reality.

#### 5 comments:

Anonymous said...

>>One way testmakers equalize exams is by requiring more correct answers on easier tests.

Hmm??? Looking over my children's data, here's what I see:

ELA 2005

4th grade
Raw Score for a '3' was 28-36
Raw Score for a '4' was 37-42

ELA 2007

6th grade (cohort from above)
Raw Score for a '3' was 28-36
Raw Score for a '4' was 37-39

4th grade
Raw Score for a '3' was 30-40
Raw Score for a '4' was 41-43.

Looks like the range for a 4 has been reduced while a '3' has widened in 4th grade. Not much difference in #correct needed for Grade 4: 28/42 in 2005, 30/43 in 2007. I doubt this is statistically significant, but I'm sure someone will have some numbers claiming it is.

We noted a difference in that the test changed for the 4th grade extended response - 2007 was much easier as the student is now directed to 'remember to include point a,b,c,...', but that doesn't seem evident in the scoring.

The biggest thing I noticed is that there is now little wiggle room at the top to distinguish a high 3 from a 4. I'd like to know what the method is to determine that line between the '3' and the '4', since apparently questions are not weighted as to their difficulty.
Either way, students here are dropped from the honors program if they fail to acheive a '4', regardless of whether their instructor covered the test material or if there is any real significance in setting the bar for the '4' above the 90% mark.

Catherine Johnson said...

Either way, students here are dropped from the honors program if they fail to achieve a '4', regardless of whether their instructor covered the test material or if there is any real significance in setting the bar for the '4' above the 90% mark.

????????

Is this true?

(Do you mind if I ask whether you are from Irvington?)

Do you have other thoughts to share?

This is horrifying.

Catherine Johnson said...

The ELA test, in the middle school, has NO wiggle room.

Christopher scored a 3, while getting an average 95% correct on the test.

There is no range in the 4s at all.

Anonymous said...

>>>(Do you mind if I ask whether you are from Irvington?)

I am in Orange County, NY.

>>Do you have other thoughts to share?

I don't want to be personal and post scores, but from my child's data, it is quite possible to improve one's percent correct and see a decrease in scale score from the previous year. This decrease can be interpreted politically. So IMHO.. if an gate keeper is trying to keep a student out based on ELA results, it may be a good idea to look at raws as well as scale scores.

Catherine Johnson said...

I don't want to be personal and post scores, but from my child's data, it is quite possible to improve one's percent correct and see a decrease in scale score from the previous year. This decrease can be interpreted politically. So IMHO.. if an gate keeper is trying to keep a student out based on ELA results, it may be a good idea to look at raws as well as scale scores.

Thank you VERY much.

I'm embarrassed to say I hadn't even looked at this, with C's scores.

No---wait.

Now I'm thinking I did.

oh, boy

Dealing with all this is simply exhausting.

The fact is: my kid isn't learning math to anything like proficiency, period.

Meanwhile his ITBS scores (I gave him the test myself) indicate that he does read well (95th percentile) though he scored a (high) 3 on the ELA.