kitchen table math, the sequel: death by data

## Saturday, March 12, 2011

### death by data

A teacher in my town posted a link to this story:
The Lab School (Lab School at US News) has selective admissions, and Ms. Isaacson’s students have excelled. Her first year teaching, 65 of 66 scored proficient on the state language arts test, meaning they got 3’s or 4’s; only one scored below grade level with a 2. More than two dozen students from her first two years teaching have gone on to Stuyvesant High School or Bronx High School of Science, the city’s most competitive high schools.

“Definitely one of a kind,” said Isabelle St. Clair, now a sophomore at Bard, another selective high school. “I’ve had lots of good teachers, but she stood out — I learned so much from her.”

You would think the Department of Education would want to replicate Ms. Isaacson — who has degrees from the University of Pennsylvania and Columbia — and sprinkle Ms. Isaacsons all over town. Instead, the department’s accountability experts have developed a complex formula to calculate how much academic progress a teacher’s students make in a year — the teacher’s value-added score — and that formula indicates that Ms. Isaacson is one of the city’s worst teachers.

According to the formula, Ms. Isaacson ranks in the 7th percentile among her teaching peers — meaning 93 per cent are better.

This may seem disconnected from reality, but it has real ramifications. Because of her 7th percentile, Ms. Isaacson was told in February that it was virtually certain that she would not be getting tenure this year. “My principal said that given the opportunity, she would advocate for me,” Ms. Isaacson said. “But she said don’t get your hopes up, with a 7th percentile, there wasn’t much she could do.”

That’s not the only problem Ms. Isaacson’s 7th percentile has caused. If the mayor and governor have their way, and layoffs are no longer based on seniority but instead are based on the city’s formulas that scientifically identify good teachers, Ms. Isaacson is pretty sure she’d be cooked.

She may leave anyway. She is 33 and had a successful career in advertising and finance before taking the teaching job, at half the pay.

“I love teaching,” she said. “I love my principal, I feel so lucky to work for her. But the people at the Department of Education — you feel demoralized.”

How could this happen to Ms. Isaacson? It took a lot of hard work by the accountability experts.

Everyone who teaches math or English has received a teacher data report. On the surface the report seems straightforward. Ms. Isaacson’s students had a prior proficiency score of 3.57. Her students were predicted to get a 3.69 — based on the scores of comparable students around the city. Her students actually scored 3.63. So Ms. Isaacson’s value added is 3.63-3.69.

[snip]

The calculation for Ms. Isaacson’s 3.69 predicted score is even more daunting. It is based on 32 variables — including whether a student was “retained in grade before pretest year” and whether a student is “new to city in pretest or post-test year.”

[snip]

In plain English, Ms. Isaacson’s best guess about what the department is trying to tell her is: Even though 65 of her 66 students scored proficient on the state test, more of her 3s should have been 4s.
But that is only a guess.

Moreover, as the city indicates on the data reports, there is a large margin of error. So Ms. Isaacson’s 7th percentile could actually be as low as zero or as high as the 52nd percentile — a score that could have earned her tenure.

Evaluating New York Teachers, Perhaps the Numbers Do Lie
By MICHAEL WINERIP
Published: March 6, 2011

I have four reactions.

1.
A 32-variable teacher evaluation scheme does not sit right with me if only because it lacks transparency. This teacher has no idea why her score falls in the bottom 7% of all teachers in NYC, and neither does anyone else including her principal and students.

2.
Is this teacher running afoul of a ceiling effect? Her students were already scoring well above average coming into her class -- isn't it harder to bring above-average students further up than it is to bring below-average students to average? Working on SAT math with C., I'm convinced that the jump from 550 to 600 is a shorter leap than the one from 600 to 650. Whether or not that's true for the SAT specifically, I'm pretty sure people have shown it to be true with other tests.

[pause]

Yes. It's a well-known effect. *

3.
I flatly reject the assumption that New York state tests are capable of distinguishing between a group of students earning 3.57 on average and a group of students earning 3.69 on average. A few years back, when C., who is a fantastically good reader,** scored a 3 on reading, I got in touch with our then-curriculum director, who told me that NY state tests in some grades have essentially no range of scores in the 4 category at all. That is, if you score a 37 or 38 out of 38 correct, say, you earn a 4; score a 36 and you're a 3. I checked the test and sure enough. She was right. There was no range at all for the 4. I don't know whether David Steiner has changed the tests in the year he's been in office, but even if he has, I reject the idea that the tests are now valid and can accurately assess what the gap between a 3.57 and a 3.69 means (if anything) and whether it is equivalent to the gap between a 3.01 and a 3.13.***

4.
On the other hand, suppose the 7% ranking is right. What might account for that?

One possibility: the Lab School is a constructivist enterprise (here's the Math Department), and this teacher was trained at Columbia Teachers College. She is teaching English and social studies to 7th graders. New York state requires that teachers have a Bachelor's degree in their field of specialty beginning in 7th grade, which means that most 7th grade teachers are teaching English or social studies, not both. One of her students says, "I really liked how she’d incorporate what we were doing in history with what we did in English,” Marya said. “It was much easier to learn.”

Interdisciplinary teaching at the middle school level tends to be shallow because students aren't expert in any of the fields being blurred together (and teachers are expert in just one field), and the only commonalities you can find between disciplines tend to be obvious and current eventsy. e.g.: back when one of our middle school principals explained to us that henceforth character education would be 'embedded' in all subject matter, the best example he could come up with was that the father in The Miracle Worker is an angry patriarchal male who is abusive towards his handicapped child. That "interdisciplinary" reading of The Miracle Worker is anachronistic, simplistic, and wrong.

It's impossible to say whether these scores mean anything.

But if they do, they suggest to me that a 7th grade teacher needs to focus all of her efforts on English or on history, not on both.

English literature and history are very different disciplines.

I wonder how other teachers in the school fared - and, if they did better, how true they were to the middle school model?

* The article doesn't tell us whether the city's statisticians correct for ceiling effects and regression to the mean.

**C. typically missed just 1 or 2 items on SAT reading & writing tests.

*** Of course, given the very wide range for 3s, perhaps a .12 difference is significant. It's impossible to know -- and that's the problem.

cranberry said...

Some schools for the gifted in our area combine English and history. It can be an improvement, as it reorients the English instruction to expository writing, rather than creative writing. The schools which do this allot two periods to the combined course.

Catherine Johnson said...

Good point (re: writing)

That's less the case here in NY, I think (now I'm forgetting which state you're from --- was it CA?)

NY has terrific history standards, so combining excellent history standards with mediocre English standards isn't good...

Still and all, it's extremely difficult to be expert in more than one discipline.

English and history are **very** different.

Ed (who is a historian) was amazed when he team taught a course with a literature professor ---- history 'reading' and literature 'reading' have almost nothing in common.

Catherine Johnson said...

don't know why I used the scare quotes there...

Rudbeckia Hirta said...

I forget who is being quoted (too lazy to check google), but there is a saying in statistics: All models are wrong; some models are useful.

I suspect that a model with 32 variables and a really wide error, like the one cited, is not particularly useful. Bad models like this give value-added analysis a bad name.

Too bad the Powers That Be can't put sufficiently anonymized data on Kaggle.com in order to find some better solutions.

Catherine Johnson said...

Hi Rudbeckia!

Thanks for weighing in.

Obviously I'm no expert, but that is **exactly** my intuition.

32 variables & wide margin of error: it's crazy

Plus I am committed to the concept that parents, students, teachers, administrators, and taxpayers should be able to understand an evaluation system --- certainly to a far greater degree than anyone can fathom this one.

Catherine Johnson said...

And you're right: this gives value-added a TERRIBLE reputation.

Catherine Johnson said...

At least here in Irvington, my position on value-added is that it needs to be used for the kids, not (so much) the teachers.

Each student's parents should receive a report each year showing them whether their child made at least one year's progress in each subject -- which means using norm-referenced tests like the ITBS.

From there, I think you'd see outlier teachers: you'd see the teachers whose students routinely make lots more than 1 year's progress, and you'd see the teachers whose students routinely make far less progress.

Everyone else would be in a mushy middle (I think) -- and the focus of energy would be on bringing any kids who have fallen behind back to where they should be.

ChemProf said...

In general, with modeling, I always like Fermi's analysis; "With four parameters I can fit an elephant, and with five I can make him wiggle his trunk." He is admittedly talking about adjustable parameters, but at some point (and with 32 parameters you are basically always there) you are just overfitting the data.

Anonymous said...

"The Lab School (Lab School at US News) has selective admissions, and Ms. Isaacson's students have excelled. Her first year teaching, 65 of 66 scored proficient on the state language arts test, meaning they got 3's or 4's; only one scored below grade level with a 2. More than two dozen students from her first two years teaching have gone on to Stuyvesant High School or Bronx High School of Science, the city's most competitive high schools."

I am *NOT* going to defend a model with 32 parameters, here.

And I'm also not going to go into things like mean regression.

But ... consider:

(1) Mr. Isaacson is teaching at a selective school. The kids who
get in are going to do well, almost not matter what. The only
question is, "How *much* are they going to excel?"

Given this, almost all of the students scoring at proficient or
better (3+) isn't much of a surprise. Right? The average
*incoming* score was closer to 3 than to 4 (3.57). I'd guess that a good
50% of the kids could have skipped class all year and still pulled
at least a 3 (those would be the kids who scored a 4 in 6th grade).

(2) We are, implicitly, evaluating Ms. Isaacson by how well her students
are doing in absolute terms. High proficiency scores and getting into
"good" schools. But, again, even without her *THESE* kids would be
expected to do well. I'll also note that the ones getting into Bronx
High School of Science may have done so more on their math and science
scores than their English and history scores.

The whole *point* to these sorts of value-add measures is to distinguish
between "you are teaching better than the average teacher" and "you have
better students than the average teacher." It is totally possible that
she *isn't* above average. She has students who do well in absolute terms,
but possibly not relative to expectations.

(3) Unless all her peers (those with the great students) are also well below
average, her kids did not increase their scores as much as her peer group.
We could argue that the scores don't matter, but then we shouldn't be
pushing for value-add in the first place. Her kids went from 3.57 to 3.63.
The expectation was double this: 3.57 to 3.69. If we have 10 teachers total
who get kids averaging 3.57, and she has the lowest scoring class after one year,
well ... shouldn't she get a low value-add ranking (again, assuming we care
mostly about these test scores)? If not, then who should? Some teachers with
students who are expected to do poorly, and do?

(to be continued)

Anonymous said...

(cont'd)

(4) "I'm convinced that the jump from 550 to 600 [on the SAT] is a shorter leap than the one from 600 to 650"

Yes, it is. That is why we should be measuring standard deviations. But, in this case
it seems like her peer group did (or was expected to) raise the scores more than
she did. It is totally possible that the bozos who built the model with 32
parameters screwed this up. If we knew how her peers with the great students did
we would have a better idea.

(5) "I flatly reject the assumption that New York state tests are capable of distinguishing between a group of students earning 3.57 on average and a group of students earning 3.69 on average."

Without knowing the population size, standard deviation and maybe some
repeatability metric, you probably shouldn't be so sure.

In any event, why do we think she is a great teacher EXCEPT FOR THE FACT THAT HER
KIDS WERE DOING WELL WHEN SHE GOT THEM AND DID WELL GOING FORWARD? Maybe another
way to phrase this is, "Other than a pretty dramatic wipeout, what would cause you
to rate a teacher with superstar kids as lower than average?"

-Mark Roulo

What I find most interesting is that this example illustrated perfectly why many
teachers are reluctant to be judged using value-add. Just in the opposite direction
from what we are used to. The common example is a good teacher with horrible students,
who do badly but much better than expected. The fear is that this hypothetical teacher
will be ranked lower than she should be. But since ranking is a zero-sum game (1% of
the teacher will be in each percentile), if we give a pass to the teachers with great
students, we wind up hurting the teachers with the poor students.

Catherine Johnson said...

chemprof - I hate to say it, but I don't quite follow.

What does it mean to include the whole elephant & then make its trunk wiggle?

(If you're around....)

Anonymous said...

FYI, Fermi is quoting Von Neumann.

In any event, the point is that the more randomly selected/fitted parameters you have, the more likely that you are over-fitting. Which means that your model will match the data you have, but won't match the actual thing you wish to measure very well. This will result in a poor match between your model and future data you gather.

Which is almost surely what is going on here. The likelihood that all 32 parameters are "significant" is very low.

If I were making a model, I'd probably start with something like:

(a) Years of formal schooling the parents have,
(b) Race of the child and/or parents.
(c) Income of the parents

Then I'd look at:
(d) Parents single/divorced
(e) *Maybe* quality of the college/university that the parent's attended.

My guess is that very much beyond this won't tell us a lot about a population of 100+ students. In the case of (a) and (c), my model would say that "more" is better, and I'd gather data and fit a curve to compute how much better.

In the case of (b), I'd expect a stacking of Asian, White, Hispanic, Black (expected best to worst scores) and again, I'd curve fit to find the exact values.

That's (3) parameters. If we can't get fairly close with 3 or 4 parameters, we probably don't have the *CORRECT* parameters, and need a new model.

-Mark Roulo

Catherine Johnson said...
Catherine Johnson said...
Anonymous said...

Oh, yeah.

I'd also include the IQ of the student(s), if I could get it.

Duh.

I've gotten so used to thinking that this data isn't available, that I forgot that I'd like it in my model for expected performance.

And prior year's results.

Okay, I've now got five parameters. I think I'd be fine with just IQ plus prior year(s) performance if I had those two pieces of data. For a population of 100+.

-Mark R.

Anonymous said...

FYI, a few years ago I tried to fit a straight line between California standardized test scores of some sort (STAR? I think those were replaced a dozen years back) as reported per school, and average education level as reported per-school. The education was binned into four, so it was very rough.

I still got about a 0.7 correlation at the school level.

I'd like to think that with better data, the correlation would be much higher. Maybe 0.8. This is with a one-factor model.

Now, per-teacher would probably have more variation, but still ...

32 parameters is nuts.

My basic conclusion is that the definition of a "good school", which to most people means "high test scores" is "lots of parents who have a college degree." Most people looking for a "good school" don't quite think of it in those terms, but this is what they are seeking. And paying lots of money in higher house prices, too. The ability to send their kids to a school where their kids' schoolmate's parents have college degrees.

-Mark Roulo

Catherine Johnson said...

I think I'd be fine with just IQ plus prior year(s) performance if I had those two pieces of data.

jeez

I didn't even think of that

they've got 32 parameters and 1 of those 32 probably is **not** IQ

Catherine Johnson said...

My basic conclusion is that the definition of a "good school", which to most people means "high test scores" is "lots of parents who have a college degree."

That's exactly what "good school" means -- although Laurence Steinberg says that the peers are more important than the school.

Steinberg says it's worth paying extra for peers with college-educated parents.

Anonymous said...

One thing I like doing for discussions like this is to try and find a sports analogy. People tend no to get so hung up on non-PC conclusion in sports, but also often care a lot. This can lead to enlightenment.

So ... baseball:

(1) You can get a *VERY* good handle on how valuable a batter is with just two values, which can be combined into one number. You need on-base percentage (OBP) , which is, for every 100 times he comes to the plate, how often does he get on base? And you need slugging percentage (SPG) , which says how many bases he gets each time he has an at-bat. In both cases, more is better. And you can combine them with this: (OBP*3 + SPG)/2 to get a number that works the way most people who follow baseball can understand.

There *ARE* more sophisticated models, but they don't improve on this one by much. So ... two parameters, both of which are pretty easily understood.

(2) For pitchers it is a bit more complicated, but you can basically track strikeouts, walks and home runs and then put them together to get a single number. Again, one can improve (for starting pitchers, you also care about how "efficient" they are), but basically you'll get the right answer for ranking pitchers with just these three.

I get that teaching is more complicated. But 32 parameters is nuts.

-Mark Roulo

ChemProf said...

Yeah, Mark covered it (and sorry for the misattributed quote). When you have a lot of adjustable parameters, you can fit anything, but your model usually loses its predictive value because it is over-fitted.

SteveH said...

"Student Characteristics"?

What kind of fudge factor is that? I assume that's a big handicap for teachers in high SES areas.

"According to the formula, Ms. Isaacson ranks in the 7th percentile among her teaching peers — meaning 93 per cent are better."

93 percent are better at her school? I don't think so.

Empirical formulas like this are supposed to reflect reality, not drive it. When I create a merit function like this, I don't blame reality if the results don't match up. Lesser variables usually have greater error ranges. Also, there is nothing linking this to some sort of absolute scale. Nobody is stating what can be expected from 6+ hours a day. Nobody says what level the test is at.

Probably the "Student Characteristics" factor killed her score. I'll bet the Pretest numbers were high too.

How about looking at the total effect of a school. If kids come into fifth grade not knowing their times table, a teacher could easily get a big bump in the Pretest/Posttest change. Does this mean that it's better to be a good teacher in a bad school? Does it mean that a good teacher is one who remediates very well? As a school improves, the pretest numbers will increase and the overall teacher quality will decrease. Did anyone ever really study the equation? Did anyone ever do a sensitivity analysis on the variables? Did anyone stop to think that they are measuring the wrong thing?

I know first hand that good students (and parental support) can make teachers look good, but how will this formula improve teaching if the Pretest numbers are already high?

Anonymous said...

"93 percent are better at her school? I don't think so."

I would hope that the results were computed district wide. I hope.

"I'll bet the Pretest numbers were high too."

They probably were, for the same reason that *any* extreme result is probably a bit further from 'true'. Both for low and high scores.

Still, the big questions are, (a) how did she do against her peer group [those teachers teaching similarly high powered classes], and (b) are the differences statistically significant.

My guess is that nobody knows.

-Mark Roulo

cranberry said...

One more question: what's the difference between a 3 and a 4 on the state test? Is it a useful distinction?

Here's a link to a scoring guide for our state tests: http://www.doe.mass.edu/mcas/student/. You'll notice that the highest marks are given to students who quote, with quotation marks, from the short written texts on the exam. This is not a useful academic skill, in my opinion it is counterproductive to drill students in setting verbatim quotes from short passages into short essay answers. If you read some of the example passages, you'll see that the highest marks are awarded to the highly artificial essays littered with quotation marks.

If she has many students flourishing at demanding high schools, she may be teaching gifted children well--just not doing test prep.

lgm said...

That particular test had 41 pts. 8 are short response, 3 are a paragraph to edit, and the remainder multiple choice: