Phony Precision in Teacher Evaluations

Let's talk more about phony precision in teacher evaluations:

As I said before, using standardized test scores for even a small part of teacher evaluations is problematic, because the test is only part of the evaluation, but it's all of the decision. Because the rest of a teacher's evaluation is based on imprecise measures like observations, the very precise score a teacher gets on his or her students' tests takes over the rest of the evaluation.

For example: let's say a teacher in a district has to be cut. If two teachers are rated in their observations as "partially effective," the one with a lower ranting based on testing will lose her job - even if her observation shows she was almost rated "effective," and the other teacher was almost "ineffective."

What's worse is that all of the evidence shows that the the use of tests in teacher evaluations leads to high error rates: the teachers' ratings will show a phony precision. We will be making precise high-stakes decisions based on evidence that is not accurate enough. This is a serious problem; I would wager implementing this arbitrary decision making will be enough to delegitimize the profession of teaching so much as to fundamentally change the profession for the worse.

Now, one of the responses to this argument is to say that we should try to make teacher observations more precise. We should use observation tools, like Danielson's Framework, to give us a more accurate "score" of a teacher's effectiveness than merely saying the teacher is "Ineffective, Partially Effective, Effective, and Highly Effective."

Washington D.C. already tried this under Michelle Rhee's IMPACT system. Let's see how they make it work:

This is a sample of a teacher's score after five observations. In each observation, the teacher was rated on a scale of 1 to 4, 4 being the highest. All of the scores are averaged to give us a final teacher score of 3.7.

Let's suppose that this year, DCPS decides to give a merit pay bonus to every teacher who earns a 3.7 or higher. Hooray! It's bonus time for this teacher. Sorry all you 3.6's; you were so close! Well, better luck next year...

"Wait!" says the reformyist. "At least this is fair! At least we're not giving breathing bonuses just for getting through another year! This is much better, because the better teacher is getting the bonus. Sure it's only a 0.1 difference, but at least it's fair!"

Is it? Allow me another inperfect metaphor:

A grant for snow-removal is available after a week-long blizzard for the town with the biggest average snowfall. Two cities claim to average the most snowfall: Ravitchton and Rheeville. Snowfall in both is measured in feet:

DayRavitchtonRheeville
Mon3 ft.3 ft.
Tue3 ft.3 ft.
Wed3 ft.4 ft.
Thurs2 ft.3 ft.
Fri3 ft.3 ft.
AVERAGE
SNOWFALL
2.8 ft.3.2 ft.

Well, there you have it: Rheeville averages more snowfall.

Except we then find out there were more accurate readings taken*:

Day
Ravitchton
Rheeville
Mon
3.9 ft.
3.1 ft.
Tue
3.6 ft.
3.2 ft.
Wed
3.8 ft.
4.1 ft.
Thurs
2.9 ft.
3.1 ft.
Fri
3.8 ft.
3.0 ft.
AVERAGE
SNOWFALL
3.6 ft.
3.3 ft.

Uh-oh: looks like we gave that grant for extra snow removal to the wrong town...

Averaging the multiple ratings in the first table did not make the snowfall measurement more accurate; it merely gave the illusion of precision. Only when we can make the initial measurement more accurate can we derive an overall score that we can confidently act upon.

Now, I have yet to see a teacher evaluation rubric that claims observers can make distinctions between various elements of teaching practice at more than a few different levels. Danielson uses four; Impact uses four. No more precision should be attributed to a teachers "score" from an observation than that.

And if that's true... well, we're back to where we were with standardized tests. The test is only part of the evaluation, but it is the majority of the decision.

And if a teacher's livelihood depends on that test's score, what do you think she's going to do? Not teach to the test?

I doubt it.


* Some of you may be reading this thinking: "Wait a minute; you're not rounding correctly. If that 3.9 of Ravitchtown's was rounded to a 4, you wouldn't have nearly the disparities. So teacher evaluations won't be as inaccurate as you suggest." Good point, except:

1) We're still dealing with ranges of values captured under a single numeric value; that means we're still going to have margins of error. Here's the table if we rounded:


Day
Ravitchton
Rheeville
Mon
4 ft.
3 ft.
Tue
4 ft.
3 ft.
Wed
4 ft.
4 ft.
Thurs
3 ft.
3 ft.
Fri
4 ft.
3 ft.
AVERAGE
SNOWFALL
3.8 ft.
3.2 ft.

Uh-oh: now we've exaggerated the actual difference between the cities. If we had more cities competing, that shift may have made a difference.

The point stands: you can't generate precision simply by adding more observations or more things to observe.

2) I think my analogy is more apt when it comes to teacher ratings. You get an "Effective" if you meet a minimal requirement; you get "Highly Effective" if you meet the next minimal requirement. A 3.1 is functionally the same as a 3.9 in this system: neither got a 4.0, even though the 3.9 was close.

But make your case in the comments; I'm always glad to see something I may have missed. Just don't be a jerk about it, OK?

ADDING: Look at the IMPACT TLF Framework again. Notice that the scores are all two-digit numbers, but the values are only one-digit. In other words, the score is given as "3.0," but there is never a "3.1" or a "3.2." It's always rounded to an integer.

That's deceptive garbage, and I'm amazed no one's called this system out before. Well, maybe they have, and I haven't seen it.

You don't add a ".0" to a value unless you are going to actually have the possibility of using that digit after the decimal to assign a value. Danielson's framework doesn't allow for that: you give a 1, 2, 3, or 4, and nothing in between.

Whoever put this together is using math to create phony precision. It's completely fraudulent and I'd say unethical.