118. Biases in Student Evaluations of Teaching

A growing body of evidence suggests that student evaluations of teaching are subject to gender and racial bias. In this episode, Dr. Kristina Mitchell joins us to discuss her recent study that examines these issues. After six years as the Director of Online Education at Texas Tech University, Kristina now works for a science curriculum publishing company and teaches part time at San Jose State University.

Show Notes
Transcript

Show Notes

Chávez, K., & Mitchell, K. M. Exploring Bias in Student Evaluations: Gender, Race, and Ethnicity. PS: Political Science & Politics, 1-5.
Colleen Flaherty (2018). “Arbitrating the Use of Student Evaluations of Teaching.” Inside Higher Ed. August 31, 2018.
Disciplinary organization statements on student evaluations
- American Historical Association (2019). “American Historical Association signs onto ASA Statement” – September.
- American Sociology Association (2019). “Statement on Student Evaluation of Teaching.” September.
Peterson, D. A., Biederman, L. A., Andersen, D., Ditonto, T. M., & Roe, K. (2019). Mitigating gender bias in student evaluations of teaching. PloS one, 14(5). – A study that indicates that informing students of the bias in student evaluations mitigates the bias.

Transcript

John: A growing body of evidence suggests that student evaluations of teaching are subject to gender and racial bias. In this episode, we discuss a recent study that examines these issues.

[MUSIC]

John: Thanks for joining us for Tea for Teaching, an informal discussion of innovative and effective practices in teaching and learning.

Rebecca: This podcast series is hosted by

John: , an economist…

John: …and Rebecca Mushtare, a graphic designer.

Rebecca: Together, we run the Center for Excellence in Learning and Teaching at the State University of New York at Oswego.

[MUSIC]

John: Our guest today is Dr. Kristina Mitchell. After six years as the Dir ector of Online Education at Texas Tech University, Kristina now works for a science curriculum publishing company and teaches part time at San Jose State University. Welcome back, Kristina.

Kristina: Thank you.

Rebecca: Today’s teas are:

John: Diet Coke?

Kristina: Diet Dr. Pepper, actually. [LAUGHTER]

John: Oh… I’m sorry.

Rebecca: Switching it up. [LAUGHTER]

John: And mine is Prince of Wales tea.

Rebecca: I have Christmas tea today. I know, I switched it up.

John: Ok.

In one of your earlier visits to our podcast you discussed some of your earlier work on gender bias in student evaluations. We’ve invited you back today to discuss your newest study, with Kerry Chavez, entitled “Exploring Bias in Student Evaluations: Gender, Race, and Ethnicity.” Could you tell us a bit about the origin of this new study?

Kristina: Well, one of the things that seems to be inevitable when someone publishes a study on bias in student evaluation is that there’s always a reluctance to believe the results by some in the community. And most often there will be some question about what was being controlled for or how the selection was done or the sampling or the research design. So really, the first impetus was just to shore up the existing findings and continue to demonstrate the potential bias that might exist. But, in addition, there’s a real dearth of research on race in student evaluation. So, the research on gender bias and student evaluations is becoming more and more robust, but there’s not very much yet on race and ethnicity. And so we were presented with the opportunity to do… it almost presented itself as a natural experiment with 14 identical online sections of the course and with different professors in each one with different genders and races and ethnicities. So, we took it as an opportunity to shore up the gender literature and expand the race literature.

John: And so, the only difference in the course, was the welcome video, if I remember?

Kristina: That is the only difference in the course. Everything about the course: the lectures, the assignments, and even the emails that the students received when they were corresponding with the course and instructor. Those were all identical. We had a course coordinator, which was me, and I was sort of the behind-the-scenes person who was filtering through all the emails to make sure that the students were getting the same tone, the same style, everything the same about how they were interacting with their course.

John: And how long were these videos.

Kristina: The videos were up just about three minutes in length Everyone read an identical script that just told them the professor’s name and sort of had a generic message about how they were looking forward to a good semester. It was a summer course, it was just a five week course. And that was the extent of the students’ direct interaction with the professor in a way that wasn’t filtered through a course coordinator.

Rebecca: Although they all thought they were interacting with the instructor, right?

Kristina: Yes, they all were told that this instructor was theirs. And of course, the instructor was instrumental in the management of the course, we just made sure that the professor was not directly facing the students without it being filtered through a coordinator, just to make sure that each professor was responding with the same tone and the same information,

John: …which sounds like a lot of work for you.

Kristina: It was a lot of work for me. But fortunately, it really allowed us to control for literally everything. We controlled for absolutely everything that you could control for. When I was doing the research, when I was compiling all of the data, getting everything ready, I was just thinking to myself, “Surely there’s no chance that I’m going to find significance.” Like “All of this was for nothing, I’m going to either have to publish a null result that could potentially undermine other people’s research on gender and racial bias.” I just thought, “There’s absolutely no way we’ve controlled for far too much for there to ever be any bias.” So, it was just astonishing to find that even with all of that control, we still found a statistically significant difference. Even with a small sample.

Rebecca: Can you talk a little bit about how many students and sections were involved?

Kristina: So, there were 14 different sections, each with a different instructor and about 200 students per section. And the students enrolled in the sections, all at the same time, when registration opened. There wasn’t necessarily any reason to think that any particular section was characteristically different than any other section. They all kind of filled up about the same.

Rebecca: Did they know the instructor name and things ahead of time when they registered?

Kristina: They did. When they registered, they were able to see what the instructor’s name was. But considering that, once again, the eight sections at a time that would open up for registration, and these were intro classes that every student needs to take to graduate, we didn’t really think that there was any reason to believe that students would be drawn to any instructor, especially since it’s an online course.

John: When we talked about an earlier study, you mentioned that this was sort of like a jobs programs for political scientists in Texas.

Kristina: We always joke that Texas, having made it required for students to take two semesters of political science to graduate with a public university degree in Texas, we call that the Political Science Professor Full Employment Act, because it ensures that we will have many students needing to take our classes in Texas. Unfortunately, now that I’m in California, only one of those classes is required. So, it’s slightly less full employment, although, I’m still getting to teach both online and face to face here in California.

Rebecca: Was there both a qualitative and a quantitative component to the current study?

Kristina: Sp, this one, we focused primarily on the quantitative component. In our earlier study, we spent a lot of time doing text analysis of the comments that we received. In this study, we didn’t do anything quite as rigorous as a full content analysis, in particular, because the number of comments was so low. But we did review them, we looked through them, and we did code them sort of as a positive or a negative comment. And the reason that we did this is because there really shouldn’t have been any reason for any difference in comments whatsoever. Once again, other than the welcome video, students never were directly interacting with a professor in a different way. So, for example, if a student emailed a professor and the professor needed to respond, the professor would tell me as the course coordinator, the messaging that needed to go out… you know, the answer to the question that the student needed, but I would compose that in my own words. So that means that all of the responses would be filtered through the way that I would say it, as me, the course coordinator. So, there’s no difference in the kinds of interactions that students had with the content with the course or with the professor. And yet, we still found that women received negative comments, A]and men did not. One of the professors who was in the study, he was laughing and saying he was going to keep his incredibly positive review in his tenure file, because he was told he was the most intelligent, well spoken, cooperative professor that the students had ever had the chance to encounter. And once again, those were my words. I was the good one. So, the professor just was laughing and saying he was going to include that in his promotion file, even though he didn’t do anything. Whereas women, we saw comments like “She got super annoyed when people would email her” and “did not come off very approachable or helpful.” It was me, it was always me. They were both hearing my words, but because they were filtered through someone of two different genders, they perceive them differently. And that’s really consistent with the literature that shows that students expect women to behave in nurturing ways: to be caring, to be helpful and friendly, whereas they view men as competent experts in their field.

John: In terms of the magnitude of the difference, how large was the average effect of the perceived gender of the instructor?

Kristina: So, when we look at just the overall average evaluation score between men and women, we saw about a 0.2 difference. So, on a scale of five that may or may not be substantively important, and that’s a question that, of course, still remain, whether the 0.2 difference is important in a substantive way, but given that student evaluations are used in promotion,hiring, and pay grade decisions, any statistically significant difference is concerning, especially in a situation like this where we controlled for everything. When we looked at the white versus non-white difference, just looking at the overall average, we didn’t find a significant difference. Those significant differences didn’t start popping up for ethnicity, until we used an OLS regression and included final grades as a control there as well.

John: How did you measure the students’ perceptions of their instructors’ ethnicity and gender? While gender may often be correctly guessed by watching the instructor’s welcome video, ethnicity may not always be obvious. What did you do to assess this?

Kristina: Absolutely. So, it is a little bit more difficult to decide whether a student will know what ethnicity there professor is. So we did, ask both for gender and ethnicity because, of course, gender isn’t always obvious. But we decided to show pictures of the professors to a group of students who were Texas Tech students, but who were not enrolled in any of the courses. We just showed pictures of the instructors and asked the students to tell us what they perceived the person’s gender to be, and if they perceived the person to be white or non-white, and so we used a threshold of: if 60% of the students perceive the professor to be non-white, then we said, “Okay, then we’ll count this person as non-white, whether or not they identify as that or not.” For example, we had one professor in the study who is a Hispanic man, but has blond hair and blue eyes, and so none of the students accurately identified his ethnicity. So, we didn’t count him as non-white in the study because the students perceive him as being white.

John: Were the names informative in cases like that.

Kristina: In that case, the name perhaps could be informative, the very long and complicated Venezuelan name, but that might not initially look to students as a Hispanic name. So, students might see Garcia or Gomez and think Hispanic person, they might not see Sagarvazu and think Hispanic person. Other names that might give students more of a clue of non-white were our Asian facultes. Some of those names could potentially give the students a hint in advance of what ethnicity their instructor was going to be. But again, we don’t really think that students were choosing these online sections based on the professor’s name, especially because students were used to the idea of just taking introduction to political science online at Texas Tech University, and likely weren’t really thinking which Professor should I choose?

Rebecca: So given these results, which should we be doing?

Kristina: You know, I have been saying a long time that the use of student evaluations in hiring, tenure, promotion and pay decisions should just be outlawed. It’s absurd that we’re still using this. I understand that there is a need to measure teacher effectiveness, especially in terms of how students are learning. So it’s really important to try and find alternate measures of this because student evaluations of teaching are flawed for so many reasons; one being students aren’t really very good necessarily at evaluating their professor’s effectiveness as a teacher. Sometimes professors who are really challenging and perhaps really getting the most out of their students are also getting some low evaluations. But, most importantly, for employment law purposes, these are discriminatory. If women and faculty of color are being treated differently in these criteria or evaluating them differently, then we need to find a different way to evaluate them and

John: You’ve made them good cases here again, and I think this contributes to the evidence on that. What might you recommend that campuses do to provide evaluations of instruction?

Kristina: I think that’s a really great question. I think that we should start with exploring your evaluations of teaching to see if those suffer from the same biases because they may, and they might not be a better alternative. Other things that might be worth exploring are portfolio-based evaluation… so, allowing professors and teachers to tell their administration why they’re a good teacher, instead of looking for some objective measure of this, I think teachers and professors who are intentional with their practices would be able to put together a really successful portfolio that would show their administration that they are effective. There’s also some talk about using assessment-based measures, things like standardized testing or exit exams or student portfolios. Those might suffer from problems as well. And one thing that I found, especially now as people in the law profession have started reaching out to me for my insight on these kinds of cases, is that it’s really difficult to show in a court case that we should get rid of a discriminatory practice if there’s not an alternative to that practice. So, what attorneys have told me is that, “Yes, maybe they’re discriminatory, but if the university needs to measure teaching effectiveness, and we don’t have a good alternate way to do it, a court is likely to just let it stand.” So, I think it’s really important that our next move in the research agenda is to try and find out what practices might be able to measure effectiveness without suffering from the same bias.

Rebecca: I think that’s a really good point to help us understand the urgency of doing these things, and coming up with alternatives and really what the real impacts are, rather than a small difference in pay or something people might write off as being whatever. But, if things are going into lawsuits and things and then just letting it stand, even though you can demonstrate that it’s biased, then I think that makes it a little more urgent for people who might not be motivated otherwise.

John: And while a 0.2 difference may not seem like much, that’s often a good share of the range from the highest to lowest evaluations in departments. So, in terms of the rank ordering of people that can make a very significant difference in the perceived quality of their teaching,

Kristina: Especially when departments sometimes use a “Are you above the mean or are you below the mean…” 0.2 could very well kick you above or below the mean in terms of your scores, which, you know, also seems Like a really bizarre way to measure whether you’re effective… if you’re above average than you are, if you’re below average, then you’re not. I’m not really sure that that’s really an adequate way to measure anything. But, one thing that we have seen is a couple of universities move toward a different way of evaluating their teaching effectiveness. Ryerson University in Canada recently decided that student evaluations of teaching in their current form could no longer be used because of these discrimination issues. And a university in Oregon, I can’t remember if it was University of Oregon or Oregon State, but one of them has just moved to a much more open format of teaching evaluations, where students aren’t just saying 2 out of 5 or 4 out of 5. Instead, they’re asked to provide a paragraph with some insight on the effectiveness and if the questions are worded appropriately, then maybe we can see some real useful feedback, because I know I found a lot of useful feedback in my student comments. Really open-ended comments, I think, can also lead to inappropriate things like comments on appearance or comments on personality, but directed prompts… “What would you change about the workload?” …those kinds of questions… might produce some really valuable feedback.

John: If the questions are on things that are fairly objective that students are qualified to evaluate, that could be helpful.

Rebecca: Sometimes students are really insightful on those things if you’re specific and start with the evidence-based practice, and that’s not the thing that’s debatable, but how it’s implemented, or the scaffolding or the timing, those are all things that could be really helpful. And they often have good ideas about these things if you open up a dialogue with them.

Kristina: Exactly. And I think that using student evaluations in this way is helpful to those of us who teach and I think that comes back down to what is the purpose of student evaluations? Why are we doing them? If it’s to try and improve our teaching practices, then let’s use it for that purpose. Let’s ask them directed questions where they have a chance to tell us what they liked and didn’t like and then let us filter those responses to improve what we’re doing. Instead, we’ve almost turned them into this gatekeeping mechanism to keep people from getting promotions, to keep people from getting hired. And it’s especially punishing to our adjuncts. And as our adjunct professors make up a larger and larger share of the teaching force, the fact that they could be not hired again, or offered fewer classes or no classes at all just because of a 0.2 difference on their teaching evaluations. It’s really concerning.

Rebecca: It’s also in some ways, a way of advocating for making sure that we spend time in the classrooms with part-time faculty and know what is going on. Sometimes we reserve those classroom visits and informal feedback with our peers to only tenure-track faculty rather than expanding that across part time faculty as well. And I think we can all gain insight from seeing a wider range of teaching practices inside and outside our departments across full-time and part-time faculty,

Kristina: And even letting our part time faculty conduct some of these peer evaluations. Now that I’m teaching part time, I really see a difference in what it’s like to be part-time faculty. And it’s great in a lot of ways. It gives you a lot of flexibility. And it gives you a lot of time to have fun with your students. And it’s a challenge in a lot of other ways too. But I think that the lines of communication between faculty and students and between different types of faculty… we can really nail down that as the purpose of student evaluation. I think it would help a lot in making them more useful.

John: One of the approaches that some departments have started to use in terms of peer evaluations is not to leave them too open ended, but to have very structured ones. And some of them involve very structured types of observations where you just record what’s happening at fixed time intervals in terms of who is participating, what is the activity, and so forth. And that, at least in theory, should provide a more neutral measure of what’s actually taking place in the classroom, and could also provide more insight into whether evidence-based practices are being used, which could lead to more positive developments in terms of how people are teaching.

Kristina: Yeah, I think that’s really interesting. I think sometimes it can be really difficult to give or receive a truly unbiased peer evaluation because it’s really easy to start saying, “Oh, the students looked like they were having fun.” What does that mean? That’s not really objective. But I think it’s also important to recognize that a 1 to 5 scale of students saying this teacher is effective is also not objective in any way. So, the idea of there being an objective measure of teaching effectiveness, I think we should move away from that idea.

Rebecca: That’s a lot of food for thought.

Kristina: A lot of tea for thought. [LAUGHTER]

John: That’s true.

But, this is coming from more and more directions now. Several disciplinary associations have issued statements indicating that student teaching evaluations not be used as primary instruments in promotion and tenure decisions. And I think we’re going to be seeing more of that, especially as the research base grows.

Kristina: And there is some good news for the listeners who might be looking for, you know, in the meantime, what can we do about this? How can I help? One recent article, it did a sort of small pseudo-experiment, quasi-experiment, where they gave their students some information about this research before they had the students fill out their student evaluation forms. So, they just briefly told the students that sometimes their student evaluations may be biased based on race, gender, or ethnicity. And they found that it was able to mitigate some of that bias. So, in the meantime, if we’re looking for ways that we can try to address this, it’s important most especially for our allies, who are white and who are men to be advocates in this… to take the time in their classes to say there’s evidence that these evaluations may be biased in favor of a certain kind of faculty member. If we can make sure that messaging is getting out there from the right people who can help, then we can start to mitigate some of that bias.

John: We’ll share a link to that study in our show notes.

Kristina: You know, I think that of course, being a white woman myself, I am more comfortable and qualified in my sort of native talk about gender bias. Hopefully we can get more faculty members of color to join us in this research agenda because it’s meaningful for them as well, because our research is starting to show that this bias exists for them as well. And there’s simply just not enough discussion of that in the conversation. One thing that we did not publish in our study because it was just sort of a side question, but when we were asking students what their perceived gender and race of the pictures was, we threw in a question just for fun to ask them “Do you think you would have difficulty understanding this professor’s English?” because one thing that we hear so many times from our colleagues with accents, is that this comes up regularly in their evaluations. And we threw in this question and what we found is that our Asian faculty members, the students all said… I mean, not 100%, but vast majority of the students said, “Yes, I think I’ll have trouble understanding the faculty members English.” And some of our Asian faculty members speak with heavily accented English and some don’t. And interestingly, our Hispanic colleague that I mentioned earlier with blond hair and blue eyes, has a very thick Venezuelan accent and no students were concerned about being able to understand his English. So, I think these elements need to be brought into the conversation as well. And I want to see, hopefully, people that are sort of more native to that discussion, and that it might be more meaningful for them, join in to start doing this research. If there are any co-authors out there, I’m happy to start a new study.

John: The effects you found for ethnicity were relatively weak compared to the effects for gender. But, with a larger sample size, you might be able to get more robust or stronger results on that.

Kristina: Absolutely. So in our difference of means test, ethnicity didn’t come out as significant. It did come out of significant in our regression, but the substantive effect was a little lower.

John: And you were unable to do interactions because of the size of the sample, right?

Kristina: We only had one non-white woman. And so I don’t think our statistical analysis program would have been very kind to us with only one observation in our interaction term.

Rebecca: So we always wrap up, Kristina by asking, as you know, what’s next?

Kristina: That’s a great question. My current position is in K-12 science curriculum. So I still teach part time, but I’m heavily involved in the curriculum world at the K-12 level now. And one thing that’s been really different is that K-12 teaching is definitely more dominated by women than higher education is and I would love to start looking at how we can get our K-12 students to be primed to think about women and men as equal in the sciences, because thinking about their high school teachers as their teachers, and then they go to college and they see men as professors could potentially continue to exacerbate those biases. So, I’d really love to start doing some research and exploring how we can change our children’s attitudes towards women in the sciences from the ground up.

Kristina: That sounds really interesting.

John: And it’s important work and that’s an area where we certainly could see a lot of improvements.

Rebecca: Well, thank you for joining us, as always an interesting conversation and many things for us to be thinking about and taking action on.

Kristina: Thank you. Always a pleasure to join.

[MUSIC]

John: If you’ve enjoyed this podcast, please subscribe and leave a review on iTunes or your favorite podcast service. To continue the conversation, join us on our Tea for Teaching Facebook page.

Rebecca: You can find show notes, transcripts and other materials on teaforteaching.com. Music by Michael Gary Brewer.

[MUSIC]