Making Judge Feedback More Representative

Maja Cimerman, Calum Worsley and Tomas Beerthuis

Good judging is a crucial part of any tournament. There are many skills that a good adjudicator should have. In general we say a good judge is able to accurately understand and describe the debate as it happened, objectively evaluate and comparatively weight contributions of each of the teams and is capable of participating constructively in a panel discussion while also allowing other judges to voice their views. It is difficult for CA-teams to know how good someone is at all of these different skills. Feedback on judges (teams on chairs, chairs on wings and wings on chairs) is one of the only ways to assess these attributes and help determine the quality of a judge. That makes feedback an essential tool in the debating community to further the overall quality of judging at our competitions.

Last summer, the European Universities Debating Championships took place in Zagreb from August 18 to August 23. During this yearly event (with 9 preliminary rounds and a break to Quarterfinals for both ESL as well as Open teams), the CA-team and the Tab-team put in place a feedback system to be able to evaluate judges. Every open round, teams could give feedback on their chair judges (through a virtual or physical form). In all of the rounds, chairs gave feedback on their wings and wings on their chairs. This led to 1777 pieces of feedback that were submitted to the tabroom. In this article we (Maja Cimerman & Tomas Beerthuis; DCAs and Calum Worsley; Tab Co-ordinator) would like to share with you the things we found and what we’ve learnt from that. This way we hope to make feedback in the debating community more effective and through that, help improve the quality of judging.

What did we do with this feedback?

Let’s start by saying that every piece of feedback was looked at by a member of the CA-team. We can assure you that we were very much sleep deprived, but also that this helped us tremendously in determining how judges were performing at our competition. Feedback at Euros worked in the following way:

–       Every piece of feedback was submitted to our system. In this system we could look at the scores on a set of determinants for every individual judge for each round. This allowed us to establish whether the ranking we had allocated to a judge was consistent with the score or if the former needed to be raised/lowered. By that we mean if a judge received very poor feedback when chairing, then this would be a reason to make this person a wing and look at their judging with more scrutiny.

–       Next to that, we closely inspected very high and very low ratings every round, to understand the extreme cases (and take appropriate action where necessary).

–       We also inspected comments closely, to ensure we learned more about our judges (particularly those that none of us knew from previous competitions).

–       Every round, 2 members of the CA-team would ‘sit-out’ (not judge) in order to look at feedback and determine if the rankings of judges needed to be changed.

Looking at so much data and especially putting it all together and analysing it after the tournament gave us some insights into how people give feedback and how useful feedback is at (large) competitions. We found a number of things that are valuable to share and may help to improve the quality of feedback for future competitions.

Finding #1: People do not use the scale

For every question and irrespective of the specific content asked, respondents could choose from a 1 to 5 scale (with 1 being the lowest score and 5 being the highest score). Looking at the results of our feedback forms, we realised 5 was a disproportionately popular answer across all questions asked, indicating that people start their evaluation at 5 and work down from there (see Graph 1). At best this kind of scale can tell us something about judges that people are really dissatisfied with, but fails to differentiate among good judges, meaning it has little value at determining the judges who should break. Thus any judge of average quality would receive a 5, but an absolute top judge would also receive a 5. On the other side of the spectrum we can interpret 1’s as judges people are really dissatisfied with, but it is not clear what 2’s, 3’s and 4’s are. While it might be that some respondents use the full scale, the fact that it is not used equally across all the respondent skews the results. This makes it very hard to determine the relative difference between judges, apart from the extremes. And even with the extremes, people tend to go to the ‘1’ very quickly (perhaps also out of resentment sometimes), while that may not be an accurate reflection on the person judging.

To address this, we propose rethinking how we define the answer scale, making 3 the response that would be expected most frequently and also closest to the average response. This seems more logical, because it allows CA-teams to better understand differences between judges. 3 would be a rank you would give to most judges that perform as expected, indicating the judge was solid. 5 would be the rank for an exceptional judge and 1 would be the rank for a judge you would be really dissatisfied with. While this might require a bit of redefining how we think about judges (mental shift from awarding a good judge a 3 and not a 5), it is actually something we already, very successfully, do with speaker points where the distribution is very close to a normal distribution.

To implement such a change 2 things need to be done:

  1. The feedback scale should be revised and explicitly included and explained in both speakers and judges briefings. Raising more awareness with participants on how to use the system will help contribute to making this mental shift.
  2. The scale on the feedback forms should be adjusted to reflect the discussion. This is an on-going process, and different scales might be used, 1, but the authors of this article are most fond of keeping the 1-5 scale, while adding a description of each of the values rather than focusing on the number. Obviously this would depend on the question, but we see it as something like:

How well did the judge explain the reasoning of the decision?

[] Poor performance (Poor explanation of the debate. Did not agree with their reasoning of the ranking at all.)

[] Acceptable (Somewhat acceptable reasoning explaining their decision. Was not fully convinced by their explanation of the ranking.)

[] Meets expectations (Good reasoning explaining their point of view. I could see and understand why they decided as they did.)

[] Exceeds expectations (Great reasoning explaining their point of view. I was convinced that was the correct reading of the debate.)

[] Top performance (Excellent explanation of the debate. Not only did I fully agree with their explanation, it gave me new insight in the debate.)

Although the system would still capture ‘Poor performance’ as a 1, this way of framing feedback would trigger people to think in a more nuanced way about the actual performance of a judge rather than thinking about a number. Sometimes there is a tendency for people to give a 5 when they are satisfied, but that doesn’t always adequately capture the performance of the judge. This is a way to make feedback more consistent across the board and give the CA teams more useful information on the quality of judges.

The same descriptive scale can be applied to the majority of other questions as well by simply reformulating their grammatical structure, while keeping the same content of the questions. For example the current question “Was this person active in the discussion?” could be changed to: “How helpful was this person in the discussion (for reaching the final decision)?”. Along with the structure of the questions, obviously the answers would be changed as well, where the answer on number 3 would be the one we expect to be the most common or average. For the specific example above:

How helpful was this person in the discussion (for reaching the final decision)?

[] Poor performance (Mostly disruptive or not involved at all.)

[] Acceptable (Only somewhat helpful and/or barely involved.)

[] Meets expectations (Helpful and active in the discussion.)

[] Exceeds expectations (Very good contribution to the discussion, all relevant and excellent. )

[]Top performance (Great contribution, changed some of my views of the debate.)

Finding #2: Your ranking in a debate determines what kind of feedback you are going to give

For a community that prides itself for reasoning and critical thinking, it is interesting to see the role emotions play when giving feedback. More specifically, data shows (see Graph 1) 1st placed teams give feedback which almost exclusively evaluates judges positively, 2nd placed teams are a bit more critical of their judges, 3rd placed even more and 4th placed teams are most likely to give judges bad feedback (the only group where “1” was the most common answer). This might be unsurprising, given that worst placed team were probably least happy with the outcome of the adjudication and best ranked teams were the happiest, however it also means this kind of feedback tells us little about the actual quality of the judge.

Screenshot 2014-12-23 18.35.28Graph 1: Frequency of responses on a scale 1-5 for judging evaluating questionnaires, based on different answering groups. [CoW = Chair on Wing, WoC= Wing on Chair, ToC = Team on Chair, ToC 1st = 1st ranked Team on Chair, ToC 2nd = 2nd ranked Team of Chair, ToC 3rd = 3rd ranked Team on Chair, ToC 4th = 4th ranked Team on Chair]

We already control for team’s position when weighing their feedback, and in the feedback module the team feedback always comes with the position the team took in that round next to the scores for CA team’s information. This data possibly calls for even greater consideration of a team’s position in determining the value of the feedback they give us. For instance, a first ranked team delivering horrible feedback on a judge necessitates greater CA’s consideration than a first ranked team praising the judge.

However, adjusting the weight of feedback based on ranking will not contribute gravely to tackling the real problem – on average, when teams win, they applaud their judge and when they lose they punish the judge with bad feedback. This is something that needs to be seriously discussed and considered within the community (and possibly even having a debaters’ briefing to flag out the role their emotions play so they might be more vigilant about them), otherwise there is little value in reading, triaging and entering the feedback we get from teams. Although emotions in debating competitions are normal, we should realize that this (currently) is seriously affecting what kind of feedback people give their judges. We should also realize that complex debates (with sometimes unsatisfying outcomes) may further trigger this effect. All of this distorts the credibility of feedback and makes it more difficult to evaluate the performance of judges. In turn, this makes it more difficult for CA-teams to adjust the rank of a judge appropriately, which again has an effect on the quality of judging at the competition.

Some other comments

We would also like to add some pragmatic issues of incorporating feedback in judges evaluation, which do not stem from empirical analysis of feedback rather they reflect issues we stumbled upon when looking at feedback.

a. In retrospect, we found the questions to chairs regarding their wings about the participation in the discussion (Was this person active in the discussion?) less useful, as a wing judge might get 1 on all other questions and 5 on this question. We believe a better phrasing might be: How helpful was this person in the discussion? (Something we have already discussed in Finding #1.) This way we could possibly also scrap the question about how willing they are to compromise (If you disagreed, did they show willingness to take your view on board?) and overall reduce the number of questions.

b. In terms of Wings on Chair feedback we realised some wings got confused by the initial call question (On reflection, do you think this person’s initial call was reasonable?), as some chairs do not disclose their ranking during the discussion. We propose either scrapping the question or reducing its relative importance.

c. Some things to look out for when interpreting feedback:

Feedback should not be determined only by the aggregate score, we should look at scores for individual questions/rounds and see what these tell us. For example:

  1. A fresher that received phenomenal feedback as a wing but terrible feedback as a chair might be a really good judge, but inexperienced or unconfident as a chair. If this person would break as a talent, this could very much contribute to their development, making them a potential chair at a future competition.
  2. A chair who consistently scores very low on taking other judges seriously, should probably not be chairing (outrounds), because they will be too dominant in the discussion and thus might stifle it.

Conclusion

Reading and evaluating feedback is time consuming, especially when the aggregate score is insufficient for a holistic evaluations and relevant information needs to be extracted from minor scores and specific answers. This, most times, results in lengthy discussions regarding the merit of a specific feedback, which constitutes too big of a time toll on the CA team in such a fast paced tournament. Thus a different way of doing and interpreting feedback is necessary. Some of the changes we discussed touch on how we ask questions and others touch on a mental shift that is necessary in the debating community to make feedback a little bit more reasonable. This article provides some suggestions on how to do that, however we see it as an ongoing process where the discussions we have within the community will play a crucial role.

  1. As for example a 1-9 scale or a Likert scale.