Text Tab

Saad Amjad

Debate tournaments, especially big ones, can be messy events, when it comes to disbursement of information, be it the Tab or Motion for the round, or other major announcements. Especially with all the intense pre-debate prep demanding people’s attention.

In many instances debaters find it difficult to properly note down their rooms, names of other teams and correct wording of the motion. Having information slides exacerbates the problem, especially if any word requires further clarification. Moreover, multiple running of the draw contributes to significant delays as well. Projector displays and oral announcements require the audience to be attentive all the time, which can be both inconvenient and stressful, and we always run the risk of someone missing out on an important announcement for any number of reasons, especially towards the end of the day when there is a rush to get back to the buses on time.

A recent development in the Bangladesh debate circuit has tackled these issues with ingenuity & pragmatism.

A project consisting of five developers, headed by Nazmus Sakib from Islamic University of Technology, Textab has been quite a success in its short stint in the local circuit, playing a substantial role in terms of distributing essential information to the relevant parties & ensuring the overall efficiency of the tournament.

TexTab ensures that each individual receives the information they need to know at the right time, and in case of announcements, allows them to stretch their memory to events that happened before that late night party started. Most importantly, it allows adjudication core and organizational committees greater flexibility at making adjustments to the schedule as the tournament progresses, since there remain no worries about people missing out on an announcement.

So how does it work?

TexTab is a personalized SMS-integrated debate tabulation system. When deployed, it sends text messages to each participant’s mobile phone via bundle SMS’s on their carrier. Currently the system sends three types of information – the draw for each round (Image 1), the motion for each round and any accompanying information slides (Image 2), and alerts (reporting notice and as such, Image 3). The system also offers custom masks, which means the name of the tournament can be the SMS transmitter’s ID.

In terms of compatibility the system operates seamlessly with Tournaman and Tabbie. The team is adept at running BP tournaments. However, the dearth of Asians or Australs format tournaments in this part of the world meant the development for 3v3 formats are still in their initial stages.

Untitled
Image 1: Sample SMS – Personalized draw for debater
Untitled2
Image 2: Sample SMS – Motion of specific round
Untitled3
Image 3: Sample SMS – Alert for reporting

About the operations      
The system consolidates three programs into one platform – the tabulation software, the TexTab data processor and the TexTab transmitter application. For the draw SMS, all three work together.

For motion and alert SMS, only the transmitter app is required. The steps are outlined in the illustration below:

Screenshot 2014-12-23 18.57.44For an example of time involved: In a 40-team tournament, time required between end of draw generation and SMS received in phones is 12 minutes on average.

Once recruited for a typical tournament, the TexTab team sets up the database and tab room beforehand. One assignment team consists of a Tab Director and two TexTab executives. The team takes care of all tabulation-related tasks of a tournament – from briefing and directing runners, coordinating with organizers and adjudication core and providing tab feedback, break summary and tabulation reports in the end. The approach is to allow the org.com to be able to completely outsource anything regarding tabulation to the team, and focus on more important things, like managing food & drinks, ensuring the debates run on time. It also reduces the burden on the organizational comities in terms of training the volunteers specifically for ballot and feedback form related tasks as well.

Other platforms:

There are certain considerations in juxtaposing SMS with WiFi, such as – reliable internet connection and availability, costs, SIM cards and need for developing iOS and Android apps. There are plans to expand the system and cover alternative communication modes and platforms in future, with particular focus on WiFi based app in the works.

Challenges

Still in its infancy of both development and operation, the software does have its hitches, however minor. Most of the issues are purely logistical. In some instances, the SMS went to the wrong person due to incorrect entry of the person’s mobile number, or that individual having swapped the team combinations from the ones originally submitted.

In a few other instances, SMS to some recipients were delayed by up to 15-20 minutes; this author received a long information slide on a gas deal during the Prime Minister Speech (Much obliged).

However, these are purely down to the connectivity and network quality of the local carriers; and were not faced by people who subscribe to better service packages within the same events.

So far, the system has been used successfully in two major local tournaments, the Dhaka University IV 2014 and the IUT IV 2014, both with more than 11 debate rooms with over 8 rounds of debate each, with no additional issues reported.

In terms of scaling up, the only concern remains for international tournaments due to some participants having limited or no access to a working SIM card.

However, that can be partially circumvented by them submitting one number per team/institute, allowing them access to all the announcements regarding the tournament.

The system has been tested to be reliable with carriers in a number of countries, so this remains a valid option for regional or national level tournaments. Realistically, a WiFi based app will need to be the default platform for the system to fully function on the international platform.
Message from the developers:

“The field of tabulation should evolve with the pace of technology and scale of tournaments. Investment is required to encourage young developers to create new methods in tab system, analysis and service. We designed TexTab as a model that changes the way tabulation is engaged intellectually and incentivizes creativity. As such, the TexTab system and service is a commercial one.

Our tab directors and executives are always excited to volunteer at tournaments where just the tabulation software is required, at no-cost. Subscribing to the full package of SMS-integrated TexTab, however, has its costs. We bill a fee which covers service- and SMS-charges. We are looking to try our system on a broader scope, to both cater to the needs of the tournaments, and to learn the specific requirements of participants and tailor our system to suit them better in the future.

We eagerly await feedback from the international community”- Nazmus Sakib, Developer, Textab.

Contact Us

Email: nsakib002@gmail.com

Call: +880-175-553-0753

Making Judge Feedback More Representative

Maja Cimerman, Calum Worsley and Tomas Beerthuis

Good judging is a crucial part of any tournament. There are many skills that a good adjudicator should have. In general we say a good judge is able to accurately understand and describe the debate as it happened, objectively evaluate and comparatively weight contributions of each of the teams and is capable of participating constructively in a panel discussion while also allowing other judges to voice their views. It is difficult for CA-teams to know how good someone is at all of these different skills. Feedback on judges (teams on chairs, chairs on wings and wings on chairs) is one of the only ways to assess these attributes and help determine the quality of a judge. That makes feedback an essential tool in the debating community to further the overall quality of judging at our competitions.

Last summer, the European Universities Debating Championships took place in Zagreb from August 18 to August 23. During this yearly event (with 9 preliminary rounds and a break to Quarterfinals for both ESL as well as Open teams), the CA-team and the Tab-team put in place a feedback system to be able to evaluate judges. Every open round, teams could give feedback on their chair judges (through a virtual or physical form). In all of the rounds, chairs gave feedback on their wings and wings on their chairs. This led to 1777 pieces of feedback that were submitted to the tabroom. In this article we (Maja Cimerman & Tomas Beerthuis; DCAs and Calum Worsley; Tab Co-ordinator) would like to share with you the things we found and what we’ve learnt from that. This way we hope to make feedback in the debating community more effective and through that, help improve the quality of judging.

What did we do with this feedback?

Let’s start by saying that every piece of feedback was looked at by a member of the CA-team. We can assure you that we were very much sleep deprived, but also that this helped us tremendously in determining how judges were performing at our competition. Feedback at Euros worked in the following way:

–       Every piece of feedback was submitted to our system. In this system we could look at the scores on a set of determinants for every individual judge for each round. This allowed us to establish whether the ranking we had allocated to a judge was consistent with the score or if the former needed to be raised/lowered. By that we mean if a judge received very poor feedback when chairing, then this would be a reason to make this person a wing and look at their judging with more scrutiny.

–       Next to that, we closely inspected very high and very low ratings every round, to understand the extreme cases (and take appropriate action where necessary).

–       We also inspected comments closely, to ensure we learned more about our judges (particularly those that none of us knew from previous competitions).

–       Every round, 2 members of the CA-team would ‘sit-out’ (not judge) in order to look at feedback and determine if the rankings of judges needed to be changed.

Looking at so much data and especially putting it all together and analysing it after the tournament gave us some insights into how people give feedback and how useful feedback is at (large) competitions. We found a number of things that are valuable to share and may help to improve the quality of feedback for future competitions.

Finding #1: People do not use the scale

For every question and irrespective of the specific content asked, respondents could choose from a 1 to 5 scale (with 1 being the lowest score and 5 being the highest score). Looking at the results of our feedback forms, we realised 5 was a disproportionately popular answer across all questions asked, indicating that people start their evaluation at 5 and work down from there (see Graph 1). At best this kind of scale can tell us something about judges that people are really dissatisfied with, but fails to differentiate among good judges, meaning it has little value at determining the judges who should break. Thus any judge of average quality would receive a 5, but an absolute top judge would also receive a 5. On the other side of the spectrum we can interpret 1’s as judges people are really dissatisfied with, but it is not clear what 2’s, 3’s and 4’s are. While it might be that some respondents use the full scale, the fact that it is not used equally across all the respondent skews the results. This makes it very hard to determine the relative difference between judges, apart from the extremes. And even with the extremes, people tend to go to the ‘1’ very quickly (perhaps also out of resentment sometimes), while that may not be an accurate reflection on the person judging.

To address this, we propose rethinking how we define the answer scale, making 3 the response that would be expected most frequently and also closest to the average response. This seems more logical, because it allows CA-teams to better understand differences between judges. 3 would be a rank you would give to most judges that perform as expected, indicating the judge was solid. 5 would be the rank for an exceptional judge and 1 would be the rank for a judge you would be really dissatisfied with. While this might require a bit of redefining how we think about judges (mental shift from awarding a good judge a 3 and not a 5), it is actually something we already, very successfully, do with speaker points where the distribution is very close to a normal distribution.

To implement such a change 2 things need to be done:

  1. The feedback scale should be revised and explicitly included and explained in both speakers and judges briefings. Raising more awareness with participants on how to use the system will help contribute to making this mental shift.
  2. The scale on the feedback forms should be adjusted to reflect the discussion. This is an on-going process, and different scales might be used, 1, but the authors of this article are most fond of keeping the 1-5 scale, while adding a description of each of the values rather than focusing on the number. Obviously this would depend on the question, but we see it as something like:

How well did the judge explain the reasoning of the decision?

[] Poor performance (Poor explanation of the debate. Did not agree with their reasoning of the ranking at all.)

[] Acceptable (Somewhat acceptable reasoning explaining their decision. Was not fully convinced by their explanation of the ranking.)

[] Meets expectations (Good reasoning explaining their point of view. I could see and understand why they decided as they did.)

[] Exceeds expectations (Great reasoning explaining their point of view. I was convinced that was the correct reading of the debate.)

[] Top performance (Excellent explanation of the debate. Not only did I fully agree with their explanation, it gave me new insight in the debate.)

Although the system would still capture ‘Poor performance’ as a 1, this way of framing feedback would trigger people to think in a more nuanced way about the actual performance of a judge rather than thinking about a number. Sometimes there is a tendency for people to give a 5 when they are satisfied, but that doesn’t always adequately capture the performance of the judge. This is a way to make feedback more consistent across the board and give the CA teams more useful information on the quality of judges.

The same descriptive scale can be applied to the majority of other questions as well by simply reformulating their grammatical structure, while keeping the same content of the questions. For example the current question “Was this person active in the discussion?” could be changed to: “How helpful was this person in the discussion (for reaching the final decision)?”. Along with the structure of the questions, obviously the answers would be changed as well, where the answer on number 3 would be the one we expect to be the most common or average. For the specific example above:

How helpful was this person in the discussion (for reaching the final decision)?

[] Poor performance (Mostly disruptive or not involved at all.)

[] Acceptable (Only somewhat helpful and/or barely involved.)

[] Meets expectations (Helpful and active in the discussion.)

[] Exceeds expectations (Very good contribution to the discussion, all relevant and excellent. )

[]Top performance (Great contribution, changed some of my views of the debate.)

Finding #2: Your ranking in a debate determines what kind of feedback you are going to give

For a community that prides itself for reasoning and critical thinking, it is interesting to see the role emotions play when giving feedback. More specifically, data shows (see Graph 1) 1st placed teams give feedback which almost exclusively evaluates judges positively, 2nd placed teams are a bit more critical of their judges, 3rd placed even more and 4th placed teams are most likely to give judges bad feedback (the only group where “1” was the most common answer). This might be unsurprising, given that worst placed team were probably least happy with the outcome of the adjudication and best ranked teams were the happiest, however it also means this kind of feedback tells us little about the actual quality of the judge.

Screenshot 2014-12-23 18.35.28Graph 1: Frequency of responses on a scale 1-5 for judging evaluating questionnaires, based on different answering groups. [CoW = Chair on Wing, WoC= Wing on Chair, ToC = Team on Chair, ToC 1st = 1st ranked Team on Chair, ToC 2nd = 2nd ranked Team of Chair, ToC 3rd = 3rd ranked Team on Chair, ToC 4th = 4th ranked Team on Chair]

We already control for team’s position when weighing their feedback, and in the feedback module the team feedback always comes with the position the team took in that round next to the scores for CA team’s information. This data possibly calls for even greater consideration of a team’s position in determining the value of the feedback they give us. For instance, a first ranked team delivering horrible feedback on a judge necessitates greater CA’s consideration than a first ranked team praising the judge.

However, adjusting the weight of feedback based on ranking will not contribute gravely to tackling the real problem – on average, when teams win, they applaud their judge and when they lose they punish the judge with bad feedback. This is something that needs to be seriously discussed and considered within the community (and possibly even having a debaters’ briefing to flag out the role their emotions play so they might be more vigilant about them), otherwise there is little value in reading, triaging and entering the feedback we get from teams. Although emotions in debating competitions are normal, we should realize that this (currently) is seriously affecting what kind of feedback people give their judges. We should also realize that complex debates (with sometimes unsatisfying outcomes) may further trigger this effect. All of this distorts the credibility of feedback and makes it more difficult to evaluate the performance of judges. In turn, this makes it more difficult for CA-teams to adjust the rank of a judge appropriately, which again has an effect on the quality of judging at the competition.

Some other comments

We would also like to add some pragmatic issues of incorporating feedback in judges evaluation, which do not stem from empirical analysis of feedback rather they reflect issues we stumbled upon when looking at feedback.

a. In retrospect, we found the questions to chairs regarding their wings about the participation in the discussion (Was this person active in the discussion?) less useful, as a wing judge might get 1 on all other questions and 5 on this question. We believe a better phrasing might be: How helpful was this person in the discussion? (Something we have already discussed in Finding #1.) This way we could possibly also scrap the question about how willing they are to compromise (If you disagreed, did they show willingness to take your view on board?) and overall reduce the number of questions.

b. In terms of Wings on Chair feedback we realised some wings got confused by the initial call question (On reflection, do you think this person’s initial call was reasonable?), as some chairs do not disclose their ranking during the discussion. We propose either scrapping the question or reducing its relative importance.

c. Some things to look out for when interpreting feedback:

Feedback should not be determined only by the aggregate score, we should look at scores for individual questions/rounds and see what these tell us. For example:

  1. A fresher that received phenomenal feedback as a wing but terrible feedback as a chair might be a really good judge, but inexperienced or unconfident as a chair. If this person would break as a talent, this could very much contribute to their development, making them a potential chair at a future competition.
  2. A chair who consistently scores very low on taking other judges seriously, should probably not be chairing (outrounds), because they will be too dominant in the discussion and thus might stifle it.

Conclusion

Reading and evaluating feedback is time consuming, especially when the aggregate score is insufficient for a holistic evaluations and relevant information needs to be extracted from minor scores and specific answers. This, most times, results in lengthy discussions regarding the merit of a specific feedback, which constitutes too big of a time toll on the CA team in such a fast paced tournament. Thus a different way of doing and interpreting feedback is necessary. Some of the changes we discussed touch on how we ask questions and others touch on a mental shift that is necessary in the debating community to make feedback a little bit more reasonable. This article provides some suggestions on how to do that, however we see it as an ongoing process where the discussions we have within the community will play a crucial role.

  1. As for example a 1-9 scale or a Likert scale.

Introducing Elo Ratings in British Parliamentary Debating

Ashish Kumar, Michael Goekjian and Richard Coates

Introduction: The Elo Ranking System1

Debating is a competitive hobby. Part of the pleasure of debating comes from being able to know how good one is compared to other debaters. This explains, inter alia, speaker and team tabs. However, speaker and team tabs, as well as results from individual debates, do not often provide us with information we might want to have.

We propose implementing the Elo rating system in British Parliamentary debating (“BP debating”) to solve this problem. The Elo rating system calculates the relative skill levels of players in competitor-versus-competitor games. A detailed explanation of the mathematics of the Elo mechanism is to be found in the next section, but our proposal can be summarised thus:

  1. To begin with every speaker is given a certain number of Elo points – we propose 1500. This is a player’s Elo rating. (1500 points will also be given to any individual who is beginning British parliamentary debating.)
  2. When speakers form teams, their team will be given a team rating – this is the average of the two speakers’ Elo ratings.
  3. When a team wins, it will steal points from the losing team. These points will be added to the speakers’ Elo ratings. A team loses to any team ranked above it in a room and wins against any team ranked below it. So a team that is 3rd in a debate wins against 1 team and loses against 2 teams.
  4. The number of Elo points stolen is determined by the gap in the team ratings and not the gap in individual speaker Elo ratings2. Winning against a relatively weak team results in a small number of Elo points stolen; winning against a relatively strong team results in a large number of Elo points stolen.
  5. Over time, speakers’ Elo ratings will change to reflect their debating ability.
  6. Speakers are globally ranked according to their Elo rating. There should also be ESL and EFL rankings. We hope that there will be regional rankings too.
  7. For the Elo rating system, both in-round and out-round performance can be taken into account. This is because even in out-rounds where full team rankings are not produced, we know for certain that the teams progressing to the next out-round have beaten the 2 teams that have not progressed to the next out-round. We do not see distortions arising from including out-rounds in the Elo calculation.
  8. Speakers will fall off3 the public Elo rating list if they
    1. Finish university education
    2. Are inactive for 1 year4
    3. Indicate that they intend to cease competitive debating
    4. Otherwise do not wish to be included on the rating list
  9. In principle the Elo rating system can be extended to include all debating tournaments. Practical concerns might dictate that only relatively major tournaments are included in the system, although the system should not be excessively difficult to put into place.

The Elo rating system has been implemented in chess, basketball, and Major League Baseball. An instant-update Elo ranking of all professional chess players with Elo ratings of 2700 and above can be found here, and might illustrate what an Elo ranking system if implemented for debating might look like: http://www.2700chess.com/5

In the Section 1 the Elo mechanism is explained in detail and illustrated with a hypothetical example. In Section 2 we point out some of the benefits that implementing the Elo rating system might have. In Section 3 we illustrate Elo implementation by running the Elo mechanism for Zagreb EUDC 2014, and make some brief comments on the results. In Section 4 we briefly list some possible further avenues of exploration with regards to Elo implementation.

Section 1: mathematical outline and hypothetical example

The Elo rating system adjusts a debater’s score after every debate based on how their team’s performance compares to that implied by the difference between their score and those of the other debaters in the room. If a debater exceeds that expectation, their score moves up. If they underperform, their score is reduced. Given a large enough sample of debates, the implied probabilities of victory will approach the actual probabilities, given that they are frequently adjusted to reflect speakers’ performances.

The Elo rating system treats each four-team debate as a series of six6 pairwise matchups between the four teams. If team A ranks above team B, team A is treated as winning against team B and vice versa. There is no additional adjustment for beating another team by more than one place in the final ranking: if team A also beat team C they would receive credit for that independently.

To understand how the adjustment process works, consider the following scenario:

–      Two teams: team 1, debaters A and B; team 2, debaters C and D.

–      ELO ratings of A, B, C and D are RA, RB, Rc and RD, respectively.

–      Team 1 beats team 2.

First, we calculate the team rankings of teams 1 and 2, T1 and T2, respectively, namely the linear average of the individual ranking of the two players:

Screenshot 2014-12-23 17.09.40This is fairly intuitive – both team members clearly contribute to the overall strength of a pairing. The type of average used is arbitrary. We picked the arithmetic mean because it is simple, but some other, larger average (e.g. a quadratic mean) may be more appropriate, given the propensity of the stronger team member to dominate their combined performance.

The difference between team rankings of 1 and 2 yields the expected probabilities of victory, P1 and P2, respectively:

Screenshot 2014-12-23 17.08.06

This is an intuitive way of deriving probabilities of victory:

–      The probabilities sum to 1. This is reassuring; one team should indeed beat the other.

–      If the two teams are equally ranked, the probabilities will both be 0.5.

–      As T1 – T2 increases, P1 tends to (i.e. gets arbitrarily close to) 1 and P2 tends to 0.

Note that we divide the difference in scores by 400 (the divisor) in the probability calculation. The choice of 400 here is arbitrary. Roughly, it determines how much the implied probabilities change given a shift in the score difference – a larger divisor gives rise to smaller change in the probabilities. 400 is the divisor used in chess.

We now adjust the scores based on difference between the expected and actual outcomes of the match. If a team scores 1 point for a victory and 0 points for a loss, we would expect team 1 to win P1 and team 2 P2 points in any given match. Given this, we calculate the changes in scores for members of team 1 and team 2, Δ1 and Δ2, respectively:

Screenshot 2014-12-23 17.08.10We are simply multiplying the difference between the expected and actual outcomes for both teams by 32 (the K-factor). The K-factor determines the magnitude of the Elo adjustment. To calculate the new Elo ranking of the various debaters, we simply add the Δ-values for the relevant team to the ratings of each of its constituent debaters. It is worth noting the following:

–      The multiplier 32 is arbitrary; it is the maximum number of points a given matchup can move a player’s score. So given that each faces three others in any given debate, a single debate can move a player’s score by as much as 96 points (though this is practically impossible).

–      Since P2 = 1 – P1,a little algebraic rearrangement will show that team 1’s gain is team 2’s loss and vice-versa. So, unless new debaters join the system, points are merely redistributed, not created.

–      We calculate the score adjustment for each pairwise matchup in a given debate before adjusting the scores – so if team 1’s scores change by Δ1 as a result of their beating team 2, we do not add this change on until we have calculated how much they gain or lose from their results against teams 3 and 4.

This following is an example debate that illustrates what an Elo adjustment might look like. The exact Elo ratings of the speakers have been chosen arbitrarily.

–      Team 1, pro-am, debaters A and B, rankings 2500 and 1500 respectively.

–      Team 2, strong team, debaters C and D, rankings 2200 and 2300 respectively.

–      Team 3, intermediate team, debaters E and F, both ranked 1900.

–      Team 4, novice team, debaters G and H, rankings 1700 and 1500 respectively.

Suppose team 1 wins, team 2 comes 2nd , team 3 3rd and team 4 comes 4th. We will consider the various pairwise matchups and the adjustments to each of the debater’s rankings.

  1. 1 vs 2
  2. 1 vs 3
  3. 1 vs 4
  4. 2 vs 3
  5. 2 vs 4
  6. 3 vs 4

We then add up the relevant Δ-values to obtain the following rankings:

  1. 2540.3 (2500 + 40.3)
  2. 1540.3 (1500 + 40.3)
  3. 2179.7 (2200 20.3)
  4. 2279.7 (2300 – 20.3)
  5. 1891.3 (1900 – 8.7)
  6. 1891.3 (1900 – 8.7)
  7. 1688.7 (1700 – 11.3)
  8. 1488.7 (1500 – 11.3)

Several things should be noted. First: the large increase in the number of Elo points A and B have is due to the fact that it was a pro-am team: the relatively low Elo ranking of B meant that Team 1’s ranking was pulled down, and it was hence rewarded more for victory. The second is that the change in the Elo score of Team 4 is small despite its loss: this is because it is a novice team. The third is the fact that despite Team 2 coming second in the debate, it lost points overall because it lost more points to Team 1 than it gained from defeating the relatively weak Teams 3 and 4. A “guaranteed second” does not always gain a strong team points.

Section 2: Why implement ELO?

Before we discuss the positive reasons to implement the Elo rating system we would like to point out that the Elo system does not require much more information than is currently captured and publicly shown in tournament tabs. All that is required for the Elo system to work are:

  1. Records of the composition of each team. This is currently captured on all tabs.
  2. Records of the wins and losses of teams in in-rounds and out-rounds. This is currently captured on all interactive tabs7, but not non-interactive tabs. Richard Coates, one of the tab engineers of the Oxford and Cambridge IVs 2014 and EUDC 2014, is currently developing an online central database that would capture all the information needed for the Elo rating system to work across multiple tournaments. The Elo system would effectively require the use of interactive tabs across most tournaments.

Performance across time
The first benefit of an Elo rating system is that it allows for the accurate tracking of performance across time. This is currently very difficult to do. Looking at speaker and team tabs across different tournaments is a helpful guide, but team tabs not take into account the varying strengths of the field at tournaments, as do rankings on speaker tabs. Speaker score averages are problematic as judges in different regions, circuits and tournaments might have different scoring standards. A novice speaker might fail to break at three tournaments at a row even though that speaker might be consistently improving; the Elo system would allow for this speaker to observe real improvements in performance and encourage the novice to continue speaking. Often the illusion of stagnation is discouraging to novices. Conversely, a speaker might break top at a tournament and fail to break at another despite performing equally strongly. It would be helpful to have a metric which can detect improvement or consistency in these two cases.

Another time-based issue arises when a “snapshot” of a speaker’s strength is used as a proxy for strength over a certain period of time, even though that “snapshot” is not representative. For instance, the person who is tops the speaker tab at the WUDC is often called the “world’s best speaker” or “World No.1” for a period of a year, even though that speaker’s strength will fluctuate over the course of a year. The rankings generated by the Elo system will probably be a lot more generous to a larger number of speakers; we might see, for instance, several speakers continuously jostle for the no.1 ranking. We might speak of “so-and so being the best speaker from June to October”, for instance, which would be a more accurate way of capturing global rank and performance.

Note that speaker tabs and the Elo rating system measure different things. First, performance on speaker tabs is based on the numerical score a judge rewards in the round (an absolute measure), while the Elo rating is based purely on relative performance. Speaker tabs account for the fact that one might have won against a strong team in a terrible debate (which means speakers get low speaker scores despite a “good” relative performance), while the Elo rating system cannot. Second, speaker tabs are in some sense more fine-grained than the Elo rating system, since they account for variation within teams. Third, the Elo rating system does not take into account margins of victory, and so a 1-point and a 20-point win are treated the same, while speaker tabs capture (albeit indirectly) such margins. We highlight these factors to point out that the Elo rating system cannot claim to replace speaker tabs, which will continue to remain important.

Comparisons with speakers against whom one has not competed

The second benefit of the Elo rating system is that it allows for comparison with speakers against whom one has not competed; more precisely, it allows for strength comparisons of speakers across circuits. Currently there is no reliable way of telling if a speaker in one regional circuit is stronger than a speaker in another regional circuit. Educated guesses are always possible but are imprecise. It is plausible that a speaker who dominates a particular circuit is not in fact performing particularly well; or that a speaker who is not doing particularly well in a circuit is in fact performing very well relative to the rest of the world. The Elo rating system helps to clear away some of this uncertainty.

Of course, if different circuits had no contact with each other at all the Elo rating system would not be able to provide these comparisons, since the “Elo pools” of each circuit would be closed and Elo points could not be stolen by or from other circuits. This would mean that the weakening or strengthening of a circuit relative to the rest of the world would not be detectable. This concern can be addressed. Regional competitions such as the EUDC, Sydney Mini, ABP, and the US BP Nationals provide one valuable place for pools to mix. The most important competition from the perspective of getting accurate comparison across regions is WUDC, since representatives from all debating circuits will be present, and will determine (together with the number of individuals beginning to debate) the size of their circuit’s Elo pool for the rest of the year.

Consider two speakers A and B in two circuits X and Y respectively, both of whom have never participated in the same tournament. Currently it is very difficult for A and B to compare their debating strength. However, circuit X and Y both send (their strongest) teams to WUDC. If circuit X happens to be strong relative to circuit Y, then its teams will increase the size of circuit X’s Elo pool relative to circuit Y (by winning more debates than circuit Y’s teams at WUDC). If A and B perform roughly equally against teams in their own circuits, it is then likely that A will have a greater number of Elo points than B, since more Elo points collected from WUDC will diffuse into circuit X than circuit Y. A might then be able to say with a reasonable degree of confidence that he/she is a stronger debater than B.

No problem arises even when a circuit’s WUDC teams are highly unrepresentative of the quality of the circuit in general. If the WUDC teams are particularly strong, then they are also unlikely to have the Elo points they gained at WUDC stolen from then by other teams in their circuit. The circuit’s Elo pool increases, but the Elo points are also more tightly locked up in a few teams. The converse logic applies where the WUDC teams are particularly weak.

Large-scale comparisons

The third benefit of the Elo rating system is that it allows for certain large-scale comparisons to be easily made. One has been mentioned to above – the relative strength of different circuits. However, Elo ratings could also be helpful in detecting bias in circuits towards or against certain genders or races. If circuit X has a large number of female speakers in its regional top 20 ranking and circuit Y has a small number of female speakers in its regional top 20 ranking (controlling for factors like the participation rates of people with different sexual orientations), this suggests that circuit Y might have a bias against female speakers. More prosaically if, say, half of the debaters in a circuit are female but none of them are ranked in that circuit’s top 20 speakers, something is probably wrong. Thus, the Elo rating system is of interest not just to individual speakers who want to become better debaters, but to tournament organisers and bodies like the WUDC Council that have a general interest in making debating fair and inclusive.

Determining tournament/room strength

The fourth benefit of the Elo rating system is that it allows for accurate categorization of tournament strength. For example, for the purposes of novice competitions or pro-arms, we currently determine who an “am” or “novice” is in debating by reference to how many university-level tournaments they have broken in. It might be worth considering broadening the definition of “am” or “novice” to include individuals whose Elo ratings fall below a certain number. A person could have debated for a long time and still benefit hugely from being partnered with a strong debater. The Elo rating system would also let us determine what the overall strength of a tournament (or room) is by simply obtaining the average Elo rating of the relevant speakers. Universities deciding which tournaments to send their teams to might find objective measurements of tournaments’ strength useful. Furthermore, knowing the strength of a particular room in a competition might aid CA teams in judge allocation; they might want to put the best judges in rooms that fall within a certain Elo bracket, for instance.

Including out-rounds

The fifth benefit of the Elo rating system is that it allows us to integrate performance over in- and out-rounds in a single measurement. This means that Elo ratings capture more information about team performance than team tabs do. A team that progresses from the quarter-finals of a tournament to the semi-finals must beat the two teams that do not progress from the quarter-finals. Thus, it steals points from two teams but, assuming that the judges did not come to a comprehensive team ranking, should neither steal points from or lose points to the team that progresses through the out-round with it. Loosely speaking, we might say that a team that progresses through an out-round takes a “1.5” ranking, while teams that do not progress take a “3.5” ranking. This makes sense; half of the teams that progress come 1st, and half 2nd, and teams that do not progress come 3rd and 4th half of the time respectively. Of course, these assumptions do not hold true for particular teams; note, however, that including this data is certainly less distortionary than excluding it altogether, since we are certain that each team has won/lost against two other teams, and these wins are just as valid as wins against any other team in an in-round.

Estimating individual tournament performance

The sixth benefit of the Elo rating system is that it allows for a speaker to estimate to a reasonable degree their performance rating. The performance rating measures the strength of performance at only one tournament; knowing his/her own performance ratings for each tournament would allow a speaker to know which tournament represented their strongest or weakest performance in terms of debating strength, without distortions relating to the strength of the tournament field. It might also allow us to determine the strongest tournament performance by any person recorded in a certain period – a person might not win a tournament, but still be responsible for a stunning performance overall. One way of estimating a speaker’s performance rating for a tournament8 is to:

  1. Take the rating of each team beaten and adding 400;
  2. Take the rating of each team lost to and subtracting 400;
  3. Sum the figures obtained; and
  4. Divide by the number of debates multiplied by three (the number of teams debated against)

A possible advantage of being able to calculate performance rating relates to tie-breaks. Ceteris paribus, we want the team with the higher performance rating to break to out-rounds. Current tiebreak measures tend to arbitrarily favour either consistency or variance in performance (e.g., counting wins) or provide only a limited snapshot of the team’s performance that overemphasises team-specific interactions (e.g., head-to-head records and tiebreak debates). Performance rating might provide a better measure of overall debating strength, although this requires the Elo rating system to be relatively well-developed (i.e., implemented for a significant period of time) so that team ratings accurately capture team strength.

Encouraging pro-ams

The seventh possible benefit of the Elo rating system relates to pro-ams. It is plausible that strong speakers will see pro-aming as a way to gain rating points, since pro-aming lowers the team rating. Provided that strong speakers believe that they will continue to perform relatively well even when pro-aming, this lowered team rating makes it appear easier for them to gain Elo points from wins. Of course, this effect is not at all a mathematical certainty – the fact that we employ team ratings when determining the size of the Elo point transfer ought to mean that a strong speaker is neither punished or rewarded when speaking with a novice – but our experience indicates that it is at least plausible that strong speakers perform very well (i.e., not significantly worse than if they were not speaking with a novice) when speaking with novices.

We do not believe that the Elo rating system will be particularly humiliating or off-putting for individuals with low Elo ratings. We should first note that there is no reason to believe that Elo ratings are more embarrassing than speaker tabs, which already list all individuals from best to worst regardless of language category. Being part of the debating community appears to already involve being willing to publicly share one’s successes and failures, as in any other competitive activity. The Elo rating system finesses information that is already available in the form of interactive tabs. We also note that, in relation to speaker tabs at large tournaments, individuals tend to be interested only in (1) their own ranking; (2) the rankings of individuals they know personally, and (3) the top 20 speakers. We no reason to believe things will be different in relation to Elo rankings. This means that a speaker who is world no.255, for instance, has absolutely nothing to feel ashamed or worry about. If this is in fact a problem, however, the solution would be to only publicly display the Elo ratings of the world’s top 100 speakers. Furthermore, we have reason to believe that Elo ratings might be especially encouraging for novice speakers, who might not see clear indicators of improvement at their first few tournaments if they do not break. And Elo ratings will also tell individuals when they have stagnated so that if they want to they can do something about it.

Does the Elo rating system make debating too competitive? This is hard to tell. Some speakers will want to debate more to improve their ranking; others might want to debate less for fear of damaging it. And (we hope) people will continue be motivated to debate or not to debate by factors unrelated to Elo rankings: the general need to live a full life, the desire to see friends (debaters or otherwise), the enjoyment of debates, and the desire to do well. It is hard to imagine the Elo rating system making a huge difference to people’s decision-making. What we will know for certain is that people will have more information upon which to base their decisions. This is good.

Section 3: Zagreb EUDC 2014: what would Elo look like?

We assigned speaker who participated in Zagreb EUDC 1500 points. Therefore each team started the tournament with a rating of 1500. We calculated Elo ratings both after the in-rounds, and after the entire tournament. The top 50 teams were ranked according to their post-EUDC Elo ratings. Since team and speaker ratings are identical (given that everyone began with the same Elo rating) we do not explicitly consider individual ratings.

Several things should be noted:

  1. Hebrew A broke into both the Open and ESL out-rounds. Since it debated in the Open out-rounds first, its out-round-inclusive Elo was calculated by making the relevant Elo adjustment from the Open quarter-final before making the adjustments from the ESL quarter-final and semi-final.
  2. Since everyone started the tournament with 1500 Elo points, the post-tournament Elo rankings also function as a measure of tournament performance strength.
  3. The relevant calculations were not particularly difficult to carry out. Once the Elo formula was provided, the relevant coding for Tabbie took less than 1 hour to complete, although several corrections had to be made later. We estimate that, if told in advance, individuals familiar with Tabbie will be able to perform the relevant Elo calculations for a tournament in less than 30 minutes, assuming that the relevant coding has been completed. The relevant data input and calculations were made easier for this EUDC illustration by the fact that all teams and individuals started off with the same rating, but we do not believe that obtaining speakers’ Elo ratings pre-tournament will be difficult. Obtaining Elo ratings can be integrated into current tournament registration procedures. If there is a central database that immediately updates and stores Elo ratings, this can be consulted. For individuals who wish to write programs that calculate Elo ratings, note that:
    1. Elo point transfers in each debate must be calculated independently. Thus, the team that takes a 1st does not have its Elo rating adjusted after the size of the point transfer from one other team has been calculated: all the Δ-values for all teams must be added up before the point transfer is made. (See the hypothetical example provided in Section 2.)
    2. If a team’s rating changes by X over the course of a tournament, then each speaker will also have his/her Elo rating change by X.
    3. In an in-round, a team’s rating can change by a maximum of 96 Elo; in out-rounds, 64 Elo.

EUDC 2014 Elo ratings (top 50)

Elo (Final) Elo (after in-rounds) Team Tab
1. SHEFFIELD A 1791
2. OXFORD A 1767
3. OXFORD B 1754
4. CAMBRIDGE A 1748
5. BELGRADE B 1738
6. CAMBRIDGE C 1736
7. EDINBURGH A 1723
8. GUU A 1701
9. CAMBRIDGE B 1697
10. BERLIN A 1683
11. LUND A 1677
12. NOTTINGHAM A 1676
13. KCL A 1675
14. OXFORD C 1670
15. BPP A 1653
16. DURHAM B 1649
17. DURHAM A 1646
18. DURHAM C 1641
19. UCD L&H A 1640
20. TCD PHIL A 1640
21. LSE A 1639
22. BIRMINGHAM A 1638
23. WARWICK B 1638
24. WARWICK A 1638
25. HEBREW A 1623
26. TARTU A 1623
27. MANCHESTER A 1610
28. SOAS A 1608
29. ABERYSTWYTH A 1608
30. BGU A 1608
31. GUU B 1607
32. UCD L&H C 1607
33. UCC PHIL A 1606
34. BBU A 1606
35. LEIDEN A 1604
36. TEL AVIV B 1603
37. BELGRADE A 1581
38. HULL A 1581
39. ELTE A 1579
40. TCD HIST B 1579
41. IMPERIAL B 1578
42. TILBURY H A 1578
43. LSE B 1577
44. WARSAW A 1577
45. BRISTOL B 1577
46. UCC LAW B 1577
47. TCD HIST A 1576
48. STRATHCLYDE A 1576
49. ULU C 1576
50. LANCASTER A 1576
1. CAMBRIDGE C 1779
2. OXFORD B 1769
3. GUU A 1737
4. CAMBRIDGE B 1734
5. OXFORD A 1732
6. CAMBRIDGE A 1701
7. OXFORD C 1701
8. DURHAM B 1681
9. LSE A 1673
10. EDINBURGH A 1673
11. NOTTINGHAM A 1672
12. SHEFFIELD A 1671
13. KCL A 1671
14. DURHAM C 1669
15. HEBREW A 1668
16. DURHAM A 1646
17. BELGRADE B 1645
18. BERLIN A 1644
19. BPP A 1642
20. TCD PHIL A 1640
21. UCD L&H A 1640
22. TEL AVIV B 1639
23. WARWICK B 1638
24. WARWICK A 1638
25. BIRMINGHAM A 1638
26. TARTU A 1638
27. LUND A 1636
28. BUCHAREST A 1610
29. MANCHESTER A 1610
30. TILBURY HOUSE A 1609
31. ELTE A 1608
32. SOAS A 1608
33. BGU A 1608
34. ABERYSTWYTH A 1608
35. UCD L&H C 1607
36. MANNHEIM A 1607
37. GUU B 1607
38. BBU A 1606
39. UCC PHIL A 1606
40. LEIDEN A 1602
41. HULL A 1581
42. TCD HIST B 1579
43. IMPERIAL B 1578
44. LSE B 1577
45. UCC LAW B 1577
46. BRISTOL B 1577
47. WARSAW A 1577
48. TCD HIST A 1576
49. STRATHCLYDE A 1576
50. LANCASTER A 1576
1. CAMBRIDGE C
2. OXFORD B
3. CAMBRIDGE B
4. OXFORD A
5. GUU A
6. CAMBRIDGE A
7. OXFORD C
8. EDINBURGH A
9. LSE A
10. SHEFFIELD A
11. DURHAM B
12. KCL A
13. HEBREW A
14. NOTTINGHAM A
15. DURHAM C
16. BPP A
17. BELGRADE B
18. LUND A
19. UCD L&H A
20. TCD PHIL A
21. DURHAM A
22. TARTU A
23. WARWICK A
24. BERLIN A
25. WARWICK B
26. BIRMINGHAM A
27. TEL AVIV B
28. LEIDEN A
29. BUCHAREST A
30. SOAS A
31. UCC PHIL A
32. UCD L&H C
33. GUU B
34. ABERYSTWYTH A
35. BBU A
36. TILBURY HOUSE A
37. MANNHEIM A
38. MANCHESTER A
39. ELTE A
40. BGU A
41. TCD HIST B
42. TCD HIST A
43. STRATHCLYDE A
44. LSE B
45. GUU C
46. HULL A
47. LANCASTER A
48. BRISTOL B
49. ULU C
50. STRATHCLYDE B

Several things should be noted:

  1. The changes in Elo rating are relatively large, often approaching 300 points. This is because many speakers began with a score (1500) that was highly unlikely to represent their debating strength, and because EUDC is a large tournament where each team must debate against at least 27 others. There are hence at least 9 ratings adjustments, each with a hypothetical maximum size of 96 Elo points, to be made.
  2. The ranking according to in-round team ratings corresponds fairly well the team tab, with some minor divergences (see Durham B, Nottingham A, Elte A, and Lund A, for example.) This is unsurprising, given that the EUDC has (1) a relatively large number of in-rounds and (2) employs power-pairing. Less correspondence will tend to be seen in smaller tournaments.
  3. The out-rounds have a significant impact on Elo rating. Sheffield A, ranked equal 12th based on Elo after the in-rounds, gains 120 Elo points by defeating 7 strong teams in the out-rounds to come 1st in the final Elo rankings and very close to crossing the 1800 mark. Belgrade B also moves from 17th to 5th position in this manner.
  4. Even though EUDC 2014 is a large tournament, it is unclear if the Elo rankings above are representative of the speakers’ relative strength; more time might be needed for estimated and actual performance to match and for Elo ratings to stabilise. We did not calculate Elo ratings on a round-by-round basis, and so do not know if Elo rankings stabilised before Round 9. For teams at the upper and lower ends of the Elo ranking, we suspect that this is unlikely to be the case.

Section 4: Further issues for consideration

Issues that we have not had time or space to discuss but which are relevant and might merit exploration include:

  1. Modifying any one of the arbitrary parameters used in our Elo calculation, such as the initial number of Elo points (1500), or the size of the divisor in the probability calculation (400).
  2. Using the geometric rather than arithmetic mean to determine team ratings.
  3. Specific K-factor issues:
    1. Having higher K-factors for tournaments deemed to be important.
    2. Having lower (or higher) K-factors for out-rounds.
    3. It might be especially useful to have a K-factor that starts out large but shrinks down to a minimum value over time, to ensure that people can rapidly move towards their representative Elo rating from the initial 1500. A simple formula for achieving this might be to have a K-factor of: 500/(number of debates), with a minimum K-factor of 32. This drastically reduces the time it takes to move away from the 1500 rating, since the first few (rated) debates will have a very large impact.
    4. Having a rating-staggered K-factor. E.g.: a K-factor of 32 for ratings between 1200 and 1600, 24 for ratings between 1600 and 2000, and 16 for ratings above 2000.
  4. Excluding certain tournaments from Elo calculations.
  5. Implementing (separate) Elo ratings for non-BP debating formats, with which we are not intimately familiar. We note that the relevant calculations ought to be simpler where debates only include 2 teams.
  6. Integrating the Elo ratings for BP and non-BP formats. This is worth serious consideration, since debaters in the Australian and Asian circuits debate mostly in the Australs and Asians formats. Implementing Elo ratings only for BP debating means that (1) these debaters have few chances to have their Elo rating adjusted, sometimes as few as 3 a year, and that (2) both UADC and Australs are excluded from Elo calculations. Separate Elo ratings might be necessary if the Australs and Asians formats are considered too different from the BP format for a single Elo rating to make sense. Since we are not intimately familiar with the Australs/Asians formats, however, we do not take a stand on this issue.
  7. Using Elo as an aid in team allocations for WUDC. Given that the demand for WUDC spots appears to be growing faster than WUDC can accommodate it, Elo ratings might be useful in determining which one among two institutions gets, say, a 3rd team for WUDC. We might wish to give the spot to the team with the higher Elo rating. Of course, this assumes a certain set of aims of the WUDC, and we do not take a stand in this article on this issue.
  1. We would like to express our gratitude to the many individuals who discussed our proposal with us and provided us with important insights and suggestions.
  2. This has important implications for teams where the Elo ratings of the speakers differ significantly. See the discussion of pro-ams in Section 2.
  3. Note that this might appear to pose a problem for tournaments which include speakers who are not on the Elo rating list. However, the solution is so calculate and store the Elo ratings of individuals not on the list, without publicly revealing their Elo ratings. So speakers can still gain or lose points fairly from open tournaments by debating against “retired” debaters whose Elo ratings will be resurrected for the purposes of making the relevant calculations.
  4. This period can be modified to reflect different levels of BP debating activity in different circuits. 6 months might be more suitable for the IONA circuit, for instance.
  5. The site is maintained by chess enthusiasts and shows: (1) Elo ratings; (2) world rankings; (3) recent changes in Elo rating and ranking; (4) recent games played; and (5) progress charts over time for each player.
  6. 4!/2!(4-2)!
  7. This is the interactive tab for the Cambridge IV 2014: http://www.tabbieballots.com/tabs/cambiv2014/teamtab.html. Clicking on a team’s name shows its win-loss record.
  8. Used by some chess clubs.