Introduction: The Elo Ranking System
Debating is a competitive hobby. Part of the pleasure of debating comes from being able to know how good one is compared to other debaters. This explains, inter alia, speaker and team tabs. However, speaker and team tabs, as well as results from individual debates, do not often provide us with information we might want to have.
We propose implementing the Elo rating system in British Parliamentary debating (“BP debating”) to solve this problem. The Elo rating system calculates the relative skill levels of players in competitor-versus-competitor games. A detailed explanation of the mathematics of the Elo mechanism is to be found in the next section, but our proposal can be summarised thus:
- To begin with every speaker is given a certain number of Elo points – we propose 1500. This is a player’s Elo rating. (1500 points will also be given to any individual who is beginning British parliamentary debating.)
- When speakers form teams, their team will be given a team rating – this is the average of the two speakers’ Elo ratings.
- When a team wins, it will steal points from the losing team. These points will be added to the speakers’ Elo ratings. A team loses to any team ranked above it in a room and wins against any team ranked below it. So a team that is 3rd in a debate wins against 1 team and loses against 2 teams.
- The number of Elo points stolen is determined by the gap in the team ratings and not the gap in individual speaker Elo ratings. Winning against a relatively weak team results in a small number of Elo points stolen; winning against a relatively strong team results in a large number of Elo points stolen.
- Over time, speakers’ Elo ratings will change to reflect their debating ability.
- Speakers are globally ranked according to their Elo rating. There should also be ESL and EFL rankings. We hope that there will be regional rankings too.
- For the Elo rating system, both in-round and out-round performance can be taken into account. This is because even in out-rounds where full team rankings are not produced, we know for certain that the teams progressing to the next out-round have beaten the 2 teams that have not progressed to the next out-round. We do not see distortions arising from including out-rounds in the Elo calculation.
- Speakers will fall off the public Elo rating list if they
- Finish university education
- Are inactive for 1 year
- Indicate that they intend to cease competitive debating
- Otherwise do not wish to be included on the rating list
- In principle the Elo rating system can be extended to include all debating tournaments. Practical concerns might dictate that only relatively major tournaments are included in the system, although the system should not be excessively difficult to put into place.
The Elo rating system has been implemented in chess, basketball, and Major League Baseball. An instant-update Elo ranking of all professional chess players with Elo ratings of 2700 and above can be found here, and might illustrate what an Elo ranking system if implemented for debating might look like: http://www.2700chess.com/
In the Section 1 the Elo mechanism is explained in detail and illustrated with a hypothetical example. In Section 2 we point out some of the benefits that implementing the Elo rating system might have. In Section 3 we illustrate Elo implementation by running the Elo mechanism for Zagreb EUDC 2014, and make some brief comments on the results. In Section 4 we briefly list some possible further avenues of exploration with regards to Elo implementation.
Section 1: mathematical outline and hypothetical example
The Elo rating system adjusts a debater’s score after every debate based on how their team’s performance compares to that implied by the difference between their score and those of the other debaters in the room. If a debater exceeds that expectation, their score moves up. If they underperform, their score is reduced. Given a large enough sample of debates, the implied probabilities of victory will approach the actual probabilities, given that they are frequently adjusted to reflect speakers’ performances.
The Elo rating system treats each four-team debate as a series of six pairwise matchups between the four teams. If team A ranks above team B, team A is treated as winning against team B and vice versa. There is no additional adjustment for beating another team by more than one place in the final ranking: if team A also beat team C they would receive credit for that independently.
To understand how the adjustment process works, consider the following scenario:
– Two teams: team 1, debaters A and B; team 2, debaters C and D.
– ELO ratings of A, B, C and D are RA, RB, Rc and RD, respectively.
– Team 1 beats team 2.
First, we calculate the team rankings of teams 1 and 2, T1 and T2, respectively, namely the linear average of the individual ranking of the two players:
This is fairly intuitive – both team members clearly contribute to the overall strength of a pairing. The type of average used is arbitrary. We picked the arithmetic mean because it is simple, but some other, larger average (e.g. a quadratic mean) may be more appropriate, given the propensity of the stronger team member to dominate their combined performance.
The difference between team rankings of 1 and 2 yields the expected probabilities of victory, P1 and P2, respectively:
This is an intuitive way of deriving probabilities of victory:
– The probabilities sum to 1. This is reassuring; one team should indeed beat the other.
– If the two teams are equally ranked, the probabilities will both be 0.5.
– As T1 – T2 increases, P1 tends to (i.e. gets arbitrarily close to) 1 and P2 tends to 0.
Note that we divide the difference in scores by 400 (the divisor) in the probability calculation. The choice of 400 here is arbitrary. Roughly, it determines how much the implied probabilities change given a shift in the score difference – a larger divisor gives rise to smaller change in the probabilities. 400 is the divisor used in chess.
We now adjust the scores based on difference between the expected and actual outcomes of the match. If a team scores 1 point for a victory and 0 points for a loss, we would expect team 1 to win P1 and team 2 P2 points in any given match. Given this, we calculate the changes in scores for members of team 1 and team 2, Δ1 and Δ2, respectively:
We are simply multiplying the difference between the expected and actual outcomes for both teams by 32 (the K-factor). The K-factor determines the magnitude of the Elo adjustment. To calculate the new Elo ranking of the various debaters, we simply add the Δ-values for the relevant team to the ratings of each of its constituent debaters. It is worth noting the following:
– The multiplier 32 is arbitrary; it is the maximum number of points a given matchup can move a player’s score. So given that each faces three others in any given debate, a single debate can move a player’s score by as much as 96 points (though this is practically impossible).
– Since P2 = 1 – P1,a little algebraic rearrangement will show that team 1’s gain is team 2’s loss and vice-versa. So, unless new debaters join the system, points are merely redistributed, not created.
– We calculate the score adjustment for each pairwise matchup in a given debate before adjusting the scores – so if team 1’s scores change by Δ1 as a result of their beating team 2, we do not add this change on until we have calculated how much they gain or lose from their results against teams 3 and 4.
This following is an example debate that illustrates what an Elo adjustment might look like. The exact Elo ratings of the speakers have been chosen arbitrarily.
– Team 1, pro-am, debaters A and B, rankings 2500 and 1500 respectively.
– Team 2, strong team, debaters C and D, rankings 2200 and 2300 respectively.
– Team 3, intermediate team, debaters E and F, both ranked 1900.
– Team 4, novice team, debaters G and H, rankings 1700 and 1500 respectively.
Suppose team 1 wins, team 2 comes 2nd , team 3 3rd and team 4 comes 4th. We will consider the various pairwise matchups and the adjustments to each of the debater’s rankings.
- 1 vs 2
- 1 vs 3
- 1 vs 4
- 2 vs 3
- 2 vs 4
- 3 vs 4
We then add up the relevant Δ-values to obtain the following rankings:
- 2540.3 (2500 + 40.3)
- 1540.3 (1500 + 40.3)
- 2179.7 (2200 – 20.3)
- 2279.7 (2300 – 20.3)
- 1891.3 (1900 – 8.7)
- 1891.3 (1900 – 8.7)
- 1688.7 (1700 – 11.3)
- 1488.7 (1500 – 11.3)
Several things should be noted. First: the large increase in the number of Elo points A and B have is due to the fact that it was a pro-am team: the relatively low Elo ranking of B meant that Team 1’s ranking was pulled down, and it was hence rewarded more for victory. The second is that the change in the Elo score of Team 4 is small despite its loss: this is because it is a novice team. The third is the fact that despite Team 2 coming second in the debate, it lost points overall because it lost more points to Team 1 than it gained from defeating the relatively weak Teams 3 and 4. A “guaranteed second” does not always gain a strong team points.
Section 2: Why implement ELO?
Before we discuss the positive reasons to implement the Elo rating system we would like to point out that the Elo system does not require much more information than is currently captured and publicly shown in tournament tabs. All that is required for the Elo system to work are:
- Records of the composition of each team. This is currently captured on all tabs.
- Records of the wins and losses of teams in in-rounds and out-rounds. This is currently captured on all interactive tabs, but not non-interactive tabs. Richard Coates, one of the tab engineers of the Oxford and Cambridge IVs 2014 and EUDC 2014, is currently developing an online central database that would capture all the information needed for the Elo rating system to work across multiple tournaments. The Elo system would effectively require the use of interactive tabs across most tournaments.
Performance across time
The first benefit of an Elo rating system is that it allows for the accurate tracking of performance across time. This is currently very difficult to do. Looking at speaker and team tabs across different tournaments is a helpful guide, but team tabs not take into account the varying strengths of the field at tournaments, as do rankings on speaker tabs. Speaker score averages are problematic as judges in different regions, circuits and tournaments might have different scoring standards. A novice speaker might fail to break at three tournaments at a row even though that speaker might be consistently improving; the Elo system would allow for this speaker to observe real improvements in performance and encourage the novice to continue speaking. Often the illusion of stagnation is discouraging to novices. Conversely, a speaker might break top at a tournament and fail to break at another despite performing equally strongly. It would be helpful to have a metric which can detect improvement or consistency in these two cases.
Another time-based issue arises when a “snapshot” of a speaker’s strength is used as a proxy for strength over a certain period of time, even though that “snapshot” is not representative. For instance, the person who is tops the speaker tab at the WUDC is often called the “world’s best speaker” or “World No.1” for a period of a year, even though that speaker’s strength will fluctuate over the course of a year. The rankings generated by the Elo system will probably be a lot more generous to a larger number of speakers; we might see, for instance, several speakers continuously jostle for the no.1 ranking. We might speak of “so-and so being the best speaker from June to October”, for instance, which would be a more accurate way of capturing global rank and performance.
Note that speaker tabs and the Elo rating system measure different things. First, performance on speaker tabs is based on the numerical score a judge rewards in the round (an absolute measure), while the Elo rating is based purely on relative performance. Speaker tabs account for the fact that one might have won against a strong team in a terrible debate (which means speakers get low speaker scores despite a “good” relative performance), while the Elo rating system cannot. Second, speaker tabs are in some sense more fine-grained than the Elo rating system, since they account for variation within teams. Third, the Elo rating system does not take into account margins of victory, and so a 1-point and a 20-point win are treated the same, while speaker tabs capture (albeit indirectly) such margins. We highlight these factors to point out that the Elo rating system cannot claim to replace speaker tabs, which will continue to remain important.
Comparisons with speakers against whom one has not competed
The second benefit of the Elo rating system is that it allows for comparison with speakers against whom one has not competed; more precisely, it allows for strength comparisons of speakers across circuits. Currently there is no reliable way of telling if a speaker in one regional circuit is stronger than a speaker in another regional circuit. Educated guesses are always possible but are imprecise. It is plausible that a speaker who dominates a particular circuit is not in fact performing particularly well; or that a speaker who is not doing particularly well in a circuit is in fact performing very well relative to the rest of the world. The Elo rating system helps to clear away some of this uncertainty.
Of course, if different circuits had no contact with each other at all the Elo rating system would not be able to provide these comparisons, since the “Elo pools” of each circuit would be closed and Elo points could not be stolen by or from other circuits. This would mean that the weakening or strengthening of a circuit relative to the rest of the world would not be detectable. This concern can be addressed. Regional competitions such as the EUDC, Sydney Mini, ABP, and the US BP Nationals provide one valuable place for pools to mix. The most important competition from the perspective of getting accurate comparison across regions is WUDC, since representatives from all debating circuits will be present, and will determine (together with the number of individuals beginning to debate) the size of their circuit’s Elo pool for the rest of the year.
Consider two speakers A and B in two circuits X and Y respectively, both of whom have never participated in the same tournament. Currently it is very difficult for A and B to compare their debating strength. However, circuit X and Y both send (their strongest) teams to WUDC. If circuit X happens to be strong relative to circuit Y, then its teams will increase the size of circuit X’s Elo pool relative to circuit Y (by winning more debates than circuit Y’s teams at WUDC). If A and B perform roughly equally against teams in their own circuits, it is then likely that A will have a greater number of Elo points than B, since more Elo points collected from WUDC will diffuse into circuit X than circuit Y. A might then be able to say with a reasonable degree of confidence that he/she is a stronger debater than B.
No problem arises even when a circuit’s WUDC teams are highly unrepresentative of the quality of the circuit in general. If the WUDC teams are particularly strong, then they are also unlikely to have the Elo points they gained at WUDC stolen from then by other teams in their circuit. The circuit’s Elo pool increases, but the Elo points are also more tightly locked up in a few teams. The converse logic applies where the WUDC teams are particularly weak.
The third benefit of the Elo rating system is that it allows for certain large-scale comparisons to be easily made. One has been mentioned to above – the relative strength of different circuits. However, Elo ratings could also be helpful in detecting bias in circuits towards or against certain genders or races. If circuit X has a large number of female speakers in its regional top 20 ranking and circuit Y has a small number of female speakers in its regional top 20 ranking (controlling for factors like the participation rates of people with different sexual orientations), this suggests that circuit Y might have a bias against female speakers. More prosaically if, say, half of the debaters in a circuit are female but none of them are ranked in that circuit’s top 20 speakers, something is probably wrong. Thus, the Elo rating system is of interest not just to individual speakers who want to become better debaters, but to tournament organisers and bodies like the WUDC Council that have a general interest in making debating fair and inclusive.
Determining tournament/room strength
The fourth benefit of the Elo rating system is that it allows for accurate categorization of tournament strength. For example, for the purposes of novice competitions or pro-arms, we currently determine who an “am” or “novice” is in debating by reference to how many university-level tournaments they have broken in. It might be worth considering broadening the definition of “am” or “novice” to include individuals whose Elo ratings fall below a certain number. A person could have debated for a long time and still benefit hugely from being partnered with a strong debater. The Elo rating system would also let us determine what the overall strength of a tournament (or room) is by simply obtaining the average Elo rating of the relevant speakers. Universities deciding which tournaments to send their teams to might find objective measurements of tournaments’ strength useful. Furthermore, knowing the strength of a particular room in a competition might aid CA teams in judge allocation; they might want to put the best judges in rooms that fall within a certain Elo bracket, for instance.
The fifth benefit of the Elo rating system is that it allows us to integrate performance over in- and out-rounds in a single measurement. This means that Elo ratings capture more information about team performance than team tabs do. A team that progresses from the quarter-finals of a tournament to the semi-finals must beat the two teams that do not progress from the quarter-finals. Thus, it steals points from two teams but, assuming that the judges did not come to a comprehensive team ranking, should neither steal points from or lose points to the team that progresses through the out-round with it. Loosely speaking, we might say that a team that progresses through an out-round takes a “1.5” ranking, while teams that do not progress take a “3.5” ranking. This makes sense; half of the teams that progress come 1st, and half 2nd, and teams that do not progress come 3rd and 4th half of the time respectively. Of course, these assumptions do not hold true for particular teams; note, however, that including this data is certainly less distortionary than excluding it altogether, since we are certain that each team has won/lost against two other teams, and these wins are just as valid as wins against any other team in an in-round.
Estimating individual tournament performance
The sixth benefit of the Elo rating system is that it allows for a speaker to estimate to a reasonable degree their performance rating. The performance rating measures the strength of performance at only one tournament; knowing his/her own performance ratings for each tournament would allow a speaker to know which tournament represented their strongest or weakest performance in terms of debating strength, without distortions relating to the strength of the tournament field. It might also allow us to determine the strongest tournament performance by any person recorded in a certain period – a person might not win a tournament, but still be responsible for a stunning performance overall. One way of estimating a speaker’s performance rating for a tournament is to:
- Take the rating of each team beaten and adding 400;
- Take the rating of each team lost to and subtracting 400;
- Sum the figures obtained; and
- Divide by the number of debates multiplied by three (the number of teams debated against)
A possible advantage of being able to calculate performance rating relates to tie-breaks. Ceteris paribus, we want the team with the higher performance rating to break to out-rounds. Current tiebreak measures tend to arbitrarily favour either consistency or variance in performance (e.g., counting wins) or provide only a limited snapshot of the team’s performance that overemphasises team-specific interactions (e.g., head-to-head records and tiebreak debates). Performance rating might provide a better measure of overall debating strength, although this requires the Elo rating system to be relatively well-developed (i.e., implemented for a significant period of time) so that team ratings accurately capture team strength.
The seventh possible benefit of the Elo rating system relates to pro-ams. It is plausible that strong speakers will see pro-aming as a way to gain rating points, since pro-aming lowers the team rating. Provided that strong speakers believe that they will continue to perform relatively well even when pro-aming, this lowered team rating makes it appear easier for them to gain Elo points from wins. Of course, this effect is not at all a mathematical certainty – the fact that we employ team ratings when determining the size of the Elo point transfer ought to mean that a strong speaker is neither punished or rewarded when speaking with a novice – but our experience indicates that it is at least plausible that strong speakers perform very well (i.e., not significantly worse than if they were not speaking with a novice) when speaking with novices.
We do not believe that the Elo rating system will be particularly humiliating or off-putting for individuals with low Elo ratings. We should first note that there is no reason to believe that Elo ratings are more embarrassing than speaker tabs, which already list all individuals from best to worst regardless of language category. Being part of the debating community appears to already involve being willing to publicly share one’s successes and failures, as in any other competitive activity. The Elo rating system finesses information that is already available in the form of interactive tabs. We also note that, in relation to speaker tabs at large tournaments, individuals tend to be interested only in (1) their own ranking; (2) the rankings of individuals they know personally, and (3) the top 20 speakers. We no reason to believe things will be different in relation to Elo rankings. This means that a speaker who is world no.255, for instance, has absolutely nothing to feel ashamed or worry about. If this is in fact a problem, however, the solution would be to only publicly display the Elo ratings of the world’s top 100 speakers. Furthermore, we have reason to believe that Elo ratings might be especially encouraging for novice speakers, who might not see clear indicators of improvement at their first few tournaments if they do not break. And Elo ratings will also tell individuals when they have stagnated so that if they want to they can do something about it.
Does the Elo rating system make debating too competitive? This is hard to tell. Some speakers will want to debate more to improve their ranking; others might want to debate less for fear of damaging it. And (we hope) people will continue be motivated to debate or not to debate by factors unrelated to Elo rankings: the general need to live a full life, the desire to see friends (debaters or otherwise), the enjoyment of debates, and the desire to do well. It is hard to imagine the Elo rating system making a huge difference to people’s decision-making. What we will know for certain is that people will have more information upon which to base their decisions. This is good.
Section 3: Zagreb EUDC 2014: what would Elo look like?
We assigned speaker who participated in Zagreb EUDC 1500 points. Therefore each team started the tournament with a rating of 1500. We calculated Elo ratings both after the in-rounds, and after the entire tournament. The top 50 teams were ranked according to their post-EUDC Elo ratings. Since team and speaker ratings are identical (given that everyone began with the same Elo rating) we do not explicitly consider individual ratings.
Several things should be noted:
- Hebrew A broke into both the Open and ESL out-rounds. Since it debated in the Open out-rounds first, its out-round-inclusive Elo was calculated by making the relevant Elo adjustment from the Open quarter-final before making the adjustments from the ESL quarter-final and semi-final.
- Since everyone started the tournament with 1500 Elo points, the post-tournament Elo rankings also function as a measure of tournament performance strength.
- The relevant calculations were not particularly difficult to carry out. Once the Elo formula was provided, the relevant coding for Tabbie took less than 1 hour to complete, although several corrections had to be made later. We estimate that, if told in advance, individuals familiar with Tabbie will be able to perform the relevant Elo calculations for a tournament in less than 30 minutes, assuming that the relevant coding has been completed. The relevant data input and calculations were made easier for this EUDC illustration by the fact that all teams and individuals started off with the same rating, but we do not believe that obtaining speakers’ Elo ratings pre-tournament will be difficult. Obtaining Elo ratings can be integrated into current tournament registration procedures. If there is a central database that immediately updates and stores Elo ratings, this can be consulted. For individuals who wish to write programs that calculate Elo ratings, note that:
- Elo point transfers in each debate must be calculated independently. Thus, the team that takes a 1st does not have its Elo rating adjusted after the size of the point transfer from one other team has been calculated: all the Δ-values for all teams must be added up before the point transfer is made. (See the hypothetical example provided in Section 2.)
- If a team’s rating changes by X over the course of a tournament, then each speaker will also have his/her Elo rating change by X.
- In an in-round, a team’s rating can change by a maximum of 96 Elo; in out-rounds, 64 Elo.
EUDC 2014 Elo ratings (top 50)
||Elo (after in-rounds)
|1. SHEFFIELD A 1791
2. OXFORD A 1767
3. OXFORD B 1754
4. CAMBRIDGE A 1748
5. BELGRADE B 1738
6. CAMBRIDGE C 1736
7. EDINBURGH A 1723
8. GUU A 1701
9. CAMBRIDGE B 1697
10. BERLIN A 1683
11. LUND A 1677
12. NOTTINGHAM A 1676
13. KCL A 1675
14. OXFORD C 1670
15. BPP A 1653
16. DURHAM B 1649
17. DURHAM A 1646
18. DURHAM C 1641
19. UCD L&H A 1640
20. TCD PHIL A 1640
21. LSE A 1639
22. BIRMINGHAM A 1638
23. WARWICK B 1638
24. WARWICK A 1638
25. HEBREW A 1623
26. TARTU A 1623
27. MANCHESTER A 1610
28. SOAS A 1608
29. ABERYSTWYTH A 1608
30. BGU A 1608
31. GUU B 1607
32. UCD L&H C 1607
33. UCC PHIL A 1606
34. BBU A 1606
35. LEIDEN A 1604
36. TEL AVIV B 1603
37. BELGRADE A 1581
38. HULL A 1581
39. ELTE A 1579
40. TCD HIST B 1579
41. IMPERIAL B 1578
42. TILBURY H A 1578
43. LSE B 1577
44. WARSAW A 1577
45. BRISTOL B 1577
46. UCC LAW B 1577
47. TCD HIST A 1576
48. STRATHCLYDE A 1576
49. ULU C 1576
50. LANCASTER A 1576
|1. CAMBRIDGE C 1779
2. OXFORD B 1769
3. GUU A 1737
4. CAMBRIDGE B 1734
5. OXFORD A 1732
6. CAMBRIDGE A 1701
7. OXFORD C 1701
8. DURHAM B 1681
9. LSE A 1673
10. EDINBURGH A 1673
11. NOTTINGHAM A 1672
12. SHEFFIELD A 1671
13. KCL A 1671
14. DURHAM C 1669
15. HEBREW A 1668
16. DURHAM A 1646
17. BELGRADE B 1645
18. BERLIN A 1644
19. BPP A 1642
20. TCD PHIL A 1640
21. UCD L&H A 1640
22. TEL AVIV B 1639
23. WARWICK B 1638
24. WARWICK A 1638
25. BIRMINGHAM A 1638
26. TARTU A 1638
27. LUND A 1636
28. BUCHAREST A 1610
29. MANCHESTER A 1610
30. TILBURY HOUSE A 1609
31. ELTE A 1608
32. SOAS A 1608
33. BGU A 1608
34. ABERYSTWYTH A 1608
35. UCD L&H C 1607
36. MANNHEIM A 1607
37. GUU B 1607
38. BBU A 1606
39. UCC PHIL A 1606
40. LEIDEN A 1602
41. HULL A 1581
42. TCD HIST B 1579
43. IMPERIAL B 1578
44. LSE B 1577
45. UCC LAW B 1577
46. BRISTOL B 1577
47. WARSAW A 1577
48. TCD HIST A 1576
49. STRATHCLYDE A 1576
50. LANCASTER A 1576
|1. CAMBRIDGE C
2. OXFORD B
3. CAMBRIDGE B
4. OXFORD A
5. GUU A
6. CAMBRIDGE A
7. OXFORD C
8. EDINBURGH A
9. LSE A
10. SHEFFIELD A
11. DURHAM B
12. KCL A
13. HEBREW A
14. NOTTINGHAM A
15. DURHAM C
16. BPP A
17. BELGRADE B
18. LUND A
19. UCD L&H A
20. TCD PHIL A
21. DURHAM A
22. TARTU A
23. WARWICK A
24. BERLIN A
25. WARWICK B
26. BIRMINGHAM A
27. TEL AVIV B
28. LEIDEN A
29. BUCHAREST A
30. SOAS A
31. UCC PHIL A
32. UCD L&H C
33. GUU B
34. ABERYSTWYTH A
35. BBU A
36. TILBURY HOUSE A
37. MANNHEIM A
38. MANCHESTER A
39. ELTE A
40. BGU A
41. TCD HIST B
42. TCD HIST A
43. STRATHCLYDE A
44. LSE B
45. GUU C
46. HULL A
47. LANCASTER A
48. BRISTOL B
49. ULU C
50. STRATHCLYDE B
Several things should be noted:
- The changes in Elo rating are relatively large, often approaching 300 points. This is because many speakers began with a score (1500) that was highly unlikely to represent their debating strength, and because EUDC is a large tournament where each team must debate against at least 27 others. There are hence at least 9 ratings adjustments, each with a hypothetical maximum size of 96 Elo points, to be made.
- The ranking according to in-round team ratings corresponds fairly well the team tab, with some minor divergences (see Durham B, Nottingham A, Elte A, and Lund A, for example.) This is unsurprising, given that the EUDC has (1) a relatively large number of in-rounds and (2) employs power-pairing. Less correspondence will tend to be seen in smaller tournaments.
- The out-rounds have a significant impact on Elo rating. Sheffield A, ranked equal 12th based on Elo after the in-rounds, gains 120 Elo points by defeating 7 strong teams in the out-rounds to come 1st in the final Elo rankings and very close to crossing the 1800 mark. Belgrade B also moves from 17th to 5th position in this manner.
- Even though EUDC 2014 is a large tournament, it is unclear if the Elo rankings above are representative of the speakers’ relative strength; more time might be needed for estimated and actual performance to match and for Elo ratings to stabilise. We did not calculate Elo ratings on a round-by-round basis, and so do not know if Elo rankings stabilised before Round 9. For teams at the upper and lower ends of the Elo ranking, we suspect that this is unlikely to be the case.
Section 4: Further issues for consideration
Issues that we have not had time or space to discuss but which are relevant and might merit exploration include:
- Modifying any one of the arbitrary parameters used in our Elo calculation, such as the initial number of Elo points (1500), or the size of the divisor in the probability calculation (400).
- Using the geometric rather than arithmetic mean to determine team ratings.
- Specific K-factor issues:
- Having higher K-factors for tournaments deemed to be important.
- Having lower (or higher) K-factors for out-rounds.
- It might be especially useful to have a K-factor that starts out large but shrinks down to a minimum value over time, to ensure that people can rapidly move towards their representative Elo rating from the initial 1500. A simple formula for achieving this might be to have a K-factor of: 500/(number of debates), with a minimum K-factor of 32. This drastically reduces the time it takes to move away from the 1500 rating, since the first few (rated) debates will have a very large impact.
- Having a rating-staggered K-factor. E.g.: a K-factor of 32 for ratings between 1200 and 1600, 24 for ratings between 1600 and 2000, and 16 for ratings above 2000.
- Excluding certain tournaments from Elo calculations.
- Implementing (separate) Elo ratings for non-BP debating formats, with which we are not intimately familiar. We note that the relevant calculations ought to be simpler where debates only include 2 teams.
- Integrating the Elo ratings for BP and non-BP formats. This is worth serious consideration, since debaters in the Australian and Asian circuits debate mostly in the Australs and Asians formats. Implementing Elo ratings only for BP debating means that (1) these debaters have few chances to have their Elo rating adjusted, sometimes as few as 3 a year, and that (2) both UADC and Australs are excluded from Elo calculations. Separate Elo ratings might be necessary if the Australs and Asians formats are considered too different from the BP format for a single Elo rating to make sense. Since we are not intimately familiar with the Australs/Asians formats, however, we do not take a stand on this issue.
- Using Elo as an aid in team allocations for WUDC. Given that the demand for WUDC spots appears to be growing faster than WUDC can accommodate it, Elo ratings might be useful in determining which one among two institutions gets, say, a 3rd team for WUDC. We might wish to give the spot to the team with the higher Elo rating. Of course, this assumes a certain set of aims of the WUDC, and we do not take a stand in this article on this issue.