Analysis: Coping with the Schedule

More than a gauntlet is needed for schedule balance


The Ontario SchoolReach provincial championship whittles roughly 40 teams down to three national invites. To coordinate the largest field of any SchoolReach event, teams are split into pools that play amongst each other, with the (usually) top two of eight in each group moving on to the playoffs.

The composition of the pool can play a significant role in how far a team can progress in the tournament. There are good-faith efforts to balance the pools, but historically with no other background information, organizers had to use reputation (and geographical separation) to form the pools. Often, this led to strange results, such as two 2013 national invites coming from the same preliminary group. Ideally, and with more information, teams would be sorted so that they earn a final rank appropriate to their performance.

But I can’t solve that for now. What I can do is look back, thanks to collecting data from past tournaments. I occasionally get asked (or hear complaints) about how teams don’t get a fair shot during provincials, either through losing a playoff spot to a “weaker” team or having to deal with a group of death. I took a look at some numbers.

The analysis is based on teams that had at least 10 appearances at Ontario provincials since 1999. Results from 2003-05 are excluded from the averages because I don’t have pool composition for those tournaments (just points and ranks). 18 schools fit the bill, including most of the modern “usual suspects” for national qualifications.

Fig 1. Average rank and PPG of frequent Ontario SchoolReach championship attendees

First up is a team’s average rank against their average round-robin points per game. See figure 1, and excuse the crowded labels in places; some teams are close together. There is an unsurprising relationship – teams that finish well scored more points to get there. There are four teams that are at least a full standard deviation from the linear trend:

  • UTS earns more points than necessary to get their rank. They are also limited by being unable to go below 1, even though they would fit closer to a theoretical rank of “0”.
  • Lisgar gets the round-robin points to justify an average rank in the 1-3 range. However, they have a history of stumbling in the playoffs, especially the televised ones, which gives them a lower final rank than their seed would suggest.
  • I will get back to Leaside in a later graph. In the early years, the team scored UTS-esque point tallies. In their later years, they had schedule benefits. Their mid-years are excluded (2003-05).
  • Assumption earns fewer points than expected. It will be seen later that one of my past assumptions (pun intended) that they get easy draws is false. Instead, they probably earn lots of close wins in the prelims, operating on razor-thin margins of victory to often get on the better side of the playoff bubble.
Fig 2. Average rank and strength of schedule of frequent Ontario provincials attendees

Next is the comparison of rank and strength of schedule. The relationship is not as strong, but teams with better ranks usually have an easier schedule. This is expected for balanced pools – the top teams in the pools face teams weaker than them, while the bottom teams face opponents stronger than them. Unfortunately, we don’t have information-based balance, so we are starting to see some outliers:

  • Leaside is on the low side of this chart. They were getting statistically significantly easier schedules than their rank would suggest. However, I believe I can explain this – Leaside made the provincial final in all (and only) the three excluded years. Leaside was extremely good in the 2003-05 period. They were also a very strong prelim team before that, but would slip in the playoffs. For their remaining active years (consecutively until 2009), they probably benefited from reputation placing them as expected pool winners, but they never made playoffs again after the 2005 run. If the 2003-05 results could be added, they would have a higher average rank with probably not too much change in SOS.
  • Lisgar appears low, but is within a standard deviation. As mentioned before, their average rank is worse than expected because of historical poor playoff performance.
  • The cluster of Oakville-Trafalgar, Waterloo, and Westdale have a right to gripe. They face statistically significantly tougher schedules than their results would justify – Westdale is almost two standard deviations from the trend. OTHS is particularly surprising: they had good results in the missing years (thanks to University Challenge celebrity Eric Monkman), but don’t appear to have been given a “boost” from that reputation; they seem to be put in pools under the assumption they don’t do well. Westdale’s tough luck was also looked at in an earlier post when I posited (incorrectly) that Hamilton teams in general suffered from bad schedules.
Fig 3. Average strength of schedule and PPG of frequent Ontario provincials attendees

The last graph, comparing SOS and PPG, could be summarized as how teams cope with the schedule they’ve been dealt. Strength of schedule loosely represents pool strength and the potential unbalance, so teams getting PPGs above the trend are punching above their weight to overcome a bad draw. A few teams are outliers:

  • Westdale still stands out (OTHS and Waterloo draw closer to the trend in this analysis). Their single greatest mountain to climb was the 2013 pool: they had a 5-2 record, their second-best ever PPG relative to the set, and a final rank of 11th, all while dealing with two nationals-bound teams and a third team that also got into the playoffs. Westdale also incredibly made playoffs in 2009 with a 1.15 SOS. Westdale often got the worst schedules, but they made every effort to try to get something out of it.
  • Assumption is the outlier on the low end. I don’t wish to suggest that they are a low-effort team, though. They get schedules that are roughly fair for what is expected of them, but the first analysis suggested that they just don’t pick up large margins of victory.
  • UTS is also an outlier. They appear to have an easy slate of opponents, but they are still performing better than their schedule would expect. UTS has had a few years with tough pools (including the 2013 one mentioned earlier) while still consistently putting up points – they have qualified for nationals four times with a preliminary SOS greater than 1. Organizers (unintentionally) throw tough teams at UTS, and they still prevail.

So there are some data to ponder. I’m sure there are some less-frequent teams that also struggle or get an easy break, but the teams highlighted here should have enough sample size to stand out. Use your own results to see how your team compares to these provincial regulars.

On distribution

A topic about topics

As teams get started for the year, a common question is “What should I be studying?” Good teams usually have a system for splitting categories among players, so what actually is important?

Ideally, there is a topic distribution. A thorough distribution usually has fixed numbers of questions for each broad subject per pack, and fixed number of questions for each subcategory over the course of a set. For example, a set might have 15 science questions per pack, with 10 geology questions spread over the twelve packs in the set. Topic distributions are well-established in the quizbowl circuits south of the border; not only does it offer guidelines on what to study, but it also organizes the submission of questions editors receive from writers.

There is no official distribution in Reach. I found this out while setting up a writing effort in 2014, and I don’t think that has changed since. I will return to that later, but I will start with my brief foray into coaching in 2012. For that year, I tried to reverse-engineer a distribution from the limited number of complete sets I had on hand at the school (the 2010-11 and 2012-13 intramural sets). The chart:

Screen Shot 2017-11-05 at 4.52.04 AM
Broad subject distribution of 2010 & 2012 intramural sets. Error bars are standard deviation based on amounts from each individual pack.

A brief explanation of the broad subjects:

  • Pop Culture: TV, movies, music, games, and sports. I generously lumped all books (even Twilight) with literature. The largest subcategory was sports, with about 7.1% of total question content.
  • History: Anything under the domain of history up to 2000, unless it fit more specifically into a smaller category like science or arts. I subdivided history into Canadian, US, post-Roman Europe, Ancient, and World. Europe had 6.7% of total content.
  • Science: For this case, science includes solving math problems, though I tallied them as a subcategory separate from math concepts. Some topics, like Newton discovering gravity, fell under history if I deemed it more suitable. Perhaps surprisingly to some, neither computational math nor chemistry (via elemental symbol questions) was largest; it was biology with 5.4% of total content.
  • Geography: The catch-all for identifying things on a map. Theory of geography was non-existent. There was a fairly even balance between Canadian, European, and “rest of World”, with “rest of North America” being the subcategory that lost out. I can’t recollect fully, but I think “rest of world” was largely dominated by identifying capitals, while Canadian geography got its point haul from several who-am-I questions.
  • Literature: With less than 10% of total content, even after throwing in “pop lit” and children’s books, this is the most lacking subject in the distribution. Most quizbowl distributions place literature around 20% as one of the “big three” with history and science. In the sets I reviewed, Canadian, US, and European literature combined (the rest of the world was non-existent) came to 7.1% – the same as sports.
  • “Words”: This is another catch-all for questions in which the word itself is more important than having background knowledge in a subject. The three subcategories I used were “definitions (incl. translation)”, “spelling”, and “wordplay”. “Wordplay” includes anagramming and those quirky questions where you add a letter to make a different word. For “definitions”, I was willing to place science/history/etc ones in that corresponding subject, but a lot were just things like “what does genuflect mean?” In my opinion, none of this “words” category belongs in quizzing, but it’s there.
  • Miscellaneous: The last catch-all for questions that don’t fit another subject. Provincial flowers, mixing colors, slogans, and so on. It is possible for legitimate topics to appear as “miscellaneous” (even quizbowl accounts for this), but I’m not holding my breath for questions on administration, shop class, nursing, and other studied material that fall between the subject cracks.
  • R/M/P: This is a quizbowl clumping. It stands for “religion, mythology, and philosophy”, which are topics that can often overlap. “Religion” in this case refers to practices and beliefs; events (such as the life of Buddha or the 95 Theses) usually fall under history. Dominated by identifying Greco-Roman gods, the mythology subcategory leads the way with 2.6% of total content.
  • Fine Arts: This category has your visual (painting, sculpture, etc) and auditory (music, opera, etc) arts. Smaller topics like dance, architecture, and certain films find their way in as well. Music is slightly favored over visual art in the sets I reviewed, but the subject as a whole is fairly small.
  • Current Events: Topics after 2000 that wouldn’t be considered “pop culture” or sports are found here. It ended up being fairly evenly split between politics and other newsworthy events. This is a subject that can be difficult to “study” for; it tends to require a habit of being world-aware.
  • Social Science: Got an interest in economics, anthropology, psychology, linguistics, or dozens of other social sciences? Too bad.

Take all this with a grain of salt. It is only a sample size of 2 sets, and sets do evolve as the difficulty and years change – a modern Nationals set is not likely to have much wordplay, for example. However, it is still reasonable to expect that the “big 3” of Reach will be pop culture, history, and science this year.

When I wrote in 2014, I used the following distribution:

  • 19.3% History. I tried a roughly even split between Canada/US/Europe (all eras)/World, but shifting some US to Europe.
  • 19.3% Science. No computational math. I restricted questions on elements to not make any reference to their symbol, except possibly as a gimme at the end of a what-am-I or long question.
  • 17.9% Literature. This was my weakest subject, but it needed a big boost from what existed.
  • 12.9% Pop culture. Sports was toned down to balance out with movies, TV, music, and games.
  • 8.6% Geography
  • 5.7% Fine arts
  • 5% R/M/P
  • 4.3% Current events. So much for that category when my questions started showing up three years after the fact.
  • 2.1% Social sciences
  • 5% miscellaneous. No “words”. Usually, “miscellaneous” became a multi-clue question that spanned several different subjects (like how “blue” appears in science, literature, or music, for example).

It should be noted that the list above is what I submitted. Editorial control determined what topics appeared in question sets, and it needed to accommodate questions from other authors.

I don’t know the future of a distribution in Reach. Ever since quizbowl showed up, there have been rumours of players trying to adapt the topics, but rarely do they get to the point of writing and submitting. Quizbowl writing is a more lucrative venture, which discourages any new talent from coming in at Reach. There is also the problem that the majority of customers would resist change, even a change to better reflect a school curriculum (when was the last time you used anagrams or Drake in class?).

For now, if you’re looking for something to study, find the biggest topic no one else on your team is covering. That’s the best bet. How to study is another matter…

The R-Value

The points you gave me, nothing else can save me, SOS

Several of my posts have referenced the “R-value”. I think most people realize it is some sort of statistical measure of a team’s strength, but they are confused by either its derivation or interpretation. I am long overdue on clarifying this.

Primarily, the R-value is a mechanism to rank teams who all played the same questions, but did not necessarily play each other. The two most useful applications for this are the Ontario regional-to-provincial and the Ontario provincial prelim-to-playoff qualification systems. Both have a large number of teams that need to be condensed to a small fraction of top teams that would proceed to a higher level, and they all played (roughly) the same questions.

A mechanism exists for this purpose in the US. National Academic Quiz Tournaments’ college program has a couple hundred university teams compete in regional tournaments, all vying to qualify for 64 spots in their national championship (across two divisions). The regional tournaments are all played on the same set of questions. Originally, NAQT used an undisclosed “S-value” to statistically determine which teams, beyond regional winners, deserved a spot in the national championship. With the cooperation of regional hosts providing stats promptly, NAQT could quickly analyze the results and issue qualification invitations a few days after the regional tournaments. Prior to the 2010 season, Dwight Wynne proposed a modified formula made transparent so all teams could verify their values were correct. NAQT adopted this, and named the mechanism the “D-value” in honour of Dwight. In 2015, the Academic Competition Federation introduced their “A-value” for national qualifications, which largely followed the D-value formula.

The R-value is a D-value modified for SchoolReach. The “R” stands for “Reach” or “Reach for the Top”. SchoolReach results typically lack the detailed answer conversion information available in quizbowl, so the R-value is dependent on total points and strength of schedule. I also added 2 modifications that I will get to later.

The R-value asks: “How does a team compare to a theoretical average team playing on the question set?” It is answered in the form of a percentage; if a team has an R-value of 100%, they were statistically average for the field. A step-by-step process to get there:

Note: my primitive embedding of LaTeX in WordPress is used below. It is possible it may not appear in your browser.

  • First, calculate all teams’ round-robin points-per-game (RRPPG). All games which occur in a round-robin system are included, even if a team plays another team multiple times. Playoffs, tiebreaking games, and exhibition matches are excluded. If certain games are known to be “extended” (for example, double-length), that is reflected in the “RR games” total.
  • RRPPG=\frac{RRPts}{RRG}
  • With the RRPPGs known, determine each team’s round-robin opponent average PPG (RROppPPG). This is the average of the PPGs of each opponent a team played, double- or triple- counting where appropriate if they faced each other multiple times. Note: this is different from a team’s average points against, which is a different statistic that is not used in this analysis.
  • RROppPPG=\frac{RRPPG_{opp_1} +RRPPG_{opp_2} +...+RRPPG_{opp_n}}{RR games}
  • The question set’s average points is also needed. This covers all pools and all sites where the questions were used for the purpose of the rank. I determine this average through total RR points and total RR games, so larger sites that have more games do end up with a larger influence on the set average.
  • SetPPG=\frac{\sum{RRPts}}{\sum{RRG}}
  • Strength of schedule (SOS) is a factor to determine how strong a team’s opponents were compared to facing an average set of opponents for the field. A value above 1 indicates a tougher than average schedule; below 1 is a lower than average schedule. In reasonably balanced pools, it is typical to have top teams below 1 and bottom teams above 1 – a top team doesn’t play itself, but its high point tally contributes to the total of one of its weaker opponents. Also, by comparing across multiple pools/sites, SOS can give an overview of how strong a pool/site was.
  • SOS=\frac{RROppPPG}{SetPPG}
  • Now for the biggest leap: the points a team earned must be modified to account for how strong its schedule was. Racking up 400 PPG is far more difficult against national contenders than against novices. Adjusted RRPPG multiplies points by the SOS factor – a tougher schedule gives a team a higher adjusted point total. This adjusted value theoretically represents a team’s PPG if they faced a slate of average teams. Note: this value is not shown in result tables.
  • RRPPG_{adj}=RRPPG \times SOS
  • This value is suitable on its own for ranking. However, I add an extra step of normalizing for the set, so I can compare across years. Earning 400 PPG is far more difficult when the set average is 200 compared to a set average of 300. For example, the late ’90s/early ’00s had much higher set point totals than today (through different formats), and a normalization is needed to compare historical teams of that era to today. The calculated result is the raw R-value, which I convert to a percentage for easier comprehension of how much different from average a team is.
  • Rval_{raw}=\frac{RRPPG_{adj}}{SetPPG} \times 100\%

Raw R-value is the number I use for most comparison purposes. In earlier posts, I tried to show some examples of how this statistic is useful for predicting future performance (especially playoffs) and analyzing outlier results. If R-value is to be used for any sort of qualification system, however, it needs to account for the universally-accepted idea that it is most important to win games. Almost all tournaments use final ranks based primarily on winning (either in playoffs or just prelim results). A team with a low (raw) R-value that finishes ahead of a team with a high R-value deserves qualification just as much (if not more than) teams below them in the standings. The actual R-value is then calculated, based on NAQT’s system (quoting from their D-value page):

After the raw values are computed, they are listed in order for each [site] and a correction is applied to ensure that invitations do not break the order-of-finish at [a site]. Starting at the top of each [site], each team is checked to see if it finished above one or more teams with higher D-values. If it did, then that team and every team between it and the lowest team with a higher D-value are given the mean D-value of that group and ranked in order by their finish.

Let’s say a site winner had a raw R-value of 120% and the runner-up had a final upset while finishing with a raw R-value of 140%. Under this adjustment, both teams end up with the mean, 130%, for their true R-value. The winner receives a boost for finishing above one or more stronger teams, while the lower teams receive a penalty for not reaching their “potential”. The true R-values would then be compared across pools/sites for qualification purposes; if tied teams straddle the cutoff for qualification, invites are issued in order of rank at the tournament.

I do deviate slightly from this formula, though. It is possible, but rare, for the top-ranked team in this average to end up with a lower R-value for finishing higher than a stronger team (e.g: 1st 120%, 2nd 80%, 3rd 130%; all teams get 110%). I don’t believe this should ever happen. If it does, I modify the averaging by this algorithm:

  • First, follow the NAQT algorithm
  • If the first team in the averaging has their R-value lower than their raw R-value, ignore the last team (which has a higher raw R-value than the first team)
  • Proceed to the team one rank above the formerly-last team and attempt the R-value average again. Repeat until the first team improves upon their R-value.
  • Continue the NAQT algorithm with the next team after the new set of averaged teams

Look at the 2016 Ontario Provincials results for an example. Woburn had a very high raw R-value (131.8%), but finished very low (22nd). Under the basic D-value algorithm, 4th-placed London Central would have joined the big set of teams all the way down to Woburn, and ended up with a decrease in their R-value, thanks to the many intermediary teams with low raw R-values. Instead, Woburn was ignored, and the next-lowest team with a higher raw R-value (Hillfield at 132.9%) was tested. Again, this would drop Central’s R-value because of the low value for intermediary Marc Garneau. It is only an average with 5th-placed Waterloo that allows Central to improve on their raw result. From this, the algorithm goes to the next “unaveraged” team, Marc Garneau, who starts the group all the way down to Woburn because they earn a slight R-value boost. 6th through 22nd end up with a final R-value of 110.6% each.

And that’s how you get the R-value. The math isn’t that complicated, but it does require detailed number-crunching, especially for the opponent PPG step. Until more thorough result reporting occurs in SchoolReach, it is probably the best analysis that can be done with the information available. Thankfully, it is a fairly reliable metric for team performance, and I hope to show some examples in future posts.

Shootout theory

Boy, that title can be taken out of context.

In the 15 or so years of “shootouts” in Schoolreach, they have been the most captivating part of a match. Over the course of a blitz of questions, teams must demonstrate that they have depth of knowledge among all four players, as correct answers slowly whittle down the field until all the pressure rests on the final teammate to earn the 40 points. It’s nail-biting, it’s a big swing of points, it’s…

…the least important stretch of a game.

Yes, I will argue that the shootout is insignificant to the point of irrelevancy for a good team. In fact, it can be a statistical annoyance in the context of a whole tournament. It just requires a different mindset.

The shootout offers 0 or 40 points over 12 questions. Let us assume that a match featuring at least one good team will see the 40 points attained, and not let all that buzzing go to waste. The shootout thus offers 3.3 potential points per question (PPPQ). Compared to other types of questions:

  • List question: 50 PPPQ
  • “What-am-I?”: 40 PPPQ
  • 20-point special: 20 PPPQ
  • Team scramble: 10 PPPQ, but an effective 40 if all the “potential” is dependent on the first part
  • Snappers/open questions: 10 PPPQ
  • Assigned questions: 10 PPPQ, but depends on opponent being incorrect every time
  • Relay: 6.25 PPPQ (other half of relay is unavailable to one side)
  • Shootout: 3.33 PPPQ

But why is this relevant? Shouldn’t 40 points from a scramble/bonus group or a “what-am-I?” be the same as 40 points from a shootout? Yes, it’s still 40 points, but it is an extremely inefficient source of points on which to focus. A subsequent span of 12 open questions can easily net you enough points to recover from any shootout loss, and a correct team scramble opener gives you all the shootout potential with one buzz.

Point efficiency arises from the fact that there is a limit of 80-90 questions in a game. Earning points is not only critical for winning (obviously), but also for improving your position in tournament standings, through seedings and tiebreaks. In fact, the mere existence of a shootout can have more impact on your standing than the outcome of it! See below:

You are a reasonably good team that can get an average of >300 points per game (>1/3 of available points) in a tournament that has a bye round. Your reasonably good rival in the standings had a bye when a shootout appeared. Your bye did not have a shootout, and the 12 questions were filled with a mixture of 10 PPPQ formats (assigned, open, etc). With the 12 filler questions, your opponent would earn more than 1/3 of the points on average, which is more than the 40 points you would gain from winning your shootout.

When I ran regional tournaments, I reviewed the sets in advance to determine the potential games in a match, and normalized scores to even the field that could be affected by byes. As far as I can tell, no one else in the history of SchoolReach has done this, and standings are just based on actual points. If every game consistently had exactly one shootout per game, this would be less concerning, but it is not the case.

I hope I have demonstrated that, in theory, shootouts are not worth their perceived importance. Unfortunately, the issue of morale remains. Shootouts are inherently set up to be a momentum swing that can start an underdog comeback or solidify a lead. It’s also a gimmick to give a greater chance of upsets, since upsets are usually more likely to occur when fewer total points are available. The best thing a good team can do is find a “mental zone” to ignore any effects of a shootout, good or bad. A good team should know that both a win and a loss are insignificant compared to a good buzz on a “what-am-I?” or team scramble, and that a stretch of 12 open questions has more impact than all the time spent on a shootout. Of course you should still attempt a shootout, but don’t fret over it…

…Worry about the “what-am-I”. But that’s another story.