Featured Post

Moving my blog! New url is https://patitsas.github.io/

Hi everybody, I'm migrating my blog to https://patitsas.github.io/blog/ to take advantage of the simplicity of blogging with hexo. RS...

Thursday, January 21, 2016

CS grades: probably more normal than you think they are

It's commonly said that computer science grades are bimodal. And people in the CS education community have spent a lot of time speculating and exploring why that could be. A few years back, I sat through a special session at ICER on that very topic, and it occurred to me: has anybody actually tested if the grades are bimodal?

From what I've seen, people (myself included) will take a quick visual look at their grade distributions, and then if they see two peaks, they say it's bimodal. I've done it.

Here's the thing: eyeballing a distribution is unreliable. If you gave me some graphs of real-world data, I wouldn't be able to tell on a quick glance whether they're, say, Gaussian or Poissonian. And if I expected it to be one of the two, confirmation bias and System 1 Thinking would probably result in me concluding that it looks like my expectation.

Two peaks on real world data don't necessarily mean you have a bimodal distribution, particularly when the two peaks are close together. A bimodal distribution means you have two different normal distributions added together (because you're sampling two different populations at the same time).

It's quite common for normal distributions to have two "peaks", due to noise in the data. Or the way the data was binned. Indeed, the Wikipedia article on Normal distribution has this histogram of real world data that is considered normal -- but has two peaks:
And since this graph looks in all honesty like a lot of the grades distributions I've seen, I decided I'd statistically test whether CS grades distributions are normal vs. bimodal. I got my hands on the final grades distributions of all the undergraduate CS classes at the University of British Columbia (UBC), from 1996 to 2013. That came out to 778 different lecture sections, containing a total of 30,214 final grades (average class size: 75).

How do you test for normality vs bimodality?

There are a bunch of ways to test whether some data are consistent with a particular statistical distribution.

One way is to fit your data to whatever formula describes that distribution. You can then eyeball whether your resulting curve matches the data, or you could look at the residuals, or even do a goodness of fit test. (It's worth noting that you could fit a normal distribution as bimodal -- the two sub-distributions would be extremely close together! If you can fit a normal distribution to it, this is a simpler explanation of the data -- Occam's razor and all.)

Another way is to use a pre-established statistical test which will allow you to reject/accept a null hypothesis about the nature of your data. I went this route, for the ease of checking hundreds of different distributions and comparing them.

There are a large variety of tests for whether a distribution is normal. I chose Shapiro-Wilk, since it has the highest statistical power.

There aren't as many tests for whether a distribution is bimodal. Most of them work more or less by trying to capture the difference in means in the two distributions that are in the bimodal model, and testing whether the means are sufficiently separate. I used Hartigan's Dip Test, because it was the only one that I could get working in R #OverlyHonestMethods.

I also computed the kurtosis for every distribution, because I had read that a necessary but not sufficient condition for bimodality is that kurtosis < 3. When you do thousands of statistical tests, you're gonna have a lot of false positives. To minimize false positives, I only used Hartigan's Dip Test on distributions where the kurtosis was less than 3. I set my alpha value at 0.05, so I expect a false positive rate of 5%.

Test results

Starting with kurtosis: 323 of the 778 lecture sections had a kurtosis less than 3. This means that 455 (58%) of the classes were definitely not bimodal, and that at most 323 (42%) classes could be bimodal.

Next I applied Hartigan's Dip Test to the 323 classes which had a kurtosis less than 3. For this test, the null hypothesis is that the population is unimodal. As a result, if p is less than alpha, then we have a multimodal distribution. This was the case for 45 classes (10% of those tested, 5.8% of all the classes).

For the Shapiro-Wilk test, the null-hypothesis is that the population is normally-distributed. So, if the p value is less than the alpha value, we can say the population is not normally distributed. This was the case for 106 classes.

44 of the 45 classes which were previously determined to be multimodal were amongst the 106 classes which the Shapiro-Wilk test indicated weren't normally-distributed. In short, 13.6% of the classes weren't normal, many of which are known to be multimodal.

For the 86.4% of classes where we failed to reject the null hypothesis, we can expect but not guarantee due to type II error that they are normal. I've got a large sample size, and good statistical power. From bootstrapping a likely beta value, I estimate my false negative rate is around 1.48%.

Bottom line: An estimated 85.1% of the final grades in UBC's undergrad CS classes are normally-distributed. 5.8% of the classes tested as being bimodal, which isn't a whole lot more than the false positive rate I'd expect to see (5%).

Discussion

I've only analyzed distributions from one institution, so you might be thinking "maybe UBC is special". And maybe UBC is special.

I couldn't get my hands on a similar quantity of data from my home institution (U of Toronto). But every U of T class I could test was normally-distributed (n=5). Including classes that I'd taught, where I'd eyeballed the grades, and then told my colleagues/TAs/students that my grades were bimodal. Oops.

Since I thought CS classes were bimodal, when I looked at my noisy grades distributions, I saw bimodality. Good old System 1 Thinking. Had I taken the time to fit my data, or statistically test it, I would have instead concluded it was normally-distributed.

I'm currently reading Stephen Jay Gould's The Mismeasure of Man, and this part stuck out for me: "Statisticians are trained to be suspicious of distributions with multiple modes." Where you see multiple modes, you're likely either looking at a lot of noise -- or two populations are improperly being sampled together.

Why are CS distributions so noisy? My colleague Nick Falkner recently did a series of blog posts on assessments in CS classes, and how they're so truly ugly. And my colleagues Daniel Zingaro, Andrew Petersen and Michelle Craig have written a couple of lovely articles which together paint a story that if you ask students a bunch of incremental small concept questions, rather than one giant all-encompassing code-writing question, you get grades distributions which look more normal. How we assess our students affects what sort of distribution we get.

Perhaps once we as CS educators figure out better ways to assess our students, our grades distributions won't be quite so noisy -- and prone to miscategorization?

3 comments:

  1. Interesting analysis! You should post this to the SIGCSE-members listserv or one of the CS Ed facebook groups.

    Did you notice any kind of relationship between the courses that might be bimodal? E.g., were they mostly introductory courses, or taught by the same professors?

    Virginia Tech publishes grade distributions for CS courses going back to '03, I might try and run this analysis on their data. Do you have the code lying around somewhere?

    ~acbart@vt.edu

    ReplyDelete
  2. Thanks!

    To answer your questions:

    The data I got only included instructors for 458 of the 778 classes, so I didn't do any analysis by instructor.

    Of the 45 classes which were positive for Hartigan's dip:
    16 were 100-level: 36%
    5 were 200-level: 11%
    12 were 300-level: 27%
    12 were 400-level: 27%

    For comparison, in the full set of 778 classes:
    171 were 100-level: 22%
    165 were 200-level: 21%
    243 were 300-level: 31%
    199 were 400-level: 26%

    It's worth noting that most of these classes were probably false positives -- I'd expect to see 40 false positives with alpha = 0.05. So likely only 5 classes were "really" bimodal, but it would be difficult to tell you which they are.

    With the relatively small number of classes that did come out as bimodal, I'm hesitant to draw relationships there.

    I've emailed you code! I'd be curious to see what you find.

    Cheers,
    Elizabeth

    ReplyDelete
  3. I've taught CS at undergraduate level where the results were, I'm pretty sure, unambiguously bimodal, which fits well with the subject observation that some students understood the material and others had very little idea of it. A lot of the latter would either drop out of CS (either change subject, or leave the university) or be failed at first or second year grading.

    Could the discrepancy between your measurements and what is commonly believed be explained by it being the final grades that you've looked at (I presume that means grades in the final year) and so what you're seeing is what's left of a bimodal distribution once the lower mode has been filtered out in earlier years of the course?

    ReplyDelete