Debarghya Das has a fascinating story on how he managed to bypass a silly web security layer to get access to the results of 150,000 ISCE (10th grade) and 65,000 ISC (12th grade) students in India. While lack of security and total ignorance to safeguard sensitive information is an interesting topic what is more fascinating about this episode is the analysis of the results that unearthed score tampering. The school boards changed the scores of the students to give them "grace" points to bump them up to the passing level. The boards also seem to have tampered some other scores but the motive for that tampering remains unclear (at least to me).
I would encourage you to read the entire analysis and the comments, but a tl;dr version is:
32, 33 and 34 were visibly absent. This chain of 3 consecutive numbers is the longest chain of absent numbers. Coincidentally, 35 happens to be the pass mark.
Here's a complete list of unattained marks -
36, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 56, 57, 59, 61, 63, 65, 67, 68, 70, 71, 73, 75, 77, 79, 81, 82, 84, 85, 87, 89, 91, 93. Yes, that's 33 numbers!
The comments are even more fascinating where people are pointing out flaws with his approach and challenging the CLT (central limit theorem) with a rebuttal. If there has been no tampering with the score it would defy the CLT with a probability that is so high that I can't even compute. In other words, the chances are almost zero, if not zero, of this guy being wrong about his inferences and conclusions.
He is using fairly simple statistical techniques and MapReduce style computing to analyze a fairly decent size data set to infer and prove a specific hypothesis (most people including me believed that grace points existed but we had no evidence to prove it). He even created a public GitHub repository of his work which he later made it private.
I am not a lawyer and I don't know what he did is legal or not but I do admire his courage to not post this anonymously as many people in the comments have suggested. Hope he doesn't get into any trouble.
Spending a little more time trying to comprehend this situation I have two thoughts:
The first shocking but unfortunately not surprising observation is: how careless the school boards are in their approach in making such sensitive information available on their website without basic security. It is not like it is hard to find web developers in India who understand basic or even advanced security; it's simply laziness and carelessness on the school board side not to just bother with this. I am hoping that all government as well as non-government institutes will learn from this breach and tighten up their access and data security.
The second revelation was - it's not a terribly bad idea to publicly distribute the very same as well as similar datasets after removing PII (personally identifiable information) from it to let people legitimately go crazy at it. If this dataset is publicly available people will analyze it, find patterns, and challenge the fundamental education practices. Open source has been a living proof of making software more secured by opening it up to public to hack it and find flaws in it so that they can be fixed. Knowing the Indian bureaucracy I don't see them going in this direction. Turns out I have seen this movie before. I have been an advocate of making electronic voting machines available to researchers to examine the validity of a fair election process. Instead of allowing the security researchers to have access to an electronic voting machine Indian officials accused a researcher of stealing a voting machine and arrested him. However, if India is serious about competing globally in education this might very well be the first step to bring in transparency.