*This is an idea motivated by the popular book Freakonomics by Steven D. Levitt and Stephen J. Dubner.
One of the most volatile in the current in the education space is the use of high stakes testing. A simple Google search will yield many articles and studies about the efficacy and effectiveness of such program with a seemingly no well-established consensus. It's a problem that's been debates by many school administrators, education policy makers, parents and students alike. The tests are referred to as "high stakes" because instead of merely testing students on their progress and knowledge, schools are increasingly held accountable for the results.
Ever since the initiation of the No Child Left Behind law (NCLB), the federal government has mandated the use of high-stakes testing as a means to keep the teachers and schools accountable for their results. Even before NCLB, there was already quite a widespread use of standardized testing to regulate school performance. 20 states awarded schools for performing well and 32 states sanctioned schools that were doing badly.
The Chicago Public School system embraced standardized testing in 1996 but under the new NCLB system, a school with low reading scores would be placed on probation and face the threat of being shut down and its staff faced the threat of being reassigned or relieved of their duties. Chicago also implemented a policy where in order to promote to third, six and eighth grade, students had to achieve the minimum score on the Iowa Test of Basic Skills.
The advocates of standardized testing often argue these tests increase the overall learning and capabilities of students by incentivising them to study. The tests also prevent poor performing students from advancing thereby supporting and accommodating their need for extra time to learn the material. Critics of high stakes testing argued that it put unnecessary pressure on students if they don't happen to test well, and that teachers may focus solely on test preparation rather than more important and varied topics that are critical for the growth of a well-rounded student.
Even before NCLB, there was already widespread implementation of regulatory policies in the educational space, teachers whose classrooms don't perform well may get looked over for a promotion or have their job security at risk. Schools who don't perform well may have their funding slashed or their school closed. On the flip side, schools that do well are rewarded with potentially extra funding and don't have the looming threat of being shut down. Teachers are also rewarded if their classroom shows significant testing improvement, they may be looked at for a promotion or a raise or even a bonus, California has a policy that grants teachers $25,000 for producing classrooms with high test scores.
This incentive rich landscape, performing well looked more attractive than ever, so attractive in fact that they just might consider something unspeakable in the academic world: Cheating.
To catch a cheater, it does help to think like one. If I were a teacher and the hour I had before turning them in I wanted to artificially inflate my classroom scores by changing some answers. I certainly wouldn't want to change too many, perhaps just a few questions here and there for maybe 1/3 of the class. I also wouldn't want to correct random questions individually because it would take too much time. Maybe I'd remember a string of correct answers and use that instead. Maybe I should also go through the last few questions where the questions are harder and that way, I'd have a higher chance of replacing incorrect answers with correct ones.
The Chicago Public School System has a database of test answers from each classroom from 3rd grade to 7th grade from 1993 to 2000. This is database is protected and proprietary so without an institutional organisation backing our research we can't get access to the whole thing. However, in the book there are two examples given:
Since we are looking of a string of similar answers we used the Levenstein distance for hierarchical clustering. After performing hierarchical clustering on strings using their Levenshtein distances, you can pick out similar strings by determining a cutoff threshold for the clusters. This threshold represents the maximum distance (i.e., the maximum number of edits) between strings in the same cluster. Strings grouped together under this threshold are considered similar to each other.
A reasonable cutoff threshold can be determined using a dendrogram for both classrooms.
The number of organse 'sub-trees' in the Dendrogram in the ClassA that there are a lot of derivations off of one particular string sequence whereas for Class B it is much less obvious. We see that for Class A after setting the cutoff to around 18 we get the following:
Focus on Cluster 1 and it should become immediately obvious what happened here.
If we do the same for B; however:
It is much less clear about patterns of repeated strings.
Therefore, we can conclude that Class A: Suspicious.