Unsplash / Kenny Eliason

Reverse Engineering the Freakonomics Cheating Algorithm - A Clustering Exercise

By:
Category:
Date:

Background

See Github here

*This is an idea motivated by the popular book Freakonomics by Steven D. Levitt and Stephen J. Dubner.

One of the most volatile in the current in the education space is the use of high stakes testing. A simple Google search will yield many articles and studies about the efficacy and effectiveness of such program with a seemingly no well-established consensus. It's a problem that's been debates by many school administrators, education policy makers, parents and students alike. The tests are referred to as "high stakes" because instead of merely testing students on their progress and knowledge, schools are increasingly held accountable for the results.

Ever since the initiation of the No Child Left Behind law (NCLB), the federal government has mandated the use of high-stakes testing as a means to keep the teachers and schools accountable for their results. Even before NCLB, there was already quite a widespread use of standardized testing to regulate school performance. 20 states awarded schools for performing well and 32 states sanctioned schools that were doing badly.

The Chicago Public School system embraced standardized testing in 1996 but under the new NCLB system, a school with low reading scores would be placed on probation and face the threat of being shut down and its staff faced the threat of being reassigned or relieved of their duties. Chicago also implemented a policy where in order to promote to third, six and eighth grade, students had to achieve the minimum score on the Iowa Test of Basic Skills.

The advocates of standardized testing often argue these tests increase the overall learning and capabilities of students by incentivising them to study. The tests also prevent poor performing students from advancing thereby supporting and accommodating their need for extra time to learn the material. Critics of high stakes testing argued that it put unnecessary pressure on students if they don't happen to test well, and that teachers may focus solely on test preparation rather than more important and varied topics that are critical for the growth of a well-rounded student.

Even before NCLB, there was already widespread implementation of regulatory policies in the educational space, teachers whose classrooms don't perform well may get looked over for a promotion or have their job security at risk. Schools who don't perform well may have their funding slashed or their school closed. On the flip side, schools that do well are rewarded with potentially extra funding and don't have the looming threat of being shut down. Teachers are also rewarded if their classroom shows significant testing improvement, they may be looked at for a promotion or a raise or even a bonus, California has a policy that grants teachers $25,000 for producing classrooms with high test scores.

This incentive rich landscape, performing well looked more attractive than ever, so attractive in fact that they just might consider something unspeakable in the academic world: Cheating.

Analysis

To catch a cheater, it does help to think like one. If I were a teacher and the hour I had before turning them in I wanted to artificially inflate my classroom scores by changing some answers. I certainly wouldn't want to change too many, perhaps just a few questions here and there for maybe 1/3 of the class. I also wouldn't want to correct random questions individually because it would take too much time. Maybe I'd remember a string of correct answers and use that instead. Maybe I should also go through the last few questions where the questions are harder and that way, I'd have a higher chance of replacing incorrect answers with correct ones.

The Chicago Public School System has a database of test answers from each classroom from 3rd grade to 7th grade from 1993 to 2000. This is database is protected and proprietary so without an institutional organisation backing our research we can't get access to the whole thing. However, in the book there are two examples given:

Class A

112a4a342cb214d0001acd24a3a12dadbcb4a0000000 d4a2341cacbddad3142a2344a2ac23421c00adb4b3cb 1b2a34d4ac42d23b141acd24a3a12dadbcb4a2134141 dbaab3dcacb1dadbc42ac2cc31012dadbcb4adb40000 d12443d43232d32323c213c22d2c23234c332db4b300 db2abad1acbdda212b1acd24a3a12dadbcb400000000 d4aab2124cbddadbcb1a42cca3412dadbcb423134bc1 1b33b4d4a2b1dadbc3ca22c000000000000000000000 d43a3a24acb1d32b412acd24a3a12dadbcb422143bco 313a3ad1ac3d2a23431223c000012dadbcb400000000 db2a33dcacbd32d313c21142323cc300000000000000 d43ab4d1ac3dd43421240d24a3a12dadbcb400000000 db223a24acb11a3b24cacd12a241cdadbcb4adb4b300 db4abadcacb1dad3141ac212a3a1c3a144ba2db41b43 1142340c2cbddadb4b1acd24a3a12dadbcb43d133bc4 214ab4dc4cbdd31b1b2213c4ad412dadbcb4adb00000 1423b4d4a23d24131413234123a243a2413a21441343 3b3ab4d14c3d2ad4cbcac1c003a12dadbcb4adb40000 dba2ba21ac3d2ad3c4c4cd40a3a12dadbcb400000000 d122ba2cacbd1a13211a2d02a2412dOdbcb4adb4b3c0 144a3adc4cbddadbcbc2c2cc43a12dadbcb4211ab343 d43aba3cacbddadbcbca42c2a3212dadbcb42344b3cb

Class B

db3a431422bd131b4413cd422a1acda332342d3ab4c4
d1aa1a11acb2d3dbc1ca22c23242c3a142b3adb243c1
d42a12d2a4b1d32b21ca2312a3411d00000000000000
3b2a34344c32d21b1123cdc000000000000000000000
34aabad12cbdd3d4c1ca112cad2ccd00000000000000
d33a3431a2b2d2d44b2acd2cad2c2223b40000000000
23aa32d2a1bd2431141342c13d212d233c34a3b3b000
d32234d4a1bdd23b242a22c2a1a1cda2b1baa33a0000
d3aab23c4cbddadb23c322c2a222223232b443b24bc3
d13a14313c31d42b14c421c42332cd2242b3433a3343
d13a3ad122b1da2b11242dc1a3a12100000000000000
d12a3ad1a13d23d3cb2a21ccada24d2131b440000000
314a133c4cbd142141ca424cad34c122413223ba4b40
d42a3adcacbddadbc42ac2c2ada2cda341baa3b24321
db24c412c1ada2c3a341ba20000000db1134dc2cb2da
d1341431acbddad3c4c213412da22d3d1132a1344b1b
1ba41a21a1b2dadb24ca22c1ada2cd32413200000000
dbaa33d2a2bddadbcbca11c2a2accda1b2ba20000000

Since we are looking of a string of similar answers we used the Levenstein distance for hierarchical clustering. After performing hierarchical clustering on strings using their Levenshtein distances, you can pick out similar strings by determining a cutoff threshold for the clusters. This threshold represents the maximum distance (i.e., the maximum number of edits) between strings in the same cluster. Strings grouped together under this threshold are considered similar to each other.

A reasonable cutoff threshold can be determined using a dendrogram for both classrooms.

Dendrogram for Class_A
Dendrogram for Class_B

The number of organse 'sub-trees' in the Dendrogram in the ClassA that there are a lot of derivations off of one particular string sequence whereas for Class B it is much less obvious. We see that for Class A after setting the cutoff to around 18 we get the following:

Class A
Cluster 1:
112a4a342cb214d0001acd24a3a12dadbcb4a0000000 1b2a34d4ac42d23b141acd24a3a12dadbcb4a2134141 dbaab3dcacb1dadbc42ac2cc31012dadbcb4adb40000 db2abad1acbdda212b1acd24a3a12dadbcb400000000 d4aab2124cbddadbcb1a42cca3412dadbcb423134bc1 d43a3a24acb1d32b412acd24a3a12dadbcb422143bco 313a3ad1ac3d2a23431223c000012dadbcb400000000 d43ab4d1ac3dd43421240d24a3a12dadbcb400000000 db223a24acb11a3b24cacd12a241cdadbcb4adb4b300 1142340c2cbddadb4b1acd24a3a12dadbcb43d133bc4 214ab4dc4cbdd31b1b2213c4ad412dadbcb4adb00000 3b3ab4d14c3d2ad4cbcac1c003a12dadbcb4adb40000 dba2ba21ac3d2ad3c4c4cd40a3a12dadbcb400000000 d122ba2cacbd1a13211a2d02a2412dOdbcb4adb4b3c0 144a3adc4cbddadbcbc2c2cc43a12dadbcb4211ab343 d43aba3cacbddadbcbca42c2a3212dadbcb42344b3cb
Cluster 3: - d4a2341cacbddad3142a2344a2ac23421c00adb4b3cb
Cluster 7: - d12443d43232d32323c213c22d2c23234c332db4b300
Cluster 4: - 1b33b4d4a2b1dadbc3ca22c000000000000000000000
Cluster 5: - db2a33dcacbd32d313c21142323cc300000000000000
Cluster 2: - db4abadcacb1dad3141ac212a3a1c3a144ba2db41b43
Cluster 6: - 1423b4d4a23d24131413234123a243a2413a21441343

Focus on Cluster 1 and it should become immediately obvious what happened here.

If we do the same for B; however:

Class_B
Cluster 12:
112a4a342cb214d0001acd24a3a12dadbcb4a0000000
Cluster 2:
d4a2341cacbddad3142a2344a2ac23421c00adb4b3cb 1b33b4d4a2b1dadbc3ca22c000000000000000000000 db4abadcacb1dad3141ac212a3a1c3a144ba2db41b43
Cluster 1:
1b2a34d4ac42d23b141acd24a3a12dadbcb4a2134141 d12443d43232d32323c213c22d2c23234c332db4b300 db2a33dcacbd32d313c21142323cc300000000000000
Cluster 4:
dbaab3dcacb1dadbc42ac2cc31012dadbcb4adb40000
Cluster 5:
db2abad1acbdda212b1acd24a3a12dadbcb400000000
Cluster 11:
d4aab2124cbddadbcb1a42cca3412dadbcb423134bc1
Cluster 7:
d43a3a24acb1d32b412acd24a3a12dadbcb422143bco
Cluster 8:
313a3ad1ac3d2a23431223c000012dadbcb400000000
Cluster 3:
d43ab4d1ac3dd43421240d24a3a12dadbcb400000000 3b3ab4d14c3d2ad4cbcac1c003a12dadbcb4adb40000
Cluster 9:
db223a24acb11a3b24cacd12a241cdadbcb4adb4b300
Cluster 13:
1142340c2cbddadb4b1acd24a3a12dadbcb43d133bc4
Cluster 10:
214ab4dc4cbdd31b1b2213c4ad412dadbcb4adb00000
Cluster 6:
1423b4d4a23d24131413234123a243a2413a21441343

It is much less clear about patterns of repeated strings.

Therefore, we can conclude that Class A: Suspicious.

"The essence of greatness is the perception that virtue is perception that virtue is enough."
— Ralph Waldo Emerson