Code Stylometry: How AI Could End Anonymous Hacking

Hackers are a constant threat in today's society for both individual and organizations. New machine learning algorithms are able to identify hackers based on their unique code style, taking away their anonymity.

225
IMage of fingerprints over Python code that says
Illustration: © IoT For All

Hacking has become an archetype of 21st-century crime. An anonymous programmer can code their way past cybersecurity software to steal valuable information from people and businesses, and either sell that data back to the owners or occasionally release the private information to the public.

In an era already threatened by data misuse, hackers pose just one more line of threat. That is, at least, until artificial intelligence (AI) changes things.

For years, many hackers have been good at slipping under the radar of cybersecurity systems. It seems like every few months we hear about another major corporation losing valuable customer data to cybercriminals.

Recent innovations have called into question the long-term relevance of hackers. Could AI take away their anonymity?

What Is Code Stylometry? Discovering Digital Fingerprints

Two computer scientists, Rachel Greenstadt (professor at Drexel University) and Aylin Caliskan (professor at George Washington University), have conducted extensive work to learn whether we can use AI to identify anonymous hackers. They’ve learned that, by using even just small extracts of code, an AI system can distinguish one programmer’s work from another.

In the past, noticing these patterns was very complex. For context, consider the variation inherent to most open source coding programs. These forums can be hectic, combining the work of numerous contributors into one near-seamless string of code. The more developers there are working on a single project, the more difficult it is to determine who wrote which piece.

But computers can notice patterns more easily than humans. That’s why machine learning was the obvious solution to try against anonymous programmers. Could a computer successfully parse the respective inputs of multiple code contributors? The answer, Caliskan and Greenstadt learned, was yes.

It all comes down to specifying exact programming styles that can match certain characteristics in the code. AI can be used to sustainably recognize specific programing patterns. This helps computers recognize certain hackers via their unique digital fingerprint.

Making the Web Safer With Code Stylometry

Using what Aylin Caliskan calls “Code Stylometry,” we can now de-anonymize coders by considering the extensive binary code of specific programmers.

As they posited in their abstract, “Source code authorship attribution… [may] enable attribution of successful attacks from code left behind on an infected system, or aid in resolving copyright, copyleft, and plagiarism issues in the programming fields.”

It’s as simple as pattern recognition, multiplied to a pattern-recognizing ability that’s far superior to that capable of any person.

Now we delve into the same question that plagues all data solutions: what about our respective rights to privacy?

There are both advantages and risks in the de-anonymization of code snippets. With Code Stylometry, hackers can be traced much more easily, but it can also pose a threat to the privacy of anonymous code contributors.

As the study points out, “Contributors to open-source projects may hide their identity whether they are Bitcoin’s creator or just a programmer who does not want her employer to know about her side activities. They may live in a regime that prohibits certain types of software, such as censorship circumvention tools.”

Is the innovation and ability to solve cyber crimes worth the new privacy risks? Perhaps only the future will tell.

What we know now is that companies may soon have greater protection against hackers, as malware users become increasingly easy to identify (and thereby prosecute).

Written by Alexander Lewis, a media relations expert for Paessler AG, a global leader in systems monitoring.