Algorithm may be used for malware code attribution

Dec 30, 2015 02:37 GMT  ·  By

Researchers from three universities and the US Army Research Laboratory have created a machine learning algorithm that can accurately detect code written by different programmers, even if the code has been compiled into an executable binary.

Previously, the same researchers managed to put together a similar algorithm that would identify different programmers based on their coding style (code stylometry).

This research continues their previous work and expands the algorithm to support cases where the source code isn't accessible, and has been compiled into an executable binary.

De-anonymizing programmers may halt the creation of controversial software.

By providing a proof-of-concept in their paper, the researchers are sounding the alarm on situations where programmers may not want to associate their name with controversial software.

The algorithm developed by the researchers is using as training data source code samples (compiled into binaries) from 600 programmers that participated in the Google Code Jam competition.

Because all programmers had to implement the same functionality, but each did it in his own way and using a coding style unique only to him, in the end, the algorithm learned to distinguish different coding styles after decompiling executable binaries (which does not produce 100% clear source code views as many think).

The algorithm has a high de-anonymization accuracy

According to the researchers, the algorithm managed to de-anonymize executable binaries written by 20 programmers with an accuracy of 96%, after the machine learning classifier trained only on 8 executable binaries for each programmer.

After analyzing binaries from all 600 programmers, researchers reported a 52% accuracy, which is more than acceptable for an algorithm that was only recently created, and hasn't seen years of development.

"Stripping and removing symbol information from the executable binaries reduces the accuracy to 66%, which is a surprisingly small drop," says Mrs. Caliskan-Islam, one of the researchers. "This suggests that coding style survives complicated transformations."

Algorithm structure overview
Algorithm structure overview

The researchers also concluded that the de-anonymization accuracy goes up if the programmer is more skilled, since advanced programmers often create their own style of coding, very distinct from scholastic, standard approaches.

Authors of controversial software may want to stay away from GitHub

Because of open source coding repositories like GitHub, state agencies can build a database of all developers and their coding styles, and then easily compare the coding style used in "anti-establishment" software to detect the culprit.

Researchers said that when the algorithm was tested on GitHub repositories, it managed to achieve a 62% de-anonymization accuracy. They did say that the algorithm is quite useless in collaborative projects where multiple programmers contribute to the same source code.

Despite all the privacy implications this research may have, the algorithm can also be used by security researchers to track down malware authors. Currently, researchers say that the algorithm is not yet ready to take on malware code, which is often very well obfuscated.

"Our results so far suggest that while stylistic analysis is unlikely to provide a 'smoking gun' in the malware case, it may contribute significantly to attribution efforts," Mrs. Caliskan-Islam also noted.

Below is Aylin Caliskan-Islam presenting the research paper at the 32nd Chaos Communication Congress (32C3). The video can also be downloaded from the Conference's website.

Photo Gallery (2 Images)

A programmer's coding style survives even in executable binaries
Algorithm structure overview
Open gallery