Source Code and Cross-Domain Authorship Attribution

31. Chaos Communication Congress

Stylometry is the study of linguistic style found in text. Stylometry existed long before computers but now the field is dominated by artificial intelligence techniques. Writing style is a marker of identity that can be found in a document through linguistic information to perform authorship recognition. Authorship recognition is a threat to anonymity but knowing ways to identify authors provides methods for anonymizing authors as well. Even basic stylometry systems reach high accuracy in classifying authors correctly. Stylometry can also be used in source code to identify the author of a program. In this talk, we investigate methods to de-anonymize source code authors of C++ and authors across different domains. Source code authorship attribution could provide proof of authorship in court, automate the process of finding a cyber criminal from the source code left in an infected system, or aid in resolving copyright, copyleft and plagiarism issues in the programming fields. Programmers can obfuscate their variable or function names, but not the structures they subconsciously prefer to use or their favorite increment operators. Following this intuition, we create a new feature set that reflects coding style from properties derived from abstract syntax trees. We reach 99% accuracy in attributing 36 authors each with ten files. We experiment with many different sized datasets leading to high true positive rates. Such a unique representation of coding style has not been used as a machine learning feature to attribute authors and therefore this is a valuable contribution to the field. We also examine the need for cross-domain stylometry, where the documents of known authorship and the documents in question are written in different contexts. Specifically, we look at blogs, Twitter feeds, and Reddit comments. While traditional methods in stylometry that work well within one domain fail to identify authors across domains, we are able to improve the accuracy of cross-domain stylometry to as high as 80%. Being able to identify authors across domains facilitates linking identities across the Internet making this a key privacy concern; users can take other measures to ensure their anonymity, but due to their unique writing style, they may not be as anonymous as they believe.

Anonymity is a topic researched in detail at the Privacy, Security, and Automation Lab at Drexel University. We study how to effectively identify the author of text with unknown authors and how to anonymize text of known authorship. In our previous talks at CCC, we have presented methods to identify authors of regular text, translated text and users a.k.a cyber-criminals of online underground forums. We introduced our authorship anonymization framework ‘Anonymouth’. Many times, we received questions on how applying de-anonymization techniques would work on source code and different domains. In this year’s talk, we will focus on identifying the authors of source code and cross-domain stylometry. Can the authors of source code be identified automatically through features of their programming style? Do they leave coding “footprints”? Holding important implications for protecting intellectual property as well as for identifying malware authors and tracking how malware spreads and evolves, this question spurred a cross-cutting research project involving NLP and machine learning. Code stylometry requires features unique to coding and to the programming language. Source code has different properties than common writing, such as the lineage, keywords, comments, the way functions and variables are created, and the grammar of the program. Aware that methods from text analytics can strengthen cyber analytics, this project sought to advance the potential of automated linguistic-type analysis, or stylometry, for authorship attribution of source code. A corpus of tens of thousands of users was built by scraping Google Code Jam Competition dataset. Specifically investigated were new ways of representing coding style through NLP-inspired syntactic, lexical and layout features. Random forests with 300 hundred trees were used along with less than ten decision features per tree. The main dataset had 173 authors each with six source code files with less then 100 lines of C++ code. A series of experiments was performed to discover the feature set that yielded the highest recognition accuracy: 91%. 57% of the features with information gain were syntactic and the rest were lexical and layout features. Tests on a validation dataset of exact same size showed 86% accuracy with the same features. The features that had information gain in the validation experiments all had information gain in the original dataset, which shows that the method and feature set are robust and abstract syntax trees show best promise. Source code is just one domain studied in authorship attribution. We also study the problem of domain adaption in stylometry. Can we identify the author of an anonymous blog from a suspect group of Twitter accounts? The ability to do so would lead to the ability to link accounts and identities across the Internet. We can achieve high accuracy at identifying authors of documents within the same domain, including blogs, Twitter feeds, and Reddit comments, even when classifying with up to 200 authors. Identifying the author of a group of tweets from among 200 tweeters yields an accuracy of 94% and identifying the author of a blog entry from among 200 bloggers yields an accuracy of 71%. When we try to identify to author of a collection of tweets based on a collection of blogs from 200 authors, however, accuracy drops to 7% using the same method and features. We are able to increase the accuracy, however, by applying an augmented version of doppelganger finder, a stylometric approach for multiple account detection that can handle small stylistic changes. This provides significant improvements in each of the cross-domain cases. Advances in authorship attribution offer both positive and negative repercussions for security. However, it is important to understand the assumptions that underlie these results. Blind application of stylometric methods could be dangerous if the domain is not understood. This work shows that stylometric methods are domain dependent. Whether used defensively or offensively, this is certain to impact user account security.

Speakers: Aylin greenie Rebekah Overdorf