Using natural language processing to reverse engineer code

Japanese Russian dictionary

Larry Wall, creator of the Perl programming language, created a custom degree plan in college, an interdisciplinary course of study in natural and artificial languages, i.e. linguistics and programming languages. Many of the features of Perl were designed as an attempt to apply natural language principles to the design of an artificial language.

I’ve been thinking of a different connection between natural and artificial languages, namely using natural language processing (NLP) to reverse engineer source code.

The source code of computer program is text, but not a text. That is, it consists of plain text files, but it’s not a text in the sense that Paradise Lost or an email is a text. The most efficient way to parse a programming language is as a programming language. Treating it as an English text will loose vital structure, and wrongly try to impose a foreign structure.

But what if you have two computer programs? That’s the problem I’ve been thinking about. I have code in two very different programming languages, and I’d like to know how functions in one code base relate to those in the other. The connections are not ones that a compiler could find. The connections are more psychological than algorithmic. I’d like to reverse engineer, for example, which function in language A a developer had in mind when he wrote a function in language B.

Both code bases are in programming language, but the function names are approximately natural language. If a pair of functions have the same name in both languages, and that name is not generic, then there’s a good chance they’re related. And if the names are similar, maybe they’re related.

I’ve done this sort of thing informally forever. I imagine most programmers do something like this from time to time. But only recently have I needed to do this on such a large scale that proceeding informally was not an option. I wrote a script to automate some of the work by looking for fuzzy matches between function names in both languages. This was far from perfect, but it reduced the amount of sleuthing necessary to line up the two sets of source code.

Around a year ago I had to infer which parts of an old Fortran program corresponded to different functions in a Python program. I also had to infer how some poorly written articles mapped to either set of source code. I did all this informally, but I wonder now whether NLP might have sped up my detective work.

Another situation where natural language processing could be helpful in software engineering is determining code authorship. Again this is something most programmers have probably done informally, saying things like “I bet Bill wrote this part of the code because it looks like his style” or “Looks like Pat left her fingerprints here.” This could be formalized using NLP techniques, and I imagine it has been. Just as Frederick Mosteller and colleagues did a statistical analysis of The Federalist Papers to determine who wrote which paper, I’m sure there have been similar analyses to try to find out who wrote what code, say for legal reasons.

Maybe this already has a name, but I like “unnatural language processing” for the application of natural language processing to unnatural (i.e. programming) languages. I’ve done a lot of ad hoc unnatural language processing, and I’m curious how much of it I could automate in the future.

Related NLP posts

6 thoughts on “Unnatural language processing”

Bob Cunningham

11 March 2017 at 09:05

Two thoughts based on work I did decades ago:

The assembler output of different compilers can be compared using well-established reverse-engineering tools. Back in the day when UV-EPROM (!!) space was precious, this was often done to compare the output of compilers for different languages and/or different compilers for a single language.

The reverse-engineering tools did have limitations, so obtaining some program behavior details required obtaining and comparing code execution traces.

But how to compare source for a compiled language with an interpreted one, or two interpreted languages?

The comparison is simplified if both programs are in the same language: What transforms would you apply to convert a program in language A to language B? I found that a full conversion often wasn’t needed: Using an intermediate pidgin that each could more easily be converted to could be “close enough” to make accurate functional comparisons.

Attempting this today, I’d probably experiment with extracting and comparing the AST (abstract syntax tree) from each language compiler/interpreter.

Source-level tools can only compare source: Library calls are often unique to the language (and OS), and only the binary-level whole-program comparison could peer inside them. But that was so much work that we’d avoid it to the greatest extent possible.

David Locke

11 March 2017 at 18:16

In the past, I’ve run across books on idiom, like C+ Idiom or Python Idiom.

Map source language to the algorithm to target language.

Patrick Stein

12 March 2017 at 09:50

I’ve seen a number of papers on identifying who wrote particular source code. Even this one that analyzes the generated binary file: http://ftp.cs.wisc.edu/paradyn/papers/Rosenblum11Authorship.pdf

Google “code authorship” for many more.

It may also be fruitful to compare the runtime performance of bits of code, too, if trying to find which parts map to each other. “This section of code just accessed 2MB of data linearly interspersed with bouncing all through the heap.

12 March 2017 at 10:21

Even names can be tricky.

I once had to modify a Fortran-66 code where a particular variable was “cwiggle”. Looking at the journal article that the programmer had referenced at the top of the source it turned out that was his translation for the Greek letter for “zeta”.

Andreas Hadjiprocopis

12 March 2017 at 15:10

Regarding “code authorship”, it sounds to me as a good idea to assemble a ‘corpus’ from github (small+personal projects) and test your methods to determine which author wrote what.

You can even sponsor an author-identification competition. Apart from harvesting the methods developed during the competition, I am quite sure programmers will soon start adopting special-personal code-styles to be identified (and honored) by – re: Kliroy was here. Except from Stuxnet guys, of course. They will stay mainstream for obvious reasons.

There are 32 keywords in C and 150,000+ words in English. So, in programming one has a limited repertoire of vocabulary (not the variables) and when the programmer-artist tries to “express him/herself” via this art called programming, there are countless tweaks and tricks one can use. Comments excluded.

Following are some ideas for code-author discrimination (appropriate for a Sunday afternoon):
1) first and foremost, indentation. This only applies for languages belonging to the pre-fascist programming era where indentation was not enforced by lang-spec.

1-1/2) TAB was the invention of a genious. Tab versus tap-tap-tap-space can reveal a lot about the mental composition of a programmer.

1-3/4) A mixture of indendations ofthen is a sign of extreme copy-paste from external sources.

2) for/while-loops (and indentation therein):
I knew a fellow programmer, now, sadly, institutionalised, who had
for(i=MAXAINDEX;i–>0;){} [AINDEX is sic]
tatooed onto his skin. Note the i–>0. And are countless others. while-loops are, in my experience, rare, maybe unecessary, but do-while’s a master’s stroke.

3) Initialising other variables within a for-loop is also a kind of (rare) compulsion (and often does makes sense) e.g. for(i=0,p=&(data[i]);*p;p++){} /*C*/

4) Years ago, at university, there were two opposite factions: one worshiped ascii32 (aka space), the other denounced it, hated it, hated it.
A bull seing red would not have felt the rush of vasopressin more in his brain than them space-haters, when confronted by a forloop or an if-statement with superfluous spaces after the “if” or “for”. In the case of “for”, some go all the way introducing it even in between its 3 expressions. In Cincinatti State prison, there is a graffiti scratched on the wall : “I killed for ( i = 0 ; i < 1 ; i + + )". Space can be as annoying as upper-case comments (plus more expensive byte-wise) and I saw blood spilt for it. And right it was.

5) i++ vs ++i.
All (cs-people) understand i=i+1, All-minus-few understand i+=1, All-few-few understand i++. But i++ versus ++i is quite a tricky concept and few would risk using ++i in their programs without being prepared to expend the mental energy to enquire post/pre-increment side-effects (e.g. does it create a temp var or a new object etc.). So in this case, those who understand these tricky concepts use the constructs with a zeal bordering to, if not surpassing, the exhibitionist. Others stick to mainstream and happy to earn their daily bread.

6) Personally, when starting a new project and naming functions I can do either java-like abcXyzAbc() or c/c++-like abc_xyz_abc() depending on mood – but be consistent throughout the project. I rarely do fortran cryptic funcnames – but when i do i do good e.g. abax, tabax or xabatabax.

6 1/2) and being on var and func naming, quite a few programmers opt for names in the realm of scatology. Others from the domain of anatomy, often the parts we tend to hide with underwear. Or the parts we miss most. Others a healthy mixture of both. I have also seen them using names from the bible – Dodo, Lot & Job. And of course notorious Sodoma arrays.

Travis Hoppe

5 April 2017 at 22:00

Regarding the statistical properties of keywords, awhile back I looked to see if they follow a power law as natural language does (eg. Zipf’s law). Interestingly, it turns out that they _don’t_, though the comments do. I called it “code linguistics” for lack of a better term. There’s a nice plot here:

http://thoppe.github.io/code-linguistics/HnC_presentation.html#/3/2

Comments are closed.