[previous] $\leftarrow$ blog post $\rightarrow$ [next]
%pylab inline
import os
This blog post is just a followup on my article A Deep Learning Approach to Program Similarity where I go deeper on why we want to do this.
Binaries are technically any file stored on the disk, since they all get stored as sequences of 0s and 1s, but throughout these blog posts we will be considering only executable binaries - programs.
The original aim of the paper was to find a way to classify similar programs by leveraging the information stored in the binary, hoping that the classification would resemble a semantic similarity as much as possible (since it's technically not computable). The idea is that every executable must contain all the semantic content of the program, hidden somewhere in sequences of 0s and 1s, so we should be able to train a deep network complex enough to accurately capture these patterns.
For no particular reason, other than the huge wealth of information on the topic, we will try to classify the images extracted from the binaries. Afterall, if the information is still there, it shouldn't matter what kind of representation we use, as long as we don't lose any of it.
I'm lying, there's more to this but I don't know if it will fit in blog posts
So let's import an executable file and transform it into an image.
Import the executable file as a numpy array:
def read_data(filename):
width = 64
with open(filename, 'r') as infile:
data = np.fromfile(infile, dtype=np.uint16)
return data
This is the first program, just a hello world, without the world part:
#include <stdio.h>
void main(){
printf("hello");
}
We store it in the 'binaries' folder and compile it with 'gcc hello_world.c -o hello_world', or we can start automating everything (since we'll need to work with lots of programs at the end):
def call(command):
process = subprocess.Popen(command.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
output, error = process.communicate()
return output, error
def compile_source(source, binary):
if os.path.isfile(binary): #skip if the file has already been compiled
return
output, error = call('gcc -std=c99 ' + source + ' -o ' + binary + ' -lm')
if error:
print error
source_path = 'binaries/hello_world.c'
exe_path = source_path[:-2]
compile_source(source_path, exe_path)
data = read_data(exe_path)
print data.shape
At this point we just imported the executable file as a big monodimensional numpy array, what happens if we visualize it now? Well, it would just look like a column (or row) of differently colored pixels, not very interesting. (also it would give us an error with most *.imshow() functions, since they expect more than one dimension).
Let's then assume that our images have a width, and set it at $64$, since it's a nice power of $2$ and it's the maximum length of an instruction on 64-bit systems.
def reshape(data, width=64):
while data.shape[0] % width != 0:
data = np.append(data, 0)
height = int(data.shape[0] / width)
return data.reshape((height, width))
data = reshape(data)
print data.shape
plt.imshow(data, cmap=plt.cm.jet)
Now, that looks like a proper ELF file (since we're working on ubuntu) with its structure, and we can see that it has a lot of padding (those big blue areas represent zeros), but we will deal with this later.
The idea is still to use Deep Learning for image recognition applied to our problem, so we should add more samples to the dataset.
Since the images will have to be classified, we chose small programs downloaded from https://www.programiz.com/c-programming/examples (only in C), here's a couple of them:
Greatest Common Denominator (cgd.c):
#include <stdio.h>
int main()
{
int n1, n2, i, gcd;
printf("Enter two integers: ");
scanf("%d %d", &n1, &n2);
for(i=1; i <= n1 && i <= n2; ++i)
{
// Checks if i is factor of both integers
if(n1%i==0 && n2%i==0)
gcd = i;
}
printf("G.C.D of %d and %d is %d", n1, n2, gcd);
return 0;
}
N is Palindrome (n_is_palindrome.c):
#include <stdio.h>
int main()
{
int n, reversedInteger = 0, remainder, originalInteger;
printf("Enter an integer: ");
scanf("%d", &n);
originalInteger = n;
// reversed integer is stored in variable
while( n!=0 )
{
remainder = n%10;
reversedInteger = reversedInteger*10 + remainder;
n /= 10;
}
// palindrome if orignalInteger and reversedInteger are equal
if (originalInteger == reversedInteger)
printf("%d is a palindrome.", originalInteger);
else
printf("%d is not a palindrome.", originalInteger);
return 0;
}
We can visualize them:
def visualize(source_path):
exe_path = source_path[:-2]
compile_source(source_path, exe_path)
data = read_data(exe_path)
data = reshape(data)
plt.imshow(data, cmap=plt.cm.jet)
f_p_1 = 'binaries/gcd.c'
f_p_2 = 'binaries/n_is_palindrome.c'
visualize(f_p_1)
visualize(f_p_2)
Ok, now we have a pretty easy way to visualize the executable files in our dataset, in the next posts we'll go over how to use them for classification.
[previous] $\leftarrow$ blog post $\rightarrow$ [next]
>>