%pylab inline
import os
How do we transform programs in a way that preserves their semantic properties but still changes them enough to work as a data augmentation tool for image classifications? That was our question in the last blog.
What is a transformation of the images that will preserve the semantic content of the code?
Trick question, since the transformation applies to the images only as an afterthought, as the best way to preserve the program itself while modifying it is to apply obfuscations.
For the sake of this blog post I won't dwell deep into the subject, but suffice to say that obfuscations are simply syntactic transformations applied to the code in order to render reverse engineering harder while preserving the semantic content (the obfuscated program calculates exactly the same function as the original one).
Also, we won't be writing any obfuscations ourselves but we'll make use of Tigress, a state of the art obfuscator for C code:
http://tigress.cs.arizona.edu/ <- refer to this website for a description of all the transformation we use.
We will focus on certain transformations embedded in Tigress, since we don't need the code to be effectively obfuscated (as in, we don't need to fool reverse engineers), we just need to generate programs that process the same function but differ in their binary representation.
Let's see how we can apply obfuscations to our code, first in a simple way and then iteratively. Applying the transformation Split to the main function of gcd.c we create a new file, gcd-Split.c that will have the original main function split into two subfunctions.
import blog_1 as b1
command = 'tigress --Transform=Split --Functions=main --out=obf_binaries/gcd-Split.c binaries/gcd.c'
output, error = b1.call(command)
print error #not technically an error, but it's everything that gets printed to stderr
The original function:
int main()
{
int n1, n2, i, gcd;
printf("Enter two integers: ");
scanf("%d %d", &n1, &n2);
for(i=1; i <= n1 && i <= n2; ++i){
if(n1%i==0 && n2%i==0)
gcd = i;
}
printf("G.C.D of %d and %d is %d", n1, n2, gcd);
return 0;
}
The split version:
int main(int _formal_argc , char **_formal_argv , char **_formal_envp )
{
int n1 ;
int n2 ;
int i ;
int gcd ;
int _BARRIER_0 ;
{
megaInit();
_global_argc = _formal_argc;
_global_argv = _formal_argv;
_global_envp = _formal_envp;
_BARRIER_0 = 1;
_1_main_main_split_1(& n1, & n2, & i);
_1_main_main_split_2(& n1, & n2, & i, & gcd);
return (0);
}
}
void _1_main_main_split_1(int *n1 , int *n2 , int *i )
{
{
printf((char const */* __restrict */)"Enter two integers: ");
scanf((char const */* __restrict */)"%d %d", & *n1, & *n2);
*i = 1;
}
}
void _1_main_main_split_2(int *n1 , int *n2 , int *i , int *gcd )
{
{
while (1) {
if (*i <= *n1) {
if (! (*i <= *n2)) {
break;
}
} else {
break;
}
if (*n1 % *i == 0) {
if (*n2 % *i == 0) {
*gcd = *i;
}
}
(*i) ++;
}
printf((char const */* __restrict */)"G.C.D of %d and %d is %d", *n1, *n2, *gcd);
}
}
Well, they do indeed look very different. Now we have a 'gcd-Split.c' file in our 'obf_binaries' folder, we can compile it and show it just like we did in the last blog post.
First the original, then the obfuscated version:
f_p = 'binaries/gcd.c'
b1.visualize(f_p)
f_p = 'obf_binaries/gcd-Split.c'
b1.visualize(f_p)
Now the two binaries don't appear as different as the source codes, what happened?
Truth be told, a lot of optimizations take place during compilation, so it's possible that a lot of the code that was autogenerated by Tigress didn't get past the compiler. But the images are indeed different, it's just that their differences aren't easy to spot for a human eye.
f_p = 'binaries/gcd'
o_f_p = 'obf_binaries/gcd-Split'
bin_img = b1.reshape(b1.read_data(f_p))
obf_img = b1.reshape(b1.read_data(o_f_p))
diff = abs(len(bin_img) - len(obf_img)) #difference in size between the two images
print diff
plt.imshow(np.equal(bin_img, obf_img[:-diff]))
Yellow represents True, while purple is False, this means that the padding is mostly the same (it's all zeros afterall) while the actual body of the function is different. (although the np.equal comparison might be too strict, the binaries could just be shifted)
It has to be noted that we had to compare the entire original binary with only a portion of the obfuscated file, since they have different lengths.
Tigress allows to stack obfuscations on top of one another, which grants us a way to complicate the binary even further. Let's try to obfuscate our GCD program with both Split and InitOpaque.
command = 'tigress --Transform=Split --Functions=main --Transform=InitOpaque --Functions=main --out=obf_binaries/gcd-Split-InitOpaque.c binaries/gcd.c'
output, error = b1.call(command)
f_p = 'binaries/gcd.c'
b1.visualize(f_p)
f_p = 'obf_binaries/gcd-Split-InitOpaque.c'
b1.visualize(f_p)
f_p = 'binaries/gcd'
o_f_p = 'obf_binaries/gcd-Split-InitOpaque'
bin_img = b1.reshape(b1.read_data(f_p))
obf_img = b1.reshape(b1.read_data(o_f_p))
diff = abs(len(bin_img) - len(obf_img)) #difference in size between the two images
print diff
plt.imshow(np.equal(bin_img, obf_img[:-diff]))
Now this looks like a substantial chage from the original, we are lucky that the transformation InitOpaque initializes a lot of variables and functions that can be used to insert opaque predicates, and we are lucky that these variables and functions (although unused) can't be excised by the compiler.
We notice that the first few lines always return a lot of matches, that's because the header of an ELF file (such as any executable on this Ubuntu machine) is standard. We will in fact skip it entirely in later iterations.
Let's build a function to help us automate these stacked obfuscations:
def _obfuscate(source, trans_fun=[('InitOpaque', 'main')]):
binary = source[:-2]
command = 'tigress '
ts = ''
for t, f in trans_fun:
command += ('--Transform=' + t + ' --Functions=' + f +' ')
ts += '-' + t
command += '--Transform=CleanUp --CleanUpKinds=annotations,constants,names ' #needed
target_path = 'obf_binaries/'
target_name = binary + ts
target_extension = '.c '
target = target_path + target_name + target_extension
command += '--Environment=x86_64:Linux:Gcc:4.6 '
command += '--out=' + target
command += 'binaries/' + source
output, error = b1.call(command)
b1.compile_source(target, target_path + target_name)
Technically now we just need a list of transformations that we can apply to the code as-is in order to create a list of their combinations. Since with 8 transformations we can technically have around 110.000 loops, we will limit this algorithm to only 20 iterations.
As a brief aside, we always start with programs that has the logic mostly contained in the main function, as that is the function that we always modify. We experimented with obfuscating all the functions (or subsets of all) but certain obfuscations create way too many new functions, it rendered the stacked approach unfeasible.
import itertools
transformations = ['Split', 'EncodeLiterals','EncodeArithmetic', 'InitEntropy', 'RandomFuns', 'InitImplicitFlow', 'Flatten', 'InitOpaque']
limit = 20
def obfuscate(sources):
c = itertools.combinations
p = itertools.permutations
base = len(sources)
counter = 0
for s in sources:
print 'Obfuscating ', s
counter += 1
count = 0
for i in range(1, len(transformations) + 1):
for comb in c(transformations, i):
for perm in p(comb):
transf = []
count += 1
if count >= limit:
break
for el in perm:
transf.append((el, 'main'))
print count, ') Obfuscations: ', perm
print 'obfuscate("', s, '",trans_fun=', transf,')'
_obfuscate(s, trans_fun=transf)
sources = ('gcd.c',)
obfuscate(sources)
Let's check some of these:
f_p = 'obf_binaries/gcd-Split-InitImplicitFlow'
o_f_p = 'obf_binaries/gcd-Split-RandomFuns'
bin_img = b1.reshape(b1.read_data(f_p))
obf_img = b1.reshape(b1.read_data(o_f_p))
diff = abs(len(bin_img) - len(obf_img)) #difference in size between the two images
plt.imshow(np.equal(bin_img, obf_img[:-diff]))
f_p = 'obf_binaries/gcd-Split-EncodeLiterals'
o_f_p = 'obf_binaries/gcd-EncodeLiterals-Split'
bin_img = b1.reshape(b1.read_data(f_p))
obf_img = b1.reshape(b1.read_data(o_f_p))
diff = abs(len(bin_img) - len(obf_img)) #difference in size between the two images
plt.imshow(np.equal(bin_img, obf_img[:-diff]))
Changing the parameter limit we now can freely augment our dataset by a factor of over 100k (but even in our study we constrained it to 200).
In the next blog post we'll start thinking about a Neural Network that can help us classify these programs as similar.
>>