[previous] $\leftarrow$ blog post $\rightarrow$ [next]

In [1]:
%pylab inline
import os
Populating the interactive namespace from numpy and matplotlib

[ 31/08/2018] - Binary Data Augmentation

How do we transform programs in a way that preserves their semantic properties but still changes them enough to work as a data augmentation tool for image classifications? That was our question in the last blog.

What is a transformation of the images that will preserve the semantic content of the code?
Trick question, since the transformation applies to the images only as an afterthought, as the best way to preserve the program itself while modifying it is to apply obfuscations.

For the sake of this blog post I won't dwell deep into the subject, but suffice to say that obfuscations are simply syntactic transformations applied to the code in order to render reverse engineering harder while preserving the semantic content (the obfuscated program calculates exactly the same function as the original one).

Also, we won't be writing any obfuscations ourselves but we'll make use of Tigress, a state of the art obfuscator for C code:
http://tigress.cs.arizona.edu/ <- refer to this website for a description of all the transformation we use.
We will focus on certain transformations embedded in Tigress, since we don't need the code to be effectively obfuscated (as in, we don't need to fool reverse engineers), we just need to generate programs that process the same function but differ in their binary representation.

Let's see how we can apply obfuscations to our code, first in a simple way and then iteratively. Applying the transformation Split to the main function of gcd.c we create a new file, gcd-Split.c that will have the original main function split into two subfunctions.

In [2]:
import blog_1 as b1
command = 'tigress --Transform=Split --Functions=main --out=obf_binaries/gcd-Split.c binaries/gcd.c'
output, error = b1.call(command)
In [3]:
print error #not technically an error, but it's everything that gets printed to stderr
gcc -D_GNUCC -E -DCIL=1 binaries/gcd.c -o /tmp/cil-Ntkh6xDa.i
/home/utente/phd/bin/tigress-2.2//cilly.asm.exe --doObf --Transform Split --Functions main --out obf_binaries/gcd-Split.c /tmp/cil-Ntkh6xDa.i
gcc -D_GNUCC -E obf_binaries/gcd-Split.c -o /tmp/cil-qkK8ASm0.cil.i
gcc -D_GNUCC -c -o /tmp/cil-BTWce0h3.o /tmp/cil-qkK8ASm0.cil.i
gcc -D_GNUCC -o a.out /tmp/cil-BTWce0h3.o

The original function:


int main()
{
    int n1, n2, i, gcd;

    printf("Enter two integers: ");
    scanf("%d %d", &n1, &n2);

    for(i=1; i <= n1 && i <= n2; ++i){
        if(n1%i==0 && n2%i==0)
            gcd = i;
    }

    printf("G.C.D of %d and %d is %d", n1, n2, gcd);

    return 0;
}

The split version:


int main(int _formal_argc , char **_formal_argv , char **_formal_envp ) 
{ 
  int n1 ;
  int n2 ;
  int i ;
  int gcd ;
  int _BARRIER_0 ;

  {
  megaInit();
  _global_argc = _formal_argc;
  _global_argv = _formal_argv;
  _global_envp = _formal_envp;
  _BARRIER_0 = 1;
  _1_main_main_split_1(& n1, & n2, & i);
  _1_main_main_split_2(& n1, & n2, & i, & gcd);
  return (0);
}
}
void _1_main_main_split_1(int *n1 , int *n2 , int *i ) 
{ 


  {
  printf((char const   */* __restrict  */)"Enter two integers: ");
  scanf((char const   */* __restrict  */)"%d %d", & *n1, & *n2);
  *i = 1;
}
}
void _1_main_main_split_2(int *n1 , int *n2 , int *i , int *gcd ) 
{ 


  {
  while (1) {
    if (*i <= *n1) {
      if (! (*i <= *n2)) {
        break;
      }
    } else {
      break;
    }
    if (*n1 % *i == 0) {
      if (*n2 % *i == 0) {
        *gcd = *i;
      }
    }
    (*i) ++;
  }
  printf((char const   */* __restrict  */)"G.C.D of %d and %d is %d", *n1, *n2, *gcd);
}
}

Well, they do indeed look very different. Now we have a 'gcd-Split.c' file in our 'obf_binaries' folder, we can compile it and show it just like we did in the last blog post.

First the original, then the obfuscated version:

In [4]:
f_p = 'binaries/gcd.c'
b1.visualize(f_p)
In [5]:
f_p = 'obf_binaries/gcd-Split.c'
b1.visualize(f_p)

Now the two binaries don't appear as different as the source codes, what happened?
Truth be told, a lot of optimizations take place during compilation, so it's possible that a lot of the code that was autogenerated by Tigress didn't get past the compiler. But the images are indeed different, it's just that their differences aren't easy to spot for a human eye.

In [6]:
f_p = 'binaries/gcd'
o_f_p = 'obf_binaries/gcd-Split'
bin_img = b1.reshape(b1.read_data(f_p))
obf_img = b1.reshape(b1.read_data(o_f_p))
diff = abs(len(bin_img) - len(obf_img)) #difference in size between the two images
print diff
plt.imshow(np.equal(bin_img, obf_img[:-diff]))
1
Out[6]:
<matplotlib.image.AxesImage at 0x7f9cad27ead0>

Yellow represents True, while purple is False, this means that the padding is mostly the same (it's all zeros afterall) while the actual body of the function is different. (although the np.equal comparison might be too strict, the binaries could just be shifted)
It has to be noted that we had to compare the entire original binary with only a portion of the obfuscated file, since they have different lengths.


Tigress allows to stack obfuscations on top of one another, which grants us a way to complicate the binary even further. Let's try to obfuscate our GCD program with both Split and InitOpaque.

In [7]:
command = 'tigress --Transform=Split --Functions=main --Transform=InitOpaque --Functions=main --out=obf_binaries/gcd-Split-InitOpaque.c binaries/gcd.c'
output, error = b1.call(command)
In [8]:
f_p = 'binaries/gcd.c'
b1.visualize(f_p)
In [9]:
f_p = 'obf_binaries/gcd-Split-InitOpaque.c'
b1.visualize(f_p)
In [10]:
f_p = 'binaries/gcd'
o_f_p = 'obf_binaries/gcd-Split-InitOpaque'
bin_img = b1.reshape(b1.read_data(f_p))
obf_img = b1.reshape(b1.read_data(o_f_p))
diff = abs(len(bin_img) - len(obf_img)) #difference in size between the two images
print diff
plt.imshow(np.equal(bin_img, obf_img[:-diff]))
130
Out[10]:
<matplotlib.image.AxesImage at 0x7f9cad168110>

Now this looks like a substantial chage from the original, we are lucky that the transformation InitOpaque initializes a lot of variables and functions that can be used to insert opaque predicates, and we are lucky that these variables and functions (although unused) can't be excised by the compiler.
We notice that the first few lines always return a lot of matches, that's because the header of an ELF file (such as any executable on this Ubuntu machine) is standard. We will in fact skip it entirely in later iterations.


Let's build a function to help us automate these stacked obfuscations:

In [11]:
def _obfuscate(source, trans_fun=[('InitOpaque', 'main')]):
    binary = source[:-2]

    command = 'tigress '
    ts = ''
    for t, f in trans_fun:
        command += ('--Transform=' + t + ' --Functions=' + f +' ')
        ts += '-' + t
    
    command += '--Transform=CleanUp --CleanUpKinds=annotations,constants,names ' #needed
    
    target_path = 'obf_binaries/'
    target_name = binary + ts 
    target_extension = '.c '
    target = target_path + target_name + target_extension

    command += '--Environment=x86_64:Linux:Gcc:4.6 '
    command += '--out=' + target
    command += 'binaries/' + source
    output, error = b1.call(command)
    
    b1.compile_source(target, target_path + target_name)

Technically now we just need a list of transformations that we can apply to the code as-is in order to create a list of their combinations. Since with 8 transformations we can technically have around 110.000 loops, we will limit this algorithm to only 20 iterations.
As a brief aside, we always start with programs that has the logic mostly contained in the main function, as that is the function that we always modify. We experimented with obfuscating all the functions (or subsets of all) but certain obfuscations create way too many new functions, it rendered the stacked approach unfeasible.

In [12]:
import itertools

transformations = ['Split', 'EncodeLiterals','EncodeArithmetic', 'InitEntropy',  'RandomFuns', 'InitImplicitFlow', 'Flatten', 'InitOpaque']
limit = 20

def obfuscate(sources):
    c = itertools.combinations
    p = itertools.permutations

    base = len(sources)
    counter = 0
    for s in sources:
        print 'Obfuscating ', s
        counter += 1
        count = 0
        for i in range(1, len(transformations) + 1):
            for comb in c(transformations, i):
                for perm in p(comb):
                    transf = []
                    count += 1
                    if count >= limit:
                        break
                    for el in perm:
                        transf.append((el, 'main'))
                    print count, ') Obfuscations: ', perm
                    print 'obfuscate("', s, '",trans_fun=', transf,')'
                    _obfuscate(s, trans_fun=transf)
In [13]:
sources = ('gcd.c',)
obfuscate(sources)
Obfuscating  gcd.c
1 ) Obfuscations:  ('Split',)
obfuscate(" gcd.c ",trans_fun= [('Split', 'main')] )
2 ) Obfuscations:  ('EncodeLiterals',)
obfuscate(" gcd.c ",trans_fun= [('EncodeLiterals', 'main')] )
3 ) Obfuscations:  ('EncodeArithmetic',)
obfuscate(" gcd.c ",trans_fun= [('EncodeArithmetic', 'main')] )
4 ) Obfuscations:  ('InitEntropy',)
obfuscate(" gcd.c ",trans_fun= [('InitEntropy', 'main')] )
5 ) Obfuscations:  ('RandomFuns',)
obfuscate(" gcd.c ",trans_fun= [('RandomFuns', 'main')] )
6 ) Obfuscations:  ('InitImplicitFlow',)
obfuscate(" gcd.c ",trans_fun= [('InitImplicitFlow', 'main')] )
7 ) Obfuscations:  ('Flatten',)
obfuscate(" gcd.c ",trans_fun= [('Flatten', 'main')] )
8 ) Obfuscations:  ('InitOpaque',)
obfuscate(" gcd.c ",trans_fun= [('InitOpaque', 'main')] )
9 ) Obfuscations:  ('Split', 'EncodeLiterals')
obfuscate(" gcd.c ",trans_fun= [('Split', 'main'), ('EncodeLiterals', 'main')] )
10 ) Obfuscations:  ('EncodeLiterals', 'Split')
obfuscate(" gcd.c ",trans_fun= [('EncodeLiterals', 'main'), ('Split', 'main')] )
11 ) Obfuscations:  ('Split', 'EncodeArithmetic')
obfuscate(" gcd.c ",trans_fun= [('Split', 'main'), ('EncodeArithmetic', 'main')] )
12 ) Obfuscations:  ('EncodeArithmetic', 'Split')
obfuscate(" gcd.c ",trans_fun= [('EncodeArithmetic', 'main'), ('Split', 'main')] )
13 ) Obfuscations:  ('Split', 'InitEntropy')
obfuscate(" gcd.c ",trans_fun= [('Split', 'main'), ('InitEntropy', 'main')] )
14 ) Obfuscations:  ('InitEntropy', 'Split')
obfuscate(" gcd.c ",trans_fun= [('InitEntropy', 'main'), ('Split', 'main')] )
15 ) Obfuscations:  ('Split', 'RandomFuns')
obfuscate(" gcd.c ",trans_fun= [('Split', 'main'), ('RandomFuns', 'main')] )
16 ) Obfuscations:  ('RandomFuns', 'Split')
obfuscate(" gcd.c ",trans_fun= [('RandomFuns', 'main'), ('Split', 'main')] )
17 ) Obfuscations:  ('Split', 'InitImplicitFlow')
obfuscate(" gcd.c ",trans_fun= [('Split', 'main'), ('InitImplicitFlow', 'main')] )
18 ) Obfuscations:  ('InitImplicitFlow', 'Split')
obfuscate(" gcd.c ",trans_fun= [('InitImplicitFlow', 'main'), ('Split', 'main')] )
19 ) Obfuscations:  ('Split', 'Flatten')
obfuscate(" gcd.c ",trans_fun= [('Split', 'main'), ('Flatten', 'main')] )

Let's check some of these:

In [14]:
f_p = 'obf_binaries/gcd-Split-InitImplicitFlow'
o_f_p = 'obf_binaries/gcd-Split-RandomFuns'
bin_img = b1.reshape(b1.read_data(f_p))
obf_img = b1.reshape(b1.read_data(o_f_p))
diff = abs(len(bin_img) - len(obf_img)) #difference in size between the two images
plt.imshow(np.equal(bin_img, obf_img[:-diff]))
Out[14]:
<matplotlib.image.AxesImage at 0x7f9cad3024d0>
In [15]:
f_p = 'obf_binaries/gcd-Split-EncodeLiterals'
o_f_p = 'obf_binaries/gcd-EncodeLiterals-Split'
bin_img = b1.reshape(b1.read_data(f_p))
obf_img = b1.reshape(b1.read_data(o_f_p))
diff = abs(len(bin_img) - len(obf_img)) #difference in size between the two images
plt.imshow(np.equal(bin_img, obf_img[:-diff]))
Out[15]:
<matplotlib.image.AxesImage at 0x7f9cae3c2650>

Changing the parameter limit we now can freely augment our dataset by a factor of over 100k (but even in our study we constrained it to 200).
In the next blog post we'll start thinking about a Neural Network that can help us classify these programs as similar.

[previous] $\leftarrow$ blog post $\rightarrow$ [next]

>>