%pylab inline
import os
In the last blog post we discussed how to automate the visualization of binaries, so let's assume we gathered a sizable dataset of simple programs, in our case we have 32 simple programs downloaded from https://www.programiz.com/c-programming/examples and 15 files from solutions of Google Code Jam competitions. Here's the complete list:
[x for x in os.listdir('binaries/') if x[-2:] == '.c']
The 32 initial programs are super simple programs that result in very small binaries (easier to deal with when we classify images) and the other 15 are slightly bigger and more similar to real programs, the filenames should be explanatory enough to know what functions are being calculated.
*Those that aren't clear are probably from GCJ*
In a classification problem we would now have 47 classes, or in other words we have 47 functions that calculate different things.
*Truth be told we inserted 2 programs that have 2 implementations, gcd and fib, I'll talk about it in the future*
But if we started learning now there would be no way to avoid overfitting (nor testing), since every class contains only one sample and there's no way to discover the patterns that we are hopeful to find.
To make things simple, when datasets are small in image processing we can turn to data augmentation:
import warnings; warnings.simplefilter('ignore') #I have some messed up dependencies, let's not talk about this
from scipy import ndimage
import random
foto = plt.imread('foto.png') #this is my pic, you can choose anything
plt.imshow(foto)
Let's assume this is the only sample from the class 'gazing_into_the_abyss', we can provide the dataset with many more samples from this class by transforming it in random ways that will keep the semantic content intact.
For example we can rotate the image, mask it with some circular dark border, zoom into it (crop) and flip it north-south or the other way.
def rotate(foto):
random_degree = random.uniform(-34, 46)
return ndimage.rotate(foto, random_degree)
def mask(foto):
lx, ly, lz = foto.shape
X, Y = np.ogrid[0:lx, 0:ly]
mask = (X - lx / 2) ** 2 + (Y - ly / 2) ** 2 > lx * ly / 4
foto[mask] = 0
return foto
def crop(foto):
lx, ly, lz = foto.shape
r = range(4,16)
i = random.choice(r)
return foto[lx // i: - lx // i, ly // i: - ly // i]
def flipud(foto):
return np.flipud(foto)
def fliplr(foto):
return np.fliplr(foto)
Now in order to use these transformations we can create as many different variations of the original sample using random choices and random integers.
def apply_random_transf(foto):
mod_fotos = []
transformations = {'rotate': rotate, 'mask':mask, 'crop':crop, 'flipud':flipud, 'fliplr':fliplr}
n_images = 20
for i in range(n_images):
n_transformations = random.randint(1,3) #it can be at most 3 different transformations
for j in range(n_transformations):
key = random.choice(transformations.keys())
t_foto = transformations[key](foto)
mod_fotos.append(t_foto)
return mod_fotos
foto = plt.imread('foto.png')
mod_fotos = apply_random_transf(foto)
And now we can print (or save) the new 20 samples in our class:
fig = plt.figure(figsize=(20,20))
for i in range(len(mod_fotos)):
img = mod_fotos[i]
x,y,z = img.shape
fig.add_subplot(4,5,i+1)
plt.imshow(img, cmap=plt.cm.jet, label=i)
plt.legend()
Ok, this is neat and all, but how would this work for binaries?
#import the functions from the first blog post
import blog_1 as b1
f_p = 'binaries/gcd.c'
bin_img = b1.reshape(b1.read_data(f_p))
b1.visualize(f_p)
plt.savefig('bin_img.png')
foto = plt.imread('bin_img.png')
mod_fotos = apply_random_transf(foto)
fig = plt.figure(figsize=(20,20))
for i in range(len(mod_fotos)):
img = mod_fotos[i]
x,y,z = img.shape
fig.add_subplot(4,5,i+1)
plt.imshow(img, cmap=plt.cm.jet, label=i)
plt.legend()
Ok, there are several problems here. We can get rid of the axes no problem (just lazy to go back and fix the code), but this would leave us with way worse problems. If we think about image classification, all of the above transformations would preserve the semantic content of the images, but if we think about our real problem at hand:
For this reason, common data augmentation tactics shouldn't work on binary images, that's why we have to think about something else. How do we transform programs in a way that preserves their semantic properties but still changes them enough to work as a data augmentation tool for image classifications?
We'll see in the next blog post.
>>