How to effortlessly hide cryptosystems with GCC - Part 1

Hiding crypto/hashing at compile time for 1$ yelp

Over time while solving some CTF the following came out: by looking where does massive xor occure in binaries one can often shortcut the static analysis phase by discovering hot areas where interesting things happen. This is possible because numerous crackme exercises rely on some custom hashing or well-known one that are slightly altered. In this article we will verify this assumption and see if we can avoid it to actually find out these areas.

Disclaimer

I’m not a cryptanalyst nor a gcc dev, hence there are certainly some shortcuts here and there. The methodology presented here is not suited for serious applications and should only be used for testing purposes, CTF challenges etc.

XOR as a metric

If one have a look at most of symmetric crypto algorithms and hashing functions it’s quite clear that the XOR logical operation is widely spread, mainly because of it’s mathematical properties. Here is a small non exhaustive xor application list:

  • mixing function for OTP or Feistel networks
  • block cipher at every step to mix the plaintext (PT) with key material or IVs.
  • MD5, SHA-N, …
  • safe for mixing entropy sources as long as one of them is “random” and independant.
The XOR's logical table can be summarized by the following: the output is true only when inputs differ (one is true, the other is false).
XOR logic gate.





Beside that, it’s a commutative, associative and linear operator that can be used to flip bit, count parity…

Let’s take xor as our metric for spotting hot areas in a unknown binary. The idea is to draw a XOR heatmap by using sort of an histogram between the offset of the xor instruction in the binary and its runtime execution occurences. For that one can build a small PIN tool that builds this mapping during the execution. In the following example my pintool is named cryptseek.so. I also wrote a small piece of C code called dummy_xor to xor an input file, producing an output file according to the key passed as a third argument to test the pintool.

$ dd if=/dev/urandom of=input bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0778394 s, 13.5 MB/s
$ pin -t cryptseek.so -- ./dummy_xor input output example_key
$ cat xor_count.map
Total number of threads = 1
Thread [0]
0x000000abf:1048576

So unsurprisingly it looks like one just XORed 1M of data since the pintool found 1048576 execution of the xor instruction at the offset 0x000000abf in the ELF. By looking at the offset one effectively fall into the function that plays with buffers by xoring them. Nothing fancy except that you may ask why the xor %eax, %rax and friends don’t show up? If unfiltered, these offsets will generate a lot of false positives. Typically the tool ends up counting function executions. Also the tool support multi-threaded programs because nobody wants to miss the good part if spawned into a thread, which happen quite regularly. So, now that we found them, how to hide everything painlessly?

Logical equivalents

The XOR is a logical operator or a logical gate often described as an elementary building block of a larger digital circuit. Turns out that you can reproduce the output of this elementary gate with other gates (See first chapter of The Elements of Computing Systems). So we’ll basically rewrite the “atomic” xor instruction with a combination of other “atomic” instructions. Note that the overall replacement is no more atomic. Here is the three different ways to implement a xor operation using NANDs, NORs and a mix of both:

Xor logical equivalents

Here is the (not much simplified on purpose) equivalent one have to write in C:

inline int xor1(int a, int b)
{
	return (~(a & b) & (a | b));
}

inline int xor2(int a, int b)
{
	return (~((~(~(a & b) & a)) & (~(~(a & b) & b) )));
}

inline int xor3(int a, int b)
{
	return (~((~(~a | ~b)) | (~(a | b))));
}

But there is a lot of problems with that approach:

  • It’s hella error prone.
  • If one makes some functions to only have to take care once, those function have to be templated or available for each type size.
  • It takes days to modify a big library and again, it’s error prone.
  • One have to “randomly” choose the version that replaces the current xor by yourself.
  • Still want to do it?

So it’s doable by hand for two or three occurences but if you aim at recompiling something bigger you clearly want some tools.

GCC to the rescue

Since GCC 4.5 one can write plugins to GCC. It means that you do not need to recompile your GCC or get a new one. But you surely need some headers (mine are located at /usr/lib/gcc/ARCH/VERSION/plugin/include). Plugins allow you to play with pretty much everything inside the beast while compiling. Thus we can add a new pass to GCC in order to perform some trickeries like modifying the GIMPLE representation of the code. Here we’ll register callbacks after the first SSA pass of GCC, that is before RTL generation. So we basically just tweak the SSA tree derived from the GIMPLE program. Modification is pretty straightforward but very heavy because of the following:

The compiler modifies the program representation so that every time a variable is assigned in the code, a new version of the variable is created.

First thing to do in your plugin is to register the appropriate callback, they will be called internally by GCC.

int plugin_init(struct plugin_name_args *plugin_info,
                struct plugin_gcc_version *version)  
{                                                    
        struct register_pass_info pass_info;       
        const char *plugin_name = plugin_info->base_name;

        pass_info.pass = new pass_noxor;
        pass_info.reference_pass_name = "ssa";
        pass_info.ref_pass_instance_number = 1;
        pass_info.pos_op = PASS_POS_INSERT_AFTER;
                                                 
        register_callback(plugin_name, PLUGIN_ATTRIBUTES, register_attributes,
                          NULL);                                              
        register_callback(plugin_name, PLUGIN_PASS_MANAGER_SETUP, NULL,       
                          &pass_info);                                 
        register_callback(plugin_name, PLUGIN_INFO, NULL, &noxor_info);
        parse_arg(plugin_info->argc, plugin_info->argv);               
                                                                       
        return 0;                                       
}                

Here we register three callbacks, the first one will be called by GCC while searching for registered attributes. Since one potentially wants to select which function should be hidden to let the other untouched and do not raise any suspicions we’ll register an attribute to decorate the functions we want. The callback simply put into a vector de function addresse for later check. The second one provide GCC the information to let it knows when to call your pass entry point. The last one provide help for the plugin while gcc -v is called.

Then our pass entry function will be eventually called later for each new function in the compilation unit. Since the interesting function have already been pushed into a vector one can determine easily if instrumentation is required or not. If yes the entry function check each basic block and each experssion for the flag BIT_XOR_EXPR then call obfuscate() function below:

enum xor_flavour {
	mixed,
	nand,
	nor,
	_nb_flavour
};

void (*noxor[_nb_flavour])(gimple_stmt_iterator *, tree *);

static void obfuscate(gimple_stmt_iterator *gsi, gimple *xor_stmt)
{
	tree ops[3];

	/* We'll have 3 operands:
	* it's a simple assign with a BIT_XOR_EXPR as rhs.
	* on something like D.3602 = D.3601 ^ 42 we'll have
	* op0 = D.3602	(SSA_NAME)
	* op1 = D.3601	(SSA_NAME)
	* op2 = 42	(INTEGER_CST)
	*/
	for (unsigned i = 0; i < gimple_num_ops(xor_stmt); ++i)
		ops[i] = gimple_op(xor_stmt, i);

	noxor[(int)RAND(0, _nb_flavour - 1)](gsi, ops);
#ifdef DEBUG
	++xor_removed;
#endif
}

The previous function just randomly choose a xor flavour to make things a little harder to discover. But the main part of the code resides in the function that actually do the change. The following code is the nand flavour :

static void noxor_nand(gimple_stmt_iterator *gsi, tree *ops)
{
	gimple_stmt_iterator old;
	gimple* stmt;
	tree temp[12];

	old = *gsi;

	/* T0 = a */
	temp[0] = create_tmp_var(TREE_TYPE(ops[1]), "TMP");
	stmt = gimple_build_assign(temp[0], VAR_DECL, ops[1]);
	gsi_insert_after(gsi, stmt, GSI_NEW_STMT);
	/* T1 = b */
	temp[1] = create_tmp_var(TREE_TYPE(ops[2]), "TMP");
	stmt = gimple_build_assign(temp[1], VAR_DECL, ops[2]);
	gsi_insert_after(gsi, stmt, GSI_NEW_STMT);
	/* T2 = T0 & T1 */
	temp[2] = create_tmp_var(TREE_TYPE(ops[2]), "TMP");
	stmt = gimple_build_assign(temp[2], BIT_AND_EXPR,
				   temp[1], temp[0]);
	gsi_insert_after(gsi, stmt, GSI_NEW_STMT);
	/* T3 = ~T2 */
	temp[3] = create_tmp_var(TREE_TYPE(ops[2]), "TMP");
	stmt = gimple_build_assign(temp[3], BIT_NOT_EXPR,
				   temp[2]);
	gsi_insert_after(gsi, stmt, GSI_NEW_STMT);
	/* T4 = a */
	temp[4] = create_tmp_var(TREE_TYPE(ops[1]), "TMP");
	stmt = gimple_build_assign(temp[4], VAR_DECL, ops[1]);
	gsi_insert_after(gsi, stmt, GSI_NEW_STMT);
	/* T5 = T3 & T4 */
	temp[5] = create_tmp_var(TREE_TYPE(ops[2]), "TMP");
	stmt = gimple_build_assign(temp[5], BIT_AND_EXPR,
				   temp[3], temp[4]);
	gsi_insert_after(gsi, stmt, GSI_NEW_STMT);
	/* T6 = ~T5 */
	temp[6] = create_tmp_var(TREE_TYPE(ops[2]), "TMP");
	stmt = gimple_build_assign(temp[6], BIT_NOT_EXPR,
				   temp[5]);
	gsi_insert_after(gsi, stmt, GSI_NEW_STMT);
	/* T7 = b */
	temp[7] = create_tmp_var(TREE_TYPE(ops[2]), "TMP");
	stmt = gimple_build_assign(temp[7], VAR_DECL, ops[2]);
	gsi_insert_after(gsi, stmt, GSI_NEW_STMT);
	/* T8 = T7 & T3 */
	temp[8] = create_tmp_var(TREE_TYPE(ops[2]), "TMP");
	stmt = gimple_build_assign(temp[8], BIT_AND_EXPR,
				   temp[7], temp[3]);
	gsi_insert_after(gsi, stmt, GSI_NEW_STMT);
	/* T9 = ~T8 */
	temp[9] = create_tmp_var(TREE_TYPE(ops[2]), "TMP");
	stmt = gimple_build_assign(temp[9], BIT_NOT_EXPR,
				   temp[8]);
	gsi_insert_after(gsi, stmt, GSI_NEW_STMT);
	/* T10 = T6 & T9 */
	temp[10] = create_tmp_var(TREE_TYPE(ops[2]), "TMP");
	stmt = gimple_build_assign(temp[10], BIT_AND_EXPR,
				   temp[6], temp[9]);
	gsi_insert_after(gsi, stmt, GSI_NEW_STMT);
	/* T11 = ~T10 */
	temp[11] = create_tmp_var(TREE_TYPE(ops[2]), "TMP");
	stmt = gimple_build_assign(temp[11], BIT_NOT_EXPR,
				   temp[10]);
	gsi_insert_after(gsi, stmt, GSI_NEW_STMT);
	stmt = gimple_build_assign(ops[0], VAR_DECL, temp[11]);
	gsi_insert_after(gsi, stmt, GSI_NEW_STMT);

	gsi_remove(&old, true);

#ifdef DEBUG
	puts("xor obfuscated with noxor_nand");
#endif
}

As you can see since it’s in SSA pass we have to create a new version of every variable each time they’re assigned. Don’t forget to remove the old expression at the end otherwise some GCC’s internal tree coherency checks will fail and compilation will abort. So after all, how does it looks when using it?

$ gcc -fplugin=./plugin_noxor.so -fplugin-arg-plugin_noxor-all Md5.c -o Md5_noxor
xor obfuscated with noxor_nand                                                                                           
xor obfuscated with noxor_nand
xor obfuscated with noxor_mixed
xor obfuscated with noxor_nor
...
xor obfuscated with noxor_mixed
xor obfuscated with noxor_nor
xor obfuscated with noxor_mixed
xor obfuscated with noxor_mixed
xor obfuscated with noxor_mixed
xor obfuscated with noxor_nand
48 xor obfuscated in function at Md5.c:232
$

Pretty easy. Note that a plugin argument (-fplugin-arg-plugin_noxor-all) is used here to enforce obfuscation of all xor in all function if found. So that it’s super easy to recompile entire libraries.

So here is the first two xor of the function static void Transform (buf, in) in the previous linked Md5.c.

  401751:       48 01 45 f0             add    %rax,-0x10(%rbp)
  401755:       48 8b 45 f0             mov    -0x10(%rbp),%rax
  401759:       48 33 45 e8             xor    -0x18(%rbp),%rax
  40175d:       48 33 45 e0             xor    -0x20(%rbp),%rax
  401761:       48 89 c2                mov    %rax,%rdx
  401764:       48 8b 45 d0             mov    -0x30(%rbp),%rax

When the obfuscation pass is applied, we get:

  401751:       48 01 45 f0             add    %rax,-0x10(%rbp)
  401755:       48 8b 45 f0             mov    -0x10(%rbp),%rax
  401759:       48 8b 55 e8             mov    -0x18(%rbp),%rdx
  40175d:       48 21 d0                and    %rdx,%rax
  401760:       48 f7 d0                not    %rax
  401763:       48 8b 55 f0             mov    -0x10(%rbp),%rdx
  401767:       48 21 c2                and    %rax,%rdx
  40176a:       48 89 d1                mov    %rdx,%rcx
  40176d:       48 f7 d1                not    %rcx
  401770:       48 8b 55 e8             mov    -0x18(%rbp),%rdx
  401774:       48 21 d0                and    %rdx,%rax
  401777:       48 f7 d0                not    %rax
  40177a:       48 21 c8                and    %rcx,%rax
  40177d:       48 f7 d0                not    %rax
  401780:       48 89 c2                mov    %rax,%rdx
  401783:       48 8b 45 e0             mov    -0x20(%rbp),%rax
  401787:       48 21 d0                and    %rdx,%rax
  40178a:       48 f7 d0                not    %rax
  40178d:       48 21 c2                and    %rax,%rdx
  401790:       48 89 d1                mov    %rdx,%rcx
  401793:       48 f7 d1                not    %rcx
  401796:       48 8b 55 e0             mov    -0x20(%rbp),%rdx
  40179a:       48 21 d0                and    %rdx,%rax
  40179d:       48 f7 d0                not    %rax
  4017a0:       48 21 c8                and    %rcx,%rax
  4017a3:       48 f7 d0                not    %rax
  4017a6:       48 89 c2                mov    %rax,%rdx
  4017a9:       48 8b 45 d0             mov    -0x30(%rbp),%rax

Here the two xor have been randomly replaced by the noxor_nand logical equivalent (cf the compilation output). However, you may ask yourself what about the benchs? When you do this kind of things this is clearly not your biggest problem, but for completeness here is a perf comparaison that shows a little drop in throughput of instrumented functions:

$ dd if=/dev/urandom of=md5me bs=4M count=100
100+0 records in
100+0 records out
419430400 bytes (419 MB, 400 MiB) copied, 19.6668 s, 21.3 MB/s

$ perf stat -B cat md5me| ./md5

 Performance counter stats for 'cat md5me':

        256.924382      task-clock (msec)         #    0.054 CPUs utilized          
           102,358      context-switches          #    0.398 M/sec                  
               339      cpu-migrations            #    0.001 M/sec                  
                88      page-faults               #    0.343 K/sec                  
       644,362,513      cycles                    #    2.508 GHz                    
       423,174,110      stalled-cycles-frontend   #   65.67% frontend cycles idle   
   <not supported>      stalled-cycles-backend   
       417,658,029      instructions              #    0.65  insns per cycle        
                                                  #    1.01  stalled cycles per insn
        81,936,737      branches                  #  318.914 M/sec                  
           848,974      branch-misses             #    1.04% of all branches        

       4.722898110 seconds time elapsed

ef2dc535da0d8424feea9ff03001d982

$ perf stat -B cat md5me| ./md5_noxor

 Performance counter stats for 'cat md5me':

        258.991388      task-clock (msec)         #    0.051 CPUs utilized          
           102,356      context-switches          #    0.395 M/sec                  
               437      cpu-migrations            #    0.002 M/sec                  
                90      page-faults               #    0.348 K/sec                  
       651,951,486      cycles                    #    2.517 GHz                    
       428,197,285      stalled-cycles-frontend   #   65.68% frontend cycles idle   
   <not supported>      stalled-cycles-backend   
       418,076,927      instructions              #    0.64  insns per cycle        
                                                  #    1.02  stalled cycles per insn
        82,013,652      branches                  #  316.666 M/sec                  
           894,521      branch-misses             #    1.09% of all branches        

       5.060046020 seconds time elapsed

ef2dc535da0d8424feea9ff03001d982

Before I forget, let’s run the pintool to prove (again) that everything works:

$ cat md5me | pin -t cryptseek.so -- md5_noxor
ef2dc535da0d8424feea9ff03001d982
$ cat xor_count.map
Total number of threads = 1
Thread [0]

Et voilà!

Conclusion

It obviously works. But some work is still required to avoid obvious solutions to counter this small obfuscation layer. One might wants to add random junk instructions in the middle of the xor logical replacement in order to avoid counterpart to build signature easily. Another 5 minutes fix would be to avoid replacing the xor when both rhs operands are the same like we did in the pin tool. Nevertheless I had a lot of fun doing these tools, I’ll probably modify this PoC with the previous observation and potentially write a part 2 to mess with signature based algorithm identification tools.

Thanks for reading, stay tuned!

References