The Beginning: Assembly

Pre requisites: Little bit of programming experience

So to begin our journey into reverse engineering super cool malware, there are a few foundational things that we need to cover first. I’m going to go into this with the assumption that the audience is able to program at least a little bit - if not there are plenty of resources to get started, I’d recommend python for ease of use and C for relevance to reverse engineering, a potential good place to start is here.

The aim at the end of this blog post is to hopefully get rid of any fear of assembly you might have. It’s intimidating to learn, but you’ll soon realise that it’s fundamentally simple.

When you write code, it’s typically either compiled or interpreted. You might not understand or even care about the intricacies of how they work (but if you do I have a talk about it), but fundamentally what they do is turn your code from a (relatively) human readable format to a machine readable one. What we write when we program is far too complex for computers to understand, which is why your source code often gets turned into a “binary” (i.e. a portable executable (.exe) in windows or an ELF binary in linux). The binary consists of machine code that is the result of the translation that the compiler does to make your source code machine readable.

You might be wondering where assembly comes into this - the neat thing about assembly is that it’s got a one to one correspondance to machine code. What this means is, it’s trivial to convert machine code into assembly and assembly whilst not very human readable, is indeed human readable. Fundamentally reverse engineering is about understanding what a particular piece of software does, be that malware or something else. This means that assembly is an incredible tool to gain that understanding, and is something we often leverage and therefore need to understand.

So to address the problem itself, what is assembly?

Assembly represents instructions for the CPU, for example add mul sub etc represent arithmatic operations. Assembly instructions are fundamentally quite simple, you can think of each one describing one “thing” the CPU can do, and those things don’t get very complicated.

Let’s take hello world as an example:

section .data
    msg db "Hello world!", 0ah

section .text
    global _start

_start:
    mov rax, 1
    mov rdi, 1
    mov rsi, msg
    mov rdx, 13

    syscall
    mov rax, 60
    mov rdi, 0
    syscall

This may look super complex now, but when we break it down it’s quite simple.

Firstly there’s the two sections at the top. We’ll delve deeper into how memory segments work in a later post but for now all you need to know is that your computer will segment your program into various blocks, .text is where code is held and .data is where (intuitively) data is held.

The line msg db "Hello world!", 0ah gives many hints as to what we’re looking at. msg is a name that we would define ourselves if we wrote this assembly, all it does is give a name to this piece of data within the data segment. db stands for “define byte”, which allocates space for the data that’s to come. The data itself is “Hello world!” of course, and the 0ah is hexadecimal A, or decimal 10, which in ASCII denotes a line feed (or newline). So in effect this line says “msg” in .data refers to the bytes “Hello world!\n”.

The next bit, section .text global _start, is saying that within the .text segment there exists the function _start, which is then defined after the _start: label.

Finally we get to the meat of the code.

mov rax, 1
mov rdi, 1
mov rsi, msg
mov rdx, 13
syscall

So here we have to introduce the concept of registers. Registers are little bits of memory inside the CPU. This means they’re very fast to access but equally have to be very small. There are several general purpose and specific purpose registers and there’ll be another post going more in depth about how registers work, but all you need to know here is that rax and rdx are general purpose registers, and rdi and rsi are parameters. mov moves values from one place to another and the syntax for it is mov dest, src, so mov rax, 1 means move 1 into the register rax.

So why are we moving seemingly random values into registers? Well at the end we’re using the syscall instruction, which tells the CPU to execute a certain system function based on register values. The registers we pass are effectively parameters to the syscall.

rdi specifies what syscall we’re passing, in this case 1 correlates to write, which writes something on screen. rsi holds a pointer to the data to be written, which in this case is the address of msg. rax specifies the stream to be written to and 1 correlates to stdout, and rdx holds the amount of data to be written, which in this case is 13 bytes.

mov rax, 60
mov rdi, 0
syscall

Finally the last bit of the assembly - now that the rest has been explained this is quite simple, all that’s going on here is the exit syscall is being invoked, which tells the computer that the program has finished execution.

Finally it’s worth noting that the syntax used here is for nasm, a type of assembly that is commonly written but isn’t what you’ll get out of reverse engineering tools. nasm was used here as the syntax is often easier to understand, but intel/AT&T assembly (common syntaxes you’ll see from disassembly tools) is quite similar and the same concepts will hold.

In summary, the point of this post is to try and demonstrate that whilst things often look scary in assembly, they’re much simpler than they seem. In terms of going further with the topic, I’d recommend playing around with NASM, there’s an assembler available to download on the official site but you can also find various online ones too. I’d also recommend games like Exapunks, Shenzhen I/O etc, as they do an incredibly job of gamefying assembly.

In the next post I’ll dive a little deeper in some of the concepts I overlooked here, such as memory segments and registers.