Monday 22 September 2014

Assembly Generated from Function Calls on x86-64

Two weeks ago in SPO600 we were given a task: compile a hello world C program, look at the Assembly code that gets generated then modify the code in small ways and notice how the Assembly code changes.

A second and separate task we were given was to learn about some feature of Assembly, teach it to the other students in the class in the form of a short presentation, and blog about what we discovered. I chose to investigate what happens when a function gets called in C, in x86-64 Assembly, and in particular what happens to the arguments passed into the function.

These two tasks are two separate labs, but since they are similar (and I'm lazy :P), I will combine them into a single blog entry.

WARNING/DISCLAIMER! What follows is a combination of personal research from reading materials I found on the web, trial and error with compiling and ojbdump-ing, and at times, wild speculation based on what I'm observing. Don't trust anything I say as authoritative!

A function that only references local variables and arguments is a standalone entity. At compile-time, it has no knowledge of where the arguments came from, or what values they should have. Therefore, when a function begins execution, it must look elsewhere to obtain the value of its arguments, as they are not defined within the function itself. Functions need to make assumptions about where to look for arguments, where to look for return values, and what stuff remains the same after a different function gets called and then returns. For a given computer system, such a set of rules governing the placement of arguments, return values, and other expected behavior, is called the "calling conventions" for that specific system.

A stack frame (I've also seen this called an "activation record") of a function at some moment in time, is the region of memory where the function stores local variables, its arguments, and information needed to restore the state of the caller upon returning. I gave the definition with respect to "some moment in time", because according to my understanding, it is possible for a stack frame to grow and shrink throughout the duration of the function's execution.

The way I, and most other stuff I read, visualize memory made the following assumptions about orientation. This is important to clarify, so that when I write about one location being 'above' or 'below' another, or about the direction of memory growth, your mental picture is the same thing as mine is. Throughout the rest of this post I will assume that:

1. Higher memory addresses are visualized as being above lower memory addresses.

2. The stack frame is a stack data structure, that grows downward, towards lower memory addresses.

This implies that if I have two local variables, X and Y, and Y was declared after X, then Y will have an address that is less than X.

I read some basic tutorials, and watched some videos about what happens when a function gets called. Typically, most explanations I saw told a story about what 'would' happen in an 'ideal case', but the tedious details of what actually happens is very specific to an instruction set architecture and an operating system. The 'ideal case' scenario would go something like this:

A function begins execution. There is a register that holds what is called the stack pointer (SP) and base pointer (BP) of the functions activations record. Above the base pointer is the address that would have been held by the program counter (PC), had the function not been called. This is where the function will return to upon returning, by loading this value back into the PC. Assuming that the size of pointer types is 8, then just above these 8 bytes, should be the arguments of the function. How much space each argument takes, is inferred from its type. So, for example, if my function is passes an int, int, and a long double, in that order, and assuming that the size of int and long double is 4 and 10 respectively, then each of these arguments can be addressed by BP + 8, BP + 12, and BP + 16 respectively. Remember! We are adding 8 to account for the saved PC of the function that called us! How did those values get there? It was the responsibility of the function that called our function to put them there, as well as set the SP to point to the right place. So, suppose we wanted to call another function, it would be OUR responsibility to decrement the SP by enough to store the values of the arguments to the function, and put the right values in there. When ever a local variable is declared, the stack pointer moves down as many bytes as is needed to make room for that local variable. So, with the numbers we've been using so far, that would be 4 bytes for an int declaration, 8 for a pointer declaration, and 10 for a double declaration.

But Alas! Things are not really this simple on x86-64 and most 'real' architectures, mostly because we can optimize on this behavior, and because of alignment issues.

1. First, we must manually store the value of the callers BP, by pushing it onto the stack. Then we must set our BP to be equal to our SP which was decremented by the caller.

2. Compilers are smart enough not to have to move the SP for every declaration and function call, but can move it by the right size just once at the beginning of the function. So, if I declare four ints, and then call a function that accepts two ints, then (ignoring alignment) the stack pointer would be decremented by 24 byes at the very beginning.

3. The stack pointer must always point to an address that is a multiple of 16. Also, the compiler may allocate more memory than you would expect to improve efficiency by aligning some variables in such a way as to waste memory but improve speed.

4. Perhaps most relevant to the code below, the caller will, whenever possible, pass the values of the arguments to a called function using registers directly as opposed to pushing them onto the memory stack. The called function can infer whether to look in the registers or above the base pointer for a particular argument from its type.

So let's start compiling functions to see what actually happens! :D First, let's compile 6 simple functions which are exactly the same except except for the types of the arguments and return value:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
int i(int a, int b, int c, int d, int e, int f)
{
  return (a + b + c) * (d + e + f);
}

char c(char a, char b, char c, char d, char e, char f)
{
  return (a + b + c) * (d + e + f);
}

long long ll(long long a, long long b, long long c, long long d, long long e, long long f)
{
  return (a + b + c) * (d + e + f);
}

float f(float a, float b, float c, float d, float e, float f)
{
  return (a + b + c) * (d + e + f);
}

double d(double a, double b, double c, double d, double e, double f)
{
  return (a + b + c) * (d + e + f);
}

long double ld(long double a, long double b, long double c, long double d, long double e, long double f)
{
  return (a + b + c) * (d + e + f);
}

Let's take a look at the Assembly output. I turned on some basic optimization ("-O1" flag) because it makes the calling convention more readily transparent. For example, I noticed that without optimization, the compiler would 'always' store the arguments from their registers onto its stack frame, even if it was not necessary. The Assembly output:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
ex1.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <i>:
   0:   01 f7                   add    %esi,%edi
   2:   01 fa                   add    %edi,%edx
   4:   41 01 c8                add    %ecx,%r8d
   7:   45 01 c1                add    %r8d,%r9d
   a:   89 d0                   mov    %edx,%eax
   c:   41 0f af c1             imul   %r9d,%eax
  10:   c3                      retq

0000000000000011 <c>:
  11:   01 f7                   add    %esi,%edi
  13:   01 fa                   add    %edi,%edx
  15:   41 01 c9                add    %ecx,%r9d
  18:   45 01 c8                add    %r9d,%r8d
  1b:   44 89 c0                mov    %r8d,%eax
  1e:   0f af c2                imul   %edx,%eax
  21:   c3                      retq

0000000000000022 <ll>:
  22:   48 01 f7                add    %rsi,%rdi
  25:   48 01 fa                add    %rdi,%rdx
  28:   49 01 c8                add    %rcx,%r8
  2b:   4d 01 c1                add    %r8,%r9
  2e:   48 89 d0                mov    %rdx,%rax
  31:   49 0f af c1             imul   %r9,%rax
  35:   c3                      retq

0000000000000036 <f>:
  36:   f3 0f 58 c8             addss  %xmm0,%xmm1
  3a:   f3 0f 58 d1             addss  %xmm1,%xmm2
  3e:   f3 0f 58 e3             addss  %xmm3,%xmm4
  42:   f3 0f 58 ec             addss  %xmm4,%xmm5
  46:   f3 0f 59 d5             mulss  %xmm5,%xmm2
  4a:   0f 28 c2                movaps %xmm2,%xmm0
  4d:   c3                      retq

000000000000004e <d>:
  4e:   f2 0f 58 c8             addsd  %xmm0,%xmm1
  52:   f2 0f 58 d1             addsd  %xmm1,%xmm2
  56:   f2 0f 58 e3             addsd  %xmm3,%xmm4
  5a:   f2 0f 58 ec             addsd  %xmm4,%xmm5
  5e:   f2 0f 59 d5             mulsd  %xmm5,%xmm2
  62:   66 0f 28 c2             movapd %xmm2,%xmm0
  66:   c3                      retq

0000000000000067 <ld>:
  67:   db 6c 24 18             fldt   0x18(%rsp)
  6b:   db 6c 24 08             fldt   0x8(%rsp)
  6f:   de c1                   faddp  %st,%st(1)
  71:   db 6c 24 28             fldt   0x28(%rsp)
  75:   de c1                   faddp  %st,%st(1)
  77:   db 6c 24 48             fldt   0x48(%rsp)
  7b:   db 6c 24 38             fldt   0x38(%rsp)
  7f:   de c1                   faddp  %st,%st(1)
  81:   db 6c 24 58             fldt   0x58(%rsp)
  85:   de c1                   faddp  %st,%st(1)
  87:   de c9                   fmulp  %st,%st(1)
  89:   c3                      retq
                                                                                                                                                        64,35-39      Bot

The important thing to notice is that the integer and floating point arguments are put into particular registers consistently. The order is always the same. For integer types, it's %rdi, %rsi, %rdx, %rcx, %r8, %r9. For floats and doubles, it's %xmm0, %xmm1, ... %xmm7. The return value is always stored in the 'A' register for integer types, and the %xmm0 register for floats and doubles,

However, for long doubles, I am a little bit confused by what I am seeing (maybe someone who understands better can chime in?). After reading the calling convention portion of the System V ABI for x86-64, I assumed that long double arguments should be pushed onto the FPU stack, if they can fit into those registers. On my system they can: sizeof(long double) == 10, CHAR_BIT == 8, and the FPU stack registers are 80 bits wide. Instead, what I am seeing is the long double being put 16 bytes above the base pointer. (The 16 points is where the saved program counter, and caller's base pointer are stored). Perhaps long doubles must padded to be 16 bytes? But then why is the return value pushed onto the %st register (top of the FPU stack)? Weird...

In any case, there were four interesting cases that came to mind:

1. There are arguments of different types in different combinations.

2: There are lots of arguments. Specifically, when there are more arguments than there are registers of the appropriate type to store them.

3. The size of the type of some of the arguments or the return value is too wide to fit into registers (a structure type with many fields, for example).

4. When the function accepts a variable number of arguments.

I will write about the fourth case, functions of a variable number of arguments, in a separate blog entry.

Let's start with the case when there are arguments of different types:


1
2
3
4
float diff_arg_types(int i, char c, long long ll, float f, double d, long double ld, int x, int y, int z)
{
  return (i + c + ll + x + y + x) * (f + d + ld);
}

This function produces the following assembly (this time, with no optimizations, since I want to be very explicit about which registers correspond to which arguments):


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
0000000000000000 <diff_arg_types>:
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   89 7d fc                mov    %edi,-0x4(%rbp)
   7:   89 f0                   mov    %esi,%eax
   9:   48 89 55 f0             mov    %rdx,-0x10(%rbp)
   d:   f3 0f 11 45 ec          movss  %xmm0,-0x14(%rbp)
  12:   f2 0f 11 4d e0          movsd  %xmm1,-0x20(%rbp)
  17:   89 4d e8                mov    %ecx,-0x18(%rbp)
  1a:   44 89 45 dc             mov    %r8d,-0x24(%rbp)
  1e:   44 89 4d d8             mov    %r9d,-0x28(%rbp)
  22:   88 45 f8                mov    %al,-0x8(%rbp)
  25:   0f be 55 f8             movsbl -0x8(%rbp),%edx
  29:   8b 45 fc                mov    -0x4(%rbp),%eax
  2c:   01 d0                   add    %edx,%eax
  2e:   48 63 d0                movslq %eax,%rdx
  31:   48 8b 45 f0             mov    -0x10(%rbp),%rax
  35:   48 01 c2                add    %rax,%rdx
  38:   8b 45 e8                mov    -0x18(%rbp),%eax
  3b:   48 98                   cltq
  3d:   48 01 c2                add    %rax,%rdx
  40:   8b 45 dc                mov    -0x24(%rbp),%eax
  43:   48 98                   cltq
  45:   48 01 c2                add    %rax,%rdx
  48:   8b 45 e8                mov    -0x18(%rbp),%eax
  4b:   48 98                   cltq
  4d:   48 01 d0                add    %rdx,%rax
  50:   48 89 45 c8             mov    %rax,-0x38(%rbp)
  54:   df 6d c8                fildll -0x38(%rbp)
  57:   f3 0f 10 45 ec          movss  -0x14(%rbp),%xmm0
  5c:   0f 5a c0                cvtps2pd %xmm0,%xmm0
  5f:   f2 0f 58 45 e0          addsd  -0x20(%rbp),%xmm0
  64:   f2 0f 11 45 c0          movsd  %xmm0,-0x40(%rbp)
  69:   dd 45 c0                fldl   -0x40(%rbp)
  6c:   db 6d 10                fldt   0x10(%rbp)
  6f:   de c1                   faddp  %st,%st(1)
  71:   de c9                   fmulp  %st,%st(1)
  73:   d9 5d d4                fstps  -0x2c(%rbp)
  76:   f3 0f 10 45 d4          movss  -0x2c(%rbp),%xmm0
  7b:   f3 0f 11 45 c0          movss  %xmm0,-0x40(%rbp)
  80:   8b 45 c0                mov    -0x40(%rbp),%eax
  83:   89 45 c0                mov    %eax,-0x40(%rbp)
  86:   f3 0f 10 45 c0          movss  -0x40(%rbp),%xmm0
  8b:   5d                      pop    %rbp
  8c:   c3                      retq
                                                                                                                                                        51,1          Bot

There's a lot of stuff here, but we don't care about most of it at the moment. First look at lines 4 through twelve. It looks like the compiler is just using the next avaialble register for that type! For examples, it uses the integer registers until it hits a float and a double. So it puts those arguments in %xmm0 and %xmm1, and continues to put the final three int arguments into registers %rcx, %r8, and %r9. And the long double gets put 16 bytes above the base pointer, since on line 35 we see that locations value being pushed onto the FPU stack.

Now! Let's see what will happen if we pass in more arguments than there are registers to store those arguments. My C code:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
int i(int a, int b, int c, int d, int e, int f, int g, int h, int i, int j)
{
  return (a + b + c + d + e) * (f + g + h + i + j);
}

char c(char a, char b, char c, char d, char e, char f, char g, char h, char i, char j)
{
  return (a + b + c + d + e) * (f + g + h + i + j);
}

double d(double a, double b, double c, double d, double e, double f, double g, double h, double i, double j)
{
  return (a + b + c + d + e) * (f + g + h + i + j);
}

The Assembly (this time with optimizations turned on again):


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
0000000000000000 <i>:
   0:   01 f7                   add    %esi,%edi
   2:   01 fa                   add    %edi,%edx
   4:   01 d1                   add    %edx,%ecx
   6:   41 01 c8                add    %ecx,%r8d
   9:   44 03 4c 24 08          add    0x8(%rsp),%r9d
   e:   44 89 c8                mov    %r9d,%eax
  11:   03 44 24 10             add    0x10(%rsp),%eax
  15:   03 44 24 18             add    0x18(%rsp),%eax
  19:   03 44 24 20             add    0x20(%rsp),%eax
  1d:   41 0f af c0             imul   %r8d,%eax
  21:   c3                      retq

0000000000000022 <c>:
  22:   41 01 f8                add    %edi,%r8d
  25:   44 01 c6                add    %r8d,%esi
  28:   01 f2                   add    %esi,%edx
  2a:   01 d1                   add    %edx,%ecx
  2c:   44 02 4c 24 20          add    0x20(%rsp),%r9b
  31:   44 89 c8                mov    %r9d,%eax
  34:   02 44 24 08             add    0x8(%rsp),%al
  38:   02 44 24 10             add    0x10(%rsp),%al
  3c:   02 44 24 18             add    0x18(%rsp),%al
  40:   0f af c1                imul   %ecx,%eax
  43:   c3                      retq

0000000000000044 <d>:
  44:   f2 0f 58 c8             addsd  %xmm0,%xmm1
  48:   f2 0f 58 d1             addsd  %xmm1,%xmm2
  4c:   f2 0f 58 da             addsd  %xmm2,%xmm3
  50:   f2 0f 58 e3             addsd  %xmm3,%xmm4
  54:   f2 0f 58 f5             addsd  %xmm5,%xmm6
  58:   f2 0f 58 fe             addsd  %xmm6,%xmm7
  5c:   f2 0f 58 7c 24 08       addsd  0x8(%rsp),%xmm7
  62:   66 0f 28 ef             movapd %xmm7,%xmm5
  66:   f2 0f 58 6c 24 10       addsd  0x10(%rsp),%xmm5
  6c:   f2 0f 59 e5             mulsd  %xmm5,%xmm4
  70:   66 0f 28 c4             movapd %xmm4,%xmm0
  74:   c3                      retq

Here we can see that the compiler uses as many registers as it can, and when it runs out, it starts to place the arguments starting above the base pointer of the callee function. Also note that all the arguments smaller than 8 bytes get aligned to exactly 8 bytes. So in function 'c' for example, where all the arguments are characters, the seventh, eighth, ninth, and tenth argument gets stored at 0x8, 0x10, 0x 18, and 0x20  above the stack pointer, respectively. These are eight byte chunks. (Note: With optimization turned on, the function is not pushing %rbp onto the stack and assigning a new value to it, so it reaches 8 bytes above the STACK POINTER, and NOT 16 bytes above the BASE POINTER as in example 2. I apologize for the confusion).

Similarly, with the double arguments, the first eight are stored in %xmm0 - %xmm7, and the last two are stored at %rsp + 0x8 and %rsp + 0x10.

Now, the last case of interest is when we pass to or return from  the function, values that are too wide for registers:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
typedef struct {
  char c;
  int i;
  long long ll;
  float f;
  double d;
  long double ld;
} big_struct;

big_struct fun(big_struct b1, big_struct b2)
{
  big_struct b1b2 = {.c = b1.c + b2.c,
                     .i = b1.i + b2.i,
                     .ll = b1.ll + b2.ll,
                     .f = b1.f + b2.f,
                     .d = b1.d + b2.d,
                     .ld = b2.ld + b2.ld };
  return b1b2;
}

And the Assembly:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
0000000000000000 <fun>:
   0:   48 89 f8                mov    %rdi,%rax
   3:   8b 4c 24 0c             mov    0xc(%rsp),%ecx
   7:   03 4c 24 3c             add    0x3c(%rsp),%ecx
   b:   48 8b 54 24 10          mov    0x10(%rsp),%rdx
  10:   48 03 54 24 40          add    0x40(%rsp),%rdx
  15:   f3 0f 10 4c 24 18       movss  0x18(%rsp),%xmm1
  1b:   f3 0f 58 4c 24 48       addss  0x48(%rsp),%xmm1
  21:   f2 0f 10 44 24 20       movsd  0x20(%rsp),%xmm0
  27:   f2 0f 58 44 24 50       addsd  0x50(%rsp),%xmm0
  2d:   db 6c 24 58             fldt   0x58(%rsp)
  31:   d8 c0                   fadd   %st(0),%st
  33:   0f b6 74 24 38          movzbl 0x38(%rsp),%esi
  38:   40 02 74 24 08          add    0x8(%rsp),%sil
  3d:   40 88 37                mov    %sil,(%rdi)
  40:   89 4f 04                mov    %ecx,0x4(%rdi)
  43:   48 89 57 08             mov    %rdx,0x8(%rdi)
  47:   f3 0f 11 4f 10          movss  %xmm1,0x10(%rdi)
  4c:   f2 0f 11 47 18          movsd  %xmm0,0x18(%rdi)
  51:   db 7f 20                fstpt  0x20(%rdi)
  54:   c3                      retq

From this we can infer that the two structs are laid out on top of each other, above the stack pointer. Each field from the struct is added to its corresponding field in the other struct, and stored in a register. For example: 0xc +  %rsp is added to 0x3c + %rsp and stored in %ecx, 0x10 + %rsp is added to 0x40 + %rsp and stored in %rdx, and so on. What's interesting is how the struct is returned. The calling function is expected to put into the register %rdi, the base address of a memory location in which the caller is supposed to store the resturn value. Thus, from lines 16 through 20 we see the values in the registers where the results of our previous calculations were put, being stored at an offset from the address in %rdi.

I felt I learnt a lot from investigating the X86-64 calling conventions on my machine. However, I now have more questions than when I started :) Many of which can probably be answered by a combination of further experimentation and reading documentation and standards, but alas, this is a topic for another blog post! The question at the front of my mind at the moment are:

1. Why aren't arguments of type long double passed through the FPU register stack?

2. What happens if the size of the struct, and the types of the fields are changed? Are structs ever passed in registers?

3. Tricky alignment questions (really, I just want a set of explicit alignment rules).

4. In the last example, I am having trouble understanding lines 14 and 15. I know from reading parts of the ABI standard that the address of where a struct is to be put is stored in %rdi. But here, it looks like %rdi is being manipulated in some way. Also, the first field of the first struct begins at 0xc bytes above above the stack pointer. But here it looks like the computer is grabbing data at 0x8 bytes above the stack pointer? But this leaves only 4 bytes of meaningful data between 0x8 and 0xc. What is this data and what the heck does it have to do with %rdi (the address of where to store the return value)?