Adam's Blog

Tuesday, 20 January 2015

Introduction to SSL and Public Key Cryptography

This week for work, I was asked by my boss to set up SSL for a server that hosts an older version of an internal Seneca web application built around BigBlueButton. I didn't understand very much about how SSL worked, so I decided to look into it a little bit to better understand what I was actually doing. I knew it had something to do with security, and I remember all that fuss about the heartbleed bug last year, but beyond that I knew nothing.

SSL stands for "Secure Sockets Layer". Newer versions of SSL are called "TLS", for "Transport Layer Security", but people still often refer to it as "SSL", so I will too.

SSL is a protocol through which two computers can establish a secure connection to send information back and forth. Once such a connection is established, they can be absolutely certain that if anyone is listening in on that connection, they won't be able to understand anything. All of the messages will look like gibberish. So, I guess you could think of it like a protocol that lets two computers invent a language that only they can understand :)

But how does it do that?

Background

Before we can talk about SSL, we need a bit of background information. We can imagine that there are three algorithms which we can call the "encryption algorithm", the "decryption algorithm", and the "key-pair generation algorithm". We don't need to know how they are implemented, as each algorithm set is different. Also, they all use a bunch of crazy math that most of us don't understand. What matters is how these three algorithm are related to each other, which we will talk about next.

The key-pair generation algorithm uses a randomly generated number to output two really big numbers, which we call "key1", and "key2". These two keys are magic, because when used with the other two algorithms (the encryption and decryption algorithms), either key can be used to decrypt a message that was encrypted with the other key. If we were to think of these algorithms as functions, they would be prototyped something like this:

/* Takes as input a random number, and outputs a pair of (really big) numbers. */

   keyPair keyGen(randomNumber);


/* Takes as input, a plain-text message, and a key (generated from the key-pair
 * algorithm) and outputs an encrypted string (called "ciphertext"). */

   cipherText encrypt(message, key);


/* Takes as input, ciphertext, encrypted by the encryption algorithm above, and
 * outputs a string. This string will either be more gibberish, or the original message
 * used in the encryption algorithm, depending on whether the right key was used. */

   string decrypt(cipherText, key);

The idea is that the encryption algorithm takes a message that we want to encrypt, together with one of the numbers from our key generation algorithm. The decryption algorithm will only be able to recover the original message, if the key that it uses is the OTHER key that the key generation algorithm created.

Put another way, we can say that key1 can encrypt a message, but it cannot be used to decrypt the ciphertext that is generated, nor can any other number except one: key2. Only key2 can decrypt the message encrypted with key1. The opposite holds as well: key2 can be used to encrypt messages, but only key1 will be able to decrypt the resulting ciphertext. When two keys stand in such a situation as this (either key can decrypt messages encrypted only with the other), it is called "asymmetric cryptography".

Public Key Cryptography

Whenever two computers want to communicate securely, they typically use a secure protocol (think of HTTP vs HTTPS). But how would this work? We can apply what we learned above to see how!

Let us designate one of those computers as a client, and one as a server, though the same would apply were it a peer-to-peer arrangement. Both client and server generate a set of keys using the keyGen algorithm above. (The server would usually only generate the key-pair once, where as the client would regenerate a new key-pair for every server it connects to, and for every session. We'll see why in a second, when we talk about certificate authorities.) Each computer designates one of those keys as its "public key", and the other as its "private key". The private key is stored on the local computer, and should not be given out to anyone under any circumstances. The public key on the other hand, can be given to anyone who asks for it; there is no danger in doing so.

Continuing with the example, the two computers would exchange public keys so that now, secure communication can take place. Whenever the client wants to send a secure message to the server, it would encrypt the message with the server's public key. Then, since only the server knows its own private key, it would be the only computer that would be capable of decrypting the message. Likewise, when the server wants to send a message to the client, the server encrypts the message using the client's public key, so that only the client could decrypt the message with the client's private key.

Great! Well... not really... This works in theory, but there are two major problems.

First, how do two computers initially exchange public keys? The request for the server's public key is necessarily insecure (since the client doesn't yet know the servers public key). A malicious computer could intercept this message, and pretend to be the server by replying to the client with the malicious computer's public key, and forward the request to the server with the malicious public key. The the malicious computer would act like a middle agent between the client and the server, decrypting every message with its "malicious private key", reading the messages, and then re-encrypting with its "malicious public key", forwarding the message to the destination without either client or server ever knowing. This is called a "man in the middle" attack.

The second thing wrong with this, is that most of the encryption algorithms are insanely strong. It would take a gazillion years to figure out what a computer's private key is, knowing only the ciphertext. Why is this a problem? The reason it's so strong is that it's computationally expensive, and therefore, not feasible to encrypt every single message exchanged between client and server, as this would greatly reduce performance.

Certificate Authorities

The solution to the man-in-the-middle problem mention above, is to use trusted organizations called "certificate authorities" to verify that the public key you receive back really does belong to who you think it does. In our man-in-the-middle example above, once our client receives the server's public key back, it verifies it with a certificate authority. If our message was intercepted, and we were given back a phony public key (by a malicious agent), the certificate authority would tell us. If you've ever seen those "site is not secure" or "Your connection is not private" messages, that is because the site's public key was not recognized by a certificate authority as belonging to the computer you are connecting to. This could mean that someone is trying to hack you. But often, this is simply because the site owners didn't wanted to pay for a certificate authority to verify who they are, but they still wanted to have secure connections to their site. Or it could mean that they "self-signed" their certificate as well. Both of these sort of defeat the purpose of using SSL, but they are perhaps better than no security at all.

Browsers come built-in with a list of certificate authorities and their public keys, so that we can be absolutely certain that our communication with certificate authorities is secure. For example, on Chrome (version 39), to view the recognized certificate authorities, I can select:

1. => "Settings"
2. => "Show advanced settings"
3. => "HTTPS/SSL"
4. => "Manage Certificates"
5. => "Authorities"

Session Keys

The solution to the second problem (that encrypting and decrypting is CPU intensive) is to not use the public keys to encrypt every message. The public keys are just used to set up a "handshake", and to verify the identities of the computers communicating.

Basically, once the server and client trust each other, they generate a "session key", which is a number to be used with a different, less expensive encryption algorithm to both encrypt and decrypt messages. The process of agreeing on this session key is done through public key encryption, but once agreed upon, this less expensive encryption technique is used instead.

As the name suggest, the session key only lasts for a short time. So even though the session key method uses a weaker form of encryption, and is therefore easier to crack, it is valid for only a short time. By the time anyone would be able to crack it, the session key would have expired.

SSL

Now we can finally get some understanding of how SSL would work. Let us suppose that a web browser, call it simply "myBrowser", wants to connect to a web server, call it "server.com", using HTTPS. Something like this would happen:

1. myBrowser examines the url of the request. Since the request is using HTTPS, myBrowser will issue a request (usually to port 443, since port 80 is conventionally for plain old HTTP), for the server.com's public key (and some other SSL stuff). This message is unencrypted.

2. server.com will receive the request, and respond with its public key (certificate, and some other SSL information). This message is unencrypted.

3. myBrowser will receive this public key, and verify it with a certificate authority. This message is encrypted with the certificate authorities' public key. If myBrowser cannot verify the public key from server.com, then myBrowser warns the user.

4. If server.com's public key is verified by a certificate authority, then myBrowser will generate a key-pair, and use one key as its private key. myBrowser will also generate a big random number.

5. myBrowser sends an "OK, let's use SSL!" message to server.com. In this message, it sends myBrowser's public key, and the big random number it just generated. This message will be encrypted using server.com's public key.

6. server.com will decrypt the message using its private key. It uses the big random number sent by my browser, does a bunch of math on it, and creates another big random number, called a "master secret". It then creates a "session key" from this master secret, that it will use to encrypt and decrypt all messages exchanged with myBrowser.

7. server.com sends this big random number back to the myBrowser. This message will be encrypted using myBrowser's public key.

8. myBrowser will decrypt the message using its private key. myBrowser performs the same math on the master secret that server.com did, to generate the same session key that server.com did.

9. Now, myBrowser and server.com send messages back and forth both encrypting and decrypting their messages with the less expensive "session key encryption algorithms", instead of public keys and private keys.

Conclusion

It was a lot of fun learning the basics of how SSL works! As a bonus, along the way I picked up a bunch of cool and fancy sounding terms like "symmetric session key" to help me sound smart :P

What was really clever, was how two computers can establish a secure connection in the first place, and that if it is done properly, then there is no point at which someone can intercept a message and do any harm. All of the data that a malicious computer can intercept and actually understand is public anyway, and all the stuff that is secret or sensitive cannot be cracked. Cool, huh?

Monday, 22 September 2014

Assembly Generated from Function Calls on x86-64

Two weeks ago in SPO600 we were given a task: compile a hello world C program, look at the Assembly code that gets generated then modify the code in small ways and notice how the Assembly code changes.

A second and separate task we were given was to learn about some feature of Assembly, teach it to the other students in the class in the form of a short presentation, and blog about what we discovered. I chose to investigate what happens when a function gets called in C, in x86-64 Assembly, and in particular what happens to the arguments passed into the function.

These two tasks are two separate labs, but since they are similar (and I'm lazy :P), I will combine them into a single blog entry.

WARNING/DISCLAIMER! What follows is a combination of personal research from reading materials I found on the web, trial and error with compiling and ojbdump-ing, and at times, wild speculation based on what I'm observing. Don't trust anything I say as authoritative!

A function that only references local variables and arguments is a standalone entity. At compile-time, it has no knowledge of where the arguments came from, or what values they should have. Therefore, when a function begins execution, it must look elsewhere to obtain the value of its arguments, as they are not defined within the function itself. Functions need to make assumptions about where to look for arguments, where to look for return values, and what stuff remains the same after a different function gets called and then returns. For a given computer system, such a set of rules governing the placement of arguments, return values, and other expected behavior, is called the "calling conventions" for that specific system.

A stack frame (I've also seen this called an "activation record") of a function at some moment in time, is the region of memory where the function stores local variables, its arguments, and information needed to restore the state of the caller upon returning. I gave the definition with respect to "some moment in time", because according to my understanding, it is possible for a stack frame to grow and shrink throughout the duration of the function's execution.

The way I, and most other stuff I read, visualize memory made the following assumptions about orientation. This is important to clarify, so that when I write about one location being 'above' or 'below' another, or about the direction of memory growth, your mental picture is the same thing as mine is. Throughout the rest of this post I will assume that:

1. Higher memory addresses are visualized as being above lower memory addresses.

2. The stack frame is a stack data structure, that grows downward, towards lower memory addresses.

This implies that if I have two local variables, X and Y, and Y was declared after X, then Y will have an address that is less than X.

I read some basic tutorials, and watched some videos about what happens when a function gets called. Typically, most explanations I saw told a story about what 'would' happen in an 'ideal case', but the tedious details of what actually happens is very specific to an instruction set architecture and an operating system. The 'ideal case' scenario would go something like this:

A function begins execution. There is a register that holds what is called the stack pointer (SP) and base pointer (BP) of the functions activations record. Above the base pointer is the address that would have been held by the program counter (PC), had the function not been called. This is where the function will return to upon returning, by loading this value back into the PC. Assuming that the size of pointer types is 8, then just above these 8 bytes, should be the arguments of the function. How much space each argument takes, is inferred from its type. So, for example, if my function is passes an int, int, and a long double, in that order, and assuming that the size of int and long double is 4 and 10 respectively, then each of these arguments can be addressed by BP + 8, BP + 12, and BP + 16 respectively. Remember! We are adding 8 to account for the saved PC of the function that called us! How did those values get there? It was the responsibility of the function that called our function to put them there, as well as set the SP to point to the right place. So, suppose we wanted to call another function, it would be OUR responsibility to decrement the SP by enough to store the values of the arguments to the function, and put the right values in there. When ever a local variable is declared, the stack pointer moves down as many bytes as is needed to make room for that local variable. So, with the numbers we've been using so far, that would be 4 bytes for an int declaration, 8 for a pointer declaration, and 10 for a double declaration.

But Alas! Things are not really this simple on x86-64 and most 'real' architectures, mostly because we can optimize on this behavior, and because of alignment issues.

1. First, we must manually store the value of the callers BP, by pushing it onto the stack. Then we must set our BP to be equal to our SP which was decremented by the caller.

2. Compilers are smart enough not to have to move the SP for every declaration and function call, but can move it by the right size just once at the beginning of the function. So, if I declare four ints, and then call a function that accepts two ints, then (ignoring alignment) the stack pointer would be decremented by 24 byes at the very beginning.

3. The stack pointer must always point to an address that is a multiple of 16. Also, the compiler may allocate more memory than you would expect to improve efficiency by aligning some variables in such a way as to waste memory but improve speed.

4. Perhaps most relevant to the code below, the caller will, whenever possible, pass the values of the arguments to a called function using registers directly as opposed to pushing them onto the memory stack. The called function can infer whether to look in the registers or above the base pointer for a particular argument from its type.

So let's start compiling functions to see what actually happens! :D First, let's compile 6 simple functions which are exactly the same except except for the types of the arguments and return value:

int i(int a, int b, int c, int d, int e, int f)
{
  return (a + b + c) * (d + e + f);
}

char c(char a, char b, char c, char d, char e, char f)
{
  return (a + b + c) * (d + e + f);
}

long long ll(long long a, long long b, long long c, long long d, long long e, long long f)
{
  return (a + b + c) * (d + e + f);
}

float f(float a, float b, float c, float d, float e, float f)
{
  return (a + b + c) * (d + e + f);
}

double d(double a, double b, double c, double d, double e, double f)
{
  return (a + b + c) * (d + e + f);
}

long double ld(long double a, long double b, long double c, long double d, long double e, long double f)
{
  return (a + b + c) * (d + e + f);
}

Let's take a look at the Assembly output. I turned on some basic optimization ("-O1" flag) because it makes the calling convention more readily transparent. For example, I noticed that without optimization, the compiler would 'always' store the arguments from their registers onto its stack frame, even if it was not necessary. The Assembly output:

ex1.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <i>:
   0:   01 f7                   add    %esi,%edi
   2:   01 fa                   add    %edi,%edx
   4:   41 01 c8                add    %ecx,%r8d
   7:   45 01 c1                add    %r8d,%r9d
   a:   89 d0                   mov    %edx,%eax
   c:   41 0f af c1             imul   %r9d,%eax
  10:   c3                      retq

0000000000000011 <c>:
  11:   01 f7                   add    %esi,%edi
  13:   01 fa                   add    %edi,%edx
  15:   41 01 c9                add    %ecx,%r9d
  18:   45 01 c8                add    %r9d,%r8d
  1b:   44 89 c0                mov    %r8d,%eax
  1e:   0f af c2                imul   %edx,%eax
  21:   c3                      retq

0000000000000022 <ll>:
  22:   48 01 f7                add    %rsi,%rdi
  25:   48 01 fa                add    %rdi,%rdx
  28:   49 01 c8                add    %rcx,%r8
  2b:   4d 01 c1                add    %r8,%r9
  2e:   48 89 d0                mov    %rdx,%rax
  31:   49 0f af c1             imul   %r9,%rax
  35:   c3                      retq

0000000000000036 <f>:
  36:   f3 0f 58 c8             addss  %xmm0,%xmm1
  3a:   f3 0f 58 d1             addss  %xmm1,%xmm2
  3e:   f3 0f 58 e3             addss  %xmm3,%xmm4
  42:   f3 0f 58 ec             addss  %xmm4,%xmm5
  46:   f3 0f 59 d5             mulss  %xmm5,%xmm2
  4a:   0f 28 c2                movaps %xmm2,%xmm0
  4d:   c3                      retq

000000000000004e <d>:
  4e:   f2 0f 58 c8             addsd  %xmm0,%xmm1
  52:   f2 0f 58 d1             addsd  %xmm1,%xmm2
  56:   f2 0f 58 e3             addsd  %xmm3,%xmm4
  5a:   f2 0f 58 ec             addsd  %xmm4,%xmm5
  5e:   f2 0f 59 d5             mulsd  %xmm5,%xmm2
  62:   66 0f 28 c2             movapd %xmm2,%xmm0
  66:   c3                      retq

0000000000000067 <ld>:
  67:   db 6c 24 18             fldt   0x18(%rsp)
  6b:   db 6c 24 08             fldt   0x8(%rsp)
  6f:   de c1                   faddp  %st,%st(1)
  71:   db 6c 24 28             fldt   0x28(%rsp)
  75:   de c1                   faddp  %st,%st(1)
  77:   db 6c 24 48             fldt   0x48(%rsp)
  7b:   db 6c 24 38             fldt   0x38(%rsp)
  7f:   de c1                   faddp  %st,%st(1)
  81:   db 6c 24 58             fldt   0x58(%rsp)
  85:   de c1                   faddp  %st,%st(1)
  87:   de c9                   fmulp  %st,%st(1)
  89:   c3                      retq
                                                                                                                                                        64,35-39      Bot

The important thing to notice is that the integer and floating point arguments are put into particular registers consistently. The order is always the same. For integer types, it's %rdi, %rsi, %rdx, %rcx, %r8, %r9. For floats and doubles, it's %xmm0, %xmm1, ... %xmm7. The return value is always stored in the 'A' register for integer types, and the %xmm0 register for floats and doubles,

However, for long doubles, I am a little bit confused by what I am seeing (maybe someone who understands better can chime in?). After reading the calling convention portion of the System V ABI for x86-64, I assumed that long double arguments should be pushed onto the FPU stack, if they can fit into those registers. On my system they can: sizeof(long double) == 10, CHAR_BIT == 8, and the FPU stack registers are 80 bits wide. Instead, what I am seeing is the long double being put 16 bytes above the base pointer. (The 16 points is where the saved program counter, and caller's base pointer are stored). Perhaps long doubles must padded to be 16 bytes? But then why is the return value pushed onto the %st register (top of the FPU stack)? Weird...

In any case, there were four interesting cases that came to mind:

1. There are arguments of different types in different combinations.

2: There are lots of arguments. Specifically, when there are more arguments than there are registers of the appropriate type to store them.

3. The size of the type of some of the arguments or the return value is too wide to fit into registers (a structure type with many fields, for example).

4. When the function accepts a variable number of arguments.

I will write about the fourth case, functions of a variable number of arguments, in a separate blog entry.

Let's start with the case when there are arguments of different types:

float diff_arg_types(int i, char c, long long ll, float f, double d, long double ld, int x, int y, int z)
{
  return (i + c + ll + x + y + x) * (f + d + ld);
}

This function produces the following assembly (this time, with no optimizations, since I want to be very explicit about which registers correspond to which arguments):

0000000000000000 <diff_arg_types>:
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   89 7d fc                mov    %edi,-0x4(%rbp)
   7:   89 f0                   mov    %esi,%eax
   9:   48 89 55 f0             mov    %rdx,-0x10(%rbp)
   d:   f3 0f 11 45 ec          movss  %xmm0,-0x14(%rbp)
  12:   f2 0f 11 4d e0          movsd  %xmm1,-0x20(%rbp)
  17:   89 4d e8                mov    %ecx,-0x18(%rbp)
  1a:   44 89 45 dc             mov    %r8d,-0x24(%rbp)
  1e:   44 89 4d d8             mov    %r9d,-0x28(%rbp)
  22:   88 45 f8                mov    %al,-0x8(%rbp)
  25:   0f be 55 f8             movsbl -0x8(%rbp),%edx
  29:   8b 45 fc                mov    -0x4(%rbp),%eax
  2c:   01 d0                   add    %edx,%eax
  2e:   48 63 d0                movslq %eax,%rdx
  31:   48 8b 45 f0             mov    -0x10(%rbp),%rax
  35:   48 01 c2                add    %rax,%rdx
  38:   8b 45 e8                mov    -0x18(%rbp),%eax
  3b:   48 98                   cltq
  3d:   48 01 c2                add    %rax,%rdx
  40:   8b 45 dc                mov    -0x24(%rbp),%eax
  43:   48 98                   cltq
  45:   48 01 c2                add    %rax,%rdx
  48:   8b 45 e8                mov    -0x18(%rbp),%eax
  4b:   48 98                   cltq
  4d:   48 01 d0                add    %rdx,%rax
  50:   48 89 45 c8             mov    %rax,-0x38(%rbp)
  54:   df 6d c8                fildll -0x38(%rbp)
  57:   f3 0f 10 45 ec          movss  -0x14(%rbp),%xmm0
  5c:   0f 5a c0                cvtps2pd %xmm0,%xmm0
  5f:   f2 0f 58 45 e0          addsd  -0x20(%rbp),%xmm0
  64:   f2 0f 11 45 c0          movsd  %xmm0,-0x40(%rbp)
  69:   dd 45 c0                fldl   -0x40(%rbp)
  6c:   db 6d 10                fldt   0x10(%rbp)
  6f:   de c1                   faddp  %st,%st(1)
  71:   de c9                   fmulp  %st,%st(1)
  73:   d9 5d d4                fstps  -0x2c(%rbp)
  76:   f3 0f 10 45 d4          movss  -0x2c(%rbp),%xmm0
  7b:   f3 0f 11 45 c0          movss  %xmm0,-0x40(%rbp)
  80:   8b 45 c0                mov    -0x40(%rbp),%eax
  83:   89 45 c0                mov    %eax,-0x40(%rbp)
  86:   f3 0f 10 45 c0          movss  -0x40(%rbp),%xmm0
  8b:   5d                      pop    %rbp
  8c:   c3                      retq
                                                                                                                                                        51,1          Bot

There's a lot of stuff here, but we don't care about most of it at the moment. First look at lines 4 through twelve. It looks like the compiler is just using the next avaialble register for that type! For examples, it uses the integer registers until it hits a float and a double. So it puts those arguments in %xmm0 and %xmm1, and continues to put the final three int arguments into registers %rcx, %r8, and %r9. And the long double gets put 16 bytes above the base pointer, since on line 35 we see that locations value being pushed onto the FPU stack.

Now! Let's see what will happen if we pass in more arguments than there are registers to store those arguments. My C code:

int i(int a, int b, int c, int d, int e, int f, int g, int h, int i, int j)
{
  return (a + b + c + d + e) * (f + g + h + i + j);
}

char c(char a, char b, char c, char d, char e, char f, char g, char h, char i, char j)
{
  return (a + b + c + d + e) * (f + g + h + i + j);
}

double d(double a, double b, double c, double d, double e, double f, double g, double h, double i, double j)
{
  return (a + b + c + d + e) * (f + g + h + i + j);
}

The Assembly (this time with optimizations turned on again):

0000000000000000 <i>:
   0:   01 f7                   add    %esi,%edi
   2:   01 fa                   add    %edi,%edx
   4:   01 d1                   add    %edx,%ecx
   6:   41 01 c8                add    %ecx,%r8d
   9:   44 03 4c 24 08          add    0x8(%rsp),%r9d
   e:   44 89 c8                mov    %r9d,%eax
  11:   03 44 24 10             add    0x10(%rsp),%eax
  15:   03 44 24 18             add    0x18(%rsp),%eax
  19:   03 44 24 20             add    0x20(%rsp),%eax
  1d:   41 0f af c0             imul   %r8d,%eax
  21:   c3                      retq

0000000000000022 <c>:
  22:   41 01 f8                add    %edi,%r8d
  25:   44 01 c6                add    %r8d,%esi
  28:   01 f2                   add    %esi,%edx
  2a:   01 d1                   add    %edx,%ecx
  2c:   44 02 4c 24 20          add    0x20(%rsp),%r9b
  31:   44 89 c8                mov    %r9d,%eax
  34:   02 44 24 08             add    0x8(%rsp),%al
  38:   02 44 24 10             add    0x10(%rsp),%al
  3c:   02 44 24 18             add    0x18(%rsp),%al
  40:   0f af c1                imul   %ecx,%eax
  43:   c3                      retq

0000000000000044 <d>:
  44:   f2 0f 58 c8             addsd  %xmm0,%xmm1
  48:   f2 0f 58 d1             addsd  %xmm1,%xmm2
  4c:   f2 0f 58 da             addsd  %xmm2,%xmm3
  50:   f2 0f 58 e3             addsd  %xmm3,%xmm4
  54:   f2 0f 58 f5             addsd  %xmm5,%xmm6
  58:   f2 0f 58 fe             addsd  %xmm6,%xmm7
  5c:   f2 0f 58 7c 24 08       addsd  0x8(%rsp),%xmm7
  62:   66 0f 28 ef             movapd %xmm7,%xmm5
  66:   f2 0f 58 6c 24 10       addsd  0x10(%rsp),%xmm5
  6c:   f2 0f 59 e5             mulsd  %xmm5,%xmm4
  70:   66 0f 28 c4             movapd %xmm4,%xmm0
  74:   c3                      retq

Here we can see that the compiler uses as many registers as it can, and when it runs out, it starts to place the arguments starting above the base pointer of the callee function. Also note that all the arguments smaller than 8 bytes get aligned to exactly 8 bytes. So in function 'c' for example, where all the arguments are characters, the seventh, eighth, ninth, and tenth argument gets stored at 0x8, 0x10, 0x 18, and 0x20 above the stack pointer, respectively. These are eight byte chunks. (Note: With optimization turned on, the function is not pushing %rbp onto the stack and assigning a new value to it, so it reaches 8 bytes above the STACK POINTER, and NOT 16 bytes above the BASE POINTER as in example 2. I apologize for the confusion).

Similarly, with the double arguments, the first eight are stored in %xmm0 - %xmm7, and the last two are stored at %rsp + 0x8 and %rsp + 0x10.

Now, the last case of interest is when we pass to or return from the function, values that are too wide for registers:

typedef struct {
  char c;
  int i;
  long long ll;
  float f;
  double d;
  long double ld;
} big_struct;

big_struct fun(big_struct b1, big_struct b2)
{
  big_struct b1b2 = {.c = b1.c + b2.c,
                     .i = b1.i + b2.i,
                     .ll = b1.ll + b2.ll,
                     .f = b1.f + b2.f,
                     .d = b1.d + b2.d,
                     .ld = b2.ld + b2.ld };
  return b1b2;
}

And the Assembly:

0000000000000000 <fun>:
   0:   48 89 f8                mov    %rdi,%rax
   3:   8b 4c 24 0c             mov    0xc(%rsp),%ecx
   7:   03 4c 24 3c             add    0x3c(%rsp),%ecx
   b:   48 8b 54 24 10          mov    0x10(%rsp),%rdx
  10:   48 03 54 24 40          add    0x40(%rsp),%rdx
  15:   f3 0f 10 4c 24 18       movss  0x18(%rsp),%xmm1
  1b:   f3 0f 58 4c 24 48       addss  0x48(%rsp),%xmm1
  21:   f2 0f 10 44 24 20       movsd  0x20(%rsp),%xmm0
  27:   f2 0f 58 44 24 50       addsd  0x50(%rsp),%xmm0
  2d:   db 6c 24 58             fldt   0x58(%rsp)
  31:   d8 c0                   fadd   %st(0),%st
  33:   0f b6 74 24 38          movzbl 0x38(%rsp),%esi
  38:   40 02 74 24 08          add    0x8(%rsp),%sil
  3d:   40 88 37                mov    %sil,(%rdi)
  40:   89 4f 04                mov    %ecx,0x4(%rdi)
  43:   48 89 57 08             mov    %rdx,0x8(%rdi)
  47:   f3 0f 11 4f 10          movss  %xmm1,0x10(%rdi)
  4c:   f2 0f 11 47 18          movsd  %xmm0,0x18(%rdi)
  51:   db 7f 20                fstpt  0x20(%rdi)
  54:   c3                      retq

From this we can infer that the two structs are laid out on top of each other, above the stack pointer. Each field from the struct is added to its corresponding field in the other struct, and stored in a register. For example: 0xc + %rsp is added to 0x3c + %rsp and stored in %ecx, 0x10 + %rsp is added to 0x40 + %rsp and stored in %rdx, and so on. What's interesting is how the struct is returned. The calling function is expected to put into the register %rdi, the base address of a memory location in which the caller is supposed to store the resturn value. Thus, from lines 16 through 20 we see the values in the registers where the results of our previous calculations were put, being stored at an offset from the address in %rdi.

I felt I learnt a lot from investigating the X86-64 calling conventions on my machine. However, I now have more questions than when I started :) Many of which can probably be answered by a combination of further experimentation and reading documentation and standards, but alas, this is a topic for another blog post! The question at the front of my mind at the moment are:

1. Why aren't arguments of type long double passed through the FPU register stack?

2. What happens if the size of the struct, and the types of the fields are changed? Are structs ever passed in registers?

3. Tricky alignment questions (really, I just want a set of explicit alignment rules).

4. In the last example, I am having trouble understanding lines 14 and 15. I know from reading parts of the ABI standard that the address of where a struct is to be put is stored in %rdi. But here, it looks like %rdi is being manipulated in some way. Also, the first field of the first struct begins at 0xc bytes above above the stack pointer. But here it looks like the computer is grabbing data at 0x8 bytes above the stack pointer? But this leaves only 4 bytes of meaningful data between 0x8 and 0xc. What is this data and what the heck does it have to do with %rdi (the address of where to store the return value)?