3 System Calls

3.1 Overview

System calls are how userspace programs interact with the kernel. The general principle behind how they work is described below.

3.1.1 System call numbers

Each and every system call has a system call number which is known by both the userspace and the kernel. For example, both know that system call number 10 is open(), system call number 11 is read(), etc.

The Application Binary Interface (ABI) is very similar to an API but rather than being for software is for hardware. The API will define which register the system call number should be put in so the kernel can find it when it is asked to do the system call.

3.1.2 Arguments

System calls are no good without arguments; for example open() needs to tell the kernel exactly what file to open. Once again the ABI will define which registers arguments should be put into for the system call.

3.1.3 The trap

To actually perform the system call, there needs to be some way to communicate to the kernel we wish to make a system call. All architectures define an instruction, usually called break or something similar, that signals to the hardware we wish to make a system call.

Specifically, this instruction will tell the hardware to modify the instruction pointer to point to the kernels system call handler (when the operating system sets its self up it tells the hardware where its system call handler lives). So once the userspace calls the break instruction, it has lost control of the program and passed it over to the kernel.

The rest of the operation is fairly straight forward. The kernel looks in the predefined register for the system call number, and looks it up in a table to see which function it should call. This function is called, does what it needs to do, and places its return value into another register defined by the ABI as the return register.

The final step is for the kernel to make a jump instruction back to the userspace program, so it can continue off where it left from. The userpsace program gets the data it needs from the return register, and continues happily on its way!

Although the details of the process can get quite hairy, this is basically all their is to a system call.

3.1.4 libc

Although you can do all of the above by hand for each system call, system libraries usually do most of the work for you. The standard library that deals with system calls on UNIX like systems is libc; we will learn more about its roles in future weeks.

3.2 Analysing a system call

As the system libraries usually deal with making systems call for you, we need to do some low level hacking to illustrate exactly how the system calls work.

We will illustrate how probably the most simple system call, getpid(), works. This call takes no arguments and returns the ID of the currently running program (or process; we'll look more at the process in later weeks).

#include <stdio.h>

/* for syscall() */
#include <sys/syscall.h>
#include <unistd.h>

/* system call numbers */
#include <asm/unistd.h>

void function(void)
{
	int pid;

	pid = __syscall(__NR_getpid);
}
Example 3.2.1 getpid() example

We start by writing a small C program which we can start to illustrate the mechanism behind system calls. The first thing to note is that there is a syscall argument provided by the system libraries for directly making system calls. This provides an easy way for programmers to directly make systems calls without having to know the exact assembly language routines for making the call on their hardware. So why do we use getpid() at all? Firstly, it is much clearer to use a symbolic function name in your code. However, more importantly, getpid() may work in very different ways on different systems. For example, on Linux the getpid() call can be cached, so if it is run twice the system library will not take the penalty of having to make an entire system call to find out the same information again.

By convention under Linux, system calls numbers are defined in the asm/unistd.h file from the kernel source. Being in the asm subdirectory, this is different for each architecture Linux runs on. Again by convention, system calls numbers are given a #define name consisting of __NR_. Thus you can see our code will be making the getpid system call, storing the value in pid.

We will have a look at how several architectures implement this code under the hood. We're going to look at real code, so things can get quite hairy. But stick with it -- this is exactly how your system works!

3.2.1 PowerPC

PowerPC is a RISC architecture common in older Apple computers, and the core of devices such as the latest version of the Xbox.


/* On powerpc a system call basically clobbers the same registers like a
 * function call, with the exception of LR (which is needed for the
 * "sc; bnslr" sequence) and CR (where only CR0.SO is clobbered to signal
 * an error return status).
 */

#define __syscall_nr(nr, type, name, args...)				\
	unsigned long __sc_ret, __sc_err;				\
	{								\
		register unsigned long __sc_0  __asm__ ("r0");		\
		register unsigned long __sc_3  __asm__ ("r3");		\
		register unsigned long __sc_4  __asm__ ("r4");		\
		register unsigned long __sc_5  __asm__ ("r5");		\
		register unsigned long __sc_6  __asm__ ("r6");		\
		register unsigned long __sc_7  __asm__ ("r7");		\
									\
		__sc_loadargs_##nr(name, args);				\
		__asm__ __volatile__					\
			("sc           \n\t"				\
			 "mfcr %0      "				\
			: "=&r" (__sc_0),				\
			  "=&r" (__sc_3),  "=&r" (__sc_4),		\
			  "=&r" (__sc_5),  "=&r" (__sc_6),		\
			  "=&r" (__sc_7)				\
			: __sc_asm_input_##nr				\
			: "cr0", "ctr", "memory",			\
			  "r8", "r9", "r10","r11", "r12");		\
		__sc_ret = __sc_3;					\
		__sc_err = __sc_0;					\
	}								\
	if (__sc_err & 0x10000000)					\
	{								\
		errno = __sc_ret;					\
		__sc_ret = -1;						\
	}								\
	return (type) __sc_ret

#define __sc_loadargs_0(name, dummy...)					\
	__sc_0 = __NR_##name
#define __sc_loadargs_1(name, arg1)					\
	__sc_loadargs_0(name);						\
	__sc_3 = (unsigned long) (arg1)
#define __sc_loadargs_2(name, arg1, arg2)				\
	__sc_loadargs_1(name, arg1);					\
	__sc_4 = (unsigned long) (arg2)
#define __sc_loadargs_3(name, arg1, arg2, arg3)				\
	__sc_loadargs_2(name, arg1, arg2);				\
	__sc_5 = (unsigned long) (arg3)
#define __sc_loadargs_4(name, arg1, arg2, arg3, arg4)			\
	__sc_loadargs_3(name, arg1, arg2, arg3);			\
	__sc_6 = (unsigned long) (arg4)
#define __sc_loadargs_5(name, arg1, arg2, arg3, arg4, arg5)		\
	__sc_loadargs_4(name, arg1, arg2, arg3, arg4);			\
	__sc_7 = (unsigned long) (arg5)

#define __sc_asm_input_0 "0" (__sc_0)
#define __sc_asm_input_1 __sc_asm_input_0, "1" (__sc_3)
#define __sc_asm_input_2 __sc_asm_input_1, "2" (__sc_4)
#define __sc_asm_input_3 __sc_asm_input_2, "3" (__sc_5)
#define __sc_asm_input_4 __sc_asm_input_3, "4" (__sc_6)
#define __sc_asm_input_5 __sc_asm_input_4, "5" (__sc_7)

#define _syscall0(type,name)						\
type name(void)								\
{									\
	__syscall_nr(0, type, name);					\
}

#define _syscall1(type,name,type1,arg1)					\
type name(type1 arg1)							\
{									\
	__syscall_nr(1, type, name, arg1);				\
}

#define _syscall2(type,name,type1,arg1,type2,arg2)			\
type name(type1 arg1, type2 arg2)					\
{									\
	__syscall_nr(2, type, name, arg1, arg2);			\
}

#define _syscall3(type,name,type1,arg1,type2,arg2,type3,arg3)		\
type name(type1 arg1, type2 arg2, type3 arg3)				\
{									\
	__syscall_nr(3, type, name, arg1, arg2, arg3);			\
}

#define _syscall4(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4) \
type name(type1 arg1, type2 arg2, type3 arg3, type4 arg4)		\
{									\
	__syscall_nr(4, type, name, arg1, arg2, arg3, arg4);		\
}

#define _syscall5(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4,type5,arg5) \
type name(type1 arg1, type2 arg2, type3 arg3, type4 arg4, type5 arg5)	\
{									\
	__syscall_nr(5, type, name, arg1, arg2, arg3, arg4, arg5);	\
}
Example 3.2.1.1 PowerPC system call example

This code snippet from the kernel header file asm/unistd.h shows how we can implement system calls on PowerPC. It looks very complicated, but it can be broken down step by step.

Firstly, jump to the end of the example where the _syscallN macros are defined. You can see there are many macros, each one taking progressively one more argument. We'll concentrate on the most simple version, _syscall0 to start with. It only takes two arguments, the return type of the system call (e.g. a C int or char, etc) and the name of the system call. For getpid this would be done as _syscall0(int,getpid).

Easy so far! We now have to start pulling apart __syscall_nr macro. This is not dissimilar to where we were before, we take the number of arguments as the first parameter, the type, name and then the actual arguments.

The first step is declaring some names for registers. What this essentially does is says __sc_0 refers to r0 (i.e. register 0). The compiler will usually use registers how it wants, so it is important we give it constraints so that it doesn't decide to go using register we need in some ad-hoc manner.

We then call sc_loadargs with the interesting ## parameter. That is just a paste command, which gets replaced by the nr variable. Thus for our example it expands to __sc_loadargs_0(name, args);. __sc_loadargs we can see below sets __sc_0 to be the system call number; notice the paste operator again with the __NR_ prefix we talked about, and the variable name that refers to a specific register.

So, all this tricky looking code actually does is puts the system call number in register 0! Following the code through, we can see that the other macros will place the system call arguments into r3 through r7 (you can only have a maximum of 5 arguments to your system call).

Now we are ready to tackle the __asm__ section. What we have here is called inline assembly because it is assembler code mixed right in with source code. The exact syntax is a little to complicated to go into right here, but we can point out the important parts.

Just ignore the __volatile__ bit for now; it is telling the compiler that this code is unpredictable so it shouldn't try and be clever with it. Again we'll start at the end and work backwards. All the stuff after the colons is a way of communicating to the compiler about what the inline assembly is doing to the CPU registers. The compiler needs to know so that it doesn't try using any of these registers in ways that might cause a crash.

But the interesting part is the two assembly statements in the first argument. The one that does all the work is the sc call. That's all you need to do to make your system call!

So what happens when this call is made? Well, the processor is interrupted knows to transfer control to a specific piece of code setup at system boot time to handle interrupts. There are many interrupts; system calls are just one. This code will then look in register 0 to find the system call number; it then looks up a table and finds the right function to jump to to handle that system call. This function receives its arguments in registers 3 - 7.

So, what happens once the system call handler runs and completes? Control returns to the next instruction after the sc, in this case a memory fence instruction. What this essentially says is "make sure everything is committed to memory"; remember how we talked about pipelines in the superscalar architecture? This instruction ensures that everything we think has been written to memory actually has been, and isn't making its way through a pipeline somewhere.

Well, we're almost done! The only thing left is to return the value from the system call. We see that __sc_ret is set from r3 and __sc_err is set from r0. This is interesting; what are these two values all about?

One is the return value, and one is the error value. Why do we need two variables? System calls can fail, just as any other function. The problem is that a system call can return any possible value; we can not say "a negative value indicates failure" since a negative value might be perfectly acceptable for some particular system call.

So our system call function, before returning, ensures its result is in register r3 and any error code is in register r0. We check the error code to see if the top bit is set; this would indicate a negative number. If so, we set the global errno value to it (this is the standard variable for getting error information on call failure) and set the return to be -1. Of course, if a valid result is received we return it directly.

So our calling function should check the return value is not -1; if it is it can check errno to find the exact reason why the call failed.

And that is an entire system call on a PowerPC!

3.2.2 x86 system calls

Below we have the same interface as implemented for the x86 processor.

/* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */

#define __syscall_return(type, res)				\
do {								\
        if ((unsigned long)(res) >= (unsigned long)(-125)) {	\
                errno = -(res);					\
                res = -1;					\
        }							\
        return (type) (res);					\
} while (0)

/* XXX - _foo needs to be __foo, while __NR_bar could be _NR_bar. */
#define _syscall0(type,name)			\
type name(void)					\
{						\
long __res;					\
__asm__ volatile ("int $0x80"			\
        : "=a" (__res)				\
        : "0" (__NR_##name));			\
__syscall_return(type,__res);
}

#define _syscall1(type,name,type1,arg1)			\
type name(type1 arg1)					\
{							\
long __res;						\
__asm__ volatile ("int $0x80"				\
        : "=a" (__res)					\
        : "0" (__NR_##name),"b" ((long)(arg1)));	\
__syscall_return(type,__res);
}

#define _syscall2(type,name,type1,arg1,type2,arg2)			\
type name(type1 arg1,type2 arg2)					\
{									\
long __res;								\
__asm__ volatile ("int $0x80"						\
        : "=a" (__res)							\
        : "0" (__NR_##name),"b" ((long)(arg1)),"c" ((long)(arg2)));	\
__syscall_return(type,__res);
}

#define _syscall3(type,name,type1,arg1,type2,arg2,type3,arg3)		\
type name(type1 arg1,type2 arg2,type3 arg3)				\
{									\
long __res;								\
__asm__ volatile ("int $0x80"						\
        : "=a" (__res)							\
        : "0" (__NR_##name),"b" ((long)(arg1)),"c" ((long)(arg2)),	\
                  "d" ((long)(arg3)));					\
__syscall_return(type,__res);						\
}

#define _syscall4(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4)	\
type name (type1 arg1, type2 arg2, type3 arg3, type4 arg4)			\
{										\
long __res;									\
__asm__ volatile ("int $0x80"							\
        : "=a" (__res)								\
        : "0" (__NR_##name),"b" ((long)(arg1)),"c" ((long)(arg2)),		\
          "d" ((long)(arg3)),"S" ((long)(arg4)));				\
__syscall_return(type,__res);							\
}

#define _syscall5(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4,	\
          type5,arg5)								\
type name (type1 arg1,type2 arg2,type3 arg3,type4 arg4,type5 arg5)		\
{										\
long __res;									\
__asm__ volatile ("int $0x80"							\
        : "=a" (__res)								\
        : "0" (__NR_##name),"b" ((long)(arg1)),"c" ((long)(arg2)),		\
          "d" ((long)(arg3)),"S" ((long)(arg4)),"D" ((long)(arg5)));		\
__syscall_return(type,__res);							\
}

#define _syscall6(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4,			\
          type5,arg5,type6,arg6)								\
type name (type1 arg1,type2 arg2,type3 arg3,type4 arg4,type5 arg5,type6 arg6)			\
{												\
long __res;											\
__asm__ volatile ("push %%ebp ; movl %%eax,%%ebp ; movl %1,%%eax ; int $0x80 ; pop %%ebp"	\
        : "=a" (__res)										\
        : "i" (__NR_##name),"b" ((long)(arg1)),"c" ((long)(arg2)),				\
          "d" ((long)(arg3)),"S" ((long)(arg4)),"D" ((long)(arg5)),				\
          "0" ((long)(arg6)));									\
__syscall_return(type,__res);									\
}
Example 3.2.2.1 x86 system call example

The x86 architecture is very different from the PowerPC that we looked at previously. The x86 is classed as a CISC processor as opposed to the RISC PowerPC, and has dramatically less registers.

Start by looking at the most simple _syscall0 macro. It simply calls the int instruction with a value of 0x80. This instruction makes the CPU raise interrupt 0x80, which will jump to code that handles system calls in the kernel.

We can start inspecting how to pass arguments with the longer macros. Notice how the PowerPC implementation cascaded macros downwards, adding one argument per time. This implementation has slightly more copied code, but is a little easier to follow.

x86 register names are based around letters, rather than the numerical based register names of PowerPC. We can see from the zero argument macro that only the A register gets loaded; from this we can tell that the system call number is expected in the EAX register. As we start loading registers in the other macros you can see the short names of the registers in the arguments to the __asm__ call.

We see something a little more interesting in __syscall6, the macro taking 6 arguments. Notice the push and pop instructions? These work with the stack on x86, "pushing" a value onto the top of the stack in memory, and popping the value from the stack back into memory. Thus in the case of having six registers we need to store the value of the ebp register in memory, put our argument in in (the mov instruction), make our system call and then restore the original value into ebp. Here you can see the disadvantage of not having enough registers; stores to memory are expensive so the more you can avoid them, the better.

Another thing you might notice there is nothing like the memory fence instruction we saw previously with the PowerPC. This is because on x86 the effect of all instructions will be guaranteed to be visible when the complete. This is easier for the compiler (and programmer) to program for, but offers less flexibility.

The only thing left to contrast is the return value. On the PowerPC we had two registers with return values from the kernel, one with the value and one with an error code. However on x86 we only have one return value that is passed into __syscall_return. That macro casts the return value to unsigned long and compares it to an (architecture and kernel dependent) range of negative values that might represent error codes (note that the errno value is positive, so the negative result from the kernel is negated). However, this means that system calls can not return small negative values, since they are indistinguishable from error codes. Some system calls that have this requirement, such as getpriority(), add an offset to their return value to force it to always be positive; it is up to the userspace to realise this and subtract this constant value to get back to the "real" value.