The gen on function perilogues.

Yes, "perilogue" is a real word — sort of. It's only ever been used as a technical term in computing, and was first used by Edward Anton Schneider of Carnegie-Mellon University in 1976 to mean the start or finish of an operation. Clearly this is a useful term for the combination of a prologue and an epilogue, which are inseparable from each other when it comes to discussions of compiled functions in computer languages, and lack another word for their combination.

As "prologue" comes from the Greek "προ", meaning "before", and as "epilogue" comes from the Greek "επι", meaning "after", so "perilogue" comes from the Greek "περι", meaning "around/about". Indeed, the word "περιλεγειν" actually exists in Classical Greek, in the writings of Hermippus, meaning "to talk around" something.

A function perilogue comprises a function prologue and a function epilogue, which bracket the actual body of a function fore and aft. The prologue and epilogue are strongly coupled to one another, and are effectively a single unit, the perilogue. The prologue sets up an exection environment for the function body, and the epilogue tears it down again.

Activation records and red zones

Function perilogues are, by their natures, specific to instruction set architectures. The standard perilogues for an architecture set up a standard stack frame for the architecture in the perilogue and tear it back down again in the epilogue.

Formally, a stack frame is part of a function's overall activation record. An activation record formally comprises:

a parameter area where the stack-stored parameters (or, in some function calling conventions such as IBM's Optlink calling convention, space reserved for register-stored parameters to be spilled into) are held
a return address where the address of the next instruction to execute within the calling function is stored
one or more saved frame pointers holding the calling functions'/function's saved frame pointer(s)
a save area where non-volatile registers that the function code modifies are saved across the function body
a locals area where function-local variables are stored, the address of the start of which is the current stack pointer
a red zone that is below the stack pointer

Strictly speaking, function activation records don't have to be stored on call stacks. Modern processor architectures employ what is known as dynamic allocation for activation records, where activation records are created and destroyed on the fly, being pushed onto and popped off the top of a call stack. Another possibility, not employed in mainstream architectures nowadays, is static allocation, where it is known that functions are not reëntrant and programs are not multi-threaded. With static allocation, activation records are simply stored in fixed portions of a program's read/write data area, determined at compile/link time.

A red zone is the area immediately below the stack, that can be overwritten at any time as a consequence of asynchronous events occuring during execution of the function: signals, exceptions, or interrupts. Because it is liable to be destroyed at any moment, functions should not attempt to use it for storage.

In many processor architectures, the stack pointer register always points to the bottom of the standard stack frame. When a function needs itself to call other functions (i.e. it is not a so-called leaf function) it has to construct the parameter area for the called function's stack frame. There are two common techniques:

On the x86 architecture, code temporarily modifies the stack pointer to make space for the new parameter area, on the fly, by pushing and popping (or otherwise manipulating) the stack. This is not terribly efficient, requiring specialized optimization hardware in x86 CPUs just to reduce its gross serial depencies upon a single processor register. As explained later, some optimizing compilers for x86 take a more RISC-like approach.
On more RISC-like architectures such as MIPS, the convention is for space large enough to hold the parameter area for any function that will be called (which size the compiler obviously can work out at code generation time, by simply taking the maximum of the sizes of the individual parameter areas) to be pre-allocated by the function prologue, below the locals area, with the stack pointer register pointing to its base. Thus a standard stack frame contains an extra area, a calling parameter area, below the locals area, part or all of which overlaps the parameter areas of the activation records of called functions.

Many processor architectures also have a frame pointer register. Sometimes this points to the saved frame pointers area. Sometimes it points elsewhere within a standard stack frame. The frame pointer isn't strictly speaking used to locate the stack frame itself. The stack pointer does that quite happily, after all. The frame pointer is used to provide simple access, using the shortest/quickest instruction forms, to the locals area and parameters area of a stack frame. For example:

On the x86 architecture, code temporarily modifies the stack pointer to make space for new calling parameter areas, on the fly, by pushing and popping (or otherwise manipulating) the stack. However, The frame pointer register — (E)BP — remains fixed, pointing at the middle of the stack frame. Compilers thus generate code that references function parameters and function-local variables using register-relative addressing via the frame pointer, and that references calling parameters using register-relative addressing via the stack pointer.
With the x86-64 standard function perilogue, the frame pointer is, by convention, exactly 128 bytes above the base of the stack frame. This doesn't necessarily point to any definite part of the stack frame, because stack frames for different functions (of course) have different sizes of locals areas, save areas, calling parameter areas, and so forth, meaning that whatever is at offset 128 is going to depend from exactly what function is called. The purpose of this offset is so that the frame pointer register can be used in preference to the stack pointer register when accessing the stack frame. The 128 byte offset means that short instructions using a register-relative addressing mode with signed 8-bit offsets can access 256 bytes of the stack frame via the frame pointer register, as opposed to only 128 bytes of the stack frame via the stack pointer register.

The standard x86 function perilogue

The standard function perilogue on the x86 architecture comprises the ENTER and LEAVE instructions:

enter N,M
…
leave
ret

This standard x86 perilogue sets up and tears down a standard x86 stack frame. The base of the stack frame is pointed to by the ESP register, and comprises N bytes for function-local variables, followed by M+1 saved frame pointers, pointing to the stack frame(s) of the calling function(s), followed by the caller to the function's return address and the (stack-stored) parameters to the function.

The EBP register points to the first (i.e. immediately enclosing caller's) saved frame pointer, which in turn was the value of the EBP register within that function's standard perilogue.

Although this is the smallest size for such a perilogue, it is not necessarily the fastest to execute, especially in the case where there is only the 1 enclosing stack frame pointer to be saved. Moreover, the ENTER and LEAVE instructions did not exist on the 8086. So traditionally, and in some cases for speed, the standard perilogue (for a non-nested function, where M will be zero) has an alternative form, that does exactly the same thing:

push ebp
mov ebp, esp
sub esp, N
…
mov esp, ebp
pop ebp
ret

Interestingly, this is not optimal. The execution of each PUSH or POP instruction depends from the result of execution of all preceding ones, since each instruction has to read, modify, and write the stack pointer register. Consider the case where the function uses several non-volatile registers (for the sake of exposition, assume the 32-bit OS/2 system API calling convention and a function body that does something like an optimized memcpy() using MOVSD and thus requiring EDI and ESI), and as a consequence its perilogue has to save and restore several non-volatile register values on the call stack in the save area:

push ebp
mov ebp, esp
sub esp, N
push edi
push esi
push ebx
…
pop ebx
pop esi
pop edi
mov esp, ebp
pop ebp
ret

Each PUSH instruction (and, indeed, the RET instruction at the end) has a sequential dependency from the instructions that immediately precede it, because they all need to know the ESP value from their predecessors. Similarly, the second and subsequent POP instructions all have sequential dependencies on their immediate predecessors. Worse still, the incremented ESP register value is then entirely discarded anyway. This does not provide the processor with the ability to schedule multiple operations in parallel, internally, as many x86 processors are nowadays capable of.

The Intel Pentium M and the Intel Pentium Atom have a mechanism called "ESP folding" that ameliorates this to an extent. ESP folding handles the implicit accesses to the ESP register, by PUSH, POP, CALL, and RET, in the Address Generation Unit (AGU). This reduces the impact of the multiple successive PUSH and POP instructions in the aforegiven non-optimal perilogue. However, as Intel's IA-32 Software Optimization Reference Manual explains (see §2.4.1, §3.4.2.6, §12.3.2.2, and §12.3.3.6), this doesn't solve all of the execution speed problems with this perilogue, because it still contains explicit accesses to the ESP register, in the SUB and MOV instructions. As the Reference Manual states mixing the Arithmetic Logic Unit (ALU) with the AGU causes execution stalls. And that's exactly what mixing SUB/MOV with PUSH/POP does.

Moreover: These are but two x86 processors from one vendor that even have ESP folding in the first place. Other processors have no such amelioration even for the sequential dependencies of the PUSH, POP, and RET instructions.

All of these problems go away by employing a more optimal perilogue, whose speedy execution isn't limited to just a certain few Intel x86 processors with special-case support. A more optimal perilogue has just two read-modify-write operations on ESP, manipulating it once each in prologue/epilogue and then using instructions that have no serial dependencies upon one another to place/retrieve the saved non-volatile registers and saved frame pointer onto/from the stack. (On other, more RISC-like, processor architectures, this is the approach taken by standard function perilogues. Even on x86 architectures, this is the approach taken by some optimizing compilers when setting up the parameter area before calling a function.) Further optimization can be enabled by following Intel's recommendation to not mix the ALU with the AGU, using LEA to calculate the new effective ESP address (which is, after all, what LEA is there for). Yet more optimizations still can be performed by taking advantage of the write-combining properties of the processor's L1 cache, and scheduling the MOV instructions accordingly.

Combining all these results in a standard perilogue that looks like:

LOCALS_SIZE equ …
SAVE_SIZE   equ 16

lea esp, [esp-LOCALS_SIZE-SAVE_SIZE]
mov [esp+LOCALS_SIZE+SAVE_SIZE-16], ebx
mov [esp+LOCALS_SIZE+SAVE_SIZE-12], esi
mov [esp+LOCALS_SIZE+SAVE_SIZE-8], edi
mov [esp+LOCALS_SIZE+SAVE_SIZE-4], ebp
lea ebp, [esp+LOCALS_SIZE+SAVE_SIZE-4]
…
mov ebx, [esp+LOCALS_SIZE+SAVE_SIZE-16]
mov esi, [esp+LOCALS_SIZE+SAVE_SIZE-12]
mov edi, [esp+LOCALS_SIZE+SAVE_SIZE-8]
mov ebp, [esp+LOCALS_SIZE+SAVE_SIZE-4]
lea esp, [esp+LOCALS_SIZE+SAVE_SIZE]
ret

As mentioned, when setting up the parameter area of a new stack frame, in order to call a function, some optimizing compilers (when optimizing for time, not space) use the above approach, thus:

; … code that calculates parameters in local variables …
mov edx,[ebp-SAVE_SIZE-LOCALS_SIZE+LOCAL_VAR_3]
mov ecx,[ebp-SAVE_SIZE-LOCALS_SIZE+LOCAL_VAR_2]
mov eax,[ebp-SAVE_SIZE-LOCALS_SIZE+LOCAL_VAR_1]
lea esp, [esp-PARAMS_SIZE]
mov [esp+8], edx
mov [esp+4], ecx
mov [esp+0], eax
call function
lea esp, [esp+PARAMS_SIZE]

However, other compilers generate code that simply PUSHes and POPs the stack within the body of the function to set up parameter areas. To add insult to injury, the code also uses an ALU instruction, ADD, on the ESP register immediately after the called function has issued an AGU instruction that is "ESP foldable", RET, causing an execution stall.

; … code that calculates parameters in local variables …
mov edx,[ebp-SAVE_SIZE-LOCALS_SIZE+LOCAL_VAR_3]
mov ecx,[ebp-SAVE_SIZE-LOCALS_SIZE+LOCAL_VAR_2]
mov eax,[ebp-SAVE_SIZE-LOCALS_SIZE+LOCAL_VAR_1]
push edx
push ecx
push eax
call function
add esp, PARAMS_SIZE

Both approaches — PUSH and MOV — modify ESP on the fly, allocating and deallocating call stack space for the new parameter area of each individual called function as it is called. In general, therefore, in x86 programming the stack pointer register is not always at a fixed position at the base of the function's standard stack frame (as it is by convention on other instruction architectures). This in turn means that accessing local variables and function parameters is generally done via the frame pointer, using positive constant offsets for function parameters and negative constant offsets for local variables, not the stack pointer.

The standard MIPS function perilogue

The standard function perilogue on the MIPS architecture is fairly similar to the perilogue on the x86 architecture:

subu sp,frame_size
sw ra,frame_size-8(sp)
…
lw ra,frame_size-8(sp)
addu sp,frame_size
jr ra

There are several notable differences:

The MIPS standard perilogue allows greater instruction parallelism by only writing to the stack pointer register once, allowing all register saves/loads to be overlapped since they don't depend from each other's results as a sequence of x86 PUSH and POP instructions do.

Saving a non-volatile register, that the function wants to use for itself, in the save area of the stack frame is merely a matter of an additional sp-relative save/load pair:
```
sw s0,frame_size-16(sp)
…
lw s0,frame_size-16(sp)
```
The MIPS standard perilogue involves saving the return address from a register onto the call stack and then retrieving it from the call stack before returning. The x86 architecture's CALL and RET instructions do this implicitly. In the MIPS architecture, the return address is in the ra ("return address") register, and how (and indeed whether) it is saved into a stack frame is up to each individual function. ra is less of a special case register and more like a simple non-volatile register, that one handles like any other non-volatile register: saving it in the save area if the function needs to re-use it itself (as a non-leaf function will, but a leaf function will not).
By convention, the stack frame on MIPS includes enough space to construct the largest parameters area of any called function, and constructing a parameters area again is a sequence of parallelizable stores that are not interdependent. For example:
```
sw s0,16(sp) # set parameter #5
sw s1,20(sp) # set parameter #6
jal called_function
```

There are also similarities. There are stack pointer (sp) and frame pointer (fp) registers. The frame pointer points to the middle of a stack frame, and the stack pointer points to the bottom.

Stack walking

On x86 architectures, because the frame pointer register (EBP) points to the saved frame pointer of the immediate caller, it is possible to walk the stack, as long as every function employs a standard stack frame. Such a process starts with the current frame pointer register value and follows the saved stack frame pointers until the top of the stack area (or an invalid saved frame pointer value) is reached.

Many compilers targetting x86 platforms provide options for disabling the generation of a standard perilogue for functions that can do without one, or disabling just the part that involves saving and restoring the caller's frame pointer and setting up and tearing down a new one for the called function. However, walking the stack is employed by debuggers, exception handlers, and postmortem analysis utilities. So although many compilers provide these options, one should be aware of the effect that they will have upon debugging, exception handling, and postmortem analyses of a program's execution.

One trick with frame pointers used on 16:32 and 16:16 x86 code is to signal the calling distance of a function, by setting a flag in its saved frame pointer for the calling function. This allows anything walking the stack to know whether the return address in the stack frame is a near (0:16 or 0:32) or a far (16:16 or 16:32) address (and hence where the parameter area begins relative to the frame pointer area).

This marking is done by taking advantage of the fact that in normal use a stack frame will never be aligned to an odd address. (Compilers and operating systems conspire to enforce this. Compilers ensure that they only ever change the value of ESP by an even number of bytes, and operating systems ensure that stack tops are always aligned to a multiple of 2 bytes.) A far function is marked by simple expedient of incrementing and decrementing the saved frame pointer by 1 byte. Bit #0 of the saved frame pointer thus becomes a "far function" flag. Such a perilogue generally looks like this:

inc ebp
enter N,M
…
leave
dec ebp
ret

Stack walking code has, of course, to be aware of whether this convention is used by the libraries on the target platform that the program is running on, and whether the program itself was compiled to employ it, of course.

Stack probes

Both 32-bit OS/2 and Win32 provide applications softwares with the capability of having what are known as decommitted stacks. A decommitted stack is a thread's stack area that starts off allocated but not committed. In other words: The range of virtual address space is allocated to the stack, but there are no virtual memory pages committed into that range of addresses. Decommitted stacks allow applications to specify large stack sizes without incurring (unless actualy required as the program executes) the overhead of all of the virtual memory pages that would be necessary were the stack area fully committed. This is done via a mechanism that involves guard pages.

The details of operation of the guard page mechanism can be found in the 32-bit OS/2 Developers' Toolkit documentation and the MSDN documentation for Win32. Simply put: All pages in a thread's stack down to some point are committed; the next page is a committed page marked as a guard page; and all pages below that are not committed. Accessing a guard page causes a page fault and an application-mode exception, in response to which the operating system automatically turns the guard page into an ordinary committed page (by removing its guard page attribute), to commit the next lower page in the stack (if it can), and (if committed) mark that next lower page as a guard page in its turn.

Thus any application that guarantees that it will access its stack space more-or-less as an actual stack, pushing things onto the top serially, will have its initially entirely decommitted stack automatically committed for it by the operating system as the stack grows downwards. (When the operating system hits the bottom of the stack, usually an exception is raised by the operating system. But that's another discussion, with complexities and subtleties of its own, tangential to this one.)

The x86 standard function perilogue does access its stack space more-or-less like a stack. It pushes activation records onto the call stack, and pops them back off again. The problem ensues when an activation record is larger than a memory page.

More particularly, the problem ensues when the perilogue is using the SUB or LEA instructions to decrement the stack pointer. A PUSH or ENTER instruction decrements the stack pointer, but also touches the memory at the stack pointer address. The write access to the memory location will trigger the guard page mechanism. A perilogue that used PUSH or ENTER to decrement the stack pointer exclusively would never suffer a problem. But perilogues don't always use PUSH or ENTER. Indeed, as already discussed, for the greatest speed optimization a perilogue doesn't use PUSH, POP, ENTER, or LEAVE anywhere, but rather uses LEA to modify the stack pointer register and ESP-relative MOV instructions to save and restore registers and frame pointers. It doesn't necessarily incorporate those MOV instructions in strictly descending order of stack location, either.

How various compilers actually incorporate stack probes into their function perilogues:

IBM VisualAge C/C++ for OS/2: Stack probes are generated longhand, inline in the function perilogue itself.
Watcom C/C++: Calls to different internal run-time library support functions are generated according to whether RTL checks are performed in addition to the platform's stack probe requirements.

If RTL stack checking is enabled as well:
```
push locals_size + call_parms_size
call __CHK
sub esp, locals_size
```
If RTL stack checking is disabled:
```
push locals_size + call_parms_size
call __GRO
sub esp, locals_size
```
If RTL stack checking is enabled, and the code was compiled with an older version of the compiler:
```
mov eax,locals_size + call_parms_size
call __STK
sub esp, locals_size
```

MetaWare High C/C++ for OS/2:

Either:

mov eax, locals_size
call __chkstk ; Modifies ESP directly.

Or:

mov eax, locals_size
call __chkstk2
sub esp, locals_size

Borland C/C++:

mov eax, locals_size
call __chkstk ; Modifies ESP directly.

This is where stack probes come in. Stack probes are a simple idea. The compiler generates extra code in the function perilogue if there's a danger that it might skip over more than one page of the stack without actually touching it when pushing the function's activation record.

Usually, compilers simply note when the sizes of the locals area is larger than a page, and spit out extra dummy memory references, immediately after the SUB instruction that decrements ESP, that touch the intervening stack pages in the right, descending, order. There are several complications to this scheme:

ESP cannot be decremented more than 1 page past the lowest stack page access at any point. This is because an asynchronous exception, whose handler would of course immediately start pushing things onto the stack at the current ESP address, may arrive at any point during the stack probe sequence itself. So compilers take one of two approaches:
- They do all stack probe operations before decrementing ESP, by writing into the red zone. It's safe to touch the red zone, just not safe to rely upon it retaining the values that one may have written to it. Stack probes don't care about values of the memory locations touched. Indeed, some stack probe mechanisms write completely arbitrary data themselves (such as zero, or 0xdeadbeef, or whatever happens to be in the EAX register at the time).
- They perform progressive stack probes, decrementing ESP one page at a time. For example, a 12KiB stack probe, unrolled, would look like:
```
lea esp,[esp - 1000h]
mov [esp],eax
lea esp,[esp - 1000h]
mov [esp],eax
lea esp,[esp - 1000h]
mov [esp],eax
```
The range of stack locations that the stack probe has to cover is not simply the size of the locals area. If the function perilogue doesn't use PUSH to save non-volatile registers into the save area, the size of the save area must be added to the probe total as well. Similarly with saved stack frame pointers, if they are stored with MOV rather than PUSH. Basically, however much space is to be skipped over with SUB or LEA is however much space needs to be probed.
An often forgotten part of the function activation record, when it comes to stack probes, is the parameters area. Unless a compiler always uses PUSH to place parameter values onto the call stack, it must also perform stack probes when it calls functions that take more than one page's worth of parameters. Here is a simple C++ program that illustrates a case where stack probes of the calling parameters area are also required:
```
struct s { char b[16384] ; } ;

int f ( s p ) { return p.b[0] ; }

s d ;

int main () { return f(d) ; }
```
Unfortunately, many compilers forget that stack probes are also required for setting up a calling parameters area properly, and don't employ any stack probe generation logic when generating calls to functions. Borland C/C++, MetaWare High C/C++, and EMX C/C++ all overlook this necessity. Watcom C/C++ does not.

It is worth nothing that the latter two points are strong arguments in favour of the MIPS standard function perilogue design over the (non-optimal) x86 standard function perilogue design. In the MIPS approach, setting up enough room to hold the largest parameters area of any called function is a part of the standard function perilogue itself, rather than being deferred to on-the-fly modifications to the stack pointer within the function body. As such, a single stack probe operation can be done in the perilogue that encompasses the sizes of the locals area, the saved frame pointer(s) area, the save area, and the calling parameters areas, all in one gulp

Callback functions in Win16

In Win16, the standard function calling convention for callback functions requires that a function perform additional set up and teardown, within the standard perilogue. This nested perilogue ensures that the DS register within the function has whatever value the AX register had on entry to the function, and takes the form:

push ds
mov ds,ax
…
pop ds

In addition, the prologue must be prefixed with one of two 3-byte prefixes:

mov ax,ds
nop

or:

push ds
pop ax
nop

Seemingly, this is a very long-winded way of doing nothing, setting the DS register to what it already was and splatting the AX register along the way. If the function is not exported and used as a callback, that is exactly what it is.

The point of the extra perilogue that modifies the DS register during function execution is to allow an instance thunk to be set up with the MakeProcInstance() call in order to use the function as a callback. This instance thunk loads a target DS selector value into AX and calls the function. The Windows loader collaborates in this. For every function exported from an EXE or a DLL, it scans its first 3 bytes, and overwrites either of the aforegiven sequences with three nop instructions. (This of course makes it impossible to call them except through an instance thunk.)

The total function perilogue for a Win16 callback function, with the standard perilogue (in "optimize for time" form), the far function call marker, and the Win16 mechanism for making a function instance thunkable, was quite hefty. In 1989, Michael Geary discovered a trick that did away with a lot of this, by observing that the whole instance thunk mechanism wasn't necessary. In EXEs, the SS register already held the proper data segment selector; and in DLLs, one could simply perform a load of a constant which the program image loader would fixup to point to DGROUP for the DLL.

Instance thunkable function "Smart callback" in an EXE "Smart callback" in a DLL Non-callback far function

Comparison of Win16 function perilogues
Instance thunkable function	"Smart callback" in an EXE	"Smart callback" in a DLL	Non-callback far function
`mov ax,ds nop inc ebp enter N,M push ds mov ds,ax … pop ds leave dec ebp ret`	`mov ax,ss inc ebp enter N,M push ds mov ds,ax … pop ds leave dec ebp ret`	`mov ax,DGROUP inc ebp enter N,M push ds mov ds,ax … pop ds leave dec ebp ret`	`inc ebp enter N,M … leave dec ebp ret`

mov ax,ds
nop
inc ebp
enter N,M
push ds
mov ds,ax
…
pop ds
leave
dec ebp
ret

mov ax,ss
inc ebp
enter N,M
push ds
mov ds,ax
…
pop ds
leave
dec ebp
ret

mov ax,DGROUP
inc ebp
enter N,M
push ds
mov ds,ax
…
pop ds
leave
dec ebp
ret

inc ebp
enter N,M
…
leave
dec ebp
ret

© Copyright 2010 Jonathan de Boyne Pollard. "Moral" rights asserted.
Permission is hereby granted to copy and to distribute this web page in its original, unmodified form as long as its last modification datestamp is preserved.