Module-3
C Compilers and
Optimization
• C Compilers and Optimization: Basic C Data Types, C Looping
Structures, Register Allocation, Function Calls, Pointer Aliasing,
Portability Issues.
• The aim of this module is to help you write C code in a style that will
compile efficiently on the ARM architecture.
• This includes small examples to show how the compiler translates C
source to ARM assembler.
• Once we understand the translation process, it will help us distinguish
fast C code from slow C code.
C compilers and optimization
• It gives an idea of the problems the C compiler faces when optimizing
your code.
• Understanding these problems helps us write source codes that will
compile more efficiently in terms of increased speed and reduced
code size.
• Optimizing code takes time and reduces source code readability.
• Usually, it’s only worth optimizing functions that are frequently
executed and important for performance.
• A performance profiling tool, found in most ARM simulators can be
used to find these frequently executed functions.
• C compilers have to translate your C function literally into assembler
so that it works for all possible inputs.
Basic C Datatypes
• Some of these datatypes are more efficient to use for local variables
than others.
• There are also differences between the addressing modes available
when loading and storing data of each type.
• ARM processors have 32-bit registers and 32-bit data processing
operations.
• The ARM architecture is a RISC load/store architecture.
• Loads that act on 8- or 16-bit values extend the value to 32 bits before
writing to an ARM register.
• Unsigned values are zero-extended, and signed values sign-extended.
• This means that the cast of a loaded value to an int type does not cost
extra instructions.
• Similarly, a store of an 8- or 16-bit value selects the lowest 8 or 16 bits
of the register.
• The cast of an int to smaller type does not cost extra instructions on a
store.
• The ARMv4 architecture and above support signed 8-bit and 16-bit
loads and stores directly, through new instructions. Since these
instructions are a later addition, they do not support as many
addressing modes as the pre-ARMv4 instructions.
• ARMv5 adds instruction support for 64-bit load and stores.
• Prior to ARMv4, ARM processors were not good at handling signed 8-
bit or any 16-bit values. Therefore ARM C compilers define char to be
an unsigned 8-bit value, rather than a signed 8-bit value as is typical in
many other compilers.
• Compilers armcc and gcc use the datatype mappings in the given
table for an ARM target.
Local Variable Types
• ARMv4-based processors can efficiently load and store 8-, 16-, and
32-bit data.
• However, most ARM data processing operations are 32-bit only.
• For this reason, you should use a 32-bit datatype, int or long, for local
variables wherever possible.
• Avoid using char and short as local variable types, even if you are
manipulating an 8- or 16-bit value.
• The one exception is when you want wrap-around to occur. If you
require modulo arithmetic of the form 255 + 1 = 0, then use the char
type.
The following code checksums a data packet containing 64 words.
A checksum function sums the values in a data packet. Most
communication protocols (such as TCP/IP) have a checksum or
cyclic redundancy check (CRC) routine to check for errors in a data
packet. Consider the compiler output for this function when we
consider i as a char
• In the first case, the compiler inserts an extra AND instruction to
reduce i to the range 0 to 255 before the comparison with 64.
• This instruction disappears in the second case.
• Reduction in the number of instructions will speed up the program
execution time and thereby increases the performance of the system.
suppose the data packet contains 16-bit values and we need a 16-bit checksum.
It
is tempting to write the following C code:
The expression sum + data[i] is an integer and so can only be assigned to a
short using an (implicit or explicit) narrowing cast. In the following assembly
output, the compiler must insert extra instructions to implement the narrowing
cast:
• The loop is now three instructions longer than the loop for example
checksum_v2 earlier! There are two reasons for the extra
instructions:
• The LDRH instruction does not allow for a shifted address offset as the LDR
instruction did in checksum_v2. Therefore the first ADD in the loop calculates
the address of item i in the array. The LDRH loads from an address with no
offset. LDRH has fewer addressing modes than LDR as it was a later addition
to the ARM instruction set. (See Table 5.1.)
• The cast reducing total + array[i] to a short requires two MOV instructions.
The compiler shifts left by 16 and then right by 16 to implement a 16-bit sign
extend. The shift right is a sign-extending shift so it replicates the sign bit to
fill the upper 16 bits.
• We can avoid the second problem by using an int type variable to
hold the partial sum. We only reduce the sum to a short type at the
function exit.
• The first problem can be solved by accessing the array by
incrementing the pointer data rather than using an index as in data[i].
This is efficient regardless of array type size or element size. All ARM
load and store instructions have a post-increment addressing mode.
The checksum_v4 code fixes all the problems discussed above. It uses int type
local variables to avoid unnecessary casts. It increments the pointer data
instead of using an index offset data[i].
Function Argument Types
• Converting local variables from types char or short to type int
increases performance and reduces code size.
• The same holds for function arguments.
• If the function arguments are short type either the caller or the callee
must perform the cast to a short type. And these char or short type
function arguments and return values introduce extra casts.
• This increase code size and decrease performance.
• Therefore It is more efficient to use the int type for function
arguments and return values, even if you are only passing an 8-bit
value.
Signed versus Unsigned Types
• The previous sections demonstrate the advantages of using int rather
than a char or short type for local variables and function arguments.
• This section compares the efficiencies of signed int and unsigned int.
• If your code uses addition, subtraction, and multiplication, then there
is no performance difference between signed and unsigned
operations. However, there is a difference when it comes to division.
• Consider the following short example that averages two integers:
• The compiler adds one to the sum before shifting by right if the sum is
negative. In other words it replaces x/2 by the statement:
(x<0) ? ((x+1) >> 1): (x >> 1)
• It must do this because x is signed.
• In C on an ARM target, a divide by two is not a right shift if x is negative.
• For example, −3≫1=−2 but −3/2=−1.
• Division rounds towards zero, but arithmetic right shift rounds towards
−∞.
• It is more efficient to use unsigned types for divisions. The compiler
converts unsigned power of two divisions directly to right shifts.
The Efficient Use of C Types
• For local variables held in registers, don’t use a char or short type unless 8-bit or 16-bit modular
arithmetic is necessary. Use the signed or unsigned int types instead. Unsigned types are faster
when you use divisions.
• For array entries and global variables held in main memory, use the type with the smallest size
possible to hold the required data. This saves memory footprint. The ARMv4 architecture is
efficient at loading and storing all data widths provided you traverse arrays by incrementing the
array pointer. Avoid using offsets from the base of the array with short type arrays, as LDRH does
not support this.
• Use explicit casts when reading array entries or global variables into local variables, or writing
local variables out to array entries. The casts make it clear that for fast operation you are taking a
narrow width type stored in memory and expanding it to a wider type in the registers. Switch on
implicit narrowing cast warnings in the compiler to detect implicit casts.
• Avoid implicit or explicit narrowing casts in expressions because they usually cost extra cycles.
Casts on loads or stores are usually free because the load or store instruction performs the cast
for you.
• Avoid char and short types for function arguments or return values. Instead use the int type
even if the range of the parameter is smaller. This prevents the compiler performing unnecessary
casts.
C Looping Structures
• This section looks at the most efficient ways to code for and while
loops on the ARM.
• We start by looking at loops with a fixed number of iterations and
then move on to loops with a variable number of iterations.
• Finally we look at loop unrolling.
Loops with a Fixed Number of
Iterations
• It takes three instructions to implement the for loop structure:
• An ADD to increment i
• A compare to check if i is less than 64
• A conditional branch to continue the loop if i < 64
• This is not efficient. On the ARM, a loop should only use two
instructions:
• A subtract to decrement the loop counter, which also sets the condition code
flags on the result
• A conditional branch instruction
• The key point is that the loop counter should count down to zero
rather than counting up to some arbitrary limit.
• Then the comparison with zero is free since the result is stored in the
condition flags.
• Since we are no longer using i as an array index, there is no problem
in counting down rather than up.
This example shows the improvement if we switch to a decrementing loop
rather than an
incrementing loop.
• The SUBS and BNE instructions implement the loop.
• Our checksum example now has the minimum number of four
instructions per loop.
• For an unsigned loop counter i we can use either of the loop
continuation conditions i!=0 or i>0. As i can’t be negative, they are the
same condition.
• For a signed loop counter, we should use the condition i>0 to
continue the loop.
Loops Using a Variable Number of
Iterations
• A do-while loop gives better performance and code density than a for
loop.
Now suppose we want our checksum routine to handle packets of arbitrary size. We
pass
in a variable N giving the number of words in the data packet. Using the lessons from
the
last section we count down until N = 0 and don’t require an extra loop counter i.
Loop Unrolling
• While implementing the loop , there are some additional instructions
in addition to body of the loop :: a subtract to decrement the loop
count and a conditional branch. These instructions are called loop
overhead.
• On ARM7 or ARM9 processors the subtract takes one cycle and the
branch three cycles, giving an overhead of four cycles per loop.
• We can save some of these cycles by unrolling a loop, means,
repeating the loop body several times, and reducing the number of
loop iterations by the same proportion.
The following code unrolls our packet checksum loop by four times.
Assume that the number of words in the packet N is a multiple of
four.
This example handles the checksum of any size of data packet using a loop that
has been
unrolled four times.
• There are two things we need to ask when unrolling a loop:
■ How many times should we unroll the loop?
• Only unroll loops that are important for the overall performance of the
application. Otherwise unrolling will increase the code size with little
performance benefit. Unrolling may even reduce performance by evicting more
important code from the cache.
■ What if the number of loop iterations is not a multiple of the unroll amount?
• We can try to arrange it so that array sizes are multiples of our unroll amount.
If this isn’t possible, then we must add extra code to take care of the leftover
cases. This increases the code size a little but keeps the performance high.
Writing Loops Efficiently
• Use loops that count down to zero. Then the compiler does not need to allocate a
register to hold the termination value, and the comparison with zero is free.
• Use unsigned loop counters by default and the continuation condition i!=0 rather
than i>0. This will ensure that the loop overhead is only two instructions.
• Use do-while loops rather than for loops when you know the loop will iterate at
least once. This saves the compiler checking to see if the loop count is zero.
• Unroll important loops to reduce the loop overhead. Do not over unroll. If the loop
overhead is small as a proportion of the total, then unrolling will increase code size
and hurt the performance of the cache.
• Try to arrange that the number of elements in arrays are multiples of four or eight.
You can then unroll loops easily by two, four, or eight times without worrying about
the leftover array elements.
Register Allocation
• The compiler attempts to allocate a processor register to each local
variable you use in a C function.
• It will try to use the same register for different local variables if the
use of the variables do not overlap.
• When there are more local variables than available registers, the
compiler stores the excess variables on the processor stack.
• These variables are called spilled or swapped out variables since they
are written out to memory.
• Spilled variables are slow to access compared to variables allocated to
registers.
• To implement a function efficiently, you need to
• minimize the number of spilled variables
• ensure that the most important and frequently accessed variables are stored
in registers
• The number of processor registers the ARM C compilers have available
for allocating variables is shown in the following table. It shows the
standard register names and usage when following the ARM-Thumb
procedure call standard (ATPCS), which is used in code generated by C
compilers.
Efficient Register Allocation
• Try to limit the number of local variables in the internal loop of
functions to 12. The compiler should be able to allocate these to ARM
registers.
• You can guide the compiler as to which variables are important by
ensuring these variables are used within the innermost loop.
Function Calls
• The ARM Procedure Call Standard (APCS) defines how to pass function
arguments and return values in ARM registers.
• The ARM-Thumb Procedure Call Standard (ATPCS) covers ARM and
Thumb interworking as well.
• The first four integer arguments are passed in the first four ARM
registers: r0, r1, r2, and r3.
• Subsequent integer arguments are placed on the full descending stack.
• Function return integer values are passed in r0.
• Two-word arguments such as long long or double are passed in a pair of
consecutive argument registers and returned in r0, r1.
• Functions with four or fewer arguments are far more efficient to call
than functions with five or more arguments.
• For functions with four or fewer arguments, the compiler can pass all
the arguments in registers. For functions with more arguments, both
the caller and callee must access the stack for some arguments.
• If your C function needs more than four arguments, or your C++
method more than three explicit arguments, then it is almost always
more efficient to use structures.
• Group related arguments into structures, and pass a structure pointer
rather than multiple arguments.
The below program shows typical routine to insert N bytes from array data into a
queue. We implement the queue using a cyclic buffer with start address Q_start
(inclusive) and end address Q_end (exclusive).
Pointer Aliasing
• Two pointers are said to alias when they point to the same address.
• If you write to one pointer, it will affect the value you read from the
other pointer.
Portability Issues
• Summarizes problems we may face when porting C code from another architecture to the ARM
architecture.
• The char type. On the ARM, char is unsigned rather than signed as for many other processors. A
common problem concerns loops that use a char loop counter i and the continuation condition i ≥ 0,
they become infinite loops. In this situation, armccproduces a warning of unsigned comparison with
zero. You should either use a compiler option to make char signed or change loop counters to type int.
• ■ The int type. Some older architectures use a 16-bit int, which may cause problems when moving to
ARM’s 32-bit int type although this is rare nowadays. Note that expressions are promoted to an int
type before evaluation. Therefore if i = -0x1000 the expression i == 0xF000 is true on a 16-bit machine
but false on a 32- bit machine.
• ■ Unaligned data pointers. Some processors support the loading of short and int typed values from
unaligned addresses. A C program may manipulate pointers directly so that they become unaligned,
for example, by casting a char * to an int *. ARM architectures up to ARMv5TE do not support
unaligned pointers. To detect them, run the program on an ARM with an alignment checking trap. For
example, you can configure the ARM720T to data abort on an unaligned access.
• Endian assumptions. C code may make assumptions about the endianness of a
memory system, for example, by casting a char * to an int *. If you configure the
ARM for the same endianness the code is expecting, then there is no issue.
Otherwise, you must remove endian-dependent code sequences and replace them
by endian-independent ones.
• ■ Function prototyping. The armcc compiler passes arguments narrow, that is,
reduced to the range of the argument type. If functions are not prototyped
correctly, then the function may return the wrong answer. Other compilers that
pass arguments wide may give the correct answer even if the function prototype is
incorrect. Always use ANSI prototypes.
• ■ Use of bit-fields. The layout of bits within a bit-field is implementation and
endian dependent. If C code assumes that bits are laid out in a certain order, then
the code is not portable.
• Use of enumerations. Although enum is portable, different compilers allocate different
numbers of bytes to an enum. The gcc compiler will always allocate four bytes to an enum
type. The armcc compiler will only allocate one byte if the enum takes only eight-bit
values. Therefore you can’t cross-link code and libraries between different compilers if you
use enums in an API structure.
• ■ Inline assembly. Using inline assembly in C code reduces portability between
architectures. You should separate any inline assembly into small inlined functions that can
easily be replaced. It is also useful to supply reference, plain C implementations of these
functions that can be used on other architectures, where this is possible.
• ■ The volatile keyword. Use the volatile keyword on the type definitions of ARM memory-
mapped peripheral locations. This keyword prevents the compiler from optimizing away
the memory access. It also ensures that the compiler generates a data access of the
correct type. For example, if you define a memory location as a volatile short type, then
the compiler will access it using 16-bit load and store instructions LDRSH and STRH.