How to Optimize Program

2016-04-10

This is a study note for Chapter 5 of CSAPP (Computer Systems: A Programmer's Perspective).

Writing efficient programs requires:

Choosing appropriate data structures and algorithms
Writing source code that compilers can effectively optimize into efficient executables
For computationally heavy tasks, breaking them into smaller code segments for parallel execution

Optimizing code:

Eliminate unnecessary work to keep code as simple as possible. For example, remove unnecessary function calls, conditional tests, and memory references.
Leverage the processor's instruction-level parallelism to execute multiple instructions simultaneously. Make reasonable arrangements based on the timing characteristics of various operations to avoid unnecessary waiting.

When optimizing code, maintain simplicity and readability, since code will eventually need to be maintained and extended.

1 Reducing Memory Accesses

Consider the following two functions:

void twiddle1(int *xp, int *yp)
{
    *xp += *yp;
    *xp += *yp;
}

void twiddle2(int *xp, int *yp)
{
    *xp += 2 * *yp;
}

twiddle1() and twiddle2() have similar computational behavior. However, twiddle2() has only three memory references: reading *xp, reading *yp, and writing *xp. twiddle1() has six memory references and is more cumbersome in practice.

The two functions are not entirely equivalent. If *xp == *yp, then twiddle1() results in *xp becoming 4 * *xp, while twiddle2() results in *xp becoming 3 * *xp. This situation where two pointers refer to the same memory location is called memory aliasing.

2 Reducing Function Calls

Consider the following two functions:

int cycle1(int *xp, int *yp)
{
    return f() + f() + f() + f();
}

int cycle2(int *xp, int *yp)
{
    return 4 * f();
}

cycle1() calls f() four times, requiring four cycles of setting up stack frame -> computing f() -> restoring stack frame. cycle2() only does this once.

Of course, as a special case, consider if f() modifies a global variable. If it does, cycle1() would modify it four times, while cycle2() would only modify it once.

3 Cycles Per Element (CPE)

Consider a process that accumulates all elements of a vector into a single sum (a reduction operation).

For a vector P = (a1, a2, a3, ..., ai, ..., an), the result is computed as:
result = a1 + a2 + a3 + ... + an

Code 1

void combine(vec_pointer ptr, data_t *dest)
{
    long int i;
    *dest = 0;
    for (i = 0; i < length(ptr); i++) { /* length(ptr) returns the number of elements in the vector */
        data_t val;                     /* Allocate a memory space */
        get_vec_element(ptr, i, &val);  /* Store the i-th element of vector ptr into val */
        *dest = *dest + val;            /* Accumulate val into *dest to compute the sum */
    }
}

Code 1 uses a for loop to iteratively accumulate the sum. Regardless of the vector size, each function call incurs stack frame setup and restoration, which takes a fixed time S. As the number of iterations n varies, the total time T = S + n * L, where L is the per-iteration execution time of the for loop, referred to in this book as Cycles Per Element (CPE).

4 Eliminating Loop Inefficiencies — Code Motion

In Code 1's for loop, length() is called every iteration to get the number of vector elements. Since this value is typically constant, storing it in a local variable to reduce call frequency can significantly improve performance.

Code 2

void combine(vec_pointer ptr, data_t *dest)
{
    long int i;
    *dest = 0;
    long int v_length = length(ptr);   /* Store length(ptr) result in a local variable */
    for (i = 0; i < v_length; i++) {
        data_t val;
        get_vec_element(ptr, i, &val);
        *dest = *dest + val;
    }
}

5 Reducing Procedure Calls

In Code 2's for loop, get_vec_element() is called each time to get the i-th element, which is also costly. A reasonable alternative is to directly obtain the vector data, store it in a local variable, and access it as needed.

Code 3

/*
 * Given the definition of vec_pointer struct
 */
typedef struct {
    long int len;
    data_t *data;
} vec_rec, vec_pointer;

/*
 * New function
 */
data_t *get_vec_start(vec_pointer ptr)
{
    return ptr->data;                   /* Directly get the data pointer of the vector */
}

void combine(vec_pointer ptr, data_t *dest)
{
    long int i;
    *dest = 0;
    long int v_length = length(ptr);
    data_t *data = get_vec_start(ptr);  /* Store vector data in a local array pointer */
    for (i = 0; i < v_length; i++) {
        *dest = *dest + data[i];
    }
}

6 Eliminating Unnecessary Memory References

Below is the assembly code for the for loop in Code 3:

/*
 * code3: data_t = float
 * i in %rdx, data in %rax, dest in %rbp, bound in %r12
 */

.L498:
    movss (%rbp), %xmm0           /* Load dest into %xmm0 */
    mulss (%rax, %rdx, 4), %xmm0  /* Load data[i], multiply with dest */
    movss %xmm0, (%rbp)           /* Store result back to dest */
    addq  $1, %rdx                /* Increment i */
    cmpq  %rdx, %r12              /* Check if i is out of bounds */
    jg    .L498                   /* If not out of bounds, loop again */

In Code 3's loop, each addition first references the memory location pointed to by *dest, performs the addition, and stores the result back. However, each read retrieves the previous computation result, which is wasteful.

A better solution is to store the accumulated value in a local variable, and only write the final result to memory after computation completes.

Code 4

void combine(vec_pointer ptr, data_t *dest)
{
    long int i;
    long int v_length = length(ptr);
    data_t *data = get_vec_start(ptr);
    data_t acc = 0;                     /* Local variable to store accumulated value */
    for (i = 0; i < v_length; i++) {
        acc = acc + data[i];            /* Use accumulator acc to sum data[i] */
    }
    *dest = acc;                        /* Store final result to dest */
}

Corresponding assembly code:

/*
 * code4: data_t = float
 * i in %rdx, data in %rax, bound in %rbp, acc in %xmm0
 */

.L488:
    mulss (%rax, %rdx, 4), %xmm0  /* Load data[i], multiply with acc */
    addq  $1, %rdx                /* Increment i */
    cmpq  %rdx, %rbp              /* Check if i is out of bounds */
    jg    .L488                   /* If not out of bounds, loop again */

7 Loop Unrolling

At this point, the loop body is clean enough. However, the loop itself also has overhead. If we can reduce the number of iterations while ensuring accurate results, we can achieve further improvements.

Loop unrolling is a program transformation that reduces the number of iterations by increasing the number of elements computed per iteration. It improves performance in two ways:

Reduces the number of operations that don't directly contribute to the program result, such as loop index calculations and conditional branches;
Enables further transformations that reduce the number of operations on the critical path of the computation.

Critical path: A chain of data dependencies formed during repeated loop execution.

Code 5

void combine(vec_pointer ptr, data_t *dest)
{
    long int i;
    long int v_length = length(ptr);
    long int limit = v_length - 1;
    data_t *data = get_vec_start(ptr);
    data_t acc = 0;

    /* Loop 1 */
    for (i = 0; i < limit; i += 2) {          /* Step by 2 */
        acc = (acc + data[i]) + data[i + 1];  /* Accumulate next two data[i] elements */
    }

    /* Loop 2 */
    for (; i < v_length; i++) {               /* Accumulate remaining elements */
        acc = acc + data[i];
    }
    *dest = acc;
}

Observing Code 5, there are two for loops:

For the first loop, we must ensure the loop doesn't go out of bounds (especially for data[i+1]);
The loop executes only when i < n - 1, so the maximum index i + 1 satisfies i + 1 < (n - 1) + 1 = n.

8 Improving Parallelism

Looking at the computation in Code 4's loop: each computation of acc must wait for the previous iteration's acc to complete.

Similarly, in Code 5's Loop 1: computing acc first calculates acc + data[i], then + data[i+1], which also requires waiting for the previous iteration's acc.

In other words, the computation of acc forms a single sequential computation flow — a critical path. If we can split this flow, we can leverage the CPU's out-of-order execution to compute in parallel and improve efficiency.

A feasible approach: compute the sum of odd-indexed elements and even-indexed elements separately, then add them together.

Code 6

void combine(vec_pointer ptr, data_t *dest)
{
    long int i;
    long int v_length = length(ptr);
    long int limit = v_length - 1;
    data_t *data = get_vec_start(ptr);
    data_t acc1 = 0;
    data_t acc2 = 0;

    /* Loop 1 */
    for (i = 0; i < limit; i += 2) {
        acc1 = acc1 + data[i];       /* Accumulate odd-indexed elements */
        acc2 = acc2 + data[i + 1];   /* Accumulate even-indexed elements */
    }

    /* Loop 2 */
    for (; i < v_length; i++) {      /* Accumulate remaining elements */
        acc1 = acc1 + data[i];
    }
    *dest = acc1 + acc2;             /* Sum the two accumulators */
}

Code 6 has two critical paths.

9 Reassociation Transformation

Consider the computation line in Code 5's Loop 1:

acc = (acc + data[i]) + data[i + 1];

Transform it to:

acc = acc + (data[i] + data[i + 1]);

Although this only changes the position of parentheses, it significantly improves performance. In the first form, the addition acc + data[i] still requires waiting for the previous iteration's acc to complete. In the second form, data[i] + data[i+1] has no such requirement. Leveraging the CPU's out-of-order execution, the second form can compute data[i] + data[i+1] for the next iteration while computing the previous iteration's acc, thereby improving efficiency.

Code 7

void combine(vec_pointer ptr, data_t *dest)
{
    long int i;
    long int v_length = length(ptr);
    long int limit = v_length - 1;
    data_t *data = get_vec_start(ptr);
    data_t acc = 0;

    /* Loop 1 */
    for (i = 0; i < limit; i += 2) {
        acc = acc + (data[i] + data[i + 1]);  /* Reassociation transformation */
    }

    /* Loop 2 */
    for (; i < v_length; i++) {
        acc = acc + data[i];
    }
    *dest = acc;
}

10 Some Considerations

Loop parallelism cannot be increased indefinitely. Once parallelism exceeds the number of available registers, the compiler will spill excess variables to the stack — causing a dramatic performance drop.
Modern CPUs have the ability to predict branches and execute ahead. However, mispredictions cause significant performance losses.
- First, don't focus too much on predictable branches (P361);
- Second, for hard-to-predict cases, prefer conditional data transfer over conditional control transfer.

The assembly code for v = test-expr ? then-expr : else-expr; can produce two forms:

/* Conditional data transfer */
vt = then-expr;
v = else-expr;
t = test-expr;
if (t) v = vt;

/* Conditional control transfer */
if (!test-expr)
    goto false;
v = then-expr;
goto done;
false:
v = else-expr;
done: