Strassen PDF
Strassen PDF
Abstract
We provide efficient single- and double-precision GPU (Graphics Processing Unit) implementa-
tions of Strassen’s matrix multiplication algorithm as well as of Winograd’s variant of this algorithm.
The single-precision implementations of these two algorithms are compared analytically using the
arithmetic count, device-memory transactions, and device memory to multiprocessor data volume
metrics. Our analysis indicates that, for 16384 × 16384 matrices, our single-precision implementation
of Strassen’s algorithm limited to four levels of recursion reduces the number of arithmetics by 41.3%,
the number of transactions by 33.7%, and the volume by 29.2% relative to the best known GPU
implementation of the classical n3 matrix multiplication algorithm. The corresponding reductions
achieved by Winograd’s variant are 41.3%, 35.1%, and 31.5%. Experimental results obtained using
an NVIDIA C1060 GPU indicate a speedup of 32% for Strassen’s 4-level implementation and 33% for
Winograd’s variant relative to the sgemm code in CUBLAS 3.0 when multiplying 16384 × 16384 ma-
trices. Our double-precision implementations of Strassen’s and Winograd’s algorithms, respectively,
achieve a speedup of 20.2% and 21% relative to dgemm when the matrix size n is 8192. The maximum
numerical error introduced by Strassen’s and Winograd’s algorithms are about 2 orders of magnitude
higher than those for sgemm when n = 16384 and about 1 order of magnitude higher than for dgemm
for n = 8192. The average numerical error introduced by Strassen’s and Winograd’s algorithms are,
respectively, 2 and 3 orders of magnitude higher than those for sgemm when n = 16384 and about 1
order of magnitude higher than for dgemm for n = 8192.
Keywords: GPU, CUDA, matrix multiplication, Strassen’s algorithm, Winograd’s
variant, accuracy
1 Introduction
Matrix multiplication is an integral component of the CUDA (Compute Unified Driver Architecture)
BLAS library [2] and much effort has been expended in obtaining an efficient CUDA implementation.
The current implementation in the CUDA BLAS library is based on an algorithm due to Volkov and
Demmel [18]. A further 3% reduction (on the NVIDIA Tesla C1060) in run time is achieved by the
algorithm GP U 8 [12]. Li, Ranka, and Sahni [12] provide a step-by-step development of efficient GPU
matrix multiplication algorithms beginning with the classical three-loop O(n3 ) single-core algorithm to
multiply two n × n matrices. Although significant effort has been expended to obtain efficient GPU
algorithms for matrix multiplication based on the classical O(n3 ) single-core algorithm, there appears to
be no work toward obtaining efficient GPU implementations of any of the single-core matrix algorithms
whose complexity is less than O(n3 ). Of these latter lower complexity algorithms, Strassen’s original
∗
This work was supported, in part, by the National Science Foundation under grants CNS0829916, CNS0905308,
CCF0903430, NETS 0963812, and NETS 1115184.
1
C11 C12 A11 A12 B11 B12
C21 C22
= A21 A22
X
B21 B22
O(n2.81 ) algorithm [17] and Winograd’s variant [20] of this algorithm, whose asymptotic complexity is
also O(n2.81 ) are considered the most practical. Hence, we focus on these two algorithms in this paper.
We note that the asymptotically fastest matrix multiplication algorithm at this time has a complexity
O(n2.38 ) [5] and it is believed that “an optimal algorithm for matrix multiplication will run in essentially
O(n2 ) time” [14].
Both Strassen’s algorithm and Winograd’s variant compute the product C of two matrices A and B by
first decomposing each matrix into 4 roughly equal sized blocks as in Figure 1. Strassen’s algorithm [17]
computes C by performing 7 matrix multiplications and 18 add/subtracts using the following equations:
When this block decomposition is applied recursively until the block dimensions reach (or fall below)
a threshold value (say τ ) the arithmetic complexity of Strassen’s algorithm becomes O(n2.81 ).
Winograd’s variant of Strassen’s method uses the following equations to compute C with 7 matrix
multiplies and 15 add/subtracts [8]:
S1 = A21 + A22 M1 = S2 ∗ S6 V1 = M1 + M2
S2 = S1 − A11 M2 = A11 ∗ B11 V2 = V1 + M4
S3 = A11 − A21 M3 = A12 ∗ B21 C11 = M2 + M3
S4 = A12 − S2 M4 = S3 ∗ S7 C12 = V1 + M5 + M6
S5 = B12 − B11 M5 = S1 ∗ S5 C21 = V2 − M7
S6 = B22 − S5 M6 = S4 ∗ B22 C22 = V2 + M5
S7 = B22 − B12 M7 = A22 ∗ S8
S8 = S6 − B21
Although the recursive application of Winograd’s variant also results in an asymptotic complexity of
O(n2.81 ), the reduction in number of matrix adds/subtracts from 18 to 15 manifests itself as a slightly
smaller measured run time in practice.
While there appears to be no GPU implementation of either Strassen’s or Winograd’s variant, both
variants have been implemented for other architectures. For example, Bailey, Lee, and Simon [1] describe
an implementation of Strassen’s algorithm for the CRAY-2 and CRAY Y-MP. This implementation uses
three temporary (scratch) matrices at each level of the recursion. The total space required by these
temporary matrices is at most n2 . However, the computation can be done using 2 temporaries T1 and
2
Step Computation Comment
1 C12 = A21 − A11
2 C21 = B11 + B12
3 C22 = C12 ∗ C21 M6
4 C12 = A12 − A22
5 C21 = B21 + B22
6 C11 = C12 ∗ C21 M7
7 C12 = A11 + A22
8 C21 = B11 + B22
9 T1 = C12 ∗ C21 M1
10 C11 = T1 + C11 M 1 + M7
11 C22 = T1 + C22 M1 + M6
12 T2 = A21 + A22
13 C21 = T2 ∗ B11 M2
14 C22 = C22 − C21 M1 − M2 + M6
15 T1 = B21 − B11
16 T2 = A22 ∗ T1 M4
17 C21 = C21 + T2 M2 + M4
18 C11 = C11 + T2 M1 + M4 + M7
19 T1 = B12 − B22
20 C12 = A11 ∗ T1 M3
21 C22 = C22 + C12 M1 − M2 + M3 + M6
22 T2 = A11 + A12
23 T1 = T2 ∗ B22 M5
24 C12 = C12 + T1 M3 + M5
25 C11 = C11 − T1 M1 + M4 − M 5 + M 7
T2 at each level using the steps given in Figure 2. The implementation of Figure 2 reduces the space
required by temporary matrices to at most 2n2 /3.
Douglas et al. [8] provide an implementation of Winograd’s variant that uses two temporary matrices
at each level of the recursion. So, this implementation, which is given in Figure 3, uses at most 2n2 /3
memory for temporary matrices. Douglas et al. [8] report on the performance of their implementation
on a variety of serial and parallel computers.
Huss-Lederman et al. [10, 11] describe two implementations of Winograd’s variant. The first uses two
temporary matrices at each level of the recursion and is identical to the implementation of Douglas et al.
[8] (Figure 3). The second implementation uses 3 temporaries at each level of the recursion. This second
implementation, however, is recommended only for the case when we are using the Winograd variant to
do a multiply-accumulate (i.e., C = αAB +βC) and not when we are doing a straight multiply (C = AB)
as in this paper. So, we do not consider this implementation further in this paper. Boyer et al. [3] show
how to implement Winograd’s variant using no temporary matrix. They provide two implementations.
The first does not increase the number of arithmetic operations but overwrites the input matrices A and
B. Since we do not permit overwriting of the input matrices, we do not consider this implementation.
Although the second in-place implementation does not overwrite the input matrices, it increases the
number of arithmetics by a constant factor. So, we do not consider this implementation either.
3
Step Computation Comment
1 T1 = A11 − A21 S3
2 T2 = B22 − B12 S7
3 C21 = T1 ∗ T2 M4
4 T1 = A21 + A22 S1
5 T2 = B12 − B11 S5
6 C22 = T1 ∗ T2 M5
7 T1 = T1 − A11 S2
8 T2 = B22 − T2 S6
9 C11 = T1 ∗ T2 M1
10 T1 = A12 − T1 S4
11 C12 = T1 ∗ B22 M6
12 C12 = C22 + C12 M5 + M6
13 T1 = A11 ∗ B11 M2
14 C11 = C11 + T1 V1
15 C12 = C11 + C12 V 1 + M5 + M6
16 C11 = C11 + C21 V2
17 T2 = T2 − B21 S8
18 C21 = A22 ∗ T2 M7
19 C21 = C11 − C21 V 2 − M7
20 C22 = C11 + C22 V 2 + M5
21 C11 = A12 ∗ B21 M3
22 C11 = T1 + C11 M2 + M3
The remainder of this paper is organized as follows. In Section 2, we describe the architecture of the
NVIDIA Tesla C1060 GPU. The CUDA programming model is described in Section 3 and the fastest
O(n3 ) GPU matrix multiplication algorithm GP U 8 [12] is described in Section 4. Section 5 gives the
basic GPU kernels used in our GPU adaptations of Strassen’s algorithm and Winograd’s variant and also
analyzes these kernels for their device-memory transactions and volume complexity. A one-level GPU
implementation of Strassen’s algorithm and Winograd’s variant (i.e., an implementation that does not
apply Strassen’s and Winograd’s equations recursively) is given in Section 6 and the general multilevel
recursive implementation is given in Section 7. Experimentation results for single- and double-precision
implementations of Strassen’s and Winograd’s algorithms are presented in Section 8. We conclude in
Section 9.
Throughout this paper, we assume that n is a power of 2. Adaptations to other values of n may be
done using methods such as padding and peeling [10, 11].
2 GPU Architecture
NVIDIA’s Tesla C1060 GPU, Figure 5, is an example of NVIDIA’s general purpose parallel computing
architecture CUDA (Compute Unified Driver Architecture) [16]. Figure 5 is a simplified version of
Figure 4 with N = 30 and M = 8. The C1060 comprises 30 streaming multiprocessors (SMs) and each
SM comprises 8 scalar processors (SPs), 16KB of on-chip shared memory, and 16,384 32-bit registers.
Each SP has its own integer and single-precision floating point units. Each SM has 1 double-precision
4
Figure 4: NVIDIA’s GPU hardware model [7]
floating-point unit and 2 single-precision transcendental function (special function, SF) units that are
shared by the 8 SPs in the SM. The 240 SPs of a Tesla C1060 share 4GB of off-chip memory referred
to as device or global memory [7]. A C1060 has a peak performance of 933 GFlops of single-precision
floating-point operations and 78 GFlops of double-precision operations. The peak of 933GFlops is for the
case when Multiply-Add (MADD) instructions are dual issued with special function (SF) instructions. In
the absence of SF instructions, the peak is 622GFlops (MADDs only) [19]. The C1060 consumes 188W
of power. The architecture of the NVIDIA Tesla C2050 (also known as Fermi) corresponds to Figure 4
with N = 14 and M = 32. So, a C2050 has 14 SMs and each SM has 32 SPs giving the C2050 a total of
448 SPs or cores. Although each SP of a C2050 has its own integer, single- and double-precision units,
the 32 SPs of an SM share 4 single-precision transcendental function units. An SM has 64KB of on-chip
memory that can be “configured as 48KB of shared memory with 16KB of L1 cache (default setting)
or as 16KB of shared memory with 48KB of L1 cache” [7]. Additionally, there are 32K 32-bit registers
5
per SM and 3GB of off-chip device/global memory that is shared by all 14 SMs. The peak performance
of a C2050 is 1,288 GFlops (or 1.288TFlops) of single-precision operations and 515GFlops of double-
precision operations and the power consumption is 238W [4]. Once again, the peak of 1,288GFlops
requires that MADDs and SF instructions be dual issued. When there are MADDs alone, the peak
single-precision rate is 1.03GFlops. Notice that the ratio of power consumption to peak single-precision
GFlop rate is 0.2W/GFlop for the C1060 and 0.18W/GFlop for the C2050. The corresponding ratio for
double-precision operations is 2.4W/GFlops for the C1060 and 0.46W/GFlop for the C2050. In NVIDIA
parlance, the C1060 has compute capability 1.3 while the compute capability of the C2050 is 2.0.
A Tesla GPU is packaged as a double-wide PCIe card (Figure 6) and using an appropriate motherboard
and a sufficiently large power supply, one can install up to 4 GPUs on the same motherboard. In this
paper, we focus on single GPU computation.
3 Programming Model
At a high-level, a GPU uses the master-slave programming model [21] in which the GPU operates as
a slave under the control of a master or host processor. In our experimental set up for single-precision
matrix multiplication, the master or host is a 2.8GHz Xeon quad core processor and the GPU is the
NVIDIA Tesla C1060. For double-precision matrix multiplication, the host is a XXX six core processor
and the GPU is the NVIDIA Tesla 2050. We describe the programming model making explicit reference
to the single-precision setup only. Programming in the master-slave model requires us to write a program
that runs on the master processor (in our case the Xeon). This master program sends data to the slave(s)
(in our case a single C1060 GPU), invokes a kernel or function that runs on the slave(s) and processes
this sent data, and finally receives the results back from the slave. This process of sending data to the
slave, executing a slave kernel, and receiving the computed results may be repeated several times by the
master program. In CUDA, the host/master and GPU/slave codes may be written in C. CUDA provides
extensions to C to allow for data transfer to/from device memory and for kernel/slave code to access
registers, shared memory, and device memory.
At another level, GPUs use the SIMT (single instruction multiple thread) programming model in
which the GPU accomplishes a computational task using thousands of light weight threads. The threads
are grouped into blocks and the blocks are organized as a grid. While a block of threads may be 1-,
2-, or 3-dimensional, the grid of blocks may only be 1- or 2-dimensional. Kernel invocation requires the
specification of the block and grid dimensions along with any parameters the kernel may have. This is
6
illustrated below for a matrix multiply kernel M atrixM ultiply that has the parameters a, b, c, and n,
where a, b, and c are pointers to the start of the row-major representation of n × n matrices and the
kernel computes c = a ∗ b.
MatrixMultiply<<<GridDimensions, BlockDimensions>>>(a,b,c,n)
A GPU has a block scheduler that dynamically assigns thread blocks to SMs. Since all the threads of
a thread block are assigned to the same SM, the threads of a block may communicate with one another
via the shared memory of an SM. Further, the resources needed by a block of threads (e.g., registers
and shared memory) should be sufficiently small that a block can be run on an SM. The block scheduler
assigns more than 1 block to run concurrently on an SM when the combined resources needed by the
assigned blocks does not exceed the resources available to an SM. However, since CUDA provides no
mechanism to specify a subset of blocks that are to be co-scheduled on an SM, threads of different blocks
can communicate only via the device memory.
Once a block is assigned to an SM, its threads are scheduled to execute on the SM’s SPs by the SM’s
warp scheduler. The warp scheduler divides the threads of the blocks assigned to an SM into warps of
32 consecutively indexed threads from the same block. Multidimensional thread indexes are serialized in
row-major order for partitioning into warps. So, a block that has 128 threads is partitioned into 4 warps.
Every thread currently assigned to an SM has its own instruction counter and set of registers. The warp
scheduler selects a warp of ready threads for execution. If the instruction counters for all threads in the
selected warp are the same, all 32 threads execute in 1 step. On a GPU with compute capability 2.0,
each SM has 32 cores and so all 32 threads can perform their common instruction in parallel, provided,
of course, this common instruction is an integer or floating point operation. On a GPU with compute
capability 1.3, each SM has 8 SPs and so a warp can execute the common instruction for only 8 threads
in parallel. Hence, when the compute capability is 1.3, the GPU takes 4 rounds of parallel execution
to execute the common instruction for all 32 threads of a warp. When the instruction counters of the
threads of a warp are not the same, the GPU executes the different instructions serially. Note that the
instruction counters may become different as the result of “data dependent conditional branches” in a
kernel [7]. When the compute capability is 1.3, the device-memory accesses of a half warp are coalesced
into a single transaction when the data being accessed lie in the same 128-byte segment of device memory.
The transaction size is reduced to 64 bytes in case the accessed data are in the same 64-byte segment
and to 32 bytes when they are in the same 32-byte segment. The transaction size determines the volume
of data moved.
An SM’s warp scheduler is able to hide much of the 400 to 600 cycle latency of a device-memory
access by executing warps that are ready to do arithmetics while other warps wait for device-memory
accesses to complete. So, the performance of code that makes many accesses to device memory can often
be improved by optimizing it to increase the number of warps scheduled on an SM. This optimization
could involve increasing the number of threads per block and/or reducing the shared memory and register
utilization of a block to enable the scheduling of a larger number of blocks on an SM.
7
__device__ void update2(float *a, float b, float *c)
{
for (int i = 0; i < 16; i++)
c[i] += a[i * 4] * b;
}
using two 128-byte transactions. To accomplish this, each thread reads a 1 × 4 sub-matrix of a using
the data type float4. The 16 × 64 a sub-matrix that is input from device memory may be viewed as a
16 × 16 matrix in which each element is a 1 × 4 vector. The transpose of this 16 × 16 matrix of vectors
is stored in the array as[16][65] with each 1 × 4 vector using four adjacent elements of a row of as. This
mapping ensures that the 16 elements in each column of the 16 × 64 sub-matrix of a that is input from
device memory are stored in different banks of shared memory. So, the writes to shared memory done
by a half warp of GP U 8 are conflict free. Further, by storing the transpose of a 16 × 16 matrix of 1 × 4
vectors rather than the transpose of a 16 × 64 matrix of scalars, GP U 8 is able to do the writes to shared
memory using float4s rather than floats as would otherwise be the case. This reduces the time to
write to shared memory.
The number of half warps is n2 /256. In each iteration of the for i loop, a half warp makes 4 128-byte
transactions to read in a values and 64 64-byte transactions to read in b values. Thus, GP U 8 makes
n3 /4096 128-byte device-memory transactions on a and n3 /256 64-byte transactions on b. Additionally,
n2 /16 64-byte transactions are made on c. Each transaction has 100% utilization. So, the the total
number of transactions is 17n3 /4096 + n2 /16 and the volume is 9n3 /32 + 4n2 . By comparison, the
number of transactions and volume for the sgemm code in CUBLAS 3.0 [2] are 5n3 /1024 + n2 /16 and
5n3 /16 + 4n2 , respectively.
1. add(X, Y, Z) computes Z = X + Y using the kernel code of Figure 10. Each thread fetches two
adjacent values of X and two adjacent values of B from device memory using the data type float2.
Since the 16 pairs of X (Y ) fetched from device memory lie in the same 128-byte segment, the fetches
of a half warp are coalesced into a single 128-byte memory transaction. The fetched pairs of X
and Y are added and the sums written to device memory. This write also requires one memory
transaction per half warp. So, two m × m matrices are added using a total of 3m2 /32 128-byte
transactions that result in a total volume of 12m2 bytes.
3. mul(X, Y, Z) computes Z = X ∗ Y using the kernel code of Figures 8 and 9. Let T and V ,
respectively, denote the number of memory transactions and volume for this code when multiplying
two m × m matrices (T = 17m3 /4096 + m2 /16 and V = 9m3 /32 + 4m2 ).
8
__global__ void GPU8 (float *a, float *b, float *c, int n)
{// thread code to compute one column of a 16 x 128 sub-matrix of c
// use shared memory to hold the transpose of a
// 16 x 64 sub-matrix of 1 x 4 sub-vectors of a
__shared__ float as[16][65];
a += aNext;
b += bNext;
c += cNext;
Figure 8: GP U 8 (Part a)
elements of W ∗ X as each is computed. Instead, after an element of W ∗ X has been computed, the
corresponding elements of Y and Z are read from device memory, incremented by the computed
element of W ∗ X, and the incremented values written back to device memory. Note that each
element of W ∗ X is computed exactly once. The modified kernel makes T − m2 /16 transactions to
multiply W and X as it does not write out W ∗X. Additional transactions are made to fetch Y and
Z and write the incremented values. A half warp reads/writes Y and Z using coalesced 64-byte
transactions. The total number of these transactions is m2 /4 (m2 /16 transactions are made to read
or write each of Y and Z). So, the total number of transactions is T + 3m2 /16 and the volume is
V + 12m2 .
9
float br0 = b[0];
float br1 = b[n];
float br2 = b[nTimes2];
float br3 = b[nTimes3];
b += nTimes4;
#pragma unroll
for (int k = 0; k < 15; k++)
{
update2 (&as[k][0], br0, cr); br0 = b[0];
update2 (&as[k][1], br1, cr); br1 = b[n];
update2 (&as[k][2], br2, cr); br2 = b[nTimes2];
update2 (&as[k][3], br3, cr); br3 = b[nTimes3];
b+= nTimes4;
}
a4 += 16;
__syncthreads(); // wait for computation to complete
}
Figure 9: GP U 8 (Part b)
6. mulStoreDec(W, X, Y, Z) computes (Y, Z−) = W ∗ X. Again, this is one by modifying the matrix
multiply kernel so that after it stores a computed element of W ∗ X to the appropriate device
memory location for Y , it reads the corresponding element of Z from device memory, decrements
this element of Z by the value of he just computed element of Y and stores the decremented element
of Z in device memory. In addition to the transactions (T ) made to compute and store W ∗ X,
the modified kernel fetches and writes Z using m2 /8 64-byte transactions. So, the modified kernel
makes a total of T + m2 /8 transactions and generates a volume of V + 8m2 .
10
__global__ void add (float *d_A, float *d_B, float *d_C, int widthA, int widthB, int widthC)
{
int startA = blockIdx.x*64 + threadIdx.x*2 + (blockIdx.y*8 + threadIdx.y)*widthA;
int startB = blockIdx.x*64 + threadIdx.x*2 + (blockIdx.y*8 + threadIdx.y)*widthB;
int startC = blockIdx.x*64 + threadIdx.x*2 + (blockIdx.y*8 + threadIdx.y)*widthC;
tempA.x += tempB.x;
tempA.y += tempB.y;
8. mulAdd(W, X, Y, Z) computes Z = W ∗ X + Y . This kernel, in addition to doing all the work done
by the matrix multiply kernel, needs to fetch Y from device memory. This fetching is done using
m2 /16 64-byte transactions. So, the total number of transactions is T + m2 /16 and the volume is
V + 4m2 .
6 One-Level Adaptation
6.1 One-Level Strassen
In a one-level implementation of Strassen’s algorithm and Winograd’s variant, the 7 matrix products M1
through M7 are computed by a direct application of GP U 8 (i.e., Strassen’s and Winograd’s equations
11
Kernel Transactions Volume
add 3m2 /32 12m2
sub 3m2 /32 12m2
mul T = 17m3 /4096 + m2 /16 V = 9m3 /32 + 4m2
mulIncInc T + 3m2 /16 V + 12m2
mulIncDec T + 3m2 /16 V + 12m2
mulStoreDec T + m2 /8 V + 8m2
mulStoreInc T + m2 /8 V + 8m2
mulAdd T + m2 /16 V + 4m2
mulIncIncInc T + 5m2 /16 V + 20m2
mulSubInc T + 3m2 /16 V + 12m2
are not applied recursively). Figure 12 gives the sequence of kernel calls in a one-level implementation
of Strassen’s method. We refer to the resulting program as one-level Strassen. The one-level GPU
implementation of Strassen’s method invokes the add and sub kernels 10 times, the mul and mulIncInc
kernels twice each, and the mulStoreDec, mulStoreInc, and mulIncDec kernels once each. Using the
transaction and volume data for each kernel (Figure 11), we determine the total transaction count to
be 7T + 7m2 /4 and the total volume to be 7V + 172m2 , where T = 17m3 /4096 + m2 /16 and V =
9m3 /32 + 4m2 . When multiplying n × n matrices, the kernels are invoked with m = n/2. So, the total
number of transactions is 119n3 /32768 + 35n2 /64 and the volume is 63n3 /256 + 50n2 .
12
Step Computation GPU Kernel
1 C12 = A21 − A11 sub(A21 , A11 , C12 )
2 C21 = B11 + B12 add(B11 , B12 , C21 )
3 C22 = C12 ∗ C21 mul(C12 , C21 , C22 )
4 C12 = A12 − A22 sub(A12 , A22 , C12 )
5 C21 = B21 + B22 add(B21 , B22 , C21 )
6 C11 = C12 ∗ C21 mul(C12 , C21 , C11 )
7 C12 = A11 + A22 add(A11 , A22 , C12 )
8 C21 = B11 + B22 add(B11 , B22 , C21 )
9 T1 = C12 ∗ C21
10 C11 = T1 + C11
11 C22 = T1 + C22 mulIncInc(C12 , C21 , C11 , C22 )
12 T2 = A21 + A22 add(A21 , A22 , T2 )
13 C21 = T2 ∗ B11
14 C22 = C22 − C21 mulStoreDec(T2 , B11 , C21 , C22 )
15 T1 = B21 − B11 sub(B21 , B11 , T1 )
16 T2 = A22 ∗ T1
17 C21 = C21 + T2
18 C11 = C11 + T2 mulIncInc(A22 , T1 , C21 , C11 )
19 T1 = B12 − B22 sub(B12 , B22 , T1 )
20 C12 = A11 ∗ T1
21 C22 = C22 + C12 mulStoreInc(A11 , T1 , C12 , C22 )
22 T2 = A11 + A12 add(A11 , A12 , T2 )
23 T1 = T2 ∗ B22
24 C12 = C12 + T1
25 C11 = C11 − T1 mulIncDec(T2 , B22 , C12 , C11 )
Y and Z from device memory, increment X and Y , and then write the incremented X and Y to device
memory. Hence the kernel would read Z only once while incrementing X and Y using the two steps
X+ = Z and Y + = Z would read Z twice.
When, τ2 < n ≤ 2 ∗ τ2 the execution of Figure 16 is referred to as a two-level Strassen multiplica-
tion. The number of arithmetics, A(2, n), in a two-level multiplication is 7A(1, n/2) + 18ADD(n/2),
where A(1, n/2) is the number of arithmetics needed in a one-level multiplication of n/2 × n/2 matrices
and ADD(n/2) is the number of arithmetics needed to add two n/2 × n/2 matrices. So, A(2, n) =
7(7(2(n/4)3 − (n/4)2 ) + 18ADD(n/4)) + 18ADD(n/2) = 49n3 /32 + 149n2 /16. For the number of trans-
actions, T (2, n), we see that two-level multiplication does 12 adds/subtracts/increments/decrements of
n/2 × n/2 matrices with each requiring 3(n/2)2 /32 = 3n2 /128 transactions. The ++ and −+ operations
each make 5n2 /128 transactions (this is a reduction of n2 /128 over doing two + = or one + = and one
− = operation). Each of the 7 multiply operations multiplies two n/2 × n/2 matrices using a one-level
multiply that does 119(n/2)3 /32768+35(n/2)2 /64 transactions. So, T (2, n) = 833n3 /262144+347n2 /256.
Using a similar analysis, we that the volume, V (2, n), is 441n3 /2048 + 277n2 /2.
When 2k−2 τ2 < n ≤ 2k−1 τ2 , a k=level execution of strassen occurs. For this k-level execution,
13
Step Computation GPU Kernel
1 T1 = A11 − A21 sub(A11 , A21 , T1 )
2 T2 = B22 − B12 sub(B22 , B12 , T2 )
3 C21 = T1 ∗ T2 mul(T1 , T2 , C21 )
4 T1 = A21 + A22 add(A21 , A22 , T1 )
5 T2 = B12 − B11 sub(B12 , B11 , T2 )
6 C22 = T1 ∗ T2 mul(T1 , T2 , C22 )
7 T1 = T1 − A11 sub(T1 , A11 , T1 )
8 T2 = B22 − T2 sub(B22 , T2 , T2 )
9 C11 = T1 ∗ T2 mul(T1 , T2 , C11 )
10 T1 = A12 − T1 sub(A12 , T1 , T1 )
11 C12 = T1 ∗ B22
12 C12 = C22 + C12 mulAdd(T1 , B22 , C22 , C12 )
13 T1 = A11 ∗ B11
14 C11 = C11 + T1
15 C12 = C11 + C12
16 C11 = C11 + C21 mulIncIncInc(A11 , B11 , , T1 , C21 , C11 , C12 )
17 T2 = T2 − B21 sub(T2 , B21 , T2 )
18 C21 = A22 ∗ T2
19 C21 = C11 − C21
20 C22 = C11 + C22 mulSubInc(A22 , T2 , C11 , C21 , C22 )
21 C11 = A12 ∗ B21
22 C11 = T1 + C11 mulAdd(A12 , B21 , T1 , C11 )
Figure 13: GPU kernels in Douglas et al.’s [8] implementation of Winograd variant
W inograd 7n3 /4 + 2n2 119n3 /32768 + 29n2 /64 63n3 /256 + 41n2
14
Strassen(A, B, C, n) {
if (n <= τ1 ) compute C = A ∗ B using GP U 8;
else if (n <= τ2 ) compute C = A ∗ B using Figure 12;
else {
C12 = A21 − A11 ; C21 = B11 + B12 ; strassen(C12 , C21 , C22 , n/2); // M6
C12 = A12 − A22 ; C21 = B21 + B22 ; strassen(C12 , C21 , C11 , n/2); // M7
C12 = A11 + A22 ; C21 = B11 + B22 ; strassen(C12 , C21 , T1 , n/2); // M1
(C11 +, C22 +) = T1 ; T2 = A21 + A22 ; strassen(T2 , B11 , C21 , n/2); // M2
C22 −= C21 ; T1 = B21 − B11 ; strassen(A22 , T1 , T2 , n/2); // M4
(C11 +, C21 +) = T2 ; T1 = B12 − B22 ; strassen(A11 , T1 , C12 , n/2); // M3
C22 += C12 ; T2 = A11 + A12 ; strassen(T2 , B22 , T1 , n/2); // M5
(C11 −, C12 +) = T1 ;}
}
where A(1, n), T (1, n), and V (1, n) are for a 1-level execution and are given in Figure 14.
Figures 17 through 19 give the values of A, T , and V for k = 2, 3, and 4. Figures 20 through 22 give
the percent reduction in arithmetics, transactions, and volume relative to GP U 8 for k = 2, 3, and 4.
Based on these numbers, we expect the two-level Strassen algorithm to run about 20% faster than GP U 8
when n = 16384 (this would correspond to τ2 = 8192); we expect the three-level Strassen algorithm run
26% to 33% faster than GP U 8; and the 4-level version to run 29% to 41% faster (depending on whether
arithmetics, transactions, or volume dominates run time).
15
Method Arithmetics Transactions Volume
GP U 8 2n3 − n2 17n3 /4096 + n2 /16 9n3 /32 + 4n2
Strassen 2401n3 /2048 + 10469n2 /256 40817n3 /16777216 + 21491n2 /4096 21609n3 /131072 + 18061n2 /32
W inograd 2401n3 /2048 + 2081n2 /64 40817n3 /16777216 + 17573n2 /4096 21609n3 /131072 + 29315n2 /64
16
W inograd(A, B, C, n)
{
if (n <= τ1 ) compute C = A ∗ B using GP U 8;
else if (n <= τ2 ) compute C = A ∗ B using Figure 13;
else {
T1 = A11 − A21 ; T2 = B22 − B12 ; winograd(T1 , T2 , C21 , n/2); //M4
T1 = A21 + A22 ; T2 = B12 − B11 ; winograd(T1 , T2 , C22 , n/2); //M5
T1 −= A11 ; T2 = B22 − T2 ; winograd(T1 , T2 , C11 , n/2); //M1
T1 = A12 − T1 ; winograd(T1 , B22 , C12 , n/2); //M6
C12 += C22 ; winograd(A11 , B11 , T1 , n/2); //M2
(C12 , C11 ) = (C11 + C12 + T1 , C11 + C21 + T1 );
T2 −= B21 ; winograd(A22 , T2 , C21 , n/2);
(C21 , C22 ) = (C11 − C21 , C11 + C22 ); winograd(A12 , B21 , C11 , n/2); //M3
C11 + = T1 ; }
}
outermost level is therefore 41. So, for this code, we see that:
for k > 1 and A(1, n), T (1, n), and V (1, n) are as in Figure 15.
Figures 17 through 19 and Figures 20 through 22, respectively, give the values of A, T , and V and
the percent reduction in these quantities relative to GP U 8 for k = 2, 3, and 4. The expected speedup of
W inograd relative to GP U 8 is slightly higher than for Strassen.
8 Experimental Results
8.1 Single Precision Experiments
We programmed several versions of GP U 8, Strassen, W inograd, and sgemm using CUDA and measured
their run time as well as accuracy on a Tesla C1060 GPU. The different versions of each algorithm varied
in their use of texture memory for the input matrices A and B. Because of the limited availability of
texture memory, GP U 8 and sgemm can be readily adapted to use texture memory only when n < 16384.
For larger values of n, it is necessary to write a blocked version of these algorithms invoking the blocked
version using texture memory for the smaller sized A and B blocks to be multiplied. Our experiments
with the blocked version of sgemm, for example, resulted in a very small reduction in run time from the
use of texture memory. For example, when n = 16384, our texture memory versions of sgemm yielded
best performance using blocks of size 8192 × 8192 and designating only the blocks of A as texture blocks.
The measured reduction in time was about 0.6% relative to the non-blocked sgemm code. Because of this
very marginal reduction in run time even for the largest matrix size we used in our experiments, we do not
report further on the blocked texture memory versions of GP U 8 and sgemm. Strassen and W inograd,
on the other hand are well suited for texture memory as they recursively and naturally decompose
matrices to smaller size submatrices until the threshold value τ2 is reached. So long as τ2 ≤ 16384, the
17
Algorithm τ2 2048 4096 8192 16384
sgemm - 0.048 0.373 2.966 23.699
GP U 8 - 0.046 0.361 2.875 22.971
Strassen 4096 0.046 0.329 2.344 16.561
tStrassen 2048 0.044 0.320 2.276 16.107
W inograd 4096 0.046 0.328 2.329 16.425
tW inograd 2048 0.044 0.318 2.243 15.846
pairs of matrices to be multiplied by GP U 8 using the one-level kernels at the lowest level of recursion
may be designated as texture matrices. Again, our experiments showed best performance when only the
first matrix in the pair to be multiplied was designated as texture. Hence, in the following, tStrassen
and tW inograd refer to versions of Strassen and W inograd in which when GP U 8 is invoked by the
one-level code used when the matrix size drops to τ2 , the first matrix of each pair to be multiplied by
GP U 8 is designated as texture (the syntax of the GP U 8 code is correspondingly modified to work with
its first matrix being texture). For our experiments, we set τ1 = τ2 /2.
8.1.2 Accuracy
The primary reason Strassen’s algorithm has not found wide application is that it is less numerically
stable than the classical O(n3 ) algorithm [9]. We assess the numerical accuracy of Strassen’s algorithm
and Winograd’s variant using the test matrix used in [13]:
1
A = I + uv T B=I− uv T C=I
1 + vT u
where I is the Kronecker delta matrix or simply the identity matrix (i.e., a matrix with 1s on the diagonal
and 0 elsewhere), and the vectors u and v are as below:
1 √
ui = vi = i i = 1, · · · , N
N +1−i
18
25
sgemm
GPU8
20 Strassen
tStrassen
Winograd
Times (seconds)
15 tWinograd
10
0
2000 4000 6000 8000 10000 12000 14000 16000 18000
N
25
Percentage (%)
20
15
10
5
4000 6000 8000 10000 12000 14000 16000 18000
N
Notice that although the product of A and B (i.e. I) is (theoretically) independent of ui and vi , there
is some dependence, in practice, because of numerical errors introduced in the initialization of A and B
on the host CPU during the computation of uv T and doing floating-point divisions. Figure 30 gives the
maximum absolute difference between an element of C as computed by each of our algorithms and the
19
35
25
Percentage (%)
20
15
10
5
4000 6000 8000 10000 12000 14000 16000 18000
N
35
30
Percentage(%)
25
20
15
10
0
2000 4000 6000 8000 10000 12000 14000 16000 18000
N
ground truth I and Figure 31 gives the average of the absolute differences. For comparison purposes, we
include also the errors in the results obtained using the classical O(n3 ) matrix multiplication algorithm
on the host CPU. For the reported errors, we used τ2 = 2048 and 4096. Since the use of texture memory
does not impact accuracy, Figures 30 and 31 do not explicitly show error measurements for tStrassen
and tW inograd (the errors, respectively, are the same as for Strassen and W inograd). The maximum
and average errors for the classical CPU algorithm, sgemm, and GP U 8 algorithms are almost the same.
However, the errors for Strassen and W inograd are substantially larger than those for the classical
algorithm, sgemm and GP U 8 when n > τ1 = τ2 /2 (when n ≤ τ1 = τ2 /2, Strassen and W inograd
reduce to GP U 8). In fact, when n = 16384 and τ2 = 2048, the maximum error for Strassen is 200 times
that for the classical algorithm, sgemm and GP U 8 while the average error is 424 times as much. The
corresponding ratios for W inograd are 1615 and 5151. We note also that when n = 16384 and τ2 = 2048,
the maximum error for W inograd is about 7.6 times that for Strassen and the average error is about
20
45
Actual Speedup in Time
Reduction in Arithmetics
40 Reduction in Transactions
Reduction in Volume
35
30
Percentage(%)
25
20
15
10
0
2000 4000 6000 8000 10000 12000 14000 16000 18000
N
21
Algorithm 0-level 1-level 2-level 3-level 4-level
Strassen 3.9e-4 3.3e-3 3.1e-2 5.8e-2 8.3e-2
W inograd 3.9e-4 1.4e-3 9.7e-3 1.6e-1 6.3e-1
and in a 2-level execution, τ2 < n ≤ 2τ2 . A 0-level execution occurs when n ≤ τ1 = τ2 /2. Figures 32
through 35 give the maximum and average errors as a function of the level of the execution for the case
n = 16384 and Figures 36 through 39 give the run time and reduction in run time relative to sgemm
and GP U 8. As expected, the errors and speedup (reduction in run time) increase with the number of
levels. For example, the 1-level version of Strassen achieves almost a 15% speedup relative to sgemm
at the expense of an almost 13 fold increase in the maximum error and an almost 17 fold increase in the
average error while the 4-level version achieves a speedup of almost 29% at a cost of an almost 213 fold
increase in the maximum error and an almost 425 fold increase in the average error.
0.7
0.6
Strassen
Winograd
0.5
0.4
Errors
0.3
0.2
0.1
0
0 1 2 3 4
Levels
22
0.00018
0.00016
Strassen
Winograd
0.00014
0.00012
0.0001
Errors
8e-005
6e-005
4e-005
2e-005
0
0 1 2 3 4
Levels
23
34
32 Strassen
tStrassen
Winograd
30 tWinograd
28
Percentage(%) 26
24
22
20
18
16
14
1 2 3 4
Levels
28
26
24
Percentage(%)
22
20
18
16
14
12
1 2 3 4
Levels
the speedup attained by dStrassen was 20.2% and that attained by dW inograd was 21%. These compare
with speedups of 21% and 21.5% attained by Strassen and W inograd relative to sgemm when n = 8192.
24
4
dgemm
Strassen
Winograd
3.5
2.5
Time (seconds)
1.5
0.5
0
4096 8192
N
19
18
Speedup over dgemm (%)
17
16
15
14
13
12
11
4096 8192
N
To assess the accuracy of computation in double precision mode, we used the same test matrix as used
in Section 8.1.2. Figure 43 gives the maximum and average errors in the computed product matrix when
n = 8192. While a double-precision computation using the classical matrix multiplication algorithm on
the CPU has the same error characteristics as dgemm, dStrassen and dW inograd have errors that are
an order of magnitude higher; the errors using dStrassen are about half those using dW inograd.
We conducted an additional experiment to gauge the accuracy of the double precision algorithms. In
this experiment, we generated 10 different 8192 × 8192 matrices with elements randomly selected from
the range [−1, 1]. The maximum and average errors for each computation were computed relative to
the results obtained by the classical matrix multiply algorithm on the CPU and then normalized by the
average of the absolute values of the elements computed by the classical CPU algorithm. Figure 44 gives
25
O(n3 ) on CPU dgemm dStrassen dW inograd
Maximum 6.5e-13 6.5e-13 4.3e-12 7.5e-12
Average 1.2e-16 1.2e-16 1.3e-15 3.1e-15
Figure 43: Errors for test matrix of Section 8.1.2 when n = 8192 and τ2 = 4096
Figure 44: Normalized maximum and average errors for ten random matrices, n = 8192 and τ2 = 4096
the maximum of the normalized maximum errors and the average of the normalized average errors.
9 Conclusion
We have developed efficient GPU implementations of Strassen’s and Winograd’s matrix multiplication
algorithms. Our experiments indicate that for single-precision arithmetic a speedup of 32% is achieved by
Strassen’s algorithm while Winograd’s variant achieves a speedup of 33% relative to the sgemm code in
CUBLAS when multiplying 16384 × 16384 matrices. Our double-precision implementations of Strassen’s
and Winograd’s algorithms, respectively, achieve a speedup of 20.2% and 21% relative to dgemm when
the matrix size n is 8192. These speedup, however, comes at significant cost in the accuracy of the
computed result. The maximum numerical error introduced by Strassen’s and Winograd’s algorithms
are about 2 orders of magnitude higher than those for sgemm when n = 16384 and about 1 order of
magnitude higher than for dgemm for n = 8192. The average numerical error introduced by Strassen’s
and Winograd’s algorithms are, respectively, 2 and 3 orders of magnitude higher than those for sgemm
when n = 16384 and about 1 order of magnitude higher than for dgemm for n = 8192. Whether the
loss in accuracy is acceptable or not will depend on the application. We have analyzed the arithmetic,
transaction and volume complexity of the various matrix multiplication algorithms considered in this
paper (single-precision versions). Our experiments indicate that speedup most closely follows volume.
References
[1] D. Bailey, K. Lee, and H. Simon, Using Strassen’s algorithm to accelerate the solution of linear
systems, Jr. of Supercomputing, 4, 357-371, 1990.
[2] http://icl.cs.utk.edu/magma/
[3] B. Boyer, C. Pernet, and W. Zhou, Memory efficient scheduling of Strassen-Winograd’s matrix
multiplication algorithm, ACM ISSAC, 2009.
[5] D. Coppersmith and S. Winograd, ”Matrix multiplication via arithmetic progressions,” Jr. of Sym-
bolic Computations, 9, 3, 251-280, 1990.
26
[7] NVIDIA CUDA Programming Guide, Version 3.0, 2010, http://developer.nvidia.com/object/gpucomputing.html
[8] C. Douglas, M. Heroux, G. Slishman, and R. Smith, GEMMW: A portable level 3 BLAS Winograd
variant of Strassen’s matrix-multiply algorithm, Jr. of Computational Physics, 110, 1-10, 1994.
[9] N. Higham, ”Exploring fast matrix multiplication within the level 3 BLAS,” ACM Trans. Math.
Soft., 16(4), 352-368, 1990.
[11] S. Huss-Lederman, E. Jacobson, J. Johnson, A. Tsao, and T. Turnbull, Strassen’s algorithm for ma-
trix multiplication: Modeling, analysis, and implementation, CCS-TR-96-17, Center for Computing
Sciences, 1996.
[12] J. Li, S. Ranka and S. Sahni, ”GPU matrix Multiplication,” chapter in Handbook on Multicore
Computing (Editor: S. Rajasekaran), Chapman Hall, 2011, to appear.
[13] I. Kaporin, ”A practical algorithm for faster matrix multiplication,” Numerical Linear Algebra with
Applications, 6: 687-700, 1999.
[14] S. Robinson, ”Toward an optimal algorithm for matrix multiplication,” SIAM News, 38, 9, 2005.
[15] Sahni, S., Data Structures, Algorithms, and Applications in C++, Second Edition, Silicon Press,
NJ, 2005.
[16] Satish, N., Harris, M. and Garland, M., Designing Efficient Sorting Algorithms for Manycore GPUs,
IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2009.
[17] V. Strassen, Gaussian elimination is not optimal, Numerische Mathematik, 13, 354-356, 1969.
[18] V. Volkov and J. Demmel, Benchmarking GPUs to Tune Dense Linear Algebra, Supercomputing,
2008.
[20] S. Winograd, On multiplication of 2 × 2 matrices, Linear Algebra and Applications, 4, 381-388, 1971.
[21] Won, Y. and Sahni, S., Hypercube-to-host sorting, Jr. of Supercomputing, 3, 41-61, 1989.
27