0% found this document useful (0 votes)

120 views

Strassen PDF

This document describes GPU implementations of Strassen's matrix multiplication algorithm and Winograd's variant of this algorithm. It analytically compares the single-precision implementations in terms of arithmetic operations, memory transactions, and data volume. Experimental results on an NVIDIA GPU show that Strassen's 4-level implementation provides a 32% speedup over CUBLAS for 16384x16384 matrices, while Winograd's variant provides a 33% speedup. Double-precision implementations also outperform CUBLAS, providing 20-21% speedups for 8192x8192 matrices. However, the algorithms introduce 2-3 orders of magnitude more numerical error than standard matrix multiplication.

Uploaded by

Tolaz Koyi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

120 views

Strassen PDF

Uploaded by

Tolaz Koyi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

∗

Strassen’s Matrix Multiplication on GPUs

Junjie Li Sanjay Ranka Sartaj Sahni
{jl3, ranka, sahni}@cise.ufl.edu
Department of Computer and Information Science and Engineering
University of Florida, Gainesville, FL 32611

Abstract
We provide efficient single- and double-precision GPU (Graphics Processing Unit) implementa-
tions of Strassen’s matrix multiplication algorithm as well as of Winograd’s variant of this algorithm.
The single-precision implementations of these two algorithms are compared analytically using the
arithmetic count, device-memory transactions, and device memory to multiprocessor data volume
metrics. Our analysis indicates that, for 16384 × 16384 matrices, our single-precision implementation
of Strassen’s algorithm limited to four levels of recursion reduces the number of arithmetics by 41.3%,
the number of transactions by 33.7%, and the volume by 29.2% relative to the best known GPU
implementation of the classical n3 matrix multiplication algorithm. The corresponding reductions
achieved by Winograd’s variant are 41.3%, 35.1%, and 31.5%. Experimental results obtained using
an NVIDIA C1060 GPU indicate a speedup of 32% for Strassen’s 4-level implementation and 33% for
Winograd’s variant relative to the sgemm code in CUBLAS 3.0 when multiplying 16384 × 16384 ma-
trices. Our double-precision implementations of Strassen’s and Winograd’s algorithms, respectively,
achieve a speedup of 20.2% and 21% relative to dgemm when the matrix size n is 8192. The maximum
numerical error introduced by Strassen’s and Winograd’s algorithms are about 2 orders of magnitude
higher than those for sgemm when n = 16384 and about 1 order of magnitude higher than for dgemm
for n = 8192. The average numerical error introduced by Strassen’s and Winograd’s algorithms are,
respectively, 2 and 3 orders of magnitude higher than those for sgemm when n = 16384 and about 1
order of magnitude higher than for dgemm for n = 8192.
Keywords: GPU, CUDA, matrix multiplication, Strassen’s algorithm, Winograd’s
variant, accuracy

1 Introduction
Matrix multiplication is an integral component of the CUDA (Compute Unified Driver Architecture)
BLAS library [2] and much effort has been expended in obtaining an efficient CUDA implementation.
The current implementation in the CUDA BLAS library is based on an algorithm due to Volkov and
Demmel [18]. A further 3% reduction (on the NVIDIA Tesla C1060) in run time is achieved by the
algorithm GP U 8 [12]. Li, Ranka, and Sahni [12] provide a step-by-step development of efficient GPU
matrix multiplication algorithms beginning with the classical three-loop O(n3 ) single-core algorithm to
multiply two n × n matrices. Although significant effort has been expended to obtain efficient GPU
algorithms for matrix multiplication based on the classical O(n3 ) single-core algorithm, there appears to
be no work toward obtaining efficient GPU implementations of any of the single-core matrix algorithms
whose complexity is less than O(n3 ). Of these latter lower complexity algorithms, Strassen’s original
∗
This work was supported, in part, by the National Science Foundation under grants CNS0829916, CNS0905308,
CCF0903430, NETS 0963812, and NETS 1115184.

1
C11 C12 A11 A12 B11 B12

C21 C22
= A21 A22
X
B21 B22

Figure 1: Block decomposition of A, B, and C

O(n2.81 ) algorithm [17] and Winograd’s variant [20] of this algorithm, whose asymptotic complexity is
also O(n2.81 ) are considered the most practical. Hence, we focus on these two algorithms in this paper.
We note that the asymptotically fastest matrix multiplication algorithm at this time has a complexity
O(n2.38 ) [5] and it is believed that “an optimal algorithm for matrix multiplication will run in essentially
O(n2 ) time” [14].
Both Strassen’s algorithm and Winograd’s variant compute the product C of two matrices A and B by
first decomposing each matrix into 4 roughly equal sized blocks as in Figure 1. Strassen’s algorithm [17]
computes C by performing 7 matrix multiplications and 18 add/subtracts using the following equations:

M1 = (A11 + A22 )(B11 + B22 ) C11 = M1 + M4 − M5 + M7

M2 = (A21 + A22 )B11 C12 = M3 + M 5
M3 = A11 (B12 − B22 ) C21 = M2 + M4
M4 = A22 (B21 − B11 ) C22 = M1 − M2 + M3 + M6
M5 = (A11 + A12 )B22
M6 = (A21 − A11 )(B11 + B12 )
M7 = (A12 − A22 )(B21 + B22 )

When this block decomposition is applied recursively until the block dimensions reach (or fall below)
a threshold value (say τ ) the arithmetic complexity of Strassen’s algorithm becomes O(n2.81 ).
Winograd’s variant of Strassen’s method uses the following equations to compute C with 7 matrix
multiplies and 15 add/subtracts [8]:

S1 = A21 + A22 M1 = S2 ∗ S6 V1 = M1 + M2
S2 = S1 − A11 M2 = A11 ∗ B11 V2 = V1 + M4
S3 = A11 − A21 M3 = A12 ∗ B21 C11 = M2 + M3
S4 = A12 − S2 M4 = S3 ∗ S7 C12 = V1 + M5 + M6
S5 = B12 − B11 M5 = S1 ∗ S5 C21 = V2 − M7
S6 = B22 − S5 M6 = S4 ∗ B22 C22 = V2 + M5
S7 = B22 − B12 M7 = A22 ∗ S8
S8 = S6 − B21
Although the recursive application of Winograd’s variant also results in an asymptotic complexity of
O(n2.81 ), the reduction in number of matrix adds/subtracts from 18 to 15 manifests itself as a slightly
smaller measured run time in practice.
While there appears to be no GPU implementation of either Strassen’s or Winograd’s variant, both
variants have been implemented for other architectures. For example, Bailey, Lee, and Simon [1] describe
an implementation of Strassen’s algorithm for the CRAY-2 and CRAY Y-MP. This implementation uses
three temporary (scratch) matrices at each level of the recursion. The total space required by these
temporary matrices is at most n2 . However, the computation can be done using 2 temporaries T1 and

2
Step Computation Comment
1 C12 = A21 − A11
2 C21 = B11 + B12
3 C22 = C12 ∗ C21 M6
4 C12 = A12 − A22
5 C21 = B21 + B22
6 C11 = C12 ∗ C21 M7
7 C12 = A11 + A22
8 C21 = B11 + B22
9 T1 = C12 ∗ C21 M1
10 C11 = T1 + C11 M 1 + M7
11 C22 = T1 + C22 M1 + M6
12 T2 = A21 + A22
13 C21 = T2 ∗ B11 M2
14 C22 = C22 − C21 M1 − M2 + M6
15 T1 = B21 − B11
16 T2 = A22 ∗ T1 M4
17 C21 = C21 + T2 M2 + M4
18 C11 = C11 + T2 M1 + M4 + M7
19 T1 = B12 − B22
20 C12 = A11 ∗ T1 M3
21 C22 = C22 + C12 M1 − M2 + M3 + M6
22 T2 = A11 + A12
23 T1 = T2 ∗ B22 M5
24 C12 = C12 + T1 M3 + M5
25 C11 = C11 − T1 M1 + M4 − M 5 + M 7

Figure 2: Steps in Strassen implementation

T2 at each level using the steps given in Figure 2. The implementation of Figure 2 reduces the space
required by temporary matrices to at most 2n2 /3.
Douglas et al. [8] provide an implementation of Winograd’s variant that uses two temporary matrices
at each level of the recursion. So, this implementation, which is given in Figure 3, uses at most 2n2 /3
memory for temporary matrices. Douglas et al. [8] report on the performance of their implementation
on a variety of serial and parallel computers.
Huss-Lederman et al. [10, 11] describe two implementations of Winograd’s variant. The first uses two
temporary matrices at each level of the recursion and is identical to the implementation of Douglas et al.
[8] (Figure 3). The second implementation uses 3 temporaries at each level of the recursion. This second
implementation, however, is recommended only for the case when we are using the Winograd variant to
do a multiply-accumulate (i.e., C = αAB +βC) and not when we are doing a straight multiply (C = AB)
as in this paper. So, we do not consider this implementation further in this paper. Boyer et al. [3] show
how to implement Winograd’s variant using no temporary matrix. They provide two implementations.
The first does not increase the number of arithmetic operations but overwrites the input matrices A and
B. Since we do not permit overwriting of the input matrices, we do not consider this implementation.
Although the second in-place implementation does not overwrite the input matrices, it increases the
number of arithmetics by a constant factor. So, we do not consider this implementation either.

3
Step Computation Comment
1 T1 = A11 − A21 S3
2 T2 = B22 − B12 S7
3 C21 = T1 ∗ T2 M4
4 T1 = A21 + A22 S1
5 T2 = B12 − B11 S5
6 C22 = T1 ∗ T2 M5
7 T1 = T1 − A11 S2
8 T2 = B22 − T2 S6
9 C11 = T1 ∗ T2 M1
10 T1 = A12 − T1 S4
11 C12 = T1 ∗ B22 M6
12 C12 = C22 + C12 M5 + M6
13 T1 = A11 ∗ B11 M2
14 C11 = C11 + T1 V1
15 C12 = C11 + C12 V 1 + M5 + M6
16 C11 = C11 + C21 V2
17 T2 = T2 − B21 S8
18 C21 = A22 ∗ T2 M7
19 C21 = C11 − C21 V 2 − M7
20 C22 = C11 + C22 V 2 + M5
21 C11 = A12 ∗ B21 M3
22 C11 = T1 + C11 M2 + M3

Figure 3: Steps in Douglas et al.’s [8] implementation of Winograd variant

The remainder of this paper is organized as follows. In Section 2, we describe the architecture of the
NVIDIA Tesla C1060 GPU. The CUDA programming model is described in Section 3 and the fastest
O(n3 ) GPU matrix multiplication algorithm GP U 8 [12] is described in Section 4. Section 5 gives the
basic GPU kernels used in our GPU adaptations of Strassen’s algorithm and Winograd’s variant and also
analyzes these kernels for their device-memory transactions and volume complexity. A one-level GPU
implementation of Strassen’s algorithm and Winograd’s variant (i.e., an implementation that does not
apply Strassen’s and Winograd’s equations recursively) is given in Section 6 and the general multilevel
recursive implementation is given in Section 7. Experimentation results for single- and double-precision
implementations of Strassen’s and Winograd’s algorithms are presented in Section 8. We conclude in
Section 9.
Throughout this paper, we assume that n is a power of 2. Adaptations to other values of n may be
done using methods such as padding and peeling [10, 11].

2 GPU Architecture
NVIDIA’s Tesla C1060 GPU, Figure 5, is an example of NVIDIA’s general purpose parallel computing
architecture CUDA (Compute Unified Driver Architecture) [16]. Figure 5 is a simplified version of
Figure 4 with N = 30 and M = 8. The C1060 comprises 30 streaming multiprocessors (SMs) and each
SM comprises 8 scalar processors (SPs), 16KB of on-chip shared memory, and 16,384 32-bit registers.
Each SP has its own integer and single-precision floating point units. Each SM has 1 double-precision

4
Figure 4: NVIDIA’s GPU hardware model [7]

Figure 5: NVIDIA’s Tesla C1060 GPU [16]

floating-point unit and 2 single-precision transcendental function (special function, SF) units that are
shared by the 8 SPs in the SM. The 240 SPs of a Tesla C1060 share 4GB of off-chip memory referred
to as device or global memory [7]. A C1060 has a peak performance of 933 GFlops of single-precision
floating-point operations and 78 GFlops of double-precision operations. The peak of 933GFlops is for the
case when Multiply-Add (MADD) instructions are dual issued with special function (SF) instructions. In
the absence of SF instructions, the peak is 622GFlops (MADDs only) [19]. The C1060 consumes 188W
of power. The architecture of the NVIDIA Tesla C2050 (also known as Fermi) corresponds to Figure 4
with N = 14 and M = 32. So, a C2050 has 14 SMs and each SM has 32 SPs giving the C2050 a total of
448 SPs or cores. Although each SP of a C2050 has its own integer, single- and double-precision units,
the 32 SPs of an SM share 4 single-precision transcendental function units. An SM has 64KB of on-chip
memory that can be “configured as 48KB of shared memory with 16KB of L1 cache (default setting)
or as 16KB of shared memory with 48KB of L1 cache” [7]. Additionally, there are 32K 32-bit registers

5
per SM and 3GB of off-chip device/global memory that is shared by all 14 SMs. The peak performance
of a C2050 is 1,288 GFlops (or 1.288TFlops) of single-precision operations and 515GFlops of double-
precision operations and the power consumption is 238W [4]. Once again, the peak of 1,288GFlops
requires that MADDs and SF instructions be dual issued. When there are MADDs alone, the peak
single-precision rate is 1.03GFlops. Notice that the ratio of power consumption to peak single-precision
GFlop rate is 0.2W/GFlop for the C1060 and 0.18W/GFlop for the C2050. The corresponding ratio for
double-precision operations is 2.4W/GFlops for the C1060 and 0.46W/GFlop for the C2050. In NVIDIA
parlance, the C1060 has compute capability 1.3 while the compute capability of the C2050 is 2.0.
A Tesla GPU is packaged as a double-wide PCIe card (Figure 6) and using an appropriate motherboard
and a sufficiently large power supply, one can install up to 4 GPUs on the same motherboard. In this
paper, we focus on single GPU computation.

Figure 6: NVIDIA’s Tesla PCIex16 Card (www.nvidia.com)

3 Programming Model
At a high-level, a GPU uses the master-slave programming model [21] in which the GPU operates as
a slave under the control of a master or host processor. In our experimental set up for single-precision
matrix multiplication, the master or host is a 2.8GHz Xeon quad core processor and the GPU is the
NVIDIA Tesla C1060. For double-precision matrix multiplication, the host is a XXX six core processor
and the GPU is the NVIDIA Tesla 2050. We describe the programming model making explicit reference
to the single-precision setup only. Programming in the master-slave model requires us to write a program
that runs on the master processor (in our case the Xeon). This master program sends data to the slave(s)
(in our case a single C1060 GPU), invokes a kernel or function that runs on the slave(s) and processes
this sent data, and finally receives the results back from the slave. This process of sending data to the
slave, executing a slave kernel, and receiving the computed results may be repeated several times by the
master program. In CUDA, the host/master and GPU/slave codes may be written in C. CUDA provides
extensions to C to allow for data transfer to/from device memory and for kernel/slave code to access
registers, shared memory, and device memory.
At another level, GPUs use the SIMT (single instruction multiple thread) programming model in
which the GPU accomplishes a computational task using thousands of light weight threads. The threads
are grouped into blocks and the blocks are organized as a grid. While a block of threads may be 1-,
2-, or 3-dimensional, the grid of blocks may only be 1- or 2-dimensional. Kernel invocation requires the
specification of the block and grid dimensions along with any parameters the kernel may have. This is

6
illustrated below for a matrix multiply kernel M atrixM ultiply that has the parameters a, b, c, and n,
where a, b, and c are pointers to the start of the row-major representation of n × n matrices and the
kernel computes c = a ∗ b.

MatrixMultiply<<<GridDimensions, BlockDimensions>>>(a,b,c,n)

A GPU has a block scheduler that dynamically assigns thread blocks to SMs. Since all the threads of
a thread block are assigned to the same SM, the threads of a block may communicate with one another
via the shared memory of an SM. Further, the resources needed by a block of threads (e.g., registers
and shared memory) should be sufficiently small that a block can be run on an SM. The block scheduler
assigns more than 1 block to run concurrently on an SM when the combined resources needed by the
assigned blocks does not exceed the resources available to an SM. However, since CUDA provides no
mechanism to specify a subset of blocks that are to be co-scheduled on an SM, threads of different blocks
can communicate only via the device memory.
Once a block is assigned to an SM, its threads are scheduled to execute on the SM’s SPs by the SM’s
warp scheduler. The warp scheduler divides the threads of the blocks assigned to an SM into warps of
32 consecutively indexed threads from the same block. Multidimensional thread indexes are serialized in
row-major order for partitioning into warps. So, a block that has 128 threads is partitioned into 4 warps.
Every thread currently assigned to an SM has its own instruction counter and set of registers. The warp
scheduler selects a warp of ready threads for execution. If the instruction counters for all threads in the
selected warp are the same, all 32 threads execute in 1 step. On a GPU with compute capability 2.0,
each SM has 32 cores and so all 32 threads can perform their common instruction in parallel, provided,
of course, this common instruction is an integer or floating point operation. On a GPU with compute
capability 1.3, each SM has 8 SPs and so a warp can execute the common instruction for only 8 threads
in parallel. Hence, when the compute capability is 1.3, the GPU takes 4 rounds of parallel execution
to execute the common instruction for all 32 threads of a warp. When the instruction counters of the
threads of a warp are not the same, the GPU executes the different instructions serially. Note that the
instruction counters may become different as the result of “data dependent conditional branches” in a
kernel [7]. When the compute capability is 1.3, the device-memory accesses of a half warp are coalesced
into a single transaction when the data being accessed lie in the same 128-byte segment of device memory.
The transaction size is reduced to 64 bytes in case the accessed data are in the same 64-byte segment
and to 32 bytes when they are in the same 32-byte segment. The transaction size determines the volume
of data moved.
An SM’s warp scheduler is able to hide much of the 400 to 600 cycle latency of a device-memory
access by executing warps that are ready to do arithmetics while other warps wait for device-memory
accesses to complete. So, the performance of code that makes many accesses to device memory can often
be improved by optimizing it to increase the number of warps scheduled on an SM. This optimization
could involve increasing the number of threads per block and/or reducing the shared memory and register
utilization of a block to enable the scheduling of a larger number of blocks on an SM.

4 The Matrix Multiplication Algorithm GP U 8

The matrix multiplication kernel GP U 8 (Figures 8 and 9), which is due to Li, Ranka, and Sahni [12],
assumes that the matrices A, B, and C are mapped to the device memory arrays a, b, and c using the
row-major mapping [15]. The kernel is invoked by the host using (16, 8) thread blocks. A thread block
reads a 16 × 64 sub-matrix of a from device memory to shared memory. Each half warp reads the 64 a
values in a row of the 16 × 64 sub-matrix, which lie in two adjacent 128-byte segments of device memory,

7
__device__ void update2(float *a, float b, float *c)
{
for (int i = 0; i < 16; i++)
c[i] += a[i * 4] * b;
}

Figure 7: Updating c values when as read using float2

using two 128-byte transactions. To accomplish this, each thread reads a 1 × 4 sub-matrix of a using
the data type float4. The 16 × 64 a sub-matrix that is input from device memory may be viewed as a
16 × 16 matrix in which each element is a 1 × 4 vector. The transpose of this 16 × 16 matrix of vectors
is stored in the array as[16][65] with each 1 × 4 vector using four adjacent elements of a row of as. This
mapping ensures that the 16 elements in each column of the 16 × 64 sub-matrix of a that is input from
device memory are stored in different banks of shared memory. So, the writes to shared memory done
by a half warp of GP U 8 are conflict free. Further, by storing the transpose of a 16 × 16 matrix of 1 × 4
vectors rather than the transpose of a 16 × 64 matrix of scalars, GP U 8 is able to do the writes to shared
memory using float4s rather than floats as would otherwise be the case. This reduces the time to
write to shared memory.
The number of half warps is n2 /256. In each iteration of the for i loop, a half warp makes 4 128-byte
transactions to read in a values and 64 64-byte transactions to read in b values. Thus, GP U 8 makes
n3 /4096 128-byte device-memory transactions on a and n3 /256 64-byte transactions on b. Additionally,
n2 /16 64-byte transactions are made on c. Each transaction has 100% utilization. So, the the total
number of transactions is 17n3 /4096 + n2 /16 and the volume is 9n3 /32 + 4n2 . By comparison, the
number of transactions and volume for the sgemm code in CUBLAS 3.0 [2] are 5n3 /1024 + n2 /16 and
5n3 /16 + 4n2 , respectively.

5 Basic GPU Kernels

We use several basic GPU kernels to arrive at our efficient GPU adaptation of Strassen’s algorithm and
Winograd’s variant. These kernels are described below.

1. add(X, Y, Z) computes Z = X + Y using the kernel code of Figure 10. Each thread fetches two
adjacent values of X and two adjacent values of B from device memory using the data type float2.
Since the 16 pairs of X (Y ) fetched from device memory lie in the same 128-byte segment, the fetches
of a half warp are coalesced into a single 128-byte memory transaction. The fetched pairs of X
and Y are added and the sums written to device memory. This write also requires one memory
transaction per half warp. So, two m × m matrices are added using a total of 3m2 /32 128-byte
transactions that result in a total volume of 12m2 bytes.

2. sub(X, Y, Z) computes Z = X − Y using a kernel code similar to that of Figure 10.

3. mul(X, Y, Z) computes Z = X ∗ Y using the kernel code of Figures 8 and 9. Let T and V ,
respectively, denote the number of memory transactions and volume for this code when multiplying
two m × m matrices (T = 17m3 /4096 + m2 /16 and V = 9m3 /32 + 4m2 ).

4. mulIncInc(W, X, Y, Z) computes (Y +, Z+) = W ∗ X (i.e., Y and Z are both incremented by

W ∗ X). This is done by modifying the matrix multiply kernel so that it does not write out the

8
__global__ void GPU8 (float *a, float *b, float *c, int n)
{// thread code to compute one column of a 16 x 128 sub-matrix of c
// use shared memory to hold the transpose of a
// 16 x 64 sub-matrix of 1 x 4 sub-vectors of a
__shared__ float as[16][65];

// registers for column of c sub-matrix

float cr[16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};

int nDiv64 = n/64;

int sRow = threadIdx.y;
int sRow4 = sRow*4;
int sCol = threadIdx.x;
int tid = sRow*16+sCol.x;
int aNext = (16*blockIdx.y+sRow)*n+sCol*4;
int bNext = 128*blockIdx.x + tid;
int cNext = 16*blockIdx.y*n + 128*blockIdx.x + tid;
int nTimes2 = 2*n;
int nTimes3 = 3*n;
int nTimes4 = 4*n;

a += aNext;
b += bNext;
c += cNext;

float4 a4 = (float4 )a;

for (int i = 0; i < nDiv64; i++)

{
*( (float4 *)(&as[sCol][sRow4]) ) = a4[0];
*( (float4 *)(&as[sCol][sRow4+32]) ) = a4[nTimes2];
__syncthreads(); // wait for read to complete

Figure 8: GP U 8 (Part a)

elements of W ∗ X as each is computed. Instead, after an element of W ∗ X has been computed, the
corresponding elements of Y and Z are read from device memory, incremented by the computed
element of W ∗ X, and the incremented values written back to device memory. Note that each
element of W ∗ X is computed exactly once. The modified kernel makes T − m2 /16 transactions to
multiply W and X as it does not write out W ∗X. Additional transactions are made to fetch Y and
Z and write the incremented values. A half warp reads/writes Y and Z using coalesced 64-byte
transactions. The total number of these transactions is m2 /4 (m2 /16 transactions are made to read
or write each of Y and Z). So, the total number of transactions is T + 3m2 /16 and the volume is
V + 12m2 .

5. mulIncDec(W, X, Y, Z) computes (Y +, Z−) = W ∗ X. This is similar to mulIncInc and has the

9
float br0 = b[0];
float br1 = b[n];
float br2 = b[nTimes2];
float br3 = b[nTimes3];

b += nTimes4;

#pragma unroll
for (int k = 0; k < 15; k++)
{
update2 (&as[k][0], br0, cr); br0 = b[0];
update2 (&as[k][1], br1, cr); br1 = b[n];
update2 (&as[k][2], br2, cr); br2 = b[nTimes2];
update2 (&as[k][3], br3, cr); br3 = b[nTimes3];

b+= nTimes4;
}

update2 (&as[15][0], br0, cr);

update2 (&as[15][1], br1, cr);
update2 (&as[15][2], br2, cr);
update2 (&as[15][3], br3, cr);

a4 += 16;
__syncthreads(); // wait for computation to complete
}

for (int j = 0; j < 16; j++)

{
c[0] = cr[j];
c += n;
}
}

Figure 9: GP U 8 (Part b)

same transaction count and volume.

6. mulStoreDec(W, X, Y, Z) computes (Y, Z−) = W ∗ X. Again, this is one by modifying the matrix
multiply kernel so that after it stores a computed element of W ∗ X to the appropriate device
memory location for Y , it reads the corresponding element of Z from device memory, decrements
this element of Z by the value of he just computed element of Y and stores the decremented element
of Z in device memory. In addition to the transactions (T ) made to compute and store W ∗ X,
the modified kernel fetches and writes Z using m2 /8 64-byte transactions. So, the modified kernel
makes a total of T + m2 /8 transactions and generates a volume of V + 8m2 .

10
__global__ void add (float *d_A, float *d_B, float *d_C, int widthA, int widthB, int widthC)
{
int startA = blockIdx.x*64 + threadIdx.x*2 + (blockIdx.y*8 + threadIdx.y)*widthA;
int startB = blockIdx.x*64 + threadIdx.x*2 + (blockIdx.y*8 + threadIdx.y)*widthB;
int startC = blockIdx.x*64 + threadIdx.x*2 + (blockIdx.y*8 + threadIdx.y)*widthC;

float2 tempA = (float2 )(d_A+startA);

float2 tempB = *(float2 *)(d_B+startB);

tempA.x += tempB.x;
tempA.y += tempB.y;

(float2 )(d_C+startC) = tempA;

}

Figure 10: Kernel to add two matrices

7. mulStoreInc(W, X, Y, Z) computes (Y, Z+) = W ∗ X using a suitable modification of the matrix

multiply kernel. This kernel is similar to that for mulStoreDec and has the same number of
transactions and volume.

8. mulAdd(W, X, Y, Z) computes Z = W ∗ X + Y . This kernel, in addition to doing all the work done
by the matrix multiply kernel, needs to fetch Y from device memory. This fetching is done using
m2 /16 64-byte transactions. So, the total number of transactions is T + m2 /16 and the volume is
V + 4m2 .

9. mulIncIncInc(U, V, W, X, Y, Z) computes W = U ∗ V ; Y += W ; Z += Y ; Y += X (in this or-

der). The modification to the matrix multiply kernel requires that when an element of U ∗ V is
computed, it is written to device memory as an element of W ; the corresponding element of Y is
fetched from device memory and incremented (but not written back to device memory); next the
corresponding element of Z is fetched from device memory, incremented by the just computed Y
value and written to device memory; finally this element of Y is incremented again by fetching
the corresponding element of X from device memory and the incremented value written to device
memory. mulIncIncInc makes m2 /16 transactions to fetch each of X, Y , and Z and to write
each of Y and Z (in addition to those made by the matrix multiply kernel). So, an extra 5m2 /16
64-byte transactions are made. The total number of transactions is T + 5m2 /16 and the volume is
V + 20m2 .

10. mulSubInc(V, W, X, Y, Z) computes Y = X − V ∗ W ; Z += X using a modification of the matrix

multiply kernel. The total number of transactions is T + 3m2 /16 and the volume is V + 12m2 .

6 One-Level Adaptation
6.1 One-Level Strassen
In a one-level implementation of Strassen’s algorithm and Winograd’s variant, the 7 matrix products M1
through M7 are computed by a direct application of GP U 8 (i.e., Strassen’s and Winograd’s equations

11
Kernel Transactions Volume
add 3m2 /32 12m2
sub 3m2 /32 12m2
mul T = 17m3 /4096 + m2 /16 V = 9m3 /32 + 4m2
mulIncInc T + 3m2 /16 V + 12m2
mulIncDec T + 3m2 /16 V + 12m2
mulStoreDec T + m2 /8 V + 8m2
mulStoreInc T + m2 /8 V + 8m2
mulAdd T + m2 /16 V + 4m2
mulIncIncInc T + 5m2 /16 V + 20m2
mulSubInc T + 3m2 /16 V + 12m2

Figure 11: Device-memory transaction statistics for m × m matrices

are not applied recursively). Figure 12 gives the sequence of kernel calls in a one-level implementation
of Strassen’s method. We refer to the resulting program as one-level Strassen. The one-level GPU
implementation of Strassen’s method invokes the add and sub kernels 10 times, the mul and mulIncInc
kernels twice each, and the mulStoreDec, mulStoreInc, and mulIncDec kernels once each. Using the
transaction and volume data for each kernel (Figure 11), we determine the total transaction count to
be 7T + 7m2 /4 and the total volume to be 7V + 172m2 , where T = 17m3 /4096 + m2 /16 and V =
9m3 /32 + 4m2 . When multiplying n × n matrices, the kernels are invoked with m = n/2. So, the total
number of transactions is 119n3 /32768 + 35n2 /64 and the volume is 63n3 /256 + 50n2 .

6.2 One-Level W inograd

Our one-level GPU implementation of Winograd’s variant is given in Figure 13. We refer to this imple-
mentation as one-level W inograd. This implementation invokes the add and sub kernels 8 times, the
mul kernel 3 times, the mulAdd kernel twice, and the mulIncIncInc and mulSubInc kernels once each.
When the kernels are invoked using m × m matrices, a total of 7T + 11m2 /8 transactions are made
and the total volume is 7V + 136m2 . In a one-level implementation, m = n/2 and the total number of
transactions becomes 119n3 /32768 + 29n2 /64 and the volume is 63n3 /256 + 41n2 .
Figure 14 summarizes the number of arithmetic operations and transactions done by GP U 8, one-
level Strassen, and one-level W inograd as well as the volume of data transfer done by each. Figure 15
gives the percent reduction in these quantities relative to GP U 8. Based on this analysis, we expect the
one-level methods to be about 12% faster than GP U 8.

7 Multilevel Recursive GPU Adaptation

7.1 Multilevel Strassen
Figure 16 gives the recursive code for our implementation of Strassen’s method for n a power of 2.
Adaptations to other values of n may be done using methods such as padding and peeling [10, 11]. The
code uses 2 threshold values τ1 and τ2 . When n ≤ τ1 the matrices are multiplied using GP U 8 and when
τ1 < n ≤ τ2 a one-level Strassen multiplication (defined by the kernel sequence given in Figure 12) is
used. When n > τ2 , Strassen’s method (Figure 2) is applied recursively. In Figure 16, the notation
(X+, Y +) = Z refers to a single kernel that increments X and Y by Z. Such a kernel would read X,

12
Step Computation GPU Kernel
1 C12 = A21 − A11 sub(A21 , A11 , C12 )
2 C21 = B11 + B12 add(B11 , B12 , C21 )
3 C22 = C12 ∗ C21 mul(C12 , C21 , C22 )
4 C12 = A12 − A22 sub(A12 , A22 , C12 )
5 C21 = B21 + B22 add(B21 , B22 , C21 )
6 C11 = C12 ∗ C21 mul(C12 , C21 , C11 )
7 C12 = A11 + A22 add(A11 , A22 , C12 )
8 C21 = B11 + B22 add(B11 , B22 , C21 )
9 T1 = C12 ∗ C21
10 C11 = T1 + C11
11 C22 = T1 + C22 mulIncInc(C12 , C21 , C11 , C22 )
12 T2 = A21 + A22 add(A21 , A22 , T2 )
13 C21 = T2 ∗ B11
14 C22 = C22 − C21 mulStoreDec(T2 , B11 , C21 , C22 )
15 T1 = B21 − B11 sub(B21 , B11 , T1 )
16 T2 = A22 ∗ T1
17 C21 = C21 + T2
18 C11 = C11 + T2 mulIncInc(A22 , T1 , C21 , C11 )
19 T1 = B12 − B22 sub(B12 , B22 , T1 )
20 C12 = A11 ∗ T1
21 C22 = C22 + C12 mulStoreInc(A11 , T1 , C12 , C22 )
22 T2 = A11 + A12 add(A11 , A12 , T2 )
23 T1 = T2 ∗ B22
24 C12 = C12 + T1
25 C11 = C11 − T1 mulIncDec(T2 , B22 , C12 , C11 )

Figure 12: GPU kernels in Strassen implementation

Y and Z from device memory, increment X and Y , and then write the incremented X and Y to device
memory. Hence the kernel would read Z only once while incrementing X and Y using the two steps
X+ = Z and Y + = Z would read Z twice.
When, τ2 < n ≤ 2 ∗ τ2 the execution of Figure 16 is referred to as a two-level Strassen multiplica-
tion. The number of arithmetics, A(2, n), in a two-level multiplication is 7A(1, n/2) + 18ADD(n/2),
where A(1, n/2) is the number of arithmetics needed in a one-level multiplication of n/2 × n/2 matrices
and ADD(n/2) is the number of arithmetics needed to add two n/2 × n/2 matrices. So, A(2, n) =
7(7(2(n/4)3 − (n/4)2 ) + 18ADD(n/4)) + 18ADD(n/2) = 49n3 /32 + 149n2 /16. For the number of trans-
actions, T (2, n), we see that two-level multiplication does 12 adds/subtracts/increments/decrements of
n/2 × n/2 matrices with each requiring 3(n/2)2 /32 = 3n2 /128 transactions. The ++ and −+ operations
each make 5n2 /128 transactions (this is a reduction of n2 /128 over doing two + = or one + = and one
− = operation). Each of the 7 multiply operations multiplies two n/2 × n/2 matrices using a one-level
multiply that does 119(n/2)3 /32768+35(n/2)2 /64 transactions. So, T (2, n) = 833n3 /262144+347n2 /256.
Using a similar analysis, we that the volume, V (2, n), is 441n3 /2048 + 277n2 /2.
When 2k−2 τ2 < n ≤ 2k−1 τ2 , a k=level execution of strassen occurs. For this k-level execution,

A(k, n) = 7A(k − 1, n/2) + 18(n2 /4)

13
Step Computation GPU Kernel
1 T1 = A11 − A21 sub(A11 , A21 , T1 )
2 T2 = B22 − B12 sub(B22 , B12 , T2 )
3 C21 = T1 ∗ T2 mul(T1 , T2 , C21 )
4 T1 = A21 + A22 add(A21 , A22 , T1 )
5 T2 = B12 − B11 sub(B12 , B11 , T2 )
6 C22 = T1 ∗ T2 mul(T1 , T2 , C22 )
7 T1 = T1 − A11 sub(T1 , A11 , T1 )
8 T2 = B22 − T2 sub(B22 , T2 , T2 )
9 C11 = T1 ∗ T2 mul(T1 , T2 , C11 )
10 T1 = A12 − T1 sub(A12 , T1 , T1 )
11 C12 = T1 ∗ B22
12 C12 = C22 + C12 mulAdd(T1 , B22 , C22 , C12 )
13 T1 = A11 ∗ B11
14 C11 = C11 + T1
15 C12 = C11 + C12
16 C11 = C11 + C21 mulIncIncInc(A11 , B11 , , T1 , C21 , C11 , C12 )
17 T2 = T2 − B21 sub(T2 , B21 , T2 )
18 C21 = A22 ∗ T2
19 C21 = C11 − C21
20 C22 = C11 + C22 mulSubInc(A22 , T2 , C11 , C21 , C22 )
21 C11 = A12 ∗ B21
22 C11 = T1 + C11 mulAdd(A12 , B21 , T1 , C11 )

Figure 13: GPU kernels in Douglas et al.’s [8] implementation of Winograd variant

Method Arithmetics Transactions Volume

GP U 8 2n3 − n2 17n3 /4096+ n2 /16 9n3 /32+ 4n2
Strassen 7n /4 + 11n2 /4
3 119n /32768 + 35n2 /64
3 63n /256 + 50n2
3

W inograd 7n3 /4 + 2n2 119n3 /32768 + 29n2 /64 63n3 /256 + 41n2

Figure 14: Transactions and volume for one-level multiplication of n × n matrices

n Method Arithmetics Transactions Volume

Strassen 12.5 9.6 8.5
4096
W inograd 12.5 10.2 9.3
Strassen 12.5 11.1 10.5
8192
W inograd 12.5 11.3 10.9
Strassen 12.5 11.8 11.5
16384
W inograd 12.5 11.9 11.7

Figure 15: Percent reduction relative to GP U 8 for one-level multiplication

14
Strassen(A, B, C, n) {
if (n <= τ1 ) compute C = A ∗ B using GP U 8;
else if (n <= τ2 ) compute C = A ∗ B using Figure 12;
else {
C12 = A21 − A11 ; C21 = B11 + B12 ; strassen(C12 , C21 , C22 , n/2); // M6
C12 = A12 − A22 ; C21 = B21 + B22 ; strassen(C12 , C21 , C11 , n/2); // M7
C12 = A11 + A22 ; C21 = B11 + B22 ; strassen(C12 , C21 , T1 , n/2); // M1
(C11 +, C22 +) = T1 ; T2 = A21 + A22 ; strassen(T2 , B11 , C21 , n/2); // M2
C22 −= C21 ; T1 = B21 − B11 ; strassen(A22 , T1 , T2 , n/2); // M4
(C11 +, C21 +) = T2 ; T1 = B12 − B22 ; strassen(A11 , T1 , C12 , n/2); // M3
C22 += C12 ; T2 = A11 + A12 ; strassen(T2 , B22 , T1 , n/2); // M5
(C11 −, C12 +) = T1 ;}
}

Figure 16: Strassen’s GPU matrix multiply

Method Arithmetics Transactions Volume

GP U 8 2n3 − n2 17n3 /4096 + n2 /16 9n3 /32 + 4n2
Strassen 49n /32 + 149n2 /16
3 833n3 /262144 + 347n2 /256 441n3 /2048 + 277n2 /2
W inograd 49n3 /32 + 29n2 /4 833n3 /262144 + 285n2 /256 441n3 /2048 + 451n2 /4

Figure 17: Transactions and volume for two-level multiplication of n × n matrices

Method Arithmetics Transactions Volume

GP U 8 2n3 − n2 17n3 /4096 + n2 /16 9n3 /32 + 4n2
Strassen 343n /256 + 1331n2 /64
3 5831n3 /2097152 + 2837n2 /1024 3087n3 /16384 + 2347n2 /8
W inograd 343n3 /256 + 263n2 /16 5831n3 /2097152 + 2323n2 /1024 3087n3 /16384 + 3813n2 /16

Figure 18: Transactions and volume for three-level multiplication of n × n matrices

= 7A(k − 1, n/2) + 9n2 /2

T (k, n) = 7T (k − 1, n/2) + 12 ∗ 3 ∗ n2 /128 + 3 ∗ 5 ∗ n2 /128
= 7T (k − 1, n/2) + 51n2 /128
V (k, n) = 7V (k − 1, n/2) + 51n2

where A(1, n), T (1, n), and V (1, n) are for a 1-level execution and are given in Figure 14.
Figures 17 through 19 give the values of A, T , and V for k = 2, 3, and 4. Figures 20 through 22 give
the percent reduction in arithmetics, transactions, and volume relative to GP U 8 for k = 2, 3, and 4.
Based on these numbers, we expect the two-level Strassen algorithm to run about 20% faster than GP U 8
when n = 16384 (this would correspond to τ2 = 8192); we expect the three-level Strassen algorithm run
26% to 33% faster than GP U 8; and the 4-level version to run 29% to 41% faster (depending on whether
arithmetics, transactions, or volume dominates run time).

15
Method Arithmetics Transactions Volume
GP U 8 2n3 − n2 17n3 /4096 + n2 /16 9n3 /32 + 4n2
Strassen 2401n3 /2048 + 10469n2 /256 40817n3 /16777216 + 21491n2 /4096 21609n3 /131072 + 18061n2 /32
W inograd 2401n3 /2048 + 2081n2 /64 40817n3 /16777216 + 17573n2 /4096 21609n3 /131072 + 29315n2 /64

Figure 19: Transactions and volume for four-level multiplication of n × n matrices

n Method Arithmetics Transactions Volume

Strassen 23.3 15.8 11.7
4096
W inograd 23.3 17.2 13.9
Strassen 23.4 19.6 17.6
8192
W inograd 23.4 20.3 18.7
Strassen 23.4 21.5 20.5
16384
W inograd 23.4 21.9 21.1

Figure 20: Percent reduction relative to GP U 8 for two-level multiplication

n Method Arithmetics Transactions Volume

Strassen 32.7 17.0 7.9
4096
W inograd 32.8 20.0 12.6
Strassen 32.9 25.0 20.4
8192
W inograd 32.9 26.5 22.8
Strassen 32.9 29.0 26.7
16384
W inograd 33.0 29.7 27.9

Figure 21: Percent reduction relative to GP U 8 for three-level multiplication

n Method Arithmetics Transactions Volume

Strassen 40.9 10.8 -7.2
4096
W inograd 41.0 16.5 2.0
Strassen 41.1 26.1 17.0
8192
W inograd 41.2 28.9 21.6
Strassen 41.3 33.7 29.2
16384
W inograd 41.3 35.1 31.5

Figure 22: Percent reduction relative to GP U 8 for four-level multiplication

7.2 Multilevel W inograd

The recursive code for a matrix multiply using Winograd’s variant is similar to the multilevel code for
Strassen’s algorithm and is given in Figure 23. This code does 10 stand alone adds/subtracts/incre-
ments/decrements of n/2 × n/2 matrices at the outermost level with each reading 2 n/2 × n/2 matrices
and writing 1; the assignment to (C12 , C11 ) reads 4 n/2 × n/2 matrices and writes two; and the assign-
ment to (C21 , C22 ) reads 3 n/2 × n/2 matrices and writes 2. The total number of reads/writes at the

16
W inograd(A, B, C, n)
{
if (n <= τ1 ) compute C = A ∗ B using GP U 8;
else if (n <= τ2 ) compute C = A ∗ B using Figure 13;
else {
T1 = A11 − A21 ; T2 = B22 − B12 ; winograd(T1 , T2 , C21 , n/2); //M4
T1 = A21 + A22 ; T2 = B12 − B11 ; winograd(T1 , T2 , C22 , n/2); //M5
T1 −= A11 ; T2 = B22 − T2 ; winograd(T1 , T2 , C11 , n/2); //M1
T1 = A12 − T1 ; winograd(T1 , B22 , C12 , n/2); //M6
C12 += C22 ; winograd(A11 , B11 , T1 , n/2); //M2
(C12 , C11 ) = (C11 + C12 + T1 , C11 + C21 + T1 );
T2 −= B21 ; winograd(A22 , T2 , C21 , n/2);
(C21 , C22 ) = (C11 − C21 , C11 + C22 ); winograd(A12 , B21 , C11 , n/2); //M3
C11 + = T1 ; }
}

Figure 23: Winograd’s GPU matrix multiply

outermost level is therefore 41. So, for this code, we see that:

A(k, n) = 7A(k − 1, n/2) + 15n2 /4

T (k, n) = 7T (k − 1, n/2) + 41 ∗ n2 /128
V (k, n) = 7V (k − 1, n/2) + 41n2

for k > 1 and A(1, n), T (1, n), and V (1, n) are as in Figure 15.
Figures 17 through 19 and Figures 20 through 22, respectively, give the values of A, T , and V and
the percent reduction in these quantities relative to GP U 8 for k = 2, 3, and 4. The expected speedup of
W inograd relative to GP U 8 is slightly higher than for Strassen.

8 Experimental Results
8.1 Single Precision Experiments
We programmed several versions of GP U 8, Strassen, W inograd, and sgemm using CUDA and measured
their run time as well as accuracy on a Tesla C1060 GPU. The different versions of each algorithm varied
in their use of texture memory for the input matrices A and B. Because of the limited availability of
texture memory, GP U 8 and sgemm can be readily adapted to use texture memory only when n < 16384.
For larger values of n, it is necessary to write a blocked version of these algorithms invoking the blocked
version using texture memory for the smaller sized A and B blocks to be multiplied. Our experiments
with the blocked version of sgemm, for example, resulted in a very small reduction in run time from the
use of texture memory. For example, when n = 16384, our texture memory versions of sgemm yielded
best performance using blocks of size 8192 × 8192 and designating only the blocks of A as texture blocks.
The measured reduction in time was about 0.6% relative to the non-blocked sgemm code. Because of this
very marginal reduction in run time even for the largest matrix size we used in our experiments, we do not
report further on the blocked texture memory versions of GP U 8 and sgemm. Strassen and W inograd,
on the other hand are well suited for texture memory as they recursively and naturally decompose
matrices to smaller size submatrices until the threshold value τ2 is reached. So long as τ2 ≤ 16384, the

17
Algorithm τ2 2048 4096 8192 16384
sgemm - 0.048 0.373 2.966 23.699
GP U 8 - 0.046 0.361 2.875 22.971
Strassen 4096 0.046 0.329 2.344 16.561
tStrassen 2048 0.044 0.320 2.276 16.107
W inograd 4096 0.046 0.328 2.329 16.425
tW inograd 2048 0.044 0.318 2.243 15.846

Figure 24: Run time (seconds) on the Tesla C1060

pairs of matrices to be multiplied by GP U 8 using the one-level kernels at the lowest level of recursion
may be designated as texture matrices. Again, our experiments showed best performance when only the
first matrix in the pair to be multiplied was designated as texture. Hence, in the following, tStrassen
and tW inograd refer to versions of Strassen and W inograd in which when GP U 8 is invoked by the
one-level code used when the matrix size drops to τ2 , the first matrix of each pair to be multiplied by
GP U 8 is designated as texture (the syntax of the GP U 8 code is correspondingly modified to work with
its first matrix being texture). For our experiments, we set τ1 = τ2 /2.

8.1.1 Run Time

Figures 24 and 25 give the time take by our various matrix multiplication codes. Strassen and W inograd
had best performance when τ2 = 4096 while tStrasen and tW inograd performed best when τ2 = 2048.
Figures 24 and 25 show the run time only for these best values of τ2 . Note that the n = 2048 times
for Strassen and W inograd are the same as for GP U 8 because n = τ2 /2 = τ1 . When n = 16384, the
optimal τ2 (4096) for Strassen and W inograd results in 3 levels of recursion while the optimal τ2 (2048)
for tStrassen and tW inograd results in 4 levels of recursion. As far as run time goes, when n = 16384,
tStrassen takes 2.7% less time than does Strassen and the use of texture reduces the run time of
W inograd by 3.5%. Further, Strassen is 30.1% faster than sgemm and 27.9% faster than GP U 8. Our
fastest code, tW inograd, takes 33.1% less time to multiply two 16384 × 16384 than sgemm and 31.0%
less than taken by GP U 8.
Figures 26 through 29 plot the percent reduction in run time, number of arithmetics, number of
transactions, and volume achieved by Strassen and W inograd relative to GP U 8. As can be seen, the
speedup (reduction in run time) most closely tracks the reduction in volume.

8.1.2 Accuracy
The primary reason Strassen’s algorithm has not found wide application is that it is less numerically
stable than the classical O(n3 ) algorithm [9]. We assess the numerical accuracy of Strassen’s algorithm
and Winograd’s variant using the test matrix used in [13]:
1
A = I + uv T B=I− uv T C=I
1 + vT u
where I is the Kronecker delta matrix or simply the identity matrix (i.e., a matrix with 1s on the diagonal
and 0 elsewhere), and the vectors u and v are as below:
1 √
ui = vi = i i = 1, · · · , N
N +1−i

18
25

sgemm
GPU8
20 Strassen

tStrassen
Winograd
Times (seconds)

15 tWinograd

0
2000 4000 6000 8000 10000 12000 14000 16000 18000
N

Figure 25: Plot of run time (seconds) on the Tesla C1060

Actual Speedup in Time

Arithmetics Reduction
Transactions Reduction
30 Volume Reduction

25
Percentage (%)

5
4000 6000 8000 10000 12000 14000 16000 18000
N

Figure 26: Strassen speedup when τ2 = 4096

Notice that although the product of A and B (i.e. I) is (theoretically) independent of ui and vi , there
is some dependence, in practice, because of numerical errors introduced in the initialization of A and B
on the host CPU during the computation of uv T and doing floating-point divisions. Figure 30 gives the
maximum absolute difference between an element of C as computed by each of our algorithms and the

19
35

Actual Speedup in Time

Arithmetics Reduction
Transactions Reduction
30 Volume Reduction

Percentage (%)

5
4000 6000 8000 10000 12000 14000 16000 18000
N

Figure 27: W inograd speedup when τ2 = 4096

45
Actual Speedup in Time
Reduction in Arithmetics
40 Reduction in Transactions
Reduction in Volume

30
Percentage(%)

0
2000 4000 6000 8000 10000 12000 14000 16000 18000
N

Figure 28: Strassen speedup when τ2 = 2048

ground truth I and Figure 31 gives the average of the absolute differences. For comparison purposes, we
include also the errors in the results obtained using the classical O(n3 ) matrix multiplication algorithm
on the host CPU. For the reported errors, we used τ2 = 2048 and 4096. Since the use of texture memory
does not impact accuracy, Figures 30 and 31 do not explicitly show error measurements for tStrassen
and tW inograd (the errors, respectively, are the same as for Strassen and W inograd). The maximum
and average errors for the classical CPU algorithm, sgemm, and GP U 8 algorithms are almost the same.
However, the errors for Strassen and W inograd are substantially larger than those for the classical
algorithm, sgemm and GP U 8 when n > τ1 = τ2 /2 (when n ≤ τ1 = τ2 /2, Strassen and W inograd
reduce to GP U 8). In fact, when n = 16384 and τ2 = 2048, the maximum error for Strassen is 200 times
that for the classical algorithm, sgemm and GP U 8 while the average error is 424 times as much. The
corresponding ratios for W inograd are 1615 and 5151. We note also that when n = 16384 and τ2 = 2048,
the maximum error for W inograd is about 7.6 times that for Strassen and the average error is about

20
45
Actual Speedup in Time
Reduction in Arithmetics
40 Reduction in Transactions
Reduction in Volume

Percentage(%)
25

0
2000 4000 6000 8000 10000 12000 14000 16000 18000
N

Figure 29: W inograd speedup when τ2 = 2048

Algorithm τ2 2048 4096 8192 16384

O(n3 ) on CPU - 7.9e-5 1.6e-4 2.3e-4 3.9e-4
sgemm - 8.1e-5 1.6e-4 2.4e-4 3.9e-4
GP U 8 - 8.1e-5 1.6e-4 2.4e-4 3.9e-4
2048 2.4e-4 6.7e-4 8.8e-3 8.3e-2
Strassen
4096 8.1e-5 3.4e-4 1.5e-3 5.8e-2
2048 2.5e-4 1.9e-3 2.9e-2 6.3e-1
W inograd
4096 8.1e-5 5.0e-4 3.6e-3 1.6e-1

Figure 30: Maximum errors

Algorithm τ2 2048 4096 8192 16384

O(n3 ) on CPU - 6.6e-8 5.5e-8 4.7e-8 3.3e-8
sgemm - 6.6e-8 5.6e-8 4.7e-8 3.3e-8
GP U 8 - 6.6e-8 5.6e-8 4.7e-8 3.3e-8
2048 2.1e-7 4.6e-7 1.4e-6 1.4e-5
Strassen
4096 6.6e-8 1.7e-7 3.9e-7 2.9e-6
2048 2.8e-7 1.3e-6 1.2e-5 1.7e-4
W inograd
4096 6.6e-8 2.8e-7 1.2e-6 3.2e-5

Figure 31: Average errors

12 times that for Strassen.

8.1.3 Performance by Number of Levels

Because of the large numerical errors resulting from Strassen and W inograd, we decided to determine
how the error varied with the number of levels of recursion. Note that in a 1-level execution, τ1 < n ≤ τ2

21
Algorithm 0-level 1-level 2-level 3-level 4-level
Strassen 3.9e-4 3.3e-3 3.1e-2 5.8e-2 8.3e-2
W inograd 3.9e-4 1.4e-3 9.7e-3 1.6e-1 6.3e-1

Figure 32: Maximum errors when n = 16384

and in a 2-level execution, τ2 < n ≤ 2τ2 . A 0-level execution occurs when n ≤ τ1 = τ2 /2. Figures 32
through 35 give the maximum and average errors as a function of the level of the execution for the case
n = 16384 and Figures 36 through 39 give the run time and reduction in run time relative to sgemm
and GP U 8. As expected, the errors and speedup (reduction in run time) increase with the number of
levels. For example, the 1-level version of Strassen achieves almost a 15% speedup relative to sgemm
at the expense of an almost 13 fold increase in the maximum error and an almost 17 fold increase in the
average error while the 4-level version achieves a speedup of almost 29% at a cost of an almost 213 fold
increase in the maximum error and an almost 425 fold increase in the average error.
0.7

0.6
Strassen
Winograd

0.5

0.4
Errors

0.3

0.2

0.1

0
0 1 2 3 4
Levels

Figure 33: Plot of maximum errors when n = 16384

8.2 Double Precision Experiments

Double precision versions of Strassen and W inograd were developed for the Tesla C2050 (we used the
C2050 as this GPU has one double-precision unit per processor core while in the C1060 each group of
8 processor cores shares a double-precision unit) and benchmarked against dgemm, the double-precision
matrix multiply kernel included in SDK 3.2 for the C2050. Our double-precision adaptations, dStrassen

Algorithm 0-level 1-level 2-level 3-level 4-level

Strassen 6.6e-8 1.1e-7 4.4e-7 2.9e-6 1.4e-5
W inograd 6.6e-8 1.9e-7 1.6e-6 3.2e-5 1.7e-4

Figure 34: Average errors when N=16384

22
0.00018

0.00016
Strassen
Winograd

0.00014

0.00012

0.0001
Errors

8e-005

6e-005

4e-005

2e-005

0
0 1 2 3 4
Levels

Figure 35: Plot of average errors when n = 16384

Algorithm 0-level 1-level 2-level 3-level 4-level

Strassen 22.971 20.223 18.025 16.561 16.855
tStrassen 22.971 20.167 17.967 16.374 16.107
W inograd 22.971 20.208 17.970 16.425 16.555
tW inograd 22.971 20.152 17.902 16.232 15.846

Figure 36: Time when n = 16384

Algorithm 1-level 2-level 3-level 4-level

Strassen 14.7/12.0 23.9/21.5 30.1/27.9 28.9/26.6
tStrassen 14.9/12.2 24.2/21.8 30.9/28.7 32.0/29.9
W inograd 14.7/12.0 24.2/21.8 30.7/28.5 30.1/27.9
tW inograd 15.0/12.3 24.5/22.1 31.5/29.3 33.1/31.0

Figure 37: Speedup(%) over sgemm/GP U 8 when n = 16384

and dW inograd, simply replaced the invocation of GP U 8 by an invocation of dgemm in Figures 16

and 23 and used double-precision versions of the kernels of Figure 11. We note that a version of GP U 8
optimized for the C2050 is not available and so we did not experiment with GP U 8. Since dgemm makes
effective use of texture memory and dgemm is invoked by Strassen and W inograd when n ≤ τ1 , the
analogs of tStrassen and tW inograd for double-precision computations are identical to dStrassen and
dW inograd, respectively.
Since the Tesla C2050 has only 3GB of device memory and since double precision matrices need
twice the memory needed by single precision matrices, the largest matrix we could experiment with
had n = 8192. Both dStrassen and dW inograd exhibited best run-time when τ2 = 4096. As in the
single-precision experiments, we set τ1 = τ2 /2.
Figures 40 through 42 give the run time for dgemm, dStrassen, and dW inograd as well as the
speedup attained by dStrassen and dW inograd over dgemm for n = 4096 and 8192. When n = 8192

23
34

32 Strassen
tStrassen
Winograd
30 tWinograd

Percentage(%) 26

14
1 2 3 4
Levels

Figure 38: Speedup relative to sgemm when n = 16384

32 Strassen
tStrassen
Winograd
30 tWinograd

24
Percentage(%)

12
1 2 3 4
Levels

Figure 39: Speedup relative to GP U 8 when n = 16384

Time Speedup over dgemm

Algorithm
4096 8192 4096 8192
dgemm 0.456 3.634 - -
dStrassen 0.404 2.900 11.4% 20.2%
dW inograd 0.402 2.870 11.8% 21.0%

Figure 40: Double precision time and speedup on C2050, τ2 = 4096

the speedup attained by dStrassen was 20.2% and that attained by dW inograd was 21%. These compare
with speedups of 21% and 21.5% attained by Strassen and W inograd relative to sgemm when n = 8192.

24
4
dgemm
Strassen
Winograd
3.5

2.5
Time (seconds)

1.5

0.5

0
4096 8192
N

Figure 41: Double precision run time

21
Strassen
Winograd
20

18
Speedup over dgemm (%)

11
4096 8192
N

Figure 42: Double precision speedup relative to dgemm

To assess the accuracy of computation in double precision mode, we used the same test matrix as used
in Section 8.1.2. Figure 43 gives the maximum and average errors in the computed product matrix when
n = 8192. While a double-precision computation using the classical matrix multiplication algorithm on
the CPU has the same error characteristics as dgemm, dStrassen and dW inograd have errors that are
an order of magnitude higher; the errors using dStrassen are about half those using dW inograd.
We conducted an additional experiment to gauge the accuracy of the double precision algorithms. In
this experiment, we generated 10 different 8192 × 8192 matrices with elements randomly selected from
the range [−1, 1]. The maximum and average errors for each computation were computed relative to
the results obtained by the classical matrix multiply algorithm on the CPU and then normalized by the
average of the absolute values of the elements computed by the classical CPU algorithm. Figure 44 gives

25
O(n3 ) on CPU dgemm dStrassen dW inograd
Maximum 6.5e-13 6.5e-13 4.3e-12 7.5e-12
Average 1.2e-16 1.2e-16 1.3e-15 3.1e-15

Figure 43: Errors for test matrix of Section 8.1.2 when n = 8192 and τ2 = 4096

dgemm dStrassen dW inograd

Maximum 8.9e-16 3.5e-14 1.4e-14
Average 5.3e-17 3.0e-15 1.8e-15

Figure 44: Normalized maximum and average errors for ten random matrices, n = 8192 and τ2 = 4096

the maximum of the normalized maximum errors and the average of the normalized average errors.

9 Conclusion
We have developed efficient GPU implementations of Strassen’s and Winograd’s matrix multiplication
algorithms. Our experiments indicate that for single-precision arithmetic a speedup of 32% is achieved by
Strassen’s algorithm while Winograd’s variant achieves a speedup of 33% relative to the sgemm code in
CUBLAS when multiplying 16384 × 16384 matrices. Our double-precision implementations of Strassen’s
and Winograd’s algorithms, respectively, achieve a speedup of 20.2% and 21% relative to dgemm when
the matrix size n is 8192. These speedup, however, comes at significant cost in the accuracy of the
computed result. The maximum numerical error introduced by Strassen’s and Winograd’s algorithms
are about 2 orders of magnitude higher than those for sgemm when n = 16384 and about 1 order of
magnitude higher than for dgemm for n = 8192. The average numerical error introduced by Strassen’s
and Winograd’s algorithms are, respectively, 2 and 3 orders of magnitude higher than those for sgemm
when n = 16384 and about 1 order of magnitude higher than for dgemm for n = 8192. Whether the
loss in accuracy is acceptable or not will depend on the application. We have analyzed the arithmetic,
transaction and volume complexity of the various matrix multiplication algorithms considered in this
paper (single-precision versions). Our experiments indicate that speedup most closely follows volume.

References
[1] D. Bailey, K. Lee, and H. Simon, Using Strassen’s algorithm to accelerate the solution of linear
systems, Jr. of Supercomputing, 4, 357-371, 1990.

[2] http://icl.cs.utk.edu/magma/

[3] B. Boyer, C. Pernet, and W. Zhou, Memory efficient scheduling of Strassen-Winograd’s matrix
multiplication algorithm, ACM ISSAC, 2009.

[4] http://www.nvidia.com/object/product tesla C2050 C2070 us.html

[5] D. Coppersmith and S. Winograd, ”Matrix multiplication via arithmetic progressions,” Jr. of Sym-
bolic Computations, 9, 3, 251-280, 1990.

[6] http://developer.download.nvidia.com/compute/cuda/3 0/toolkit/docs/CUBLAS Library 3.0.pdf

26
[7] NVIDIA CUDA Programming Guide, Version 3.0, 2010, http://developer.nvidia.com/object/gpucomputing.html

[8] C. Douglas, M. Heroux, G. Slishman, and R. Smith, GEMMW: A portable level 3 BLAS Winograd
variant of Strassen’s matrix-multiply algorithm, Jr. of Computational Physics, 110, 1-10, 1994.

[9] N. Higham, ”Exploring fast matrix multiplication within the level 3 BLAS,” ACM Trans. Math.
Soft., 16(4), 352-368, 1990.

[10] S. Huss-Lederman, E. Jacobson, J. Johnson, A. Tsao, and T. Turnbull, Implementation of Strassen’s

algorithm for matrix multiplication, Supercomputing ’96, 1996.

[11] S. Huss-Lederman, E. Jacobson, J. Johnson, A. Tsao, and T. Turnbull, Strassen’s algorithm for ma-
trix multiplication: Modeling, analysis, and implementation, CCS-TR-96-17, Center for Computing
Sciences, 1996.

[12] J. Li, S. Ranka and S. Sahni, ”GPU matrix Multiplication,” chapter in Handbook on Multicore
Computing (Editor: S. Rajasekaran), Chapman Hall, 2011, to appear.

[13] I. Kaporin, ”A practical algorithm for faster matrix multiplication,” Numerical Linear Algebra with
Applications, 6: 687-700, 1999.

[14] S. Robinson, ”Toward an optimal algorithm for matrix multiplication,” SIAM News, 38, 9, 2005.

[15] Sahni, S., Data Structures, Algorithms, and Applications in C++, Second Edition, Silicon Press,
NJ, 2005.

[16] Satish, N., Harris, M. and Garland, M., Designing Efficient Sorting Algorithms for Manycore GPUs,
IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2009.

[17] V. Strassen, Gaussian elimination is not optimal, Numerische Mathematik, 13, 354-356, 1969.

[18] V. Volkov and J. Demmel, Benchmarking GPUs to Tune Dense Linear Algebra, Supercomputing,
2008.

[19] http://wapedia.mobi/en/NVIDIA Tesla

[20] S. Winograd, On multiplication of 2 × 2 matrices, Linear Algebra and Applications, 4, 381-388, 1971.

[21] Won, Y. and Sahni, S., Hypercube-to-host sorting, Jr. of Supercomputing, 3, 41-61, 1989.

Backup and Restoration SOP Sample
56% (9)
Backup and Restoration SOP Sample
7 pages
Probability With Applications in Engineering, Science, and Technology, 2nd (Instructor's Solution Manual) - Matthew A. Carlton
100% (1)
Probability With Applications in Engineering, Science, and Technology, 2nd (Instructor's Solution Manual) - Matthew A. Carlton
400 pages
Hill's Equation
From Everand
Hill's Equation
Wilhelm Magnus
No ratings yet
Design of Upsampler
No ratings yet
Design of Upsampler
7 pages
Combinepdf
No ratings yet
Combinepdf
50 pages
Delivery Split .. ERP SAP ECC 6
92% (13)
Delivery Split .. ERP SAP ECC 6
15 pages
Matzat Greuel Hiss (Eds.) Algorithmic Algebra and Number Theory
No ratings yet
Matzat Greuel Hiss (Eds.) Algorithmic Algebra and Number Theory
430 pages
Geomcs v2 PDF
No ratings yet
Geomcs v2 PDF
502 pages
An Introduction To The Risch Integration Algorithm
No ratings yet
An Introduction To The Risch Integration Algorithm
4 pages
L. Fox - An Introduction To Numerical Linear Algebra-Oxford University Press (1967) PDF
100% (1)
L. Fox - An Introduction To Numerical Linear Algebra-Oxford University Press (1967) PDF
345 pages
Full PDF
No ratings yet
Full PDF
157 pages
Solution of Discretized Equations
No ratings yet
Solution of Discretized Equations
26 pages
Reed-Muller Codes: Sebastian Raaphorst Carleton University May 9, 2003
100% (1)
Reed-Muller Codes: Sebastian Raaphorst Carleton University May 9, 2003
20 pages
Applied Functional Analysis PDF
No ratings yet
Applied Functional Analysis PDF
88 pages
A Proof That B Splines With Equal Spacing Have Mirow Symmetry
No ratings yet
A Proof That B Splines With Equal Spacing Have Mirow Symmetry
4 pages
(Encyclopedia of Mathematics and Its Applications) H. Salzmann, T. Grundhöfer, H. Hähl, R. Löwen - The Classical Fields_ Structural Features of the Real and Rational Numbers (Encyclopedia of Mathemati
100% (1)
(Encyclopedia of Mathematics and Its Applications) H. Salzmann, T. Grundhöfer, H. Hähl, R. Löwen - The Classical Fields_ Structural Features of the Real and Rational Numbers (Encyclopedia of Mathemati
418 pages
On Rice S Theorem: 1 Properties
No ratings yet
On Rice S Theorem: 1 Properties
7 pages
MATLAB Linear Algebra
100% (4)
MATLAB Linear Algebra
9 pages
Convex Optimization For Machine Learning
No ratings yet
Convex Optimization For Machine Learning
110 pages
Wilkinson-Reinsch1971 Book HandbookForAutomaticComputatio
No ratings yet
Wilkinson-Reinsch1971 Book HandbookForAutomaticComputatio
450 pages
Notes
No ratings yet
Notes
422 pages
Lectures on the Coupling Method
From Everand
Lectures on the Coupling Method
Torgny Lindvall
No ratings yet
Decidability and Reductions
No ratings yet
Decidability and Reductions
178 pages
Cayley Hamilton
No ratings yet
Cayley Hamilton
2 pages
(Texts and Monographs in Symbolic Computation) Dipl.-Ing. Dr. Franz Winkler (Auth.) - Polynomial Algorithms in Computer Algebra (1996, Springer-Verlag Wien) PDF
No ratings yet
(Texts and Monographs in Symbolic Computation) Dipl.-Ing. Dr. Franz Winkler (Auth.) - Polynomial Algorithms in Computer Algebra (1996, Springer-Verlag Wien) PDF
283 pages
Sequential Circuit Analysis: State Tables State Diagrams
100% (1)
Sequential Circuit Analysis: State Tables State Diagrams
17 pages
IIT Kanpur PHD May 2017
No ratings yet
IIT Kanpur PHD May 2017
5 pages
UserManual 2018 PDF
100% (1)
UserManual 2018 PDF
356 pages
Acourse of Pure Mathematics Cambrige
No ratings yet
Acourse of Pure Mathematics Cambrige
587 pages
Varga, R.S. Gersgorin and His Circles. 2004. 3.4MB
No ratings yet
Varga, R.S. Gersgorin and His Circles. 2004. 3.4MB
237 pages
ASM - Abstract StateMachines Theory and Applications (Lecture Notes in Computer Science) PDF
100% (1)
ASM - Abstract StateMachines Theory and Applications (Lecture Notes in Computer Science) PDF
390 pages
Cantor Schroeder Bernstein
No ratings yet
Cantor Schroeder Bernstein
2 pages
MAT401-Lecture 1
100% (1)
MAT401-Lecture 1
12 pages
Introduction To Algebraic Geometry - Dolgachev PDF
No ratings yet
Introduction To Algebraic Geometry - Dolgachev PDF
198 pages
Faraway Strang
No ratings yet
Faraway Strang
17 pages
Probability and Geometry On Groups Lecture Notes For A Graduate Course
No ratings yet
Probability and Geometry On Groups Lecture Notes For A Graduate Course
209 pages
History of The Integral From The 17th Century
No ratings yet
History of The Integral From The 17th Century
8 pages
Edinburgh Uni Algebra Exam Notes
No ratings yet
Edinburgh Uni Algebra Exam Notes
7 pages
17 Dynamic Programming Matrix Chain Multiplication No Pause
No ratings yet
17 Dynamic Programming Matrix Chain Multiplication No Pause
32 pages
Madhava MC Paper 12
No ratings yet
Madhava MC Paper 12
2 pages
Bode - Shannon - A Simplified Derivation of Linear Least Square Smoothing and Prediction Theory - 1950
No ratings yet
Bode - Shannon - A Simplified Derivation of Linear Least Square Smoothing and Prediction Theory - 1950
9 pages
Automata
100% (1)
Automata
17 pages
Recurrence and Ergodicity
No ratings yet
Recurrence and Ergodicity
19 pages
Instant download (Ebook) A Sampling of Remarkable Groups: Thompson's, Self-similar, Lamplighter, and Baumslag-Solitar by Marianna C. Bonanome, Margaret H. Dean, Judith Putnam Dean ISBN 9783030019761, 9783030019785, 3030019764, 3030019780 pdf all chapter
100% (3)
Instant download (Ebook) A Sampling of Remarkable Groups: Thompson's, Self-similar, Lamplighter, and Baumslag-Solitar by Marianna C. Bonanome, Margaret H. Dean, Judith Putnam Dean ISBN 9783030019761, 9783030019785, 3030019764, 3030019780 pdf all chapter
67 pages
Introduction To Theory of The Complexity
100% (2)
Introduction To Theory of The Complexity
290 pages
Lectures On Convex Sets
No ratings yet
Lectures On Convex Sets
93 pages
Rsa - TCR PDF
No ratings yet
Rsa - TCR PDF
89 pages
(J.W. Gardner, R. Wiegandt) Radical Theory of Ring
100% (1)
(J.W. Gardner, R. Wiegandt) Radical Theory of Ring
408 pages
Parallel
No ratings yet
Parallel
13 pages
Linear Algebra For Computational Engineering
0% (1)
Linear Algebra For Computational Engineering
21 pages
Lecture Notes - Kristiaan Pelckmans
100% (1)
Lecture Notes - Kristiaan Pelckmans
153 pages
Banach Spaces
100% (2)
Banach Spaces
34 pages
Lessons in Electric Circuits - Vol 5 - Reference
No ratings yet
Lessons in Electric Circuits - Vol 5 - Reference
168 pages
Gilbert Strang-The Algebra of Elimination (Expository Notes) (2011) PDF
No ratings yet
Gilbert Strang-The Algebra of Elimination (Expository Notes) (2011) PDF
20 pages
Numerical Analysis Introduction
100% (1)
Numerical Analysis Introduction
24 pages
Real Analysis - Homework Solutions: Chris Monico, May 2, 2013
No ratings yet
Real Analysis - Homework Solutions: Chris Monico, May 2, 2013
37 pages
Milnor Thom
No ratings yet
Milnor Thom
18 pages
Lectures on Integral Equations
From Everand
Lectures on Integral Equations
Harold Widom
3.5/5 (1)
Infinite Crossed Products
From Everand
Infinite Crossed Products
Donald S. Passman
No ratings yet
A Survey of Minimal Surfaces
From Everand
A Survey of Minimal Surfaces
Robert Osserman
3.5/5 (1)
Infinite Matrices and Sequence Spaces
From Everand
Infinite Matrices and Sequence Spaces
Richard G. Cooke
No ratings yet
Variational Analysis: Critical Extremals and Sturmian Extensions
From Everand
Variational Analysis: Critical Extremals and Sturmian Extensions
Marston Morse
5/5 (1)
Mathematical Conversations: Multicolor Problems, Problems in the Theory of Numbers, and Random Walks
From Everand
Mathematical Conversations: Multicolor Problems, Problems in the Theory of Numbers, and Random Walks
E. B. Dynkin
5/5 (1)
Hands-on ML Projects with OpenCV: Master computer vision and Machine Learning using OpenCV and Python (English Edition)
From Everand
Hands-on ML Projects with OpenCV: Master computer vision and Machine Learning using OpenCV and Python (English Edition)
Mugesh S.
No ratings yet
Leaning Instability: The Tower of Pisa, Italy: © Springer Science+Business Media B.V. 2010 A.M. Puzrin Et Al., 45
No ratings yet
Leaning Instability: The Tower of Pisa, Italy: © Springer Science+Business Media B.V. 2010 A.M. Puzrin Et Al., 45
2 pages
Effectofmontmorillonitecontentinnatural Saudi Arabianclayonitsadsorptiveperformanceforsingleaqueousuptakeof Cu IIand Ni II
No ratings yet
Effectofmontmorillonitecontentinnatural Saudi Arabianclayonitsadsorptiveperformanceforsingleaqueousuptakeof Cu IIand Ni II
11 pages
Rampello Callisto CJG 1998
No ratings yet
Rampello Callisto CJG 1998
19 pages
Geotechnical Characterization of The Subsoil of Pisa Tower: January 2003
No ratings yet
Geotechnical Characterization of The Subsoil of Pisa Tower: January 2003
40 pages
348 - 61275 - BA223 - 2017 - 1 - 1 - 1 - Math 3 - Lecture Notes 2018 PDF
No ratings yet
348 - 61275 - BA223 - 2017 - 1 - 1 - 1 - Math 3 - Lecture Notes 2018 PDF
159 pages
348 - 61275 - BA223 - 2017 - 1 - 1 - 1 - Math 3 - Lecture Notes 2018 PDF
No ratings yet
348 - 61275 - BA223 - 2017 - 1 - 1 - 1 - Math 3 - Lecture Notes 2018 PDF
159 pages
Research Method
No ratings yet
Research Method
51 pages
MPDF
No ratings yet
MPDF
3 pages
Civil Enfineering
No ratings yet
Civil Enfineering
1 page
Shaban Booklet 8-1-2016 - 156 Pages PDF
100% (1)
Shaban Booklet 8-1-2016 - 156 Pages PDF
156 pages
Using Strassen's Algorithm To Accelerate The Solution of Linear Systems
No ratings yet
Using Strassen's Algorithm To Accelerate The Solution of Linear Systems
15 pages
Technical Project Officer: General
No ratings yet
Technical Project Officer: General
3 pages
Curiosity and Motivation-To-Learn PDF
No ratings yet
Curiosity and Motivation-To-Learn PDF
5 pages
Job Vacancy 2 Watsan Supervisors (M/W) Based in Dohuk: Job Reference: ACF/KRG/DO/165/032015
No ratings yet
Job Vacancy 2 Watsan Supervisors (M/W) Based in Dohuk: Job Reference: ACF/KRG/DO/165/032015
1 page
What Was The Relationship Between The Re
No ratings yet
What Was The Relationship Between The Re
6 pages
Littles Law
No ratings yet
Littles Law
22 pages
NS2 - How To Add New Routing Protocol - Causal Nexus
No ratings yet
NS2 - How To Add New Routing Protocol - Causal Nexus
10 pages
Process Maker Installation Esolutions
No ratings yet
Process Maker Installation Esolutions
5 pages
CMPT130 Week1 LectureNotes Yonas
100% (1)
CMPT130 Week1 LectureNotes Yonas
56 pages
ImPlant-I Installation and User Guide
No ratings yet
ImPlant-I Installation and User Guide
44 pages
Verilog For Sequential Circuits: Example of D - LATCH
No ratings yet
Verilog For Sequential Circuits: Example of D - LATCH
20 pages
Java Awt PDF
No ratings yet
Java Awt PDF
30 pages
Unit - I Algorithm/pseudo Code
No ratings yet
Unit - I Algorithm/pseudo Code
43 pages
PixelSensecollege Report
No ratings yet
PixelSensecollege Report
30 pages
Qlik Web Connectors 2.0.1 - Release Notes
No ratings yet
Qlik Web Connectors 2.0.1 - Release Notes
4 pages
Sybase IQ Installation and Configuration
No ratings yet
Sybase IQ Installation and Configuration
112 pages
Porting Guide
No ratings yet
Porting Guide
4 pages
IQF Exam Guidelines
No ratings yet
IQF Exam Guidelines
9 pages
Operating System Notes
100% (1)
Operating System Notes
168 pages
Hfss Tutorial
No ratings yet
Hfss Tutorial
8 pages
Jetson Platform Brief May2014
No ratings yet
Jetson Platform Brief May2014
15 pages
ITE Unit 5
No ratings yet
ITE Unit 5
16 pages
Guess
100% (1)
Guess
59 pages
Is It Compulsory To Add Wall Load When Calculating Slab, Beam and Column Loads - Quora
No ratings yet
Is It Compulsory To Add Wall Load When Calculating Slab, Beam and Column Loads - Quora
3 pages
PCP
No ratings yet
PCP
126 pages
MPU-9150 Register Map and Descriptions Revision 4.2
No ratings yet
MPU-9150 Register Map and Descriptions Revision 4.2
52 pages
Amkette FlashLink Suite Virtual Drive
100% (1)
Amkette FlashLink Suite Virtual Drive
2 pages
UPPAAL Assignment
No ratings yet
UPPAAL Assignment
4 pages
Docu47002 Unisphere For VMAX 1.6 REST API Programmer's Guide
No ratings yet
Docu47002 Unisphere For VMAX 1.6 REST API Programmer's Guide
38 pages
The Windows 7 Boot Process SBSL
No ratings yet
The Windows 7 Boot Process SBSL
4 pages
Traveling Salesman Problem
100% (1)
Traveling Salesman Problem
39 pages

Strassen PDF

Uploaded by

Strassen PDF

Uploaded by

∗

Strassen’s Matrix Multiplication on GPUs

Figure 1: Block decomposition of A, B, and C

M1 = (A11 + A22 )(B11 + B22 ) C11 = M1 + M4 − M5 + M7

Figure 2: Steps in Strassen implementation

Figure 3: Steps in Douglas et al.’s [8] implementation of Winograd variant

Figure 5: NVIDIA’s Tesla C1060 GPU [16]

Figure 6: NVIDIA’s Tesla PCIex16 Card (www.nvidia.com)

4 The Matrix Multiplication Algorithm GP U 8

Figure 7: Updating c values when as read using float2

5 Basic GPU Kernels

2. sub(X, Y, Z) computes Z = X − Y using a kernel code similar to that of Figure 10.

4. mulIncInc(W, X, Y, Z) computes (Y +, Z+) = W ∗ X (i.e., Y and Z are both incremented by

// registers for column of c sub-matrix

int nDiv64 = n/64;

float4 *a4 = (float4 *)a;

for (int i = 0; i < nDiv64; i++)

5. mulIncDec(W, X, Y, Z) computes (Y +, Z−) = W ∗ X. This is similar to mulIncInc and has the

update2 (&as[15][0], br0, cr);

for (int j = 0; j < 16; j++)

same transaction count and volume.

float2 tempA = *(float2 *)(d_A+startA);

*(float2 *)(d_C+startC) = tempA;

Figure 10: Kernel to add two matrices

7. mulStoreInc(W, X, Y, Z) computes (Y, Z+) = W ∗ X using a suitable modification of the matrix

9. mulIncIncInc(U, V, W, X, Y, Z) computes W = U ∗ V ; Y += W ; Z += Y ; Y += X (in this or-

10. mulSubInc(V, W, X, Y, Z) computes Y = X − V ∗ W ; Z += X using a modification of the matrix

Figure 11: Device-memory transaction statistics for m × m matrices

6.2 One-Level W inograd

7 Multilevel Recursive GPU Adaptation

Figure 12: GPU kernels in Strassen implementation

A(k, n) = 7A(k − 1, n/2) + 18(n2 /4)

Method Arithmetics Transactions Volume

Figure 14: Transactions and volume for one-level multiplication of n × n matrices

n Method Arithmetics Transactions Volume

Figure 15: Percent reduction relative to GP U 8 for one-level multiplication

Figure 16: Strassen’s GPU matrix multiply

Method Arithmetics Transactions Volume

Figure 17: Transactions and volume for two-level multiplication of n × n matrices

Method Arithmetics Transactions Volume

Figure 18: Transactions and volume for three-level multiplication of n × n matrices

= 7A(k − 1, n/2) + 9n2 /2

Figure 19: Transactions and volume for four-level multiplication of n × n matrices

n Method Arithmetics Transactions Volume

Figure 20: Percent reduction relative to GP U 8 for two-level multiplication

n Method Arithmetics Transactions Volume

Figure 21: Percent reduction relative to GP U 8 for three-level multiplication

n Method Arithmetics Transactions Volume

Figure 22: Percent reduction relative to GP U 8 for four-level multiplication

7.2 Multilevel W inograd

Figure 23: Winograd’s GPU matrix multiply

A(k, n) = 7A(k − 1, n/2) + 15n2 /4

Figure 24: Run time (seconds) on the Tesla C1060

8.1.1 Run Time

Figure 25: Plot of run time (seconds) on the Tesla C1060

Actual Speedup in Time

Figure 26: Strassen speedup when τ2 = 4096

Actual Speedup in Time

Figure 27: W inograd speedup when τ2 = 4096

Figure 28: Strassen speedup when τ2 = 2048

Figure 29: W inograd speedup when τ2 = 2048

Algorithm τ2 2048 4096 8192 16384

Figure 30: Maximum errors

Algorithm τ2 2048 4096 8192 16384

Figure 31: Average errors

12 times that for Strassen.

8.1.3 Performance by Number of Levels

Figure 32: Maximum errors when n = 16384

Figure 33: Plot of maximum errors when n = 16384

8.2 Double Precision Experiments

Algorithm 0-level 1-level 2-level 3-level 4-level

Figure 34: Average errors when N=16384

Figure 35: Plot of average errors when n = 16384

Algorithm 0-level 1-level 2-level 3-level 4-level

Figure 36: Time when n = 16384

Algorithm 1-level 2-level 3-level 4-level

float4 a4 = (float4 )a;

float2 tempA = (float2 )(d_A+startA);

(float2 )(d_C+startC) = tempA;