The first error theory and bounds for Fast Matrix Multiplication based on the Strassen-Winograd algorithms (FastMMW) were formulated in the 70s. The theory introduces the concept, which is now known as weakly-stable error analysis, where the error bounds must use matrix norms instead of component-wise bounds. While the theory debunked the instability myth by using matrix scaling and a clean and simple analysis, its bounds are available only as properties of the whole matrices, which are too coarse, pessimistic, at times used to suggest instability, and are not used for algorithm optimization. We build on top of the original theory in order to reformulate the bounds: we show that tighter norm-wise and component-wise bounds are achievable by orthogonal algorithm optimizations. To achieve even better discrimination and circumvent the use of norm bounds, we develop an error theory by using communication and statistics concepts: we investigate lower and upper bounds, we estimate the practical bounds, and we investigate the algorithmic nature of the error for the class of random matrices. The theory and tools are not limited to random matrices and we can foresee further investigations to different matrix classes and algorithms. We propose new and more accurate algorithms. We show that we can improve theoretically and empirically the maximum absolute error of any FastMMW algorithm by 10% - 20% per recursion (we reduce the error by half for 4 recursions). Our theory and practice, in turn, will provide a kick start for the development of hybrid algorithms as accurate as the vendor GEMM implementation, and in certain cases even more accurate for random matrices.
We are interested in the analysis, design, and implementation of the algorithms known as Fast Matrix-Multipli- cation based on the variants of Strassen and Winograd (i.e., Fast Matrix Multiplication by Winograd’s algorithms FastMMW see [
The FastMMWs are used and investigated in several contexts, there is a wealth of related work; furthermore, with every new architecture there are new implementations, and thus new results. For example, we have a small contribution in the exploration of a large set of architectures. In practice, performance is the main reason for the FastMMW proposal and we often hear that accuracy is the main concern. Here, we address the accuracy of the FastMMW for any type of matrices and we provide new tools to handle the class of matrices broadly called random.
We show that using the theory developed by Brent [
In our work, the design of kernel algorithms, we struggle in providing performance and accuracy estimates for our algorithms. We struggle because we do not always know the context where our algorithms will be applied. The application of random matrices for such estimates is common. Random matrices are useful tools to describe the range (i.e., in the range [0, 1]) and the values of matrices without a clear pattern (e.g., with normal distribution); that is, random. This class of matrices is special: they have full rank and they are dense. Being without pattern makes them a little remote to be a benchmark, in the common sense of the term; however, they are ideal for the testing of algorithms. There are a few good reasons why researchers, like us, entertain the testing with these special matrices, in the following we share three.
First, there is a known unknown effect. If we design a new algorithm and we want to test its performance and accuracy, it is plainly impossible to test every possible matrix. We know that we do not know what matrices will be used, so we substitute our unknown matrices with something random. It is an understandable and common misdemeanor.
Second, there is a practical appeal. They are easy to generate with different statistical properties: uniformly distributed in an interval such as [−1, 1] or Normal with specific mean and variance just to give two common continuous distributions (e.g., a user-guide of random matrix [
Third, there is a statistical appeal. We know and we control the statistical features of the input and we measure the statistical properties of the output quite easily. We can use such information to derive the so called transfer function for the algorithms; we provide a formal definition starting from Section 4. The random matrix is a tool for the computation of the transfer function. If we obtain an adequate transfer function, we can estimate statistical properties of the output. Among these properties, there are also the distribution range of the maximum absolute error and its location in result matrix: the maximum error is the most common measure to estimate the accuracy of an algorithm. We are interested in a component-wise transfer function so as to estimate component-wise error bounds.
The transfer function has specific properties that will allow us to extend the weak-stability error analysis and our optimizations to random matrices and to obtain component-wise bounds.
We can state our contribution in one paragraph. Here, we present a methodology that will improve the FastMMW error bounds and will provide the estimate and measure of the error for each component of the result matrix for random matrices. This allows us to have a complete error-analysis tool set: we shall model the error, measure the error (by experiments in floating point arithmetic in IEEE-754 single precision floating point arithmetic), validate the model, and ultimately we shall design and implement more accurate algorithms. These algorithms enrich the FastMMW software package. In turn, all performance tables, plots, and other graphical presentations are drawn automatically using the Python FastMMW package. The self-contained software will help any independent validation and reproduction of all the following results: ultimately, it will simplify the exploration of new algorithms in the future and the transfer of FastMM algorithms to a larger audience.
We organize the paper as follows: in Section 2, we introduce the theory of weak stability, that is, the most successful error analysis as of today with the relative references, a description of the main ideas and our main difference. In Section 3, we bridge together the error analysis with tools used in linear time-series analysis. In Section 4, we introduce and formalize the transfer function. We present a complexity analysis using transfer function in Section 5. Using this analysis, we propose optimizations in Section 6. In Section 7, we show the practical relation between the transfer-function complexity and the weak stability. We draw our conclusions in Section 8.
Obviously, an algorithm is a constructive way of representing and computing a function. The output of an algorithm is often an approximation to the true result due to the approximation of the data and of intermediary states of the computation. Stability analysis is the ability to quantify the maximum or the expected error an algorithm will introduce during and at the end of its computation.
The most natural way to quantify an error is by estimating the difference between what we should compute and what we compute instead. So if the ideal computation with ideal representation of the inputs and outputs is a matrix
simply the computation
initial error).
The component-wise comparison between
This is a matrix representing the absolute difference of the two matrices and the equality is meant to be component wise
We know that for any Matrix Multiplication (MM) using
The interpretation is simple and important. Given any component error
The first bounds for the FastMM accuracy are by Brent [
For both Strassen’s and Winograd’s (for the latter Higham provides a complete analysis [
Where
Intuitively, the ratio
Factor
Without loss of generality and to simplify the equation, consider the matrices unitary so that
For example, for one level of recursion,
We understand from this equation that we can apply two different optimizations: we improve the leaf computation [
Miller [
Miller infers that Brent’s bound are the best we can infer for FastMM because Equation (2) satisfies the form of Equation (3). In practice, Miller argument is to introduce and evaluate different ways to compute bounds for bilinear forms (FastMM) and he introduces the terminology known today: Brent stability and Restricted Brent stability (in honor to the original author) and the more common term of Weak and Strong Stability. In the literature and in this paper, we mean the weak stability as the Brent Stability (norm wise bounds).
In Section 4, we shall present graphical tools and bounds that will make obvious Miller’s error bounds [
Bini and Lotti [
They estimate the error as a multiplicative factor for each quadrant but they do not care about their order. They define it stability vector and we represent the error location by means of a matrix:
The component/matrix
This is a fundamental idea that we shall expand in this paper further for the design of more accurate algorithms. For example, the Winograd algorithm, as presented by Higham ([
The stability factor of an algorithm is the maximum of the stability vector:
and we can see that
This classification and the methodology is powerful but there are a few points that we are going to expand and use:
・ The stability error is a function of the algorithm representation (matrix forms) instead of code or experimental data. We shall introduce an empirical and theoretical measure, the transfer function, for any algorithm to cope with experimental error analysis and graphical tools and the inherently coarse grain of the error bounds.
・ We tailor our tools for random matrices. However, the theory developed by Bini and Lotti will justify the same optimization on any matrix.
The last point is quite important and we expand it here. This section derives from Bini and Lotti framework but it is completely original. The classification and the bounds are based on the recursive nature of the stability vector, unchanged, and thus exploiting the worst case scenario. We can improve the bounds if we improve the algorithms. Consider the addition of two result matrices by the Strassen algorithm and their identical stability vector
Obviously, if we add them together the stability factor is additive because the computations are independent and because we estimate the worst case (i.e., if we add two matrices that have been computed using the Strassen’s algorithm, the resulting stability factor, the maximum expected error, is the sum of the stability factors.)
However, if we have a way to permute the computation so that the stability vector is rotated one shift clockwise (this is always possible as we show in Section 5) the error estimate is different:
As we may appreciate from the Strassen algorithm, the error is on a diagonal, we may take advantage of the specific layout of the error to write better algorithms. If we do nothing and we perform two levels of recursion, the stability vector will be:
This means that the stability factor is
If we exploit the location of the error using what we call orthogonal algorithms (see Section 6), we can obtain the following stability vector:
This means that the stability factor is
Bini and Lotti provide several classes of algorithms for the Winograd variant: the most accurate has stability vector
and for two recursions
There are actually three orthogonal permutations to be applied to have any advantage
This means that the stability factor is
Notice that all previous stability vectors are computed automatically using the matrix notations introduced by the original authors. Also, the permutations are applied as matrix notation and automatic. Once the set of matrices are specified (i.e., the algorithm), we can compute the stability vector with and without orthogonal permutation for any recursion level. Bottom line, this section presents a constructive proof of how to write more accurate FastMM algorithms based on bilinear techniques using permutations. It also shows that we can actually write component-wise bounds.
However, further discussion about the automatic generation of stability vectors is beyond the scope of this paper. When we will present our code-generation tools that will take the matrix form and will generate code: we will provide more details and mathematical notations how this can be achieved using Kronecker and matrix products.
The intuition behind Miller’s result is that the components of the error
The easiest way to introduce this connection and its implications is by the description of a common experiment. Choose a reference MM, for example, DGEMM (double precision General Matrix Multiplication see [
・ Let us choose the number of iteration T, 100 say, and a dimension
・ Per iteration
・ We compute
This is a very common experiment so that to estimate the maximum (or maximum relative error) given a reference and experimental algorithm. For example, we record only
・ worst case estimate
・ empirical distribution of the
Often the input matrices have some statistical properties but they can be from benchmarks as well. In practice, we use the experiments above to estimate the parameters such as
For large
Then each series
・ The estimate of the mean
Of course,
・ The estimate of the variance
Of course,
If the assumptions about the error independence nature, if the size of
Where
distribution
notation providing a single bound instead of a matrix bound. In practice, where there is a large variance, there is a large error. Also, where there is a small variance, likely, there is a small error.
Furthermore, we can infer a relation across components of
Of course, Equation (6) has such a small and clean probability because we use the normal distribution of the output
In practice, the matrix
2Also,
One of the best ways to showcase the power of the transfer function is by presenting the matrix
We can take a small problem where matrices are of size
In the previous section, we discussed that the transfer function has meaning for large
It is simple to appreciate the close relationship between the heat map and the stability vector. We recall the stability vector is:
the transfer function as heat map is a graphical representation of the stability vector and the error distribution, see
The nature of the recursive algorithm is captured by the transfer function quite beautifully: Figures 3 and 4.
Now, we have a clear picture about Miller’s bounds and the not uniform distribution of the error that cannot be model by Equation (1). Here, we put to use both the recursive division, Bini and Lotti’s framework, and the line of thoughts of Equation (2) proof. We know that applying a single recursion of the Strassen’s algorithm the
magnitudes. The proof uses the maximum among the errors in order to write a recurrence equation and to provide a solution. Notice if we use Bini and Lotti framework we can compute directly the solution of the recurrence equation for each quadrant and then each component. The heat map is a consistent estimate of such a computation (as we did in Section 2.1). The heat map is a clear picture for one recursive step, which seems obvious now; however, it pictures a concise, coherent and beautiful information for 2 or more recursions and it is a beautiful example of fractals.
The transfer function is a way to represent and compute the point-wise root-mean-square error and this is a common theme in several previous publications: For example, the original work presented by Welch [
There are also disconnections between the transfer function, the direction of the error and the Equation (2). Here we investigate the hidden differences before dwelling into the commonalities.
In our experience, the error of FastMM is connected to the sign/range of the matrices in the sense that versions of Winograd’s are known to be as accurate as
algorithm is less accurate for positive matrices; this will affect the accuracy of FastMM because
Obviously we wonder, whether or not any range change of the input matrices, will affect the transfer function. For example, instead of using matrices in the range
1. Let us start by considering the effects on the
This simple observation will explain why fast algorithms have the opportunity to be as accurate as regular algorithms for random positive matrices (i.e., they resolve to use matrices in the [?1, 1] range).
2. The transfer-function shape changes for FastMMs. In
Note that SWOPT magnifies the error onto two quadrants, instead of three. The SWOPT’s transfer function has similarity with the transfer function of SSTRA thought the maximum error differs especially for large recursions. Interestingly, SW is as accurate as GSGEMM for recursion smaller or equal to three. For deeper recursions, SW looses its edge in accuracy. We will provide an explanation in the following section when we introduce a complexity theory using transfer functions.
Let us introduce a few definitions useful for the notation, for the error complexity and, finally, for the design of more accurate algorithms. These notations stem from the stability vectors previously introduced. Let us consider the error matrix
Of course, the transfer function is a matrix and we identify the error direction in a transfer function using matrix notations and sub-matrices as in Equation (4). We can summarize the transfer function by the hot submatrices of the error matrix.
・ The transfer function of SSTRA algorithm will be identified as
・ The SWOPT and SW algorithms have the same transfer function and we identify it as
・ We can identify the GSGEMM algorithm transfer function as
For example, if we have the addition of two matrices such as
For example, if
This is true because we add the component-wise square variances and
dering the statistical meaning. It is also function of the problem size and the range of the matrices (because it affects the basic assumption of the error distribution).
The transfer function
・ Closure:
・ Associativity
・ Commutativity
・ (almost) Identity element
・ Orthogonal (or Inverse) element
Theoretically, there is an identity element: the matrix zero or the transfer function of the ideal computation. Here, we rather introduce the almost identity element because it is a real computation and it takes the role of the classic
Again, if we restrict the family of algorithms, there may be no orthogonal transfer function to one transfer function. For example, for the Winograd algorithm with transfer function
In practice, if
which emphasizes that the shape of the transfer function stays the same and the intensity of the variance should double.
Let us consider the Strassen’s algorithm as presented in
The left column in
We present a solution for the above error complexity in Equation (9) where
From the experimental results presented previously, we see that this is quite adequate with a simple explanation. Notice that FastMM are more accurate in absolute term for matrices in the range [?1, 1]. However, they grow slower for positive matrices. We present the analysis for the Winograd variants in Appendix 8.
Strassen algorithm has a very distinctive direction of the transfer function
Strassen algorithm computes the following and obvious matrix computation:
Its orthogonal is the following:
The permutation is logical and we do not need really to move data along. If one recursion level is applied, and if we repeat the same bound estimation as in
recursion bounds are identical as in Equation (9).
Now we can do something interesting. Consider a matrix result
and more importantly small error increase.
The coefficient can be estimate as
ample
So let us take again the Strassen algorithm and introduce a permutation instruction that allows us to switch on
1 | |||
2 | |||
3 | |||
4 | |||
5 | |||
6 | |||
7 | |||
8 | |||
9 | |||
10 | |||
11 | |||
12 | |||
13 | |||
14 | |||
15 | |||
16 | |||
17 | |||
18 | |||
19 | |||
20 | |||
21 |
and off the orthogonal algorithm (
If we write the recurrence equation as we did in Equation (8), and we solve it to estimate the magnitude, then we should explicitly introduce coefficient
For the Winograd’s variants such as SW and SWOPT the orthogonal transformation is a little more complicated:
Once again, the permutations are logical only, there is no data movement.
ORTHOGONAL 1, 2 | ||
ORTHOGONAL 0, 3 | ||
ORTHOGONAL 1, 2 | ||
ORTHOGONAL 0,3 | ||
In Section 2.1, we introduce an algorithm that actually applies all four possible direction. The regular, the two above, and the following:
We call this algorithm
Note. We applied these same permutations to Strassen and Winograd stability vectors in Section 2.1 to lower the coefficient
1 | |
2 | |
ORTHOGONAL 0, 2, 3 | |
3 | |
4 | |
5 | |
ORTHOGONAL 0, 1, 3 | |
6 | |
7 | |
ORTHOGONAL 0, 1, 2 | |
8 | |
9 | |
10 | |
11 | |
12 | |
13 | |
ORTHOGONAL 0, 1, 3 | |
14 | |
15 | |
ORTHOGONAL 0, 2, 3 | |
16 | |
17 | |
18 | |
ORTHOGONAL 0, 1, 3 | |
19 | |
20 | |
21 |
matrix input [0, 1]).
How will it work in practice? In short, the orthogonal algorithm actually improves the transfer function in a significant way that will improve the maximum error as well. In the following, we shall summarize the error in Figures 8, 9, 10, 11 and 12.
We can see that the error direction changed dramatically and the transfer functions of Fast algorithm is getting closer to the regular MM. From the simple theory we developed, we understand that we cannot achieve a truly uniform distribution by using orthogonal algorithm transformation. What we can do is to attenuate the effects of the recursive narrow error into a specific location so as to avoid the overlap of large errors close to the same geographical location.
A curiosity: the SWOPT orthogonal algorithm has a heat spot clearly defined on the right side of the matrix result. Such a biased error may suggest that part of the matrix is small and it could be computed separately: for 5
recursions, we may have to recompute a very small matrix
case the reader is wondering about any relation with the permutations introduced in [
not change the error direction and thus the error; we tested them and not reported here.
In this section, we will wrap up the experimental results by showing the properties of fast algorithms using relatively small matrix sizes. The goal is to compare what we can predict using transfer function versus maximum error.
We present different views of the error and we start by showing the maximum error and maximum transfer function, here we may use the term heat to indicate the transfer function for short.
In
For every algorithm and matrix range, the heat and maximum error are consistent measures of each other and in particular we show that the orthogonal permutation always improves both. We present also the ratio between maximum error and maximum heat to provide the multiplicative factor to
We can appreciate quantitatively that permutation algorithms reduce the heat and the maximum error by half.
Maximum Heat vs. Maximum Error Location.
There is a correlation between the values of the maximum error and the maximum heat. The correlation is used to show that we can design better algorithms. Here, we address the geographical correlation: we show that the transfer function maps the most likely locations for the error.
In Figures 13 and 14, we present the heat map for the maximum error for all algorithms for matrices of size
The goal of the orthogonal permutation is to change the pattern of the error in sub computations in such a way to avoid their maximum contribution. As result, we are spreading the error across the result matrix. Differently, we can guide the distribution of the error accordingly to target any part or the result matrix; this could be inva-
Size | GSGEMM | SSTRA | SW | SWOPT |
---|---|---|---|---|
20 | 2.25e-06/3.70e-07/6.08 | 2.25e-06/3.70e-07/6.08 | 2.25e-06/3.70e-07/6.08 | 2.25e-06/3.70e-07/6.08 |
21 | 2.45e-06/3.95e-07/6.21 | 4.39e-06/7.36e-07/5.96 | 2.49e-06/3.80e-07/6.56 | 3.99e-06/6.60e-07/6.05 |
42 | 6.72e-06/1.09e-06/6.16 | 1.83e-05/3.32e-06/5.50 | 6.75e-06/1.05e-06/6.40 | 1.50e-05/2.83e-06/5.29 |
50 | 8.20e-06/1.35e-06/6.06 | 2.47e-05/3.95e-06/6.24 | 8.03e-06/1.21e-06/6.65 | 1.88e-05/3.41e-06/5.50 |
64 | 1.30e-05/1.91e-06/6.81 | 3.43e-05/5.48e-06/6.25 | 1.08e-05/1.63e-06/6.64 | 3.04e-05/4.96e-06/6.14 |
70 | 1.38e-05/2.25e-06/6.15 | 3.97e-05/6.51e-06/6.09 | 1.07e-05/1.82e-06/5.86 | 3.14e-05/5.50e-06/5.70 |
86 | 1.88e-05/3.16e-06/5.94 | 7.81e-05/1.51e-05/5.16 | 2.19e-05/3.61e-06/6.05 | 6.49e-05/1.20e-05/5.41 |
90 | 1.93e-05/3.36e-06/5.74 | 8.84e-05/1.63e-05/5.43 | 2.44e-05/3.80e-06/6.41 | 7.13e-05/1.53e-05/4.65 |
100 | 2.12e-05/3.80e-06/5.57 | 9.22e-05/1.76e-05/5.24 | 2.40e-05/3.62e-06/6.62 | 7.31e-05/1.44e-05/5.09 |
120 | 3.10e-05/4.71e-06/6.58 | 1.20e-04/2.15e-05/5.56 | 3.10e-05/4.36e-06/7.11 | 9.84e-05/1.85e-05/5.32 |
150 | 4.23e-05/7.19e-06/5.88 | 1.72e-04/3.23e-05/5.32 | 3.92e-05/6.11e-06/6.42 | 1.38e-04/2.58e-05/5.34 |
175 | 5.54e-05/9.17e-06/6.04 | 3.75e-04/6.81e-05/5.51 | 1.26e-04/2.15e-05/5.86 | 4.25e-04/8.11e-05/5.24 |
200 | 6.13e-05/1.08e-05/5.69 | 4.17e-04/7.89e-05/5.28 | 8.54e-05/1.19e-05/7.18 | 3.16e-04/6.15e-05/5.14 |
250 | 4.72e-05/7.34e-06/6.43 | 5.93e-04/1.03e-04/5.79 | 1.28e-04/2.16e-05/5.91 | 4.41e-04/9.89e-05/4.46 |
300 | 7.16e-05/1.05e-05/6.79 | 8.22e-04/1.45e-04/5.67 | 1.49e-04/1.78e-05/8.38 | 5.62e-04/1.09e-04/5.17 |
350 | 7.92e-05/1.33e-05/5.96 | 1.73e-03/3.06e-04/5.67 | 4.06e-04/6.95e-05/5.84 | 1.51e-03/3.39e-04/4.45 |
400 | 9.57e-05/1.55e-05/6.16 | 2.06e-03/3.53e-04/5.85 | 3.76e-04/4.40e-05/8.53 | 1.29e-03/2.58e-04/5.01 |
Size | SSTRA-Permute | SW-4Permute | SW-Permute | SWOPT-Permute |
20 | 2.25e-06/3.70e-07/6.08 | 2.25e-06/3.70e-07/6.08 | 2.25e-06/3.70e-07/6.08 | 2.25e-06/3.70e-07/6.08 |
21 | 4.39e-06/7.36e-07/5.96 | 2.49e-06/3.80e-07/6.56 | 2.49e-06/3.80e-07/6.56 | 3.99e-06/6.60e-07/6.05 |
42 | 1.53e-05/3.03e-06/5.06 | 6.29e-06/9.95e-07/6.32 | 6.74e-06/1.06e-06/6.37 | 1.37e-05/2.78e-06/4.94 |
50 | 1.89e-05/3.61e-06/5.25 | 7.49e-06/1.14e-06/6.57 | 7.08e-06/1.20e-06/5.90 | 1.70e-05/3.33e-06/5.11 |
64 | 2.73e-05/5.01e-06/5.45 | 1.05e-05/1.53e-06/6.84 | 9.35e-06/1.63e-06/5.74 | 2.69e-05/4.82e-06/5.58 |
70 | 3.28e-05/5.95e-06/5.51 | 1.07e-05/1.71e-06/6.27 | 1.01e-05/1.81e-06/5.57 | 2.98e-05/5.39e-06/5.53 |
86 | 6.49e-05/1.27e-05/5.13 | 1.90e-05/2.93e-06/6.48 | 2.12e-05/3.11e-06/6.80 | 5.67e-05/1.14e-05/4.97 |
90 | 6.81e-05/1.36e-05/5.02 | 1.99e-05/3.02e-06/6.60 | 1.91e-05/3.25e-06/5.88 | 6.47e-05/1.26e-05/5.11 |
100 | 8.60e-05/1.48e-05/5.83 | 2.25e-05/3.22e-06/6.96 | 2.11e-05/3.48e-06/6.06 | 6.76e-05/1.38e-05/4.90 |
120 | 9.35e-05/1.80e-05/5.19 | 2.49e-05/3.88e-06/6.43 | 2.40e-05/4.18e-06/5.74 | 9.26e-05/1.76e-05/5.26 |
150 | 1.43e-04/2.71e-05/5.28 | 3.20e-05/5.11e-06/6.28 | 3.22e-05/5.40e-06/5.95 | 1.34e-04/2.46e-05/5.46 |
175 | 2.62e-04/5.29e-05/4.95 | 5.97e-05/9.97e-06/5.99 | 7.48e-05/1.42e-05/5.28 | 2.25e-04/4.72e-05/4.77 |
200 | 3.04e-04/6.07e-05/5.00 | 6.52e-05/1.03e-05/6.31 | 6.78e-05/1.12e-05/6.06 | 2.68e-04/5.66e-05/4.74 |
250 | 4.23e-04/7.84e-05/5.40 | 8.66e-05/1.27e-05/6.84 | 8.88e-05/1.52e-05/5.86 | 3.77e-04/7.21e-05/5.23 |
300 | 6.02e-04/1.12e-04/5.36 | 1.02e-04/1.51e-05/6.75 | 1.01e-04/1.67e-05/6.06 | 5.43e-04/1.01e-04/5.37 |
350 | 1.15e-03/2.17e-04/5.32 | 2.26e-04/3.39e-05/6.66 | 2.38e-04/4.32e-05/5.52 | 9.68e-04/1.96e-04/4.95 |
400 | 1.24e-03/2.50e-04/4.94 | 2.66e-04/3.82e-05/6.97 | 2.68e-04/4.10e-05/6.53 | 1.25e-03/2.34e-04/5.34 |
Size | GSGEMM | SSTRA | SW | SWOPT |
---|---|---|---|---|
20 | 1.57e-06/1.45e-07/10.86 | 1.57e-06/1.45e-07/10.86 | 1.57e-06/1.45e-07/10.86 | 1.57e-06/1.45e-07/10.86 |
21 | 1.68e-06/1.50e-07/11.17 | 2.35e-06/3.16e-07/7.43 | 4.23e-06/3.88e-07/10.92 | 3.58e-06/3.85e-07/9.29 |
42 | 3.75e-06/2.85e-07/13.18 | 7.30e-06/1.11e-06/6.58 | 2.18e-05/1.66e-06/13.13 | 1.81e-05/1.66e-06/10.94 |
50 | 4.06e-06/3.36e-07/12.10 | 8.25e-06/1.26e-06/6.52 | 2.05e-05/1.90e-06/10.80 | 2.19e-05/1.90e-06/11.51 |
64 | 5.71e-06/4.22e-07/13.53 | 1.07e-05/1.52e-06/7.03 | 2.14e-05/2.29e-06/9.35 | 2.47e-05/2.29e-06/10.76 |
70 | 6.37e-06/4.61e-07/13.84 | 1.17e-05/1.66e-06/7.05 | 2.46e-05/2.48e-06/9.92 | 3.06e-05/2.48e-06/12.33 |
86 | 8.97e-06/5.61e-07/16.00 | 2.36e-05/3.87e-06/6.10 | 6.76e-05/7.20e-06/9.39 | 7.15e-05/7.13e-06/10.03 |
90 | 8.28e-06/5.85e-07/14.17 | 2.45e-05/4.12e-06/5.96 | 7.70e-05/7.50e-06/10.27 | 6.08e-05/7.34e-06/8.28 |
100 | 1.03e-05/6.48e-07/15.94 | 2.89e-05/4.38e-06/6.61 | 7.36e-05/8.11e-06/9.07 | 6.92e-05/8.05e-06/8.60 |
120 | 1.26e-05/7.84e-07/16.09 | 2.97e-05/5.02e-06/5.92 | 9.15e-05/9.25e-06/9.89 | 8.74e-05/9.30e-06/9.39 |
150 | 1.60e-05/9.62e-07/16.65 | 3.99e-05/6.05e-06/6.59 | 9.86e-05/1.12e-05/8.78 | 1.01e-04/1.11e-05/9.06 |
175 | 1.96e-05/1.13e-06/17.39 | 8.19e-05/1.35e-05/6.07 | 2.64e-04/3.07e-05/8.60 | 2.98e-04/3.05e-05/9.78 |
200 | 2.45e-05/1.28e-06/19.10 | 9.72e-05/1.53e-05/6.36 | 3.01e-04/3.43e-05/8.76 | 3.32e-04/3.41e-05/9.76 |
250 | 1.36e-05/1.15e-06/11.82 | 1.05e-04/1.81e-05/5.81 | 3.49e-04/4.12e-05/8.48 | 3.28e-04/4.05e-05/8.10 |
300 | 1.78e-05/1.37e-06/13.05 | 1.26e-04/2.09e-05/6.04 | 4.24e-04/4.73e-05/8.95 | 4.13e-04/4.75e-05/8.69 |
350 | 2.62e-05/1.59e-06/16.43 | 2.64e-04/4.66e-05/5.68 | 1.11e-03/1.31e-04/8.42 | 9.40e-04/1.30e-04/7.23 |
400 | 2.63e-05/1.81e-06/14.48 | 2.99e-04/5.29e-05/5.66 | 1.14e-03/1.45e-04/7.86 | 1.20e-03/1.45e-04/8.29 |
Size | SSTRA-Permute | SW-4Permute | SW-Permute | SWOPT-Permute |
20 | 1.57e-06/1.45e-07/10.86 | 1.57e-06/1.45e-07/10.86 | 1.57e-06/1.45e-07/10.86 | 1.57e-06/1.45e-07/10.86 |
21 | 2.35e-06/3.16e-07/7.43 | 4.23e-06/3.88e-07/10.92 | 4.23e-06/3.88e-07/10.92 | 3.58e-06/3.85e-07/9.29 |
42 | 6.85e-06/9.23e-07/7.42 | 1.47e-05/1.59e-06/9.24 | 1.62e-05/1.63e-06/9.94 | 1.45e-05/1.64e-06/8.81 |
50 | 7.85e-06/1.05e-06/7.47 | 1.45e-05/1.82e-06/7.95 | 1.71e-05/1.87e-06/9.11 | 1.69e-05/1.88e-06/9.00 |
64 | 9.59e-06/1.24e-06/7.74 | 2.28e-05/2.20e-06/10.37 | 2.21e-05/2.27e-06/9.74 | 1.96e-05/2.29e-06/8.57 |
70 | 1.00e-05/1.38e-06/7.23 | 2.18e-05/2.41e-06/9.04 | 2.97e-05/2.48e-06/11.96 | 2.32e-05/2.47e-06/9.37 |
86 | 1.79e-05/2.87e-06/6.23 | 5.31e-05/6.68e-06/7.95 | 5.59e-05/7.07e-06/7.91 | 6.00e-05/7.01e-06/8.56 |
90 | 2.11e-05/2.99e-06/7.05 | 5.36e-05/6.98e-06/7.68 | 5.41e-05/7.34e-06/7.37 | 5.65e-05/7.26e-06/7.78 |
100 | 2.01e-05/2.99e-06/6.72 | 5.98e-05/7.50e-06/7.97 | 6.22e-05/7.95e-06/7.83 | 8.18e-05/7.93e-06/10.33 |
120 | 2.61e-05/3.33e-06/7.82 | 6.89e-05/8.60e-06/8.01 | 6.97e-05/9.08e-06/7.68 | 7.41e-05/9.13e-06/8.11 |
150 | 2.96e-05/4.47e-06/6.61 | 8.31e-05/1.05e-05/7.93 | 9.04e-05/1.10e-05/8.22 | 9.52e-05/1.10e-05/8.67 |
175 | 5.04e-05/7.64e-06/6.60 | 1.86e-04/2.77e-05/6.70 | 2.27e-04/3.00e-05/7.56 | 2.13e-04/3.00e-05/7.09 |
200 | 6.09e-05/8.48e-06/7.18 | 2.57e-04/3.10e-05/8.28 | 2.38e-04/3.35e-05/7.10 | 2.30e-04/3.35e-05/6.86 |
250 | 6.63e-05/1.13e-05/5.86 | 2.64e-04/3.73e-05/7.07 | 3.10e-04/4.04e-05/7.69 | 3.14e-04/4.03e-05/7.79 |
300 | 8.26e-05/1.27e-05/6.51 | 3.05e-04/4.31e-05/7.08 | 3.76e-04/4.67e-05/8.06 | 3.91e-04/4.67e-05/8.38 |
350 | 1.32e-04/2.41e-05/5.47 | 8.76e-04/1.14e-04/7.67 | 8.67e-04/1.27e-04/6.81 | 8.70e-04/1.28e-04/6.82 |
400 | 1.45e-04/2.40e-05/6.07 | 8.05e-04/1.28e-04/6.28 | 9.91e-04/1.43e-04/6.93 | 9.81e-04/1.42e-04/6.90 |
luable in the case we know where to have the maximum accuracy. This tailoring of the algorithm to a result accuracy goal is novel and powerful; in contrast, this is not possible using regular matrix multiplications because of their uniformly distributed error.
Brent’s Connection.
Now we show that the error is function of the algorithm. Let us start by using Equation (2), which we present here again.
In
hand side of Equation (15) by
divide the LHS by
For comparison purpose, we show also the
As a function of the range of the input we have different factors. For the range [?1, 1] we have clear factors:
So we can see even if we use the standard way to measure the error and the standard bounds: we reproduce correctly what we already know about the algorithm and we show that orthogonal permutation affects the maximum error.
In
Recursion Connection.
In this work, we introduce the transfer function to estimate the recursive effect on the error, so that to create a different recurrence equation to solve. Our goal was to achieve a simplified bound such as in Equation (17)
Even simpler using
The bound in Equation (18) is simpler to explain to any developer because we quantify the intuitive idea that more recursive calls will increase the error: the multiplicative factor is specific to the algorithm and a constant at each recursion step.
Both equations provide a means for the comparison of different algorithms and their accuracy. We can actually plug in the GSGEMM, which should provide a practical and theoretical lower bounds (X = 2).
In
We notice that the transfer function and the maximum error provide very similar estimate of
The recurrence solutions presented in Equations (9), (12), (19), and (20) are consistent estimate of the bounds; that is, they allow comparing the algorithm accuracy. However, they seem overestimating the measured ones.
In summary, from the theory of weak stability we introduce the family of orthogonal algorithms for the Fast- MMW algorithms. The combination of the regular and orthogonal algorithms allows us to write more accurate FastMMW algorithms. We show this theoretically by showing better error bounds. We provide extension to the error bounds, and the way we can compute error bounds, so that we can model corner cases: we introduce the transfer function. In fact, for the family of random matrices the weak stability bounds cannot capture idiosyncrasies when positive random matrices are used as operands.
Recalling conversations we had about error analysis, now we understand better why Winograd’s algorithms are viewed with suspicion by some even for positive matrices. In contrast, we have always found our Winograd implementation quite accurate for positive matrices. The misunderstanding is related to the assumption that all fast algorithms have the same error analysis properties. In this work, we show that we can actually estimate and expect accuracy: this is a property of the algorithms, their implementation, and the way we use them. Hopefully, a better understanding of these algorithms will provide adequate standard and error estimate in such a way to guide experiments and data collection: so as to make sense of large errors in experiments results (e.g., [
We have the opportunity to open a new chapter and create new interest in this beautiful field.
This work stems from a question raised during a conversation with Matthew Badin, Alexandru Nicolau, and Michael Dillencourt. The question was: where the error is located? Once we answered the question by the transfer function, we wanted to reduce the error by randomizing the error pattern in particular by permutations. Marco Bodrato had the idea of permuting the computation among recursion calls. We revisited the permutation used by David Wise and checked the original use of the permutations. Wise’s permutations did not help because they are symmetric and they just reverse the order of the computation. Random permutations involving the result matrix did change it by disrupting the patter. The randomization and the permutations provided almost the same distribution as the original matrix multiplication: Richard Brent guided us to make sense of the preliminary result. At this stage, we had a randomized algorithm. This helped a little the maximum error. So instead of applying random permutations, we tried to understand which computation and permutation we could use systematically. The orthogonal permutations were crystallized and applied to random matrices: we achieved better transfer function and better error. Nicholas Higham asked whether or not this approach can be extended to general matrices and thus provide better bounds. The answer was yes, thanks to the theory developed by Dario Bini and Grazia Lotti in their original work. Orthogonal algorithms are applied to the stability vectors and thus reducing the asymptotic stability factor. We shared the preliminary draft of this work with all of the above and we thank them to be our sounding board, our reference, and our standard. Especially, we thank Richard Brent for his feed- back, moral support and clean up of the earlier drafts.
SWOPT
If we take the expected transfer function from the experiments and estimate the transfer function, then we can explain the transfer function for matrices in the range [?1, 1] very nicely. If we take the minimum and the maximum of the transfer function we have a ratio as in
However, for matrices in the range [0, 1] and for the algorithm SWOP, the addition of the transfer function related to the matrix
The computation of
The computation of
1 | |||
2 | |||
3 | |||
4 | |||
5 | |||
6 | |||
7 | |||
8 | |||
9 | |||
10 | |||
11 | |||
12 | |||
13 | |||
14 | |||
15 | |||
16 | |||
17 | |||
18 | |||
19 |
1The statistical properties of the error will not necessarily follow the statistical properties of the input matrices.
SW as accurate as SGEMM in theory.
Let us consider the algorithm deployed in SW and presented in
The error we commit in the mixed-sign leaf MMs is smaller than for positive leaf MMs. There is a common knowledge that Winograd algorithm is more accurate because there is no true subtraction of the matrix products. Actually, the algorithm with only addition of the matrix result, will require subtraction in the input of the matrix product.
If we like to create a recursive equation to estimate the transfer functions:
With a factor 2.5 and for small number of recursion, SW is very close to the regular MM. We can forecast and we show in practice that for
SW-4PermuteL: four-permutation algorithm.