Tuesday, June 21, 2016

#GSoC 2016 Report I

Hi everyone!

As was mentioned in my previous blog post, one of the first tasks is to implement tilings and interchangings of specific loops based on the algorithm used to optimize matrix multiplication and described in "Analytical Modeling is Enough for High Performance BLIS" [1]. It was decided to do it step-by-step. This blog post is devoted to its description.

Let's assume that matrix multiplication has the following form and matrices A, B and C stored into row-major order, in contrast to column-major order used in [1].

Loop 1 for i = 0, …, UpperBound1 - 1 in steps of 1
Loop 2   for j = 0, …, UpperBound2 - 1 in steps of 1
Loop 3     for k = 0, …, UpperBound3 - 1 in steps of 1
                   C[i][j] += A[i][k] * B[k][j]
                 endfor
               endfor
              endfor

In this case, the optimized code could look like this,

Loop 1 for ic = 0, ..., UpperBound1 − 1 in steps of mc
Loop 2   for pc = 0, ..., UpperBound3 − 1 in steps of kc
                 Ac = Pack(A(ic : ic + mc − 1, pc : pc + kc − 1))
Loop 3     for jc = 0, ..., UpperBound2 − 1 in steps of nc
                   Bc = Pack(B(pc : pc + kc − 1, jc : jc + nc − 1))
Loop 4       for ir = 0, ..., mc − 1 in steps of mr
Loop 5         for jr = 0, ..., nc − 1 in steps of nr
Loop 6           for pr = 0, ..., kc − 1 in steps of 1
                         Cc(ir :ir +mr  1,jr :jr +nr  1) += Ac (ir : ir + mr   1, pr ) * Bc(pr, jr:jr+nr 1)
                       endfor
                    endfor
                  endfor
                endfor
              endfor
            endfor

where Loop 4 and Loop 5 form a macro-kernel around Loop 6, which contains rank-1 updates and forms a micro-kernel.

The first step was to create a micro-kernel. To do it, we tile the first two loops of the matrix multiplication and consequently unroll them. In the result we get this code:

Loop 1 for ir = 0, …, UpperBound1 - 1 in steps of mr
Loop 2   for jr = 0, …, UpperBound2 - 1 in steps of nr
Loop 3     for k = 0, …, UpperBound3 - 1 in steps of 1
                   C[i][j] += A[i][k] * B[k][j]
                   …
                   C[i][j + nr - 1] += A[i][k] * B[k][j + nr - 1]
                   …
                   C[i + mr - 1][j + nr - 1] += A[i + mr - 1][k] * B[k][j + nr - 1]
                 endfor
               endfor
             endfor

If auto-vectorization is enabled in LLVM, the body of Loop 3 may be replaced with rank-1 updates and we'll get a mirco-kernel. It's utilized in [2].

The second step was the creation of a macro-kernel by applying a combination of tiling of the first three loops and interchanging Loop 3 and Loop 2. This approach was implemented in [3].

However, without the optimal parameter values creation of the described kernels could lead to not highly optimized implementations or even increase execution time of generated code. To address the issue, we use the optimization model for mapping the GEMM algorithm described in [1].

It requires information about the following parameters of a target architecture:

1. Size of double-precision floating-point number.

2. Number of double-precision floating-point numbers that can be hold by a vector register.

3. Throughput of vector instructions per clock cycle.

4. Latency of instructions (i.e., the minimum number of cycles between the issuance of two dependent consecutive instructions).

5. Paramaters of cache levels (size of cache lines, associativity degrees, size, number of cache sets).

The first two parameters can be obtained from the TargetTransformInfo class. Therefore, since the product of a number of cache sets and a size of cache lines is a size divided by an associativity degree, we should determine only the four parameters. At the moment they can be specified as command line parameters.

Even if we able to get close-to-peak performance of matrix multiplication, we have to determine that the code under consideration contains matrix multiplication. To address the issue, we specified the following class of statements, which contains matrix multiplication:

1. There are exactly three input dimensions
2. All memory accesses of the statement will have stride 0 or 1, if we interchange loops (switch the variable used in the inner loop to the outer loop)
3. All memory accesses of the statement except from the last one, are read memory access and the last one is write memory access
4. All subscripts of the last memory access of the statement don't contain the variable used in the inner loop

The larger class was introduced in [4] to simplify its determination and to make a step toward generalization, which is planned for the future.

Refs.:

[1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf
[2] - http://reviews.llvm.org/D21140
[3] - http://reviews.llvm.org/D21491
[4] - http://reviews.llvm.org/D20575

My GSOC 2016 Proposal: Improvement of vectorization process in Polly

Hi everyone!

At the moment I'm participating in Google Summer of Code 2016. My next blog posts will be devoted to the updates of the project after reaching the mile stones. Below is the accepted proposal.

Title: Improvement of vectorization process in Polly
Abstract
Polly can perform classical loop transformations, exploit OpenMP level parallelism, expose SIMDization opportunities. However, due to the lack of a machine-specific performance model and missing optimizations, these transformations sometimes lead to compile and execution time regressions, and the generated code is at least one order of magnitude off in comparison to the corresponding vendor implementations. The goal of the project is to reduce such influence through implementation of optimizations aimed to produce code compatible with the best implementations of BLAS and an attempt to avoid vectorization of loops, when it is not profitable for the target architecture. It could be a step in transformation of Polly into an optimization pass used in standard -O3 optimizations.
1 Summary
A Polly framework is applied to a broad class of programs that sometimes causes a regression and is not reasonable. Our project aims to reduce such influence, which could be a step in transformation of Polly into an optimization pass used in standard -O3 optimizations.
2 The project
Although Polly performs classical loop transformations, especially tiling and loop fusion to improve data-locality, it can also be used to expose SIMDization opportunities. To do so, we can utilize, for example, a prevectorization, the choice of a possible outer loop that is strip-mined to the innermost level to enable inner-loop vectorization. In certain cases, it could give 50.31% of execution-time improvement [1].
However, tests such as the SingleSource/Benchmarks/Polybench/linear-algebra/ solvers/lu/lu from the LLVM test-suite show that sometimes these transformations are not reasonable and only result in the compile time regression. Furthermore, the result of SingleSource/Benchmarks/Polybench/medley/floyd-warshall/floyd-warshall stated in [2] demonstrates that there are cases when utilization of Polly leads to the execution time regression.
Even if the issues mentioned above are solved, Polly is about 3x off in comparison to the vendor implementations of BLAS. For example, let us consider the GEMM operation of the form C := AB + C, where C, A, and B are matrices of sizes 1056 × 1056, 1056 × 1024, and 1024 × 1056, respectively. A corresponding code compiled with r262915 of Clang takes about 2.02. Polly and its prevectorization help to attain 0.26. However, usage of BLAS routines such as a dgemm from the Intel MKL even in sequential mode can reduce this number to 0.086.
The goal of our project is to improve the vectorization process in Polly by working in two directions: first, reduction of compile and execution time regressions; second, reduction of execution time in situations when speedup is already achieved. In the first case, it is planned to estimate loops and determine their optimal vectorization factors based on the LLVM vectorization cost model in order to expose SIMDization opportunities only in those loops, which are estimated as profitable. In the second case, the best implementations of the GEMM operation will be considered in order to find and implement things missed in both default optimizations of Polly and transformation to expose SIMDization opportunities. Furthermore, we’ll add a new mode that uses the same Polly tool flow, processes statements, which contain only matrix multiplication, and helps reduce the regressions.
3 Benefits for Polly
  • A new mode of Polly could be interested for people who would like to use Polly and worry about compile and execution time regressions.
  • Specific optimization of the GEMM operation along with determination of optimal vectorization factors of loop will reduce the execution time of code generated by Polly.
  • Vectorization of loops, which are profitable according to the LLVM vectorization cost model, will reduce the average compile time of Polly.
  • Determination of loops, which are not profitable according to the LLVM vectorization cost model, allow to optimize them with different optimization strategies (e.g., register tiling), which could probably produce speed up.
  • Implementations of things related to the effective optimization of the GEMM operation (i.e., a packing transformation) could probably improve default optimization sequence of Polly.
  • Even though Polly is already able to speed up compute kernels significantly, when comparing to the best BLAS routines we still are at least one order of magnitude off [3]. This project could be a step toward obtaining fast BLAS kernels, which are competitive with vendor math libraries.
4 Details about the project
Performance of single-threaded GEMM with different implementations
The GEMM operation computes C := αAB + βC, where C, A, and B are matrices of sizes m × n, m × k, and k × n, respectively. For simplicity, we will assume that α = β = 1. In case, for example, of m = n = 1056 and k = 1024, a corresponding implementation [4] compiled with r262915 of Clang with following options takes about 2.02 on the Intel Core i7-3820 SandyBridge.
-O3 -I /utilities/ /utilities/polybench.c -DPOLYBENCH_TIME  -DPOLYBENCH_USE_SCALAR_LB  -march=native
However, a code based on the BLIS framework [5], the Intel MKL or the OpenBLAS can reduce this number even in case of a sequential implementation and take 0.088, 0.086 and 0.0865, respectively.
The following figure shows the result of evaluations of these implementations on the Intel Core i7-3820 SandyBridge, for different values of m, n, k.



Porting to Polly
Polly is another way to optimize [4], which doesn’t require usage of external libraries and helps to get the results presented in the following figure that shows a performance gap between code optimized with Polly and implementations stated above.
Consideration of an algorithm from [6], which is used to perform the GEMM operation in the BLIS framework, and its implementation written in C [7] for Intel Core i7-3820 SandyBridge reveals the following things, which could be added to Polly to try to achieve the same performance:
  • Tilings and interchanges of specific loops can produce loop nests which are similar to those presented in [6]. Furthermore, [6] describes an algorithm of determination of tiling parameters based on the following properties of the target architecture: sizes of cache lines, associativity degrees and sizes of caches, parameters of fused multiply-add instructions. Consequently, an implementation of matrix multiplication can be tuned without access to the target machine.

However, these transformations can not always improve generated code on its own. For example, a JSoP, which can be found on the following link [8], helps to generate a loop nest, which is similar to the one stated in [6], but increases the execution time to 0.62. It reveals the next missing piece, a packing transformation.

  • A packing transformation can be described as a transformation, which replaces accesses to a tile of a matrix with accesses to a temporary block of consecutive memory locations that contain elements of the replaced tile. For example, its utilization helps to reduce execution time from 0.78 to 0.27.

Although Polly doesn’t perform the transformation at the moment, it can be modeled using the memory address modification support.

  • Further improvements can be achieved taking into account information about SIMD registers usage during matrix multiplication from [6]. For example, micro-tiles Cr and br, the result of tiling of loops 5 and 6 from [6], can be loaded from memory, employed in the multiplication and only after this stored. This helps to reduce the execution time from 0.27 to 0.10.

The following figure shows the result of evaluations of the implementation of [6] written in C on the Intel Core i7-3820 SandyBridge, for different values of m, n, k.


It shows that there is still a gap between the mentioned implementation written in C and those based on the BLIS framework, the Intel MKL or the OpenBLAS, which use inline assembly. A possible way to attain the same performance is an improvement of instruction selection as well as register allocation, which could probably influence on the translation from LLVM-IR.

Of course, Polly is used to optimize a broader class of programs, which includes the matrix multiplication. Sometimes this causes a regression (for example, in case of SingleSource/Benchmarks/Polybench/medley/floyd-warshall/floyd-warshall from the LLVM test suite [2]) and is not reasonable (for example, in case of SingleSource/Benchmarks/Polybench/linear-algebra/solvers/lu/lu from the LLVM test suite [1]). In this project we aim to add a new mode that uses the same Polly tool flow, processes statements, which contain only matrix multiplication, and helps reduce the regression. It can also be used by default optimization sequence of Polly to handle matrix multiplication in a special way.

Determination of profitability of vectorization

Although, in certain cases, vectorization could give 50.31% of execution-time improvement of generated code [1].  Tests such as the SingleSource/Benchmarks/Polybench/ linear-algebra/solvers/lu/lu from the LLVM test-suite show that sometimes these transformations are not reasonable and only result in the compile time regression [2].  Furthermore, let us consider work of the prevectorization, a main transformation of vectorization in Polly. The prevectorization is the choice of a possible outer loop that is strip-mined to the innermost level to enable inner-loop vectorization. The goal of this transformation implemented in Polly is to create a trivially vectorizable loop. This means a parallel loop at the innermost level that has a constant number of iterations corresponding to the target vector width and, therefore, can be easily vectorized by LLVM’s inner loop vectorizer (or Polly's simple vectorizer). However, its implementation has a drawback, because it always uses a vectorization factor of four, even in case the vector registers of a target architecture can’t hold these number of elements or would allow even wider vector operations

To address these issues, we aim to implement a determination of vectorization factors and profitability of vectorization based on the LLVM vectorization cost model. It uses an algorithm, which can be roughly described as follows:

// A vectorization factor. In the end its value will be equal to one, if vectorization is not profitable.
unsigned VF;
TargetTransformInfo TTI;
unsigned WidestRegister = TTI.getRegisterBitWidth(true);
// Traverse each instruction in the loop and examine only Loads, Stores and PHINodes in order to determine the minimal and the maximal sizes in bits of used types.
std::tie(SmallestType, WidestType) = getSmallestAndWidestTypes();

unsigned MaxVectorSize = WidestRegister / WidestType;
unsigned NewMaxVectorSize = WidestRegister / SmallestType;
// Select the largest VF which doesn't require more registers than existing ones.
for (unsigned VS = MaxVectorSize; VS <= NewMaxVectorSize; VS *= 2)
 if (calculateRegisterUsage(VS) < TargetNumRegisters)
   VF = VS;

// Computation of expectedCost based on cost of every instruction from the loop computed in LoopVectorizationCostModel::getInstructionCost.
float Cost = expectedCost(1);
for (unsigned i=2; i <= VF; i*=2) {
 float VectorCost = expectedCost(i) / (float)i;
 if (VectorCost < Cost) {
   Cost = VectorCost;
   Width = i;
 }
}

It is planned to apply the algorithm in the ScheduleTreeOptimizer::optimizeBand to call the ScheduleTreeOptimizer::prevectSchedBand only in case it is profitable.

Timeline

  • 23 May – 29 May: Get more familiar with things missed in Polly to attain the best implementation of the gemm.
  • 30 May – 5 June: Get more familiar with the LLVM vectorization cost model.
  • 6 June – 19 June: Implement tilings and interchanges of specific loops based on the algorithm presented in [6].
  • 20 June – 27 June: Vacation.
  • 28 July – 3 July: Implementation of the packing transformation.
  • 4 July – 10 July: Testing and fixing bugs.
  • 11 July – 24 July:  Add generation of specific code, which takes into account information about SIMD registers.
  • 25 July - 31 July: Testing and fixing bugs.
  • 1 August - 7 August: Implement a determination of vectorization factors and profitability of vectorization based on the LLVM vectorization cost model.
  • 8 August - 23 August: Testing and fixing bugs.

5 Required deliverables

  • Add a mode that uses the same Polly tool flow and processes statements, which contain only matrix multiplication, in a special way
  • Implement determination of vectorization factors and profitability of vectorization.
  • Pass regression tests
  • Add new tests to testsuite

6 Nice to have

According to advancement in work on the integration and results, we want to:
  • Extend a new mode to hold statements, which contain different code along with matrix multiplication.
  • Implement a heuristic to choose which dimension to prevectorize, if they are multiple. At the moment Polly always uses the first coincident dimension.

About me

I am a first year P.h.D. student at Ural Federal University majoring in "Mathematical modelling, numerical methods, and software complexes". At the moment my research focuses on a machine-specific performance modelling and polyhedral optimizations.

I have been interested for years in compiler development and, in particular, in polyhedral compilation. Along with academic experience in writing compilers using ANTLR, Flex, Bison, I have experience of working in the LLVM Compiler Infrastructure and the GNU Compiler Collection. I successfully completed the project "Integration of ISL code generator into Graphite" during the participation in the GSoC 2014 in the GNU Compiler Collection. Its description and code samples can be found on the following link [9]. As part of my research, I have added new features to Polly and its vectorization process [10], [11], [12]. I have also fixed a bug of LICM [13], [14] fixed the following bugs of Polly [15], [16], [17], [18], [19], [20].

I will be available to work full time (40 hours per week) on this project.

Ref

[1] - https://drive.google.com/file/d/0B2Wloo-931AoQ1FRa0ZVbWdzc28/view?usp=sharing
[2] - https://drive.google.com/file/d/0B2Wloo-931AoLWFjeVZMbDF3UE0/view?usp=sharing
[3] - http://polly.llvm.org/projects.html
[4] - https://drive.google.com/file/d/0B2Wloo-931AoRTlVUVRKb1h6TFk/view?usp=sharing
[5] - https://github.com/flame/blis
[6] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf
[7] - https://drive.google.com/file/d/0B2Wloo-931AoUUU1T2ZLTDFHNFk/view?usp=sharing
[8] - https://drive.google.com/file/d/0B2Wloo-931AoR0tYOVJDbXRPMTQ/view?usp=sharing
[9] - https://www.google-melange.com/gsoc/project/details/google/gsoc2014/groman/5639274879778816
[10] - http://reviews.llvm.org/D13779
[11] - http://reviews.llvm.org/D14491
[12] - http://repo.or.cz/polly-mirror.git/commit/4d9f318d4114d64062f38b0c7efd9e0ef647cc8f
[13] - https://llvm.org/bugs/show_bug.cgi?id=23077
[14] - http://reviews.llvm.org/D16753
[15] - http://repo.or.cz/polly-mirror.git/commit/defbd21a86ca0c3b595096720a49428c014b2c55
[16] - http://reviews.llvm.org/D15563
[17] - https://llvm.org/bugs/show_bug.cgi?id=25759
[18] - https://llvm.org/bugs/show_bug.cgi?id=19422
[19] - https://llvm.org/bugs/show_bug.cgi?id=25760
[20] - http://repo.or.cz/polly-mirror.git/commit/f95553beea5c274749d3ddad61c4e7c06f6dafaf