Wednesday, August 31, 2016

#GSoC 2016 Final report

Hi everyone!

I have successfully passed the Google Summer of Code 2016 and below is the copy of my final report.

Dear LLVM contributors,

during the Google Summer of Code 2016 I have been working on the "Improvement of vectorization process in Polly" project.

I planned to add a mode that uses the same Polly tool flow and processes statements, which contain only matrix multiplication of the following form, in a special way.

C := AB + C, where C, A, and B are matrices of sizes m × n, m × k, and k × n, respectively.

It was also planned to implement determination of vectorization factors and profitability of vectorization.

We decided to adjust the goals and focus only on the optimization of matrix multiplication and its generalization. In the result, we optimize a class of statements (see [1] for further details), which also contains the matrix multiplication.

The result of evaluations of the corresponding code compiled with r279395 of Clang on the Intel Core i7-3820 SandyBridge, for different values of m, n, k and different compiler options can be found on the following link [2].

According to the figure [2], the algorithm for the computation of the optimal values of the parameters of matrix multiplication can be improved. One way is a new algorithm for the computation of the parameter Mc (see [1] for further details). If we, for example, manually equate the Mc to Nc we get the following results [3]. Its improvement is planned for the future.

The class also contains general matrix multiplication of the form C := αAB + βC, where C, A, and B are matrices of sizes m × n, m × k, and k × n, respectively. In case, for example, of m = n = 1056 and k = 1024, α = 32412, β = 2123, the corresponding code compiled with r278666 of Clang with following options

clang -O3 gemm.c -I utilities/ utilities/polybench.c -DPOLYBENCH_TIME -march=native -mllvm -polly -mllvm -polly-pattern-matching-based-opts=true -DPOLYBENCH_USE_SCALAR_LB -mllvm -polly-target-cache-level-associativity=8,8 -mllvm -polly-target-cache-level-sizes=32768,262144 -mllvm -polly-target-througput-vector-fma=1 -mllvm -polly-target-latency-vector-fma=8

takes about 0.30 on the Intel Core i7-3820 SandyBridge. In case the described optimization is disabled, it takes about 0.75 and 2.1 in case Polly is not used.

Furthermore, using the optimization we can optimize the following examples of code:

Loop 1 for i = 0, …, UpperBound1 - 1 in steps of 1
Loop 2   for j = 0, …, UpperBound2 - 1 in steps of 1
Loop 3     for k = 0, …, UpperBound3 - 1 in steps of 1
             C[i][j] +=  32412 Op1 B[k][j]
           endfor
         endfor
       endfor

Loop 1 for i = 0, …, UpperBound1 - 1 in steps of 1
Loop 2   for j = 0, …, UpperBound2 - 1 in steps of 1
Loop 3     for k = 0, …, UpperBound3 - 1 in steps of 1
             C[i][j] = C[i][j] Op1 A[i][k] Op2 B[k][j]
           endfor
         endfor
       endfor

where Opi is an operation from the set {+, -, /, *} (see [4], [5], [6], [7] for results of their evaluations).

The described optimization is implemented in [8], [9], [10], [11], [12], [13], [14] that are already committed and [15] that is under review. For further details about the project, what code got merged, what code did not get merged, and what is left to do, please see my blog posts [1], [16].

I am planning to continue working on the project as part of my Ph.D. thesis and implement the following:

1.  Prefetching of the certain cache lines

The BLIS implementation of matrix-matrix multiplication prefetches only the first cache line, before traversing a given interval of memory. This could confirm that the implementation relies on hardware preteching to prefetch the subsequent lines [17]. Consequently, there is a possibility that its utilization could improve the described optimization.

2.  Reduction of the number of the parameters of a target architecture that require to be specified as command line parameters

At the moment, to be able to use the implemented optimization, one should specify parameters of a target architecture like, for example, throughput of vector instructions per clock cycle (see [1] for further details). Reduction of the number of such parameters using the information about the target architecture of the LLVM (e.g., TargetTransformInfo::getArithmeticInstrCost) could be preferable.

3. Generalization of determination of parameters of the transformation

At the moment, for this purpose, we use an algorithm described in "Analytical Modeling is Enough for High Performance BLIS" [1] that was designed to optimize a subclass of the class and could possibly cause performance regression, if we try to apply it to all statements of the class. We could try to create a new algorithm specialized for the whole class.

I would like to thank Michael Kruse, Albert Cohen, Johannes Doerfert and the LLVM community for your comments, reviews, ideas and the opportunity to work on this project.

I would like to especially thank Tobias Grosser for his encouragement and excellent support.

Refs.:

[1] - http://romangareev.blogspot.ru/2016/06/gsoc-2016-report-i.html
[2] - https://drive.google.com/file/d/0B2Wloo-931AoVTRINXp0dDZPSjQ/view?usp=sharing
[3] - https://drive.google.com/file/d/0B2Wloo-931AoWG9jTjg3aEVxRG8/view?usp=sharing
[4] - https://docs.google.com/spreadsheets/d/1RtioBokRePdX2zdFxGxOxTbtY_WnI4moYFCfbujr7S8/pubhtml?gid=0&single=true
[5] - https://docs.google.com/spreadsheets/d/1hUtXH9JsgYUhHCdydVtE7ayNupc9ViZuAG8z_JKevP0/pubhtml?gid=0&single=true
[6] - https://docs.google.com/spreadsheets/d/1NQHqCuhc1A8pVJtN3FuTHrQ2gcyXUYs5T2L8tMXqLE4/pubhtml?gid=0&single=true
[7] - https://docs.google.com/spreadsheets/d/1nr4qZpG5V--Cq5cTLX1TLA7OpfdFqhIGr6-IFuoq1yw/pubhtml?gid=0&single=true
[8] - http://reviews.llvm.org/D21140
[9] - http://reviews.llvm.org/D21491
[10] - http://reviews.llvm.org/D20575
[11] - https://reviews.llvm.org/D22187
[12] - https://reviews.llvm.org/D22828
[13] - https://reviews.llvm.org/D23740
[14] - https://reviews.llvm.org/D23759
[15] - https://reviews.llvm.org/D23260
[16] - http://romangareev.blogspot.ru/2016/08/gsoc-2016-report-ii.html
[17] - http://lists.llvm.org/pipermail/llvm-dev/2016-May/100229.html

Monday, August 22, 2016

#GSoC 2016 Report II

Hi everyone!

As was mentioned previously, my next task is to implement the packing transformation. It can be described as a data-layout transformation that requires to introduce a new array, copy data to it and change memory access locations to reference the array (see [1], p. 6 for further details).

By the moment of implementation of the packing transformation, Polly supported the only one data-layout transformation, namely changing of memory access locations. The remaining tasks were to implement creation of empty arrays and copying data to it. However, it was first required to identify the locations that should be changed.

In case of the following code that could correspond to the matrix-matrix multiplication,

Loop 1 for i = 0, …, UpperBound1 - 1 in steps of 1
Loop 2   for j = 0, …, UpperBound2 - 1 in steps of 1
Loop 3     for k = 0, …, UpperBound3 - 1 in steps of 1
                   C[i][j] += A[i][k] * B[k][j]
                 endfor
               endfor
              endfor

the following access function can be built:

S[i, j, k]->A[i][k], S[i, j, k]->B[k][j].

According to the algorithm used to optimize matrix multiplication and described in "Analytical Modeling is Enough for High Performance BLIS" [1] they should be changed to the following access relations:

S[O0, O1, O2, O3, O4, O5, O6, O7, O8]->Ac[O5 + K * O3, O6] and

S[O0, O1, O2, O3, O4, O5, O6, O7, O8]->Bc[O5 + K * O4, O7], respectively.

The transformation helps to access elements of operands of a matrix multiplication after creation of the BLIS micro and macro kernels. The described approach is implemented in [3] and also already
available from the main repository.

The next task was to create an empty array. To do it, we use an object of the ScopArrayInfo class
that is designed to store information (e.g., the type of the elements, dimension sizes) about arrays in the SCoP (see [2] for further details). In our case, it is not a part of the SCoP originally. Furthermore, the array would be allocated only during the work of the CodeGeneration pass that takes a mathematical model derived and optimized using Polly and translates it back to LLVM-IR. The described approach is implemented in [3] and already available from the main repository.

The last task was to copy data to the newly created arrays according to the packing transformation.

To implement it, we decided to introduce a new type of SCoP statement that has one read memory access and one write memory access (in this very order). In particular, data is loaded from the location described by the read memory access and written to the location described by the
write memory access. To describe the domain of such a statement, we use extension nodes of the integer set library [4].

The described approach is implemented in [5]. It helps to get the following execution time improvements of benchmarks from PolyBench [6], which correspond to matrix multiplications.

Performance Improvements - Execution Time                                                   Δ       Previous  Current
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/3mm/3mm     -70.59%  7.1680    2.1080
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm     -68.85%  4.7640    1.4840
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gemm/gemm -66.61%  2.3720    0.7920

The results obtained from the Polly buildbot perf-x86_64-penryn-O3-polly are available on the following link [7].

Although it is already implemented, we continue to improve the patch and general Polly infrastructure related to it. In particular, we are planning to implement the ability to introduce the SCoP statements via the JSCoP that helps to test new optimizations without any knowledge about compiler internals [8]. During this GSoC we’ve already extended the JSCoP to allow the user to declare new arrays and to reference these arrays from access expressions. It is implemented in [9] and
already available from the main repository.

Refs.:

[1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf
[2] - http://llvm.org/viewvc/llvm-project/polly/trunk/include/polly/ScopInfo.h?view=markup
[3] - https://reviews.llvm.org/D22187
[4] - http://isl.gforge.inria.fr
[5] - https://reviews.llvm.org/D23260
[6] - http://web.cs.ucla.edu/~pouchet/software/polybench/
[7] - https://gareevroman.github.io
[8] - http://www.infosun.fim.uni-passau.de/cl/arbeiten/grosser-d.pdf
[9] - https://reviews.llvm.org/D22828

Tuesday, June 21, 2016

#GSoC 2016 Report I

Hi everyone!

As was mentioned in my previous blog post, one of the first tasks is to implement tilings and interchangings of specific loops based on the algorithm used to optimize matrix multiplication and described in "Analytical Modeling is Enough for High Performance BLIS" [1]. It was decided to do it step-by-step. This blog post is devoted to its description.

Let's assume that matrix multiplication has the following form and matrices A, B and C stored into row-major order, in contrast to column-major order used in [1].

Loop 1 for i = 0, …, UpperBound1 - 1 in steps of 1
Loop 2   for j = 0, …, UpperBound2 - 1 in steps of 1
Loop 3     for k = 0, …, UpperBound3 - 1 in steps of 1
                   C[i][j] += A[i][k] * B[k][j]
                 endfor
               endfor
              endfor

In this case, the optimized code could look like this,

Loop 1 for ic = 0, ..., UpperBound1 − 1 in steps of mc
Loop 2   for pc = 0, ..., UpperBound3 − 1 in steps of kc
                 Ac = Pack(A(ic : ic + mc − 1, pc : pc + kc − 1))
Loop 3     for jc = 0, ..., UpperBound2 − 1 in steps of nc
                   Bc = Pack(B(pc : pc + kc − 1, jc : jc + nc − 1))
Loop 4       for ir = 0, ..., mc − 1 in steps of mr
Loop 5         for jr = 0, ..., nc − 1 in steps of nr
Loop 6           for pr = 0, ..., kc − 1 in steps of 1
                         Cc(ir :ir +mr  1,jr :jr +nr  1) += Ac (ir : ir + mr   1, pr ) * Bc(pr, jr:jr+nr 1)
                       endfor
                    endfor
                  endfor
                endfor
              endfor
            endfor

where Loop 4 and Loop 5 form a macro-kernel around Loop 6, which contains rank-1 updates and forms a micro-kernel.

The first step was to create a micro-kernel. To do it, we tile the first two loops of the matrix multiplication and consequently unroll them. In the result we get this code:

Loop 1 for ir = 0, …, UpperBound1 - 1 in steps of mr
Loop 2   for jr = 0, …, UpperBound2 - 1 in steps of nr
Loop 3     for k = 0, …, UpperBound3 - 1 in steps of 1
                   C[i][j] += A[i][k] * B[k][j]
                   …
                   C[i][j + nr - 1] += A[i][k] * B[k][j + nr - 1]
                   …
                   C[i + mr - 1][j + nr - 1] += A[i + mr - 1][k] * B[k][j + nr - 1]
                 endfor
               endfor
             endfor

If auto-vectorization is enabled in LLVM, the body of Loop 3 may be replaced with rank-1 updates and we'll get a mirco-kernel. It's utilized in [2].

The second step was the creation of a macro-kernel by applying a combination of tiling of the first three loops and interchanging Loop 3 and Loop 2. This approach was implemented in [3].

However, without the optimal parameter values creation of the described kernels could lead to not highly optimized implementations or even increase execution time of generated code. To address the issue, we use the optimization model for mapping the GEMM algorithm described in [1].

It requires information about the following parameters of a target architecture:

1. Size of double-precision floating-point number.

2. Number of double-precision floating-point numbers that can be hold by a vector register.

3. Throughput of vector instructions per clock cycle.

4. Latency of instructions (i.e., the minimum number of cycles between the issuance of two dependent consecutive instructions).

5. Paramaters of cache levels (size of cache lines, associativity degrees, size, number of cache sets).

The first two parameters can be obtained from the TargetTransformInfo class. Therefore, since the product of a number of cache sets and a size of cache lines is a size divided by an associativity degree, we should determine only the four parameters. At the moment they can be specified as command line parameters.

Even if we able to get close-to-peak performance of matrix multiplication, we have to determine that the code under consideration contains matrix multiplication. To address the issue, we specified the following class of statements, which contains matrix multiplication:

1. There are exactly three input dimensions
2. All memory accesses of the statement will have stride 0 or 1, if we interchange loops (switch the variable used in the inner loop to the outer loop)
3. All memory accesses of the statement except from the last one, are read memory access and the last one is write memory access
4. All subscripts of the last memory access of the statement don't contain the variable used in the inner loop

The larger class was introduced in [4] to simplify its determination and to make a step toward generalization, which is planned for the future.

Refs.:

[1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf
[2] - http://reviews.llvm.org/D21140
[3] - http://reviews.llvm.org/D21491
[4] - http://reviews.llvm.org/D20575

My GSOC 2016 Proposal: Improvement of vectorization process in Polly

Hi everyone!

At the moment I'm participating in Google Summer of Code 2016. My next blog posts will be devoted to the updates of the project after reaching the mile stones. Below is the accepted proposal.

Title: Improvement of vectorization process in Polly
Abstract
Polly can perform classical loop transformations, exploit OpenMP level parallelism, expose SIMDization opportunities. However, due to the lack of a machine-specific performance model and missing optimizations, these transformations sometimes lead to compile and execution time regressions, and the generated code is at least one order of magnitude off in comparison to the corresponding vendor implementations. The goal of the project is to reduce such influence through implementation of optimizations aimed to produce code compatible with the best implementations of BLAS and an attempt to avoid vectorization of loops, when it is not profitable for the target architecture. It could be a step in transformation of Polly into an optimization pass used in standard -O3 optimizations.
1 Summary
A Polly framework is applied to a broad class of programs that sometimes causes a regression and is not reasonable. Our project aims to reduce such influence, which could be a step in transformation of Polly into an optimization pass used in standard -O3 optimizations.
2 The project
Although Polly performs classical loop transformations, especially tiling and loop fusion to improve data-locality, it can also be used to expose SIMDization opportunities. To do so, we can utilize, for example, a prevectorization, the choice of a possible outer loop that is strip-mined to the innermost level to enable inner-loop vectorization. In certain cases, it could give 50.31% of execution-time improvement [1].
However, tests such as the SingleSource/Benchmarks/Polybench/linear-algebra/ solvers/lu/lu from the LLVM test-suite show that sometimes these transformations are not reasonable and only result in the compile time regression. Furthermore, the result of SingleSource/Benchmarks/Polybench/medley/floyd-warshall/floyd-warshall stated in [2] demonstrates that there are cases when utilization of Polly leads to the execution time regression.
Even if the issues mentioned above are solved, Polly is about 3x off in comparison to the vendor implementations of BLAS. For example, let us consider the GEMM operation of the form C := AB + C, where C, A, and B are matrices of sizes 1056 × 1056, 1056 × 1024, and 1024 × 1056, respectively. A corresponding code compiled with r262915 of Clang takes about 2.02. Polly and its prevectorization help to attain 0.26. However, usage of BLAS routines such as a dgemm from the Intel MKL even in sequential mode can reduce this number to 0.086.
The goal of our project is to improve the vectorization process in Polly by working in two directions: first, reduction of compile and execution time regressions; second, reduction of execution time in situations when speedup is already achieved. In the first case, it is planned to estimate loops and determine their optimal vectorization factors based on the LLVM vectorization cost model in order to expose SIMDization opportunities only in those loops, which are estimated as profitable. In the second case, the best implementations of the GEMM operation will be considered in order to find and implement things missed in both default optimizations of Polly and transformation to expose SIMDization opportunities. Furthermore, we’ll add a new mode that uses the same Polly tool flow, processes statements, which contain only matrix multiplication, and helps reduce the regressions.
3 Benefits for Polly
  • A new mode of Polly could be interested for people who would like to use Polly and worry about compile and execution time regressions.
  • Specific optimization of the GEMM operation along with determination of optimal vectorization factors of loop will reduce the execution time of code generated by Polly.
  • Vectorization of loops, which are profitable according to the LLVM vectorization cost model, will reduce the average compile time of Polly.
  • Determination of loops, which are not profitable according to the LLVM vectorization cost model, allow to optimize them with different optimization strategies (e.g., register tiling), which could probably produce speed up.
  • Implementations of things related to the effective optimization of the GEMM operation (i.e., a packing transformation) could probably improve default optimization sequence of Polly.
  • Even though Polly is already able to speed up compute kernels significantly, when comparing to the best BLAS routines we still are at least one order of magnitude off [3]. This project could be a step toward obtaining fast BLAS kernels, which are competitive with vendor math libraries.
4 Details about the project
Performance of single-threaded GEMM with different implementations
The GEMM operation computes C := αAB + βC, where C, A, and B are matrices of sizes m × n, m × k, and k × n, respectively. For simplicity, we will assume that α = β = 1. In case, for example, of m = n = 1056 and k = 1024, a corresponding implementation [4] compiled with r262915 of Clang with following options takes about 2.02 on the Intel Core i7-3820 SandyBridge.
-O3 -I /utilities/ /utilities/polybench.c -DPOLYBENCH_TIME  -DPOLYBENCH_USE_SCALAR_LB  -march=native
However, a code based on the BLIS framework [5], the Intel MKL or the OpenBLAS can reduce this number even in case of a sequential implementation and take 0.088, 0.086 and 0.0865, respectively.
The following figure shows the result of evaluations of these implementations on the Intel Core i7-3820 SandyBridge, for different values of m, n, k.



Porting to Polly
Polly is another way to optimize [4], which doesn’t require usage of external libraries and helps to get the results presented in the following figure that shows a performance gap between code optimized with Polly and implementations stated above.
Consideration of an algorithm from [6], which is used to perform the GEMM operation in the BLIS framework, and its implementation written in C [7] for Intel Core i7-3820 SandyBridge reveals the following things, which could be added to Polly to try to achieve the same performance:
  • Tilings and interchanges of specific loops can produce loop nests which are similar to those presented in [6]. Furthermore, [6] describes an algorithm of determination of tiling parameters based on the following properties of the target architecture: sizes of cache lines, associativity degrees and sizes of caches, parameters of fused multiply-add instructions. Consequently, an implementation of matrix multiplication can be tuned without access to the target machine.

However, these transformations can not always improve generated code on its own. For example, a JSoP, which can be found on the following link [8], helps to generate a loop nest, which is similar to the one stated in [6], but increases the execution time to 0.62. It reveals the next missing piece, a packing transformation.

  • A packing transformation can be described as a transformation, which replaces accesses to a tile of a matrix with accesses to a temporary block of consecutive memory locations that contain elements of the replaced tile. For example, its utilization helps to reduce execution time from 0.78 to 0.27.

Although Polly doesn’t perform the transformation at the moment, it can be modeled using the memory address modification support.

  • Further improvements can be achieved taking into account information about SIMD registers usage during matrix multiplication from [6]. For example, micro-tiles Cr and br, the result of tiling of loops 5 and 6 from [6], can be loaded from memory, employed in the multiplication and only after this stored. This helps to reduce the execution time from 0.27 to 0.10.

The following figure shows the result of evaluations of the implementation of [6] written in C on the Intel Core i7-3820 SandyBridge, for different values of m, n, k.


It shows that there is still a gap between the mentioned implementation written in C and those based on the BLIS framework, the Intel MKL or the OpenBLAS, which use inline assembly. A possible way to attain the same performance is an improvement of instruction selection as well as register allocation, which could probably influence on the translation from LLVM-IR.

Of course, Polly is used to optimize a broader class of programs, which includes the matrix multiplication. Sometimes this causes a regression (for example, in case of SingleSource/Benchmarks/Polybench/medley/floyd-warshall/floyd-warshall from the LLVM test suite [2]) and is not reasonable (for example, in case of SingleSource/Benchmarks/Polybench/linear-algebra/solvers/lu/lu from the LLVM test suite [1]). In this project we aim to add a new mode that uses the same Polly tool flow, processes statements, which contain only matrix multiplication, and helps reduce the regression. It can also be used by default optimization sequence of Polly to handle matrix multiplication in a special way.

Determination of profitability of vectorization

Although, in certain cases, vectorization could give 50.31% of execution-time improvement of generated code [1].  Tests such as the SingleSource/Benchmarks/Polybench/ linear-algebra/solvers/lu/lu from the LLVM test-suite show that sometimes these transformations are not reasonable and only result in the compile time regression [2].  Furthermore, let us consider work of the prevectorization, a main transformation of vectorization in Polly. The prevectorization is the choice of a possible outer loop that is strip-mined to the innermost level to enable inner-loop vectorization. The goal of this transformation implemented in Polly is to create a trivially vectorizable loop. This means a parallel loop at the innermost level that has a constant number of iterations corresponding to the target vector width and, therefore, can be easily vectorized by LLVM’s inner loop vectorizer (or Polly's simple vectorizer). However, its implementation has a drawback, because it always uses a vectorization factor of four, even in case the vector registers of a target architecture can’t hold these number of elements or would allow even wider vector operations

To address these issues, we aim to implement a determination of vectorization factors and profitability of vectorization based on the LLVM vectorization cost model. It uses an algorithm, which can be roughly described as follows:

// A vectorization factor. In the end its value will be equal to one, if vectorization is not profitable.
unsigned VF;
TargetTransformInfo TTI;
unsigned WidestRegister = TTI.getRegisterBitWidth(true);
// Traverse each instruction in the loop and examine only Loads, Stores and PHINodes in order to determine the minimal and the maximal sizes in bits of used types.
std::tie(SmallestType, WidestType) = getSmallestAndWidestTypes();

unsigned MaxVectorSize = WidestRegister / WidestType;
unsigned NewMaxVectorSize = WidestRegister / SmallestType;
// Select the largest VF which doesn't require more registers than existing ones.
for (unsigned VS = MaxVectorSize; VS <= NewMaxVectorSize; VS *= 2)
 if (calculateRegisterUsage(VS) < TargetNumRegisters)
   VF = VS;

// Computation of expectedCost based on cost of every instruction from the loop computed in LoopVectorizationCostModel::getInstructionCost.
float Cost = expectedCost(1);
for (unsigned i=2; i <= VF; i*=2) {
 float VectorCost = expectedCost(i) / (float)i;
 if (VectorCost < Cost) {
   Cost = VectorCost;
   Width = i;
 }
}

It is planned to apply the algorithm in the ScheduleTreeOptimizer::optimizeBand to call the ScheduleTreeOptimizer::prevectSchedBand only in case it is profitable.

Timeline

  • 23 May – 29 May: Get more familiar with things missed in Polly to attain the best implementation of the gemm.
  • 30 May – 5 June: Get more familiar with the LLVM vectorization cost model.
  • 6 June – 19 June: Implement tilings and interchanges of specific loops based on the algorithm presented in [6].
  • 20 June – 27 June: Vacation.
  • 28 July – 3 July: Implementation of the packing transformation.
  • 4 July – 10 July: Testing and fixing bugs.
  • 11 July – 24 July:  Add generation of specific code, which takes into account information about SIMD registers.
  • 25 July - 31 July: Testing and fixing bugs.
  • 1 August - 7 August: Implement a determination of vectorization factors and profitability of vectorization based on the LLVM vectorization cost model.
  • 8 August - 23 August: Testing and fixing bugs.

5 Required deliverables

  • Add a mode that uses the same Polly tool flow and processes statements, which contain only matrix multiplication, in a special way
  • Implement determination of vectorization factors and profitability of vectorization.
  • Pass regression tests
  • Add new tests to testsuite

6 Nice to have

According to advancement in work on the integration and results, we want to:
  • Extend a new mode to hold statements, which contain different code along with matrix multiplication.
  • Implement a heuristic to choose which dimension to prevectorize, if they are multiple. At the moment Polly always uses the first coincident dimension.

About me

I am a first year P.h.D. student at Ural Federal University majoring in "Mathematical modelling, numerical methods, and software complexes". At the moment my research focuses on a machine-specific performance modelling and polyhedral optimizations.

I have been interested for years in compiler development and, in particular, in polyhedral compilation. Along with academic experience in writing compilers using ANTLR, Flex, Bison, I have experience of working in the LLVM Compiler Infrastructure and the GNU Compiler Collection. I successfully completed the project "Integration of ISL code generator into Graphite" during the participation in the GSoC 2014 in the GNU Compiler Collection. Its description and code samples can be found on the following link [9]. As part of my research, I have added new features to Polly and its vectorization process [10], [11], [12]. I have also fixed a bug of LICM [13], [14] fixed the following bugs of Polly [15], [16], [17], [18], [19], [20].

I will be available to work full time (40 hours per week) on this project.

Ref

[1] - https://drive.google.com/file/d/0B2Wloo-931AoQ1FRa0ZVbWdzc28/view?usp=sharing
[2] - https://drive.google.com/file/d/0B2Wloo-931AoLWFjeVZMbDF3UE0/view?usp=sharing
[3] - http://polly.llvm.org/projects.html
[4] - https://drive.google.com/file/d/0B2Wloo-931AoRTlVUVRKb1h6TFk/view?usp=sharing
[5] - https://github.com/flame/blis
[6] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf
[7] - https://drive.google.com/file/d/0B2Wloo-931AoUUU1T2ZLTDFHNFk/view?usp=sharing
[8] - https://drive.google.com/file/d/0B2Wloo-931AoR0tYOVJDbXRPMTQ/view?usp=sharing
[9] - https://www.google-melange.com/gsoc/project/details/google/gsoc2014/groman/5639274879778816
[10] - http://reviews.llvm.org/D13779
[11] - http://reviews.llvm.org/D14491
[12] - http://repo.or.cz/polly-mirror.git/commit/4d9f318d4114d64062f38b0c7efd9e0ef647cc8f
[13] - https://llvm.org/bugs/show_bug.cgi?id=23077
[14] - http://reviews.llvm.org/D16753
[15] - http://repo.or.cz/polly-mirror.git/commit/defbd21a86ca0c3b595096720a49428c014b2c55
[16] - http://reviews.llvm.org/D15563
[17] - https://llvm.org/bugs/show_bug.cgi?id=25759
[18] - https://llvm.org/bugs/show_bug.cgi?id=19422
[19] - https://llvm.org/bugs/show_bug.cgi?id=25760
[20] - http://repo.or.cz/polly-mirror.git/commit/f95553beea5c274749d3ddad61c4e7c06f6dafaf

Saturday, May 31, 2014

#GSoC Report II

Hi everyone!

As I told you before, my second task is to get more familiar with CLooG generation. In this post, I'll consider an example of it.

If we give the following code as an input to Graphite
#include <stdio.h>
#define N 16
 
int
main1 (int n, int *a)
{
  int i, j;

  for (i = 0; i < n - 1; i++)
    for (j = 0; j < n; j++)
      a[j] = i + n;

  for (j = 0; j < n; j++)
    if (a[j] != i + n - 1)
      __builtin_abort ();

  return 0;
}

int
main ()
{
  int a[N];
  main1 (N, a);
  return 0;
} 

Graphite will outline the following SCoP:

Schedule of the basic block: [n, P_37] -> { S_5[i0, i1] -> [0, i0, 0, i1, 0] }

Domain of the basic block: [n, P_37] -> { S_5[i0, i1] : exists (e0 = [(-2 + n)/4294967296], e1 = [(-1 + n)/4294967296]: i0 >= 0 and 4294967296e0 <= -2 + n and 4294967296e0 >= -4294967297 + n and 4294967296e0 <= -2 + n - i0 and i0 <= 2147483645 and i1 >= 0 and 4294967296e1 <= -1 + n and 4294967296e1 >= -4294967296 + n and 4294967296e1 <= -1 + n - i1 and i1 <= 2147483646 and n >= 1) }

Let's generate a CLoogInput
static CloogInput *
generate_cloog_input (scop_p scop)
{
  CloogUnionDomain *union_domain;
  CloogInput *cloog_input;
  CloogDomain *context;

  union_domain = build_cloog_union_domain (scop);
  context = cloog_domain_from_isl_set (isl_set_copy (scop->context));
  cloog_input = cloog_input_alloc (context, union_domain);

  return cloog_input;
}

A CloogInput structure represents the input to CLooG. It is essentially a CloogUnionDomain along with a context CloogDomain. CloogDomain is an type representing a polyhedral domain (a union of polyhedra). “generate_cloog_input()” transforms type of scop's context to type, which is appropriate for CloogInput. “build_cloog_union_domain” is used for generation of CloogUnionDomain, which contains all CloogUnion corresponding to all basic blocks in the given scop. 

static CloogUnionDomain *
build_cloog_union_domain (scop_p scop)
{
  int i;
  poly_bb_p pbb;
  CloogUnionDomain *union_domain =
    cloog_union_domain_alloc (scop_nb_params (scop));

  FOR_EACH_VEC_ELT (SCOP_BBS (scop), i, pbb)
    {
      CloogDomain *domain;
      CloogScattering *scattering;

      /* Dead code elimination: when the domain of a PBB is empty,
  don't generate code for the PBB.  */
      if (isl_set_is_empty (pbb->domain))
 continue;

      domain = cloog_domain_from_isl_set (isl_set_copy (pbb->domain));
      scattering = cloog_scattering_from_isl_map
 (isl_map_copy (pbb->transformed));

      union_domain = cloog_union_domain_add_domain (union_domain, "", domain,
          scattering, pbb);
    }

  return union_domain;
}

Let's pass the previously generated cloog_input to to the CLooG code generator
static struct clast_stmt *
scop_to_clast (scop_p scop)
{
  CloogInput *cloog_input;
  struct clast_stmt *clast;
  CloogOptions *options =  cloog_options_malloc (cloog_state);
  options->language = CLOOG_LANGUAGE_C;
  cloog_input = generate_cloog_input (scop);

  clast = cloog_clast_create_from_input (cloog_input, options);

  cloog_options_free (options);
  return clast;
}

We can produce the following dump now:

for (c2=0;c2<=M-2;c2++) {
  for (c4=0;c4<=M-1;c4++) {
    (c2,c4);
  }
}

In CLooG An AST constructed by cloog_clast_create_from_input has the type clast_stmt, which represents a linked list of statements”, which allows to traverse an AST tree. Let's do this
static void
translate_clast (CloogOptions *options, int indent,
    FILE *file, struct clast_stmt *);


static void
translate_clast_for (CloogOptions *options, int indent,
        FILE *file, struct clast_for *stmt)
{
  fprintf (file, "%*s", indent, "");
  fprintf(file, "for (");
  if (stmt->LB)
    {
      fprintf(file, "%s=LB", stmt->iterator);
    }
  fprintf(file,";");
  if (stmt->UB)
    {
      fprintf (file,"%s<=UB", stmt->iterator);
    }
  if (cloog_int_gt_si(stmt->stride, 1))
    {
      fprintf (file,";%s+=", stmt->iterator);
      cloog_int_print (file, stmt->stride);
      fprintf (file, ") \n");
    }
  else
    fprintf(file, ";%s++) \n", stmt->iterator);
  translate_clast (options, indent + 2, file, stmt->body);
}

static void
translate_clast_guard (CloogOptions *options, int indent,
   FILE *file, struct clast_guard *stmt)
{
  fprintf (file, "%*s", indent, "");
  fprintf(file,"if ");
  if (stmt->n > 1)
    fprintf(file, "(");
  int i;
  for (i = 0; i < stmt->n; i++)
  {
    if (i > 0)
    {
      fprintf(file, " && ");
    }
    fprintf(file, "(eq)");
  }
  if (stmt->n > 1)
    fprintf(file, ")");
  translate_clast (options, indent + 2, file, stmt->then);
}

static void
translate_clast_user (int indent, FILE *file,  struct clast_stmt *stmt)
{
  fprintf (file, "%*s", indent, "");
  print_clast_stmt (file, stmt);
}

static void
translate_clast_assignment (int indent, FILE *file,  struct clast_stmt *stmt)
{
  fprintf (file, "%*s", indent, "");
  print_clast_stmt (file, stmt);
}


static void
translate_clast (CloogOptions *options, int indent,
    FILE *file, struct clast_stmt *stmt)
{
  if (!stmt)
    return;

  if (CLAST_STMT_IS_A (stmt, stmt_root))
    ; /* Do nothing.  */

  else if (CLAST_STMT_IS_A (stmt, stmt_user))
    translate_clast_user (indent, file, stmt);

  else if (CLAST_STMT_IS_A (stmt, stmt_for))
    translate_clast_for (options, indent, file, (struct clast_for *) stmt);

  else if (CLAST_STMT_IS_A (stmt, stmt_guard))
    translate_clast_guard (options, indent, file, (struct clast_guard *) stmt);

  else if (CLAST_STMT_IS_A (stmt, stmt_block))
    translate_clast (options, indent, file, ((struct clast_block *) stmt)->body);

  else if (CLAST_STMT_IS_A (stmt, stmt_ass))
    translate_clast_assignment (indent, file, stmt);
  else
    gcc_unreachable ();

  translate_clast (options, indent, file, stmt->next);
}
Finally we get the following

for (c2=LB;c2<=UB;c2++) 
  for (c4=LB;c4<=UB;c4++) 
    (c2,c4);

Sunday, May 25, 2014

#GSoC Report I

Hi everyone!

As I told you before, my first task is to get more familiar with ISL AST generation. In this post, I'll describe a solution for it.

First of all, I added a new command line flag (fgraphite-isl-code-gen) to GCC. That was implemented with addition of the following code to gcc/common.opt:
fgraphite-isl-code-gen
Common Report Var(flag_graphite_isl_code_gen) Optimization
Enable isl code generator in Graphite
fgraphite-isl-code-gen will be used to generate new code in parallel, when it's necessary.
if (flag_graphite_isl_code_gen)
  {
    //generation of new code 
  }
Let's consider the following code: 
#include <stdio.h>
#define N 16
 
int
main1 (int n, int *a)
{
  int i, j;

  for (i = 0; i < n - 1; i++)
    for (j = 0; j < n; j++)
      a[j] = i + n;

  for (j = 0; j < n; j++)
    if (a[j] != i + n - 1)
      __builtin_abort ();

  return 0;
}

int
main ()
{
  int a[N];
  main1 (N, a);
  return 0;
} 
Graphite outlines the following SCoP from it: 

Schedule of the basic block: [n, P_37] -> { S_5[i0, i1] -> [0, i0, 0, i1, 0] }

Domain of the basic block: [n, P_37] -> { S_5[i0, i1] : exists (e0 = [(-2 + n)/4294967296], e1 = [(-1 + n)/4294967296]: i0 >= 0 and 4294967296e0 <= -2 + n and 4294967296e0 >= -4294967297 + n and 4294967296e0 <= -2 + n - i0 and i0 <= 2147483645 and i1 >= 0 and 4294967296e1 <= -1 + n and 4294967296e1 >= -4294967296 + n and 4294967296e1 <= -1 + n - i1 and i1 <= 2147483646 and n >= 1) }

After addition of a new flag I set up generation of isl_union_map, which specifies an order used to visit elements in a domain. 

static isl_union_map *
generate_isl_schedule (scop_p scop)
{
  int i;
  poly_bb_p pbb;
  isl_union_map *schedule_isl =
    isl_union_map_empty (isl_set_get_space (scop->context));

  FOR_EACH_VEC_ELT (SCOP_BBS (scop), i, pbb)
    {
      /* Dead code elimination: when the domain of a PBB is empty,
  don't generate code for the PBB.  */
      if (isl_set_is_empty (pbb->domain))
 continue;

      isl_map *bb_schedule = isl_map_copy (pbb->transformed);
      bb_schedule = isl_map_intersect_domain (bb_schedule,
           isl_set_copy (pbb->domain));
      schedule_isl =
        isl_union_map_union (schedule_isl,
        isl_union_map_from_map (bb_schedule));
    }
  return schedule_isl;
}
You can see that this function returns the union of all maps constructed by intersection of scattering functions' domains with corresponding domains of all basic blocks in the given SCoP.

In our case we get the following isl_union_map:

[n, P_37] -> { S_5[i0, i1] -> [0, i0, 0, i1, 0] : exists (e0 = [(-2 + n)/4294967296], e1 = [(-1 + n)/4294967296]: i0 >= 0 and 4294967296e0 <= -2 + n and 4294967296e0 >= -4294967297 + n and 4294967296e0 <= -2 + n - i0 and i0 <= 2147483645 and i1 >= 0 and 4294967296e1 <= -1 + n and 4294967296e1 >= -4294967296 + n and 4294967296e1 <= -1 + n - i1 and i1 <= 2147483646 and n >= 1) }


After this I set up generation of isl_ast_build, which specifies the constraints on the parameters.
static isl_ast_build *
generate_isl_context (scop_p scop)
{
  isl_set * context_isl = isl_set_params (isl_set_copy (scop->context));
  return isl_ast_build_from_context (context_isl);
}
Unfortunately, I haven't found any opportunity to dump isl_ast_build in ISL. That's why I'll provide a dump of universe set, which is used by isl_ast_build_from_context to generate isl_ast_build and generated from our SCoP.
[n, P_37] -> { : n <= 2147483647 and n >= 2 and P_37 <= 2147483647 and P_37 >= -2147483648 }


After this generation, I finally set up generation of a ISL AST.
static isl_ast_node *
scop_to_isl_ast (scop_p scop)
{
  isl_union_map *schedule_isl = generate_isl_schedule (scop);
  isl_ast_build *context_isl = generate_isl_context (scop);
  isl_ast_node *ast_isl = isl_ast_build_ast_from_schedule (context_isl,
          schedule_isl);
  isl_ast_build_free (context_isl);
  return ast_isl;
}
Let's consider a dump of the ISL AST, which is generated in our case.

for (int c1 = 0; c1 < n - 1; c1 += 1)
  for (int c3 = 0; c3 < n; c3 += 1)
    S_5(c1, c3);

Now we can compare it with ClooG, which is also generated from our SCoP.

for (scat_1=0;scat_1<=n_7-2;scat_1++) {
  for (scat_3=0;scat_3<=n_7-1;scat_3++) {
    (scat_1,scat_3);
  }
}

You could see that they are generally similar.


After this generation, I traversed the ISL AST with the following code:
void
print_indent (int indent, isl_printer *prn)
{
  int i;
  for (i = 0; i < indent; i++)
    {
      prn = isl_printer_print_str (prn, " ");
    }
}

static void
translate_isl_ast (int indent, __isl_keep isl_ast_node *node, 
     isl_printer *prn);

/* Translates a isl_ast_node_for STMT to gimple.
   TODO: replace output to FILE with translation
   to GCC representation. */

static void
translate_isl_ast_node_for (int indent, __isl_keep isl_ast_node *node_for,
       isl_printer *prn)
{
  isl_id *id;
  const char *name;
  const char *type;
  type = isl_options_get_ast_iterator_type (isl_printer_get_ctx (prn));
  isl_ast_expr *for_iterator = isl_ast_node_for_get_iterator (node_for);
  isl_ast_expr_free (for_iterator);
  id = isl_ast_expr_get_id (for_iterator);
  name = isl_id_get_name (id);
  isl_id_free (id);
  prn = isl_printer_print_str (prn, "\n");
  print_indent (indent, prn);
  prn = isl_printer_print_str (prn, "for (");
  prn = isl_printer_print_str (prn, type);
  prn = isl_printer_print_str (prn, " ");
  prn = isl_printer_print_str (prn, name);
  prn = isl_printer_print_str (prn, " = ");
  isl_ast_expr *for_init = isl_ast_node_for_get_init (node_for);
  prn = isl_printer_print_ast_expr (prn, for_init);
  isl_ast_expr_free (for_init);
  prn = isl_printer_print_str (prn, "; ");
  isl_ast_expr *for_cond = isl_ast_node_for_get_cond (node_for);
  prn = isl_printer_print_ast_expr (prn, for_cond);
  isl_ast_expr_free (for_cond);
  prn = isl_printer_print_str (prn, "; ");
  prn = isl_printer_print_str (prn, name);
  prn = isl_printer_print_str (prn, " += ");
  isl_ast_expr *for_inc = isl_ast_node_for_get_inc (node_for);
  prn = isl_printer_print_ast_expr (prn, for_inc);
  isl_ast_expr_free (for_inc);
  prn = isl_printer_print_str (prn, ")");
  isl_ast_node *for_body = isl_ast_node_for_get_body (node_for);
  translate_isl_ast (indent + 2, for_body, prn);
  isl_ast_node_free (for_body);
}

/* Translates a clast guard statement STMT to gimple.
   TODO: replace output to FILE with translation
   to GCC representation. */

static void
translate_isl_ast_node_if (int indent, __isl_keep isl_ast_node *node_if,
      isl_printer *prn)
{
  prn = isl_printer_print_str (prn, "\n");
  print_indent (indent, prn);
  prn = isl_printer_print_str (prn, "if (");
  isl_ast_expr *if_cond = isl_ast_node_if_get_cond (node_if);
  prn = isl_printer_print_ast_expr (prn, if_cond);
  isl_ast_expr_free (if_cond);
  prn = isl_printer_print_str (prn, ")");
  isl_ast_node *if_then = isl_ast_node_if_get_then (node_if);
  translate_isl_ast (indent + 2, if_then, prn);
  isl_ast_node_free (if_then);
  if (isl_ast_node_if_has_else (node_if))
    {
      prn = isl_printer_print_str (prn, "else");
      isl_ast_node *if_else = isl_ast_node_if_get_else(node_if);
      translate_isl_ast (indent + 2, if_else, prn);
      isl_ast_node_free (if_else);
    }
}

/* Translates a clast guard statement STMT to gimple.
   TODO: replace output to FILE with translation
   to GCC representation. */

static void
translate_isl_ast_node_user (int indent, __isl_keep isl_ast_node *node_user,
        isl_printer *prn)
{
  prn = isl_printer_print_str (prn, "\n");
  prn = isl_printer_set_indent (prn, indent);
  prn = isl_printer_print_ast_node (prn, node_user);
}

/* Translates a clast guard statement STMT to gimple.
   TODO: replace output to FILE with translation
   to GCC representation. */

static void
translate_isl_ast_node_block (int indent, __isl_keep isl_ast_node *node_block,
         isl_printer *prn)
{
  isl_ast_node_list *node_list = isl_ast_node_block_get_children (node_block);
  int i;
  for (i = 0; i < isl_ast_node_list_n_ast_node (node_list); i++)
    {
      translate_isl_ast (indent + 2, 
    isl_ast_node_list_get_ast_node (node_list, i), prn);
    }
  isl_ast_node_list_free (node_list);
}

/* Translates a ISL AST node NODE to GCC representation in the
   context of a SESE.
   TODO: replace output to FILE with translation
   to GCC representation. */

static void
translate_isl_ast (int indent, __isl_keep isl_ast_node *node, isl_printer *prn)
{
  switch (isl_ast_node_get_type (node))
    {
    case isl_ast_node_error:
      gcc_unreachable ();

    case isl_ast_node_for:
      translate_isl_ast_node_for (indent, node, prn);
      return;

    case isl_ast_node_if:
      translate_isl_ast_node_if (indent, node, prn);
      return;

    case isl_ast_node_user:
      translate_isl_ast_node_user (indent, node, prn);
      return;

    case isl_ast_node_block:
      translate_isl_ast_node_block (indent, node, prn);
      return;

    default:
      gcc_unreachable ();
    }
}

/* Prints NODE to FILE.  */

void
print_isl_ast_node (FILE *file, __isl_keep isl_ast_node *node, 
      __isl_keep isl_ctx *ctx)
{
  isl_printer *prn = isl_printer_to_file (ctx, file);
  prn = isl_printer_set_output_format (prn, ISL_FORMAT_C);
  prn = isl_printer_print_ast_node (prn, node);
  isl_printer_free (prn);
}
You could make sure, that traversing is correct by comparing it with the dump of the ISL AST.

for (int c1 = 0; c1 < n - 1; c1 += 1)
  for (int c3 = 0; c3 < n; c3 += 1)
    S_5(c1, c3);


You can find all the mentioned changes in the following patch, wich can be found at the following link http://gcc.1065356.n5.nabble.com/attachment/1037729/0/patch