How does HPCG relate to other benchmarks.

We do not intend to eliminate HPL. HPCG will provide an alternative ranking of the TOP500 machines. We expect that HPCG will take several years to both mature and emerge as a widely-visible metric.

If the reference version of HPCG is used for performance analysis, the fraction of time spent in the (unoptimized) sparse kernels (in particular ComputeSPMV and ComputeSYMGS) will be very high, and HPCG performance will be dominated by memory system performance. In this case, for computer systems with a good reduction networks, or HPCG runs using few MPI processes, the benchmark will give rankings that are very similar to STREAM.

However, this is true for many benchmarks. If HPL were executed with reference Fortran computational kernels (Basic Linear Algebra Subprograms), HPL would also look like a STREAM benchmark.

Warnings have been added to HPCG reports to let the benchmarker know that performance will be suboptimal when using reference kernels.

Even after optimization of HPCG, its overall performance will still be heavily influenced by memory system performance, but not solely by how fast data streams from memory. Memory latency, synchronous global and neighborhood collectives, and thread control transfer--all of which have strong dependencies on memory system performance--are important factors in the performance of an optimized version of HPCG.

NAS PB CG uses random sparsity pattern which naturally leads to two-dimensional distribution of the matrix for optimality. This results in computation and communication patterns that are non-physical. Another difference is the lack of preconditioning, does not allow to show the effects of local triangular solve. The options for introducing such a preconditioning component are limited due to the non-physical sparsity pattern.

For axpy-like and ddot-like operations, the vector length is the size of the submatrix on each MPI process, typically 1M or more.

For the SpMV, the vector lengths in the reference kernel average about 25, with indexed reads (gathers) and summation in to a single value. This is the inner loop of the sparse row format MV. This is the basic pseudo code:

`for (i=0; i<nrow; ++i) { // Number of rows on this MPI process (big)`

` double sum = 0.0;`

` for (j=ptr[i]; j<ptr[i+1]; ++j) `

` sum += A[j] * x[col[j]]; // This loop is on average length 25`

` y[i] = sum;`

`}`

For the SymGS, the vector lengths is on average 12, with the same index read pattern, except that the loop does not naturally vectorize since the SymGS operation is like a triangular solve, recursive.

Optimized implementations of SpMV tend to re-order the matrix data structure so that the SpMV loops are still indexed reads, but are of length nrow (same as axpy), using, for example, the Jagged Diagonal or Ellpack format.

Optimized implementations of SymGS are also reordered to get longer vector lengths, but typically are a fraction of nrow, or much smaller, depending on whether targeted for the GPU or a CPU. The GPU approach uses multicoloring, so that vector lengths are approximately nrow/8. The CPU approach will use level-scheduling where vector lengths will vary a lot, but are typically in the range of 15 - 100. The GPU approach takes more iterations, which is penalized by HPCG. All optimized SymGS approaches use indexed reads.

What could be done to make HPCG run more optimally on your system and what are the allowed optimizations.

Yes, it is permitted to use a custom ordering of the grid points. This is facilitated with the function `OptimizeProblem()`

and the `optimizationData`

members of various data structures.

It is not permitted to change the preconditioner but it is allowed to change the ordering of the matrix to facilitate parallel preconditioning.

HPCG can be run in just a few minutes from start to finish. However, official runs must be at least 1800 seconds (30 minutes) as reported in the output file. The Quick Path option is an exception for machines that are in production mode prior to broad availability of an optimized version of HPCG 3.0 for a given platform. In this situation (which should be confirmed by sending a note to the HPCG Benchmark owners) the Quick Path option can be invoked by setting the run time parameter equal to 0 (zero).

A valid run must also execute a problem size that is large enough so that data arrays accessed in the CG iteration loop do not fit in the cache of the device in a way that would be unrealistic in a real application setting. Presently this restriction means that the problem size should be large enough to occupy a significant fraction of "main memory", at least 1/4 of the total.

Future memory system architectures may require restatement of the specific memory size requirements. But the guiding principle will always be that the problem size should reflect what would be reasonable for a real sparse iterative solver.

We are aware of many variants of the CG algorithm and their benefits for particular matrices. At the same time, we strive for simplicity of the reference implementation and permit only selected optimizations that allow the results to remain representative of a wide range of CG variants.