For axpy-like and ddot-like operations, the vector length is
the size of the submatrix on each MPI process, typically 1M
or more.

For the SpMV, the vector lengths in the reference kernel average
about 25, with indexed reads (gathers) and summation in to a single
value. This is the inner loop of the sparse row format MV. This is
the basic pseudo code:

`for (i=0; i<nrow; ++i) { // Number of rows on this MPI process (big)`

` double sum = 0.0;`

` for (j=ptr[i]; j<ptr[i+1]; ++j) `

` sum += A[j] * x[col[j]]; // This loop is on average length 25`

` y[i] = sum;`

`}`

For the SymGS, the vector lengths is on average 12, with the same
index read pattern, except that the loop does not naturally vectorize
since the SymGS operation is like a triangular solve, recursive.

Optimized implementations of SpMV tend to re-order the matrix data
structure so that the SpMV loops are still indexed reads, but are
of length nrow (same as axpy), using, for example, the Jagged Diagonal
or Ellpack format.

Optimized implementations of SymGS are also reordered to get longer
vector lengths, but typically are a fraction of nrow, or much smaller,
depending on whether targeted for the GPU or a CPU. The GPU approach
uses multicoloring, so that vector lengths are approximately nrow/8.
The CPU approach will use level-scheduling where vector lengths will
vary a lot, but are typically in the range of 15 - 100. The GPU
approach takes more iterations, which is penalized by HPCG.
All optimized SymGS approaches use indexed reads.