HPCCG

A simple conjugate gradient benchmark code for a 3D chimney domain on an arbitrary number of processors.


Problem Size Discussion

From the application README

Suggested: Data size is over a range from 25% of total system memory up to 75%.

With nx=ny=nz and n = nx * ny * nz
Total memory per MPI rank: 720 * n bytes for 27 pt stencil, 240 * n bytes for 7 pt stencil.

Additional details in application README


Analysis

On the Skylake machine on which it was analyzed, HPCCG uses 70-80% of the DRAM bandwidth despite having memory latency issues due to a indirect memory access.


Parameters

Compiler = icc (ICC) 18.0.1 20171018
Build_Flags = -g -O3 -march=native -ftree-vectorize -qopenmp -DUSING_OMP
Run_Parameters = 256 256 256

Scaling


Performance Improvement

Threads
2
4
8
16
32
56
112
Speed Up 1.95X 1.92X 1.95X 1.40X 1.34X 1.01X 0.95X

Hit Locations


FLOPS

Double Precision
Scalar
128B Packed
256B Packed
512B Packed
Total FLOPS
GFLOPS/sec
PMU 3.520e+10 6.790e+10 2.550e+09 0.000e+00 1.812e+11 2.136e+01
SDE 3.500e+10 6.748e+10 2.525e+09 0.000e+00 1.801e+11 2.123e+01

Intel Software Development Emulator

Intel SDE
HPCCG
Arithmetric Intensity 0.103
FLOPS per Inst 0.442
FLOPS per FP Inst 1.71
Bytes per Load Inst 7.96
Bytes per Store Inst 7.52

Roofline – Intel(R) Xeon(R) Platinum 8180M CPU

112 Threads – 56 – Cores 3200.0 Mhz


UOPS Executed


Experiment Aggregate Metrics

Threads (Time)
IPC per Core
Loads per Cycle
L1 Hits per Cycle
L1 Miss Ratio
L2 Miss Ratio
L3 Miss Ratio
L2 B/W Utilized
L3 B/W Utilized
DRAM B/W Utilized
1 (100.0%) 1.35 0.70 0.59 2.44% 43.36% 91.79% 11.18% 38.09% 42.03%
56 (100.0%) 0.42 0.20 0.17 2.04% 44.08% 91.52% 2.71% 24.03% 65.88%
112 (100.0%) 0.46 0.12 0.10 2.12% 43.94% 85.94% 2.73% 22.06% 59.98%

HPC_sparsemv.cpp

Threads (Time)
IPC per Core
Loads per Cycle
L1 Hits per Cycle
L1 Miss Ratio
L2 Miss Ratio
L3 Miss Ratio
L2 B/W Utilized
L3 B/W Utilized
DRAM B/W Utilized
1 (83.1%) 1.39 0.77 0.68 1.86% 42.24% 72.59% 11.36% 38.16% 42.23%
56 (70.4%) 0.47 0.25 0.22 1.74% 42.51% 68.42% 3.27% 28.55% 78.07%
112 (65.5%) 0.48 0.13 0.12 2.19% 42.56% 75.15% 3.51% 28.05% 76.01%
66 int HPC_sparsemv( HPC_Sparse_Matrix *A,
67 const double * const x, double * const y)
68 {
69
70   const int nrow = (const int) A->local_nrow;
71
72   #ifdef USING_OMP
73   #pragma omp parallel for
74   #endif
75   for (int i=0; i< nrow; i++) 
76   { 
77     double sum = 0.0; 
78     const double * const cur_vals = 
79     (const double * const) A->ptr_to_vals_in_row[i];
80
81     const int * const cur_inds =
82     (const int * const) A->ptr_to_inds_in_row[i];
83
84     const int cur_nnz = (const int) A->nnz_in_row[i];
85
Threads (Time)
IPC per Core
Loads per Cycle
L1 Hits per Cycle
L1 Miss Ratio
L2 Miss Ratio
L3 Miss Ratio
L2 B/W Utilized
L3 B/W Utilized
DRAM B/W Utilized
1 (64.5%) 1.32 0.62 0.55 1.92% 43.27% 71.10% 10.02% 37.62% 39.92%
56 (63.7%) 0.41 0.20 0.17 1.78% 44.29% 63.23% 2.87% 28.25% 75.00%
112 (60.1%) 0.47 0.13 0.11 2.20% 43.91% 74.23% 3.28% 27.66% 73.94%
86     for (int j=0; j< cur_nnz; j++)
87       sum += cur_vals[j]*x[cur_inds[j]];
88     y[i] = sum;
89   }
90   return(0);
91 }