HPCCG

A simple conjugate gradient benchmark code for a 3D chimney
domain on an arbitrary number of processors.


Problem Size

From the application README

Suggested: Data size is over a range from 25% of total system memory up to 75%.

With nx=ny=nz and n = nx * ny * nz
Total memory per MPI rank:720 * n bytes for 27 pt stencil, 240 * n bytes for 7 pt stencil.

Additional details in application README


Parameters

Compiler = icc (ICC) 18.0.1 20171018
Build_Flags = -g -O3 -march=native -ftree-vectorize -qopenmp -DUSING_OMP
Run_Parameters = 256 256 256

Scaling

Performance Improvement

Threads 2 4 8 16 32 56 112
Speed Up 1.95X 1.92X 1.95X 1.40X 1.34X 1.01X 0.95X

Hit Locations


FLOPS

Double Precision Scalar 128B Packed 256B Packed 512B Packed Total FLOPS GFLOPS/sec
PMU 3.520e+10 6.790e+10 2.550e+09 0.000e+00 1.812e+11 2.136e+01
SDE 3.500e+10 6.748e+10 2.525e+09 0.000e+00 1.801e+11 2.123e+01

Intel Software Development Emulator

Intel SDE HPCCG
Arithmetric Intensity 0.103
FLOPS per Inst 0.442
FLOPS per FP Inst 1.71
Bytes per Load Inst 7.96
Bytes per Store Inst 7.52

Roofline – Intel(R) Xeon(R) Platinum 8180M CPU

112 Threads – 56 – Cores 3200.0 Mhz

GB/sec L1 B/W L2 B/W L3 B/W DRAM B/W
1 Thread 159.33 91.42 47.08 21.27
56 Threads 9816.2 5579.1 1050.00* 198.4
112 Threads 9912.56 5573.58 1050.00* 203.13

* L3 BW ERT unable to recognize. Very short plateau ( estimate taken from graph3 )



UOPS Executed


Experiment Aggregate Metrics

Threads (Time) IPC per Core Loads per Cycle L1 Hits per Cycle L1 Miss Ratio L2 Miss Ratio L3 Miss Ratio L2 B/W Utilized L3 B/W Utilized DRAM B/W Utilized
1 (100.0%) 1.35 0.70 0.59 2.44% 43.36% 91.79% 11.18% 38.09% 42.03%
56 (100.0%) 0.42 0.20 0.17 2.04% 44.08% 91.52% 2.71% 24.03% 65.88%
112 (100.0%) 0.46 0.12 0.10 2.12% 43.94% 85.94% 2.73% 22.06% 59.98%

HPC_sparsemv.cpp

Threads (Time) IPC per Core Loads per Cycle L1 Hits per Cycle L1 Miss Ratio L2 Miss Ratio L3 Miss Ratio L2 B/W Utilized L3 B/W Utilized DRAM B/W Utilized
1 (83.1%) 1.39 0.77 0.68 1.86% 42.24% 72.59% 11.36% 38.16% 42.23%
56 (70.4%) 0.47 0.25 0.22 1.74% 42.51% 68.42% 3.27% 28.55% 78.07%
112 (65.5%) 0.48 0.13 0.12 2.19% 42.56% 75.15% 3.51% 28.05% 76.01%
66 int HPC_sparsemv( HPC_Sparse_Matrix *A,
67 const double * const x, double * const y)
68 {
69
70   const int nrow = (const int) A->local_nrow;
71
72   #ifdef USING_OMP
73   #pragma omp parallel for
74   #endif
75   for (int i=0; i< nrow; i++) 
76   { 
77     double sum = 0.0; 
78     const double * const cur_vals = 
79     (const double * const) A->ptr_to_vals_in_row[i];
80
81     const int * const cur_inds =
82     (const int * const) A->ptr_to_inds_in_row[i];
83
84     const int cur_nnz = (const int) A->nnz_in_row[i];
85
Threads (Time) IPC per Core Loads per Cycle L1 Hits per Cycle L1 Miss Ratio L2 Miss Ratio L3 Miss Ratio L2 B/W Utilized L3 B/W Utilized DRAM B/W Utilized
1 (64.5%) 1.32 0.62 0.55 1.92% 43.27% 71.10% 10.02% 37.62% 39.92%
56 (63.7%) 0.41 0.20 0.17 1.78% 44.29% 63.23% 2.87% 28.25% 75.00%
112 (60.1%) 0.47 0.13 0.11 2.20% 43.91% 74.23% 3.28% 27.66% 73.94%
86     for (int j=0; j< cur_nnz; j++)
87       sum += cur_vals[j]*x[cur_inds[j]];
88     y[i] = sum;
89   }
90   return(0);
91 }