HPCCG

A simple conjugate gradient benchmark code for a 3D chimney domain on an arbitrary number of processors.

Problem Size Discussion

From the application README

Suggested: Data size is over a range from 25% of total system memory up to 75%.

With nx=ny=nz and n = nx * ny * nz
Total memory per MPI rank: 720 * n bytes for 27 pt stencil, 240 * n bytes for 7 pt stencil.

Additional details in application README

Analysis

On the Skylake machine on which it was analyzed, HPCCG uses 70-80% of the DRAM bandwidth despite having memory latency issues due to a indirect memory access.

Parameters

Compiler = icc (ICC) 18.0.1 20171018
Build_Flags = -g -O3 -march=native -ftree-vectorize -qopenmp -DUSING_OMP
Run_Parameters = 256 256 256

Scaling

Performance Improvement

Threads	2	4	8	16	32	56	112
Speed Up	1.95X	1.92X	1.95X	1.40X	1.34X	1.01X	0.95X

Hit Locations

FLOPS

Double Precision	Scalar	128B Packed	256B Packed	512B Packed	Total FLOPS	GFLOPS/sec
PMU	3.520e+10	6.790e+10	2.550e+09	0.000e+00	1.812e+11	2.136e+01
SDE	3.500e+10	6.748e+10	2.525e+09	0.000e+00	1.801e+11	2.123e+01

Intel Software Development Emulator

Intel SDE	HPCCG
Arithmetric Intensity	0.103
FLOPS per Inst	0.442
FLOPS per FP Inst	1.71
Bytes per Load Inst	7.96
Bytes per Store Inst	7.52

Roofline – Intel(R) Xeon(R) Platinum 8180M CPU

112 Threads – 56 – Cores 3200.0 Mhz

UOPS Executed

`Experiment Aggregate Metrics`

Threads (Time)	IPC per Core	Loads per Cycle	L1 Hits per Cycle	L1 Miss Ratio	L2 Miss Ratio	L3 Miss Ratio	L2 B/W Utilized	L3 B/W Utilized	DRAM B/W Utilized
1 (100.0%)	1.35	0.70	0.59	2.44%	43.36%	91.79%	11.18%	38.09%	42.03%
56 (100.0%)	0.42	0.20	0.17	2.04%	44.08%	91.52%	2.71%	24.03%	65.88%
112 (100.0%)	0.46	0.12	0.10	2.12%	43.94%	85.94%	2.73%	22.06%	59.98%

`HPC_sparsemv.cpp`

Threads (Time)	IPC per Core	Loads per Cycle	L1 Hits per Cycle	L1 Miss Ratio	L2 Miss Ratio	L3 Miss Ratio	L2 B/W Utilized	L3 B/W Utilized	DRAM B/W Utilized
1 (83.1%)	1.39	0.77	0.68	1.86%	42.24%	72.59%	11.36%	38.16%	42.23%
56 (70.4%)	0.47	0.25	0.22	1.74%	42.51%	68.42%	3.27%	28.55%	78.07%
112 (65.5%)	0.48	0.13	0.12	2.19%	42.56%	75.15%	3.51%	28.05%	76.01%

66 int HPC_sparsemv( HPC_Sparse_Matrix *A,
67 const double * const x, double * const y)
68 {
69
70   const int nrow = (const int) A->local_nrow;
71
72   #ifdef USING_OMP
73   #pragma omp parallel for
74   #endif
75   for (int i=0; i< nrow; i++) 
76   { 
77     double sum = 0.0; 
78     const double * const cur_vals = 
79     (const double * const) A->ptr_to_vals_in_row[i];
80
81     const int * const cur_inds =
82     (const int * const) A->ptr_to_inds_in_row[i];
83
84     const int cur_nnz = (const int) A->nnz_in_row[i];
85

Threads (Time)	IPC per Core	Loads per Cycle	L1 Hits per Cycle	L1 Miss Ratio	L2 Miss Ratio	L3 Miss Ratio	L2 B/W Utilized	L3 B/W Utilized	DRAM B/W Utilized
1 (64.5%)	1.32	0.62	0.55	1.92%	43.27%	71.10%	10.02%	37.62%	39.92%
56 (63.7%)	0.41	0.20	0.17	1.78%	44.29%	63.23%	2.87%	28.25%	75.00%
112 (60.1%)	0.47	0.13	0.11	2.20%	43.91%	74.23%	3.28%	27.66%	73.94%

86     for (int j=0; j< cur_nnz; j++)
87       sum += cur_vals[j]*x[cur_inds[j]];
88     y[i] = sum;
89   }
90   return(0);
91 }