A simple conjugate gradient benchmark code for a 3D chimney domain on an arbitrary number of processors.
Problem Size Discussion
From the application README
Suggested: Data size is over a range from 25% of total system memory up to 75%.
With nx=ny=nz
and n = nx * ny * nz
Total memory per MPI rank: 720 * n
bytes for 27 pt stencil, 240 * n
bytes for 7 pt stencil.
Additional details in application README
Analysis
On the Skylake machine on which it was analyzed, HPCCG uses 70-80% of the DRAM bandwidth despite having memory latency issues due to a indirect memory access.
Parameters
Compiler = icc (ICC) 18.0.1 20171018
Build_Flags = -g -O3 -march=native -ftree-vectorize -qopenmp -DUSING_OMP
Run_Parameters = 256 256 256
Scaling
Performance Improvement
Threads |
2 |
4 |
8 |
16 |
32 |
56 |
112 |
Speed Up |
1.95X |
1.92X |
1.95X |
1.40X |
1.34X |
1.01X |
0.95X |
Hit Locations
FLOPS
Double Precision |
Scalar |
128B Packed |
256B Packed |
512B Packed |
Total FLOPS |
GFLOPS/sec |
PMU |
3.520e+10 |
6.790e+10 |
2.550e+09 |
0.000e+00 |
1.812e+11 |
2.136e+01 |
SDE |
3.500e+10 |
6.748e+10 |
2.525e+09 |
0.000e+00 |
1.801e+11 |
2.123e+01 |
Intel Software Development Emulator
Intel SDE |
HPCCG |
Arithmetric Intensity |
0.103 |
FLOPS per Inst |
0.442 |
FLOPS per FP Inst |
1.71 |
Bytes per Load Inst |
7.96 |
Bytes per Store Inst |
7.52 |
Roofline – Intel(R) Xeon(R) Platinum 8180M CPU
112 Threads – 56 – Cores 3200.0 Mhz
UOPS Executed
Experiment Aggregate Metrics
Threads (Time) |
IPC per Core |
Loads per Cycle |
L1 Hits per Cycle |
L1 Miss Ratio |
L2 Miss Ratio |
L3 Miss Ratio |
L2 B/W Utilized |
L3 B/W Utilized |
DRAM B/W Utilized |
1 (100.0%) |
1.35 |
0.70 |
0.59 |
2.44% |
43.36% |
91.79% |
11.18% |
38.09% |
42.03% |
56 (100.0%) |
0.42 |
0.20 |
0.17 |
2.04% |
44.08% |
91.52% |
2.71% |
24.03% |
65.88% |
112 (100.0%) |
0.46 |
0.12 |
0.10 |
2.12% |
43.94% |
85.94% |
2.73% |
22.06% |
59.98% |
HPC_sparsemv.cpp
Threads (Time) |
IPC per Core |
Loads per Cycle |
L1 Hits per Cycle |
L1 Miss Ratio |
L2 Miss Ratio |
L3 Miss Ratio |
L2 B/W Utilized |
L3 B/W Utilized |
DRAM B/W Utilized |
1 (83.1%) |
1.39 |
0.77 |
0.68 |
1.86% |
42.24% |
72.59% |
11.36% |
38.16% |
42.23% |
56 (70.4%) |
0.47 |
0.25 |
0.22 |
1.74% |
42.51% |
68.42% |
3.27% |
28.55% |
78.07% |
112 (65.5%) |
0.48 |
0.13 |
0.12 |
2.19% |
42.56% |
75.15% |
3.51% |
28.05% |
76.01% |
66 int HPC_sparsemv( HPC_Sparse_Matrix *A,
67 const double * const x, double * const y)
68 {
69
70 const int nrow = (const int) A->local_nrow;
71
72 #ifdef USING_OMP
73 #pragma omp parallel for
74 #endif
75 for (int i=0; i< nrow; i++)
76 {
77 double sum = 0.0;
78 const double * const cur_vals =
79 (const double * const) A->ptr_to_vals_in_row[i];
80
81 const int * const cur_inds =
82 (const int * const) A->ptr_to_inds_in_row[i];
83
84 const int cur_nnz = (const int) A->nnz_in_row[i];
85
Threads (Time) |
IPC per Core |
Loads per Cycle |
L1 Hits per Cycle |
L1 Miss Ratio |
L2 Miss Ratio |
L3 Miss Ratio |
L2 B/W Utilized |
L3 B/W Utilized |
DRAM B/W Utilized |
1 (64.5%) |
1.32 |
0.62 |
0.55 |
1.92% |
43.27% |
71.10% |
10.02% |
37.62% |
39.92% |
56 (63.7%) |
0.41 |
0.20 |
0.17 |
1.78% |
44.29% |
63.23% |
2.87% |
28.25% |
75.00% |
112 (60.1%) |
0.47 |
0.13 |
0.11 |
2.20% |
43.91% |
74.23% |
3.28% |
27.66% |
73.94% |
86 for (int j=0; j< cur_nnz; j++)
87 sum += cur_vals[j]*x[cur_inds[j]];
88 y[i] = sum;
89 }
90 return(0);
91 }