XSBench

From docs/XSBench_Theory:

The XSBench proxy app models the most computationally intensive part of a typical MC transport algorithm – the calculation of macroscopic neutron cross sections – a kernel which accounts for around 85% of the total runtime of OpenMC. The essential computational conditions and tasks of fully featured MC neutron transport codes are retained in the mini-app, without the additional complexity of the full application.


Problem Size and Run Configuration

Version 18:

The work which XSBench proxies is the cross-section lookups, in version 18 this is best controlled by the -p <particles> parameter. Increasing this value will increase the amount of work for the proxy (number of particles X number of lookups per particle).

Problems with 10^9 particles are not unusual for a current run. Scaling the number of particles beyond 10^12 will not likely be meaningful.


Analysis


Build and Run Information – Version 16

Compiler = icc (ICC) 18.0.1 20171018
Build Flags = -g -O3 -march=native -qopenmp
Run Parameters = -s large -g 11303 -G unionized -l 100000000 -t [#ofThreads]

Version 16 of XSBench scaled its workload with number of lookups (-l) instead of number of particles X number of lookups.

Scaling


Performance Improvement

Threads
2
4
8
16
32
56
112
Speed Up 1.63X 2.00X 2.00X 1.97X 1.92X 1.37X 1.13X

Hit Locations


FLOPS

Double Precision
Scalar
128B Packed
256B Packed
512B Packed
Total FLOPS
GFLOPS/sec
PMU 4.720e+10 0.000e+00 2.790e+10 0.000e+00 1.588e+11 1.574e+01
SDE 4.584e+10 0.000e+00 2.222e+10 0.000e+00 1.347e+11 1.328e+01

Intel Software Development Emulator

Intel SDE
XSBench
Arithmetric Intensity 0.0515
FLOPS per Inst 0.286
FLOPS per FP Inst 1.98
Bytes per Load Inst 10.3
Bytes per Store Inst 11.8

Roofline – Intel(R) Xeon(R) Platinum 8180M CPU

112 Threads – 56 – Cores 3200.0 Mhz


UOPS Executed


Experiment Aggregate Metrics

Threads (Time)
IPC per Core
Loads per Cycle
L1 Hits per Cycle
L1 Miss Ratio
L2 Miss Ratio
L3 Miss Ratio
L2 B/W Utilized
L3 B/W Utilized
DRAM B/W Utilized
1 (100.0%) 0.54 0.43 0.34 7.10% 46.89% 68.53% 4.04% 13.22% 14.63%
56 (100.0%) 0.35 0.27 0.22 6.86% 46.25% 66.82% 2.33% 15.89% 55.09%
112 (100.0%) 0.40 0.16 0.13 6.91% 44.55% 67.22% 2.69% 17.60% 58.67%

main omp loop

153         // OpenMP compiler directives - declaring variables
            // as shared or private
154         #pragma omp parallel default(none) \
155         private(i, thread, p_energy, mat, seed) \
156         shared( in, energy_grid, nuclide_grids, \
157                 mats, concs, num_nucs, mype, vhash)
158         {
159                 // Initialize parallel PAPI counters
160                 #ifdef PAPI
                    {...}
167                 #endif
168
169                 double macro_xs_vector[5];
170                 double * xs = (double *) calloc(5, sizeof(double));
171
172                 // Initialize RNG seeds for threads
173                 thread = omp_get_thread_num();
174                 seed   = (thread+1)*19+17;
175
176                 // XS Lookup Loop
177                 #pragma omp for schedule(dynamic)
178                 for( i = 0; i < in.lookups; i++ )
179                 {
180                         // Status text
181                         if( INFO && mype == 0 && 
                                thread == 0 && i % 1000 == 0 )
182                                 printf("\rCalculating XS's... 
                                            (%.0lf%% completed)",
183               (i / ( (double)in.lookups / (double) in.nthreads ))
184                                   / (double) in.nthreads * 100.0);
185
186                         // Randomly pick an energy and material
                            // for the particle
187                         #ifdef VERIFICATION
                            {...}
196                         #endif
197
198                         // debugging
199                         //printf("E = %lf mat = %d\n", p_energy, mat);
200
201                         // This returns the macro_xs_vector, but we're not going
202                         // to do anything with it in this program, so return value
203                         // is written over.
204                         calculate_macro_xs( p_energy, mat, 
                                                in.n_isotopes,
205                                             in.n_gridpoints, 
                                                num_nucs, concs,
206                                             energy_grid,
                                                nuclide_grids, mats,
207                                             macro_xs_vector,
                                                in.grid_type,
                                                in.hash_bins );
208
209                         // Copy results from above function call onto heap
210                         // so that compiler cannot optimize function out
211                         // (only occurs if -flto flag is used)
212                         memcpy(xs, macro_xs_vector, 5*sizeof(double));
213
214                         // Verification hash calculation
215                         // This method provides a consistent hash accross
216                         // architectures and compilers.
217                         #ifdef VERIFICATION
                            {...}
229                         #endif
230                 }
231
232                 // Prints out thread local PAPI counters
233                 #ifdef PAPI
                    {...}
246                 #endif
248         }

calculate_macro_xs

Threads (Time)
IPC per Core
Loads per Cycle
L1 Hits per Cycle
L1 Miss Ratio
L2 Miss Ratio
L3 Miss Ratio
L2 B/W Utilized
L3 B/W Utilized
DRAM B/W Utilized
1 (97.4%) 0.50 0.41 0.33 7.49% 46.83% 68.69% 4.04% 13.31% 14.86%
56 (83.4%) 0.36 0.28 0.23 7.31% 46.53% 67.07% 2.69% 18.55% 64.65%
112 (72.6%) 0.44 0.19 0.15 7.36% 44.90% 67.44% 3.30% 21.73% 72.17%
101 void calculate_macro_xs( double p_energy, int mat, long n_isotopes,
102                          long n_gridpoints, int * restrict num_nucs,
103                          double ** restrict concs,
104                          GridPoint * restrict energy_grid,
105                          NuclideGridPoint ** restrict nuclide_grids,
106                          int ** restrict mats,
107                          double * restrict macro_xs_vector,
                             int grid_type, int hash_bins ){
108         double xs_vector[5];
109         int p_nuc; // the nuclide we are looking up
110         long idx = -1;
111         double conc; // the concentration of the nuclide
                         // in the material
112
113         // cleans out macro_xs_vector
114         for( int k = 0; k < 5; k++ )
115                 macro_xs_vector[k] = 0;
116
117         // If we are using the unionized energy grid (UEG), we only
118         // need to perform 1 binary search per macroscopic lookup.
119         // If we are using the nuclide grid search, it will have to be
120         // done inside of the "calculate_micro_xs" function for each
            // different
121         // nuclide in the material.
122         if( grid_type == UNIONIZED )
123                 idx = grid_search( n_isotopes * n_gridpoints, p_energy,
124                                energy_grid);
125         else if( grid_type == HASH )
126         {
               {...}
129         }
130
131         // Once we find the pointer array on the UEG, we can pull the data
132         // from the respective nuclide grids, as well as the nuclide
133         // concentration data for the material
134         // Each nuclide from the material needs to have 
            // its micro-XS array
135         // looked up & interpolatied (via calculate_micro_xs). Then, the
136         // micro XS is multiplied by the concentration of that nuclide
137         // in the material, and added to the total macro XS array.
138         for( int j = 0; j < num_nucs[mat]; j++ )
139         {
140                 p_nuc = mats[mat][j];
141                 conc = concs[mat][j];
142                 calculate_micro_xs( p_energy, p_nuc, n_isotopes,
143                                     n_gridpoints, energy_grid,
144                                     nuclide_grids, idx, xs_vector,
                                        grid_type, hash_bins );
145                 for( int k = 0; k < 5; k++ )
146                         macro_xs_vector[k] += xs_vector[k] * conc;
147         }
        {...}
155 }

[I] grid_search

Threads (Time)
IPC per Core
Loads per Cycle
L1 Hits per Cycle
L1 Miss Ratio
L2 Miss Ratio
L3 Miss Ratio
L2 B/W Utilized
L3 B/W Utilized
DRAM B/W Utilized
1 (16.3%) 0.29 0.06 0.04 38.37% 38.04% 37.93% 4.48% 12.55% 14.54%
56 (12.8%) 0.23 0.05 0.03 38.00% 38.60% 35.61% 3.22% 19.44% 68.60%
112 (7.9%) 0.41 0.04 0.02 43.16% 39.89% 35.28% 4.74% 28.81% 96.22%
158 // (fixed) binary search for energy on unionized energy grid
159 // returns lower index
160 long grid_search( long n, double quarry, GridPoint * A)
161 {
162         long lowerLimit = 0;
163         long upperLimit = n-1;
164         long examinationPoint;
165         long length = upperLimit - lowerLimit;
166
167         while( length > 1 )
168         {
169                 examinationPoint = lowerLimit + ( length / 2 );
170
171                 if( A[examinationPoint].energy > quarry )
172                         upperLimit = examinationPoint;
173                 else
174                         lowerLimit = examinationPoint;
175
176                 length = upperLimit - lowerLimit;
177         }
178
179         return lowerLimit;
180 }

calculate_micro_xs

Threads (Time)
IPC per Core
Loads per Cycle
L1 Hits per Cycle
L1 Miss Ratio
L2 Miss Ratio
L3 Miss Ratio
L2 B/W Utilized
L3 B/W Utilized
DRAM B/W Utilized
1 (72.2%) 0.49 0.42 0.35 6.11% 49.46% 72.74% 3.43% 11.42% 14.36%
56 (65.5%) 0.34 0.28 0.24 5.83% 48.97% 72.66% 2.37% 16.55% 61.84%
112 (61.0%) 0.42 0.18 0.15 6.39% 45.93% 72.88% 2.95% 19.66% 68.11%
  3 // Calculates the microscopic cross section for a given nuclide & energy
  4 void calculate_micro_xs(   double p_energy, int nuc, long n_isotopes,
  5                            long n_gridpoints,
  6                            GridPoint * restrict energy_grid,
  7                            NuclideGridPoint ** restrict nuclide_grids,
  8                            long idx, double * restrict xs_vector, int grid_type,
                               int hash_bins ){
  9         // Variables
 10         double f;
 11         NuclideGridPoint * low, * high;
 12
 13         // If using only the nuclide grid, we must perform a binary search
 14         // to find the energy location in this particular nuclide's grid.
 15         if( grid_type == NUCLIDE )
 16         {
               {...}
 26         }
 27         else if( grid_type == UNIONIZED) // Unionized Energy Grid - we already know
                                             // the index, no binary search needed.
 28         {
 29                 // pull ptr from energy grid and check to ensure that
 30                 // we're not reading off the end of the nuclide's grid
 31                 if( energy_grid[idx].xs_ptrs[nuc] == n_gridpoints - 1 )
 32                         low = &nuclide_grids[nuc][energy_grid[idx].xs_ptrs[nuc] - 1];
 33                 else
 34                         low = &nuclide_grids[nuc][energy_grid[idx].xs_ptrs[nuc]];
 35         }
 36         else // Hash grid
 37         {
               {...}
 65         }
 66
 67         high = low + 1;
 68
 69         // calculate the re-useable interpolation factor
 70         f = (high->energy - p_energy) / (high->energy - low->energy);
 71
 72         // Total XS
 73         xs_vector[0] = high->total_xs - f * (high->total_xs - low->total_xs);
 74
 75         // Elastic XS
 76         xs_vector[1] = high->elastic_xs - f * (high->elastic_xs
                                                   - low->elastic_xs);
 77
 78         // Absorbtion XS
 79         xs_vector[2] = high->absorbtion_xs - f * (high->absorbtion_xs
                                                      - low->absorbtion_xs);
 80
 81         // Fission XS
 82         xs_vector[3] = high->fission_xs - f * (high->fission_xs
                                                   - low->fission_xs);
 83
 84         // Nu Fission XS
 85         xs_vector[4] = high->nu_fission_xs - f * (high->nu_fission_xs
                                                      - low->nu_fission_xs);
 86
            {...}
 98 }