SNAP

From README.md:

SNAP serves as a proxy application to model the performance of a modern discrete ordinates neutral particle transport application. SNAP may be considered an update to Sweep3D, intended for hybrid computing architectures. It is modeled off the Los Alamos National Laboratory code PARTISN. PARTISN solves the linear Boltzmann transport equation (TE), a governing equation for determining the number of neutral particles (e.g., neutrons and gamma rays) in a multi-dimensional phase space. SNAP itself is not a particle transport application; SNAP incorporates no actual physics in its available data, nor does it use numerical operators specifically designed for particle transport. Rather, SNAP mimics the computational workload, memory requirements, and communication patterns of PARTISN. The equation it solves has been composed to use the same number of operations, use the same data layout, and load elements of the arrays in approximately the same order. Although the equation SNAP solves looks similar to the TE, it has no real world relevance.


Problem Size and Run Configuration

./snap input_file output_file

SNAP uses a Fortran namelist for its input, defaults are provided if variable is not defined in the input file. A full list of input parameters is available in the SNAP user’s manual.


Analysis


Build and Run Information

Compiler = ifort (IFORT) 18.0.1 20171018
Build Flags = -g -O3 -qopenmp -ip -align array32byte -qno-opt-dynamic-align \
              -fno-fnalias -fp-model fast -fp-speculation fast -xcore-avx2
Run Parameters = input_file

input_file

! Input from namelist
&invar
  nthreads=72
  nnested=1
  npey=1
  npez=1
  ndimen=3
  nx=20
  lx=0.02
  ny=20
  ly=0.02
  nz=12
  lz=0.012
  ichunk=10
  nmom=4
  nang=80
  ng=72
  mat_opt=1
  src_opt=1
  timedep=1
  it_det=0
  tf=0.01
  nsteps=10
  iitm=5
  oitm=100
  epsi=1.E-4
  fluxp=0
  scatp=0
  fixup=0
  soloutp=1
  angcpy=2
/

Scaling


Intel Software Development Emulator

SDE Metrics
SNAP
Arithmetic Intensity 0.11
Bytes per Load Inst 25.57
Bytes per Store Inst 24.39

Roofline – Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz

72 Threads – 36 – Cores 2300.0 Mhz


Experiment Aggregate Metrics

Threads (Time)
IPC per Core
Loads per Cycle
L1 Hits per Cycle
L1 Miss Ratio
L2 Miss Ratio
L3 Miss Ratio
L2 B/W Utilized
L3 B/W Utilized
DRAM B/W Utilized
1 (100.0%) 1.98 0.85 1.03 4.03% 22.66% 36.49% 18.24% 13.75% 1.39%
36 (100.0%) 0.79 0.33 0.39 3.19% 20.79% 32.82% 10.54% 8.49% 4.92%
72 (100.0%) 0.96 0.17 0.20 8.31% 16.93% 21.85% 14.55% 7.57% 1.03%

SUBROUTINE dim3_sweep

Data for entire subroutine (only data members and outer loop structure show)

Threads (Time)
IPC per Core
Loads per Cycle
L1 Hits per Cycle
L1 Miss Ratio
L2 Miss Ratio
L3 Miss Ratio
L2 B/W Utilized
L3 B/W Utilized
DRAM B/W Utilized
1 (80.9%) 1.82 0.90 1.06 4.37% 21.55% 35.76% 20.79% 14.33% 1.38%
36 (65.9%) 0.56 0.29 0.33 4.88% 19.91% 37.08% 14.40% 10.70% 7.13%
72 (66.0%) 0.57 0.13 0.14 14.73% 16.13% 24.45% 20.08% 9.58% 1.53%
 33   SUBROUTINE dim3_sweep ( ich, id, d1, d2, d3, d4, jd, kd, oct, g, t,  &
 34     iop, reqs, szreq, psii, psij, psik, qtot, ec, vdelt, ptr_in,       &
 35     ptr_out, dinv, flux0, fluxm, jb_in, jb_out, kb_in, kb_out, wmu,    &
 36     weta, wxi, flkx, flky, flkz, t_xs, fmin, fmax )
 37
 38 !-----------------------------------------------------------------------
 39 !
 40 ! 3-D slab mesh sweeper.
 41 !
 42 !-----------------------------------------------------------------------
 43
 44     INTEGER(i_knd), INTENT(IN) :: ich, id, d1, d2, d3, d4, jd, kd, oct,&
 45       g, t, iop, szreq
 46
 47     INTEGER(i_knd), DIMENSION(szreq), INTENT(INOUT) :: reqs
 48
 49     REAL(r_knd), INTENT(IN) :: vdelt
 50
 51     REAL(r_knd), INTENT(INOUT) :: fmin, fmax
 52
 53     REAL(r_knd), DIMENSION(nang), INTENT(IN) :: wmu, weta, wxi
 54
 55     REAL(r_knd), DIMENSION(nang,cmom), INTENT(IN) :: ec
 56
 57     REAL(r_knd), DIMENSION(nang,ny,nz), INTENT(INOUT) :: psii
 58
 59     REAL(r_knd), DIMENSION(nang,ichunk,nz), INTENT(INOUT) :: psij,     &
 60       jb_in, jb_out
 61
 62     REAL(r_knd), DIMENSION(nang,ichunk,ny), INTENT(INOUT) :: psik,     &
 63       kb_in, kb_out
 64
 65     REAL(r_knd), DIMENSION(nx,ny,nz), INTENT(IN) :: t_xs
 66
 67     REAL(r_knd), DIMENSION(nx,ny,nz), INTENT(INOUT) :: flux0
 68
 69     REAL(r_knd), DIMENSION(nx+1,ny,nz), INTENT(INOUT) :: flkx
 70
 71     REAL(r_knd), DIMENSION(nx,ny+1,nz), INTENT(INOUT) :: flky
 72
 73     REAL(r_knd), DIMENSION(nx,ny,nz+1), INTENT(INOUT) :: flkz
 74
 75     REAL(r_knd), DIMENSION(nang,ichunk,ny,nz), INTENT(IN) :: dinv
 76
 77     REAL(r_knd), DIMENSION(cmom-1,nx,ny,nz), INTENT(INOUT) :: fluxm
 78
 79     REAL(r_knd), DIMENSION(cmom,ichunk,ny,nz), INTENT(IN) :: qtot
 80
 81     REAL(r_knd), DIMENSION(d1,d2,d3,d4), INTENT(IN) :: ptr_in
 82
 83     REAL(r_knd), DIMENSION(d1,d2,d3,d4), INTENT(OUT) :: ptr_out
 84 !_______________________________________________________________________
 85 !
 86 !   Local variables
 87 !_______________________________________________________________________
 88
 89     INTEGER(i_knd) :: ist, iclo, ichi, jst, jlo, jhi, kst, klo, khi, k,&
 90       j, ic, i, l, ibl, ibr, ibb, ibt, ibf, ibk
 91
 92     LOGICAL(l_knd) :: receive
 93
 94     REAL(r_knd) :: sum_hv
 95
 96     REAL(r_knd), DIMENSION(nang) :: psi, pc, den
 97
 98     REAL(r_knd), DIMENSION(nang,4) :: hv, fxhv
        {...}
135 !_______________________________________________________________________
136 !
137 !   Loop over the cells using bounds/stride above
138 !_______________________________________________________________________
139
140     k_loop:  DO k = klo, khi, kst
141     j_loop:  DO j = jlo, jhi, jst
142     ic_loop: DO ic = iclo, ichi, ist

3 Most costly loops shown below are within these loops


loop at dim3_sweep.f90: 168

Threads (Time)
IPC per Core
Loads per Cycle
L1 Hits per Cycle
L1 Miss Ratio
L2 Miss Ratio
L3 Miss Ratio
L2 B/W Utilized
L3 B/W Utilized
DRAM B/W Utilized
1 (15.5%) 2.41 1.28 1.71 0.36% 15.81% 14.23% 2.94% 2.92% 0.03%
36 (4.9%) 1.81 0.97 1.30 0.62% 15.53% 37.66% 6.78% 6.40% 1.28%
72 (4.6%) 1.22 0.48 0.61 3.37% 15.00% 26.87% 17.50% 7.68% 0.37%
168       DO l = 2, cmom
169         psi = psi + ec(:,l)*qtot(l,ic,j,k)
170       END DO

loop at dim3_sweep.f90: 257

Threads (Time)
IPC per Core
Loads per Cycle
L1 Hits per Cycle
L1 Miss Ratio
L2 Miss Ratio
L3 Miss Ratio
L2 B/W Utilized
L3 B/W Utilized
DRAM B/W Utilized
1 (16.9%) 0.79 0.49 0.46 19.93% 23.50% 38.56% 50.75% 30.93% 5.56%
36 (30.7%) 0.12 0.07 0.07 21.27% 20.34% 36.12% 15.60% 11.76% 11.38%
72 (30.2%) 0.35 0.03 0.02 57.21% 16.96% 21.73% 18.89% 10.38% 2.25%
251 !_______________________________________________________________________
252 !
253 !     Compute initial solution
254 !_______________________________________________________________________
255
256       IF ( vdelt /= zero ) THEN
257         pc = ( psi + psii(:,j,k)*mu*hi + psij(:,ic,k)*eta*hj +         &
258           psik(:,ic,j)*xi*hk + ptr_in(:,i,j,k)*vdelt ) * dinv(:,ic,j,k)
259       ELSE
260         pc = ( psi + psii(:,j,k)*mu*hi + psij(:,ic,k)*eta*hj +         &
261           psik(:,ic,j)*xi*hk ) * dinv(:,ic,j,k)
262       END IF

loop at dim3_sweep.f90: 397

Threads (Time)
IPC per Core
Loads per Cycle
L1 Hits per Cycle
L1 Miss Ratio
L2 Miss Ratio
L3 Miss Ratio
L2 B/W Utilized
L3 B/W Utilized
DRAM B/W Utilized
1 (19.9%) 1.87 0.94 0.94 3.36% 11.88% 36.01% 12.28% 7.70% 0.40%
36 (10.4%) 0.86 0.45 0.45 3.74% 13.00% 58.47% 13.52% 7.07% 4.46%
72 (7.8%) 0.90 0.27 0.25 12.64% 11.57% 47.94% 28.93% 7.82% 1.59%
397         DO l = 1, cmom-1
398           fluxm(l,i,j,k) = fluxm(l,i,j,k) + SUM( ec(:,l+1)*psi )
399         END DO

411 !_______________________________________________________________________
412 !
413 !   Finish the loops
414 !_______________________________________________________________________
415
416     END DO ic_loop
417     END DO j_loop
418     END DO k_loop
419 !__________________________