ASPA

From aspa.pdf

The purpose of ASPA (Adaptive Sampling Proxy Application) is to enable the evaluation of a technique known as adaptive sampling on advanced computer architectures. Adaptive sampling is of interest in simulations involving multiple physical scales, wherein models of individual scales are combined using some form of scale bridging.

Adaptive sampling [Barton2008,Knap2008] attempts to significantly reduce the number of fine-scale evaluations by dynamically constructing a database of fine-scale evaluations and interpolation models. When the response of the fine-scale model is needed at a new point, the database is searched for interpolation models centered at ‘nearby’ points. Assuming that the interpolation models possess error estimators, they can be evaluated to determine if the fine-scale response at the current query point can be obtained to sufficient accuracy simply by interpolation from previously known states. If not, the fine-scale model must be evaluated and the new input/response pair added to the closest interpolation model.


Build and Problem Configuration

ASPA is run by executing:

./aspa point_data.txt value_data.txt

The parameter file aspa.inp is automatically read:

aspa.inp

maxKrigingModelSize             4
maxNumberSearchModels           4
theta                           1.2e3
meanErrorFactor                 1.0
tolerance                       1.0e-7
maxQueryPointModelDistance      1.0e3

Analysis


Build and Run Information

Compiler = icpc (ICC) 18.0.1 20171018
Build_Flags = -g -O3 -march=native -std=c++0x -llapack -lblas
Run_Parameters = point_data.txt value_data.txt

Run on 1 Thread on 1 Node


Intel Software Development Emulator

SDE Metrics
ASPA
Arithmetic Intensity 0.06
Bytes per Load Inst 7.89
Bytes per Store Inst 8.28

Roofline – Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz


ASPA makes use of a database (M-tree database), however the sparse data interpolation (known as kringing) takes the vast of the application cycles. The I/O involved is not significant.

The code that is performing the bulk of the work (81.1%) is in the BLAS library within the following kernels:

BLAS Function
% Cycles
dgemv 59.4%
dgemm 15.3%
dtrsm 5.2%
ddot 1.1%

The calls to the BLAS kernels primarily originate from the following loop in the MultivariateDerivativeKringingModel::getMeanSquaredError:

1462       for (int i = 0; i < valueDimension; ++i) {
1463 
1464         //
1465         // initialize with the self correlation term
1466         //
1467 
1468         errorVector[i] = sigma[i][i];
1469 
1470         //
1471         // add u.(XVX)^-1.u^T contribution
1472         //
1473 
1474         getRow(uRow,
1475                u,
1476                i);
1477 
1478         errorVector[i] += dot(uRow,
1479                               mult(_matrixInverseXVX, uRow));
1480 
1481         //
1482         // add r^T V^-1 r contribution
1483         //
1484 
1485         getColumn(rColumn,
1486                   r,
1487                   i);
1488 
1489         errorVector[i] -= dot(rColumn,
1490                               mult(_matrixInverseV, rColumn));
1491 
1492         //
1493         // apply self-correlation
1494         //
1495 
1496         errorVector[i] *= _sigmaSqr[valueId];
1497 
1498       }

Experiment Aggregate Metrics

IPC per Core
Loads per Cycle
L1 Hits per Cycle
L1 Miss Ratio
L2 Miss Ratio
L3 Miss Ratio
L2 B/W Utilized
L3 B/W Utilized
2.63 0.91 1.06 2.77% 4.61% 6.55% 9.60% 1.14%

DGEMV (Level 2)

IPC per Core
Loads per Cycle
L1 Hits per Cycle
L1 Miss Ratio
L2 Miss Ratio
L3 Miss Ratio
L2 B/W Utilized
L3 B/W Utilized
2.57 1.03 1.02 3.36% 2.02% 8.45% 11.39% 0.48%

DGEMM (Level 3)

IPC per Core
Loads per Cycle
L1 Hits per Cycle
L1 Miss Ratio
L2 Miss Ratio
L3 Miss Ratio
L2 B/W Utilized
L3 B/W Utilized
3.90 1.05 1.54 0.43% 6.71% 12.41% 2.46% 0.74%