In this work, fundamental performance, power, and energy characteristics...
Sparse linear iterative solvers are essential for many large-scale
simul...
Comprehending the performance bottlenecks at the core of the intricate
h...
Molecular dynamics (MD) simulations provide considerable benefits for th...
We address the communication overhead of distributed sparse
matrix-(mult...
This paper studies the utility of using data analytics and machine learn...
The performance of highly parallel applications on distributed-memory sy...
The multiplication of a sparse matrix with a dense vector (SpMV) is a ke...
Automatic code generation is frequently used to create implementations o...
Automatic code generation is frequently used to create implementations o...
Most distributed-memory bulk-synchronous parallel programs in HPC assume...
The A64FX CPU is arguably the most powerful Arm-based processor design t...
Complex applications running on multicore processors show a rich perform...
The A64FX CPU powers the current number one supercomputer on the Top500 ...
Hardware platforms in high performance computing are constantly getting ...
Analytic, first-principles performance modeling of distributed-memory
pa...
Useful models of loop kernel runtimes on out-of-order architectures requ...
The symmetric sparse matrix-vector multiplication (SymmSpMV) is an impor...
We describe a universal modeling approach for predicting single- and
mul...
Stencil algorithms have been receiving considerable interest in HPC rese...
Analytic, first-principles performance modeling of distributed-memory
ap...
Analytic, first-principles performance modeling of distributed-memory
ap...
General matrix-matrix multiplications with double-precision real and com...
General matrix-matrix multiplications (GEMM) in vendor-supplied BLAS
lib...
Big science initiatives are trying to reconstruct and model the brain by...
An accurate prediction of scheduling and execution of instruction stream...
Chebyshev filter diagonalization is well established in quantum chemistr...
This paper presents refinements to the execution-cache-memory performanc...
Hardware performance monitoring (HPM) is a crucial ingredient of perform...
We introduce PVSC-DTM (Parallel Vectorized Stencil Code for Dirac and
To...
We introduce PVSC-DTM, a highly parallel and SIMD-vectorized library and...
This paper presents a survey of architectural features among four genera...
Achieving optimal program performance requires deep insight into the
int...
We examine the Xeon Phi, which is based on Intel's Many Integrated Cores...
Sparse matrix-vector multiplication (spMVM) is the most time-consuming k...
Sparse matrix-vector multiplication (spMVM) is the dominant operation in...