MPI is the de facto standard for parallel computation on a cluster of
co...
Model checking has become a key tool for gaining confidence in correctne...
Fuzzing is one of the most popular and widely used techniques to find
vu...
MANA-2.0 is a scalable, future-proof design for transparent checkpointin...
Checkpoint/restart (C/R) provides fault-tolerant computing capability,
e...
The share of the top 500 supercomputers with NVIDIA GPUs is now over 25
...
This work strives to make formal verification of POSIX multithreaded pro...
Bolted is a new architecture for a bare metal cloud with the goal of
pro...
Bolted is a new architecture for bare-metal clouds that enables tenants ...
Transparently checkpointing MPI for fault tolerance and load balancing i...
Unified Virtual Memory (UVM) was recently introduced on recent NVIDIA GP...
Existing bare-metal cloud services that provide users with physical node...
Checkpoint-restart is now a mature technology. It allows a user to save ...
Fault tolerance for the upcoming exascale generation has long been an ar...
Providing fault-tolerance for long-running GPU-intensive jobs requires
a...
InfiniBand is widely used for low-latency, high-throughput cluster compu...
It is common today to deploy complex software inside a virtual machine (...
A new style of temporal debugging is proposed. The new URDB debugger can...