The Study of Transient Faults Propagation in Multithread Applications
Whereas contemporary Error Correcting Codes (ECC) designs occupy a significant fraction of total die area in chip-multiprocessors (CMPs), approaches to deal with the vulnerability increase of CMP architecture against Single Event Upsets (SEUs) and Multi-Bit Upsets (MBUs) are sought. In this paper, we focus on reliability assessment of multithreaded applications running on CMPs to propose an adaptive application-relevant architecture design to accommodate the impact of both SEUs and MBUs in the entire CMP architecture. This work concentrates on leveraging the intrinsic soft-error-immunity feature of Spin-Transfer Torque RAM (STT-RAM) as an alternative for SRAM-based storage and operation components. We target a specific portion of working set for reallocation to improve the reliability level of the CMP architecture design. A selected portion of instructions in multithreaded program which experience high rate of referencing with the lowest memory modification are ideal candidate to be stored and executed in STT-RAM based components. We argue about why we cannot use STT-RAM for the global storage and operation counterparts and describe the obtained resiliency compared to the baseline setup. In addition, a detail study of the impact of SEUs and MBUs on multithreaded programs will be presented in the Appendix.
READ FULL TEXT