Demystifying Dependency Bugs in Deep Learning Stack
Recent breakthroughs in deep learning (DL) techniques have stimulated significant growth in developing DL-enabled applications. These DL applications, built upon a heterogeneous and complex DL stack (e.g., Nvidia GPU, Linux, CUDA driver, Python runtime, and TensorFlow), are subject to software and hardware dependencies across the DL stack. A persistent challenge in dependency management across the entire engineering lifecycle is posed by the asynchronous and radical evolution as well as the complex version constraints among dependencies. Developers might introduce dependency bugs (DBs) in selecting, using and maintaining dependencies. However, the characteristics of DBs in DL stack is still under-investigated, hindering practical solutions to dependency management in DL stack. To fill this gap, this paper presents the first comprehensive study to characterize symptoms, root causes and fix patterns of DBs across the whole DL stack with 326 DBs collected from StackOverflow posts. For each DB, we first investigate the symptom as well as the lifecyle stage and dependency where the symptom is exposed. Then, we analyze the root cause as well as the lifecycle stage and dependency where the root cause is introduced. Finally, we explore the fix pattern as well as the knowledge sources that are used to fix it. Our findings from this study shed light on the implications on dependency management, e.g., constructing dependency knowledge graph for the entire DL stack, recommending dependencies, detecting, localizing and fixing dependency bugs, and upgrading and migrating dependencies.
READ FULL TEXT