Orloj: Predictably Serving Unpredictable DNNs

08/31/2022
by   Peifeng Yu, et al.
0

Existing DNN serving solutions can provide tight latency SLOs while maintaining high throughput via careful scheduling of incoming requests, whose execution times are assumed to be highly predictable and data-independent. However, inference requests to emerging dynamic DNNs – e.g., popular natural language processing (NLP) models and computer vision (CV) models that skip layers – are data-dependent. They exhibit poor performance when served using existing solutions because they experience large variance in request execution times depending on the input – the longest request in a batch inflates the execution times of the smaller ones, causing SLO misses in the absence of careful batching. In this paper, we present Orloj, a dynamic DNN serving system, that captures this variance in dynamic DNNs using empirical distributions of expected request execution times, and then efficiently batches and schedules them without knowing a request's precise execution time. Orloj significantly outperforms state-of-the-art serving solutions for high variance dynamic DNN workloads by 51–80 relaxed SLO settings. For well-studied static DNN workloads, Orloj keeps comparable performance with the state-of-the-art.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset