Intelligent Anomaly Detection and Mitigation in Data Centers
Data centers play a key role in today's Internet. Cloud applications are mainly hosted on multi-tenant warehouse-scale data centers. Anomalies pose a serious threat to data centers' operations. If not controlled properly, a simple anomaly can spread throughout the data center, resulting in a cascading failure. Amazon AWS had been affected by such incidents recently. Although some solutions are proposed to detect anomalies and prevent cascading failures, they mainly rely on application-specific metrics and case-based diagnosis to detect the anomalies. Given the variety of applications on a multi-tenant data center, proposed solutions are not capable of detecting anomalies in a timely manner. In this paper we design an application-agnostic anomaly detection scheme. More specifically, our design uses a highly distributed data mining scheme over network-level traffic metrics to detect anomalies. Once anomalies are detected, simple actions are taken to mitigate the damage. This ensures that errors are confined and prevents cascading failures before administrators intervene.
READ FULL TEXT