Learning-based anomaly detection has proven to be an effective black-box technique for detecting unknown attacks. However, the effectiveness of this technique crucially depends upon both the quality and the completeness of the training data. Unfortunately, in most cases, the traffic to the system (e.g., a web application or daemon process) protected by an anomaly detector is not uniformly distributed. Therefore, some components (e.g., authentication, payments, or content publishing) might not be exercised enough to train an anomaly detection system in a reasonable time frame. This is of particular importance in real-world settings, where anomaly detection systems are deployed with little or no manual configuration, and they are expected to automatically learn the normal behavior of a system to detect or block attacks.
In this work, we first demonstrate that the features utilized to train a learning-based detector can be semantically grouped, and that features of the same group tend to induce similar models. Therefore, we propose addressing local training data deficiencies by exploiting clustering techniques to construct a knowledge base of well-trained models that can be utilized in case of undertraining. Our approach, which is independent of the particular type of anomaly detector employed, is validated using the realistic case of a learning-based system protecting a pool of web servers running several web applications such as blogs, forums, or Web services. We run our experiments on a real-world data set containing over 58 million HTTP requests to more than 36,000 distinct web application components. The results show that by using the proposed solution, it is possible to achieve effective attack detection even with scarce training data.