As more and more Internet-based attacks arise, organizations are responding by deploying an assortment of security products that generate situational intelligence in the form of logs. These logs often contain high volumes of interesting and useful information about activities in the network, and are among the first data sources that information security specialists consult when they suspect that an attack has taken place. However, security products often come from a patchwork of vendors, and are inconsistently installed and administered. They generate logs whose formats differ widely and that are often incomplete, mutually contradictory, and very large in volume. Hence, although this collected information is useful, it is often dirty.
We present a novel system, Beehive, that attacks the problem of automatically mining and extracting knowledge from the dirty log data produced by a wide variety of security products in a large enterprise. We improve on signature-based approaches to detecting security incidents and instead identify suspicious host behaviors that Beehive reports as potential security incidents. These incidents can then be further analyzed by incident response teams to determine whether a policy violation or attack has occurred. We have evaluated Beehive on the log data collected in a large enterprise, EMC, over a period of two weeks. We compare the incidents identified by Beehive against enterprise Security Operations Center reports, antivirus software alerts, and feedback from enterprise security specialists. We show that Beehive is able to identify malicious events and policy violations which would otherwise go undetected.