From a logging perspective, in either classic or big data scenarios, anomalies are events that occur very rarely in a dataset. Malicious events share a similar trait, in the sense that, if most of the infrastructure is well-secured, they are also infrequent entries when compared to main-stream log events/activities.
“All Bourbon is Whiskey, but not all Whiskey is Bourbon” is maybe the best analogy to describe the relationship between anomalies and malicious events. All malicious events are anomalies, but not all anomalies are malicious events.
Starting from this supposition, the direct approach is to identify the anomalies in a dataset and search for the potential malicious activities in this subset of data. For logs sources that generate up to several millions of events each hour an Anomaly subset can be something up to just dozens of events, reducing the investigation space drastically.
Our proposed framework tends to be generic enough to be applied to most logs types where there is a high (but not necessarily obvious) correlation between the observations.
The framework consists of two main modules:
1. Transforming our data from a logs state (categorical values/string/text) to a vectorized one (numerical one) using a deep-learning generic approach;
2. Use anomaly detection for generating the list of extremely rare events that occur in the dataset, using the vectorized format.
We will discuss about the multiple Anomaly Detection algorithms we experimented with but also how we used the Tripod Machine Learning tool, a recently released open-source project from Adobe that we helped develop and demonstrate how unsupervised learning can be used to compute latent representations for sequences (logs entries). More about this project can be found on: https://github.com/adobe/tripod .
The presentation will include all the steps needed for training the Tripod Model, generating the Anomaly Model and extracting Anomalies for reporting purposes.