Machine learning (ML) can be a powerful tool for augmenting the detection efficacy of a cybersecurity solution. Using it effectively means first cutting through the hype and understanding the tangible steps needed to build models with it.
The vast majority of enterprise security solutions – from antivirus applications to firewalls to intrusion detection and prevention systems – use (or at least claim to use) ML to detect threats that traditional approaches can’t, in many cases because such threats unfold faster or on a much larger scale than a traditional security solution can process.
Unfortunately, ML’s widespread adoption has created excessive hype and confusion around what it can actually do. ML isn’t a one-size-fits-all type of technology; merely employing it within a threat detection engine doesn’t guarantee that your security posture will be vastly improved. In fact, if used improperly, ML results can be detrimental to that security posture, driving up both the noise a solution generates and the rate of false positives and false discoveries.
With the notion in mind that ML can be a double-edged sword under the hood of a security platform, this post aims to inform on the fundamentals of ML approaches that impact detection efficacy.
Despite the name, “machine learning” is somewhat of a misnomer. The machine isn’t actually “learning” in the complex manner that humans do (at least, not yet). ML – as it exists today – is the practice of harnessing modern computational power to build mathematical and statistical models that explain patterns in data. This holds true for ML within the context of cybersecurity: we build models on security data to find patterns that indicate compromise of your infrastructure or environment.
In no way should ML be treated as magic. Many of the mathematical underpinnings were developed centuries ago, such as Bayesian statistics, and more recently neural networks, which date back to the 1940s. Modern advancements in computing and graphics processing units (GPUs) have made it possible to process vastly larger data sets and build models that yield richer sets of behaviors than ever before. The mathematics behind these ML models have withstood the test of time, with improvements and changes coming out of academia, more recently.
ML can only be as good as the data you feed it
Although there are several ML algorithms designed to classify and predict threats, the effectiveness of those models depends heavily on the volume and quality of data that are fed into them.
The data set itself needs to have predictive power. For example, consider the task of building a basic model to predict whether a file operation exhibits malicious behavior or not. What are some relevant data points that may provide us with information on whether this behavior is indeed malicious? There are some basic indicators that we may wish to collect and examine:
whether or not the file is a valid executable
whether or not the file is located within a protected system folder
the permissions that apply to the file
whether or not the file references system libraries
whether or not the file has a valid digital signature
Each of these indicators is likely to have direct predictive power and helps determine if a particular file operation is malicious or not. Other features like file size may have more of an indirect predictive power. By itself, the feature tells us nothing about whether a file modification was malicious, but viewed within a context of multiple different indicators, it helps to us to better understand the type of malicious behavior at play.
Let’s look at ransomware, for example. Ransomware files tend to be significantly larger than standard binaries. If your enterprise builds or uses legitimate binaries similar in size to a typical ransomware variant, then scanning and filtering binaries according to file size alone won’t protect you against a threat like this. But file size coupled with other information, such as file entropy, can provide critical context that enables a more confident ransomware threat detection.
ML classifiers such as random forests are well-suited toward operating in multidimensional spaces and can adeptly find patterns like these and classify them with a lower false positive rate than if you applied independent rules on each indicator. When the number of indicators runs into the thousands and beyond, exploring these relationships and building classifiers around them becomes an issue of scale. In this case, ML can quickly find correlations between several features that would otherwise be nearly impossible to do manually.
Beware of spurious correlations
There is a critical principle to consider when using ML to identify relationships between several indicators: correlation doesn’t necessarily imply causation This is an essential concept when the number of indicators is very large, where many of them don’t have a good amount of predictive power. What results is a model trained with supposedly high accuracy, but ends up being nothing more than a facet of the data set used to train it, and not representative of the overall relationship you’re trying to model (“selection bias”). Author and statistics enthusiast Tyler Vigen maintains a website with several such examples where two trends are logically independent, but show high statistical correlation. These models will almost invariably fall apart when they are tested against new data:
Any ML model or system needs to be backed by solid data science work to vet not just the model, but also the data that it is trained on. ML models can be updated to use new data that can improve efficacy in real time. However, it’s important to note that with an ever-changing security landscape, the underlying data sources used to build models will invariably change as well, and certain features that are currently strong indicators of a security event may not continue to be so in the future.
It’s important to stay on top of developments in cybersecurity that may dictate how the data used to generate ML models needs to change so that it continues to have a high degree of accuracy with a minimal false discovery rate.