The Pursuit of Normal: Alerting on Anomalies using Splunk

It seems strange to talk about normal right now considering that, at the time of writing, a lot of the world is under quarantine. Yet in security, normal is something that is important to know.

When creating alerts or analysing logs, you want to be notified when something is not normal, to do this, you need to first know what normal is.

Searching for anomalies relies on creating a baseline of previous events and alerting on outliers from this baseline.

This is true for all SIEMs and log analysis tools, but the one I am going to concentrate this article on is Splunk.

One method to map your normal and look for anomalies in Splunk is to use a weighted moving average.

Weighted Moving Average

A Weighted Moving Average puts more weight on recent data and less on past data. This means that the search can be more sensitive to peaks and troughs in the data, and alerts can be created when the actual data deviates from the moving average by too much.

Here is an example of the Splunk visualisation for a WMA search through some auth logs:

This is the Splunk search that resulted in this chart:

index=<<index>> earliest=-7d AND <<search criteria>>
| timechart span=1h count(<<event to search by>>) AS events 
| trendline wma3(events) AS wma_events 
| eval anomaly=if(events > 2 * wma_events, "spike", "good")

The important parts of this search query are described below:

Span

| timechart span=1h count(<<event to search by>>) AS events

This is the timeframe that the events will be grouped by into. Usually, I have found that 1 hour is the most efficient span, particularly when the number of events varies between peak and off-peak times.

WMA Events

wma3(events)

The number of events for the weighted moving average is important for getting an accurate calculation of previous events and prediction of what actual events should look like. I have found that the lower number of events looked at, the more accurate your average line will look.

Below is the same search as above, but using wma10(events) instead of wma3

The higher amount of events to look for has caused the average line to lag much further behind the actual line of events.

Sensitivity

| eval anomaly=if(events > 2 * wma_events, "spike", "good")

The sensitivity is the acceptable difference between the calculated average and the actual data. In this case, we have if(events > 2 * wma_events, so it would be counted as a spike when the actual number of events in 1 hour are 2 x the average events.

You will need to judge the most efficient sensitivity for the data that you are using, the best way to do this is to look at the behaviour of the data for a period of 7 or more days to try to find what is normal. For data that ‘zig-zags’ up and down (like auth logs), a higher sensitivity is required to avoid false positives, and to capture when anomalous events are actually occurring.

For data that stays static, a lower sensitivity can be used. Take the example below, which shows new processes starting on a device. Because the events happen regularly, anything that strays off this line may be a suspicious event. In this case, the sensitivity is set to 1.3 x the average number of events: (events > 1.3 * wma_events, "spike", "good")

Spikes and Troughs

Although most of the time you are looking for spikes in data, sometimes you may want to look at troughs as an indicator that something is abnormal. This can be done as well:

index=<<index>> earliest=-7d AND <<search criteria>> 
| timechart span=1h count(mdc.clientId) AS events 
| trendline wma3(events) AS wma_events 
| eval anomaly=if(events > 2 * wma_events, "spike", if(events < wma_events / 2, "trough","good"))

In this case, | eval anomaly=if(events > 2 * wma_events, "spike", if(events < wma_events / 2, "trough","good")) will set the value for anomaly to spike if the actual events are over 2 x the number of average events, and will set it to trough if the actual events are lower than the number of average events ÷ 2. These values can then be used in alerting.

You can also have different sensitivities for spikes and troughs.

Search Templates for WMA Searching

Alert search looking for spikes:

index=<<index>> earliest=-7d AND <<search criteria>> 
| timechart span=1h count(<<events to look for>>) AS events 
| trendline wma<x>(events) AS wma_events 
| eval anomaly=if(events > <<sensitivity>> * wma_events, "spike", "good") 
| where anomaly="spike"

Alert query looking for spikes and troughs:

index=<<index>> earliest=-7d AND <<search criteria>> 
| timechart span=1h count(<<events to look for>>) AS events 
| trendline wma<x>(events) AS wma_events 
| eval anomaly=if(events > <<sensitivity>> * wma_events, "spike", if(events < wma_events / <<sensitivity>>, "trough","good")) 
| where anomaly="spike" OR anomaly="trough"