How Grapl Avoids Fighting Data

Detection and Response is all about data. Analysts collect many billions of logs every single day and store them, searching through the noise for some signal that might indicate malicious behavior. What has become obvious is that this collection of data is not slowing down at all - we’re instrumenting more services and systems all while companies are expanding their own asset inventories, or pulling in new data sources from the cloud. Data growth, year after year, is in rapidly.

One thing that bothers me about existing state of the art in modern SIEMs is that they punish you for having a lot of data. With increasing data storage costs, licensing fees, and slower querying you can expect your SIEM experience to degrade over time, not get better. This is simply unacceptable to me - this fighting of data is demoralizing and wasteful, and it’s a problem that will only get worse as your org scales up.

In this post I’m going to cover a few areas where Grapl far exceeds the existing SIEM state of the art and aims to make this Sisyphean fight against data a thing of the past.

Storage

Perhaps the most painful constraint that SIEMs impose on customers is the cost of data storage. Data storage in a SIEM is effectively linear - every log is stored in full, and so if you send N logs up it takes O(N) bytes of space.

A significant part of why SIEMs scale storage linearly is because they work with unstructured data - a SIEM can not generally say things like “Oh, this field and that field are equivalent so I can just store one copy”.

One of the greatest pains I hear when talking to others in my field is that they have huge burdens around data storage. It is not uncommon for companies to spend millions or even tens of millions on data storage - both through physical capacity planning as well as licensing fees.

This massive cost means that even a relatively small IR team can be spending a disproportionate amount of security budget, just to collect the data that’s needed for other work.

Grapl aims to significantly improve upon this state. Grapl works with structured data (after an explicit parsing stage), and by leveraging a concept of ‘identity’, data storage grows closer to log(N) in most cases.

Let’s look at two Sysmon logs. These are both relating to the same entity - the process with pid 5324, and GUID {331D737B-28FF-5C0B-0000-001081250F00}.

    <Event
        xmlns='http://schemas.microsoft.com/win/2004/08/events/event'>
        <System>
            <Provider Name='Microsoft-Windows-Sysmon' Guid='{5770385F-C22A-43E0-BF4C-06F5698FFBD9}'/>
            <EventID>2</EventID>
            <Version>4</Version>
            <Level>4</Level>
            <Task>2</Task>
            <Opcode>0</Opcode>
            <Keywords>0x8000000000000000</Keywords>
            <TimeCreated SystemTime='2018-12-08T20:37:53.775868800Z'/>
            <EventRecordID>10</EventRecordID>
            <Correlation/>
            <Execution ProcessID='5324' ThreadID='2928'/>
            <Channel>Microsoft-Windows-Sysmon/Operational</Channel>
            <Computer>DESKTOP-34EOTDT</Computer>
            <Security UserID='S-1-5-18'/>
        </System>
        <EventData>
            <Data Name='RuleName'></Data>
            <Data Name='UtcTime'>2018-12-08 20:37:53.763</Data>
            <Data Name='ProcessGuid'>{331D737B-28FF-5C0B-0000-001081250F00}</Data>
            <Data Name='ProcessId'>1772</Data>
            <Data Name='Image'>C:\Program Files (x86)\Google\Chrome\Application\chrome.exe</Data>
            <Data Name='TargetFilename'>C:\Users\andy\AppData\Local\Google\Chrome\User Data\Default\e46787f2-8ec3-46f9-b245-000fe5f85fa6.tmp</Data>
            <Data Name='CreationUtcTime'>2018-12-08 02:14:24.177</Data>
            <Data Name='PreviousCreationUtcTime'>2018-12-08 20:37:53.747</Data>
        </EventData>
    </Event>
    <Event
        xmlns='http://schemas.microsoft.com/win/2004/08/events/event'>
        <System>
            <Provider Name='Microsoft-Windows-Sysmon' Guid='{5770385F-C22A-43E0-BF4C-06F5698FFBD9}'/>
            <EventID>2</EventID>
            <Version>4</Version>
            <Level>4</Level>
            <Task>2</Task>
            <Opcode>0</Opcode>
            <Keywords>0x8000000000000000</Keywords>
            <TimeCreated SystemTime='2018-12-08T20:38:17.621228500Z'/>
            <EventRecordID>19</EventRecordID>
            <Correlation/>
            <Execution ProcessID='5324' ThreadID='2928'/>
            <Channel>Microsoft-Windows-Sysmon/Operational</Channel>
            <Computer>DESKTOP-34EOTDT</Computer>
            <Security UserID='S-1-5-18'/>
        </System>
        <EventData>
            <Data Name='RuleName'></Data>
            <Data Name='UtcTime'>2018-12-08 20:38:17.606</Data>
            <Data Name='ProcessGuid'>{331D737B-28FF-5C0B-0000-001081250F00}</Data>
            <Data Name='ProcessId'>1772</Data>
            <Data Name='Image'>C:\Program Files (x86)\Google\Chrome\Application\chrome.exe</Data>
            <Data Name='TargetFilename'>C:\Users\andy\AppData\Local\Google\Chrome\User Data\Default\daa42f83-e6b5-4528-a7ed-e0778b91783f.tmp</Data>
            <Data Name='CreationUtcTime'>2018-12-08 02:14:24.177</Data>
            <Data Name='PreviousCreationUtcTime'>2018-12-08 20:38:17.591</Data>
        </EventData>
    </Event>

The process, Chrome, is operating on two distinct cache files. These sorts of operations happen extremely frequently, to the point where your config may even whitelist out the directory entirely.

There’s clearly a ton of redundancy between these two logs - the process pid, image name, process guid, command line, etc, will be repeated in every single log, wasting hundreds of bytes for every additional log.

I have a 16MB dump of Sysmon logs from a virtual machine. Let’s quickly scrape out all of the lines that aren’t completely unique:

    cat ./events.xml | sort | uniq -u | wc -c
    > 1351536

We can see that the actual unique information in this 16MB file is closer to ~1.3MB. That’s less than 10% of the original data that we actually care about - an order of magnitude data reduction!

This ‘unique’ approach is actually very similar to how Grapl works - it takes information in logs, such as pids, paths, or timestamps, to determine a canonical identity, called a node key. This is not unlike Sysmon’s Process GUIDs, but entirely server-side. Grapl then coalesces the information for each entity, throwing out redundant information, and storing only the unique information.

The end result is that Grapl’s storage does not grow linearly with logs you send up but instead it’s linear with the unique information sent up - practically, this will be closer to a logarithmic growth rate. The first log for a process create will likely contain mostly unique information, but for all subsequent actions by that process the information stored will decrease considerably.

Analyzers

Most SIEM alerting works via a scheduled search. Every N minutes your search runs over M minutes of data (where N and M are often the same).

Each of these searches is, more or less, O(N). So if you’re searching over the last 10 minutes of data today, and your search runs in X seconds, then next year when your data volume has doubled your search will run in roughly 2X seconds.

What’s worse is that join performance in a traditional SIEM is going to be along the lines of exponential, making joins effectively pointless. To put this into perspective, here is an excerpt from Splunk’s documentation on subsearches (which joins leverage):

Additionally, by default subsearches return a maximum of 10,000 results and have a maximum runtime of 60 seconds. In large production environments it is quite possible that the subsearch in this example will timeout before it completes. source

When using subsearchs you have to ensure that you send a bounded amount of data in or your search may be truncated. And I’m not just picking on Splunk, this is just fundamental to the way that traditional SIEMs work.

Grapl’s searches, what it refers to as Analyzers, have two important properties:

They are real time
Search complexity grows based on the query, not the data

What this ends up meaning is that Analyzer execution is effectively constant time, which is to say that your analyzers that execute in X seconds today will execute in ~X seconds next year, even if your data size has increased dramatically.

Here is a search for a suspicious execution based on a ‘winword.exe’ parent process:

    def analyzer(client: DgraphClient, node: NodeView, sender: Any):
      process = node.as_process_view() 
      if not process: return 
      p = ( 
          ProcessQuery() 
          .with_process_name(eq="winword.exe") 
          .with_children(ProcessQuery()) 
          .query_first(client, contains_node_key=process.node_key)
      )

Note on line 8 the contains_node_key=process.node_key, this tells the query builder to create a subgraph search that will search for a subgraph matching the described pattern where that node_key exists somewhere in the matched graph.

Under the hood it is as if it generates these two separate queries:

        ProcessQuery() 
        .with_process_name(eq="winword.exe") 
        .with_node_key(eq=process.node_key)
        .with_children(ProcessQuery()) 

and

        ProcessQuery() 
        .with_process_name(eq="winword.exe") 
        .with_children(
            ProcessQuery()
            .with_node_key(eq=process.node_key)
        ) 

In this case there are at most 4 operations ever, and they can even execute in parallel thanks to the DGraph backend. Even with trillions of nodes this query should always take roughly the same amount of time.

Those 4 operations are all key-based lookups (even the edge traversal), and as such they’re constant time.

Engagements

In a SIEM-based workflow, upon receiving an alert, you will first open up some kind of search window - say, the last 8 hours, and all searches will run across that 8 hour period of data.

As your search window grows due to the scope of your investigation increasing, so do your searches degrade in terms of performance. Going from an 8 hour window to a 16 hour window will at least double your search times.

It is not uncommon at all for investigations to take place over weeks, months, or even years worth of data. It is a common approach of malware to schedule its execution days or weeks after the initial payload lands, for example, or you may have had a very old vuln/exposure reported and you want to validate that it wasn’t exploited.

Once again the SIEM has put us in a position of fighting with our data. We want the largest search window possible so that we can capture the full scope of an attack, but we want the shortest search window possible so that we can optimize our searches. This is the sort of trade off that I find particularly demoralizing.

Grapl throws search windows out entirely. You start off an engagement with some suspect node, and from there you expand that node. Each expansion operation is constant time. This is done through a Python library provided by Grapl, and can be executed in an AWS Sagemaker Notebook.

As an example, you may want to go from a process to its parent process.

    suspect_process = engagement.get_process("...")
    suspect_parent = suspect_process.get_parent()

It would not matter if suspect_process and suspect_parent were executed weeks apart, the operation always takes the same amount of time.

This is leveraging the same techniques as the Analyzers, generating optimized queries under the hood that act as key lookups.

Conclusion

By leveraging techniques like identification and focusing on constant time operations Grapl can provide literally orders of magnitude better storage and performance than existing state of the art solutions. Organizations should never feel like they have to fight with their data, or worry about their log volume due to absurd licensing fees and storage costs.

The improvements that Grapl makes don’t just represent a “2x” or “10x” speedup, they fundamentally change runtime performance attributes, turning operations that are linear or exponential in a SIEM into operations that are logarithmic or even constant time.

Grapl is free, open source, and promises to make Detection and Response a radically better experience for detection engineers and incident responders.

Github: https://github.com/insanitybit/grapl Twitter: https://twitter.com/home

← Previous Archive Next →

Published

17 August 2019