Queries in Code
A detection and response (D&R) team’s attack signature queries are vital to their success, providing insight into suspicious behaviors occurring in their environment. Writing searches that can capture complex attacker behaviors, and ensuring that these searches are correct, are important responsibilities for a successful D&R team.
Grapl takes a fairly different approach to building these queries than other tools in the market, such as Splunk. Whereas Splunk has its own domain specific language (DSL), SplunkQL, Grapl instead leverages Python - one of the most popular programming languages in the world.
I believe that there are numerous D&R use cases where a programming language like Python has significant advantages over domain specific languages like SplunkQL.
While I will be discussing the usage of Python in comparison to SplunkQL, it’s worth noting that almost any project like Splunk takes the same DSL based approach. I only chose Splunk because I have the most experience with it.
SplunkQL
The current state of the art for Detection and Response is the SIEM - products like Splunk, or ElastAlert, which perform log management, orchestration, and provide a system for correlation and alerting.
These systems almost exclusively leverage their own query languages. Splunk, for example, has the Splunk Query Language (SplunkQL). Here is an example of a Splunk query:
index=wineventlog source=WinEventLog:Security
EventCode="4624"
Logon_Type="2" OR Logon_Type="10"
| fillnull value=* Source_Network_Address
| stats count by host Source_Network_Address Logon_Type user
| eval bar="("+count+") "+Source_Network_Address
| eval bar_host="("+count+") "+host
| stats list(bar) values(bar_host) by user Logon_Type
https://gosplunk.com/windows-rdp-sessions/
Notably, there are some specialized functions like stats
with a by
clause, you can bind information to a name using eval
, and aggregate data using list
or values
. The language is really powerful in many ways.
There’s also no branching - instead, we write declarative statements such as Logon_Type=
"
2
"
, and filter out results that do not match. We have no function calls and the ability to abstract or compose searches is very limited.
Certain commands are also restricted in some ways; special commands like makeresults
or inputlookup
must be first in your search, and one can not precede the other. There are some hidden magical rules like this in SplunkQL that aren’t always obvious, and can limit flexibility.
Python
Python is a much more typical, standard programming language. It has classes, functions, if statements, loops, libraries, and other constructs you’d expect.
In Grapl, which uses Python, a query looks something like this:
child = Process() \
.with_image_name(contains="svchost.exe")
parent = Process() \
.with_image_name(contains=Not("services.exe"))
.with_image_name(contains=Not("smss.exe"))
query = parent.with_child(child).to_query()
Process
is a class that we instantiate, and use to describe what kind of processes in our graph we want to match against. We call methods like with_image_name
to describe attributes of the process, or with_child
to describe relationships between processes.
Ignoring the graph based approach here, which allows a clear way to show relationships between entities, we can see that there’s a lot of abstraction. We don’t see the underlying generated query and we don’t know the internal mechanics of Process, which means we’re free to change those underlying details in the future.
Python is more of an imperative, object oriented language (though it’s flexible enough to fit many paradigms), unlike Splunk’s purely declarative query language.
Composition, Abstraction, and Control Flow
Composition and abstraction are fundamentals of software development. The ability to compose different computations, while abstracting away irrelevant details, is what allows us to write clean, clear, maintainable code.
As I mentioned before, query languages like SplunkQL have a hard time here. There are macros, which can expand to Splunk queries, and you can technically call other searches from within your search but this is complex, and those are really the only tools available.
Python, on the other hand, has great tools for abstractions.
child = Process() \
.with_image_name(contains="svchost.exe")
We don’t have to worry about how Process is implemented, it exposes a natural interface and we make use of it.
We could compose multiple Processes together, into a ParentChildPair
if we wanted to, or move some of the logic into another function.
A common problem I’ve had in Splunk is expressing all of my logic in one query, without the use of control flow primitives. Python makes this easy.
def signature_graph() -> str:
child = Process() \
.with_image_name(contains="svchost.exe")
parent = Process() \
.with_image_name(contains=Not("services.exe"))
return parent.with_child(child).to_query()
for hit in execute_analyzer(signature_graph):
if !check_hit_against_whitelist(hit):
output(hit)
else:
debug_log("Whitelisted hit: {}".format(hit))
Here we see a case where control flow and abstraction are used to build a query that was easy to write and is still easy to read.
We filter results from our signature matching using the check_hit_against_whitelist
function, but the details of that function are abstracted away - maybe we hit a database, or reach back out to the master graph, or any other implementation. This keeps our whitelisting logic simple, and easy to change in the future.
Branching allows the code to not just filter out whitelisted hits, but to also execute code based on whether it is whitelisted or not. In the event that we do get a whitelisted event, we’re going to log some information, and then continue.
Debugging
Debugging a Splunk search can be really difficult. For one thing, there’s no easy way to just log out various steps or data. Sometimes things just stop (like if you stats
by null
) and you don’t know why - the easiest way to figure it out is usually to start cutting your search in half, rerun it, and inspect the output. This is a tedious process.
Python makes things way simpler here. For one thing, print debugging is trivial - you can inject log points anywhere into your code, as we see in the example in the Composition, Abstraction, and Control Flow
section.
Python also provides standard debugging support using breakpoints. You can actually attach to the Python interpreter and step through code, inspecting variables as you go.
The PDB tool is what I’ve used to do this in the past when debugging more complex problems.
Version Control
Searches are code, and they require an adherence to standards just as code does. Version control is one of the mechanisms that almost every mature software project uses to enforce their standard of quality.
When your searches live in code it makes management much simpler. Splunk’s searches generally live in a flat file, with the interface to the file being the GUI - this makes management of searches difficult if you want to do it in a way that isn’t the default.
Again, using a more standardized tool pays off. Python makes it easy to follow standard best practices here, as it’s extremely common for Python codebases to be backed by a version control system. The intended practice for Grapl is to keep all of your queries in a repository, and then use a githook to sign and deploy them to the analyzer S3 bucket.
This allows enforcing code reviews, linting, etc, and only releasing when your githooks have passed and your queries meet your quality bar.
Testing
DSLs are often very frontloaded in power, having lots of specialized functions for their designated use case. They usually lack power in other areas, such as tooling.
In particular, if you search around for how to test your SplunkQL searches, you might be disappointed. It’s definitely possible, but it isn’t a natively supported concept, and you’re probably going to be home-growing whatever solution you come up with. If you want to get closer to best practices, such as rerunning tests on every change, and blocking changes if tests fail, you’ll be spending a lot of time building your own system.
Contrast this with Python, where testing is provided by the standard library. There’s mocking, patching, and support from all major Continuous Integration (CI) services.
import unittest
import my_attack_analyzer
class TestAttackSignature(unittest.TestCase):
def setUp(self):
self.master_graph = init_local_mg()
add_attack_signature(self.master_graph)
def test_hit(self):
assert my_attack_analyzer(self.master_graph)
# Assert other properties of the response
def test_miss(self):
# Clear our master_graph
self.master_graph.clear()
add_benign_graph(self.master_graph)
assert my_attack_analyzer(self.master_graph) is None
if __name__ == '__main__':
unittest.main()
This is a strawman example of how one might create a positive or negative testcase for a Python based Analyzer query. This approach demonstrates simple, standard practices for testing - we could easily integrate this into our CI pipeline just like any other codebase.
Even with a very basic test like this you can ensure that your alert is functional, and Python makes it easy to build much more powerful alerts, and guide your testing through coverage or other metrics.
Static Validation
Part of ensuring code correctness is static validation - linters and type systems being the big two.
Splunk provides the appinspect app, which has some predefined rules for ensuring the basics of a good Splunk search - it’s essentially a linter.
Python has a ton of linters, as well as an optional static type system.
You can find more information about linters from pylint.org - there are incredibly powerful and capable linters. For example, pyreverse
allows you to generate UML diagrams out of your Python code. And of course you have your bases covered for things like line length, variable name standards, incorrect interface implementations, etc.
mypy, the Python type checker, can also help you ensure correctness of your searches. Grapl’s Analyzers use mypy types heavily, which helps avoid errors like accidentally using a None value. Contrast this with Splunk where fields can very easily be undefined or null, and lead to silently dropped events.
Libraries
Python is famous for its huge ecosystem of libraries - there’s no need to reinvent the wheel. Between the standard library and the PYPI you should have everything you need to build arbitrarily powerful searches.
The data science communities, as well as the security community, have really centered on Python over the last decade or so, building helpful tools like:
- scipy - Statistical functions and common analytics tools
- sklearn - A simple, well document ML library
- tensorflow - A powerful ML library, driving projects like AlphaGo
- scapy - A library for packet inspection
- pefile - A library for interacting with PE files
- beautifulsoup - Not directly a security tool, but definitely one that a lot of security researches use. beautifulsoup provides a simple interface for interacting with HTML, helpful for analysis of webpages for suspect content.
Python provides the best in class ecosystem for analyzing data.
Conclusion
I want to be clear that I’m not picking on Splunk here - I used it as the example because I know it best, but virtually every system I’ve run across suffers from the same exact problems. I believe that using a typical, powerful programming language like Python solves many of these problems.
This is why I’ve chosen Python as the first language that Grapl supports for its Analyzer library.
My hope is that I can help analysts build better attack signatures faster, reduce noise, increase signal, express more powerful TTPs and anomalies in their alerts, all while ensuring that their queries are maintainable, readable, and correct.
If you’re interested in learning more about Grapl, please feel free to reach out to me either on Twitter or via the github repo.
https://twitter.com/InsanityBit
https://github.com/insanitybit/grapl
blog comments powered by Disqus