InsanityBit2023-12-23T21:04:05+00:00http://insanitybit.github.ioInsanityBitinsanitybit@gmail.comMitigations Without Modeling2023-12-23T00:00:00+00:00http://insanitybit.github.io/2023/12/23/mitigations-without-modeling<h1 id="mitigations-without-modeling">Mitigations without Modeling</h1>
<p>Something that has been brought up for years is that mitigation techniques should not be built without a threat model. I agree with this premise, mostly at least, but I wanted to consider the alternative argument; if I truly believed that mitigations should be built without modeling, what would that look like?</p>
<p>I believe that the argument would look like this;
By implementing mitigations without threat models we can address unknown threats before they occur. Mitigations can be built without modeling but still be based on principles.</p>
<p>PaX Team published numerous documents about mitigations and threat models. It’s worth reading <a href="https://pax.grsecurity.net/docs/">all of them</a>, but a common theme is to define exploit primitives and vulnerability classes and then to discuss what mitigations for those would look like. This is how you build mitigations with clear threat modeling.</p>
<p>Contrast this, OpenBSD has been adding mitigations that arguably do not have a threat model. There is no rigid definition of attack primitives and how these mitigations interfere with those primitives - there is definitely <em>some</em> discussion of primitives and how this would interfere with attackers but it’s a bit more handwavy and “hopefully this will be annoying to attackers”.</p>
<p>What OpenBSD is doing could be seen as “bad”, but a lot of the argument is “this <em>could</em> end up making things harder for attackers”. What OpenBSD is doing is building mitigations based on principles, which are broad, rather than models, which are rigid.</p>
<p>We can imagine building existing mitigations from a principle. The Principle of Least Privilege would be enough for us to invent N^X, even if we didn’t yet have a threat model to guide us, for example.</p>
<p>One can continue to build principled systems without ever thinking about the actual implications, and they might be useless, or, when some new unforseen threat changes the existing threat model, we might have actually addressed it without even trying.</p>
<p>That is to say, systems which are designed in a principled way have the <em>potential</em> to be safer against unknown threats, whereas systems designed in a modeled way have known value for existing threats.</p>
<p>“Potential for unknown threats” is, I suspect for many, far less compelling than “definitive value against known threats”. If you have to pick one I think anyone would agree to pick the latter. But that doesn’t mean that the former isn’t a valuable way to build systems.</p>
<p>Certainly, the ideal system is built in a principled way and with mitigations for known techniques.</p>
Firefox2023-12-05T00:00:00+00:00http://insanitybit.github.io/2023/12/05/firefox<h1 id="saving-firefox">Saving Firefox</h1>
<p>So once again the topic of Firefox’s decline has shown up on Hacker News. I want to give my opinion on the topic as a long-time Firefox user who switched to Chrome about a decade ago. I’m not an expert on browsers, and really I don’t think my opinion is that worthwhile - my relevant credentials are “someone who has thought a bit about browsers and one of billions who uses a browser”.</p>
<p>So, with that said, here’s some stuff.</p>
<p><strong>Why do I use Chrome? Why do others?</strong></p>
<p>The main reason I switched to Chrome is pretty straightforward. It was, by far, the safest browser on the market. It wasn’t even close. When Chrome got decent market share the entire web had a shift in its threat landscape. I personally had a computer basically destroyed by a Java 0-click drive by as a teenager and I was not happy about it.</p>
<p>Besides introducing the now-commonplace “auto updater”, Chrome implemented Click To Play for Java and Flash <em>and</em> it sandboxed Flash, so you could run it and still be safe from all but the advanced attackers. Again, this was an absolute game changer. Black Hole and Poison Ivy, the big pay-to-play exploit kits at the time, would detect if you were a Chrome user and fall back to trying to just convince you to download and run malware. Drive-By exploits went from being pretty commonplace to basically dead - when was the last time you heard about a massive, publicly used, spray-and-pray 0-click exploit against users browsing the internet? It just doesn’t happen anymore, attacks are far more targeted because they are <em>radically</em> more expensive.</p>
<p>So, yeah, I chose to move to Chrome.</p>
<p>Why did I stick with Chrome, and why do I suspect many others do? Well…. at least for me, that’s pretty simple - it’s what I, and many others, use at work.</p>
<p>There’s one major reasons why companies centralize on Chrome, other than that it’s the most common browser; SOC2 and Compliance.</p>
<p>The vast majority of companies need to attain <em>some</em> kind of certification, such as SOC2. One thing that you’ll need to do in that process is answer questions like “how do you make sure that your computers are patched? how do you authenticate clients?”. These are good questions for every company to answer, but many companies don’t have a choice in the matter - they have to answer them.</p>
<p>First of all, answering questions about <em>one</em> browser is much easier than answering questions about <em>two</em> or arbitrary numbers of browsers. Just by saying “use only this browser” a company now only has to monitor a single User Agent to determine if the client is up to date. They only have to track CVEs for one browser. They only have to <em>manage</em> one browser. So then, the question is which browser? Well, obviously many will choose whatever is the majority - if you have to pick one, pick the one that most people use, right? Right.</p>
<p>But, also, there’s real merit to choosing Chrome for SOC2/Compliance, as well as actual security.</p>
<ol>
<li>IT can manage your Chrome profile, enforcing versioning and extension policies.</li>
<li>IT can enforce Endpoint Verification when you SSO, if you have GSuite as your SSO provider.</li>
</ol>
<p>I’m not saying every company is solving their SOC2 problems this way, but I bet a lot of them are. And you don’t need all of them to, just enough that Chrome becomes the “obvious” choice for the others who just need to pick whatever’s popular.</p>
<p>So, if you’re going to work every day and using Chrome, do you want to come home and use Firefox? Maybe you do, but I think the vast majority of people would prefer to use the same browser that they use every single day for work at home as well - after all, even minor UX differences between the two will be painful.</p>
<p><strong>Lack of Motivation</strong>
The reality is that Firefox’s message hasn’t been very compelling for a long time.</p>
<ol>
<li>
<p>Firefox is funded almost exclusively by Google. So they can say “wow Google is so evil!” but idk, it just isn’t really doing it for me when you’re taking their money. What’s the long term plan here? Firefox suddenly takes a ton of market share, and Google just pays for that? It’s kind of shocking that Google still bothers to pay Firefox as much as they do given the lack of market share. I don’t think Mozilla is the basket for me eggs, personally.</p>
</li>
<li>
<p>I’m not convinced that Mozilla is the horse to bet on. Brave seems far more interesting to me. I’m not going to try to plug Brave, but integrating TOR (and donating lots of TOR nodes) is actually a genius move that <em>seriously</em> moves the needle with regards to privacy. Brave is also willing to actually answer the question “how would the internet look if we removed advertising?” and whether you’re happy with their answer or not, Mozilla wouldn’t exist without ad revenue at all. Mozilla can safely target user privacy while its market share dwindles, but the reality is that ads still pay their bills. I know Mozilla sort of half tried to do something about this and failed, I can’t even remember what that was called, but it was some sorta alt-funding for the web idea. I don’t know, try harder.</p>
</li>
<li>
<p>A lot of Privacy conversations are… bad. People used to talk about all of these things Chrome did that just do not matter. One example is that people freaked out about LLMNR poisoning detection - Chrome sends out a bunch of LLMNR packets to your local broadcast network when it starts up (or it used to at least, idk if it still does). So people would see this in wireshark and be like “oh my god Chrome is sending out packets when I didn’t do anything”. The thing is, the packets never even left your local network, and they were there <em>to protect you -</em> if those LLMNR requests received responses it means someone on your network is fucking with you. This is just one example of literally well over a dozen where “omg Chrome is talking over the network” was just… not a problem at all. “Chrome collects every website you visit” welllllll, uh, kinda?</p>
<p>The Google Safebrowsing API collects partial hashes of websites you visit and, if the partial hash collides with a suspicious site, the full hash is provided (which obviously can be reversed on Google’s end). Is that really bad? I mean it’s not great, but how would you have implemented that feature? It’s actually avoiding sending the full hash in the vast majority of cases, only a partial hash, which I don’t think is nearly as easy to reverse (and certainly always leaves plausible deniability). I think V3, which requires an Opt In (for enhanced protection) actually collects full URLs in order for Google to visit the site, analyze it dynamically, and then make a judgment - hey, look, that’s definitely not great for privacy… but it’s <em>opt in</em> and it’s also clearly a win for security (even if we say it’s a loss for privacy!). So, idk, is Chrome super evil?</p>
<p>But wait, Manifest V3! OK, yeah, Manifest V3 is kind of annoying, although I think for reasons that are not exactly mainstream. I wish that Chrome would handle it differently and I’ll elaborate further in a minute. I believe that the problem of malicious extensions are legitimate and need to be addressed - extensions absolutely have too many privileges. I’ve been told by others (experts) that they’ve encountered malicious/ spam extensions and they track this sort of thing and believe it’s a real issue. Manifest V3 is an attempt to limit a malicious extension’s abilities. Many people have been saying, for years, that this is evil and will break ad-blockers. The reality is not that straightforward - Google has implemented <em>many</em> changes in Manifest V3 expressly to allow for adblockers to continue. Really, what annoys me about Manifest V3 the most is that Google isn’t coming out and giving us more data on malicious extensions and how V3 will stop them. Like, just do that please, it would be super interesting.</p>
<p>Manifest V3 is in no way the “death” of adblockers, as far as I can tell. Adblockers will still be largely functional, and certainly 30,000 dynamic rules should be enough to block <em>Google’s</em> trackers, right? Again, let me know if I’m just totally off here. In fact, here’s a version of uBlock Origin that is using Manifest V3 features exclusively: https://github.com/gorhill/uBlock/commit/a559f5f2715c58fea4de09330cf3d06194ccc897</p>
<p>The point is that these discussions tend to be extremely hyperbolic, low in technical content, and refuse to acknowledge that there are real issues with the way things work right now. I could continue on and discuss WEI but this section is already obscenely long (and I cut out a whole bunch of extra V3 content about the APIs just to keep it shorter!).</p>
</li>
</ol>
<p>Suffice to say that, while I am a privacy advocate, I am just not swayed by the majority of conversations I have about Chrome versus Firefox.</p>
<p><strong>So what’s to be done?</strong></p>
<p>Well, it may surprise you to learn that my preference is actually for Firefox to “win”. Or, at least, for Chrome to “lose”. I don’t like a browser monopoly and as much as I think the privacy conversation around Chrome is mostly noise, Google’s interests in the web don’t align well with mine, longer term. If Google diversified more and stopped being an ad company that could change, but while I think advertising should always be a <em>part</em> of the internet it should never be the sole driver of it - and Google as it exists today will always benefit from a web that is driven exclusively by advertising.</p>
<p>Anyways, if I had to put a plan together for Mozilla, here’s what I’d do. (Of course, if I were actually the CEO I’d spend a hell of a lot of time talking to internal teams about what to do, but I’m not getting paid millions of dollars a year, unlike someone else….)</p>
<ol>
<li>
<p>Focus on enterprise. There should be a trivial way for companies to manage Firefox in an enterprise, integrate it with their SSO provider (GSuite, Okta, O365), and answer key compliance questions using Firefox. Firefox already has an LTS, which is cool and helpful. You can also manage it via GPO, but I’m talking about a web interface with integrations to other service providers, not just “get it installed”. IDK, maybe that exists? Mozilla marketing is terrible, see my next point…</p>
</li>
<li>
<p>Firefox should emphasize other features that align well with the organization. Do you know how many ads I see a day for ExpressVPN and NordVPN? Dozens. They advertise on Youtube like <em>crazy</em> - a perfect demographic for Firefox users, in my opinion. Do you know how often I hear about Mozilla’s VPN? Literally never. I had to DDG it to make sure it hadn’t shut down or something. Mozilla needs to put tertiary integrations like a VPN <em>front and center</em>. Opera has been doing a very good job of this lately with OperaGX - meeting their users where they are and getting their brand out there.</p>
</li>
<li>
<p>Honestly, fire the CEO. Absolute disaster and an abject, repeated failure. The board needs to get serious and get them out of there yesterday. I don’t think anything else matters more than this.</p>
</li>
<li>
<p>Focus on core values. That means privacy (VPN, TOR), security, and performance. Firefox is in such an interesting position. Mozilla has Rust (and then fucked up incredibly by firing the entire team - again, fucking fire that joke of a CEO) and a unique engine to compete with. These are <em>assets</em>. Chrome is suffering from 0-day exploits very consistently now, it’s a real problem; a browser with significant use of a memory safe language would be a major marketing tool for users <em>and</em> organizations. And as for performance, things may have changed a lot over the last decade, but plenty of websites are still damn slow - I find it hard to believe that there isn’t more work to be done on performance. I remember when Mozilla <a href="https://hacks.mozilla.org/2017/08/inside-a-super-fast-css-engine-quantum-css-aka-stylo/">put out this blog post</a> and people <em>lost their minds</em> at how good it was - both in terms of the focused efforts and the way the content was presented. Of course, as I recall, the CEO fired the authors. Brilliant.</p>
</li>
</ol>
<p>Anyways, I could go on and on about how the CEO of Mozilla absolutely has to be fired, but I won’t bother. This post is already way longer than I had intended. I think that Firefox has a path to success and I’d like to see it do so. Until then, switching to Firefox feels like a purely symbolic gesture with zero impact - me choosing Firefox won’t change the fact that companies aren’t, the fact that their marketing is disastrous, that their CEO is aggressively unfit for the role, etc.</p>
<p>I also want to note that it’s OK to disagree. I tried to make this clear in my first paragraph - I’m not an expert. I might make different value judgments, but I also might just be <em>wrong</em>. I’m also definitely <em>not</em> advocating that you switch to Chrome or something like that - if anything, it’s the opposite. Go make that symbolic gesture if you want to, or hell, use Firefox because for you it’s the better browser, <strong>by all means.</strong> I have just seen this conversation go on for years and I feel like throwing some of my thoughts out there.</p>
<p>If I’ve made an incorrect statement here or you think I’ve missed something important you can point it out to me, I would be happy to learn more.</p>
The Cognitive Burden of Garbage Collection vs Move Semantics2023-06-09T00:00:00+00:00http://insanitybit.github.io/2023/06/09/Java-GC-Rust
<h1 id="the-cognitive-burden-of-garbage-collection-vs-move-semantics">The Cognitive Burden of Garbage Collection vs Move Semantics</h1>
<p>Many people feel that Rust’s borrow checker introduces too much cognitive overhead, and that it must therefore reduce productivity. This is something I strongly disagree with. In fact, I would argue that it <em>reduces</em> cognitive overhead by unifying memory and resource management.</p>
<p>In Garbage Collected languages there is far more manual, error-prone work that the developer is responsible for because, somewhat (IMO very) unintuitively, the only garbage that a GC handles is memory.</p>
<p>This post is going to demonstrate this using Java, but other languages with a GC like Python and Go have this same problem, just replace <code class="language-plaintext highlighter-rouge">try-with-resources</code> with <code class="language-plaintext highlighter-rouge">defer</code> for Go and <code class="language-plaintext highlighter-rouge">with</code> for Python.</p>
<h2 id="memory-is-not-the-only-resource"><strong>Memory is not the Only Resource</strong></h2>
<p>Any non trivial program is almost certainly going to do more than handle just memory - file handles, network connections, database clients, threads, etc are all a part of any program’s bag of resources to manage. Errors related to managing these resources are problematic - I know I’ve certainly run into my fair share of resource leaks leading to a process crashing in the middle of the night.</p>
<p>Java, like many other languages, uses a garbage collector (GC) to automatically manage memory. While this eliminates the pitfalls of manual memory management, it also means that we have two methods of managing resources - one for memory, one for the rest. And it’s not always clear (certainly it’s very <em>unclear</em> without an IDE, such as when performing a code review in Github) which method is appropriate to use.</p>
<p>For <em>non-memory resources</em>, Java provides constructs like <code class="language-plaintext highlighter-rouge">**try-with-resources**</code> and <code class="language-plaintext highlighter-rouge">**AutoCloseable**</code>, but the developer has to know when and where to use them. And that’s the crux of the issue: it’s not always clear who is responsible for cleaning up these resources, leading to potential confusion, errors, and certainly leading to cognitive overhead.</p>
<h2 id="complex-resource-management-in-java">Complex Resource Management in Java</h2>
<p>Consider a simple scenario in Java where you establish an HTTP connection, provide that connection to a gRPC channel, and then provide that channel to a gRPC Client:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>HttpClient httpClient = HttpClient.newHttpClient();
try(
GrpcChannel grpcChannel = GrpcChannel.builder(httpClient).build();
GrpcClient grpcClient = GrpcClient.builder(grpcChannel).build();
) {
// Use the gRPC client here
}
</code></pre></div></div>
<p>As humans we can see that this HttpClient is only used in one place - the GrpcChannel. We’ve provided it as an argument and never need to use it again. If the HttpClient had been a memory-only construct we could assume this all works just fine - after all, the Garbage Collector would have no problem understanding that it is never used again.</p>
<p>But HttpClient isn’t memory. Despite GrpcChannel and GrpcClient owning that resource they have no way of releasing it. We have a leak here.</p>
<p>Now, your IDE <em>may</em> help here by pointing out that HttpClient implements AutoClosable. Certainly Intellij seems to do well here - but that doesn’t change the fact that the developer is forced to manage this situation.</p>
<p>You could remedy this situation by moving the HttpClient into the try-with-resources, but this mismatch of resource handling is complex and involves a lot of boilerplate to deal with. And this is just a simple, contrived example - consider cases where a class holds onto an AutoClosable field, complicating ownership further.</p>
<p>The inexpressible ownership is burdensome.</p>
<h2 id="contrast-to-rust"><strong>Contrast to Rust</strong></h2>
<p>In contrast, Rust introduces a unified approach to managing resources through its ownership model and RAII principles. When an object goes out of scope, Rust automatically cleans up the resources associated with it, both memory and non-memory. Here’s how the same scenario could look in Rust:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>let http_client = HttpClient::new();
let grpc_channel = GrpcChannel::new(http_client);
let grpc_client = GrpcClient::new(grpc_channel);
// Use the gRPC client here
// Once out of scope, Rust automatically cleans up grpc_client, and all the resources leading up to it - whether they're memory or not
</code></pre></div></div>
<p>Rust’s approach provides clear ownership of resources, making it apparent who is responsible for releasing them. Furthermore, resources are automatically released when they go out of scope, ensuring no resource leaks. This eliminates the aforementioned complexity associated with managing non-memory resources in languages with a GC.</p>
<p>We don’t even have to define the scope of our variables - it can change based on the usage. If we returned a variable we’d be saying “ok caller, I give you ownership”, if we didn’t move it anywhere we’d retain it, and, as we see here, if we do move it, we move the responsibility with it.</p>
<p>This <em>unified</em> approach introduces far less cognitive overhead. There is one way to do things, it’s handled for you, and you have one, flexible system for changing ownership.</p>
<h2 id="conclusion"><strong>Conclusion</strong></h2>
<p>My personal opinion, and my experience as well, has been that many tools like GCs that try to help you actually introduce far more complexity - but I constantly encounter statements like “Rust just won’t be as productive as a GC’d language” that flat out aren’t the case for me or other Rust devs I know.</p>
<p>GC is just one example. Languages that try to hide pointer semantics for you also introduce a massive amount of complexity, often hiding copies from you in a way that makes you ask the question “if I mutate this thing is it just mutating my copy or someone else’s?” - the subject of another post, I think.</p>
<p>Now I can assume that I’m going to get some responses like:
“OK but it’s not that hard, you just learn to do it, or use tools to catch these problems”.</p>
<p>And that’s fine - I find it to be this annoying extra task when writing code where I have to think “wait is this thing supposed to be closed? k, I’ll add it to my ever-growing try-with-resources” but it’s not like I can’t write Java.</p>
<p>The point isn’t “Java is unusable”, it’s that the solutions that GC brings have their own cognitive burden and that, in my experience, when something tries to do things for you automatically, unless it can do <em>everything</em> for you automatically, it’s going to end up making things worse.</p>
<p>Anyway that’s it.</p>
Forged Capabilities in Rust2022-05-11T00:00:00+00:00http://insanitybit.github.io/2022/05/11/Forged-Caps-Rust
<p>I had some thoughts about capabilities. This is a very half assed post that I wrote up to get my thoughts out so that I could go back to work.</p>
<p>OK so capabilities are cool. They’re essentially named tokens that, if you possess, give you some right. They can be delegated by telling someone about that name, which makes them very powerful.</p>
<p>The “security” of a capability is enforced by the inability to forge one - as an example, imagine I host a sensitive document on S3 at <code class="language-plaintext highlighter-rouge">public-bucket/<uuid></code>. If you don’t have the “list” iam permission that uuid acts as a capability - its name connotes permission. I can tell you the name, and now you have permission. If you could “forge” the name by guessing it it wouldn’t protect me at all.</p>
<p>So that’s capabilities. What about rust?</p>
<p>Well, let’s imagine a capability system.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pub cap Io;
fn reads_file(path: &str) -> String
requires: Io
{
std::fs::read_to_string(path).unwrap()
}
</code></pre></div></div>
<p>I’ve declared a new public capability, <code class="language-plaintext highlighter-rouge">Io</code>, and I’ve required that capability for <code class="language-plaintext highlighter-rouge">reads_file</code>.</p>
<p>Here’s what calling that looks like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>fn read_config() -> String
requires: Io
{
reads_file("config.toml")
}
fn main()
requires: Arbitrary
{
dbg!(read_config());
}
</code></pre></div></div>
<p>Notably, <code class="language-plaintext highlighter-rouge">read_config</code> doesn’t have to pass anything into <code class="language-plaintext highlighter-rouge">reads_file</code> - so long as it “requires” the Io capability <code class="language-plaintext highlighter-rouge">reads_file</code> is callable. In essenence ‘read_config’ is delegating Io implicity.</p>
<p>Further, while main may have the Arbitrary capability, which denotes “all” capabilities, once you’re in <code class="language-plaintext highlighter-rouge">read_config</code> you drop everything except for what’s
<code class="language-plaintext highlighter-rouge">require</code>d. In this way capabilities are automatically narrowed throughout your program.</p>
<p>More on that ‘Arbitrary’ later.</p>
<p>You might be thinking “wow that looks familiar”. This is just a generalized ‘unsafe’, I think.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pub cap Unsafe;
fn does_unsafe_things() requires: Unsafe {}
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">unsafe</code> is a capability in a sense, but it’s <em>forgable</em>. And that actually is a super important property. It lets us write unsafe code and wrap it in safe code - otherwise all of rust would be ‘unsafe’.</p>
<p>So let’s take a look at our code again, but with forgery.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>fn read_config() -> String
forges: Io
{
reads_file("config.toml")
}
</code></pre></div></div>
<p>A new keyword appears - <code class="language-plaintext highlighter-rouge">forges</code>. This is a declaration of capabilities that we don’t inherit from the caller, instead we magically forge Io out of nowhere, and we can now use that capability.</p>
<p>Forgery, like wrapping any <code class="language-plaintext highlighter-rouge">unsafe</code> function in <code class="language-plaintext highlighter-rouge">safe</code>, would have to be heavily scrutinized. Are you <em>sure</em> you have that capability?</p>
<p>Forgery also makes capabilities backwards compatible, right? Functions before the next edition would all just forge their capabilities. I would suggest a special capability, <code class="language-plaintext highlighter-rouge">Arbitrary</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>fn read_config() -> String
forges: Arbitrary
{
reads_file("config.toml")
}
</code></pre></div></div>
<p>By default all past editions would forge <code class="language-plaintext highlighter-rouge">Arbitrary</code>, which itself would encompass all capabilities. In new editions, you’d have to declare your capabilities everywhere.</p>
<p>There are some pretty obvious questions though. Are capabilities polymorphic? If <code class="language-plaintext highlighter-rouge">read_config</code> takes R: Read, do I need a capability to read from a backing file? Honestly, idk.</p>
<p>How do we compose capabilities? Maybe</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cap Io: IoRead + IoWrite;
</code></pre></div></div>
<p>What if we want capabilities to be parameterized? Like <code class="language-plaintext highlighter-rouge">cap Io<&Path></code> ? The short answer is, I have no idea, I told you this was just to get my thoughts out.</p>
<p>Anyway, some scattered thoughts here, since people have been talking about language capabilities. Koka is doing some cool stuff in this area, but I haven’t dug into it enough.</p>
<p>There are some cool implications here. In theory if you know all of a programs capabilities you can generated sandboxes for them at compile time. You can force callers to abide by arbitrary constraints. You could even have caps like “Panics” etc.</p>
Supply Chain Thoughts2022-05-10T00:00:00+00:00http://insanitybit.github.io/2022/05/10/supply-chain-thoughts
<h1 id="supply-chain-thoughts">Supply Chain Thoughts</h1>
<p>So there was a malicious crates.io package. No surprise there, malicious packages have been a thing for years and it was only a matter of time before one was discovered targeting rust.</p>
<p>So what can be done about that? A number of things. (tl;dr at bottom, although tbh this is a short post)</p>
<h4 id="reduce-viable-crate-names">Reduce viable crate names.</h4>
<p>The vast majority of typosquatting falls under two categories.</p>
<ol>
<li>“I thought decimal was spelled decimel, or I otherwise mistyped it”</li>
<li>“I expected there to be a package called ‘collections’ so I just tried adding it”</li>
</ol>
<p>(1) is trivial to solve. Enforce a minimum edit distance between crate names. Obviously all current crates would be ‘grandfathered’ and not have to deal with this, but this means that someone has to typo twice instead of once, in exactly the right way, which is radically less likely. The crates.io team can also monitor for edit distances of a slightly larger edit threshold, publishing a log of “this crate was published with an edit distance of N” - a feed like that could be ingested and monitored by the public.</p>
<p>(2) Is a bit harder to solve. It basically requires reserving some names. “Thankfully” users have been… graciously reserving every common name on crates.io anyways in protest of the lack of namespaces. So, uh, thanks?</p>
<p>Anyway, I suggest that crate authors go ahead and reserve names. Everyone always calls my company ‘graphl’ by accident so I went ahead and created a ‘graphl’ repo to redirect users - I’d just suggest this sort of thing as a good hygiene.</p>
<h4 id="reduce-impact-of-malicious-builds">Reduce Impact of Malicious Builds</h4>
<p>It’s been discussed before. A <code class="language-plaintext highlighter-rouge">build.rs</code> should have to state its requirements and cargo should enforce those. This is what browser extensions do, this is what every app does. You expose capabilities, you require a manifest, you lock that manifest, and you alert users to changes in that manifest.</p>
<p>This is RFC worthy and likely requires lots of discussion on what it would look like, but I would suggest looking at how browsers have been doing this.</p>
<p>Of course, the obvious caveat is “but if your build.rs can modify your code, the attacker is in prod!”. Yes. Counter-intuitively, that can be far less impactful than an attacker in my build system.</p>
<p>First of all, we have great tools for restricting production services. If an attacker owns a production service via a compromised package, chances are they can’t even get a reverse shell - most services require no egress traffic to the internet. With containers being so trivial they’re also bound to a set of namespaces and likely can’t access things like keys on the host.</p>
<p>In a CI/CD environment things are not so simple. Builds often reach out to the internet, and setting up mirrors is not always straightforward. Further, build environments will often have a lot of credentials.</p>
<p>I’m not happy about an attacker who can mess with my production binaries, but that’s a threat I already have to consider since I already assume RCE in these services - it frankly adds very little to my threat model. Obviously I have to care about code execution in my build environment too - like my tests executing, but again, I have way more tools for dealing with that sort of thing.</p>
<p>I guess the short version is that it’s really easy to sandbox runtime code because I control the vast majority of behaviors, and it’s really hard for me to sandbox build code because I control almost none of the behaviors.</p>
<h4 id="the-update-framework">The Update Framework</h4>
<p>Lastly, we should have package signing. There are a number of ways this improves things. Most obviously is that if crates.io gets owned the attacker can’t just modify crates and own everyone else.</p>
<p>TUF has great properties like multiple signing parties, which means I can also have my CI/CD pipeline sign packages, which means even if my laptop is owned I can leverage all of my various branch controls as well - this is great, it gives me a way to compose all of my security controls.</p>
<p>I don’t really feel like digging into the virtues of package signing, it’s been discussed a million times.</p>
<h4 id="ok-but-what-do-i-do-now">OK but what do I do now?</h4>
<p>Yeah, good question. I guess there are a few things.</p>
<ol>
<li>Sandbox your runtime services. One of the best wins you can get is removing access to the public internet for them - highly recommend.</li>
<li>Run your builds in stages. So like, first vendor dependencies, then disable networking to the public internet. Run builds in docker. Run tests in a separate, limited environment.</li>
<li>Limit exposure to secrets, only run CI/CD tasks that include secrets on code that has already been reviewed, that has passed tests, etc.</li>
<li>Maybe consider cargo-crev? Honestly, I have looked at it, and I want to use it but haven’t had time.</li>
<li>Advocate for the mitigations above.</li>
</ol>
<h4 id="tldr">tl;dr</h4>
<ul>
<li>We can kill typosquatting with no breakage, no complex systems, etc, with a basic edit distance check - please do this</li>
<li>We should start the process of figuring out how to sandbox builds</li>
<li>We should get The Update Framework implemented</li>
</ul>
<p>Please at least do the typosquatting thing. Happy to chat more about it, or even discuss funding, or whatever - I’ve been asking for the typosquatting thing for years, idk where I’m supposed to suggest these things.</p>
Static Intersections With Pytype2020-07-19T00:00:00+00:00http://insanitybit.github.io/2020/07/19/intersection-types-in-python
<h1 id="static-intersections-with-pytype">Static Intersections With Pytype</h1>
<p>Python has had a static type system for quite some time as initially defined in <a href="https://www.python.org/dev/peps/pep-0484/">PEP 484</a> (with additions in future PEPs). Types allow one to statically verify various aspects of a program - that a value conforms to some set of constraints (methods, property, assertions).</p>
<p>The most popular and well known type checker for Python has been mypy, but an interesting quality of Python is that, because types are so separate from the language and runtime, there are <em>multiple</em> competing type systems. This means that one can run multiple type checkers, each with their own strengths and weaknesses, on a single python project. I’m aware of 3 different type checkers for Python at this time.</p>
<p>This has been extremely useful for me due to mypy <a href="https://github.com/python/typing/issues/213">lacking intersection types</a>, a feature that I very much miss from Rust where these are trivially implemented with traits.</p>
<p>An intersection type is simple - it is a type that implements interface A <strong>and</strong> interface B at the same time. This is really helpful when you want to extend an existing type that you don’t “own” (ie: comes from a library) - you can just attach a new interface to it. In Rust this is as simple as importing a trait for that type, in Python things are not so simple.</p>
<p>The goal I had was to write code like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">class</span> <span class="nc">A</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">foo</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="s">'foo'</span><span class="p">)</span>
<span class="k">class</span> <span class="nc">B</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">bar</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="s">'bar'</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">extend_a</span><span class="p">(</span><span class="n">b</span><span class="p">:</span> <span class="n">Type</span><span class="p">[</span><span class="n">_</span><span class="p">])</span> <span class="o">-></span> <span class="n">Type</span><span class="p">[</span><span class="n">A</span> <span class="o">+</span> <span class="n">_</span><span class="p">]:</span> <span class="c1"># (This is fake mypy)
</span> <span class="k">pass</span> <span class="c1"># ...
</span>
<span class="n">A</span> <span class="o">=</span> <span class="n">extend_a</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="c1"># Type[A + B]
</span> <span class="n">A</span><span class="p">().</span><span class="n">foo</span><span class="p">()</span>
<span class="n">A</span><span class="p">().</span><span class="n">bar</span><span class="p">()</span>
<span class="c1"># A().baz() # This should not type check!
</span></code></pre></div></div>
<p>This code is possible to express at runtime fairly easily. With Python very little isn’t possible, after all we can just monkey patch the methods of one class directly onto the other, or form a metaclass from the two classes we wish to combine.</p>
<p>Here’s the metaclass implementation. If you run this code with the above class definitions it will print ‘foo’ followed by ‘bar’. A pretty crazy power of Python’s metaclasses.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">def</span> <span class="nf">extend_a</span><span class="p">(</span><span class="n">b</span><span class="p">):</span>
<span class="k">return</span> <span class="nb">type</span><span class="p">(</span><span class="s">'A'</span><span class="p">,</span> <span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">b</span><span class="p">),</span> <span class="p">{})</span>
<span class="n">A</span> <span class="o">=</span> <span class="n">extend_a</span><span class="p">(</span><span class="n">B</span><span class="p">)</span>
<span class="n">A</span><span class="p">().</span><span class="n">foo</span><span class="p">()</span>
<span class="n">A</span><span class="p">().</span><span class="n">bar</span><span class="p">()</span>
</code></pre></div></div>
<p>Metaclasses give us this incredible power to extend one type with another, but there’s no way to express this in a way that mypy can understand. There is no way, in mypy, to ‘name’ the class <code class="language-plaintext highlighter-rouge">A + B</code> and so we can not write the annotation that would allow mypy to know that we can call <code class="language-plaintext highlighter-rouge">foo</code> and <code class="language-plaintext highlighter-rouge">bar</code> on A.</p>
<p>Thankfully there’s <code class="language-plaintext highlighter-rouge">pytype</code>, a type system from Google that takes a fairly different approach from mypy. Whereas mypy relies heavily on the definitions of types in various places, pytype is driven from type inference, only using PEP 484 annotations as assertions that must be upheld during inference.</p>
<p>Here’s the example that pytype provides in their <a href="https://github.com/google/pytype">repo</a>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">List</span>
<span class="k">def</span> <span class="nf">get_list</span><span class="p">()</span> <span class="o">-></span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span>
<span class="n">lst</span> <span class="o">=</span> <span class="p">[</span><span class="s">"PyCon"</span><span class="p">]</span>
<span class="n">lst</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="mi">2019</span><span class="p">)</span>
<span class="k">return</span> <span class="p">[</span><span class="nb">str</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">lst</span><span class="p">]</span>
<span class="c1"># mypy: line 4: error: Argument 1 to "append" of "list" has
</span> <span class="c1"># incompatible type "int"; expected "str"
</span></code></pre></div></div>
<p>As we can see, pytype looks a lot further than just the type annotations, or at just declarations. It leverages contextual information to determine types.</p>
<p>Incredibly, pytype is even able to infer metaclasses!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">def</span> <span class="nf">extend_a</span><span class="p">(</span><span class="n">b</span><span class="p">):</span> <span class="c1"># (pytype will infer the types!)
</span> <span class="k">return</span> <span class="nb">type</span><span class="p">(</span><span class="s">'A'</span><span class="p">,</span> <span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">b</span><span class="p">),</span> <span class="p">{})</span>
<span class="n">A</span> <span class="o">=</span> <span class="n">extend_a</span><span class="p">(</span><span class="n">B</span><span class="p">)</span>
<span class="n">A</span><span class="p">().</span><span class="n">foo</span><span class="p">()</span>
<span class="n">A</span><span class="p">().</span><span class="n">bar</span><span class="p">()</span>
</code></pre></div></div>
<p>This code actually, incredibly, type checks with <code class="language-plaintext highlighter-rouge">pytype</code>!</p>
<p><code class="language-plaintext highlighter-rouge">pytype</code> is able to generate a <code class="language-plaintext highlighter-rouge">.pyi</code> like the following:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">class</span> <span class="nc">A</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">B</span><span class="p">):</span> <span class="p">...</span>
<span class="k">class</span> <span class="nc">B</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">bar</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-></span> <span class="bp">None</span><span class="p">:</span> <span class="p">...</span>
<span class="k">def</span> <span class="nf">extend_a</span><span class="p">(</span><span class="n">b</span><span class="p">)</span> <span class="o">-></span> <span class="nb">type</span><span class="p">:</span> <span class="p">...</span>
</code></pre></div></div>
<p>It’s a bit strange - A appears to be a recursive type on itself… but it works. And we can even compose multiple types:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">class</span> <span class="nc">A</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">foo</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="s">'foo'</span><span class="p">)</span>
<span class="k">class</span> <span class="nc">B</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">bar</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="s">'bar'</span><span class="p">)</span>
<span class="k">class</span> <span class="nc">C</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">baz</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="s">'baz'</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">extend_a</span><span class="p">(</span><span class="o">*</span><span class="n">b</span><span class="p">):</span> <span class="c1"># (pytype will infer the return type!)
</span> <span class="k">return</span> <span class="nb">type</span><span class="p">(</span><span class="s">'A'</span><span class="p">,</span> <span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="o">*</span><span class="n">b</span><span class="p">),</span> <span class="p">{})</span>
<span class="n">A</span> <span class="o">=</span> <span class="n">extend_a</span><span class="p">(</span><span class="n">B</span><span class="p">,</span> <span class="n">C</span><span class="p">)</span>
<span class="n">A</span><span class="p">().</span><span class="n">foo</span><span class="p">()</span>
<span class="n">A</span><span class="p">().</span><span class="n">bar</span><span class="p">()</span>
<span class="n">A</span><span class="p">().</span><span class="n">baz</span><span class="p">()</span>
</code></pre></div></div>
<p>Generating a <code class="language-plaintext highlighter-rouge">.pyi</code> of:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>class A(A, B, C): ...
class B:
def bar(self) -> None: ...
class C:
def baz(self) -> None: ...
def extend_a(*b) -> type: ...
</code></pre></div></div>
<p>And just to prove it really works let’s call a method that doesn’t exist on our combined A:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">A</span><span class="p">().</span><span class="n">bop</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> No attribute 'bop' on A [attribute-error]
</code></pre></div></div>
<p>This is pretty incredible! It’s not <em>quite</em> as powerful as a trait system like Rust’s where we can just import a trait and start using methods from it on a type, but this gets us fairly close. We can attach interfaces at runtime to a type but reason about it statically.</p>
<p>One caveat is you’ll probably want to have mypy ignore that file. If we run mypy against this code, here’s what we get:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> scratch.py:19: error: Cannot assign to a type
scratch.py:22: error: "A" has no attribute "bar"
scratch.py:23: error: "A" has no attribute "baz"
</code></pre></div></div>
<p>Eventually, if mypy gets intersection types, this will be possible with just a single tool. But for now this is not an unreasonable workaround.</p>
<p>I think it’s fascinating that Python has multiple implementations of a type checker, allowing for them to grow in different directions, and even for one part of a project to type check with one tool but another part with another tool - it’s something I have never experienced with another language.</p>
<p>Pytests technique of leveraging inference is obviously very powerful. What I’d love to see is something that allows me to leverage that a bit more explicitly, giving me the ability to get that fake mypy code I’d written to work.</p>
Rust 20202019-10-30T00:00:00+00:00http://insanitybit.github.io/2019/10/30/rust-2020-wishlist---stabilize-the-ecosystem
<p>I’ve been using Rust since just before it hit 1.0. Since that time the language has gotten considerably better;
non-lexical lifetimes, impl Trait, async/await, compiler performance and error improvements, and more.</p>
<p>In 2019 the big focus was async/await, or at least as an outsider that is how it has appeared. The end result
looks like it will deliver what we’ve all been waiting for - efficient async code that works well with Rust’s
borrow checker.</p>
<p>As 2019 is coming to a close and async/await is stabilizing, there is the obvious question - what next?</p>
<p>Many have suggested revisiting the 2019 roadmap. Reddit user 0b_0101_001_1010 <a href="https://www.reddit.com/r/rust/comments/dorinl/a_call_for_blogs_2020/f5pv3c3/">summarizes it well in this post</a>.</p>
<p>This would include taking on features like specialization, const generics, and generic associated types.</p>
<p>To be frank, in the last two years I have not tracked rust’s RFC’s or development nearly as much. My time goes
to Grapl, to ensuring it works as best as it can, that the features I need can be implemented quickly. I do
not know the specific features that I want stabilized - Rust, the language, feels quite close to where I want it
to be.</p>
<p>GAT, async/await, const generics, etc, will all improve my development quality of life but I couldn’t really say
“with these features I will finally be productive”.</p>
<p>Instead, my productivity tends to come down to two things:</p>
<ul>
<li><strong>High quality</strong> and <strong>stable</strong> libraries for the work I’m doing</li>
<li>Fast iteration cycles</li>
</ul>
<p>Both of these are still lacking in rust. Maybe that’s because of a missing feature - GAT, async/await, whatever. I
won’t guess which.</p>
<p>I remember earlier in the Rust days when custom Derive was still unstable. Constant breaking changes in libraries,
tons of churn - productivity seemed impossible to me. We’re past those days, thankfully, but the same pains are
still my primary pains.</p>
<p>Libraries that are <1.0.0 often make major breaking changes, as well as major bug fixes, leaving me with the impossible
choice of having to use a broken version or break all of my code for the new version.</p>
<p>I want to see this change. I want whatever it is that is keeping libraries from stabilizing to be solved.</p>
<p>One big library for me, which can be quite painful to upgrade at times, is Rusoto. This is of no fault of the Rusoto
developers - it’s a massive project - but I’d like to better understand what those developers need in order to hit
a 1.0, and how the rust language (or surrounding ecosystem) can satisfy those needs. It seems to me that at least
part of that is going to be solved with async/await - but I’m unsure.</p>
<p>Stability and quality are often transitive. A core dependency may be unstable, therefor all crates with that dependency
are unstable. I’d like to see those identified and tackled.</p>
<p>With new features always around the corner I think library developers may be reluctant to stabilize - though that
is just a guess.</p>
<p>In terms of iteration cycle, compile times are still quite slow, and my intellij autocompletion is as well.</p>
<p>I won’t prescribe solutions, I don’t know them. But these are the pain points and areas that I think would help me be
productive as I continue to build a project that is primarily rust.</p>
How Grapl Avoids Fighting Data2019-08-17T00:00:00+00:00http://insanitybit.github.io/2019/08/17/how-grapl-avoids-fighting-data
<p>Detection and Response is all about data. Analysts collect many billions of logs every single day and store them, searching through the noise for some signal that might indicate malicious behavior. What has become obvious is that this collection of data is not slowing down at all - we’re instrumenting more services and systems all while companies are expanding their own asset inventories, or pulling in new data sources from the cloud. Data growth, year after year, is in rapidly.</p>
<p>One thing that bothers me about existing state of the art in modern SIEMs is that they punish you for having a lot of data. With increasing data storage costs, licensing fees, and slower querying you can expect your SIEM experience to degrade over time, not get better. This is simply unacceptable to me - this fighting of data is demoralizing and wasteful, and it’s a problem that will only get worse as your org scales up.</p>
<p>In this post I’m going to cover a few areas where Grapl far exceeds the existing SIEM state of the art and aims to make this Sisyphean fight against data a thing of the past.</p>
<h3 id="storage">Storage</h3>
<p>Perhaps the most painful constraint that SIEMs impose on customers is the cost of data storage. Data storage in a SIEM is effectively linear - every log is stored in full, and so if you send N logs up it takes O(N) bytes of space.</p>
<p>A significant part of why SIEMs scale storage linearly is because they work with unstructured data - a SIEM can not generally say things like “Oh, this field and that field are equivalent so I can just store one copy”.</p>
<p>One of the greatest pains I hear when talking to others in my field is that they have huge burdens around data storage. It is not uncommon for companies to spend millions or even tens of millions on data storage - both through physical capacity planning as well as licensing fees.</p>
<p>This massive cost means that even a relatively small IR team can be spending a disproportionate amount of security budget, just to collect the data that’s needed for other work.</p>
<p>Grapl aims to significantly improve upon this state. Grapl works with structured data (after an explicit parsing stage), and by leveraging a concept of ‘identity’, data storage grows closer to log(N) in most cases.</p>
<p>Let’s look at two Sysmon logs. These are both relating to the same entity - the process with pid 5324, and GUID <code class="language-plaintext highlighter-rouge">{331D737B-28FF-5C0B-0000-001081250F00}</code>.</p>
<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="nt"><Event</span>
<span class="na">xmlns=</span><span class="s">'http://schemas.microsoft.com/win/2004/08/events/event'</span><span class="nt">></span>
<span class="nt"><System></span>
<span class="nt"><Provider</span> <span class="na">Name=</span><span class="s">'Microsoft-Windows-Sysmon'</span> <span class="na">Guid=</span><span class="s">'{5770385F-C22A-43E0-BF4C-06F5698FFBD9}'</span><span class="nt">/></span>
<span class="nt"><EventID></span>2<span class="nt"></EventID></span>
<span class="nt"><Version></span>4<span class="nt"></Version></span>
<span class="nt"><Level></span>4<span class="nt"></Level></span>
<span class="nt"><Task></span>2<span class="nt"></Task></span>
<span class="nt"><Opcode></span>0<span class="nt"></Opcode></span>
<span class="nt"><Keywords></span>0x8000000000000000<span class="nt"></Keywords></span>
<span class="nt"><TimeCreated</span> <span class="na">SystemTime=</span><span class="s">'2018-12-08T20:37:53.775868800Z'</span><span class="nt">/></span>
<span class="nt"><EventRecordID></span>10<span class="nt"></EventRecordID></span>
<span class="nt"><Correlation/></span>
<span class="nt"><Execution</span> <span class="na">ProcessID=</span><span class="s">'5324'</span> <span class="na">ThreadID=</span><span class="s">'2928'</span><span class="nt">/></span>
<span class="nt"><Channel></span>Microsoft-Windows-Sysmon/Operational<span class="nt"></Channel></span>
<span class="nt"><Computer></span>DESKTOP-34EOTDT<span class="nt"></Computer></span>
<span class="nt"><Security</span> <span class="na">UserID=</span><span class="s">'S-1-5-18'</span><span class="nt">/></span>
<span class="nt"></System></span>
<span class="nt"><EventData></span>
<span class="nt"><Data</span> <span class="na">Name=</span><span class="s">'RuleName'</span><span class="nt">></Data></span>
<span class="nt"><Data</span> <span class="na">Name=</span><span class="s">'UtcTime'</span><span class="nt">></span>2018-12-08 20:37:53.763<span class="nt"></Data></span>
<span class="nt"><Data</span> <span class="na">Name=</span><span class="s">'ProcessGuid'</span><span class="nt">></span>{331D737B-28FF-5C0B-0000-001081250F00}<span class="nt"></Data></span>
<span class="nt"><Data</span> <span class="na">Name=</span><span class="s">'ProcessId'</span><span class="nt">></span>1772<span class="nt"></Data></span>
<span class="nt"><Data</span> <span class="na">Name=</span><span class="s">'Image'</span><span class="nt">></span>C:\Program Files (x86)\Google\Chrome\Application\chrome.exe<span class="nt"></Data></span>
<span class="nt"><Data</span> <span class="na">Name=</span><span class="s">'TargetFilename'</span><span class="nt">></span>C:\Users\andy\AppData\Local\Google\Chrome\User Data\Default\e46787f2-8ec3-46f9-b245-000fe5f85fa6.tmp<span class="nt"></Data></span>
<span class="nt"><Data</span> <span class="na">Name=</span><span class="s">'CreationUtcTime'</span><span class="nt">></span>2018-12-08 02:14:24.177<span class="nt"></Data></span>
<span class="nt"><Data</span> <span class="na">Name=</span><span class="s">'PreviousCreationUtcTime'</span><span class="nt">></span>2018-12-08 20:37:53.747<span class="nt"></Data></span>
<span class="nt"></EventData></span>
<span class="nt"></Event></span>
<span class="nt"><Event</span>
<span class="na">xmlns=</span><span class="s">'http://schemas.microsoft.com/win/2004/08/events/event'</span><span class="nt">></span>
<span class="nt"><System></span>
<span class="nt"><Provider</span> <span class="na">Name=</span><span class="s">'Microsoft-Windows-Sysmon'</span> <span class="na">Guid=</span><span class="s">'{5770385F-C22A-43E0-BF4C-06F5698FFBD9}'</span><span class="nt">/></span>
<span class="nt"><EventID></span>2<span class="nt"></EventID></span>
<span class="nt"><Version></span>4<span class="nt"></Version></span>
<span class="nt"><Level></span>4<span class="nt"></Level></span>
<span class="nt"><Task></span>2<span class="nt"></Task></span>
<span class="nt"><Opcode></span>0<span class="nt"></Opcode></span>
<span class="nt"><Keywords></span>0x8000000000000000<span class="nt"></Keywords></span>
<span class="nt"><TimeCreated</span> <span class="na">SystemTime=</span><span class="s">'2018-12-08T20:38:17.621228500Z'</span><span class="nt">/></span>
<span class="nt"><EventRecordID></span>19<span class="nt"></EventRecordID></span>
<span class="nt"><Correlation/></span>
<span class="nt"><Execution</span> <span class="na">ProcessID=</span><span class="s">'5324'</span> <span class="na">ThreadID=</span><span class="s">'2928'</span><span class="nt">/></span>
<span class="nt"><Channel></span>Microsoft-Windows-Sysmon/Operational<span class="nt"></Channel></span>
<span class="nt"><Computer></span>DESKTOP-34EOTDT<span class="nt"></Computer></span>
<span class="nt"><Security</span> <span class="na">UserID=</span><span class="s">'S-1-5-18'</span><span class="nt">/></span>
<span class="nt"></System></span>
<span class="nt"><EventData></span>
<span class="nt"><Data</span> <span class="na">Name=</span><span class="s">'RuleName'</span><span class="nt">></Data></span>
<span class="nt"><Data</span> <span class="na">Name=</span><span class="s">'UtcTime'</span><span class="nt">></span>2018-12-08 20:38:17.606<span class="nt"></Data></span>
<span class="nt"><Data</span> <span class="na">Name=</span><span class="s">'ProcessGuid'</span><span class="nt">></span>{331D737B-28FF-5C0B-0000-001081250F00}<span class="nt"></Data></span>
<span class="nt"><Data</span> <span class="na">Name=</span><span class="s">'ProcessId'</span><span class="nt">></span>1772<span class="nt"></Data></span>
<span class="nt"><Data</span> <span class="na">Name=</span><span class="s">'Image'</span><span class="nt">></span>C:\Program Files (x86)\Google\Chrome\Application\chrome.exe<span class="nt"></Data></span>
<span class="nt"><Data</span> <span class="na">Name=</span><span class="s">'TargetFilename'</span><span class="nt">></span>C:\Users\andy\AppData\Local\Google\Chrome\User Data\Default\daa42f83-e6b5-4528-a7ed-e0778b91783f.tmp<span class="nt"></Data></span>
<span class="nt"><Data</span> <span class="na">Name=</span><span class="s">'CreationUtcTime'</span><span class="nt">></span>2018-12-08 02:14:24.177<span class="nt"></Data></span>
<span class="nt"><Data</span> <span class="na">Name=</span><span class="s">'PreviousCreationUtcTime'</span><span class="nt">></span>2018-12-08 20:38:17.591<span class="nt"></Data></span>
<span class="nt"></EventData></span>
<span class="nt"></Event></span>
</code></pre></div></div>
<p>The process, Chrome, is operating on two distinct cache files. These sorts of operations happen extremely frequently, to the point where your config may even whitelist out the directory entirely.</p>
<p>There’s clearly a ton of redundancy between these two logs - the process pid, image name, process guid, command line, etc, will be repeated in every single log, wasting hundreds of bytes for every additional log.</p>
<p>I have a 16MB dump of Sysmon logs from a virtual machine. Let’s quickly scrape out all of the lines that aren’t completely unique:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> cat ./events.xml | sort | uniq -u | wc -c
> 1351536
</code></pre></div></div>
<p>We can see that the actual unique information in this 16MB file is closer to ~1.3MB. <strong>That’s less than 10% of the original data that we actually care about - an order of magnitude data reduction!</strong></p>
<p>This ‘unique’ approach is actually very similar to how Grapl works - it takes information in logs, such as pids, paths, or timestamps, to determine a canonical identity, called a node key. This is not unlike Sysmon’s Process GUIDs, but entirely server-side. Grapl then coalesces the information for each entity, throwing out redundant information, and storing only the unique information.</p>
<p>The end result is that Grapl’s storage does not grow linearly with logs you send up but instead it’s linear with the unique information sent up - practically, <strong>this will be closer to a logarithmic growth rate</strong>. The first log for a process create will likely contain mostly unique information, but for all subsequent actions by that process the information stored will decrease considerably.</p>
<h3 id="analyzers">Analyzers</h3>
<p>Most SIEM alerting works via a scheduled search. Every N minutes your search runs over M minutes of data (where N and M are often the same).</p>
<p>Each of these searches is, more or less, O(N). So if you’re searching over the last 10 minutes of data today, and your search runs in X seconds, then next year when your data volume has doubled your search will run in roughly 2X seconds.</p>
<p>What’s worse is that join performance in a traditional SIEM is going to be along the lines of exponential, making joins effectively pointless. To put this into perspective, here is an excerpt from Splunk’s documentation on subsearches (which joins leverage):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Additionally, by default subsearches return a maximum of 10,000 results and have a maximum runtime of 60 seconds. In large production environments it is quite possible that the subsearch in this example will timeout before it completes. source
</code></pre></div></div>
<p>When using subsearchs you have to ensure that you send a bounded amount of data in or your search may be truncated. And I’m not just picking on Splunk, this is just fundamental to the way that traditional SIEMs work.</p>
<p>Grapl’s searches, what it refers to as Analyzers, have two important properties:</p>
<ul>
<li>They are real time</li>
<li>Search complexity grows based on the query, not the data</li>
</ul>
<p>What this ends up meaning is that <strong>Analyzer execution is effectively constant time</strong>, which is to say that your analyzers that execute in X seconds today will execute in ~X seconds next year, even if your data size has increased dramatically.</p>
<p>Here is a search for a suspicious execution based on a ‘winword.exe’ parent process:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">def</span> <span class="nf">analyzer</span><span class="p">(</span><span class="n">client</span><span class="p">:</span> <span class="n">DgraphClient</span><span class="p">,</span> <span class="n">node</span><span class="p">:</span> <span class="n">NodeView</span><span class="p">,</span> <span class="n">sender</span><span class="p">:</span> <span class="n">Any</span><span class="p">):</span>
<span class="n">process</span> <span class="o">=</span> <span class="n">node</span><span class="p">.</span><span class="n">as_process_view</span><span class="p">()</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">process</span><span class="p">:</span> <span class="k">return</span>
<span class="n">p</span> <span class="o">=</span> <span class="p">(</span>
<span class="n">ProcessQuery</span><span class="p">()</span>
<span class="p">.</span><span class="n">with_process_name</span><span class="p">(</span><span class="n">eq</span><span class="o">=</span><span class="s">"winword.exe"</span><span class="p">)</span>
<span class="p">.</span><span class="n">with_children</span><span class="p">(</span><span class="n">ProcessQuery</span><span class="p">())</span>
<span class="p">.</span><span class="n">query_first</span><span class="p">(</span><span class="n">client</span><span class="p">,</span> <span class="n">contains_node_key</span><span class="o">=</span><span class="n">process</span><span class="p">.</span><span class="n">node_key</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div></div>
<p>Note on line <code class="language-plaintext highlighter-rouge">8</code> the <code class="language-plaintext highlighter-rouge">contains_node_key=process.node_key</code>, this tells the query builder to create a subgraph search that will search for a subgraph matching the described pattern where that node_key exists somewhere in the matched graph.</p>
<p>Under the hood it is as if it generates these two separate queries:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">ProcessQuery</span><span class="p">()</span>
<span class="p">.</span><span class="n">with_process_name</span><span class="p">(</span><span class="n">eq</span><span class="o">=</span><span class="s">"winword.exe"</span><span class="p">)</span>
<span class="p">.</span><span class="n">with_node_key</span><span class="p">(</span><span class="n">eq</span><span class="o">=</span><span class="n">process</span><span class="p">.</span><span class="n">node_key</span><span class="p">)</span>
<span class="p">.</span><span class="n">with_children</span><span class="p">(</span><span class="n">ProcessQuery</span><span class="p">())</span>
</code></pre></div></div>
<p>and</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">ProcessQuery</span><span class="p">()</span>
<span class="p">.</span><span class="n">with_process_name</span><span class="p">(</span><span class="n">eq</span><span class="o">=</span><span class="s">"winword.exe"</span><span class="p">)</span>
<span class="p">.</span><span class="n">with_children</span><span class="p">(</span>
<span class="n">ProcessQuery</span><span class="p">()</span>
<span class="p">.</span><span class="n">with_node_key</span><span class="p">(</span><span class="n">eq</span><span class="o">=</span><span class="n">process</span><span class="p">.</span><span class="n">node_key</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div></div>
<p>In this case there are <strong>at most 4 operations ever</strong>, and they can even execute in parallel thanks to the DGraph backend. Even with trillions of nodes this query should always take roughly the same amount of time.</p>
<p>Those 4 operations are all key-based lookups (even the edge traversal), and as such they’re constant time.</p>
<h3 id="engagements">Engagements</h3>
<p>In a SIEM-based workflow, upon receiving an alert, you will first open up some kind of search window - say, the last 8 hours, and all searches will run across that 8 hour period of data.</p>
<p>As your search window grows due to the scope of your investigation increasing, so do your searches degrade in terms of performance. Going from an 8 hour window to a 16 hour window will at least double your search times.</p>
<p>It is not uncommon at all for investigations to take place over weeks, months, or even years worth of data. It is a common approach of malware to schedule its execution days or weeks after the initial payload lands, for example, or you may have had a very old vuln/exposure reported and you want to validate that it wasn’t exploited.</p>
<p>Once again the SIEM has put us in a position of fighting with our data. We want the largest search window possible so that we can capture the full scope of an attack, but we want the shortest search window possible so that we can optimize our searches. This is the sort of trade off that I find particularly demoralizing.</p>
<p>Grapl throws search windows out entirely. You start off an engagement with some suspect node, and from there you expand that node. Each expansion operation is constant time. This is done through a Python library provided by Grapl, and can be executed in an AWS Sagemaker Notebook.</p>
<p>As an example, you may want to go from a process to its parent process.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">suspect_process</span> <span class="o">=</span> <span class="n">engagement</span><span class="p">.</span><span class="n">get_process</span><span class="p">(</span><span class="s">"..."</span><span class="p">)</span>
<span class="n">suspect_parent</span> <span class="o">=</span> <span class="n">suspect_process</span><span class="p">.</span><span class="n">get_parent</span><span class="p">()</span>
</code></pre></div></div>
<p>It would not matter if <code class="language-plaintext highlighter-rouge">suspect_process</code> and <code class="language-plaintext highlighter-rouge">suspect_parent</code> were executed weeks apart, <strong>the operation always takes the same amount of time</strong>.</p>
<p>This is leveraging the same techniques as the Analyzers, generating optimized queries under the hood that act as key lookups.</p>
<h3 id="conclusion">Conclusion</h3>
<p>By leveraging techniques like identification and focusing on constant time operations Grapl can provide literally orders of magnitude better storage and performance than existing state of the art solutions. Organizations should never feel like they have to fight with their data, or worry about their log volume due to absurd licensing fees and storage costs.</p>
<p>The improvements that Grapl makes don’t just represent a “2x” or “10x” speedup, they fundamentally change runtime performance attributes, turning operations that are linear or exponential in a SIEM into operations that are logarithmic or even constant time.</p>
<p>Grapl is free, open source, and promises to make Detection and Response a radically better experience for detection engineers and incident responders.</p>
<p>Github: https://github.com/insanitybit/grapl
Twitter: https://twitter.com/home</p>
Grapl’s Detection Story - Graph Analyzers, Risk, and Lenses2019-06-11T00:00:00+00:00http://insanitybit.github.io/2019/06/11/Grapl-s-DetectionStory-GraphAnalyzers-Risk-and-Lenses
<p>Grapl is a Graph based detection and response platform, but what does this workflow actually look like? What does Grapl do differently, and how does it all fit together?</p>
<p>Grapl does a ton of work to get you the data you need in the best format for analysis, and provide the tools you need to understand your environment; it provides your logs with identity, it combines them together into a concise format, and it links them together into a graph that exposes their relationships.</p>
<p>In this post I want to focus on some of the features I’ve been working on lately - the new Analyzer library, risk based alerting, and lenses</p>
<h2 id="analyzers">Analyzers</h2>
<p>Analyzers provide the first tier of what I call “local correlation” - it’s where we define things like TTPs or interesting, connected patterns in our master graph of events. “Local” means that the detection can be represented through a single connected graph.</p>
<p>Analyzers can be quite simple:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">ProcessQuery</span><span class="p">().</span><span class="n">with_process_name</span><span class="p">(</span><span class="n">eq</span><span class="o">=</span><span class="s">"evil.exe"</span><span class="p">)</span>
</code></pre></div></div>
<p>Here we have a query for any process with the process name <code class="language-plaintext highlighter-rouge">evil.exe</code>. At a minimum this gives us the basic powers of what most log based systems do - we can do querying with regexes across various fields.</p>
<p>Where the real power comes in is when you want to look at <em>behaviors</em>.</p>
<p>Here is a look at abuse of the CMSTP process, inspired by https://attack.mitre.org/techniques/T1191/</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">ProcessQuery</span><span class="p">()</span>
<span class="p">.</span><span class="n">with_process_name</span><span class="p">(</span><span class="n">ends_with</span><span class="o">=</span><span class="s">"CMSTP.exe"</span><span class="p">)</span>
<span class="p">.</span><span class="n">with_read_files</span><span class="p">(</span>
<span class="n">FileQuery</span><span class="p">().</span><span class="n">with_file_ext</span><span class="p">(</span><span class="n">eq</span><span class="o">=</span><span class="s">".inf"</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div></div>
<p>With this signature we can track all reads to <code class="language-plaintext highlighter-rouge">.inf</code> files from CMSTP.exe.</p>
<p>Further refinement of the alert can take a count of the combination of <code class="language-plaintext highlighter-rouge">CMSTP.exe</code> and the <code class="language-plaintext highlighter-rouge">.inf</code> file, and output if the combination has been seen zero or one times. This way, even if your environment has legitimate executions of CMSTP.exe, you can take advantage of the attacker’s <code class="language-plaintext highlighter-rouge">.inf</code> file being non-standard.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">p</span> <span class="o">=</span> <span class="p">(</span>
<span class="n">ProcessQuery</span><span class="p">()</span>
<span class="p">.</span><span class="n">with_process_name</span><span class="p">(</span><span class="n">ends_with</span><span class="o">=</span><span class="s">"CMSTP.exe"</span><span class="p">)</span>
<span class="p">.</span><span class="n">with_read_files</span><span class="p">(</span>
<span class="n">FileQuery</span><span class="p">().</span><span class="n">with_file_path</span><span class="p">().</span><span class="n">with_file_ext</span><span class="p">(</span><span class="n">eq</span><span class="o">=</span><span class="s">".inf"</span><span class="p">)</span>
<span class="p">).</span><span class="n">query_first</span><span class="p">(</span><span class="n">dgraph_client</span><span class="p">)</span>
<span class="p">)</span>
<span class="n">count</span> <span class="o">=</span> <span class="p">(</span>
<span class="n">ProcessFileCounter</span><span class="p">(</span><span class="n">dgraph_client</span><span class="p">)</span>
<span class="p">.</span><span class="n">count</span><span class="p">(</span>
<span class="n">process_name</span><span class="o">=</span><span class="n">p</span><span class="p">.</span><span class="n">process_name</span><span class="p">,</span>
<span class="n">file_name</span><span class="o">=</span><span class="n">p</span><span class="p">.</span><span class="n">read_files</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">file_path</span><span class="p">,</span>
<span class="p">)</span>
<span class="p">)</span>
<span class="k">if</span> <span class="n">count</span> <span class="o"><=</span> <span class="n">Seen</span><span class="p">.</span><span class="n">Once</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Unique CMSTP.exe with .inf combination"</span><span class="p">)</span>
</code></pre></div></div>
<p>Grapl’s Python and Graph based signatures allow expression of complex behaviors and can track those behaviors over time using counters. The combination allows anyone to write high fidelity alerts quickly.</p>
<h2 id="risk">Risk</h2>
<p>Of course, the reality of detection is that it’s impossible to say, in the general case, that anything is bad. It’s absurd what users will do - especially at a tech company, when you’ve got developers debugging systems in all sorts of ways.</p>
<p>Treating signatures as binary statements of badness is going to leave you in a bad situation - you’ll either be so inundated with triage that you never get anything done, or you’ll never manage to push signatures out because they have too many false positives.</p>
<p>The graph based approach is extraordinarily powerful and can help you build alerts with powerful whitelisting, but even still, these signatures are heuristics.</p>
<p>This is why Grapl provides a concept of <em>risk</em>. Risk is just a number indicating how suspicious you think this behavior is. Known malware executing? Maybe that’s a risk of 180. Unique parent child process? Maybe that’s closer to 50. The numbers are made up, it’s the relative distance that matters.</p>
<p>Let’s look at our previous example, modified to include risk:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">p</span> <span class="o">=</span> <span class="p">(</span>
<span class="n">ProcessQuery</span><span class="p">()</span>
<span class="p">.</span><span class="n">with_process_name</span><span class="p">(</span><span class="n">ends_with</span><span class="o">=</span><span class="s">"CMSTP.exe"</span><span class="p">)</span>
<span class="p">.</span><span class="n">with_read_files</span><span class="p">(</span>
<span class="n">FileQuery</span><span class="p">().</span><span class="n">with_file_path</span><span class="p">().</span><span class="n">with_file_ext</span><span class="p">(</span><span class="n">eq</span><span class="o">=</span><span class="s">".inf"</span><span class="p">)</span>
<span class="p">).</span><span class="n">query_first</span><span class="p">(</span><span class="n">dgraph_client</span><span class="p">)</span>
<span class="p">)</span>
<span class="n">count</span> <span class="o">=</span> <span class="p">(</span>
<span class="n">ProcessFileCounter</span><span class="p">(</span><span class="n">dgraph_client</span><span class="p">)</span>
<span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="n">process_name</span><span class="o">=</span><span class="n">p</span><span class="p">.</span><span class="n">process_name</span><span class="p">,</span> <span class="n">file_name</span><span class="o">=</span><span class="n">p</span><span class="p">.</span><span class="n">read_files</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">file_path</span><span class="p">)</span>
<span class="p">)</span>
<span class="k">if</span> <span class="n">count</span> <span class="o">==</span> <span class="n">Seen</span><span class="p">.</span><span class="n">Never</span><span class="p">:</span>
<span class="n">output</span><span class="p">(</span>
<span class="n">suspicious_graph</span><span class="o">=</span><span class="n">p</span><span class="p">,</span>
<span class="n">risk</span><span class="o">=</span><span class="mi">150</span><span class="p">,</span>
<span class="p">)</span>
</code></pre></div></div>
<p>All that is needed is to add a score, stating that this is a “150” level risk.</p>
<p>Grapl leveraging Python means we can really easily express more dynamic scoring. For example,</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">p</span> <span class="o">=</span> <span class="p">(</span>
<span class="n">ProcessQuery</span><span class="p">()</span>
<span class="p">.</span><span class="n">with_process_name</span><span class="p">(</span><span class="n">ends_with</span><span class="o">=</span><span class="s">"CMSTP.exe"</span><span class="p">)</span>
<span class="p">.</span><span class="n">with_read_files</span><span class="p">(</span>
<span class="n">FileQuery</span><span class="p">().</span><span class="n">with_file_path</span><span class="p">().</span><span class="n">with_file_ext</span><span class="p">(</span><span class="n">eq</span><span class="o">=</span><span class="s">".inf"</span><span class="p">)</span>
<span class="p">).</span><span class="n">query_first</span><span class="p">(</span><span class="n">dgraph_client</span><span class="p">)</span>
<span class="p">)</span>
<span class="n">count</span> <span class="o">=</span> <span class="p">(</span>
<span class="n">ProcessFileCounter</span><span class="p">(</span><span class="n">dgraph_client</span><span class="p">)</span>
<span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="n">process_name</span><span class="o">=</span><span class="n">p</span><span class="p">.</span><span class="n">process_name</span><span class="p">,</span> <span class="n">file_name</span><span class="o">=</span><span class="n">p</span><span class="p">.</span><span class="n">read_files</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">file_path</span><span class="p">)</span>
<span class="p">)</span>
<span class="c1"># Unique is extra scary - risk is 150
</span> <span class="k">if</span> <span class="n">count</span> <span class="o">==</span> <span class="n">Seen</span><span class="p">.</span><span class="n">Never</span><span class="p">:</span>
<span class="n">output</span><span class="p">(</span>
<span class="n">suspicious_graph</span><span class="o">=</span><span class="n">p</span><span class="p">,</span>
<span class="n">risk</span><span class="o">=</span><span class="mi">150</span><span class="p">,</span>
<span class="p">)</span>
<span class="c1"># If we've seen it once that's still sketchy - risk is 120
</span> <span class="k">elif</span> <span class="n">count</span> <span class="o">==</span> <span class="n">Seen</span><span class="p">.</span><span class="n">Once</span><span class="p">:</span>
<span class="n">output</span><span class="p">(</span>
<span class="n">suspicious_graph</span><span class="o">=</span><span class="n">p</span><span class="p">,</span>
<span class="n">risk</span><span class="o">=</span><span class="mi">120</span><span class="p">,</span>
<span class="p">)</span>
<span class="c1"># If we've seen it more than once it *might* be sketchy, but not worth
</span> <span class="c1"># raising alarms over, let's drop risk down to 20
</span> <span class="k">else</span><span class="p">:</span>
<span class="n">output</span><span class="p">(</span>
<span class="n">suspicious_graph</span><span class="o">=</span><span class="n">p</span><span class="p">,</span>
<span class="n">risk</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span>
<span class="p">)</span>
</code></pre></div></div>
<p>We can pull in peripheral information, such as the count of the combination of process name and filename, and use that to determine a score. Maybe CMSTP.exe is actually something we see in the environment sometimes, so if it’s a file we’ve seen a lot, it <em>could</em> be bad, but we’ll drop the score a lot.</p>
<p>Risk is so powerful because you can throw <em>everything</em> into it. If you’ve ever wanted to write an alert but just couldn’t cut the false positives down, risk probably could have helped you.</p>
<p>It is too often the case that the signatures that are mostly likely to catch an attacker are too noisy to investigate every time - attach a risk to it, and now you can sort it across other risks in the environment.</p>
<h2 id="lenses">Lenses</h2>
<p>Of course, Grapl is a graph based system, and the real power of its analyzers and risks lies in that approach. Analyzers provide us with <em>local</em> correlation - we can see a process with a direct read connection to a file. But what if another analyzer had found a suspicious pattern elsewhere in that process tree? It would be great if we could do correlation even across disconnected graphs.</p>
<p>This is where <em>lenses</em> come in. The lens is a way to view groups of local correlations through some focal point - in Grapl’s case, the currently supported focal point is the asset lens. An asset would be someone’s laptop, or a server, so an asset lens would allow us to see all of the risks associated with, for example, various suspicious activities on a users’ laptop.</p>
<p>Consider the situation of Microsoft Word or Excel executing a child process.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">ProcessQuery</span><span class="p">()</span>
<span class="p">.</span><span class="n">with_process_name</span><span class="p">(</span><span class="n">eq</span><span class="o">=</span><span class="p">[</span><span class="s">"winword.exe"</span><span class="p">,</span> <span class="s">"excel.exe"</span><span class="p">])</span>
<span class="p">.</span><span class="n">with_children</span><span class="p">(</span>
<span class="n">ProcessQuery</span><span class="p">()</span>
<span class="p">)</span>
<span class="n">output</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">risk</span><span class="o">=</span><span class="mi">120</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="https://s3.amazonaws.com/media-p.slid.es/uploads/650602/images/6244701/word_payload.png" alt="" /></p>
<p>Well, there’s probably more to that story, right? A file must have been read to execute a macro, or something along those lines.</p>
<p>Maybe on that same asset we have a low risk signature, looking for files downloaded from common browsers.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">p</span> <span class="o">=</span> <span class="n">ProcessQuery</span><span class="p">()</span>
<span class="p">.</span><span class="n">with_process_name</span><span class="p">(</span><span class="n">eq</span><span class="o">=</span><span class="p">[</span><span class="s">"chrome.exe"</span><span class="p">,</span> <span class="s">"firefox.exe"</span><span class="p">,</span> <span class="s">"iexplorer.exe"</span><span class="p">])</span>
<span class="p">.</span><span class="n">created_files</span><span class="p">(</span><span class="n">FileQuery</span><span class="p">())</span>
<span class="p">.</span><span class="n">query_first</span><span class="p">(</span><span class="n">dgraph_client</span><span class="p">)</span>
<span class="n">output</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">risk</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="https://s3.amazonaws.com/media-p.slid.es/uploads/650602/images/6244690/chrome_mal.png" alt="" /></p>
<p>This is an incredibly low risk signature. Users download files <em>all the time</em>. But this is where non local correlation comes in.</p>
<p>Both of these signatures triggered for the same <em>asset</em>, and so we can view them through that <em>lens</em>.</p>
<p><img src="https://s3.amazonaws.com/media-p.slid.es/uploads/650602/images/6244695/lens.png" alt="" /></p>
<p>We can now see a way to correlate these isolated subgraphs - when investigating, you can just start connecting the paths between these nodes.</p>
<p>Let’s create another low-risk analyzer. We’ll call this one:
“Commonly Targeted Application - Unique File Read”.</p>
<p>Certain applications are targeted a lot - word, excel, pdf readers, and similar software. These applications are often targeted through malicious file reads - for example, an attacker will convince a user to open a malicious pdf, exploit adobe reader, and take over their computer. Further, we can assume that the user downloaded the file from the browser.</p>
<p>So let’s build this signature out.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">common_targets</span> <span class="o">=</span> <span class="p">[</span><span class="s">"winword.exe"</span><span class="p">,</span> <span class="s">"excel.exe"</span><span class="p">,</span> <span class="s">"adobereader.exe"</span><span class="p">]</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">ProcessQuery</span><span class="p">()</span>
<span class="p">.</span><span class="n">with_process_name</span><span class="p">(</span><span class="n">eq</span><span class="o">=</span><span class="n">common_targets</span><span class="p">)</span>
<span class="p">.</span><span class="n">with_read_files</span><span class="p">(</span>
<span class="n">FileQuery</span><span class="p">()</span>
<span class="p">.</span><span class="n">created_by</span><span class="p">(</span>
<span class="n">ProcessQuery</span><span class="p">().</span><span class="n">with_process_name</span><span class="p">(</span><span class="n">eq</span><span class="o">=</span><span class="p">[</span><span class="s">"chrome.exe"</span><span class="p">,</span> <span class="s">"firefox.exe"</span><span class="p">])</span>
<span class="p">)</span>
<span class="p">)</span>
<span class="p">.</span><span class="n">query_first</span><span class="p">(</span><span class="n">dgraph_client</span><span class="p">)</span>
<span class="n">output</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">risk</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div></div>
<p>We can give this a risk of 10.</p>
<p>Now our lens shows us a new connection.</p>
<p><img src="https://s3.amazonaws.com/media-p.slid.es/uploads/650602/images/6244696/correlated_lens.png" alt="" /></p>
<p>We have a pretty compelling story here for the attack - pretty easy to see what’s going on.</p>
<p>But more importantly, we have <em>multiple overlapping risks</em> within a lens. So let’s make those risks explicit.</p>
<p><img src="https://s3.amazonaws.com/media-p.slid.es/uploads/650602/images/6244697/lens_with_risk.png" alt="" /></p>
<p>What’s important to note here is that we have multiple <em>distinct risks</em> that are correlating both <em>locally and non-locally</em>.</p>
<p>The three risks are:</p>
<ul>
<li>“Browser Created File”</li>
<li>“Word With Child Process”</li>
<li>“Commonly Targeted App Read Browser Created File”</li>
</ul>
<p>The local correlation is where the risks overlap - the <code class="language-plaintext highlighter-rouge">word.exe</code> node has edges to two distinct risks. The non-local correlation is where the risks don’t overlap, but the lens allows us to see them together - Browser Created File, for example.</p>
<p>When a node has multiple risks, we get something like:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">risk_sum</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">node_risks</span><span class="p">)</span>
<span class="n">risk_sum</span> <span class="o">+=</span> <span class="n">risk_sum</span> <span class="o">*</span> <span class="p">(</span><span class="mf">0.10</span> <span class="o">*</span> <span class="nb">len</span><span class="p">(</span><span class="n">risk_sum</span> <span class="o">-</span> <span class="mi">1</span><span class="p">))</span>
</code></pre></div></div>
<p>Essentially, if two nodes correlate, risk is increased 10%. If three nodes correlate, 20%.</p>
<h2 id="future">Future</h2>
<p>Lens views of your assets risk is a powerful concept, but it can go so much further. We can create arbitrary lenses to view your environment. A lens for users would track actions attributed to a user, regardless of which assets the actions occurred on. A lens for the kill chain would take attack signatures that map to the kill chain and provide a lens to correlate across them.</p>
<p>Lens-based correlation is also a great example of how graphs apply to different areas of Detection and Response. Not only do graph based signatures let us express powerful attack signatures, but because the signatures output graphs we can trivially connect the outputs together, giving us an almost arbitrarily powerful tool for correlation.</p>
<p>I also highly recommend <a href="https://www.youtube.com/watch?v=SdXosDrna-A">this conference talk</a> by <a href="https://twitter.com/four">@four</a>. Lenses and risk based signatures are inspired by this talk.</p>
<p>If you’re interested in talking more about Grapl, check out the project or reach out - I’m always interested in hearing thoughts about the project.</p>
<p>Github: https://github.com/insanitybit/grapl
Twitter: <a href="https://twitter.com/InsanityBit">@insanitybit</a></p>
Queries in Code2019-05-20T00:00:00+00:00http://insanitybit.github.io/2019/05/20/queries-in-code
<p>A detection and response (D&R) team’s attack signature queries are vital to their success, providing insight into suspicious behaviors occurring in their environment. Writing searches that can capture complex attacker behaviors, and ensuring that these searches are correct, are important responsibilities for a successful D&R team.</p>
<p><a href="https://github.com/insanitybit/grapl">Grapl</a> takes a fairly different approach to building these queries than other tools in the market, such as Splunk. Whereas Splunk has its own domain specific language (DSL), SplunkQL, Grapl instead leverages Python - one of the most popular programming languages in the world.</p>
<p>I believe that there are numerous D&R use cases where a programming language like Python has significant advantages over domain specific languages like SplunkQL.</p>
<p>While I will be discussing the usage of Python in comparison to SplunkQL, it’s worth noting that almost any project like Splunk takes the same DSL based approach. I only chose Splunk because I have the most experience with it.</p>
<h2 id="splunkql">SplunkQL</h2>
<p>The current state of the art for Detection and Response is the SIEM - products like Splunk, or ElastAlert, which perform log management, orchestration, and provide a system for correlation and alerting.</p>
<p>These systems almost exclusively leverage their own query languages. Splunk, for example, has the Splunk Query Language (SplunkQL). Here is an example of a Splunk query:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> index=wineventlog source=WinEventLog:Security
EventCode="4624"
Logon_Type="2" OR Logon_Type="10"
| fillnull value=* Source_Network_Address
| stats count by host Source_Network_Address Logon_Type user
| eval bar="("+count+") "+Source_Network_Address
| eval bar_host="("+count+") "+host
| stats list(bar) values(bar_host) by user Logon_Type
</code></pre></div></div>
<p>https://gosplunk.com/windows-rdp-sessions/</p>
<p>Notably, there are some specialized functions like <code class="language-plaintext highlighter-rouge">stats</code> with a <code class="language-plaintext highlighter-rouge">by</code> clause, you can bind information to a name using <code class="language-plaintext highlighter-rouge">eval</code>, and aggregate data using <code class="language-plaintext highlighter-rouge">list</code> or <code class="language-plaintext highlighter-rouge">values</code>. The language is really powerful in many ways.</p>
<p>There’s also no branching - instead, we write declarative statements such as <code class="language-plaintext highlighter-rouge">Logon_Type=</code><code class="language-plaintext highlighter-rouge">"</code><code class="language-plaintext highlighter-rouge">2</code><code class="language-plaintext highlighter-rouge">"</code>, and filter out results that do not match. We have no function calls and the ability to abstract or compose searches is very limited.</p>
<p>Certain commands are also restricted in some ways; special commands like <code class="language-plaintext highlighter-rouge">makeresults</code> or <code class="language-plaintext highlighter-rouge">inputlookup</code> must be first in your search, and one can not precede the other. There are some hidden magical rules like this in SplunkQL that aren’t always obvious, and can limit flexibility.</p>
<h2 id="python">Python</h2>
<p>Python is a much more typical, standard programming language. It has classes, functions, if statements, loops, libraries, and other constructs you’d expect.</p>
<p>In Grapl, which uses Python, a query looks something like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">child</span> <span class="o">=</span> <span class="n">Process</span><span class="p">()</span> \
<span class="p">.</span><span class="n">with_image_name</span><span class="p">(</span><span class="n">contains</span><span class="o">=</span><span class="s">"svchost.exe"</span><span class="p">)</span>
<span class="n">parent</span> <span class="o">=</span> <span class="n">Process</span><span class="p">()</span> \
<span class="p">.</span><span class="n">with_image_name</span><span class="p">(</span><span class="n">contains</span><span class="o">=</span><span class="n">Not</span><span class="p">(</span><span class="s">"services.exe"</span><span class="p">))</span>
<span class="p">.</span><span class="n">with_image_name</span><span class="p">(</span><span class="n">contains</span><span class="o">=</span><span class="n">Not</span><span class="p">(</span><span class="s">"smss.exe"</span><span class="p">))</span>
<span class="n">query</span> <span class="o">=</span> <span class="n">parent</span><span class="p">.</span><span class="n">with_child</span><span class="p">(</span><span class="n">child</span><span class="p">).</span><span class="n">to_query</span><span class="p">()</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">Process</code> is a class that we instantiate, and use to describe what kind of processes in our graph we want to match against. We call methods like <code class="language-plaintext highlighter-rouge">with_image_name</code> to describe attributes of the process, or <code class="language-plaintext highlighter-rouge">with_child</code> to describe relationships between processes.</p>
<p>Ignoring the graph based approach here, which allows a clear way to show relationships between entities, we can see that there’s a lot of <em>abstraction</em>. We don’t see the underlying generated query and we don’t know the internal mechanics of Process, which means we’re free to change those underlying details in the future.</p>
<p>Python is more of an imperative, object oriented language (though it’s flexible enough to fit many paradigms), unlike Splunk’s purely declarative query language.</p>
<h2 id="composition-abstraction-and-control-flow">Composition, Abstraction, and Control Flow</h2>
<p>Composition and abstraction are fundamentals of software development. The ability to compose different computations, while abstracting away irrelevant details, is what allows us to write clean, clear, maintainable code.</p>
<p>As I mentioned before, query languages like SplunkQL have a hard time here. There are macros, which can expand to Splunk queries, and you can technically call other searches from within your search but this is complex, and those are really the only tools available.</p>
<p>Python, on the other hand, has great tools for abstractions.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">child</span> <span class="o">=</span> <span class="n">Process</span><span class="p">()</span> \
<span class="p">.</span><span class="n">with_image_name</span><span class="p">(</span><span class="n">contains</span><span class="o">=</span><span class="s">"svchost.exe"</span><span class="p">)</span>
</code></pre></div></div>
<p>We don’t have to worry about how Process is implemented, it exposes a natural interface and we make use of it.</p>
<p>We could compose multiple Processes together, into a <code class="language-plaintext highlighter-rouge">ParentChildPair</code> if we wanted to, or move some of the logic into another function.</p>
<p>A common problem I’ve had in Splunk is expressing all of my logic in one query, without the use of control flow primitives. Python makes this easy.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="k">def</span> <span class="nf">signature_graph</span><span class="p">()</span> <span class="o">-></span> <span class="nb">str</span><span class="p">:</span>
<span class="n">child</span> <span class="o">=</span> <span class="n">Process</span><span class="p">()</span> \
<span class="p">.</span><span class="n">with_image_name</span><span class="p">(</span><span class="n">contains</span><span class="o">=</span><span class="s">"svchost.exe"</span><span class="p">)</span>
<span class="n">parent</span> <span class="o">=</span> <span class="n">Process</span><span class="p">()</span> \
<span class="p">.</span><span class="n">with_image_name</span><span class="p">(</span><span class="n">contains</span><span class="o">=</span><span class="n">Not</span><span class="p">(</span><span class="s">"services.exe"</span><span class="p">))</span>
<span class="k">return</span> <span class="n">parent</span><span class="p">.</span><span class="n">with_child</span><span class="p">(</span><span class="n">child</span><span class="p">).</span><span class="n">to_query</span><span class="p">()</span>
<span class="k">for</span> <span class="n">hit</span> <span class="ow">in</span> <span class="n">execute_analyzer</span><span class="p">(</span><span class="n">signature_graph</span><span class="p">):</span>
<span class="k">if</span> <span class="err">!</span><span class="n">check_hit_against_whitelist</span><span class="p">(</span><span class="n">hit</span><span class="p">):</span>
<span class="n">output</span><span class="p">(</span><span class="n">hit</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">debug_log</span><span class="p">(</span><span class="s">"Whitelisted hit: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">hit</span><span class="p">))</span>
</code></pre></div></div>
<p>Here we see a case where control flow and abstraction are used to build a query that was easy to write and is still easy to read.</p>
<p>We filter results from our signature matching using the <code class="language-plaintext highlighter-rouge">check_hit_against_whitelist</code> function, but the details of that function are abstracted away - maybe we hit a database, or reach back out to the master graph, or any other implementation. This keeps our whitelisting logic simple, and easy to change in the future.</p>
<p>Branching allows the code to not just filter out whitelisted hits, but to also execute code based on whether it is whitelisted or not. In the event that we do get a whitelisted event, we’re going to log some information, and then continue.</p>
<h2 id="debugging">Debugging</h2>
<p>Debugging a Splunk search can be really difficult. For one thing, there’s no easy way to just log out various steps or data. Sometimes things just stop (like if you <code class="language-plaintext highlighter-rouge">stats</code> by <code class="language-plaintext highlighter-rouge">null</code>) and you don’t know why - the easiest way to figure it out is usually to start cutting your search in half, rerun it, and inspect the output. This is a tedious process.</p>
<p>Python makes things way simpler here. For one thing, print debugging is trivial - you can inject log points anywhere into your code, as we see in the example in the <code class="language-plaintext highlighter-rouge">Composition, Abstraction, and Control Flow</code> section.</p>
<p>Python also provides standard debugging support using breakpoints. You can actually attach to the Python interpreter and step through code, inspecting variables as you go.</p>
<p><a href="https://docs.python.org/3/library/pdb.html">The PDB tool</a> is what I’ve used to do this in the past when debugging more complex problems.</p>
<h2 id="version-control">Version Control</h2>
<p>Searches are code, and they require an adherence to standards just as code does. Version control is one of the mechanisms that almost every mature software project uses to enforce their standard of quality.</p>
<p>When your searches live in code it makes management much simpler. Splunk’s searches generally live in a flat file, with the interface to the file being the GUI - this makes management of searches difficult if you want to do it in a way that isn’t the default.</p>
<p>Again, using a more standardized tool pays off. Python makes it easy to follow standard best practices here, as it’s extremely common for Python codebases to be backed by a version control system. The intended practice for Grapl is to keep all of your queries in a repository, and then use a githook to sign and deploy them to the analyzer S3 bucket.</p>
<p>This allows enforcing code reviews, linting, etc, and only releasing when your githooks have passed and your queries meet your quality bar.</p>
<h2 id="testing">Testing</h2>
<p>DSLs are often very frontloaded in power, having lots of specialized functions for their designated use case. They usually lack power in other areas, such as tooling.</p>
<p>In particular, if you search around for how to test your SplunkQL searches, you might be disappointed. It’s definitely <em>possible</em>, but it isn’t a natively supported concept, and you’re probably going to be home-growing whatever solution you come up with. If you want to get closer to best practices, such as rerunning tests on every change, and blocking changes if tests fail, you’ll be spending a lot of time building your own system.</p>
<p>Contrast this with Python, where testing is provided by the standard library. There’s mocking, patching, and support from all major Continuous Integration (CI) services.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kn">import</span> <span class="nn">unittest</span>
<span class="kn">import</span> <span class="nn">my_attack_analyzer</span>
<span class="k">class</span> <span class="nc">TestAttackSignature</span><span class="p">(</span><span class="n">unittest</span><span class="p">.</span><span class="n">TestCase</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">setUp</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">master_graph</span> <span class="o">=</span> <span class="n">init_local_mg</span><span class="p">()</span>
<span class="n">add_attack_signature</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">master_graph</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">test_hit</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">assert</span> <span class="n">my_attack_analyzer</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">master_graph</span><span class="p">)</span>
<span class="c1"># Assert other properties of the response
</span>
<span class="k">def</span> <span class="nf">test_miss</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="c1"># Clear our master_graph
</span> <span class="bp">self</span><span class="p">.</span><span class="n">master_graph</span><span class="p">.</span><span class="n">clear</span><span class="p">()</span>
<span class="n">add_benign_graph</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">master_graph</span><span class="p">)</span>
<span class="k">assert</span> <span class="n">my_attack_analyzer</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">master_graph</span><span class="p">)</span> <span class="ow">is</span> <span class="bp">None</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">'__main__'</span><span class="p">:</span>
<span class="n">unittest</span><span class="p">.</span><span class="n">main</span><span class="p">()</span>
</code></pre></div></div>
<p>This is a strawman example of how one might create a positive or negative testcase for a Python based Analyzer query. This approach demonstrates simple, standard practices for testing - we could easily integrate this into our CI pipeline just like any other codebase.</p>
<p>Even with a very basic test like this you can ensure that your alert is functional, and Python makes it easy to build much more powerful alerts, and guide your testing through coverage or other metrics.</p>
<h2 id="static-validation">Static Validation</h2>
<p>Part of ensuring code correctness is static validation - linters and type systems being the big two.</p>
<p>Splunk provides the <a href="http://dev.splunk.com/view/appinspect/SP-CAAAFAM">appinspect</a> app, which has some predefined rules for ensuring the basics of a good Splunk search - it’s essentially a linter.</p>
<p>Python has a ton of linters, as well as an optional static type system.</p>
<p>You can find more information about linters from <a href="https://www.pylint.org/">pylint.org</a> - there are incredibly powerful and capable linters. For example, <code class="language-plaintext highlighter-rouge">pyreverse</code> allows you to generate UML diagrams out of your Python code. And of course you have your bases covered for things like line length, variable name standards, incorrect interface implementations, etc.</p>
<p><a href="http://mypy-lang.org/">mypy</a>, the Python type checker, can also help you ensure correctness of your searches. Grapl’s Analyzers use mypy types heavily, which helps avoid errors like accidentally using a None value. Contrast this with Splunk where fields can very easily be undefined or null, and lead to silently dropped events.</p>
<h2 id="libraries">Libraries</h2>
<p>Python is famous for its huge ecosystem of libraries - there’s no need to reinvent the wheel. Between the standard library and the <a href="https://pypi.org">PYPI</a> you should have everything you need to build arbitrarily powerful searches.</p>
<p>The data science communities, as well as the security community, have really centered on Python over the last decade or so, building helpful tools like:</p>
<ul>
<li><a href="https://www.scipy.org/">scipy</a> - Statistical functions and common analytics tools</li>
<li><a href="https://pypi.org/project/sklearn/">sklearn</a> - A simple, well document ML library</li>
<li><a href="https://www.tensorflow.org/">tensorflow</a> - A powerful ML library, driving projects like <a href="https://deepmind.com/research/alphago/">AlphaGo</a></li>
<li><a href="https://github.com/secdev/scapy">scapy</a> - A library for packet inspection</li>
<li><a href="https://github.com/erocarrera/pefile">pefile</a> - A library for interacting with PE files</li>
<li><a href="https://www.crummy.com/software/BeautifulSoup/">beautifulsoup</a> - Not directly a security tool, but definitely one that a lot of security researches use. beautifulsoup provides a simple interface for interacting with HTML, helpful for analysis of webpages for suspect content.</li>
</ul>
<p>Python provides the best in class ecosystem for analyzing data.</p>
<h2 id="conclusion">Conclusion</h2>
<p>I want to be clear that I’m not picking on Splunk here - I used it as the example because I know it best, but virtually every system I’ve run across suffers from the same exact problems. I believe that using a typical, powerful programming language like Python solves many of these problems.</p>
<p>This is why I’ve chosen Python as the first language that Grapl supports for its Analyzer library.</p>
<p>My hope is that I can help analysts build better attack signatures faster, reduce noise, increase signal, express more powerful TTPs and anomalies in their alerts, all while ensuring that their queries are maintainable, readable, and correct.</p>
<p>If you’re interested in learning more about Grapl, please feel free to reach out to me either on Twitter or via the github repo.</p>
<p><a href="https://twitter.com/InsanityBit">https://twitter.com/InsanityBit</a></p>
<p><a href="https://github.com/insanitybit/grapl">https://github.com/insanitybit/grapl</a></p>
Grapl Six Months Later, And The Future2019-05-15T00:00:00+00:00http://insanitybit.github.io/2019/05/15/grapl-six-months-later-and-the-future
<p><a href="https://insanitybit.github.io/2018/10/20/grapl-a-graph-platform-for-detection-forensics-and-incident-response">I released the first Alpha version of Grapl in mid October, 2018.</a> At that point Grapl was already over a year old, though development really started ramping up in the months leading up to that release. Months later, I spoke about Grapl at kernelcon, and <a href="https://insanitybit.github.io/2019/03/09/grapl">transcribed the state of Grapl at the time here</a>.</p>
<p>When Grapl was first released it was already a powerful system, albeit with some rough edges.</p>
<ul>
<li>Grapl’s slowest and most complex service, the <code class="language-plaintext highlighter-rouge">node-identifier</code>, was built off of MySQL with a very complex codebase, and could not scale.</li>
<li>The DGraph cluster in Grapl was built on EC2, with low availability and uptime.</li>
<li>Writing Analyzers required writing raw DGraph queries.</li>
<li>Engagements barely existed - they could be created, but not manipulated.</li>
</ul>
<p>I’m going to go over what has changed, what Grapl looks like today, and what the next few months hold.</p>
<h2 id="node-identification">Node Identification</h2>
<p>If you’re familiar with instrumentation tool <a href="https://docs.microsoft.com/en-us/sysinternals/downloads/sysmon">Sysmon</a> you’ll know that it provides a <code class="language-plaintext highlighter-rouge">ProcessGuid</code> construct - a unique identifier for every process that won’t collide the way that normal <code class="language-plaintext highlighter-rouge">pid</code>s do. The node-identifier service performs a very similar construct - a unique, canonical identifier for all process and file nodes in Grapl.</p>
<p>Grapl’s node identification process does not rely on host-based instrumentation. Even if your instrumentation tools do not provide a canonical ID for processes, Grapl will be able to determine one - and it does this for both processes as well as files. It does so by taking the <code class="language-plaintext highlighter-rouge">pid</code> and the <code class="language-plaintext highlighter-rouge">timestamp</code> of the event, and creating timelines for each host using that information. When a node comes in, we look up where it fits into the timeline, and either create a new ID or take the timeline’s existing ID.</p>
<p>There’s a lot more complexity to this than you might expect - Grapl aims to handle cases where logs come out of order, are heavily delayed, or are otherwise dropped, all while also ensuring that progress is being made.</p>
<p>The old node-identifier leveraged MySQL for managing identities but there were some problems with that approach.</p>
<p>To start with, all Grapl services are AWS Lambdas, and are built to scale horizontally. MySQL is not built this way - it scales vertically, and you can add read replicas for horizontal scaling of reads. Grapl’s workload doesn’t work well with this - it’s an extremely write-heavy workload as we need to constantly be updating timelines and identities.</p>
<p>On top of that, RDS, the AWS managed database service, limits the number of active connections to MySQL. I was spending money to scale the database vertically just so I could get more connections.</p>
<p>Lastly, I was overusing transactions because it was difficult for me to express my queries using SQL. The table structure didn’t match what was ultimately a simple model - a key for <code class="language-plaintext highlighter-rouge">host + pid</code> and the ability to search by <code class="language-plaintext highlighter-rouge">timestamp</code>.</p>
<p>The service was slow and expensive, and the code was very complex.</p>
<p>In the last few months I’ve since rewritten the service to use DynamoDB, AWS’s managed horizontally scalable NoSQL database. DynamoDB provides a table construct that matches my use case very well - there is a primary key and a sort key, which means I can use the <code class="language-plaintext highlighter-rouge">host + pid</code> as the primary key and the <code class="language-plaintext highlighter-rouge">timestamp</code> as the sort key.</p>
<p>I also don’t have to worry about too many connections, or holding transactions to Dynamo. Transactions to DynamoDB are tiny, and are an edge case, unlike with MySQL where they were large and always required.</p>
<p>The code is much simpler and performance has improved as well. This was the last service in Grapl that was not horizontally scalable, so this represents a significant milestone.</p>
<h2 id="clustered-dgraph">Clustered DGraph</h2>
<p>Grapl has aimed to be as easy to manage as possible from day one. It leverages AWS Lambdas or other managed services wherever possible. DGraph, however, is not an AWS provided service, and when Grapl was released it was required that you manage the cluster - including the underlying OS.</p>
<p>Today, DGraph is deployed to AWS Fargate. Fargate is an AWS Elastic Container Service - essentially, it’s container orchestration where AWS manages the underlying hardware as well as the operating system and service discovery.</p>
<p>This has also greatly simplified the deployment of Grapl. There is no need to SSH to any systems in order to up the DGraph instances, no need to worry about setting up DNS resolution, and this change, along with a few others, has lead to Grapl only required a single parameter to be deployed.</p>
<p>By default Grapl will set up a highly available DGraph cluster with 3 DGraph Zeroes and 5 DGraph Alphas.</p>
<p>The node-identifier rewrite and dgraph clustering were the final pieces in the Grapl performance story. There’s plenty of low hanging fruit for improving performance, but these were the fundamental, architecture improvements that will unlock Grapl’s ability to scale to any workload.</p>
<h2 id="analyzer-library">Analyzer Library</h2>
<p>Previously, Grapl Analyzers required writing raw DGraph queries. One of the early goals of Grapl was to help users to <em>not</em> have to learn another bespoke query language, and instead to leverage the widespread knowledge of Python.</p>
<p>This is now close to being fully realized, with the Grapl Analyzer library providing a simple Python wrapper around the DGraph query language, tuned for Grapl’s use cases.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">signature_graph</span><span class="p">(</span><span class="n">node_key</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-></span> <span class="nb">str</span><span class="p">:</span>
<span class="n">child</span> <span class="o">=</span> <span class="n">Process</span><span class="p">()</span> \
<span class="p">.</span><span class="n">with_image_name</span><span class="p">(</span><span class="n">contains</span><span class="o">=</span><span class="s">"svchost.exe"</span><span class="p">)</span> \
<span class="p">.</span><span class="n">with_node_key</span><span class="p">(</span><span class="n">eq</span><span class="o">=</span><span class="n">node_key</span><span class="p">)</span>
<span class="n">parent</span> <span class="o">=</span> <span class="n">Process</span><span class="p">()</span> \
<span class="p">.</span><span class="n">with_image_name</span><span class="p">(</span><span class="n">contains</span><span class="o">=</span><span class="n">Not</span><span class="p">(</span><span class="s">"services.exe"</span><span class="p">))</span>
<span class="k">return</span> <span class="n">parent</span><span class="p">.</span><span class="n">with_child</span><span class="p">(</span><span class="n">child</span><span class="p">).</span><span class="n">to_query</span><span class="p">()</span>
</code></pre></div></div>
<p>Expressing relationships and constraints on your search is intuitive and simple - you don’t really need to be a Python expert to write these basic signatures.</p>
<p>The use of Python libraries opens up tons of possibilities, way too many to go into detail in this post. To name a few;</p>
<ul>
<li>Code review your alerts</li>
<li>Write tests, integrate into CI</li>
<li>Build abstractions, reuse logic, and generally follow best practices for building and maintaining software</li>
</ul>
<p>The Analyzer library has a fair amount of work left but it’s showing a lot of promise already.</p>
<h2 id="engagements">Engagements</h2>
<p>Grapl was previously capable of creating engagements but there was no method for actually working with them - you could only view the engagements through the DGraph interface. Recently I finished building the proof of concept for the Grapl Engagement UX, which leverages Jupyter Notebooks and a <a href="https://d3js.org/">D3.js</a> based UI.</p>
<p>The intended UX, for now, is that you’ll have two browser windows open. One that holds a live updating visualization of your engagement graph, and one for your Jupyter Notebook, which you’ll use to mutate the graph - adding the relevant nodes, and expanding the graph to represent the scope of the attacker behavior.</p>
<p><img src="https://paper-attachments.dropbox.com/s_940589D1D2FD85DE77E286B547D677FB6771FE927CA2782B71831752F962ADC6_1557943758641_engagementui.gif" alt="" />
<img src="https://paper-attachments.dropbox.com/s_940589D1D2FD85DE77E286B547D677FB6771FE927CA2782B71831752F962ADC6_1557943777704_engagementbotebook.png" alt="" /></p>
<p>I’ve used this approach to investigate some custom malware that I wrote and it’s surprisingly ergonomic despite being a feature with relatively little development time. I’m confident that with a bit more work I can make this into one of the best investigation workflows.</p>
<p>Jupyter Notebooks also hold a ton of potential for drastically improving common response workflows - I can build abstractions that automate common operations, such as enumerating child processes, filtering out known-good binaries, and more.</p>
<p>And at the end of every investigation you get two powerful artifacts - a visualization of attacker scope, and a record, in code, of your investigation steps.</p>
<h2 id="future-work---towards-beta">Future Work - Towards Beta</h2>
<p>Somehow I have actually managed to get Grapl to a state where it feels very nearly done. There is only a single feature that I intend to add before Grapl hits beta, which will indicate a commitment to minimal backwards incompatible changes, and a focus on stability and documentation over features.</p>
<p>The final feature for Grapl’s Alpha phase is to implement Risk Based Alerting.</p>
<p>As Grapl exists today you can do powerful local correlation to pull out individual or even composite attacker behaviors. Analysts can express their attack signatures as more generalized patterns using the Graph constructs, and drive false positive rates way down without sacrificing signal. This graph based approach is a significant improvement over a raw log based approach.</p>
<p>That said, the reality is that attacker behaviors are rarely expressible, with total confidence, using only a single signature - even if that signature correlates a lot of related data. We need non-local correlation and a concept of risk and priority so that instead of chasing false positives we can automatically prioritize where we focus our time.</p>
<h3 id="local-vs-non-local-correlation">Local vs Non-Local Correlation</h3>
<p>The fundamental difference between local correlation and non-local correlation is how connected the signatures are.</p>
<p>An example of local correlation is:</p>
<ul>
<li>Process X created a file Y and executed it as a child process Z.</li>
</ul>
<p>All three of the nodes involved are highly connected. The subgraph describes a single behavior, or connected cluster of behaviors.</p>
<p>Non-local correlation would be something more along the lines of:</p>
<ul>
<li>On asset M, process X created a file Y and executed it as a child process Z</li>
<li>On asset M, process A attempted to modify the system hosts file</li>
<li>On asset M, process B deleted the binary that it executed from</li>
</ul>
<p>These are a series of local correlations, with only a single entity in common - the asset. By viewing the disjoint, local correlations as a grouping under an asset we can better understand that asset’s risk.</p>
<p>The asset is the ‘lens’ we use to view our non-local correlations - it allows us to cut a grouping of local correlations out, and view them together.</p>
<p>I intend to add an Asset node to Grapl and to ensure that every Analyzer provides a risk score alongside it. Then, when triaging, you can simply sort by “riskiest asset”. The graph based approach also means that, despite the correlations being somewhat disconnected, you can trivially identify the connections between them - this makes it easy to see if it’s just a series of benign events on an asset or if there are insidious connections between them.</p>
<p>Grapl’s ability to provide extremely powerful local correlation primitives alongside this non-local correlation should make prioritizing your triage trivial - your risky assets will form a prioritized list, and you’ll just pull from the top.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In the very near future Grapl will be a polished, stable, efficient system that solves real problems for real teams. Grapl has been built with an extreme focus on solving real world issues - I genuinely believe that teams will benefit greatly from adopting this tool or at least the underlying approaches.</p>
<p>If you have any interest in the project, please feel free to reach out either on Twitter or Github:</p>
<p>https://github.com/insanitybit/grapl</p>
<p>https://twitter.com/InsanityBit</p>
Grapl - A Graph Platform For Detection and Response2019-03-09T00:00:00+00:00http://insanitybit.github.io/2019/03/09/grapl
<p>(This blog post is transcribed from a conference talk)
<a href="https://slides.com/colinwa/grapl-a-graph-platform-for-detection-and-response-5/#/">(Original slides)</a></p>
<p>Github: https://github.com/insanitybit/grapl</p>
<p>Twitter: https://twitter.com/InsanityBit</p>
<p>Grapl is an open source platform for Detection and Response (D&R). The position that Grapl takes is that Graphs provide a more natural experience than raw logs for many common D&R use cases.</p>
<p>A graph is a data structure - like a linked list, or a hashmap. Graphs are composed of “nodes” and “edges” - nodes are usually analogous to “entities” and edges denote the relationships between nodes.</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1551939201764_Screenshot+from+2019-03-06+22-13-07.png" alt="" /></p>
<p>Graphs are a very powerful data structure. Even with a visualization of a graph that has no properties, as above, or labels - we can derive information; I can see, for example, the purple node has a relationship with the other two nodes.</p>
<p>The ability to encode relationships into their structure is one of the features of graphs that makes them so appealing for so many workloads.</p>
<p>Companies like Facebook are all about relationships - that’s the basis of social media, after all. If Facebook wanted to target a given user for some ad, they can leverage graphs to do so by looking at what a users’ relationships are. A user may have liked a post, or opened an app - or you can look at their friends and their behaviors, building up a model for what this user may be in the market for.</p>
<p>Graphs have become <img src="https://developers.facebook.com/docs/graph-api/" alt="a central part of Facebook’s API" />:</p>
<p>“The Graph API is the primary way for apps to read and write to the Facebook social graph. All of our SDKs and products interact with the Graph API in some way, and our other APIs are extensions of the Graph API, so understanding how the Graph API works is crucial.</p>
<p><img src="https://techcrunch.com/wp-content/uploads/2015/04/facebook-api.png?w=680" alt="Facebook’s graph based approach to understanding user interests." /></p>
<p>Google is another company that has invested considerably in graphs. Google’s Knowledge Graph is what allows them to move past a plaintext search - they encode semantic information into their graph, allowing you to make a search for a term like “Apple Location” and get results relating to Cupertino, even for pages that don’t have the word “Apple” in them.</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1551940633588_Screenshot+from+2019-03-06+22-36-59.png" alt="" /></p>
<p>TensorFlow, another Google project, is built using <a href="https://en.wikipedia.org/wiki/Dataflow_programming">Dataflow Programming</a> - a way to compile programs into graphs. TensorFlow’s claim to fame was its victory against the top Go board game champions, something that many thought was decades off.</p>
<p><img src="https://i2.wp.com/sourcedexter.com/wp-content/uploads/2017/04/TensorFlow-graph1.jpg?ssl=1" alt="" /></p>
<p><strong>Graphs in Security</strong>
Graph based approaches have started to gain traction in the security space. John Lambert, a Distinguished Engineer at Microsoft, stated:</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1552168948498_Screenshot+from+2019-03-09+14-01-32.png" alt="" /></p>
<p><a href="https://twitter.com/johnlatwc/status/1059841882086232065">https://twitter.com/johnlatwc/status/1059841882086232065</a></p>
<p>As Lambert expresses in his post, Defenders tend to work with lists of information - lists of assets, possibly with labels such as criticality. Attackers, on the other hand, work with graphs - they land on a box and start traversing the network. The inability of lists to properly express relationships makes it hard to map the defender’s tools to the attacker’s approach. A list-based defender may think to locally about an asset, not understanding the scope of the trust relationships within their network.</p>
<p>The tool BloodHound uses a graph-based approach to demonstrate implicit trust relationships within Active Directory. They even package custom graph algorithms to help you determine attack routes that you should be prioritizing - such as the ‘Shortest Path To Domain Admins’ query.</p>
<p><img src="https://blog.stealthbits.com/wp-content/uploads/2017/03/BloodHound-Attack-Graph.png" alt="" /></p>
<p>Graph based thinking can go well beyond assets and users.</p>
<p><a href="https://github.com/duo-labs/cloudmapper">CloudMapper</a> is a tool for exploring trust relationships in AWS environments, primarily for determining potential holes in policies.</p>
<p><img src="https://raw.githubusercontent.com/duo-labs/cloudmapper/master/docs/images/ideal_layout.png" alt="" /></p>
<p><strong>Graphs for D&R</strong>
The graph based approach has shown to be an effective, natural fit for a diverse problem space, and there’s a great case to be made that it fits well with Detection and Response work.</p>
<p>The core primitive for D&R is currently the log and we index massive lists of logs. Then, we write alerts based on logs in our environments and and combing through them line by line, attempting to pivot off of the data we find.</p>
<p>A list of logs might show:</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1551942907936_Screenshot+from+2019-03-06+23-14-54.png" alt="" /></p>
<p>Looking at these logs we can see that they are connected - thought it may not be immediately obvious. The <code class="language-plaintext highlighter-rouge">pid</code> and <code class="language-plaintext highlighter-rouge">ppid</code> fields match them up, and in such a small window of time we can probably rule out pid collisions.</p>
<p>Implicit within this set of logs is a graph.</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1551943074284_Screenshot+from+2019-03-06+23-17-35.png" alt="" /></p>
<p>When these logs are structured as graphs the relationships are impossible to miss - it’s pretty easy to look at the graph and understand what’s going on.</p>
<p><strong>Grapl</strong>
Grapl aims to provide exactly this sort of graph abstraction, designed specifically for D&R. Grapl will take security relevant logs and convert them into graphs, forming a giant ‘Master Graph’ representing the actions across your environments. This Master Graph lives in DGraph, which provides us a language for querying the data.</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1552001462405_Screen+Shot+2019-03-07+at+3.30.48+PM.png" alt="" /></p>
<p>Grapl is an attempt to explore Detection and Response given a graph primitive instead of a log primitive. It works by taking logs that you send it (currently supporting Sysmon and a custom JSON format) and parsing those logs out into a subgraph representation.</p>
<p>Grapl then determines the ‘identity’ for each node (eg: “we have a pid, we have a timestamp, what’s the ID for this process node”).</p>
<p>Using this identity Grapl can then pin up this subgraph into the master graph. This master graph will represent all of the relationships across your environment.</p>
<p>Analyzers, the ‘attacker signatures’ for Grapl, are then executed against the graph. These analyzers can query the master graph for suspicious patterns.</p>
<p>When these analyzers find a sketchy subgraph Grapl will generate an Engagement - this is where your investigation will begin.</p>
<p><strong>Identity</strong></p>
<p>By joining data together semantically it becomes possible to easily reason about non-local attributes of your data. A log for a process event tells you only one piece of information whereas a Process Node with an identity can tell you about the history of a process, and its behavior over time. It can take multiple logs to describe a process starting, reading a file, making a connection, and terminating, but we can represent that data far more effectively as a single node.</p>
<p>Nodes in a graph can also provide identity, which is a very powerful construct. With logs you have information about entities spread across many places - identity allows looking up all of the information for an entity in one place, its node.</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1552108764388_Screenshot+from+2019-03-08+21-18-08.png" alt="" /></p>
<p>Grapl identifies nodes by creating a timeline of state changing events. A creation event, such as a process creation, will mark the beginning of a new “session”. If we have logs come in that have “seen” a pid, we’ll go to that pid’s session timeline and find the process creation event that is closest before the “seen” time.</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1552163142776_Screenshot+from+2019-03-09+12-22-57.png" alt="" /></p>
<p>In some cases, such as for processes that start up very early or for static files, we may not have creation events. In this case we have to guess at what the process ID is.</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1552163129421_Screenshot+from+2019-03-09+12-23-22.png" alt="" /></p>
<p>Guesses will also have to propagate. The algorithm has room for improvements but my experience is that it tends to guess things correctly.</p>
<p>Grapl will also assume that if two of the same pid are seen within a small window of time that they are the same, and won’t look over the entire timeline of sessions to figure it out - this should match reality pretty well, pid collisions don’t occur too often.</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1552163150870_Screenshot+from+2019-03-09+12-23-35.png" alt="" /></p>
<p><strong>Analyzers</strong></p>
<p>Logs generally describe an action, and some properties of that action. In some cases, with really powerful logging solutions like Sysmon, we can get a few relationships as well - such as a parent process ID.</p>
<p>Logs, especially logs from Sysmon, can power a lot of great alerts.</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1552002210493_Screen+Shot+2019-03-07+at+3.34.24+PM.png" alt="" /></p>
<p>Not every log is like Sysmon though - the relationships are often implicit. And while Sysmon may pull some relationships in, it’s only ever one layer.</p>
<p>With the average log the relationships are only ever implicit. This can lead to thinking about actions in isolation, and not as part of a chain of events, or a sum of properties that may be spread across many logs.</p>
<p>Looking at the following logs in isolation, it may not be obvious that this is malicious behavior. Word spawning is benign, and powershell spawning is also often benign - certainly you could not write an alert on these logs alone in most environments.</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1552002680190_Screen+Shot+2019-03-07+at+3.50.00+PM.png" alt="" /></p>
<p>When we pull out these relationships to form a graph the behavior is much more obviously malicious. We can move from writing an alert against properties, or individual events, and start writing alerts on behaviors and relationships.</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1552002841580_Screen+Shot+2019-03-07+at+3.51.27+PM.png" alt="" /></p>
<p>When we view our system as a graph, attacker signatures are obvious. Observing non-local properties of a process or file becomes much simpler with a graph.</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1552014429678_Screenshot+from+2019-03-07+19-06-58.png" alt="" /></p>
<p>Alerts in Grapl are simply Python files. This provides maximum flexibility for alerts - you aren’t constrained by a DSL or query language. If you want to run multiple queries to express your alert, with intermediary logic, Python makes it trivial.</p>
<p>Python is also the language of choice for data scientists, which means the library ecosystem for working with data is best in class. With Python it’s possible to integrate other APIs or database queries into your alerts.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">signature_graph</span><span class="p">()</span> <span class="o">-></span> <span class="nb">str</span><span class="p">:</span>
<span class="n">child</span> <span class="o">=</span> <span class="n">Process</span><span class="p">()</span> \
<span class="p">.</span><span class="n">with_image_name</span><span class="p">(</span><span class="n">contains</span><span class="o">=</span><span class="s">"svchost.exe"</span><span class="p">)</span> \
<span class="p">.</span><span class="n">with_node_key</span><span class="p">(</span><span class="n">eq</span><span class="o">=</span><span class="s">'$a'</span><span class="p">)</span>
<span class="n">parent</span> <span class="o">=</span> <span class="n">Process</span><span class="p">()</span> \
<span class="p">.</span><span class="n">with_image_name</span><span class="p">(</span><span class="n">contains</span><span class="o">=</span><span class="n">Not</span><span class="p">(</span><span class="s">"services.exe"</span><span class="p">))</span>
<span class="k">return</span> <span class="n">parent</span><span class="p">.</span><span class="n">with_child</span><span class="p">(</span><span class="n">child</span><span class="p">).</span><span class="n">to_query</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">_analyzer</span><span class="p">(</span><span class="n">client</span><span class="p">:</span> <span class="n">DgraphClient</span><span class="p">,</span> <span class="n">graph</span><span class="p">:</span> <span class="n">Subgraph</span><span class="p">,</span> <span class="n">sender</span><span class="p">:</span> <span class="n">Connection</span><span class="p">):</span>
<span class="k">for</span> <span class="n">node_key</span> <span class="ow">in</span> <span class="n">graph</span><span class="p">.</span><span class="n">subgraph</span><span class="p">.</span><span class="n">nodes</span><span class="p">:</span>
<span class="n">res</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">query</span><span class="p">(</span><span class="n">signature_graph</span><span class="p">(),</span> <span class="n">variables</span><span class="o">=</span><span class="p">{</span><span class="s">'$a'</span><span class="p">:</span> <span class="n">node_key</span><span class="p">})</span>
<span class="k">if</span> <span class="ow">not</span> <span class="p">(</span><span class="n">res</span> <span class="ow">and</span> <span class="n">res</span><span class="p">.</span><span class="n">json</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="s">'res was empty'</span><span class="p">)</span>
<span class="k">continue</span>
<span class="n">res</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">loads</span><span class="p">(</span><span class="n">res</span><span class="p">.</span><span class="n">json</span><span class="p">)</span>
<span class="k">if</span> <span class="p">[</span><span class="n">sender</span><span class="p">.</span><span class="n">send</span><span class="p">(</span><span class="n">make_hit</span><span class="p">(</span><span class="n">match</span><span class="p">))</span> <span class="k">for</span> <span class="n">match</span> <span class="ow">in</span> <span class="n">res</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'q0'</span><span class="p">,</span> <span class="p">[])]:</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Got a hit for {}'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">node_key</span><span class="p">))</span>
<span class="n">sender</span><span class="p">.</span><span class="n">send</span><span class="p">(</span><span class="n">ExecutionComplete</span><span class="p">())</span>
</code></pre></div></div>
<p>One of the best parts about using Python for alert logic is the ability to write powerful tests. You can patch and mock, set up local infrastructure, write positive and negative tests, etc - all witrhin the stdlib. Python also has some powerful testing tools such as <a href="https://hypothesis.readthedocs.io/en/latest/"><strong>Hypothesis</strong></a><strong>.</strong> Using Python should make it easy to integrate your tests into a Continuous Integration pipeline.</p>
<p>Writing correct alerts that age well can’t be undersold. Testing is a huge boon - but there’s also linters, style enforcement, code review, type annotations, and tons of other ways to ensure that no matter who’s modifying your alerts they still function correctly.</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1552073881476_Screen+Shot+2019-03-08+at+10.45.46+AM.png" alt="" /></p>
<p><strong>Investigations</strong></p>
<p>When investigating an attacker using a log based approach there are some significant caveats.</p>
<p>Given a log that triggers an alert you’ll begin searching for fields related to that log, building up a timeline of attacker related events.</p>
<p>During our investigation we want to pivot off of relationships. We’ll start at the top layer - the initial alert, and use its fields as implicit joins. Pivoting off of the ‘pid’ and ‘ppid’ may yield some results.</p>
<p>We may also search for the <code class="language-plaintext highlighter-rouge">hash</code> - but these are all implicit relationships, not real ones, so we get nothing back (wasting time searching over all of our logs in the search window).</p>
<p>We may also search for the <code class="language-plaintext highlighter-rouge">image_name</code> - it would be great to know what created the suspicious file. Unfortunately, no results again.</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1552075843638_Screen+Shot+2019-03-08+at+12.10.14+PM.png" alt="" /></p>
<p>Pivoting off of the <code class="language-plaintext highlighter-rouge">image_name</code> would be really nice, so we can extend our search window back further. This will slow searches down, but it can be necessary to get logs we need.</p>
<p>We’ve now pulled in some logs relating to <code class="language-plaintext highlighter-rouge">image_name</code>, but we also see a log relating to the <code class="language-plaintext highlighter-rouge">pid</code> - but it’s not the process we care about. In my experience, once you’re expanding your search window >12 hours you are extremely likely to run into a pid collision. So now we’ll have to deal with removing those logs as they are not relevant.</p>
<p>This log based workflow can be completely fine, especially for short investigations. When it comes to longer investigations the inability to understand your pivot points, the necessity to research over all of the timeline, and the lack of identity, can slow things down a lot.</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1552074130025_Screen+Shot+2019-03-08+at+11.41.20+AM.png" alt="" /></p>
<p>Because the graph datastructure encodes relationships, which are our pivot points, directly into their structure, they solve a lot of these issues. When looking a at a process we don’t need to ‘search’ for what that process has done - we have all of that encompassed in the node and in its relationships.</p>
<p>We don’t have to wonder about our edges - whether we can pivot off of them or not is determined by their existence; if an edge exists, we can traverse it.</p>
<p>There’s no need to expand search windows - we don’t have to care about when a parent process did something, we can see what it has done in totality.</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1552078481025_Screen+Shot+2019-03-08+at+12.54.25+PM.png" alt="" /></p>
<p>Grapl provides an Engagement construct for performing investigations - it’s a Python class that you can load up in an <a href="https://sagemaker-workshop.com/prerequisites/jupyter.html">AWS hosted Jupyter Notebook</a>.</p>
<p>Engagements are still very alpha, and the code below is not fully implemented.</p>
<p>When an Analyzer fires Grapl will store whatever subgraph it triggered on in the Engagement Graph - a separate graph database instance to hold engagements.on.</p>
<p>After instantiating a view of the Engagement in our Notebook we can start expanding the graph. Our pivot points are always observable and pivoting is efficient and trivial</p>
<p>An investigation is complete when you have a complete subgraph representation of attacker behavior. The notebook you’ve used will act as a record of your investigation - you can incorporate these into libraries or runbooks, and use them as training material for tabletops.</p>
<p>There’s no supported visual display for the engagement graphs right now, but my hope is to build a live updated visualization that can be displayed in a separate browser window. This would allow for a separate of your searches and the state of the engagement - I think this will make it easier for multiple people to work on an engagement in the future.</p>
<p>In the future, because engagements are themselves a graph, it should be possible to merge engagements together when they overlap. Or, otherwise, you could have temporary engagement graphs that don’t alert on their own, but only if a correlatoin thres</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1552105888271_Screenshot+from+2019-03-08+20-31-07.png" alt="" /></p>
<p><strong>Grapl as a Platform</strong>
I’ve described Grapl as a platform - what that means is that it isn’t a black box. Grapl is a collection of services hosted on AWS and libraries for integrating into them. Every single service that makes up Grapl works by emitting and receiving events.</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1552107160853_Screenshot+from+2019-03-08+20-52-28.png" alt="" /></p>
<p>All events are multiconsumer, so extending Grapl is as easy as subscribing to those events. If a parser for a new type of log format needs to be added it’s a matter of subscribing to publishes to the “raw-log” bucket, and then emit messages to the “unidentified-subgraph” bucket.</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1552107173910_Screenshot+from+2019-03-08+20-52-44.png" alt="" /></p>
<p>Building parsers for Grapl is best done in Rust. I built most of Grapl in Rust for a few reasons - the simplest is performance; Grapl has to be able to consume a lot of logs, and Rust is the ideal language for that task. It’s memory safe and it’s <a href="https://github.com/serde-rs/json-benchmark">extremely efficient</a>.</p>
<p>When building a parser the goal is to create a graph that represents the relationships between any entities described in the graph. Even if all you have is a parent process id, you can describe that as a node - Grapl will figure out exactly <em>which</em> node that pid is referring to, and link everything up.</p>
<p>We also have to provide the “state” of the node, for certain nodes like Process or Files, which can have transient states (“Created”, “Already exists”, “Terminated/Deleted”). This helps Grapl to identify the process.</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1552107598525_Screenshot+from+2019-03-08+20-59-47.png" alt="" /></p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1552107839744_Screenshot+from+2019-03-08+21-03-45.png" alt="" /></p>
<p>Once the nodes have been parsed out they need to be linked up in a graph. And that’s it - a bit of boilerplate around event handling, and you have a subgraph parser.</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1552107874491_Screenshot+from+2019-03-08+21-04-20.png" alt="" /></p>
<p><strong>Future</strong></p>
<p>Grapl has gone from a white board drawing to being able to build full graphs of system activity from my lab environment, but I have a lot of future work planned.</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1552108094940_Screenshot+from+2019-03-08+21-07-52.png" alt="" /></p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1552108610925_Screenshot+from+2019-03-08+21-16-25.png" alt="" /></p>
<p>One thing I haven’t built out for Grapl is a data management system. Grapl can store data very efficiently but theres no cleanup of old nodes. I intend to solve this problem soon, but it isn’t straightforward.</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1552108906369_netowrk.png" alt="" /></p>
<p>Grapl has solid support for Process and File nodes, and preliminary support for networking. I’m eager to get network support in as soon as possible in order to better capture lateral movement and C2. In the future I’d like to also explore modeling users and assets - I think this information is often important information during the triage phase, and it opens up some interesting analytics approaches like cohort analysis.</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1552159300358_Screenshot+from+2019-03-09+11-21-16.png" alt="" /></p>
<p>There are many research papers around graphs - even in the detection security space. One paper in particular is about an approach that was taken to detect components of what is described as a “Causal Graph” that are likely to be an attacker. Their tool, PRIOTRACKER, demonstrates an approach that expands a graph based on commonality features and to minimize fanout - ideally scoping exclusively attacker behavior.</p>
<p>This sort of approach is usually pretty expensive, even for a system like the one in the paper that is explicitly optimized for this work. I think this approach would actually be ideal as a way to automatically scope an engagement. When engagements are created the system could auto-scope the connected graph and by the time the Jupyter Notebook is opened a portion of the investigation could be complete.</p>
<p>When working with a powerful abstraction like a Graph there’s a lot of opportunity to implement more advanced techniques.</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1552159733616_Screenshot+from+2019-03-09+11-26-28.png" alt="https://www.princeton.edu/~pmittal/publications/priotracker-ndss18" /></p>
<p>Grapl currently supports a custom JSON log format, as well as Sysmon logs. I think that between these two formats you can generally get most data up into Grapl, but I’d love to have native support for OSQuery or other open source logging solutions.</p>
<p>Grapl also has no native display, and relies on dgraph’s visualization, which isn’t tuned for engagements. Graphistry has beautiful display features and is designed with similar workflows in mind - an integration seems worth pursuing.</p>
<p><img src="https://d2mxuefqeaa7sj.cloudfront.net/s_A34F006A82484ABC31C78D15F90D61A26E825A6B4715DF2C484A9992F8A03A0F_1552161442988_Screenshot+from+2019-03-09+11-57-10.png" alt="" /></p>
<p><strong>Setting up Grapl</strong></p>
<p>Grapl is intended to be very easy to set up and operate. Almost all of Grapl is built on managed services, and can be set up using a single deploy script.</p>
<p>Clone grapl and install the aws cloud development kit:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ npm i -g aws-cdk
$ git clone git@github.com:insanitybit/grapl.git
</code></pre></div></div>
<p>Create a <code class="language-plaintext highlighter-rouge">.env</code> file in the <code class="language-plaintext highlighter-rouge">grapl-cdk</code> folder with the following fields:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cd ./grapl/grapl-cdk/
$ <your editor> ./.env
HISTORY_DB_USERNAME=username
HISTORY_DB_PASSWORD=password
BUCKET_PREFIX="<unique bucket prefix>"
GRAPH_DB_KEY_NAME="<name of ssh key>"
</code></pre></div></div>
<p>Then it’s a single command to deploy:
<code class="language-plaintext highlighter-rouge">./deploy_all.sh</code></p>
<p>Wait a few minutes and Grapl’s infrastructure and services will be deployed.</p>
<p>After Grapl’s core services are set up you can SSH using the SSH key as named in the .env file.</p>
<p><a href="https://docs.dgraph.io/deploy/#install-dgraph">Then just follow the DGraph deploy steps here.</a></p>
<p>The deployment isn’t necessarily built for a production, scalable system, but it’s a base install that can provide a playground to work with Grapl.</p>
Deploying Grapl With AWS CDK2018-11-05T00:00:00+00:00http://insanitybit.github.io/2018/11/05/deploying-grapl-with-aws-cdk
<p>Over the last few months I’ve been working on <a href="https://github.com/insanitybit/grapl">Grapl, a platform for DFIR</a>
built largely around graph structures.</p>
<p>I wanted Grapl to be trivial to deploy, both because it would ease others’
work to get started with it, and because it’ll make my test cycle a lot
faster.</p>
<p>Grapl consists of around 7 Lambdas, some S3 buckets, SNS topics, SQS
queues, and the connections and policies between them. Configuring
and changing these in the console quickly became untenable. I evaluated
two projects to make my life easier here, with the goal of having a
near-one-command deployment.</p>
<h3 id="terraform">Terraform</h3>
<p>The first project I started with was Terraform. Terraform is developed by
Hashicorp, and is basically a DSL for describing CloudFormation policies,
written in HCL.</p>
<p>I found HCL to mostly be unbearable. It’s very ‘stringy’ and weird - it
definitely feels like a DSL, and not like a typical programming language,
which I don’t really see the appeal of since I spend 99% of my time using
typical languages.</p>
<p>Here’s an example:</p>
<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Create a subnet to launch our instances into</span>
<span class="nx">resource</span> <span class="s2">"aws_subnet"</span> <span class="s2">"default"</span> <span class="p">{</span>
<span class="nx">vpc_id</span> <span class="p">=</span> <span class="s2">"${aws_vpc.default.id}"</span>
<span class="nx">cidr_block</span> <span class="p">=</span> <span class="s2">"10.0.1.0/24"</span>
<span class="nx">map_public_ip_on_launch</span> <span class="p">=</span> <span class="kc">true</span>
<span class="p">}</span>
</code></pre></div></div>
<p><a href="https://github.com/terraform-providers/terraform-provider-aws/blob/master/examples/two-tier/main.tf">This</a> defines a VPC subnet resource, referencing a VPC by id through string interpollatoin.</p>
<p>On top of that Terraform does very little to help you out. Everything has
to be defined - policies, subnets, etc. Why? Why must I state that my
Queue should be publishable to by my SNS Topic? And then I also have to
define that my SNS topic can publish to my Queue? It felt so redundant
and easy to get wrong - I had to understand CloudFormation and AWS policies.</p>
<p>I ditched Terraform and, for a while, just did things with the console.</p>
<h3 id="aws-cdk">aws-cdk</h3>
<p>Later I found out about <a href="https://awslabs.github.io/aws-cdk/">aws-cdk</a>, a new approach to configuring AWS resources
through code.</p>
<p>The nice thing about CDK is that it <em>is not a DSL</em>. It’s a library that I
can use from various programming languages.</p>
<p>By leveraging real languages I can use the tools I’m used to - classes,
functions, generics, loops, branches, etc. I didn’t have to learn the
CDK way, I just approached it as I would any problem.</p>
<p>What this meant was I could move really fast, building my configurations
in a way that was very readable and, at least to some extent, DRY.</p>
<div class="language-typescript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">function</span> <span class="nx">subscribe_lambda_to_queue</span><span class="p">(</span><span class="nx">stack</span><span class="p">:</span> <span class="nx">cdk</span><span class="p">.</span><span class="nx">Stack</span><span class="p">,</span> <span class="nx">id</span><span class="p">:</span> <span class="kr">string</span><span class="p">,</span> <span class="nx">fn</span><span class="p">:</span> <span class="nx">lambda</span><span class="p">.</span><span class="nb">Function</span><span class="p">,</span> <span class="nx">queue</span><span class="p">:</span> <span class="nx">sqs</span><span class="p">.</span><span class="nx">Queue</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// TODO: Build the S3 Endpoint and allow traffic only through that endpoint</span>
<span class="k">new</span> <span class="nx">lambda</span><span class="p">.</span><span class="nx">cloudformation</span><span class="p">.</span><span class="nx">EventSourceMappingResource</span><span class="p">(</span><span class="nx">stack</span><span class="p">,</span> <span class="nx">id</span> <span class="o">+</span> <span class="dl">'</span><span class="s1">Events</span><span class="dl">'</span><span class="p">,</span> <span class="p">{</span>
<span class="na">functionName</span><span class="p">:</span> <span class="nx">fn</span><span class="p">.</span><span class="nx">functionName</span><span class="p">,</span>
<span class="na">eventSourceArn</span><span class="p">:</span> <span class="nx">queue</span><span class="p">.</span><span class="nx">queueArn</span>
<span class="p">});</span>
<span class="nx">fn</span><span class="p">.</span><span class="nx">addToRolePolicy</span><span class="p">(</span><span class="k">new</span> <span class="nx">cdk</span><span class="p">.</span><span class="nx">PolicyStatement</span><span class="p">()</span>
<span class="p">.</span><span class="nx">addAction</span><span class="p">(</span><span class="dl">'</span><span class="s1">sqs:ReceiveMessage</span><span class="dl">'</span><span class="p">)</span>
<span class="p">.</span><span class="nx">addAction</span><span class="p">(</span><span class="dl">'</span><span class="s1">sqs:DeleteMessage</span><span class="dl">'</span><span class="p">)</span>
<span class="p">.</span><span class="nx">addAction</span><span class="p">(</span><span class="dl">'</span><span class="s1">sqs:GetQueueAttributes</span><span class="dl">'</span><span class="p">)</span>
<span class="p">.</span><span class="nx">addAction</span><span class="p">(</span><span class="dl">'</span><span class="s1">sqs:*</span><span class="dl">'</span><span class="p">)</span>
<span class="p">.</span><span class="nx">addResource</span><span class="p">(</span><span class="nx">queue</span><span class="p">.</span><span class="nx">queueArn</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Here’s a function I wrote to factor out some common logic I had - subscribing
my lambdas to an SQS Queue.</p>
<p>Note that in this case I had to add the policy to my lambda. This is actually
atypical - cdk will, in almost every case, generate these for you. I only had
to do so here because it’s a very young library and they haven’t automated this yet.</p>
<p>But consider this code:</p>
<div class="language-typescript highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kd">const</span> <span class="nx">event_producer</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">sns</span><span class="p">.</span><span class="nx">Topic</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="dl">"</span><span class="s2">ProducerName</span><span class="dl">"</span><span class="p">);</span>
<span class="kd">const</span> <span class="nx">graph_merger_queue</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">sqs</span><span class="p">.</span><span class="nx">Queue</span><span class="p">(</span><span class="k">this</span><span class="p">,</span> <span class="dl">"</span><span class="s2">QueueName</span><span class="dl">"</span><span class="p">);</span>
<span class="nx">event_producer</span><span class="p">.</span><span class="nx">subscribeQueue</span><span class="p">(</span><span class="nx">graph_merger_queue</span><span class="p">);</span>
</code></pre></div></div>
<p>This code defines a Topic and a Queue, and subscribes the Queue to the
Topic. I don’t need to define any policy - it’s obvious what it should be,
allow the queue to read from the topic, so cdk just does it for me.</p>
<p>I felt confident that cdk would build be policies that are least privilege
by default, and that I couldn’t accidentally mess them up.</p>
<p>I also use a database, and I wanted my db username and passwords to be
stored in an environment variable. Because I was using typescript this
was trivial - just npm install <code class="language-plaintext highlighter-rouge">node-env-file</code> and use it.</p>
<div class="language-typescript highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kd">const</span> <span class="nx">history_db</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">HistoryDb</span><span class="p">(</span>
<span class="k">this</span><span class="p">,</span>
<span class="dl">'</span><span class="s1">history-db</span><span class="dl">'</span><span class="p">,</span>
<span class="nx">network</span><span class="p">.</span><span class="nx">grapl_vpc</span><span class="p">,</span>
<span class="k">new</span> <span class="nx">cdk</span><span class="p">.</span><span class="nx">Token</span><span class="p">(</span><span class="nx">process</span><span class="p">.</span><span class="nx">env</span><span class="p">.</span><span class="nx">HISTORY_DB_USERNAME</span><span class="p">),</span>
<span class="k">new</span> <span class="nx">cdk</span><span class="p">.</span><span class="nx">Token</span><span class="p">(</span><span class="nx">process</span><span class="p">.</span><span class="nx">env</span><span class="p">.</span><span class="nx">HISTORY_DB_PASSWORD</span><span class="p">)</span>
<span class="p">);</span>
</code></pre></div></div>
<p>I pass my custom HistoryDb class the necessary information, pulling
the credentials from my .env, and I’m done.</p>
<p>Similarly, I pass these credentials to my lambdas that need to access
the history db (I intend to move to KMS later).</p>
<div class="language-typescript highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kd">let</span> <span class="nx">node_identity_mapper</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">lambda</span><span class="p">.</span><span class="nb">Function</span><span class="p">(</span>
<span class="k">this</span><span class="p">,</span> <span class="dl">'</span><span class="s1">node-identity-mapper</span><span class="dl">'</span><span class="p">,</span> <span class="p">{</span>
<span class="na">runtime</span><span class="p">:</span> <span class="nx">lambda</span><span class="p">.</span><span class="nx">Runtime</span><span class="p">.</span><span class="nx">Go1x</span><span class="p">,</span>
<span class="na">handler</span><span class="p">:</span> <span class="dl">'</span><span class="s1">node-identity-mapper</span><span class="dl">'</span><span class="p">,</span>
<span class="na">code</span><span class="p">:</span> <span class="nx">lambda</span><span class="p">.</span><span class="nx">Code</span><span class="p">.</span><span class="nx">file</span><span class="p">(</span><span class="dl">'</span><span class="s1">./node-identity-mapper.zip</span><span class="dl">'</span><span class="p">),</span>
<span class="na">vpc</span><span class="p">:</span> <span class="nx">vpc</span><span class="p">,</span>
<span class="na">environment</span><span class="p">:</span> <span class="p">{</span>
<span class="dl">"</span><span class="s2">HISTORY_DB_USERNAME</span><span class="dl">"</span><span class="p">:</span> <span class="nx">process</span><span class="p">.</span><span class="nx">env</span><span class="p">.</span><span class="nx">HISTORY_DB_USERNAME</span><span class="p">,</span>
<span class="dl">"</span><span class="s2">HISTORY_DB_PASSWORD</span><span class="dl">"</span><span class="p">:</span> <span class="nx">process</span><span class="p">.</span><span class="nx">env</span><span class="p">.</span><span class="nx">HISTORY_DB_PASSWORD</span><span class="p">,</span>
<span class="dl">"</span><span class="s2">BUCKET_PREFIX</span><span class="dl">"</span><span class="p">:</span> <span class="nx">process</span><span class="p">.</span><span class="nx">env</span><span class="p">.</span><span class="nx">BUCKET_PREFIX</span>
<span class="p">},</span>
<span class="na">timeout</span><span class="p">:</span> <span class="mi">30</span>
<span class="p">}</span>
<span class="p">);</span>
</code></pre></div></div>
<p>This same approach allows me to deal with the fact that S3 buckets are
global. If someone else wants to set up Grapl they just provide a unique
prefix in the .env file and all services will be made aware of it. Easy.</p>
<p>CDK is still early days but I really couldn’t recommend it more. Deploying
Grapl is practically trivial and adding new CloudFormation stacks, or
modifying existing ones, has been incredibly smooth.</p>
Introducing Grapl2018-10-20T00:00:00+00:00http://insanitybit.github.io/2018/10/20/grapl-a-graph-platform-for-detection-forensics-and-incident-response
<p>Oftentimes when chasing down an alert I find myself asking the same questions:</p>
<ul>
<li>I know that a process is malicious, did it spawn any children?</li>
<li>The process spawned children, what did they do?</li>
<li>What spawned that first process?</li>
<li>What created the binary file for the malicious process?</li>
<li>Did these processes interact with the file system?</li>
<li>Did they interact with the network?</li>
</ul>
<p>Answering these questions can be costly. It often involves manually running linear
searches over logs. This can get very convoluted - if a process A has children B and C,
now I need to perform separate searches to understand B and C further. Each branch
adds significant cognitive overhead, and searches over logs can be very slow.</p>
<p>In some cases I even want to model my alerts this way - not just alerting off of
discrete events, such as a process spawning, but combined events, such as a process
with specific attributes spawning a child with specific attributes.</p>
<p>Some example of signatures that would require more than a single event would might be:</p>
<ul>
<li>Word executing child processes, indicating that a malicious macro has executed</li>
<li>A process X spawning a child Y where we’ve never seen X have a relationship with Y
before. (Why is my Java service executing <code class="language-plaintext highlighter-rouge">/bin/bash</code>?)</li>
<li>A process writes to a sensitive file, but not one of the many processes that are
known to do so</li>
</ul>
<p>Given most log sources, where individual logs represent discrete events, writing any of
these alerts requires complex joining logic, handling pid collisions, and a lot of
time and compute, with the compute growing exponentially with the depth of my searches.</p>
<h2 id="grapl">Grapl</h2>
<p>What I want is a way to answer all of those questions in a single operation. I want
to take an alert and be able to immediately see everything interesting about the
components of that alert. I want to write signatures that work across multiple
events and I want that to be elegant and efficient.</p>
<p>I’m building <a href="https://github.com/insanitybit/grapl">Grapl</a> to optimize for these use cases.</p>
<h3 id="how-it-works">How it works</h3>
<p>Grapl works by ingesting logs, such as a process creation event, and producing a graph
representation of that log. These graphs are later marged into the master graph.</p>
<p>Issues like pid collisions are handled automatically.</p>
<p>As an example, a log like this:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"pid"</span><span class="p">:</span><span class="w"> </span><span class="mi">5</span><span class="p">,</span><span class="w">
</span><span class="nl">"ppid"</span><span class="p">:</span><span class="w"> </span><span class="mi">4</span><span class="p">,</span><span class="w">
</span><span class="nl">"image_path"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/home/downloads/payload.exe"</span><span class="p">,</span><span class="w">
</span><span class="nl">"create_time"</span><span class="p">:</span><span class="w"> </span><span class="mi">1540071808</span><span class="p">,</span><span class="w">
</span><span class="nl">"sourcetype"</span><span class="p">:</span><span class="w"> </span><span class="s2">"PROCESS_START"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>will create a subgraph with a newly created process node, an edge to some pre-existing
process node with pid ‘4’, and some pre-existing file node with path “/home/downloads/payload.exe”.</p>
<p>These subgraphs get added to the master graph, creating the new process node, connecting it
to its parent process’s node, and its binary’s node.</p>
<p>Expanding your investigation from a single node is trivial. If I want to see everything
a process did, it’s as simple as:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
q(func: eq(pid,5))
@filter( eq(asset_id, "asset_zzd"))
@recurse(depth:10) {
expand(_all_),
}
}
</code></pre></div></div>
<p>(This is using DGraph’s query language, <a href="https://docs.dgraph.io/query-language/">Graphql+</a>)</p>
<p>This will find any nodes with pid=5 on the asset with id ‘asset_zzd’ and recursively
expand its edges.</p>
<p>Given only a single event we can go from this:</p>
<p><img src="https://raw.githubusercontent.com/insanitybit/grapl/master/images/unexpanded_payload.png" alt="unexpanded_payload" /></p>
<p>to this: (unfortunately filenames are not listed on nodes)</p>
<p><img src="https://raw.githubusercontent.com/insanitybit/grapl/master/images/expanded_payload.png" alt="expanded_payload" /></p>
<p>This is a single, simple operation that executed in milliseconds. I now have much of
the context I would want when investigating a suspicious process.</p>
<p>With a single query I can now understand quite a lot about the event:</p>
<ul>
<li>chrome.exe created a file</li>
<li>word.exe read the file created by chrome.exe</li>
<li>word.exe created a file, payload.exe</li>
<li>payload.exe was executed by word.exe</li>
</ul>
<p>If payload.exe had spawned other children, or read other files, I’d see it. (Note that
the logs used to generate these graphs are fabricated)</p>
<h3 id="getting-started">Getting Started</h3>
<p>Setting up Grapl is mostly automated, with a few manual pain points.</p>
<p><a href="https://github.com/insanitybit/grapl/#setting-up-grapl">Instructions are here.</a></p>
<p>Once Grapl is deployed you can send up JSON encoded logs to your <code class="language-plaintext highlighter-rouge">raw-log</code> S3
bucket. The rest should just work.</p>
<h3 id="current-state">Current State</h3>
<p>Grapl is a very young project. Currently the best supported features are:</p>
<ul>
<li>Parsing logs into graphs</li>
<li>Creating ‘identities’ for nodes (to handle pid collisions)</li>
<li>Merging generated graphs into the master graph</li>
</ul>
<p>Grapl is in an ‘alpha’ release state. There may be major architectural changes
and rewrites. Data that goes through Grapl may not be valid for futures versions.</p>
<p>There’s a lot more I want to build, some of which is already decently far
along.</p>
<h2 id="future">Future</h2>
<h3 id="networking-users-and-assets">Networking, Users, and Assets</h3>
<p>Grapl only supports files and processes right now, and assets_ids are opaque
identifiers.</p>
<p>I want to be able to answer more questions:</p>
<ul>
<li>
<p>Did a process SSH to another system? What subsequent processes
where executed?</p>
</li>
<li>
<p>What IPs has a process connected to?</p>
</li>
<li>
<p>What domains has a process resolved?</p>
</li>
<li>
<p>Which processes executed on a given asset, under a given user?</p>
</li>
</ul>
<p>I have ongoing work to model the data necessary to answer these questions. When
I’m done it will be as easy to answer these questions as it was to answer the
others - one simple operation.</p>
<h3 id="subgraph-signatures">Subgraph Signatures</h3>
<p>So far Grapl mostly supports manual investigation, but it has no system for
writing alerts. As I mentioned above, there are at least some attacker signatures
best described by the combination of events, not singular events.</p>
<p>In order to support this I intend to allow signatures to be stored by an analyst
and subsequently executed each time the master graph is updated. This should give
real time alerting with graph queries.</p>
<p>The current state for this is not ideal. There is actually a single analyzer
in the Grapl repository that will scan for malicious word macros, but the
process of creating analyzers is overly painful, requiring a separate lambda
for each signature.</p>
<p>The next steps here are to:</p>
<ul>
<li>Provide a friendly DSL for writing signatures</li>
<li>Remove the need for separate lambdas, and just provide one lambda that pulls
down the signatures and executes them against the graph</li>
</ul>
<h3 id="automated-scoping">Automated Scoping</h3>
<p>As demonstrated, Grapl can trivially expand graphs. There’s no need for Grapl
to understand details of a signature in order to provide basic contexting, it
only needs to expand the graph.</p>
<p>When analyzers are built, and as they output signature matches, the
<code class="language-plaintext highlighter-rouge">engagement-creation-service</code> will automatically expand the graph around the
match and create an engagement - a separate graph with a unique key, which
you can interact with to add and remove more nodes.</p>
<p>Engagements will be the main way to interact with Grapl, with future plans to
provide a Python SDK so you can script your engagements further.</p>
<h3 id="contributing">Contributing</h3>
<p>The vast majority of Grapl is written in the Rust programming language, with
the Analyzers and Engagement SDK being written in Python.</p>
<p>If you don’t know Rust or Python, don’t worry. I’d be happy to help anyone get
ramped up with either language.</p>
<p>If you’re interested in contributing, if you have feedback or questions, please
feel free to open an issue or <a href="https://github.com/insanitybit/grapl/issues">start working on an existing one</a>.</p>
More Rust Actors2017-10-09T00:00:00+00:00http://insanitybit.github.io/2017/10/09/more-rust-actors
<p>I’ve been continuing work on my <a href="https://github.com/insanitybit/derive_aktor">rust actor library</a>, which macro-magically
turns synchronous structs into asynchronous actors. (Note that all examples
are using branches of that project and not Master).</p>
<p>For example, here’s a struct that stores a closure and, upon its ‘complete’ method being called, executes that closure.</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="k">pub</span> <span class="k">struct</span> <span class="n">CompletionHandler</span><span class="o"><</span><span class="n">F</span><span class="o">></span>
<span class="k">where</span>
<span class="n">F</span><span class="p">:</span> <span class="nf">Fn</span><span class="p">()</span> <span class="o">+</span> <span class="nb">Send</span> <span class="o">+</span> <span class="n">Sync</span> <span class="o">+</span> <span class="nv">'static</span> <span class="o">+</span> <span class="nb">Send</span><span class="p">,</span>
<span class="p">{</span>
<span class="n">self_ref</span><span class="p">:</span> <span class="n">CompletionHandlerActor</span><span class="p">,</span>
<span class="n">system</span><span class="p">:</span> <span class="n">SystemActor</span><span class="p">,</span>
<span class="n">f</span><span class="p">:</span> <span class="n">F</span><span class="p">,</span>
<span class="p">}</span>
<span class="nd">#[derive_actor]</span>
<span class="k">impl</span><span class="o"><</span><span class="n">F</span><span class="o">></span> <span class="n">CompletionHandler</span><span class="o"><</span><span class="n">F</span><span class="o">></span>
<span class="k">where</span>
<span class="n">F</span><span class="p">:</span> <span class="nf">Fn</span><span class="p">()</span> <span class="o">+</span> <span class="nb">Send</span> <span class="o">+</span> <span class="n">Sync</span> <span class="o">+</span> <span class="nv">'static</span><span class="p">,</span>
<span class="p">{</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">complete</span><span class="p">(</span><span class="o">&</span><span class="k">self</span><span class="p">)</span> <span class="p">{</span>
<span class="p">(</span><span class="k">self</span><span class="py">.f</span><span class="p">)();</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Note the <code class="language-plaintext highlighter-rouge">#[derive_actor]</code> bit. This generates (some bits omitted) something like this:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="k">pub</span> <span class="k">struct</span> <span class="n">CompletionHandlerActor</span> <span class="p">{</span>
<span class="n">sender</span><span class="p">:</span> <span class="p">::</span><span class="nn">channel</span><span class="p">::</span><span class="n">Sender</span><span class="o"><</span><span class="n">CompletionHandlerSystemMessage</span><span class="o">></span><span class="p">,</span>
<span class="n">ref_count</span><span class="p">:</span> <span class="p">::</span><span class="nn">std</span><span class="p">::</span><span class="nn">sync</span><span class="p">::</span><span class="nb">Arc</span><span class="o"><</span><span class="p">()</span><span class="o">></span><span class="p">,</span>
<span class="k">pub</span> <span class="n">id</span><span class="p">:</span> <span class="p">::</span><span class="nn">std</span><span class="p">::</span><span class="nn">sync</span><span class="p">::</span><span class="nb">Arc</span><span class="o"><</span><span class="nb">String</span><span class="o">></span><span class="p">,</span>
<span class="p">}</span>
<span class="k">impl</span> <span class="n">CompletionHandlerActor</span> <span class="p">{</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">complete</span><span class="p">(</span><span class="o">&</span><span class="k">self</span><span class="p">)</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">msg</span> <span class="o">=</span> <span class="nn">CompletionHandlerMessage</span><span class="p">::</span><span class="n">CompleteVariant</span> <span class="p">{};</span>
<span class="k">let</span> <span class="n">msg</span> <span class="o">=</span> <span class="nn">CompletionHandlerSystemMessage</span><span class="p">::</span><span class="nf">Inner</span><span class="p">(</span><span class="n">msg</span><span class="p">);</span>
<span class="k">let</span> <span class="mi">_</span> <span class="o">=</span> <span class="k">self</span><span class="py">.sender</span><span class="nf">.send</span><span class="p">(</span><span class="n">msg</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>I didn’t have a good sense of how this would actually be in practice. I’ve used a similar approach elsewhere but without
the macro, and I diverged from the patterns that the macro enforced (a single queue per actor, for example). My last
project that used actors also became fairly large in terms of just lines of code due to all of the ‘actor’ boilerplate
everywhere - I was very curious to see if the macro helped here.</p>
<h3 id="spam-detection">Spam Detection</h3>
<p>I set out to build a simple program. In my talk at Rustconf Suchin and I brought up where Rust was particularly strong
and not much has changed since then. Rust is great for feature extraction and cleaning, not as great for investigation,
with (unmet) potential for modelling. This meant I’d need some not-rust code. Since actors model everything as services
they felt well suited for a project that would involve multiple languages.</p>
<p>The basic premise was to use machine learning, specifically around NLP to start with, to determine if a given email
is malicious.</p>
<p><a href="https://github.com/insanitybit/spam_detection">You can see the code for it here</a> (note that it is not complete, every prediction is hardcoded to ‘false’)</p>
<h4 id="actor-construction">Actor Construction</h4>
<p>As I built the service I developed the derive_actor library further. One of the first changes was changing how actors
are constructed. Any <code class="language-plaintext highlighter-rouge">StructNameActor::new(args)</code> will require a closure that is used to initialize the struct.</p>
<p>Here’s how you initialize a PythonModelActor - an actor that interfaces with a background Python service.</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">let</span> <span class="n">python_model</span> <span class="o">=</span>
<span class="k">move</span> <span class="p">|</span><span class="n">self_ref</span><span class="p">,</span> <span class="n">system</span><span class="p">|</span> <span class="nn">PythonModel</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">self_ref</span><span class="p">,</span> <span class="n">system</span><span class="p">,</span> <span class="s">"./model_service/service/prediction_service.py"</span><span class="nf">.into</span><span class="p">());</span>
<span class="k">let</span> <span class="n">python_model_actor</span> <span class="o">=</span> <span class="nn">PythonModelActor</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">python_model</span><span class="p">,</span> <span class="n">system</span><span class="nf">.clone</span><span class="p">(),</span> <span class="nn">Duration</span><span class="p">::</span><span class="nf">from_secs</span><span class="p">(</span><span class="mi">30</span><span class="p">));</span>
</code></pre></div></div>
<p>We create our <code class="language-plaintext highlighter-rouge">python_model</code>, whic his a closure that takes a PythonModelActor, a SystemActor, and returns a PythonModel.
PythonModelActor::new uses this to construct and route messages to the PythonModel.</p>
<p>With Depency Injection by Construction starting up actors/services ends up looking a lot like this:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">let</span> <span class="n">prediction_cache</span> <span class="o">=</span>
<span class="k">move</span> <span class="p">|</span><span class="n">self_ref</span><span class="p">,</span> <span class="n">system</span><span class="p">|</span> <span class="nn">PredictionCache</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">self_ref</span><span class="p">,</span> <span class="n">system</span><span class="p">);</span>
<span class="k">let</span> <span class="n">prediction_cache</span> <span class="o">=</span> <span class="nn">PredictionCacheActor</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">prediction_cache</span><span class="p">,</span> <span class="n">system</span><span class="nf">.clone</span><span class="p">(),</span> <span class="nn">Duration</span><span class="p">::</span><span class="nf">from_secs</span><span class="p">(</span><span class="mi">30</span><span class="p">));</span>
<span class="k">let</span> <span class="n">mail_parser</span> <span class="o">=</span>
<span class="k">move</span> <span class="p">|</span><span class="n">self_ref</span><span class="p">,</span> <span class="n">system</span><span class="p">|</span> <span class="nn">MailParser</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">self_ref</span><span class="p">,</span> <span class="n">system</span><span class="p">);</span>
<span class="k">let</span> <span class="n">mail_parser</span> <span class="o">=</span> <span class="nn">MailParserActor</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">mail_parser</span><span class="p">,</span> <span class="n">system</span><span class="nf">.clone</span><span class="p">(),</span> <span class="nn">Duration</span><span class="p">::</span><span class="nf">from_secs</span><span class="p">(</span><span class="mi">30</span><span class="p">));</span>
<span class="k">let</span> <span class="n">sentiment_analyzer</span> <span class="o">=</span>
<span class="k">move</span> <span class="p">|</span><span class="n">self_ref</span><span class="p">,</span> <span class="n">system</span><span class="p">|</span> <span class="nn">SentimentAnalyzer</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">self_ref</span><span class="p">,</span> <span class="n">system</span><span class="p">);</span>
<span class="k">let</span> <span class="n">sentiment_analyzer</span> <span class="o">=</span> <span class="nn">SentimentAnalyzerActor</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">sentiment_analyzer</span><span class="p">,</span> <span class="n">system</span><span class="nf">.clone</span><span class="p">(),</span> <span class="nn">Duration</span><span class="p">::</span><span class="nf">from_secs</span><span class="p">(</span><span class="mi">30</span><span class="p">));</span>
<span class="k">let</span> <span class="n">python_model</span> <span class="o">=</span>
<span class="k">move</span> <span class="p">|</span><span class="n">self_ref</span><span class="p">,</span> <span class="n">system</span><span class="p">|</span> <span class="nn">PythonModel</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">self_ref</span><span class="p">,</span> <span class="n">system</span><span class="p">,</span> <span class="s">"./model_service/service/prediction_service.py"</span><span class="nf">.into</span><span class="p">());</span>
<span class="k">let</span> <span class="n">python_model</span> <span class="o">=</span> <span class="nn">PythonModelActor</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">python_model</span><span class="p">,</span> <span class="n">system</span><span class="nf">.clone</span><span class="p">(),</span> <span class="nn">Duration</span><span class="p">::</span><span class="nf">from_secs</span><span class="p">(</span><span class="mi">30</span><span class="p">));</span>
<span class="k">let</span> <span class="n">model</span> <span class="o">=</span>
<span class="k">move</span> <span class="p">|</span><span class="n">self_ref</span><span class="p">,</span> <span class="n">system</span><span class="p">|</span> <span class="nn">Model</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">self_ref</span><span class="p">,</span> <span class="n">system</span><span class="p">,</span> <span class="n">python_model</span><span class="nf">.clone</span><span class="p">());</span>
<span class="k">let</span> <span class="n">model</span> <span class="o">=</span> <span class="nn">ModelActor</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">system</span><span class="nf">.clone</span><span class="p">(),</span> <span class="nn">Duration</span><span class="p">::</span><span class="nf">from_secs</span><span class="p">(</span><span class="mi">30</span><span class="p">));</span>
<span class="k">let</span> <span class="n">extractor</span> <span class="o">=</span>
<span class="k">move</span> <span class="p">|</span><span class="n">self_ref</span><span class="p">,</span> <span class="n">system</span><span class="p">|</span>
<span class="nn">FeatureExtractionManager</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">mail_parser</span><span class="nf">.clone</span><span class="p">(),</span> <span class="n">sentiment_analyzer</span><span class="nf">.clone</span><span class="p">(),</span> <span class="n">self_ref</span><span class="p">,</span> <span class="n">system</span><span class="p">);</span>
<span class="k">let</span> <span class="n">extractor</span> <span class="o">=</span> <span class="nn">FeatureExtractionManagerActor</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">extractor</span><span class="p">,</span> <span class="n">system</span><span class="nf">.clone</span><span class="p">(),</span> <span class="nn">Duration</span><span class="p">::</span><span class="nf">from_secs</span><span class="p">(</span><span class="mi">30</span><span class="p">));</span>
<span class="k">let</span> <span class="n">service</span> <span class="o">=</span>
<span class="k">move</span> <span class="p">|</span><span class="n">self_ref</span><span class="p">,</span> <span class="n">system</span><span class="p">|</span> <span class="nn">SpamDetectionService</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span>
<span class="n">prediction_cache</span><span class="nf">.clone</span><span class="p">(),</span>
<span class="n">extractor</span><span class="nf">.clone</span><span class="p">(),</span>
<span class="n">model</span><span class="nf">.clone</span><span class="p">(),</span>
<span class="n">self_ref</span><span class="p">,</span>
<span class="n">system</span>
<span class="p">);</span>
<span class="nn">SpamDetectionServiceActor</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">service</span><span class="p">,</span> <span class="n">system</span><span class="nf">.clone</span><span class="p">(),</span> <span class="nn">Duration</span><span class="p">::</span><span class="nf">from_secs</span><span class="p">(</span><span class="mi">30</span><span class="p">))</span>
</code></pre></div></div>
<p>It almost makes me miss Spring.</p>
<h4 id="panic-handling">Panic Handling</h4>
<p>I also realized that I wanted to handle panics. A typical Rust service, such as one built on Hyper, would likely handle
panics at the service’s top level. But since Actors are their own little services they need to be able to handle panics
individually. So every struct that derives an Actor must now implement an on_error method. Error handling is still very
much something I’m trying to work through.</p>
<p>Here’s an example of on_error being used in my SentimentAnalyzerActor.</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="k">fn</span> <span class="n">on_error</span><span class="o"><</span><span class="n">T</span><span class="o">></span><span class="p">(</span><span class="o">&</span><span class="k">mut</span> <span class="k">self</span><span class="p">,</span>
<span class="n">err</span><span class="p">:</span> <span class="nb">Box</span><span class="o"><</span><span class="nn">std</span><span class="p">::</span><span class="nn">any</span><span class="p">::</span><span class="n">Any</span> <span class="o">+</span> <span class="nb">Send</span><span class="o">></span><span class="p">,</span>
<span class="n">msg</span><span class="p">:</span> <span class="n">SentimentAnalyzerMessage</span><span class="p">,</span>
<span class="n">t</span><span class="p">:</span> <span class="nb">Arc</span><span class="o"><</span><span class="n">T</span><span class="o">></span><span class="p">)</span>
<span class="k">where</span> <span class="n">T</span><span class="p">:</span> <span class="nf">Fn</span><span class="p">(</span><span class="n">SentimentAnalyzerActor</span><span class="p">,</span> <span class="n">SystemActor</span><span class="p">)</span> <span class="k">-></span> <span class="n">SentimentAnalyzer</span> <span class="o">+</span> <span class="nb">Send</span> <span class="o">+</span> <span class="n">Sync</span> <span class="o">+</span> <span class="nv">'static</span>
<span class="p">{</span>
<span class="k">match</span> <span class="n">msg</span> <span class="p">{</span>
<span class="nn">SentimentAnalyzerMessage</span><span class="p">::</span><span class="n">AnalyzeVariant</span><span class="p">{</span>
<span class="n">phrase</span><span class="p">,</span> <span class="n">res</span>
<span class="p">}</span> <span class="k">=></span> <span class="p">{</span>
<span class="nf">res</span><span class="p">(</span><span class="nf">Err</span><span class="p">(</span>
<span class="nn">ErrorKind</span><span class="p">::</span><span class="nf">UnrecoverableError</span><span class="p">(</span>
<span class="s">"An unexpected error occurred in sentiment analyzer"</span><span class="nf">.into</span><span class="p">())</span><span class="nf">.into</span><span class="p">())</span>
<span class="p">);</span>
<span class="p">},</span>
<span class="p">};</span>
<span class="p">}</span>
</code></pre></div></div>
<p>I wasn’t really sure what information would be useful in an on_error method so I just went with as much as I could.</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">err</code> - This is the value that was produced by the panic</li>
<li><code class="language-plaintext highlighter-rouge">msg</code> - This is the message that caused the panic</li>
<li><code class="language-plaintext highlighter-rouge">t</code> - This is the closure that was originally used to construct the SentimentAnalyzer</li>
</ul>
<p>In this case all I do is send ‘res’ (a callback) an UnrecoverableError message, which will instruct the service not to
retry processing for this message. I could also inspect <code class="language-plaintext highlighter-rouge">err</code> to see if it’s an error I could recover from, but this
suited me fine.</p>
<p>I’ve considered whether actors should even continue to live if an error has occurred. So far, they do - the actor’s loop
will continue.</p>
<p>It also means that all Message types must be Clone. This is usually trivial to implement since everything tends to be
Arc’d, but it’s worth keeping in mind. <em>Every</em> message is cloned, whether you have an error not. This kiiiinda sucks,
because it means handling errors impacts the performance of non-erroneous messages. But I’ve avoided thinking about it,
so it isn’t a problem yet.</p>
<p>To ensure that my service works in the presence of unrecoverable failures I’ve started adding these to actors:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">type</span> <span class="n">SentimentResponse</span> <span class="o">=</span> <span class="nb">Arc</span><span class="o"><</span><span class="nf">Fn</span><span class="p">(</span><span class="n">Result</span><span class="o"><</span><span class="n">Analysis</span><span class="o">></span><span class="p">)</span> <span class="o">+</span> <span class="nb">Send</span> <span class="o">+</span> <span class="n">Sync</span> <span class="o">+</span> <span class="nv">'static</span><span class="o">></span><span class="p">;</span>
<span class="nd">#[derive_actor]</span>
<span class="k">impl</span> <span class="n">SentimentAnalyzer</span> <span class="p">{</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">analyze</span><span class="p">(</span><span class="o">&</span><span class="k">self</span><span class="p">,</span> <span class="n">phrase</span><span class="p">:</span> <span class="nb">String</span><span class="p">,</span> <span class="n">res</span><span class="p">:</span> <span class="n">SentimentResponse</span><span class="p">)</span> <span class="p">{</span>
<span class="nd">random_panic!</span><span class="p">(</span><span class="mi">10</span><span class="p">);</span>
<span class="nd">random_latency!</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">20</span><span class="p">);</span>
<span class="k">let</span> <span class="n">analysis</span> <span class="o">=</span> <span class="nf">analyze</span><span class="p">(</span><span class="n">phrase</span><span class="p">);</span>
<span class="nf">res</span><span class="p">(</span><span class="nf">Ok</span><span class="p">(</span><span class="n">analysis</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>random_panic! is a 1 in <code class="language-plaintext highlighter-rouge">n</code> chance to panic. random_latency! is a 1 in <code class="language-plaintext highlighter-rouge">n</code> chance to sleep for <code class="language-plaintext highlighter-rouge">k</code> milliseconds.</p>
<p>You’ll notice I type alias’d the SentimentResponse - this is partly due to it being a large type, but mostly due to my
macro being unable to handle types with multiple segments eg: Box<T> resolves to Box in generated code. I haven't felt
the need to fix this yet because it's actually forced me to avoid vague types like `Box<u8>` and write things like:</u8></T></p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">type</span> <span class="n">EmailBytes</span> <span class="o">=</span> <span class="nb">Vec</span><span class="o"><</span><span class="nb">u8</span><span class="o">></span><span class="p">;</span>
</code></pre></div></div>
<p>This comes in particular handy when I need to export that type and change it later.</p>
<h4 id="assume-failure">Assume Failure</h4>
<p>One of the nice things with actors is that they can be treated as entirely separate services, even moved to another process
or system. This boundary forces you to assume that the actor may fail. I may not get a response back, I may get an error
back.</p>
<p>SentimentAnalyzer.analyze(..) can never fail, other than bugs - there’s no IO going on and the crate I use doesn’t
expose a Result. Still, my SentimentAnalyzerActor’s SentimentResponse contains a Result. I found that using Result
for all Actor callbacks led to life being easier later, as code changed and suddenly there were error conditions. If I
wanted to move SentimentAnalyzer to another process none of the code that interacts with it would have t ochange.</p>
<h4 id="separate-processes">Separate Processes</h4>
<p>I mentioned earlier that I’d have a Python service doing some of the machine learning work. The code is trivial using
Flask:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">app</span><span class="p">.</span><span class="n">route</span><span class="p">(</span><span class="s">'/predict/<string:csv_features>'</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">predict</span><span class="p">(</span><span class="n">csv_features</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="n">csv_features</span><span class="p">)</span>
<span class="n">csv_features</span> <span class="o">=</span> <span class="n">StringIO</span><span class="p">(</span><span class="n">csv_features</span><span class="p">)</span>
<span class="n">features</span><span class="p">:</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">csv_features</span><span class="p">,</span> <span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s">'a'</span><span class="p">,</span><span class="s">'b'</span><span class="p">,</span><span class="s">'c'</span><span class="p">])</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">forest</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">features</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">p</span><span class="p">)</span>
<span class="k">return</span> <span class="nb">str</span><span class="p">(</span><span class="n">forest</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">features</span><span class="p">)[</span><span class="mi">0</span><span class="p">])</span>
<span class="o">@</span><span class="n">app</span><span class="p">.</span><span class="n">route</span><span class="p">(</span><span class="s">'/health_check'</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">health_check</span><span class="p">():</span>
<span class="k">return</span> <span class="s">"UP"</span>
</code></pre></div></div>
<p>This simple API gets an Actor mirror on the Rust side, in the PythonModel struct.</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">pub</span> <span class="k">struct</span> <span class="n">PythonModel</span> <span class="p">{</span>
<span class="n">self_ref</span><span class="p">:</span> <span class="n">ModelActor</span><span class="p">,</span>
<span class="n">system</span><span class="p">:</span> <span class="n">SystemActor</span><span class="p">,</span>
<span class="n">python</span><span class="p">:</span> <span class="n">Child</span><span class="p">,</span>
<span class="n">client</span><span class="p">:</span> <span class="n">Client</span><span class="p">,</span>
<span class="n">port</span><span class="p">:</span> <span class="nb">u16</span><span class="p">,</span>
<span class="n">path</span><span class="p">:</span> <span class="n">PathBuf</span>
<span class="p">}</span>
<span class="nd">#[derive_actor]</span>
<span class="k">impl</span> <span class="n">PythonModel</span> <span class="p">{</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">predict</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span> <span class="k">self</span><span class="p">,</span> <span class="n">features</span><span class="p">:</span> <span class="n">Features</span><span class="p">,</span> <span class="n">res</span><span class="p">:</span> <span class="n">Prediction</span><span class="p">)</span> <span class="p">{</span>
<span class="nd">println!</span><span class="p">(</span><span class="s">"poython pred"</span><span class="p">);</span>
<span class="k">if</span> <span class="k">let</span> <span class="nf">Err</span><span class="p">(</span><span class="n">e</span><span class="p">)</span> <span class="o">=</span> <span class="k">self</span><span class="py">.client</span><span class="nf">.get</span><span class="p">(</span><span class="o">&</span><span class="nd">format!</span><span class="p">(</span><span class="s">"http://127.0.0.1:{}/predict/1,2,3"</span><span class="p">,</span> <span class="k">self</span><span class="py">.port</span><span class="p">))</span><span class="nf">.send</span><span class="p">()</span> <span class="p">{</span>
<span class="nf">res</span><span class="p">(</span><span class="nf">Err</span><span class="p">(</span><span class="nn">ErrorKind</span><span class="p">::</span><span class="nf">RecoverableError</span><span class="p">(</span>
<span class="nd">format!</span><span class="p">(</span><span class="s">"Failed to predict {}"</span><span class="p">,</span> <span class="n">e</span><span class="p">)</span><span class="nf">.into</span><span class="p">())</span>
<span class="nf">.into</span><span class="p">()));</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="nf">res</span><span class="p">(</span><span class="nf">Ok</span><span class="p">(</span><span class="k">false</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The code is clearly incomplete as I only check for errors, otherwise returning a hardcoded prediction. But looking at
the code it’s very simple to model the separate Python process as an actor in the Rust code. This is a big win because
I’m far more comfortable with Pandas and sklearn than anything in Rust.</p>
<h4 id="callbacks">Callbacks</h4>
<p>The callback aspect of actors is clearly prevalent in the codebase. If ActorA sends ActorB a message, because actors are
strongly typed, ActorB does not know how to respond. ActorA could implement an interface and send itself to ActorB but
generic actors aren’t very pretty right now and it seems silly to have to implement ‘HandlesXResponse’ and
‘HandlesYResponse’ 1000 times. Instead, Callbacks serve as a translation mechanism for sending messages to where they
need to go.</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">#[derive_actor]</span>
<span class="k">impl</span> <span class="n">SpamDetectionService</span> <span class="p">{</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">predict_with_cache</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span> <span class="k">self</span><span class="p">,</span> <span class="n">email</span><span class="p">:</span> <span class="n">EmailBytes</span><span class="p">,</span> <span class="n">res</span><span class="p">:</span> <span class="n">PredictionResult</span><span class="p">)</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">self_ref</span> <span class="o">=</span> <span class="k">self</span><span class="py">.self_ref</span><span class="nf">.clone</span><span class="p">();</span>
<span class="k">let</span> <span class="n">res</span> <span class="o">=</span> <span class="n">res</span><span class="nf">.clone</span><span class="p">();</span>
<span class="k">let</span> <span class="n">email</span> <span class="o">=</span> <span class="n">email</span><span class="nf">.clone</span><span class="p">();</span>
<span class="k">let</span> <span class="n">hash</span> <span class="o">=</span> <span class="nn">SpamDetectionService</span><span class="p">::</span><span class="nf">hash_email</span><span class="p">(</span><span class="n">email</span><span class="nf">.clone</span><span class="p">());</span>
<span class="k">self</span><span class="py">.prediction_cache</span>
<span class="nf">.get</span><span class="p">(</span><span class="n">hash</span><span class="p">,</span> <span class="nn">std</span><span class="p">::</span><span class="nn">sync</span><span class="p">::</span><span class="nn">Arc</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="k">move</span> <span class="p">|</span><span class="n">cache_res</span><span class="p">|</span> <span class="p">{</span>
<span class="k">match</span> <span class="n">cache_res</span> <span class="p">{</span>
<span class="nf">Ok</span><span class="p">(</span><span class="nf">Some</span><span class="p">(</span><span class="n">hit</span><span class="p">))</span> <span class="k">=></span> <span class="p">{</span>
<span class="nd">println!</span><span class="p">(</span><span class="s">"pred cache hit"</span><span class="p">);</span>
<span class="nf">res</span><span class="p">(</span><span class="nf">Ok</span><span class="p">(</span><span class="n">hit</span><span class="p">));</span>
<span class="p">}</span>
<span class="mi">_</span> <span class="k">=></span> <span class="n">self_ref</span><span class="nf">.clone</span><span class="p">()</span><span class="nf">.predict</span><span class="p">(</span><span class="n">email</span><span class="nf">.clone</span><span class="p">(),</span> <span class="n">res</span><span class="nf">.clone</span><span class="p">())</span>
<span class="p">};</span>
<span class="p">}));</span>
<span class="p">}</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">predict</span><span class="p">(</span><span class="o">&</span><span class="k">self</span><span class="p">,</span> <span class="n">email</span><span class="p">:</span> <span class="n">EmailBytes</span><span class="p">,</span> <span class="n">res</span><span class="p">:</span> <span class="n">PredictionResult</span><span class="p">)</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">self_ref</span> <span class="o">=</span> <span class="k">self</span><span class="py">.self_ref</span><span class="nf">.clone</span><span class="p">();</span>
<span class="k">let</span> <span class="n">model</span> <span class="o">=</span> <span class="k">self</span><span class="py">.model</span><span class="nf">.clone</span><span class="p">();</span>
<span class="k">self</span><span class="py">.extractor</span><span class="nf">.extract</span><span class="p">(</span><span class="n">email</span><span class="p">,</span> <span class="nn">std</span><span class="p">::</span><span class="nn">sync</span><span class="p">::</span><span class="nn">Arc</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="k">move</span> <span class="p">|</span><span class="n">features</span><span class="p">|</span> <span class="p">{</span>
<span class="k">match</span> <span class="n">features</span> <span class="p">{</span>
<span class="nf">Ok</span><span class="p">(</span><span class="n">data</span><span class="p">)</span> <span class="k">=></span> <span class="p">{</span>
<span class="n">model</span><span class="nf">.clone</span><span class="p">()</span><span class="nf">.predict</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">res</span><span class="nf">.clone</span><span class="p">())</span>
<span class="p">}</span>
<span class="nf">Err</span><span class="p">(</span><span class="n">e</span><span class="p">)</span> <span class="k">=></span> <span class="p">{</span>
<span class="nf">res</span><span class="p">(</span><span class="nf">Err</span><span class="p">(</span><span class="n">e</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="p">}));</span>
<span class="p">}</span>
<span class="c">///</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Here’s an example of using callbacks to route a message. The SpamDetectionService queries the PredictionServiceCacheActor,
and then routes its response. If we get a hit we short circuit the computation, sending <code class="language-plaintext highlighter-rouge">res</code> the immediate result.</p>
<p>Otherwise, we use a reference to our SpamDetectionServiceActor (<code class="language-plaintext highlighter-rouge">self_ref</code>) and send it a message to perform the prediction.</p>
<p>Closures can get a bit large. I tried to avoid any logic in them, only using them as ‘routes’ or translations between actors.
If I had a nested closure I made an effort to refactor.</p>
<p>Here’s an example of a larger closure, though as you can see its only goal is to route a message:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">worker</span><span class="nf">.predict</span><span class="p">(</span><span class="n">work</span><span class="p">,</span> <span class="nn">Arc</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="k">move</span> <span class="p">|</span><span class="n">p</span><span class="p">|</span> <span class="p">{</span>
<span class="k">match</span> <span class="n">p</span> <span class="p">{</span>
<span class="nf">Ok</span><span class="p">(</span><span class="n">p</span><span class="p">)</span> <span class="k">=></span> <span class="p">{</span>
<span class="n">completion_handler</span><span class="nf">.success</span><span class="p">();</span>
<span class="p">}</span>
<span class="nf">Err</span><span class="p">(</span><span class="k">ref</span> <span class="n">e</span><span class="p">)</span> <span class="k">=></span> <span class="p">{</span>
<span class="k">match</span> <span class="o">*</span><span class="n">e</span><span class="nf">.kind</span><span class="p">()</span> <span class="p">{</span>
<span class="nn">ErrorKind</span><span class="p">::</span><span class="nf">RecoverableError</span><span class="p">(</span><span class="k">ref</span> <span class="n">e</span><span class="p">)</span> <span class="k">=></span> <span class="p">{</span>
<span class="n">completion_handler</span>
<span class="nf">.retry</span><span class="p">(</span><span class="nn">Arc</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="nn">ErrorKind</span><span class="p">::</span><span class="nf">RecoverableError</span><span class="p">(</span><span class="n">e</span><span class="nf">.to_owned</span><span class="p">()</span><span class="nf">.into</span><span class="p">())));</span>
<span class="p">}</span>
<span class="nn">ErrorKind</span><span class="p">::</span><span class="nf">UnrecoverableError</span><span class="p">(</span><span class="k">ref</span> <span class="n">e</span><span class="p">)</span> <span class="k">=></span> <span class="p">{</span>
<span class="n">completion_handler</span>
<span class="nf">.abort</span><span class="p">(</span><span class="nn">Arc</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="nn">ErrorKind</span><span class="p">::</span><span class="nf">UnrecoverableError</span><span class="p">(</span><span class="n">e</span><span class="nf">.to_owned</span><span class="p">()</span><span class="nf">.into</span><span class="p">())));</span>
<span class="p">}</span>
<span class="nn">ErrorKind</span><span class="p">::</span><span class="nf">Msg</span><span class="p">(</span><span class="k">ref</span> <span class="n">e</span><span class="p">)</span> <span class="k">=></span> <span class="p">{</span>
<span class="n">completion_handler</span>
<span class="nf">.retry</span><span class="p">(</span><span class="nn">Arc</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">e</span><span class="nf">.as_str</span><span class="p">()</span><span class="nf">.into</span><span class="p">()));</span>
<span class="p">}</span>
<span class="mi">_</span> <span class="k">=></span> <span class="p">{</span>
<span class="n">completion_handler</span>
<span class="nf">.retry</span><span class="p">(</span><span class="nn">Arc</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="s">"An unknown error occurred"</span><span class="nf">.into</span><span class="p">()));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="nf">res</span><span class="p">(</span><span class="n">p</span><span class="p">);</span>
<span class="p">})</span>
<span class="p">);</span>
</code></pre></div></div>
<p>I found it fairly simple to keep closures small and simple. If they grew too large I realized quickly that I had some other
issue with the structure of my codebase.</p>
<h4 id="work-stealing-back-pressure">Work Stealing, Back Pressure</h4>
<p>One thing I ran into early on was that I was pushing data to the service faster than it could handle it. I found a few
patterns for dealing with that, mostly based on work stealing and <code class="language-plaintext highlighter-rouge">completion handlers</code>. Whatever produces work for
the SpamDetectionServiceActor also provides a CompletionHandlerActor. This actor is very simple:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">pub</span> <span class="k">struct</span> <span class="n">CompletionHandler</span><span class="o"><</span><span class="n">F</span><span class="o">></span>
<span class="k">where</span> <span class="n">F</span><span class="p">:</span> <span class="nf">Fn</span><span class="p">(</span><span class="n">CompletionStatus</span><span class="p">)</span> <span class="o">+</span> <span class="nb">Send</span> <span class="o">+</span> <span class="n">Sync</span> <span class="o">+</span> <span class="nv">'static</span> <span class="o">+</span> <span class="nb">Send</span>
<span class="p">{</span>
<span class="n">self_ref</span><span class="p">:</span> <span class="n">CompletionHandlerActor</span><span class="p">,</span>
<span class="n">system</span><span class="p">:</span> <span class="n">SystemActor</span><span class="p">,</span>
<span class="n">f</span><span class="p">:</span> <span class="n">F</span><span class="p">,</span>
<span class="n">tries</span><span class="p">:</span> <span class="nb">usize</span>
<span class="p">}</span>
<span class="nd">#[derive(Debug,</span> <span class="nd">Clone)]</span>
<span class="k">pub</span> <span class="k">enum</span> <span class="n">CompletionStatus</span> <span class="p">{</span>
<span class="c">/// Processed successfully</span>
<span class="nb">Success</span><span class="p">,</span>
<span class="c">/// A transient error occurred</span>
<span class="nf">Retry</span><span class="p">(</span><span class="n">CloneableError</span><span class="p">,</span> <span class="nb">usize</span><span class="p">),</span>
<span class="c">/// An unrecoverable Error occurred</span>
<span class="nf">Abort</span><span class="p">(</span><span class="n">CloneableError</span><span class="p">)</span>
<span class="p">}</span>
<span class="nd">#[derive_actor]</span>
<span class="k">impl</span><span class="o"><</span><span class="n">F</span><span class="o">></span> <span class="n">CompletionHandler</span><span class="o"><</span><span class="n">F</span><span class="o">></span>
<span class="k">where</span> <span class="n">F</span><span class="p">:</span> <span class="nf">Fn</span><span class="p">(</span><span class="n">CompletionStatus</span><span class="p">)</span> <span class="o">+</span> <span class="nb">Send</span> <span class="o">+</span> <span class="n">Sync</span> <span class="o">+</span> <span class="nv">'static</span>
<span class="p">{</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">success</span><span class="p">(</span><span class="o">&</span><span class="k">self</span><span class="p">)</span> <span class="p">{</span>
<span class="p">(</span><span class="k">self</span><span class="py">.f</span><span class="p">)(</span><span class="nn">CompletionStatus</span><span class="p">::</span><span class="nb">Success</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">retry</span><span class="p">(</span><span class="o">&</span><span class="k">self</span><span class="p">,</span> <span class="n">e</span><span class="p">:</span> <span class="n">CloneableError</span><span class="p">)</span> <span class="p">{</span>
<span class="p">(</span><span class="k">self</span><span class="py">.f</span><span class="p">)(</span><span class="nn">CompletionStatus</span><span class="p">::</span><span class="nf">Retry</span><span class="p">(</span><span class="n">e</span><span class="p">,</span> <span class="k">self</span><span class="py">.tries</span> <span class="o">+</span> <span class="mi">1</span><span class="p">));</span>
<span class="p">}</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">abort</span><span class="p">(</span><span class="o">&</span><span class="k">self</span><span class="p">,</span> <span class="n">e</span><span class="p">:</span> <span class="n">CloneableError</span><span class="p">)</span> <span class="p">{</span>
<span class="p">(</span><span class="k">self</span><span class="py">.f</span><span class="p">)(</span><span class="nn">CompletionStatus</span><span class="p">::</span><span class="nf">Abort</span><span class="p">(</span><span class="n">e</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>When work is completed it forwards the result to the producer. The producer can then determine whether to schedule new
work for an available actor, retry the last message again, or give up.</p>
<h4 id="type-errors">Type Errors</h4>
<p>There are definitely some sore spots when using this library.</p>
<p>First, and probably most painfully, is fixing type errors. When I get a type error the message <em>comes from nowhere</em>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>error[E0618]: expected function, found `()`
--> src/main.rs:45:1
|
45 | #[derive_actor]
| ^^^^^^^^^^^^^^^
</code></pre></div></div>
<p>Not great. A combination of cargo-expand, disabling the macro, and walking away from the computer a lot are what I found
most effective for fixing errors given this limitation.</p>
<h4 id="ide-support">IDE Support</h4>
<p>Also, no intellij completion for Actor’s. So if I write <code class="language-plaintext highlighter-rouge">some_actor.f</code> intellij won’t help me complete it to
<code class="language-plaintext highlighter-rouge">some_actor.foo()</code>. I found that first writing code with a non-actor version, then just adding Actor to the end of the type
name where it’s used actually worked well but that’s hardly a smooth experience.</p>
<h4 id="scheduling">Scheduling</h4>
<p>I’ve referenced the SystemActor a few times. That actor, on some branches of code, is the scheduler. It’s used to spawn
actors using rust’s Futures crate. I have a basic proof of concept working but it’s not nearly ready. This means that
all of my actors are based on OS threads. I haven’t had any issues with this - performance is fine, I can handle thousands
of emails in seconds, even querying out to some Python service, with hardcoded latency in some actors, and failure conditions.</p>
<p>But it’s not ideal.</p>
<h4 id="overall">Overall</h4>
<p>I’ve really enjoyed working on the service. Modelling individual components as actors has allowed me to refactor,
parallelize, and ensure stability in the code. It’s not ideal and it’s not production ready but I’m having fun, and I’m
getting a lot of work done on the service.</p>
Building a Microservice in Rust (With Actors)2017-07-10T00:00:00+00:00http://insanitybit.github.io/2017/07/10/building-a-microservice-in-rust
<p>Recently we had to solve a problem at work - we wanted to use AWS SNS topics to communicate in some of our services, but we also needed to selectively delay visibility/ processing of specific messages.</p>
<p>In order to get delayed messages while still using SNS we developed a simple microservice that listens to an SQS Queue, grabs messages, and
publishes the message to the appropriate topic. Because SQS allows timeouts we can have producers set the appropriate timeout on messages and get exactly what we need.</p>
<p>The service works fairly well, but I like excuses to write rust, so I
came up with an idealized version and started rewriting.</p>
<p>Just so you aren’t disappointed at the end, no this service is not in production.</p>
<p><a href="https://github.com/insanitybit/queue-delay-app">Here is the Rust code for the project in its current state.</a></p>
<h1 id="service-goals">Service Goals</h1>
<ul>
<li>
<p>Never lose messages, avoid duplicating messages.</p>
</li>
<li>
<p>Attempt to use bulk APIs where possible to cut down on cost, and to improve performance, since this program will spend most of its time in IO.</p>
</li>
<li>
<p>Process as many messages as quickly as possible</p>
</li>
<li>
<p>Remain stable under unexpected conditions</p>
</li>
</ul>
<p>And ideally the code should be reasonable and clean enough that if I were to ever try to get my company to deploy it my coworkers would have a simple enough time understanding how it worked. I care less about that part though, since I am unlikely to push this to production.</p>
<h1 id="actors-in-rust">Actors In Rust</h1>
<p>I decided early on that I would structure my code using actors. The work is mostly IO bound and I wanted to try to understand what patterns I would end up with while writing actor-oriented rust code.</p>
<p>I took some of my code form my <a href="https://github.com/insanitybit/derive_aktor">actor crate</a> and based the patterns off of there to start.</p>
<p>Essentially every structure with some business logic had an actor ‘wrapper’ for it that would just pass messages along to it. The API was typed and made concurrency trivial.</p>
<p>So we would have our business logic structure + impl:
(Note that the code below is slightly abridged)</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="nd">#[derive(Clone)]</span>
<span class="k">pub</span> <span class="k">struct</span> <span class="n">MessageDeleter</span><span class="o"><</span><span class="n">SQ</span><span class="o">></span>
<span class="k">where</span> <span class="n">SQ</span><span class="p">:</span> <span class="n">Sqs</span> <span class="o">+</span> <span class="nb">Send</span> <span class="o">+</span> <span class="n">Sync</span> <span class="o">+</span> <span class="nv">'static</span><span class="p">,</span>
<span class="p">{</span>
<span class="n">sqs_client</span><span class="p">:</span> <span class="nb">Arc</span><span class="o"><</span><span class="n">SQ</span><span class="o">></span><span class="p">,</span>
<span class="n">queue_url</span><span class="p">:</span> <span class="nb">String</span><span class="p">,</span>
<span class="p">}</span>
<span class="k">impl</span><span class="o"><</span><span class="n">SQ</span><span class="o">></span> <span class="n">MessageDeleter</span><span class="o"><</span><span class="n">SQ</span><span class="o">></span>
<span class="k">where</span> <span class="n">SQ</span><span class="p">:</span> <span class="n">Sqs</span> <span class="o">+</span> <span class="nb">Send</span> <span class="o">+</span> <span class="n">Sync</span> <span class="o">+</span> <span class="nv">'static</span><span class="p">,</span>
<span class="p">{</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">new</span><span class="p">(</span><span class="n">sqs_client</span><span class="p">:</span> <span class="nb">Arc</span><span class="o"><</span><span class="n">SQ</span><span class="o">></span><span class="p">,</span> <span class="n">queue_url</span><span class="p">:</span> <span class="nb">String</span><span class="p">)</span> <span class="k">-></span> <span class="n">MessageDeleter</span><span class="o"><</span><span class="n">SQ</span><span class="o">></span> <span class="p">{</span>
<span class="n">MessageDeleter</span> <span class="p">{</span>
<span class="n">sqs_client</span><span class="p">,</span>
<span class="n">queue_url</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">delete_messages</span><span class="p">(</span><span class="o">&</span><span class="k">self</span><span class="p">,</span> <span class="n">receipts</span><span class="p">:</span> <span class="nb">Vec</span><span class="o"><</span><span class="p">(</span><span class="nb">String</span><span class="p">,</span> <span class="n">Instant</span><span class="p">)</span><span class="o">></span><span class="p">)</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">msg_count</span> <span class="o">=</span> <span class="n">receipts</span><span class="nf">.len</span><span class="p">();</span>
<span class="nd">println!</span><span class="p">(</span><span class="s">"Deleting {} messages"</span><span class="p">,</span> <span class="n">msg_count</span><span class="p">);</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">receipt_init_map</span> <span class="o">=</span> <span class="nn">HashMap</span><span class="p">::</span><span class="nf">new</span><span class="p">();</span>
<span class="k">for</span> <span class="p">(</span><span class="n">receipt</span><span class="p">,</span> <span class="n">time</span><span class="p">)</span> <span class="n">in</span> <span class="n">receipts</span> <span class="p">{</span>
<span class="n">receipt_init_map</span><span class="nf">.insert</span><span class="p">(</span><span class="n">receipt</span><span class="p">,</span> <span class="n">time</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">let</span> <span class="n">entries</span> <span class="o">=</span> <span class="n">receipt_init_map</span><span class="nf">.keys</span><span class="p">()</span><span class="nf">.map</span><span class="p">(|</span><span class="n">r</span><span class="p">|</span> <span class="p">{</span>
<span class="n">DeleteMessageBatchRequestEntry</span> <span class="p">{</span>
<span class="n">id</span><span class="p">:</span> <span class="nd">format!</span><span class="p">(</span><span class="s">"{}"</span><span class="p">,</span> <span class="nn">uuid</span><span class="p">::</span><span class="nn">Uuid</span><span class="p">::</span><span class="nf">new_v4</span><span class="p">()),</span>
<span class="n">receipt_handle</span><span class="p">:</span> <span class="n">r</span><span class="nf">.to_owned</span><span class="p">()</span>
<span class="p">}</span>
<span class="p">})</span><span class="nf">.collect</span><span class="p">();</span>
<span class="k">let</span> <span class="n">req</span> <span class="o">=</span> <span class="n">DeleteMessageBatchRequest</span> <span class="p">{</span>
<span class="n">entries</span><span class="p">,</span>
<span class="n">queue_url</span><span class="p">:</span> <span class="k">self</span><span class="py">.queue_url</span><span class="nf">.clone</span><span class="p">()</span>
<span class="p">};</span>
<span class="k">match</span> <span class="k">self</span><span class="py">.sqs_client</span><span class="nf">.delete_message_batch</span><span class="p">(</span><span class="o">&</span><span class="n">req</span><span class="p">)</span> <span class="p">{</span>
<span class="nf">Ok</span><span class="p">(</span><span class="n">res</span><span class="p">)</span> <span class="k">=></span> <span class="p">{</span>
<span class="nd">unimplemented!</span><span class="p">()</span>
<span class="p">},</span>
<span class="nf">Err</span><span class="p">(</span><span class="n">e</span><span class="p">)</span> <span class="k">=></span> <span class="p">{</span>
<span class="nd">println!</span><span class="p">(</span><span class="s">"Failed to deleted {} messages {}"</span><span class="p">,</span> <span class="n">msg_count</span> <span class="p">,</span><span class="n">e</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>A message for communicating to it:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">pub</span> <span class="k">enum</span> <span class="n">MessageDeleterMessage</span> <span class="p">{</span>
<span class="n">DeleteMessages</span> <span class="p">{</span>
<span class="n">receipts</span><span class="p">:</span> <span class="nb">Vec</span><span class="o"><</span><span class="p">(</span><span class="nb">String</span><span class="p">,</span> <span class="n">Instant</span><span class="p">)</span><span class="o">></span><span class="p">,</span>
<span class="p">},</span>
<span class="p">}</span>
</code></pre></div></div>
<p>A ‘route_msg’ function to destructure the message and pass it to the proper method:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">pub</span> <span class="k">fn</span> <span class="nf">route_msg</span><span class="p">(</span><span class="o">&</span><span class="k">self</span><span class="p">,</span> <span class="n">msg</span><span class="p">:</span> <span class="n">MessageDeleterMessage</span><span class="p">)</span> <span class="p">{</span>
<span class="k">match</span> <span class="n">msg</span> <span class="p">{</span>
<span class="nn">MessageDeleterMessage</span><span class="p">::</span><span class="n">DeleteMessages</span> <span class="p">{</span> <span class="n">receipts</span> <span class="p">}</span> <span class="k">=></span> <span class="k">self</span><span class="nf">.delete_messages</span><span class="p">(</span><span class="n">receipts</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>And then the actor interface, which packs function arguments into messages and passes them along.</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="nd">#[derive(Clone)]</span>
<span class="k">pub</span> <span class="k">struct</span> <span class="n">MessageDeleterActor</span> <span class="p">{</span>
<span class="n">sender</span><span class="p">:</span> <span class="n">Sender</span><span class="o"><</span><span class="n">MessageDeleterMessage</span><span class="o">></span><span class="p">,</span>
<span class="n">receiver</span><span class="p">:</span> <span class="n">Receiver</span><span class="o"><</span><span class="n">MessageDeleterMessage</span><span class="o">></span><span class="p">,</span>
<span class="n">id</span><span class="p">:</span> <span class="nb">String</span><span class="p">,</span>
<span class="p">}</span>
<span class="k">impl</span> <span class="n">MessageDeleterActor</span> <span class="p">{</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="n">new</span><span class="o"><</span><span class="n">SQ</span><span class="o">></span><span class="p">(</span><span class="n">actor</span><span class="p">:</span> <span class="n">MessageDeleter</span><span class="o"><</span><span class="n">SQ</span><span class="o">></span><span class="p">)</span> <span class="k">-></span> <span class="n">MessageDeleterActor</span>
<span class="k">where</span> <span class="n">SQ</span><span class="p">:</span> <span class="n">Sqs</span> <span class="o">+</span> <span class="nb">Send</span> <span class="o">+</span> <span class="n">Sync</span> <span class="o">+</span> <span class="nv">'static</span><span class="p">,</span>
<span class="p">{</span>
<span class="k">let</span> <span class="p">(</span><span class="n">sender</span><span class="p">,</span> <span class="n">receiver</span><span class="p">)</span> <span class="o">=</span> <span class="nf">unbounded</span><span class="p">();</span>
<span class="k">let</span> <span class="n">id</span> <span class="o">=</span> <span class="nn">uuid</span><span class="p">::</span><span class="nn">Uuid</span><span class="p">::</span><span class="nf">new_v4</span><span class="p">()</span><span class="nf">.to_string</span><span class="p">();</span>
<span class="k">let</span> <span class="n">recvr</span> <span class="o">=</span> <span class="n">receiver</span><span class="nf">.clone</span><span class="p">();</span>
<span class="nn">thread</span><span class="p">::</span><span class="nf">spawn</span><span class="p">(</span>
<span class="k">move</span> <span class="p">||</span> <span class="p">{</span>
<span class="k">loop</span> <span class="p">{</span>
<span class="k">if</span> <span class="n">recvr</span><span class="nf">.len</span><span class="p">()</span> <span class="o">></span> <span class="mi">10</span> <span class="p">{</span>
<span class="nd">println!</span><span class="p">(</span><span class="s">"MessageDeleterActor queue len {}"</span><span class="p">,</span> <span class="n">recvr</span><span class="nf">.len</span><span class="p">());</span>
<span class="p">}</span>
<span class="k">match</span> <span class="n">recvr</span><span class="nf">.recv_timeout</span><span class="p">(</span><span class="nn">Duration</span><span class="p">::</span><span class="nf">from_secs</span><span class="p">(</span><span class="mi">60</span><span class="p">))</span> <span class="p">{</span>
<span class="nf">Ok</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span> <span class="k">=></span> <span class="p">{</span>
<span class="n">actor</span><span class="nf">.route_msg</span><span class="p">(</span><span class="n">msg</span><span class="p">);</span>
<span class="k">continue</span>
<span class="p">}</span>
<span class="nf">Err</span><span class="p">(</span><span class="nn">RecvTimeoutError</span><span class="p">::</span><span class="n">Disconnected</span><span class="p">)</span> <span class="k">=></span> <span class="p">{</span>
<span class="k">break</span>
<span class="p">}</span>
<span class="nf">Err</span><span class="p">(</span><span class="nn">RecvTimeoutError</span><span class="p">::</span><span class="n">Timeout</span><span class="p">)</span> <span class="k">=></span> <span class="p">{</span>
<span class="k">continue</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">});</span>
<span class="n">MessageDeleterActor</span> <span class="p">{</span>
<span class="n">sender</span><span class="p">:</span> <span class="n">sender</span><span class="p">,</span>
<span class="n">receiver</span><span class="p">:</span> <span class="n">receiver</span><span class="p">,</span>
<span class="n">id</span><span class="p">:</span> <span class="n">id</span><span class="p">,</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">delete_messages</span><span class="p">(</span><span class="o">&</span><span class="k">self</span><span class="p">,</span> <span class="n">receipts</span><span class="p">:</span> <span class="nb">Vec</span><span class="o"><</span><span class="p">(</span><span class="nb">String</span><span class="p">,</span> <span class="n">Instant</span><span class="p">)</span><span class="o">></span><span class="p">)</span> <span class="p">{</span>
<span class="k">self</span><span class="py">.sender</span><span class="nf">.send</span><span class="p">(</span>
<span class="nn">MessageDeleterMessage</span><span class="p">::</span><span class="n">DeleteMessages</span> <span class="p">{</span> <span class="n">receipts</span> <span class="p">}</span>
<span class="p">)</span><span class="nf">.unwrap</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>And you have a pretty sweet, typed, async interface to your actor.</p>
<p>Still, there’s a <em>lot</em> of boilerplate involved, which is why my actor library attempts to generate most of the above with a macro. Unfortunately that library is not in a place for me to use it right now, and it’s breaking too often, so I used it to generate the boilerplate with ‘cargo expand’ and copy/pasted stuff over, making fixes where necessary.</p>
<p>At one point my actors used ‘fibers’ from the fibers crate, but I ran into performance problems and stability issues. I really wanted my actors to be cheap, so I attempted to use the futures crate, but I just couldn’t get it working.</p>
<h1 id="work-stealing-actors">Work Stealing Actors</h1>
<p>In order to provide work stealing/ pooled actors, I added a function to pooled actors, called “with_queue”. Essentially this function would act exactly like ‘new’ but provide a queue with the arguments. So you could have many actors sharing the same queue.</p>
<p>Then a single ‘Broker’ actor would feed messages into that queue.</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">#[derive(Clone)]</span>
<span class="k">pub</span> <span class="k">struct</span> <span class="n">MessageDeleterBroker</span>
<span class="p">{</span>
<span class="n">workers</span><span class="p">:</span> <span class="nb">Vec</span><span class="o"><</span><span class="n">MessageDeleterActor</span><span class="o">></span><span class="p">,</span>
<span class="n">sender</span><span class="p">:</span> <span class="n">Sender</span><span class="o"><</span><span class="n">MessageDeleterMessage</span><span class="o">></span><span class="p">,</span>
<span class="n">id</span><span class="p">:</span> <span class="nb">String</span>
<span class="p">}</span>
<span class="k">impl</span> <span class="n">MessageDeleterBroker</span>
<span class="p">{</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="n">new</span><span class="o"><</span><span class="n">T</span><span class="p">,</span> <span class="n">F</span><span class="p">,</span> <span class="n">SQ</span><span class="o">></span><span class="p">(</span><span class="n">new</span><span class="p">:</span> <span class="n">F</span><span class="p">,</span>
<span class="n">worker_count</span><span class="p">:</span> <span class="nb">usize</span><span class="p">,</span>
<span class="n">max_queue_depth</span><span class="p">:</span> <span class="n">T</span><span class="p">)</span>
<span class="k">-></span> <span class="n">MessageDeleterBroker</span>
<span class="k">where</span> <span class="n">F</span><span class="p">:</span> <span class="nf">Fn</span><span class="p">(</span><span class="n">MessageDeleterActor</span><span class="p">)</span> <span class="k">-></span> <span class="n">MessageDeleter</span><span class="o"><</span><span class="n">SQ</span><span class="o">></span><span class="p">,</span>
<span class="n">T</span><span class="p">:</span> <span class="n">Into</span><span class="o"><</span><span class="nb">Option</span><span class="o"><</span><span class="nb">usize</span><span class="o">>></span><span class="p">,</span>
<span class="n">SQ</span><span class="p">:</span> <span class="n">Sqs</span> <span class="o">+</span> <span class="nb">Send</span> <span class="o">+</span> <span class="n">Sync</span> <span class="o">+</span> <span class="nv">'static</span><span class="p">,</span>
<span class="p">{</span>
<span class="k">let</span> <span class="n">id</span> <span class="o">=</span> <span class="nn">uuid</span><span class="p">::</span><span class="nn">Uuid</span><span class="p">::</span><span class="nf">new_v4</span><span class="p">()</span><span class="nf">.to_string</span><span class="p">();</span>
<span class="k">let</span> <span class="p">(</span><span class="n">sender</span><span class="p">,</span> <span class="n">receiver</span><span class="p">)</span> <span class="o">=</span> <span class="n">max_queue_depth</span><span class="nf">.into</span><span class="p">()</span><span class="nf">.map_or</span><span class="p">(</span><span class="nf">unbounded</span><span class="p">(),</span> <span class="n">channel</span><span class="p">);</span>
<span class="k">let</span> <span class="n">workers</span> <span class="o">=</span> <span class="p">(</span><span class="mi">0</span><span class="o">..</span><span class="n">worker_count</span><span class="p">)</span>
<span class="nf">.map</span><span class="p">(|</span><span class="mi">_</span><span class="p">|</span> <span class="nn">MessageDeleterActor</span><span class="p">::</span><span class="nf">from_queue</span><span class="p">(</span><span class="o">&</span><span class="n">new</span><span class="p">,</span> <span class="n">sender</span><span class="nf">.clone</span><span class="p">(),</span> <span class="n">receiver</span><span class="nf">.clone</span><span class="p">()))</span>
<span class="nf">.collect</span><span class="p">();</span>
<span class="n">MessageDeleterBroker</span> <span class="p">{</span>
<span class="n">workers</span><span class="p">,</span>
<span class="n">sender</span><span class="p">,</span>
<span class="n">id</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">delete_messages</span><span class="p">(</span><span class="o">&</span><span class="k">self</span><span class="p">,</span> <span class="n">receipts</span><span class="p">:</span> <span class="nb">Vec</span><span class="o"><</span><span class="p">(</span><span class="nb">String</span><span class="p">,</span> <span class="n">Instant</span><span class="p">)</span><span class="o">></span><span class="p">)</span> <span class="p">{</span>
<span class="k">self</span><span class="py">.sender</span><span class="nf">.send</span><span class="p">(</span>
<span class="nn">MessageDeleterMessage</span><span class="p">::</span><span class="n">DeleteMessages</span> <span class="p">{</span> <span class="n">receipts</span> <span class="p">}</span>
<span class="p">)</span><span class="nf">.unwrap</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The broker just generates a bunch of MessageDeleterActor structs using a shared queue among all of them. It then provides an identical interface to its workers.</p>
<h1 id="buffering-actors">Buffering Actors</h1>
<p>Lastly, I needed a way to buffer deletes. In order to do this I stuck the broker behind a buffer actor. The buffer actor will receive delete requests, store them, and then flush its buffer when it hits its maximum capacity. A separate ‘BufferFlusherActor’ will periodically send flush commands so as to ensure that messages get deleted in a timely manner, even under low load.</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">pub</span> <span class="k">struct</span> <span class="n">MessageDeleteBuffer</span> <span class="p">{</span>
<span class="n">deleter_broker</span><span class="p">:</span> <span class="n">MessageDeleterBroker</span><span class="p">,</span>
<span class="n">buffer</span><span class="p">:</span> <span class="n">ArrayVec</span><span class="o"><</span><span class="p">[(</span><span class="nb">String</span><span class="p">,</span> <span class="n">Instant</span><span class="p">);</span> <span class="mi">10</span><span class="p">]</span><span class="o">></span><span class="p">,</span>
<span class="n">flush_period</span><span class="p">:</span> <span class="n">Duration</span>
<span class="p">}</span>
<span class="k">impl</span> <span class="n">MessageDeleteBuffer</span> <span class="p">{</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">new</span><span class="p">(</span><span class="n">deleter_broker</span><span class="p">:</span> <span class="n">MessageDeleterBroker</span><span class="p">,</span> <span class="n">flush_period</span><span class="p">:</span> <span class="nb">u8</span><span class="p">)</span> <span class="k">-></span> <span class="n">MessageDeleteBuffer</span>
<span class="p">{</span>
<span class="n">MessageDeleteBuffer</span> <span class="p">{</span>
<span class="n">deleter_broker</span><span class="p">:</span> <span class="n">deleter_broker</span><span class="p">,</span>
<span class="n">buffer</span><span class="p">:</span> <span class="nn">ArrayVec</span><span class="p">::</span><span class="nf">new</span><span class="p">(),</span>
<span class="n">flush_period</span><span class="p">:</span> <span class="nn">Duration</span><span class="p">::</span><span class="nf">from_secs</span><span class="p">(</span><span class="n">flush_period</span> <span class="k">as</span> <span class="nb">u64</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">delete_message</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span> <span class="k">self</span><span class="p">,</span> <span class="n">receipt</span><span class="p">:</span> <span class="nb">String</span><span class="p">,</span> <span class="n">init_time</span><span class="p">:</span> <span class="n">Instant</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="k">self</span><span class="py">.buffer</span><span class="nf">.is_full</span><span class="p">()</span> <span class="p">{</span>
<span class="nd">println!</span><span class="p">(</span><span class="s">"MessageDeleteBuffer buffer full. Flushing."</span><span class="p">);</span>
<span class="k">self</span><span class="nf">.flush</span><span class="p">();</span>
<span class="p">}</span>
<span class="k">self</span><span class="py">.buffer</span><span class="nf">.push</span><span class="p">((</span><span class="n">receipt</span><span class="p">,</span> <span class="n">init_time</span><span class="p">));</span>
<span class="p">}</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">flush</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span> <span class="k">self</span><span class="p">)</span> <span class="p">{</span>
<span class="k">self</span><span class="py">.deleter_broker</span><span class="nf">.delete_messages</span><span class="p">(</span><span class="nn">Vec</span><span class="p">::</span><span class="nf">from</span><span class="p">(</span><span class="k">self</span><span class="py">.buffer</span><span class="nf">.as_ref</span><span class="p">()));</span>
<span class="k">self</span><span class="py">.buffer</span><span class="nf">.clear</span><span class="p">();</span>
<span class="p">}</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">on_timeout</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span> <span class="k">self</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="k">self</span><span class="py">.buffer</span><span class="nf">.len</span><span class="p">()</span> <span class="o">!=</span> <span class="mi">0</span> <span class="p">{</span>
<span class="nd">println!</span><span class="p">(</span><span class="s">"MessageDeleteBuffer timeout. Flushing {} messages."</span><span class="p">,</span> <span class="k">self</span><span class="py">.buffer</span><span class="nf">.len</span><span class="p">());</span>
<span class="k">self</span><span class="nf">.flush</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">pub</span> <span class="k">enum</span> <span class="n">MessageDeleteBufferMessage</span> <span class="p">{</span>
<span class="n">Delete</span> <span class="p">{</span>
<span class="n">receipt</span><span class="p">:</span> <span class="nb">String</span><span class="p">,</span>
<span class="n">init_time</span><span class="p">:</span> <span class="n">Instant</span>
<span class="p">},</span>
<span class="n">Flush</span> <span class="p">{},</span>
<span class="n">OnTimeout</span> <span class="p">{},</span>
<span class="p">}</span>
<span class="nd">#[derive(Clone)]</span>
<span class="k">pub</span> <span class="k">struct</span> <span class="n">MessageDeleteBufferActor</span> <span class="p">{</span>
<span class="n">sender</span><span class="p">:</span> <span class="n">Sender</span><span class="o"><</span><span class="n">MessageDeleteBufferMessage</span><span class="o">></span><span class="p">,</span>
<span class="n">receiver</span><span class="p">:</span> <span class="n">Receiver</span><span class="o"><</span><span class="n">MessageDeleteBufferMessage</span><span class="o">></span><span class="p">,</span>
<span class="n">id</span><span class="p">:</span> <span class="nb">String</span><span class="p">,</span>
<span class="p">}</span>
<span class="k">impl</span> <span class="n">MessageDeleteBufferActor</span> <span class="p">{</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">new</span>
<span class="p">(</span><span class="n">actor</span><span class="p">:</span> <span class="n">MessageDeleteBuffer</span><span class="p">)</span>
<span class="k">-></span> <span class="n">MessageDeleteBufferActor</span>
<span class="p">{</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">actor</span> <span class="o">=</span> <span class="n">actor</span><span class="p">;</span>
<span class="k">let</span> <span class="p">(</span><span class="n">sender</span><span class="p">,</span> <span class="n">receiver</span><span class="p">)</span> <span class="o">=</span> <span class="nf">unbounded</span><span class="p">();</span>
<span class="k">let</span> <span class="n">id</span> <span class="o">=</span> <span class="nn">uuid</span><span class="p">::</span><span class="nn">Uuid</span><span class="p">::</span><span class="nf">new_v4</span><span class="p">()</span><span class="nf">.to_string</span><span class="p">();</span>
<span class="k">let</span> <span class="n">recvr</span> <span class="o">=</span> <span class="n">receiver</span><span class="nf">.clone</span><span class="p">();</span>
<span class="nn">thread</span><span class="p">::</span><span class="nf">spawn</span><span class="p">(</span>
<span class="k">move</span> <span class="p">||</span> <span class="p">{</span>
<span class="k">loop</span> <span class="p">{</span>
<span class="k">if</span> <span class="n">recvr</span><span class="nf">.len</span><span class="p">()</span> <span class="o">></span> <span class="mi">10</span> <span class="p">{</span>
<span class="nd">println!</span><span class="p">(</span><span class="s">"MessageDeleteBufferActor queue len {}"</span><span class="p">,</span> <span class="n">recvr</span><span class="nf">.len</span><span class="p">());</span>
<span class="p">}</span>
<span class="k">match</span> <span class="n">recvr</span><span class="nf">.recv_timeout</span><span class="p">(</span><span class="nn">Duration</span><span class="p">::</span><span class="nf">from_secs</span><span class="p">(</span><span class="mi">60</span><span class="p">))</span> <span class="p">{</span>
<span class="nf">Ok</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span> <span class="k">=></span> <span class="p">{</span>
<span class="n">actor</span><span class="nf">.route_msg</span><span class="p">(</span><span class="n">msg</span><span class="p">);</span>
<span class="k">continue</span>
<span class="p">}</span>
<span class="nf">Err</span><span class="p">(</span><span class="nn">RecvTimeoutError</span><span class="p">::</span><span class="n">Disconnected</span><span class="p">)</span> <span class="k">=></span> <span class="p">{</span>
<span class="k">break</span>
<span class="p">}</span>
<span class="nf">Err</span><span class="p">(</span><span class="nn">RecvTimeoutError</span><span class="p">::</span><span class="n">Timeout</span><span class="p">)</span> <span class="k">=></span> <span class="p">{</span>
<span class="nd">println!</span><span class="p">(</span><span class="s">"MessageDeleteBufferActor Haven't received a message in 10 seconds"</span><span class="p">);</span>
<span class="k">continue</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">});</span>
<span class="n">MessageDeleteBufferActor</span> <span class="p">{</span>
<span class="n">sender</span><span class="p">:</span> <span class="n">sender</span><span class="p">,</span>
<span class="n">receiver</span><span class="p">:</span> <span class="n">receiver</span><span class="p">,</span>
<span class="n">id</span><span class="p">:</span> <span class="n">id</span><span class="p">,</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">delete_message</span><span class="p">(</span><span class="o">&</span><span class="k">self</span><span class="p">,</span> <span class="n">receipt</span><span class="p">:</span> <span class="nb">String</span><span class="p">,</span> <span class="n">init_time</span><span class="p">:</span> <span class="n">Instant</span><span class="p">)</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">msg</span> <span class="o">=</span> <span class="nn">MessageDeleteBufferMessage</span><span class="p">::</span><span class="n">Delete</span> <span class="p">{</span>
<span class="n">receipt</span><span class="p">,</span>
<span class="n">init_time</span>
<span class="p">};</span>
<span class="k">self</span><span class="py">.sender</span><span class="nf">.send</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span><span class="nf">.expect</span><span class="p">(</span><span class="s">"All receivers have died."</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">flush</span><span class="p">(</span><span class="o">&</span><span class="k">self</span><span class="p">)</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">msg</span> <span class="o">=</span> <span class="nn">MessageDeleteBufferMessage</span><span class="p">::</span><span class="n">Flush</span> <span class="p">{};</span>
<span class="k">self</span><span class="py">.sender</span><span class="nf">.send</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span><span class="nf">.expect</span><span class="p">(</span><span class="s">"All receivers have died."</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">on_timeout</span><span class="p">(</span><span class="o">&</span><span class="k">self</span><span class="p">)</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">msg</span> <span class="o">=</span> <span class="nn">MessageDeleteBufferMessage</span><span class="p">::</span><span class="n">OnTimeout</span> <span class="p">{};</span>
<span class="k">self</span><span class="py">.sender</span><span class="nf">.send</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span><span class="nf">.expect</span><span class="p">(</span><span class="s">"All receivers have died."</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">impl</span> <span class="n">MessageDeleteBuffer</span>
<span class="p">{</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">route_msg</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span> <span class="k">self</span><span class="p">,</span> <span class="n">msg</span><span class="p">:</span> <span class="n">MessageDeleteBufferMessage</span><span class="p">)</span> <span class="p">{</span>
<span class="k">match</span> <span class="n">msg</span> <span class="p">{</span>
<span class="nn">MessageDeleteBufferMessage</span><span class="p">::</span><span class="n">Delete</span> <span class="p">{</span>
<span class="n">receipt</span><span class="p">,</span>
<span class="n">init_time</span>
<span class="p">}</span> <span class="k">=></span> <span class="p">{</span>
<span class="k">self</span><span class="nf">.delete_message</span><span class="p">(</span><span class="n">receipt</span><span class="p">,</span> <span class="n">init_time</span><span class="p">)</span>
<span class="p">}</span>
<span class="nn">MessageDeleteBufferMessage</span><span class="p">::</span><span class="n">Flush</span> <span class="p">{}</span> <span class="k">=></span> <span class="k">self</span><span class="nf">.flush</span><span class="p">(),</span>
<span class="nn">MessageDeleteBufferMessage</span><span class="p">::</span><span class="n">OnTimeout</span> <span class="p">{}</span> <span class="k">=></span> <span class="k">self</span><span class="nf">.on_timeout</span><span class="p">(),</span>
<span class="p">};</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>In my tests, under load, the Buffer almost always flushes a full 10 messages. There’s room to improve with the buffer strategy, such as resetting the Flusher’s timer if we flush due to a full buffer, but so far this is already a massive improvement over using the non-bulk API.</p>
<p>The rest of the code uses very similar patterns. Some other particularly interesting patterns are my MessageStateManager, which is an actor that maintains a messages state in SQS (Invisible, Deleted, etc) as well as the AutoScalingActor, which will emit messages to brokers letting them know when to scale up or down.</p>
<h1 id="the-good-stuff">The Good Stuff</h1>
<p>The design has worked very well in my tests, and rust facilitated that design. With some work I would have a lot of confidence in the correctness and stability of my code.</p>
<p>The actual processing of data is fast, and because of rust’s explicit nature if I ever need to make it faster I can audit for low hanging fruit very easily via .clone() calls.</p>
<p>There’s room for optimization around message passing. Rust’s strong type system allows me to eventually implement 0 copying of messages.</p>
<p>Serde is awesome. I used it for deserialization and it was a breeze.</p>
<p>Actors in rust actually worked pretty well, I’m happy with how the code looks and feels.</p>
<p>The community was very helpful. I got a lot of help from a lot of people, usually within minutes of asking for help.</p>
<h1 id="problems">Problems</h1>
<p>One problem I ran into with this was enforcing ordering of events. I used to not guarantee that message deletes happened before the cancellation of timeouts, which led to errors when trying to extend the timeout of a deleted message.</p>
<p>That’s where the MessageStateManager came in - when a message was done being processed, you told the state manager, and it enforced the ordering of events. This is all based on causal ordering - the guarantee that if I place a message A on a queue, and then message B, A will arrive before B. This allowed me to ensure that as long as I handled visibility timeouts and deletes via the same queue (or series of queues) I could guarantee ordering of events.</p>
<p>This is simply a pattern that I had to internalize, not really any fault of rust. I have considered how to avoid this issue, and it likely just means writing a spec for the service before hand and then using types to enforce it - there are at least a few types that can be broken ou tinto session types.</p>
<p>I also ran into a few problems with the AWS SDK in rust - rusoto. While I’m super happy to see the progress that it’s made over the course of its development it’s still got a few issues that need sorting out before I could use it in production.</p>
<p>In particular, the clients don’t support timeouts or automatic retry policies. So sometimes calls to the services will take 20+ seconds and then time out. I actually created a macro to provide timeouts, but it’s a hack and I’d really want native support.</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">macro_rules!</span> <span class="n">timeout_ms</span> <span class="p">{</span>
<span class="p">(</span><span class="nv">$pool:expr</span><span class="p">,</span> <span class="nv">$closure:expr</span><span class="p">,</span> <span class="nv">$dur:expr</span><span class="p">,</span> <span class="nv">$timer:expr</span><span class="p">)</span> <span class="k">=></span> <span class="p">{</span>
<span class="p">{</span>
<span class="k">let</span> <span class="n">timeout</span> <span class="o">=</span> <span class="nv">$timer</span><span class="nf">.sleep</span><span class="p">(</span><span class="nn">Duration</span><span class="p">::</span><span class="nf">from_millis</span><span class="p">(</span><span class="nv">$dur</span><span class="p">))</span>
<span class="nf">.then</span><span class="p">(|</span><span class="mi">_</span><span class="p">|</span> <span class="nf">Err</span><span class="p">(()));</span>
<span class="k">let</span> <span class="n">value</span> <span class="o">=</span> <span class="nv">$pool</span><span class="nf">.spawn_fn</span><span class="p">(</span><span class="nv">$closure</span><span class="p">);</span>
<span class="k">let</span> <span class="n">value_or_timeout</span> <span class="o">=</span> <span class="n">timeout</span><span class="nf">.select</span><span class="p">(</span><span class="n">value</span><span class="p">)</span><span class="nf">.map</span><span class="p">(|(</span><span class="n">win</span><span class="p">,</span> <span class="mi">_</span><span class="p">)|</span> <span class="n">win</span><span class="p">);</span>
<span class="n">value_or_timeout</span><span class="nf">.wait</span><span class="p">()</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="p">}</span>
</code></pre></div></div>
<p>This macro just takes a function and executes it or times out. The problem is that, as far as I am aware, it isn’t able to actually stop the execution of the function after it’s timed out. So I could time out, and then the call could succeed later. This is definitely not acceptable for an SNS publish, as we want to avoid double publishing as much as possible.</p>
<p>This is honestly the number one blocker - and thankfully there’s progress being made. In particular, hyper .11 support in rusoto will allow some of these features to be built.</p>
<p>Ultimately, however, I was quite happy with the experience. I’m going to continue to improve the application at least in part because it’s been so much fun coming up with fun patterns in rust, particularly with actors.</p>
<p>I’m surprised by how close Rust is to being production ready for my use case. Some work on Rusoto, and ideally some better async support in general, and I could start making the case to productionalize this.</p>
Derive Actor2017-05-07T00:00:00+00:00http://insanitybit.github.io/2017/05/07/derive-actor
<p>Recently I’ve been trying to extend my actor library, <a href="https://github.com/insanitybit/aktors">Aktors</a> to deal
with type safety. The current version relies heavily on the Any type, which has
two serious problems:</p>
<ul>
<li>It means that your code is way slower than it needs to be - Any means dynamic
dispatch all over the place</li>
<li>You lose static type safety - you can send an actor a message that it can not
handle</li>
</ul>
<p>What we want in an actor framework is the ability to write simple, sequential code
and have concurrency ‘just happen’. We don’t want to sacrifice performance, and
we don’t want to sacrifice type safety.</p>
<p>(note: If you are interested in working on the project, feel free to reach out to me,
I’d be happy to walk anyone through the code and split some work out)</p>
<h1 id="current-state">Current State</h1>
<p>To provide that experience I created <a href="https://github.com/insanitybit/derive_aktor">derive_aktor</a>.</p>
<p>This project provides a macro that you can apply to your structure, and generates
all of the necessary types and functions such that you can work with an Actor version
of your structure fairly seamlessly, without having to think (much) about concurrency.</p>
<p>For example:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">pub</span> <span class="k">struct</span> <span class="n">PrintLogger</span> <span class="p">{</span>
<span class="p">}</span>
<span class="nd">#[derive_actor]</span>
<span class="k">impl</span> <span class="n">PrintLogger</span> <span class="p">{</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="n">info</span><span class="o"><</span><span class="n">T</span><span class="p">:</span> <span class="n">Debug</span> <span class="o">+</span> <span class="nb">Send</span> <span class="o">+</span> <span class="nv">'static</span><span class="o">></span><span class="p">(</span><span class="o">&</span><span class="k">self</span><span class="p">,</span> <span class="n">data</span><span class="p">:</span> <span class="n">T</span><span class="p">)</span> <span class="p">{</span>
<span class="nd">println!</span><span class="p">(</span><span class="s">"{:?}"</span><span class="p">,</span> <span class="n">data</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="n">error</span><span class="o"><</span><span class="n">T</span><span class="p">:</span> <span class="n">Debug</span> <span class="o">+</span> <span class="nb">Send</span> <span class="o">+</span> <span class="nv">'static</span><span class="o">></span><span class="p">(</span><span class="o">&</span><span class="k">self</span><span class="p">,</span> <span class="n">data</span><span class="p">:</span> <span class="n">T</span><span class="p">)</span> <span class="p">{</span>
<span class="nd">println!</span><span class="p">(</span><span class="s">"{:?}"</span><span class="p">,</span> <span class="n">data</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The above code will generate a new impl for PrintLogger, with the following method:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">impl</span> <span class="n">PrintLogger</span> <span class="p">{</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="n">route_msg</span><span class="o"><</span><span class="n">InfoT</span><span class="p">:</span> <span class="n">Debug</span> <span class="o">+</span> <span class="nb">Send</span> <span class="o">+</span> <span class="nv">'static</span><span class="p">,</span> <span class="n">ErrorT</span><span class="p">:</span> <span class="n">Debug</span> <span class="o">+</span> <span class="nb">Send</span> <span class="o">+</span>
<span class="nv">'static</span><span class="o">></span><span class="p">(</span><span class="o">&</span><span class="k">mut</span> <span class="k">self</span><span class="p">,</span>
<span class="n">msg</span><span class="p">:</span> <span class="n">PrintLoggerMessage</span><span class="o"><</span><span class="n">InfoT</span><span class="p">,</span> <span class="n">ErrorT</span><span class="o">></span><span class="p">)</span> <span class="p">{</span>
<span class="k">match</span> <span class="n">msg</span> <span class="p">{</span>
<span class="nn">PrintLoggerMessage</span><span class="p">::</span><span class="n">InfoVariant</span> <span class="p">{</span> <span class="n">data</span><span class="p">:</span> <span class="n">data</span> <span class="p">}</span> <span class="k">=></span> <span class="k">self</span><span class="nf">.info</span><span class="p">(</span><span class="n">data</span><span class="p">),</span>
<span class="nn">PrintLoggerMessage</span><span class="p">::</span><span class="n">ErrorVariant</span> <span class="p">{</span> <span class="n">data</span><span class="p">:</span> <span class="n">data</span> <span class="p">}</span> <span class="k">=></span> <span class="k">self</span><span class="nf">.error</span><span class="p">(</span><span class="n">data</span><span class="p">),</span>
<span class="p">};</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>as well as a PrintLoggerActor that we will interact with directly.</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">impl</span><span class="o"><</span><span class="n">InfoT</span><span class="p">:</span> <span class="n">Debug</span> <span class="o">+</span> <span class="nb">Send</span> <span class="o">+</span> <span class="nv">'static</span><span class="p">,</span> <span class="n">ErrorT</span><span class="p">:</span> <span class="n">Debug</span> <span class="o">+</span> <span class="nb">Send</span> <span class="o">+</span> <span class="nv">'static</span><span class="o">></span> <span class="n">PrintLoggerActor</span><span class="o"><</span><span class="n">InfoT</span><span class="p">,</span>
<span class="n">ErrorT</span><span class="o">></span> <span class="p">{</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="n">new</span><span class="o"><</span><span class="n">H</span><span class="p">:</span> <span class="nb">Send</span> <span class="o">+</span> <span class="nn">fibers</span><span class="p">::</span><span class="n">Spawn</span> <span class="o">+</span> <span class="n">Clone</span> <span class="o">+</span> <span class="nv">'static</span><span class="o">></span><span class="p">(</span><span class="n">handle</span><span class="p">:</span> <span class="n">H</span><span class="p">,</span>
<span class="n">actor</span><span class="p">:</span> <span class="n">PrintLogger</span><span class="p">)</span>
<span class="k">-></span> <span class="n">PrintLoggerActor</span><span class="o"><</span><span class="n">InfoT</span><span class="p">,</span> <span class="n">ErrorT</span><span class="o">></span> <span class="p">{</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">actor</span> <span class="o">=</span> <span class="n">actor</span><span class="p">;</span>
<span class="k">let</span> <span class="p">(</span><span class="n">sender</span><span class="p">,</span> <span class="n">receiver</span><span class="p">)</span> <span class="o">=</span> <span class="nf">unbounded</span><span class="p">();</span>
<span class="k">let</span> <span class="n">id</span> <span class="o">=</span> <span class="s">"random string"</span><span class="nf">.to_owned</span><span class="p">();</span>
<span class="k">let</span> <span class="n">recvr</span> <span class="o">=</span> <span class="n">receiver</span><span class="nf">.clone</span><span class="p">();</span>
<span class="n">handle</span><span class="nf">.spawn</span><span class="p">(</span><span class="nn">futures</span><span class="p">::</span><span class="nf">lazy</span><span class="p">(</span><span class="k">move</span> <span class="p">||</span> <span class="p">{</span>
<span class="nf">loop_fn</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="k">move</span> <span class="p">|</span><span class="mi">_</span><span class="p">|</span> <span class="k">match</span> <span class="n">recvr</span><span class="nf">.try_recv</span><span class="p">()</span> <span class="p">{</span>
<span class="nf">Ok</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span> <span class="k">=></span> <span class="p">{</span>
<span class="n">actor</span><span class="nf">.route_msg</span><span class="p">(</span><span class="n">msg</span><span class="p">);</span>
<span class="nn">Ok</span><span class="p">::</span><span class="o"><</span><span class="mi">_</span><span class="p">,</span> <span class="mi">_</span><span class="o">></span><span class="p">(</span><span class="nn">futures</span><span class="p">::</span><span class="nn">future</span><span class="p">::</span><span class="nn">Loop</span><span class="p">::</span><span class="nf">Continue</span><span class="p">(</span><span class="mi">0</span><span class="p">))</span>
<span class="p">}</span>
<span class="nf">Err</span><span class="p">(</span><span class="nn">TryRecvError</span><span class="p">::</span><span class="n">Disconnected</span><span class="p">)</span> <span class="k">=></span> <span class="nn">Ok</span><span class="p">::</span><span class="o"><</span><span class="mi">_</span><span class="p">,</span> <span class="mi">_</span><span class="o">></span><span class="p">(</span><span class="nn">futures</span><span class="p">::</span><span class="nn">future</span><span class="p">::</span><span class="nn">Loop</span><span class="p">::</span><span class="nf">Break</span><span class="p">(())),</span>
<span class="nf">Err</span><span class="p">(</span><span class="nn">TryRecvError</span><span class="p">::</span><span class="n">Empty</span><span class="p">)</span> <span class="k">=></span> <span class="nn">Ok</span><span class="p">::</span><span class="o"><</span><span class="mi">_</span><span class="p">,</span> <span class="mi">_</span><span class="o">></span><span class="p">(</span><span class="nn">futures</span><span class="p">::</span><span class="nn">future</span><span class="p">::</span><span class="nn">Loop</span><span class="p">::</span><span class="nf">Continue</span><span class="p">(</span><span class="mi">0</span><span class="p">)),</span>
<span class="p">})</span>
<span class="p">}));</span>
<span class="n">PrintLoggerActor</span> <span class="p">{</span>
<span class="n">sender</span><span class="p">:</span> <span class="n">sender</span><span class="p">,</span>
<span class="n">receiver</span><span class="p">:</span> <span class="n">receiver</span><span class="p">,</span>
<span class="n">id</span><span class="p">:</span> <span class="n">id</span><span class="p">,</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">info</span><span class="p">(</span><span class="o">&</span><span class="k">self</span><span class="p">,</span> <span class="n">data</span><span class="p">:</span> <span class="n">InfoT</span><span class="p">)</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">msg</span> <span class="o">=</span> <span class="nn">PrintLoggerMessage</span><span class="p">::</span><span class="n">InfoVariant</span> <span class="p">{</span> <span class="n">data</span><span class="p">:</span> <span class="n">data</span> <span class="p">};</span>
<span class="k">self</span><span class="py">.sender</span><span class="nf">.send</span><span class="p">(</span><span class="n">msg</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">error</span><span class="p">(</span><span class="o">&</span><span class="k">self</span><span class="p">,</span> <span class="n">data</span><span class="p">:</span> <span class="n">ErrorT</span><span class="p">)</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">msg</span> <span class="o">=</span> <span class="nn">PrintLoggerMessage</span><span class="p">::</span><span class="n">ErrorVariant</span> <span class="p">{</span> <span class="n">data</span><span class="p">:</span> <span class="n">data</span> <span class="p">};</span>
<span class="k">self</span><span class="py">.sender</span><span class="nf">.send</span><span class="p">(</span><span class="n">msg</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The above code is a little complicated. The end result is three methods:</p>
<ul>
<li>A ‘new’ constructor, which takes a tokio executor handle and the PrintLogger
actor.</li>
<li>Actor versions of the ‘info’ and ‘error’ methods, which pack the arguments into
a message and send them off to the underlying PrintLogger’s route_msg function.</li>
</ul>
<p>The end result is that you can work with an actor that has an identical interface
to your underlying structure - same method names, same types - but all functions
are non blocking.</p>
<h1 id="future-what-about-return-values">Future: What about return values?</h1>
<p>Imagine a function with a similar structure as the following:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">pub</span> <span class="k">fn</span> <span class="nf">try_log</span><span class="p">(</span><span class="o">&</span><span class="k">self</span><span class="p">,</span> <span class="n">data</span><span class="p">:</span> <span class="nb">String</span><span class="p">)</span> <span class="k">-></span> <span class="n">Result</span><span class="o"><</span><span class="p">(),</span> <span class="p">()</span><span class="o">></span> <span class="p">{</span>
<span class="nf">Ok</span><span class="p">(())</span>
<span class="p">}</span>
</code></pre></div></div>
<p>We may want to handle the potential error case in that function. Currently,
you have to pass in an actor that can handle the error. One of the next steps
in the library is to provide a more general way to handle results of functions.</p>
<p>Specifically, that function would generate an enum variant like this:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="nn">PrintLogger</span><span class="p">::</span><span class="n">TryLog</span> <span class="p">{</span>
<span class="n">data</span><span class="p">:</span> <span class="nb">String</span><span class="p">,</span>
<span class="n">ret</span><span class="p">:</span> <span class="nf">Fn</span><span class="p">(</span><span class="n">Result</span><span class="o"><</span><span class="p">(),</span> <span class="p">()</span><span class="o">></span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>
<p>And the route_msg for that variant would look like this:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">match</span> <span class="n">msg</span> <span class="p">{</span>
<span class="nn">PrintLogger</span><span class="p">::</span><span class="n">TryLog</span> <span class="p">{</span><span class="n">data</span><span class="p">:</span> <span class="n">data</span><span class="p">,</span> <span class="n">ret</span><span class="p">:</span> <span class="n">ret</span> <span class="p">}</span> <span class="k">=></span> <span class="nf">ret</span><span class="p">(</span><span class="k">self</span><span class="nf">.try_log</span><span class="p">(</span><span class="n">data</span><span class="p">))</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The return value of try_log will be handed to the closure. You can then send the
value to any actors within the scope:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">let</span> <span class="n">error_handler</span> <span class="o">=</span> <span class="nn">ErrorHandlerActor</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">handle</span><span class="p">,</span> <span class="n">err_handler</span><span class="p">);</span>
<span class="k">let</span> <span class="n">print_logger</span> <span class="o">=</span> <span class="nn">PrintLoggerActor</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">handle</span><span class="p">,</span> <span class="n">logger</span><span class="p">);</span>
<span class="n">print_logger</span><span class="nf">.try_log</span><span class="p">(</span><span class="s">"logline"</span><span class="nf">.to_owned</span><span class="p">(),</span> <span class="p">|</span><span class="n">res</span><span class="p">|</span> <span class="p">{</span>
<span class="k">if</span> <span class="k">let</span> <span class="nf">Err</span><span class="p">(</span><span class="n">e</span><span class="p">)</span> <span class="o">=</span> <span class="n">res</span> <span class="p">{</span>
<span class="n">error_handler</span><span class="nf">.handle</span><span class="p">(</span><span class="n">e</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">})</span>
</code></pre></div></div>
<p>The above code is fairly representative of what I’m hoping to generate.</p>
<h1 id="future-trait-actors">Future: Trait Actors</h1>
<p>Another future improvement will be Trait Actors. In the same way that derive_actor
provides an Actor interface for a structure, it can provide one for a Trait.</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">pub</span> <span class="k">trait</span> <span class="n">Logger</span> <span class="p">{</span>
<span class="k">fn</span> <span class="nf">info</span><span class="p">(</span><span class="o">&</span><span class="k">self</span><span class="p">,</span> <span class="n">data</span><span class="p">:</span> <span class="nb">String</span><span class="p">);</span>
<span class="k">fn</span> <span class="nf">error</span><span class="p">(</span><span class="o">&</span><span class="k">self</span><span class="p">,</span> <span class="n">data</span><span class="p">:</span> <span class="nb">String</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>This should generate an ActorLogger that will route messages to an underlying
Logger trait object.</p>
Threat Modeling Firefox2017-03-19T00:00:00+00:00http://insanitybit.github.io/2017/03/19/threat-model-firefox
<p>(Don’t want to read? Jump to the end - there’s a tldr)</p>
<p>Recently a <a href="https://palant.de/2018/03/10/master-password-in-firefox-or-thunderbird-do-not-bother">post</a> was written about Firefox’s local password database encryption.</p>
<p>The post is appropriately titled <code class="language-plaintext highlighter-rouge">Master password in Firefox or Thunderbird? Do not bother!</code>, which I’ll certainly be
getting back to soon.</p>
<p>The tl;dr is that Firefox will SHA1 your master password and use that to encrypt your local database. This
is purported to be bad because a single round of SHA1 is extremely fast to calculate so weak passwords
will be trivial to bruteforce.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>converts a password into an encryption key by means of applying SHA-1 hashing to a string consisting
of a random salt and your actual master password.
Anybody who ever designed a login function on a website will likely see the red flag here.
[..]
The problem here is: GPUs are extremely good at calculating SHA-1 hashes.
</code></pre></div></div>
<p>There is an interesting point made here - we certainly know that using a single round of SHA1
for web auth is <em>bad</em>. But can we then assume that the same thing applies to this situation?</p>
<p>Fundamentally what the author is doing here is applying a threat model for websites to a local
database. There actually is no other evidence that what Firefox is doing is bad, just some numbers
about bruteforcing - really, the argument hinges on the threat model.</p>
<p>In my opinion this is simply a case of poor threat modeling.</p>
<h3 id="lets-do-some-threat-modeling">Let’s Do Some Threat Modeling!</h3>
<p>Let’s take a step back and do a few exercises. Let’s imagine we are in charge of keeping Firefox
users safe, we are <em>Team MozSec</em>; so who are we worried about?</p>
<ul>
<li>Attackers who control malicious webpages</li>
<li>Attackers who control malicious 3rd party webpages</li>
<li>Attackers who abuse our various services (sync?)</li>
</ul>
<p>If I were in such a position those would be my really big ones. Only the first two really apply
to this situation.</p>
<p>Now let’s ask what exploitation of this ‘vulnerability’, weak hashing, would take.</p>
<p>The attacker:</p>
<ul>
<li>Must be able to exfiltrate the encrypted database file</li>
<li>Can’t just access Firefox’s memory to dump the passwords</li>
<li>Can’t just keylog the user/ Firefox</li>
<li>Is not interested in higher value data, such as session tokens</li>
</ul>
<p>Let’s take our threat model - attacker controls malicious website - and try to see if we can
come up with an attacker…</p>
<p>Can you? Because I can’t. An attacker with such insane limitations and requirements simply doesn’t
exist in my mind. Perhaps an attacker who compromises a terribly configured sandboxed application
that somehow has read access to the encrypted database file…. I’m not too worried about that.</p>
<p>Within Mozilla’s threat model, as I would design it, there is no reasonable attacker we can imagine.</p>
<p>What would an attacker who <em>can</em> exploit this vulnerability look like? I’d say there are two:</p>
<ul>
<li>An attacker has execution on the local system</li>
<li>An attacker has unlimited offline access to the system</li>
</ul>
<p>So now we’ve got two attackers who are outside of our threat model, but let’s think it through.</p>
<h3 id="local-execution">Local Execution</h3>
<p>OK, so we have an attacker who can locally execute. They see the encrypted file is using
super-hash, which uses a trillion rounds of argon2, or whatever. “Well, screw that” our
attacker says - they know better than to attempt bruteforcing.</p>
<p>What does the attacker do? Well, they have many, many options.</p>
<p>The most likely attacker (who has the mission of compromising your websites) will just grab
existing Firefox session tokens. These should exist offline as well as online (becuase you
can close the browser, open it, and still be logged in, right?) so they grab those. They
use them on your local system to perform requests to websites, get the info they need, etc.</p>
<p>OK, well that was easy. Let’s imagine you clear all of your sessions because you’re nuts.</p>
<p>“Huh, ok” our attacker says. They sit around, wait for you to open Firefox, and dump the
passwords straight from memory. Or just hijack the live sessions.</p>
<p>I am unconvinced, given this straw man scenario I’ve whipped up, that super-hash is making
me safer.</p>
<h3 id="unlimited-physical-access">Unlimited Physical Access</h3>
<p>You’re fast asleep - computer turned off. An attacker takes your laptop.</p>
<p>They scrape the database off of the hard drive. “Damn” the attacker says - “super-hash!”.</p>
<p>So they grab all of your other files and data, sessions, cookies, extension info, etc and leave.</p>
<p>Well, that sucks, but it is in fact true that super-hash saved the password database here.</p>
<p>We’re well outside of the reasonable threat model I’ve set out, but super-hash had some impact.</p>
<p>Let’s further imagine that now local attackers, such as the one described here, <em>are</em> in our threat
model. Is ‘encrypt this one set of data’ a good mitigation? Are users really safer?</p>
<p>Really, local attackers with this sort of access are more reasonably in the threat model of your
operating system and hardware manufacturer. And they have a great solution for you! It’s called
Full Disk Encryption and you probably have it enabled if you’ve bought your computer in the last
two years - both Windows and OSX enable this by default.</p>
<h3 id="so-why-encrypt-at-all">So Why Encrypt At All?</h3>
<p><code class="language-plaintext highlighter-rouge">Master password in Firefox or Thunderbird? Do not bother!</code></p>
<p>This is the title of the post I’m responding to, and I really think it’s the best part.</p>
<p>Why bother at all indeed - we’ve shown that encryption of the password db really isn’t
meaningful at all, and that the issue is better solved elsewhere.</p>
<p>Turn on the encryption if you like, or don’t, I’d say it makes absolutely no different,
regardless of your hashing algorithm.</p>
<h3 id="anyways">Anyways</h3>
<p>Would super-hash, or bcrypt, or argon2, or anything else make users safer? Is it
raising the bar for the attackers? Is it a problem best solved by Firefox? Is it worth the
issue of users believing they are safer than they are?</p>
<p>In my opinion the answer to all of these question is a clear “no”.</p>
<p>Threat modeling is an extremely important part of security - all good mitigation techniques
start with a threat model. Otherwise what we end up with are a bunch of trivial-to-bypass
half measures, a poor understanding of where we’re weak, and users who are confused about
how safe they really are.</p>
<p>I wrote this up in a few minutes via github - apologies for what I’m sure is terrible
formatting on a rushed post. I want to be clear that what I’ve done here lacks rigor - this is
more an off the cuff way to do a risk assessment. This is not really a great representation of
threat modeling (and actually conflates some aspects of risk assessment with having a threat model),
but I hope it makes it clear that asking these sorts of questions is important.</p>
<h3 id="tldr">TL;DR</h3>
<ul>
<li>Comparing local password storage to web-auth is erroneous, entirely different world.</li>
<li>Comparing local password storage to password managers that sync is erroneous.</li>
<li>Bcrypt would make no difference for any attacker reasonably in Firefox’s threat model.</li>
<li>People are convinced that Bcrypt would make them safer, which is why implementing it woudld be irresponsible.</li>
<li>If you are concerned about offline attackers, or attackers running as separate users, consult your OS/Hardware
whose threat models certainly include that attacker.</li>
</ul>
Golang and Rustlang Memory Safety2016-12-28T00:00:00+00:00http://insanitybit.github.io/2016/12/28/golang-and-rustlang-memory-safety
<p>I recently read an excellent blog post by Scott Piper about a tool he has released
called <a href="https://t.co/RJFLf2IemW">Serene</a>. The tool analyzes a binary to see if it
has been compiled with security mitigation techniques - essentially a sanity check
for best practices. As I was reading the post I came across this quote:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Anything compiled with Golang will not have ASLR/PIE. This is a decision by the language creators as Golang is a secure language, but if the process imports a C library, it exposes itself to possible issues. As such, I didn't want to skip Golang binaries.
</code></pre></div></div>
<p>I was pretty shocked - this seems like a huge oversight. Scott referenced me to
a quote by the author (and a quick rundown of Golang security that he’d written
about) <a href="http://0xdabbad00.com/2015/04/12/looking_for_security_trouble_spots_in_go_code/#unfixed-issues">here</a>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"Address space randomization is an OS-level workaround for a language-level problem, namely that simple C programs tend to be full of exploitable buffer overflows. Go fixes this at the language level, with bounds-checked arrays and slices and no dangling pointers, which makes the OS-level workaround much less important. In return, we receive the incredible debuggability of deterministic address space layout. I would not give that up lightly."
</code></pre></div></div>
<p>EDIT: Note that Golang does in fact support ASLR/ PIE on Linux, though it is not
enabled by default.
See this <a href="https://github.com/golang/go/blob/master/src/cmd/go/build.go#L385-L386">snippet</a></p>
<p>Thanks to Shawn Webb (@lattera) for pointing this out.</p>
<p>Essentially, because Golang is “memory safe”, there is no need for a defense in
depth approach involving mitigation techniques. The cost of the ASLR mitigation
is cited as improved debugging experience.</p>
<p>But is Go even memory safe? That’s a bit of a sticky definition. Go-the-language
is memory safe… I guess. But Go programs compiled with the standard compiler are
not. Go has data races, a design choice made for performance reasons. This means
that you can write code with security vulnerabilities in Go.</p>
<p>What’s worse is that the decision to exclude ASLR has doomed these vulnerabilities
to be much more easily exploitable.</p>
<p>A great blog post by stalkr shows some proof of concepts for exploitable Go code
<a href="http://blog.stalkr.net/2015/04/golang-data-races-to-break-memory-safety.html">here</a>.</p>
<p>As stalkr mentions, he doesn’t know of a situation where such a vulnerabliity exists
in the wild.</p>
<p>However, it is more likely that Golang code will make exploiting C/C++ code,
loaded dynamically via CGO, easier. The improved debuggability hardly seems worth
it to me - one could simply disable ASLR for debug builds, though I have rarely
heard complains about ASLR for debugging.</p>
<p>Beyond that, I would expect more soundness issues and vulnerabilities to exist in
Go code than we currently know about, which is exactly why defense in depth is so
important. And, of course, one can drop into unsafe in Golang as well.</p>
<p>Rust, thankfully, takes what I would consider a much saner approach. While it still
maintains a similar “The language is safe even if the implementation isn’t” attitude,
it also makes use of defense in depth measures.</p>
<p>While looking into the <a href="https://underhanded.rs/blog/2016/12/15/underhanded-rust.en-US.html">Underhanded Rust Contest</a> I had a look at the
current soundness issues in rust. Unlike Go, these were very easy to find and
they were all nicely labeled as soundness issues.</p>
<p>Within a few minutes I had found a few solid candidates for exploitability, and I
narrowed it down to one that seemed particularly inconspicuous.</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="c">// U modified the code a bit while playing around for the contest.</span>
<span class="c">// Issue with original code here: https://github.com/rust-lang/rust/issues/29723</span>
<span class="k">fn</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">foo</span> <span class="o">=</span> <span class="nn">String</span><span class="p">::</span><span class="nf">from</span><span class="p">(</span><span class="s">"FOO"</span><span class="p">);</span>
<span class="k">let</span> <span class="n">foo</span> <span class="o">=</span> <span class="k">match</span> <span class="mi">0</span> <span class="p">{</span>
<span class="mi">0</span> <span class="k">if</span> <span class="p">{</span>
<span class="nf">some_func</span><span class="p">(</span><span class="n">foo</span><span class="p">)</span> <span class="c">// foo is freed here</span>
<span class="p">}</span> <span class="k">=></span> <span class="nd">unreachable!</span><span class="p">(),</span>
<span class="mi">_</span> <span class="k">=></span> <span class="p">{</span>
<span class="c">// Use After Free - we return freed memory</span>
<span class="n">foo</span>
<span class="p">}</span>
<span class="p">};</span>
<span class="nd">println!</span><span class="p">(</span><span class="s">"{:#?}"</span><span class="p">,</span> <span class="n">foo</span><span class="p">);</span> <span class="c">// And here we access the invalid memory</span>
<span class="p">}</span>
<span class="k">fn</span> <span class="nf">some_func</span><span class="p">(</span><span class="n">foo</span><span class="p">:</span> <span class="nb">String</span><span class="p">)</span> <span class="k">-></span> <span class="nb">bool</span> <span class="p">{</span>
<span class="k">drop</span><span class="p">(</span><span class="n">foo</span><span class="p">);</span>
<span class="k">false</span>
<span class="p">}</span>
</code></pre></div></div>
<p>What we have here is a use after free vulnerability. This will print garbage,
or panic. This is an issue with how rust’s current borrow semantics work with
match statements.</p>
<p>The vulnerability is here:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="mi">0</span> <span class="k">if</span> <span class="p">{</span>
<span class="nf">some_func</span><span class="p">(</span><span class="n">foo</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Effectively, this branch succeeds if some_func returns true. It does not, so the
branch does not succeed. However, ‘foo’ is freed in some_func, leaving it invalid.</p>
<p>Despite that, we can use the value in the other branch, returning it, and then
accessing it later.</p>
<p>This may seem a bit contrived, and I know of no place where this code exists, but
I thought it was an ideal candidate for the undheranded contest because the ‘free’
is hidden elsewhere and it may not be obvious.</p>
<p>Of course, I then realized that rustc compiles rust programs with ‘the works’. Any
rust program has full RELRO, NX, ASLR/ PIE, and (I believe) safestack.</p>
<p>These mitigations would make it considerably more difficult to make a reliable
exploit against this or other vulnerabilities in rust code. For the contest, if
I get around to writing an exploit for this, I will definitely not bother trying
to get around them and instead I’ll take a hit to points and disable ASLR.</p>
<p>This defense in depth attitude is, in my opinion, exactly the right way to go. Rust
programs don’t have to rely entirely on the memory safety guarantees of the language,
which is critical since rust allows explicitly unsafe code, FFI, etc.</p>
<p>Security is about so much more than language level security, or even memory safety.
It is fundamentally an ongoing process with moving goals. Saying “Well, we’ve solved
those problems in one place, so we can stop there” is a dangerous attitude.</p>
<p>This isn’t to say that Go is better or worse than Rust. This isn’t to say that either
languages are unlikely or likely to have vulnerable code out in the wild. I think it
is much more about how the two languages approach security.</p>
Using Types For Better Python2016-07-30T00:00:00+00:00http://insanitybit.github.io/2016/07/30/using-types-for-better-python
<p>I’ve written Python for some time now and it’s been a go-to for quick scripts
on many occasions. It’s not particularly fast, and I’m not a fan of Python for
large projects, but it is definitely easy to use. There’s a massive ecosystem for
libraries, an easy to use std lib, a REPL, and other attributes that make it
simple to jump into quickly.</p>
<p>The reason I prefer to not write large projects in Python is because it is
prohibitively difficult for me to reason about Python code. Shared mutable state
and, in particular, dynamic types, are a burden to me - I can not quite know what
a function really does, what a variable really is, or how the program will behave.</p>
<p>While I can test the code, and I can be reasonably sure that it works in some
way, the true cost here is in development. As a code base grows I have to increasingly
rely on the code being documented well to be able to jump in and start using it
or making changes. Types, for me, are like a free documentation that alleviates
the need for boiler plate testing/ comments.</p>
<p>And so, naturally, I’ve been eager to try out MyPy - a static type checker for
Python that features a familiar type system and type inference. If you’re coming
from a language with types it should be fairly quick to get started.</p>
<p>Essentially, through some type annotations in your Python code, you can use MyPy
for static analysis.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="k">def</span> <span class="nf">typed</span><span class="p">(</span><span class="n">x</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-></span> <span class="n">Optional</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span>
<span class="k">if</span> <span class="n">x</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">return</span> <span class="s">"Zero"</span>
<span class="k">else</span> <span class="k">if</span> <span class="n">x</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
<span class="k">return</span> <span class="s">"One"</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">None</span>
</code></pre></div></div>
<p>The above function takes an integer and returns the union of a ‘str’ and None.</p>
<p>To get started I needed to install the magicpython tool for Atom, since the
default language syntax highlighting breaks with Python’s 3.5+ type syntax.</p>
<p><a href="https://atom.io/packages/magicpython">You can find magicpython here</a></p>
<p>I then installed the mypy-lang package through pip:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip install --user mypy-lang
</code></pre></div></div>
<p>And looked online for any tidbits on how to use it.</p>
<p>It turns out that mypy has solid documentation, it was very easy to find a few
extra, interesting settings.</p>
<h1 id="the-optional-type">The Optional Type</h1>
<p>In languages like Python there exists a ‘null’ type. In Go this is ‘nil’, Python
has None, Java has ‘null’, the name changes but the meaning is generally the same.</p>
<p>A variable is implicitly always its type, or null. So while Java may tell you
that you’re dealing with a String, you may not be - and the same goes for Python.</p>
<p>MyPy has a way of dealing with this - the Optional type. This is not much different
from how many languages deal with nullable types. Instead of having the None type
implicitly in every other type, we make it explicit.</p>
<p>MyPy will then try to ensure that you properly check this Optional[T] type before
you try to treat it like its concrete variants - T or None.</p>
<p><a href="http://mypy.readthedocs.io/en/latest/kinds_of_types.html#strict-optional">From the MyPy documentation</a>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Optional</span>
<span class="k">def</span> <span class="nf">f</span><span class="p">(</span><span class="n">x</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">int</span><span class="p">])</span> <span class="o">-></span> <span class="nb">int</span><span class="p">:</span>
<span class="k">if</span> <span class="n">x</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
<span class="k">return</span> <span class="mi">0</span>
<span class="k">else</span><span class="p">:</span>
<span class="c1"># The inferred type of x is just int here.
</span> <span class="k">return</span> <span class="n">x</span> <span class="o">+</span> <span class="mi">1</span>
</code></pre></div></div>
<p>If you do not perform the check against None before accessing the Optional value
you will encounter an error before the program runs.</p>
<p>To enable this checking just add –strict-optional to your mypy command.</p>
<p>Note that this is an experimental flag and there exist loopholes.</p>
<h1 id="silent-imports">Silent Imports</h1>
<p>Unfortunately, much of the code you may encounter will not include types. This is
something I’ll be talking about a bit later in this post but essentially you have
two options:</p>
<ul>
<li>Write typed wrappers for the untyped code</li>
</ul>
<p>Essentially if you know that a function takes T and returns U, write a wrapper
function that provides the annotations.</p>
<ul>
<li>Use –silent-imports</li>
</ul>
<p>The silent imports flag will disable automatic type checking of imported modules.</p>
<h1 id="disallow-untyped-calls-and-defs">Disallow Untyped Calls and Defs</h1>
<p>From the MyPy blog:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>--disallow-untyped-defs generates errors for functions without type annotations.
Consider using this if you tend to forget to annotate some functions.
--disallow-untyped-calls causes mypy to complain about calls to untyped functions.
This is a boon for static typing purists, together with --disallow-untyped-defs :-)
</code></pre></div></div>
<p>Both of these flags are very useful for larger codebases where multiple developers
may be working on the code. It’s a strict enforcement of writing typed code.</p>
<p>In the end I wrote an alias:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>alias smypy='python3.5 -m mypy --strict-optional --disallow-untyped-defs --disallow-untyped-calls\
--check-untyped-defs'
</code></pre></div></div>
<p>To analyze a program with these options I just need to run</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>smypy ./prog.py
</code></pre></div></div>
<p>Here’s a simple program that demonstrates how to use type annotations:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">random</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Iterable</span><span class="p">,</span> <span class="n">Optional</span>
<span class="k">def</span> <span class="nf">get_rand_list</span><span class="p">(</span><span class="nb">len</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-></span> <span class="n">Iterable</span><span class="p">[</span><span class="nb">int</span><span class="p">]:</span>
<span class="k">return</span> <span class="p">[</span><span class="nb">int</span><span class="p">(</span><span class="mi">10</span><span class="o">*</span><span class="n">random</span><span class="p">.</span><span class="n">random</span><span class="p">())</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">)]</span>
<span class="k">def</span> <span class="nf">index_of</span><span class="p">(</span><span class="n">li</span><span class="p">:</span> <span class="n">Iterable</span><span class="p">[</span><span class="nb">int</span><span class="p">],</span> <span class="n">val</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-></span> <span class="n">Optional</span><span class="p">[</span><span class="nb">int</span><span class="p">]:</span>
<span class="k">for</span> <span class="n">index</span><span class="p">,</span> <span class="n">item</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">li</span><span class="p">):</span>
<span class="k">if</span> <span class="n">item</span> <span class="o">==</span> <span class="n">val</span><span class="p">:</span>
<span class="k">return</span> <span class="n">index</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">'__main__'</span><span class="p">:</span>
<span class="n">r_list</span> <span class="o">=</span> <span class="n">get_rand_list</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span> <span class="c1"># Type is inferred to be iterable[int]
</span>
<span class="n">opt_ix</span> <span class="o">=</span> <span class="n">index_of</span><span class="p">(</span><span class="n">r_list</span><span class="p">,</span> <span class="mi">19</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="nb">hex</span><span class="p">(</span><span class="n">opt_ix</span><span class="p">))</span>
</code></pre></div></div>
<p>A list of random numbers between 1 and 10 is generated.</p>
<p>We get the index of a value in that list.</p>
<p>We convert that value into hex and print it out.</p>
<p>But it will crash on hex(opt_ix) because 19 is never going to be in the list,
and applying hex to None raises an exception. With MyPy this code does not pass
the type checker:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>example.py:23: error: Argument 1 to "hex" has incompatible type "Optional[int]"; expected "int"
</code></pre></div></div>
<p>Instead we have to change the code to look like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">if</span> <span class="n">opt_ix</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="nb">hex</span><span class="p">(</span><span class="n">opt_ix</span><span class="p">))</span>
</code></pre></div></div>
<p>MyPy doesn’t complain and the program will no longer crash.</p>
<p>It’s a trivial example but in a large code base it can be easy, especially when
working with third party libraries, to not realize that a return value may be of
multiple types.</p>
<p>MyPy, to me, eases this burden significantly.</p>
<p>That said, it’s not perfect. Type annotations provide some nice linting and
documentation but they can’t actually impact the code that runs. Many languages
with static type systems are able to do far more with their types, leading to
faster, safer code. With Python we can only scratch the surface - but even with
these annotations not being as powerful as other type systems I think they’re well
worth implementing in your code.</p>
<p>Note that Python type annotations are cross-version. There is a compatible syntax
for Python 2.x+, using comment blocks.</p>