Tracee Roadmap update March 23 #2890

itaysk · 2023-03-21T09:46:14Z

itaysk
Mar 21, 2023
Maintainer

Over the past few months Tracee has been undergoing some fundamental improvements. I wanted to add some context to this, explain the motivation behind it, and lay out the plan for what's to come.

TL;DR: here's the high level plan:

Refining the use case
Unified architecture
User-friendly observability events
Everything is an event
Scopes
Policies and Actions

Use case

Tracee was designed around a use case that looked like this: Tracee detect suspicious malicious behavior at runtime. You install it in your server or cluster and when it detects "something bad" it would alert you.

While Tracee did deliver that premise, there were still some gaps from users' expectations:

Tracee is built from a tracing component and a behavioral detection component. Users wanted to leverage tracing capabilities directly to collect raw system events and process them elsewhere (aka "security observability"). This was technically possible to achieve but hidden from the main user experience.
Putting the value on detections (aka behavioral signatures) creates expectation for completeness of security content. Tracee does have an impressive and growing library of behavioral security detections but our focus in open source is building a great engine first, and not to provide ultimate security detections coverage.
Detection alerts don't carry sufficient context, the raw events around the incident are needed to complete the picture. Combining detections with related events was tricky in the previous architecture.
Defining the user experience as incident-driven means most of the time after installing Tracee it would stay silent (hopefully there's nothing noteworthy to report) and that experience impedes onboarding journey and value realization. Ideally users would install Tracee and immediately start getting feedback and value, even before "something bad" happened.

Given the above, we've modified Tracee's primary use case and mission statement: Tracee uses eBPF to tap into your system and expose hundreds of events that helps you understand how your system behaves.

Note that the focus on behavioral detections is replaced with access to many events which can be useful in different scenarios. The goal is to give you tools that help you understand how your system behaves, not to do the understanding for you.

Architecture

Tracee was made of two distinct components (binaries) that worked together: tracee-ebpf generates low level tracing data, which is passed to tracee-rules to find "bad" patterns in it. To support the previously discussed use case, when you "ran Tracee" (i.e docker run tracee / helm install tracee), it would orchestrate the two components internally so that the output is only security detections.

The motivations for this design were:

tracee-rules might be used independently, for example if you feed it Kubernetes audit events, you don't need the eBPF baggage.
tracee-ebpf can be used independently for other use cases such as troubleshooting, debugging, forensic investigations, etc.
Support distributed deployment for scalability.
Security-wise it is safer to separate the component that deals with the kernel (tracee-ebpf), and also separate the component that deals with user-generated content (tracee-rules).

Over time we experienced some hardships in maintaining this architecture:

Orchestrating the two binaries inside the container is not trivial (synchronization, administration).
Passing data across process boundaries created more performance challenges (serialization, communication, buffering).
Leaky-abstractions from abstracting two individual binaries under one front-end (configuration, metrics, endpoints).

These architectural challenges add to the previously discussed user experience challenges which led us to move toward a "unified architecture". The new architecture has a single binary - tracee. When executed it immediately emits events, including simple observability and security insights. It has a straightforward configuration that speaks to the use case described. Internally the code is split in two libraries for code hygiene, but to the user it's just one tool.

User-friendly observability events

Tracee's new experience allows and encourages the user to trace any system activity events (regardless of security insights). You could always trace system calls with Tracee, which is a good starting point, but we believe are not ideal:

syscalls are low-level technical concept in operating system design, and therefore they do not always represent higher level applicative interactions well. For example, a network connection involves several inter-related syscalls.
Operating system might provide several syscalls to accomplish the same goal, which makes it harder to cover basic scenarios. For example, opening a file can be done in several ways.
syscalls are not completely safe for security use cases, as we discussed before
syscall arguments are untreated user-defined values which does not make actionable information. For example, a path argument might be a file-descriptor, or a symlink, or a relative, which makes it unreliable and inconsistent.

For these reasons, Tracee's signatures detections never used syscalls, and instead had been relying on alternative kprobe-based framework that produced more robust events for security detections. We want to not only expose these previously internal raw events, but to make it a spotlight feature which shapes Tracee's new experience.

To address this, we are introducing a higher-level set of events that represents commonly useful system activity events in a user-friendly, reliable and consistent way. These will address all common observability questions while abstracting away the details. This will include:

Workload lifecycle, for example, process started, container created.
Network activity (including protocol parsing), for example, DNS request.
File activity, for example file was modifies.

These new events will be the foundation to tracing with Tracee, as well as for creating security detections on top of.

Everything is an event

In the previous experience tracee-ebpf emitted structured tracing data, while tracee-rules emitted semi-structured security alerts. To support the new desired experience we are moving to what we refer to as "everything is an event". With this approach every kind of information that is emitted by Tracee will be logged as event entity that is following a common data model.
This includes syscalls, system observability events, internal security events which previously were not exposed to the user, and most importantly, tracee-rules's behavioral signatures detections. All of the above will now follow the same event model and consumed in the same way. In addition, user-defined custom events follow the same pattern and forensic capturing will be too.

To support this notion, the event schema might evolve and the UX options around it might change as we implement this approach.

Scopes

Tracee has a very rich and powerful filtering mechanism that lets you target specific workloads and scenarios, but filters could not be applied granularly for specific workloads or events. For example, you could not ignore some events only for specific workloads, or filter by event argument that is shared with other events. This kind of requirement is especially important for performance and efficiency optimizations.

As part of the new experience we are also introducing the "multi-scope" architecture for Tracee. This essentially lets you group a bunch of filters to a "scope", and there can be multiple scopes handled by the same Tracee instance. For example, if I want to ignore some events for some workloads but not others, I could create a scope which applies only to those relevant workloads (by using Tracee's existing powerful filters), and set the selected events for that scope (by using Tracee's event selection). Scope is like a bubble of tracing configuration that gives you fine control over what is being traced, and it is critical for real world advanced applications of runtime security.

Policies

While Tracee already has an impressive feature set, which keeps expanding, when we considered advanced applications of runtime security in a production servers or clusters, we understand that the mechanisms used to employ these features affects the experience nonetheless. The Command Line Interface (CLI) has its place as a basic experience, but for those advanced scenarios we want to provide a more scalable model for thinking about runtime security, which is based on declarative policies.

A policy is a declarative configuration that binds "scope" with "events". The policy tells the story of "for these workloads (scope), collect this information (events), in this way (event filters, outputs)".
In addition it makes place to define "action" to take when a policy condition is matched, opening the door to more interesting use cases in runtime security such as enforcement.

Being a declarative model the policy also ties well to the Kubernetes deployment story, where we expect policies to take a central role in the UX, perhaps as a Kubernetes CRD.

Summary

We are reimagining Tracee based on your feedback and experience. Here's the high level plan:

Refining the use case
Unified architecture
User-friendly observability events
Everything is an event
Scopes
Policies and Actions

This is a high level roadmap that will be broken into deliverable work items and delivered incrementally over the next releases (some of it has already made it out by now).
We hope this excites you as much as us, and we welcome you to provide any feedback about this, in this discussion or in relevant GitHub issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracee Roadmap update March 23 #2890

{{title}}

Replies: 0 comments

Select a reply

Tracee Roadmap update March 23 #2890

itaysk Mar 21, 2023 Maintainer

Use case

Architecture

User-friendly observability events

Everything is an event

Scopes

Policies

Summary

Replies: 0 comments

itaysk
Mar 21, 2023
Maintainer