Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: batch audit logs and ensure they are not lost #1080

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

de-sh
Copy link
Contributor

@de-sh de-sh commented Jan 7, 2025

Fixes #XXXX.

Description

Collects audit logs into batches, pushes as configured based on threshold of time/size and ensures logs are persisted into a file when the network is down. NOTE: Works in a near synchronous manner where logs will not be accepted if write to disk is blocked or when direct push to network is blocked, this may lead to a situation where requests will not respond properly as they try to send audit logs, but this is a known cost.


This PR has:

  • been tested to ensure log ingestion and log query works.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added documentation for new or modified features or behaviors.

@coveralls
Copy link

coveralls commented Jan 7, 2025

Pull Request Test Coverage Report for Build 12675531641

Details

  • 0 of 197 (0.0%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.1%) to 12.286%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/audit/builder.rs 0 7 0.0%
src/audit/types.rs 0 7 0.0%
src/cli.rs 0 41 0.0%
src/audit/logger.rs 0 142 0.0%
Totals Coverage Status
Change from base Build 12672350759: -0.1%
Covered Lines: 2415
Relevant Lines: 19657

💛 - Coveralls

Copy link

@hippalus hippalus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! The implementation looks solid, but I have some questions about the architectural approach:

Audit Log Architecture:

  1. What drove the decision to use another Parseable instance for audit logs rather than local storage?
  2. Do we have plans for a dedicated audit logging solution in the future?

Deployment Considerations:
For multi-node setups (e.g., 3 ingest + 1 query nodes):

  1. What's the recommended log_endpoint configuration?
  2. Which node should own audit log collection?
  3. How do we prevent circular logging dependencies?

Performance & Scalability:

  1. In high-traffic scenarios, how do we handle the additional network load from audit logging?
  2. Have we considered the impact of inter-service dependencies on system reliability?

Would appreciate your insights on these points to better understand the architectural vision.

src/audit/logger.rs Show resolved Hide resolved
.open(log_file_path)
.await
.expect("Failed to open audit log file");
let buf = serde_json::to_vec(&logs_to_send).expect("Failed to serialize audit logs");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly it looks like it never happens, but unwrap() can cause panic in production. It would be nice to have proper error handling with context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be an issue in most places are serialization is not a problem for these types. But still we have expects to document why it is safe to assume it won't

src/audit/logger.rs Outdated Show resolved Hide resolved
src/audit/logger.rs Outdated Show resolved Hide resolved
}

// setup the audit log channel
let (audit_log_tx, mut audit_log_rx) = channel(0);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the initial size is 0 ?

Copy link
Contributor Author

@de-sh de-sh Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to create a "Glienicke Bridge" so that we block at the request level everytime the underlying audit logger is unable to keep-up.

src/audit/logger.rs Outdated Show resolved Hide resolved
@de-sh
Copy link
Contributor Author

de-sh commented Jan 9, 2025

Thank you for the comments @hippalus, let me note that audit logging is not to be considered as beta yet. We are working on developing this as a feature that is still in development and will need a lot of work before we can consider it ready for production.

With regards to your questions:

LGTM! The implementation looks solid, but I have some questions about the architectural approach:

Audit Log Architecture:

1. What drove the decision to use another Parseable instance for audit logs rather than local storage?

It is an industry practice to send audit logs to an external system, we felt dogfooding would be the least friction option when it comes to doing this.

2. Do we have plans for a dedicated audit logging solution in the future?

Parseable already fulfills most of this and we don't see any reason as to why we would build with anything different. Please let us know of any alternatives, we can document them and plan out how to build support. To make ourselves clear, logs should be sent to an instance outside of the cluster.

Deployment Considerations: For multi-node setups (e.g., 3 ingest + 1 query nodes):

1. What's the recommended log_endpoint configuration?

The log_endpoint should be a separate cluster/standalone instance of parseable.

2. Which node should own audit log collection?

This should not matter in my opinion, in a multi-node setup the expectation is that a load-balancer should take the decision and logs should end up in the same stream no matter the node it is received at.

3. How do we prevent circular logging dependencies?

Circular logs are not expected as pointed above.

Performance & Scalability:

1. In high-traffic scenarios, how do we handle the additional network load from audit logging?

We are only testing the feature and as of our current learnings, batching and writing to file on network slow-down as well as pushing to file when network is down seemed to be the right decision for us to handle high-frequency use-cases.

2. Have we considered the impact of inter-service dependencies on system reliability?

What do you mean by "inter-service dependencies".

Would appreciate your insights on these points to better understand the architectural vision.

@hippalus
Copy link

hippalus commented Jan 9, 2025

Thank you for the comments @hippalus, let me note that audit logging is not to be considered as beta yet. We are working on developing this as a feature that is still in development and will need a lot of work before we can consider it ready for production.

With regards to your questions:

LGTM! The implementation looks solid, but I have some questions about the architectural approach:
Audit Log Architecture:

1. What drove the decision to use another Parseable instance for audit logs rather than local storage?

It is an industry practice to send audit logs to an external system, we felt dogfooding would be the least friction option when it comes to doing this.

2. Do we have plans for a dedicated audit logging solution in the future?

Parseable already fulfills most of this and we don't see any reason as to why we would build with anything different. Please let us know of any alternatives, we can document them and plan out how to build support. To make ourselves clear, logs should be sent to an instance outside of the cluster.

Deployment Considerations: For multi-node setups (e.g., 3 ingest + 1 query nodes):

1. What's the recommended log_endpoint configuration?

The log_endpoint should be a separate cluster/standalone instance of parseable.

2. Which node should own audit log collection?

This should not matter in my opinion, in a multi-node setup the expectation is that a load-balancer should take the decision and logs should end up in the same stream no matter the node it is received at.

3. How do we prevent circular logging dependencies?

Circular logs are not expected as pointed above.

Performance & Scalability:

1. In high-traffic scenarios, how do we handle the additional network load from audit logging?

We are only testing the feature and as of our current learnings, batching and writing to file on network slow-down as well as pushing to file when network is down seemed to be the right decision for us to handle high-frequency use-cases.

2. Have we considered the impact of inter-service dependencies on system reliability?

What do you mean by "inter-service dependencies".

Would appreciate your insights on these points to better understand the architectural vision.

Thank you for the clarification @de-sh . I highlighted these questions because I couldn't find explicit documentation or configuration descriptions in the code addressing these architectural decisions. The implementation looks quite solid.

I'm particularly looking forward to seeing the Kafka connector integration. Regarding inter-service dependencies, as you've well explained, we can scale Parseable instances behind a load balancer for audit logging purposes. It's great to see this pathway being established.

IMHO : It would be nice to add more documentation around these architectural decisions directly in the code, making it easier for future contributors to understand the design choices. It would be nice to have to add docker-compose as an example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants