-
Notifications
You must be signed in to change notification settings - Fork 63
Prototype audit log improvements + coverage test #9467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| @@ -0,0 +1,123 @@ | |||
| Mutating endpoints without audit logging: | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file would be empty or nearly empty in the complete version of this PR.
| .context | ||
| .external_latencies | ||
| .instrument_dropshot_handler(&rqctx, handler) | ||
| }) |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
test for verifying audit log coverage
c07a3ff to
79c0228
Compare
| nexus.scim_idp_create_token(&opctx, &silo_lookup).await?; | ||
| Ok(HttpResponseCreated(token)) | ||
| }) | ||
| .await |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ergonomics comparison here is.... favorable.
inickles
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of what I have below was discussed offline, but I'll mention here for posterity.
The callsite ergonomics look better, though I'm wondering if what's here will be the full story considering I think we'll want more than a top-level resource_id field for the audit event (which I think it's what's proposed here?). I think there could be a number of things we'd want to include an audit event, and that each endpoint will need to define a what it wants to include in the audit log, which could be a subset of the elements returned in the response body itself. And so I think that means audit_and_time would need more than just a handler, or perhaps the handler types must implement functions that define what that audit log response body would be?
I don't think entire request or response bodies should be automatically added to the audit log, as some data isn't relevant or helpful to an audit log, such as block data, which could blow up the size of the log.
For example, a single API call to create an instance can be used to also create new disk and network interface, and ideally all of those new created resource IDs are in the audit log.
I could see similar for request parameters as well, mostly for the cases where interesting request params are in POST body data and not available in the URL. Though, for the sake of parsing, I think I would prefer to for some of that data to also be available in the audit event body that I could parse with just JSON and not also have to parse a URL.
I think it's common for API logs to do this kind of thing, include a response body element that has response data specific to the endpoint.
| // so names don't actually identify things uniquely the way IDs do. So we may | ||
| // end up needing to record the ID for delete or update operations as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point.
| // TODO: pass resource_id to audit_log_entry_complete once | ||
| // the schema supports it | ||
| let _resource_id = | ||
| result.as_ref().ok().and_then(|r| r.resource_id()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea being that resource_id is a top-level field of an audit event?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the idea here. Another way would be to have a general concept of response summary, like we talked about, probably as a JSON column so it could be free-form, and that would have the ID in it. The problem there is that the summary column would have no guaranteed schema because different things might get different summaries, and if the ID is going to be especially load-bearing, it would probably make sense to give it its own column. Especially if we're recording it for deletes and updates too, like my code comment mentions. In that case nearly every audit log entry would have a resource ID associated with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though something that just occurred to me is that it is not necessarily always obvious what the ID is of. If you're doing a project update, then it's obvious that it would be the ID of the project. But if you update the firewall rules on a VPC, the ID would be that of the VPC because there is no ID associated with the firewall rules specifically.
I wonder if I should include a resource type string indicating what kind of ID it is. That is kind of painful to think about, but we do probably have a string like that around for every resource (see ResourceType below). I'm going to have to do a kind of audit of all the endpoints to see if problematic cases are the rule or the exception.
omicron/common/src/api/external/mod.rs
Lines 933 to 1013 in 1c69391
| #[display(style = "kebab-case")] | |
| pub enum ResourceType { | |
| AddressLot, | |
| AddressLotBlock, | |
| AffinityGroup, | |
| AffinityGroupMember, | |
| Alert, | |
| AlertReceiver, | |
| AllowList, | |
| AntiAffinityGroup, | |
| AntiAffinityGroupMember, | |
| AuditLogEntry, | |
| BackgroundTask, | |
| BgpAnnounceSet, | |
| BgpConfig, | |
| Blueprint, | |
| Certificate, | |
| ConsoleSession, | |
| Dataset, | |
| DeviceAccessToken, | |
| DeviceAuthRequest, | |
| Disk, | |
| Fleet, | |
| FloatingIp, | |
| IdentityProvider, | |
| Image, | |
| Instance, | |
| InstanceNetworkInterface, | |
| InternetGateway, | |
| InternetGatewayIpAddress, | |
| InternetGatewayIpPool, | |
| IpPool, | |
| IpPoolResource, | |
| LldpLinkConfig, | |
| LoopbackAddress, | |
| MetricProducer, | |
| MulticastGroup, | |
| MulticastGroupMember, | |
| NatEntry, | |
| Oximeter, | |
| PhysicalDisk, | |
| Probe, | |
| ProbeNetworkInterface, | |
| Project, | |
| ProjectImage, | |
| Rack, | |
| RoleBuiltin, | |
| RouterRoute, | |
| SagaDbg, | |
| SamlIdentityProvider, | |
| ScimClientBearerToken, | |
| Service, | |
| ServiceNetworkInterface, | |
| Silo, | |
| SiloAuthSettings, | |
| SiloGroup, | |
| SiloImage, | |
| SiloQuotas, | |
| SiloUser, | |
| Sled, | |
| SledInstance, | |
| SledLedger, | |
| Snapshot, | |
| SshKey, | |
| SupportBundle, | |
| Switch, | |
| SwitchPort, | |
| SwitchPortSettings, | |
| TufArtifact, | |
| TufRepo, | |
| TufTrustRoot, | |
| UserBuiltin, | |
| Vmm, | |
| Volume, | |
| Vpc, | |
| VpcFirewallRule, | |
| VpcRouter, | |
| VpcSubnet, | |
| WebhookSecret, | |
| Zpool, | |
| } |
This is a draft because I want to get reactions to the approach. Would address #8820, #8811, and #8819. The idea is:
audit_and_timethata. Wraps up the two annoying audit log calls — init and complete — into a single function that takes the handler logic as a callback
b. Combines that with the latency timing function, because it's ugly to do both
VERIFY_ENDPOINTSlist to make sure every non-get endpoint gets an audit log entry when you call itThe only added functionality here is the use of a trait
MaybeHasResourceId(could rename toAuditResponseor something) to extract the ID of created resources so we can log them. Note that this PR doesn't do any of the DB work yet to actually store that ID — I want to make sure people like the callsite ergonomics first.Concerns and future plans