Skip to content

Conversation

@jmg-duarte
Copy link
Contributor

@jmg-duarte jmg-duarte commented Dec 1, 2025

Description

We're getting autopilot errors due to running ANALYZE on our read replica, the PR disables it for the read replica only

2025-12-01T03:02:50.723Z ERROR database_metrics: autopilot::database: failed to update large tables stats err=Database(PgDatabaseError { severity: Error, code: "25006", message: "cannot execute ANALYZE during recovery", detail: None, hint: None, position: None, where: None, schema: None, table: None, column: None, data_type: None, constraint: None, file: Some("utility.c"), line: Some(455), routine: Some("PreventCommandDuringRecovery") })

https://aws-es.cow.fi/_dashboards/app/discover#/context/86e4a5a0-4e4b-11ef-85c5-3946a99ed1a7/Lhrc15oBNcYyVCDI7-L5?_g=(filters:!())&_a=(columns:!(timestamp,log,log_level,kubernetes.pod_name),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!t,index:'86e4a5a0-4e4b-11ef-85c5-3946a99ed1a7',key:kubernetes.container_name,negate:!f,params:(query:polygon-autopilot-prod),type:phrase),query:(match_phrase:(kubernetes.container_name:polygon-autopilot-prod))),('$state':(store:appState),meta:(alias:!n,disabled:!t,index:'86e4a5a0-4e4b-11ef-85c5-3946a99ed1a7',key:log_level,negate:!f,params:!(ERROR,FATAL),type:phrases,value:'ERROR,%20FATAL'),query:(bool:(minimum_should_match:1,should:!((match_phrase:(log_level:ERROR)),(match_phrase:(log_level:FATAL))))))))

The following types of administration commands are not accepted during recovery mode:

Data Definition Language (DDL): e.g., CREATE INDEX

Privilege and Ownership: GRANT, REVOKE, REASSIGN

Maintenance commands: ANALYZE, VACUUM, CLUSTER, REINDEX

Again, note that some of these commands are actually allowed during "read only" mode transactions on the primary.

As a result, you cannot create additional indexes that exist solely on the standby, nor statistics that exist solely on the standby. If these administration commands are needed, they should be executed on the primary, and eventually those changes will propagate to the standby.

https://www.postgresql.org/docs/current/hot-standby.html

Changes

  • Check if the autopilot is connected to a read replica and do not issue ANALYZE commands

How to test

Tested in staging, change was issued around 10:40, since the task first runs ANALYZE then sleeps, no errors = no command issued

image

@jmg-duarte jmg-duarte marked this pull request as ready for review December 3, 2025 11:21
@jmg-duarte jmg-duarte requested a review from a team as a code owner December 3, 2025 11:21
Copy link
Contributor

@MartinquaXD MartinquaXD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a bit of context this logic is only used to make table sizes show up in grafana. Since then we introduced a postgres metrics exporter. Maybe the better solution would be to drop this logic altogether from the services and instead solve the issue in infra. This should clearly only happen if it's very easy to do as these metrics are low prio overall.

} else {
db_write.clone()
(db_write.clone(), true)
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we now have the read replica configured everywhere this change will cause the metrics to not be updated anymore. As it is right now this seems like a footgun.
If it works with the write_db we should always pass that instead of silently "disabling" this functionality when the read_db is configured.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants