Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pagerduty_service_event_rule is unreliable, 404 Not Found during apply #428

Closed
idsvandermolen opened this issue Dec 3, 2021 · 7 comments

Comments

@idsvandermolen
Copy link

idsvandermolen commented Dec 3, 2021

Terraform Version

Run terraform -v to show the version. If you are not running the latest version of Terraform, please upgrade because your issue may have already been fixed.

Terraform v1.0.11
on darwin_amd64
+ provider registry.terraform.io/hashicorp/null v3.1.0
+ provider registry.terraform.io/pagerduty/pagerduty v2.1.1

Affected Resource(s)

Please list the resources as a list, for example:

  • pagerduty_service
  • pagerduty_service_event_rule

If this issue appears to affect multiple resources, it may be an issue with Terraform's core, so please mention this.

Code

We have a service module with main.tf like this:

#
# Setup a service with a some defaults like:
# - a relation to an escalation profile
# - a service_event_rule for critical events
# - a service_event_rule for warning events
# - a relation to a business service
#

resource "pagerduty_service" "service" {
  name                    = var.name
  description             = var.description
  auto_resolve_timeout    = "null"
  acknowledgement_timeout = 1800
  escalation_policy       = var.escalation_policy_id
  alert_creation          = "create_alerts_and_incidents"

  incident_urgency_rule {
    type    = "constant"
    urgency = "severity_based"
  }
}

resource "pagerduty_service_event_rule" "critical" {
  service  = pagerduty_service.service.id
  position = 0

  conditions {
    operator = "and"
    subconditions {
      operator = "equals"
      parameter {
        path  = "severity"
        value = "critical"
      }
    }
  }

  actions {
    priority {
      value = data.pagerduty_priority.p2.id
    }
  }
}

resource "pagerduty_service_event_rule" "warning" {
  service  = pagerduty_service.service.id
  position = 1

  conditions {
    operator = "and"
    subconditions {
      operator = "equals"
      parameter {
        path  = "severity"
        value = "warning"
      }
    }
  }

  actions {
    priority {
      value = data.pagerduty_priority.p4.id
    }
    suppress {
      value = true
    }
  }
}

resource "pagerduty_service_dependency" "business-service" {
  dependency {
    dependent_service {
      id   = var.business_service_id
      type = "business_service"
    }
    supporting_service {
      id   = pagerduty_service.service.id
      type = "service"
    }
  }
}

And then call the module like this:

module "translations-service" {
  source               = "./modules/service"
  name                 = "Translations"
  description          = "Translations MicroService"
  escalation_policy_id = pagerduty_escalation_policy.dummy.id
  business_service_id  = pagerduty_business_service.capabilities.id
}

Expected Behavior

When deploying larger changes we expect them to succeed.

Actual Behavior

What actually happened?
During first apply we see messages about resources still being created, then it fails with a 404 Not Found:

module.translation-keys-service.pagerduty_service_event_rule.critical: Still creating... [2m0s elapsed]
module.translations-service.pagerduty_service_event_rule.critical: Still creating... [2m0s elapsed]

Error: GET API call to https://api.eu.pagerduty.com/services/PTKNIE0/rules/9bd3764b-8aa3-430c-a96c-6e709a4fbedc failed 404 Not Found. Code: 0, Errors: <nil>, Message: Rule Not Found

  with module.translations-service.pagerduty_service_event_rule.critical,
  on modules/service/main.tf line 23, in resource "pagerduty_service_event_rule" "critical":
  23: resource "pagerduty_service_event_rule" "critical" {


Error: GET API call to https://api.eu.pagerduty.com/services/PB5JK5D/rules/43a5417d-efb3-4224-8e4a-44167c35ee41 failed 404 Not Found. Code: 0, Errors: <nil>, Message: Rule Not Found

  with module.translation-keys-service.pagerduty_service_event_rule.critical,
  on modules/service/main.tf line 23, in resource "pagerduty_service_event_rule" "critical":
  23: resource "pagerduty_service_event_rule" "critical" {


Error: Error updating service event rule 377c56ed-bf5b-4eea-b1ac-35c19f3deebd position 0 needs to be 1

  with module.translations-service.pagerduty_service_event_rule.warning,
  on modules/service/main.tf line 45, in resource "pagerduty_service_event_rule" "warning":
  45: resource "pagerduty_service_event_rule" "warning" {


Error: Error updating service event rule db05de09-a9fb-4292-8561-3b0f5a03e940 position 0 needs to be 1

  with module.translation-keys-service.pagerduty_service_event_rule.warning,
  on modules/service/main.tf line 45, in resource "pagerduty_service_event_rule" "warning":
  45: resource "pagerduty_service_event_rule" "warning" {

Error: Process completed with exit code 1.

If you try to run the terraform plan, the refreshing fails with the same error (404 Not Found). The work-around is to delete these "not found" resources from terraform state with terraform state rm <resource> and try again.

Note: we create the pagerduty_service and the accompanying service_event_rules in the same terraform apply. There might be a race condition where the terraform pagerduty provider does not process service => service_event_rule dependency correctly

Steps to Reproduce

Please list the steps required to reproduce the issue, for example:

  1. terraform apply

Important Factoids

Are there anything atypical about your accounts that we should know? For example: Running in EC2 Classic? Custom version of OpenStack? Tight ACLs?

References

Are there any other GitHub issues (open or closed) or Pull Requests that should be linked here? For example:

@jjm
Copy link
Contributor

jjm commented Dec 7, 2021

I noticed the same thing and that the position was not really being stable either, one workaround I found was to make each service event rule depend on the pervious one so that they are created in the correct order.

@idsvandermolen
Copy link
Author

I noticed the same thing and that the position was not really being stable either, one workaround I found was to make each service event rule depend on the pervious one so that they are created in the correct order.

Thanks, that would help making sure you don't have to apply multiple times to get the order of rules correct. However, it doesn't solve the race condition where terraform thinks something has been deployed and adds it to the state file and PD API doesn't know about the resources and returns a 404

@jjm
Copy link
Contributor

jjm commented Dec 9, 2021

@idsvandermolen very true, the a release may have fixed the race condition. As I'm no longer seeing these errors & the PR you linked to has been merged.

@jbfavre
Copy link
Contributor

jbfavre commented Dec 10, 2021

@jjm
Run into this issue as well, and am still seing it with latest provider release (v2.2.0).

Running the terraform apply with TF_LOG=debug show that API acknowledge resource creation.
But further API calls result in Resource not found errors.

In my opinion, this is more an Pagerduty API bug than a Terraform provider one.
I've opened a case to Pagerduty support, providing full debug log and request IDs so that they can investigate this issue.

@jbfavre
Copy link
Contributor

jbfavre commented Jan 5, 2022

👋
Got an answer from PagerDuty support:

We've looked into this and determined that the bug results from Terraform attempting to process the event_rule calls concurrently. We are looking into changes to either the API protocol or the Terraform integration code still.

BUT, we do have a work around that should work: simply disabling parallel processing of commands in Terraform.

This is controlled by the variable parallelism, which defaults to the value of 10. You can disable concurrent processing by setting it to 1.

parallelism = 1

(If you'd like you can see more at "Walking the Graph" in Terraform's documentation here.)

I'm currently setting up the workaround.
Will keep this issue updated

@stmcallister
Copy link
Contributor

stmcallister commented Jan 12, 2022

Hello! I have connected with the Engineering team and can confirm that they recommended slowing down the requests for creating rules, as @jbfavre mentioned above.

An approach I just tested with the code above is to add a depends_on field to the warning rule. That way the creation of that rule waits for the critical rule to created before beginning its request. The benefit of this approach over setting parallelism is that you're only slowing down rule creation and not the whole terraform apply process.

Here's an example of using the depends_on field with the code above.

resource "pagerduty_service_event_rule" "warning" {
  service  = pagerduty_service.service.id
  position = 1
  depends_on = [
    pagerduty_service_event_rule.critical  
  ]
...

stmcallister pushed a commit that referenced this issue Jan 19, 2022
Fix Service Event Rules Tests...and Updated the Version of TF SDK
@gsreynolds
Copy link
Member

Rules are being deprecated and replaced

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants