Skip to content

2021 09 18 (Saturday) Deployment

Mike Marcotte edited this page Sep 22, 2021 · 5 revisions

General Notes

This deployment includes new features from Batch 8 and re-attempts a previous deployment that included a few bug fixes.

That previous deployment failed because one segment had become overloaded with records and the lambda that migrated those records timed out at 15 minutes. We wrote a script to move those old and unused records to partitions based on the month they occurred. This will let us use them in the future for pagination and avoid a hot partition. Performing this action also had the unexpected bonus of our glue jobs from Production to Staging account taking only 2 hours and 16 minutes, down from the most recent attempt 7 hours and 54 minutes.

One thing we observed while deploying this in the staging environments is that the first deploy will fail. A zip file is required to be available, but the deployment builds and prepares that file. Then a script can be run in order to make that file available to the deployment. It's somewhat tedious because this deployment involves a migration, you also need to re-delete the resources (DynmamoDB and Elasticsearch) that Terraform created.

A new feature of this deployment are readonly smoketests. In order for this to function, we needed to specify the USTC_ADMIN_USER in addition to the USTC_ADMIN_PASS in the CircleCI environment variables.

Until we can accomplish a more seamless user experience for deployments, we will be performing deployments on weekends or after midnight on weekdays.

Bugfixes

Feature Stories

Timeline

  • 22:00 - deleted efcms-prod-beta (full of previous migration attempt) table east & west

  • 22:02 - deleted efcms-search-prod-beta cluster

  • 22:03 - made the Pull Request

  • 22:05 - ran account-specific

    NOTE: Using 8 nodes in prod for the Kibana cluster until we make snapshots and are comfortable running without all of the logs to search upon.

  • 22:18 - added USTC_ADMIN_USER and USTC_ADMIN_PASS environment variables to CircleCI

  • 22:46 - ran ./docker-to-ecr.sh latest

  • 22:47 - account specific finished! ✅

  • 22:47 - run env specific

  • 22:49 - need to add feature flag for maintenance mode

    $ ./scripts/set-maintenance-mode.sh false prod
    
  • 22:51 - merged PR ✅; CircleCI build (This is expected to fail in deploy step)

  • 23:28 - deploy failed

    │ Error: failed getting S3 Bucket (*********************.efcms.****.us-east-1.lambdas) Object (maintenance_notify_blue.js.zip): NotFound: Not Found
    │ 	status code: 404, request id: H5JCKNCXRYRPWE4T, host id: XtGQYbjtV15imEL4gr2IxB1LKnFwyN6+01S+yIulSkuUmp2Dxan3zTVeIQri7rr/Gf5nIQPuthM=
    │ 
    │   with module.ef-cms_apis.data.aws_s3_bucket_object.maintenance_notify_blue_east_object,
    │   on ../template/main-east.tf line 182, in data "aws_s3_bucket_object" "maintenance_notify_blue_east_object":
    │  182: data "aws_s3_bucket_object" "maintenance_notify_blue_east_object" {
    │ 
    ╵
    ╷
    │ Error: failed getting S3 Bucket (*********************.efcms.****.us-west-1.lambdas) Object (maintenance_notify_blue.js.zip): NotFound: Not Found
    │ 	status code: 404, request id: T7X3ZB5AHFNNVFHC, host id: pqlwSMjvpbpeTO4f9RO4lVLpQ711FjGd5cj1LVJPAvqe53jmVtTU7xLhZXzUt4gCAcoh6D53/AE=
    │ 
    │   with module.ef-cms_apis.data.aws_s3_bucket_object.maintenance_notify_blue_west_object,
    │   on ../template/main-west.tf line 137, in data "aws_s3_bucket_object" "maintenance_notify_blue_west_object":
    │  137: data "aws_s3_bucket_object" "maintenance_notify_blue_west_object" {
    │ 
    ```
    
    
  • 23:28 - need to run maintenance script

    $ ./setup-s3-maintenance-file.sh prod
    copy: s3://dawson.ustaxcourt.gov.efcms.prod.us-east-1.lambdas/maintenance_notify_green.js.zip to s3://dawson.ustaxcourt.gov.efcms.prod.us-east-1.lambdas/maintenance_notify_blue.js.zip
    copy: s3://dawson.ustaxcourt.gov.efcms.prod.us-west-1.lambdas/maintenance_notify_green.js.zip to s3://dawson.ustaxcourt.gov.efcms.prod.us-west-1.lambdas/maintenance_notify_blue.js.zip
    
  • 23:29 - delete DynamoDB East & West tables

  • 23:36 - retry from failed

  • 23:59 - confirm that the deploy table was updated correctly, and the migration will be alpha => beta

  • 00:10 - migration started

  • 01:12 - migration finished! ✅

  • 01:15 - re-indexing begins 📈

  • 01:45 - updated admin pass and CircleCI env var

  • 01:50 - re-ran from failed for readonly smoketests

    NOTE: Observed that the USTC_ADMIN_USER was enabled after failure. Need to gracefully handle a failure and disable the admin account.

  • 1:55 - adjusted admin pass again

  • 1:57 - re-ran smoketests

  • 2:03 - maintenance:disengage

  • 1:57 - re-ran smoketests

  • 2:11 - they pass!

  • 2:11 - observed the USTC_ADMIN_USER is disabled, as is the testAdmissionsClerk account.

  • 2:15 - checked the health of the migration; indexing rate is 80k;

    ┌─────────┬───────────────────────┬────────────┬───────────┬──────────┐
    │ (index) │       indexName       │ countAlpha │ countBeta │   diff   │
    ├─────────┼───────────────────────┼────────────┼───────────┼──────────┤
    │    0    │     'efcms-case'      │  1999766   │  532582   │ 1467184  │
    │    1    │ 'efcms-case-deadline' │   17580    │    527    │  17053   │
    │    2    │ 'efcms-docket-entry'  │  18344494  │  5653610  │ 12690884 │
    │    3    │    'efcms-message'    │   343346   │   93030   │  250316  │
    │    4    │     'efcms-user'      │  2019112   │  341324   │ 1677788  │
    │    5    │   'efcms-user-case'   │  1995026   │   53493   │ 1941533  │
    │    6    │   'efcms-work-item'   │   891996   │  174096   │  717900  │
    └─────────┴───────────────────────┴────────────┴───────────┴──────────┘
    Total Difference: 18762658 (6848662/25611320) 26.74% 
    
  • 03:00 - re-indexing is progressing

    ┌─────────┬───────────────────────┬────────────┬───────────┬─────────┐
    │ (index) │       indexName       │ countAlpha │ countBeta │  diff   │
    ├─────────┼───────────────────────┼────────────┼───────────┼─────────┤
    │    0    │     'efcms-case'      │  1999766   │  997323   │ 1002443 │
    │    1    │ 'efcms-case-deadline' │   17580    │   3985    │  13595  │
    │    2    │ 'efcms-docket-entry'  │  18344494  │  9907253  │ 8437241 │
    │    3    │    'efcms-message'    │   343346   │  123550   │ 219796  │
    │    4    │     'efcms-user'      │  2019114   │  1012755  │ 1006359 │
    │    5    │   'efcms-user-case'   │  1995026   │  429675   │ 1565351 │
    │    6    │   'efcms-work-item'   │   891996   │  339182   │ 552814  │
    └─────────┴───────────────────────┴────────────┴───────────┴─────────┘
    Total Difference: 12797599 (12813723/25611322) 50.03% 
    
  • 04:45 - re-indexing is continuing

    ┌─────────┬───────────────────────┬────────────┬───────────┬────────┐
    │ (index) │       indexName       │ countAlpha │ countBeta │  diff  │
    ├─────────┼───────────────────────┼────────────┼───────────┼────────┤
    │    0    │     'efcms-case'      │  1999766   │  1841790  │ 157976 │
    │    1    │ 'efcms-case-deadline' │   17580    │   8170    │  9410  │
    │    2    │ 'efcms-docket-entry'  │  18344494  │ 17687738  │ 656756 │
    │    3    │    'efcms-message'    │   343346   │  259407   │ 83939  │
    │    4    │     'efcms-user'      │  2019114   │  1490314  │ 528800 │
    │    5    │   'efcms-user-case'   │  1995026   │  1682098  │ 312928 │
    │    6    │   'efcms-work-item'   │   891996   │  679024   │ 212972 │
    └─────────┴───────────────────────┴────────────┴───────────┴────────┘
    Total Difference: 1962781 (23648541/25611322) 92.33% 
    

5:11 - Indexing is complete, but there appears to have been errors, which is why it took so long. There must be something up with the 6 missing user cases. I'm calling the deployment for now, and will investigate tomorrow.

┌─────────┬───────────────────────┬────────────┬───────────┬──────┐
│ (index) │       indexName       │ countAlpha │ countBeta │ diff │
├─────────┼───────────────────────┼────────────┼───────────┼──────┤
│    0    │     'efcms-case'      │  1999766   │  1999766  │  0   │
│    1    │ 'efcms-case-deadline' │   17580    │   17580   │  0   │
│    2    │ 'efcms-docket-entry'  │  18344494  │ 18344494  │  0   │
│    3    │    'efcms-message'    │   343346   │  343346   │  0   │
│    4    │     'efcms-user'      │  2019114   │  2019114  │  0   │
│    5    │   'efcms-user-case'   │  1995026   │  1995020  │  6   │
│    6    │   'efcms-work-item'   │   891996   │  891996   │  0   │
└─────────┴───────────────────────┴────────────┴───────────┴──────┘

Conclusion

We experienced a few hiccups in tonight's deployment, and further investigation is required. Going to provide the outline of items here.

Missing records in destination cluster

Six records were missing in the efcms-user-case cluster. So, I created an Elasticsearch script to identify them by querying that index on both clusters, and finding the records that did not exist on the destination cluster. These were the six records:

user|6e9acd85-ea2d-40a3-9c66-d0c982eafdcf	case|7275-19
user|cc90d791-6224-4bab-bf70-1c43327807a0	case|12103-19
user|cc90d791-6224-4bab-bf70-1c43327807a0	case|12106-19
user|f6ff6e98-4d8f-4695-8ccc-36a6afefb460	case|16612-21
user|be5a7e6f-c734-4e59-9890-02e341ed3e4d	case|16960-21
user|5c7824f5-0120-4df3-924d-fd2de169190b	case|15511-21

Took a look in DynamoDB, and I found records for the first three in the source table, but could not find records for the last three. 🤔 All of the user records are IRS Practitioners.

So, I looked more closely at the application logs, and I was able to identify that these IRS Practitioners had recently been removed from these cases. And it would seem that the operation did not properly remove the mapping records from Elasticsearch after removing the records from DynamoDB:

  • Sep 10, 2021 @ 15:50:09.843 - /case-parties/16612-21/counsel/f6ff6e98-4d8f-4695-8ccc-36a6afefb460 DELETE
  • Sep 17, 2021 @ 15:29:58.983 - /case-parties/16960-21/counsel/be5a7e6f-c734-4e59-9890-02e341ed3e4d DELETE
  • Sep 10, 2021 @ 15:09:36.463 - /case-parties/15511-21/counsel/5c7824f5-0120-4df3-924d-fd2de169190b DELETE

And these three, where the record exists in the source table and not the destination table, were removed Sunday or Monday. This explains why they were missing when running the query on Monday afternoon.

  • Sep 20, 2021 @ 10:37:05.425 - /case-parties/7275-19/counsel/6e9acd85-ea2d-40a3-9c66-d0c982eafdcf DELETE
  • Sep 20, 2021 @ 07:14:04.308 - /case-parties/12103-19/counsel/cc90d791-6224-4bab-bf70-1c43327807a0 DELETE
  • Sep 19, 2021 @ 15:08:34.391 - /case-parties/12106-19/counsel/cc90d791-6224-4bab-bf70-1c43327807a0 DELETE

So, all of the records did get migrated -- eventually! And investigation on whether or not we are properly and reliably removing records from ES for User Case is in order.

Sluggish re-indexing

Re-indexing appeared to really slow down at around 3am. Compared with previous deployments that involved a blue-green migration, this one took about twice as long.

Usual Deployment:

normal-migration-reindexing

Saturday's deployment:

slow-reindexing

Action Items

  • Try to replicate removing an IRS Practitioner from a case and see that the record doesn't get removed from Elasticsearch
  • Work with AWS Support to identify issues during re-indexing.
  • Add story to make sure that the USTC_ADMIN_USER gets disabled if the readonly smoketest script fails.
Clone this wiki locally