-
Notifications
You must be signed in to change notification settings - Fork 47
2021 10 23 (Saturday) Deployment
This deployment mainly consists of the latest batch of work from Flexion. See the stories below.
Additionally, it commits the change to add an additional replica shard to our Elasticsearch cluster for each index. This will improve performance and resiliency.
We are performing this update after hours, expecting it to conclude between 1am and 2am as we observe low level of activity at this time. We will notify any Court Staff logged in to save their work and log out as the deployment completes.
- https://github.com/flexion/ef-cms/issues/9009
- https://github.com/flexion/ef-cms/issues/9019
- https://github.com/ustaxcourt/ef-cms/issues/1739
- https://github.com/flexion/ef-cms/issues/8896
- https://github.com/flexion/ef-cms/issues/8919
- https://github.com/flexion/ef-cms/issues/8918
- https://github.com/flexion/ef-cms/issues/7704
While deploying in Court environments, we observed that the wait until reindexing was complete script was getting confused by the additional cluster. it appears that the stats
API counts the total number of documents multiplied by the number of shards. By adding a replica, we increased that amount by 50%. So, we created a bug to track this, and a fix to use the count
API instead.
- 22:14 - Created the Pull Request
- 22:15 - Run script to setup boolean values in prod deploy table
$ ./scripts/update-deploy-string-to-boolean.sh prod
- 22:17 - Ensure ES and DynamoDB tables are ready for a Migration
- 22:20 - Ran Docker to ECR script
$ ./docker-to-ecr.sh latest
- 22:21 - Tests pass
- 22:22 - Merged the PR CircleCI Build
- 22:35 - Tests pass; deploy step starts
- 22:40 - Observed deploy table looks correct, and migrate flag is true, source table:
beta
, destination table:alpha
. - 23:00 - Deploy step completes
- 23:00 - Migration starts. 🤞
- 00:06 - Migration completes successfully
- 02:58 - Reindexing appears to be complete based off of the earlier observations:
## prod Index Summary
┌─────────┬───────────────────────┬────────────┬───────────┬─────────┐
│ (index) │ indexName │ countAlpha │ countBeta │ diff │
├─────────┼───────────────────────┼────────────┼───────────┼─────────┤
│ 0 │ 'efcms-case' │ 3013935 │ 2009290 │ 1004645 │
│ 1 │ 'efcms-case-deadline' │ 27384 │ 18266 │ 9118 │
│ 2 │ 'efcms-docket-entry' │ 27667143 │ 18444764 │ 9222379 │
│ 3 │ 'efcms-message' │ 592368 │ 394912 │ 197456 │
│ 4 │ 'efcms-user' │ 481410 │ 320940 │ 160470 │
│ 5 │ 'efcms-work-item' │ 1587057 │ 1058038 │ 529019 │
└─────────┴───────────────────────┴────────────┴───────────┴─────────┘
With the updated script:
┌─────────┬───────────────────────┬────────────┬───────────┬──────┐
│ (index) │ indexName │ countAlpha │ countBeta │ diff │
├─────────┼───────────────────────┼────────────┼───────────┼──────┤
│ 0 │ 'efcms-case' │ 1004645 │ 1004645 │ 0 │
│ 1 │ 'efcms-case-deadline' │ 9128 │ 9133 │ 5 │
│ 2 │ 'efcms-docket-entry' │ 9222381 │ 9222382 │ 1 │
│ 3 │ 'efcms-message' │ 197456 │ 197456 │ 0 │
│ 4 │ 'efcms-user' │ 160470 │ 160470 │ 0 │
│ 5 │ 'efcms-work-item' │ 529019 │ 529019 │ 0 │
└─────────┴───────────────────────┴────────────┴───────────┴──────┘
- 03:02 - Manually continuing the deployment
- 03:03 - Running script to figure out what the missing docket entry is:
$ node shared/admin-tools/elasticsearch/determine-difference-es-index.js prod beta efcms-docket-entry
- 03:08 - Smoketests pass! Observed that
USTC_ADMIN_USER
is disabled. - 03:13 - Switch colors...
- 03:16 - Disabled
blue
api custom domains east & west
Things are looking good. Investigating the docket entry and case deadlines that are missing from the destination cluster. 🤔
I’m having a hard time figuring out which document is missing because my query to calculate the delta keeps timing out due to the fact that the docket entry index is so huge.
$ node shared/admin-tools/elasticsearch/determine-difference-es-index.js prod beta efcms-docket-entry
efcms-search-prod-alpha
events.js:292
throw er; // Unhandled 'error' event
^
Error: read ECONNRESET
at TCP.onStreamRead (internal/stream_base_commons.js:209:20)
Emitted 'error' event on ClientRequest instance at:
at Socket.socketErrorListener (_http_client.js:469:9)
at Socket.emit (events.js:315:20)
at Socket.EventEmitter.emit (domain.js:467:12)
at emitErrorNT (internal/streams/destroy.js:106:8)
at emitErrorCloseNT (internal/streams/destroy.js:74:3)
at processTicksAndRejections (internal/process/task_queues.js:80:21) {
errno: -54,
code: 'ECONNRESET',
syscall: 'read'
}
However, for the case deadline records, it’s another example of https://github.com/flexion/ef-cms/issues/9009. The records don’t exist in DynamoDB (either source or destination). At some point in time, these records should have been removed from the source Cluster. Somehow they continue to linger. It must be something intermittently failing deleting these records (and perhaps indexing?) from the cluster. The fix put forth for 9009 so far was a significant refactor that deprecated efcms-user-case
index and stopped indexing unwanted records into the efcms-user
index. It appears the underlying problem, where some requests are failing to be deleted, still persists.