-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert 2017 & 2018 API services from EC2 to Fargate #244
Convert 2017 & 2018 API services from EC2 to Fargate #244
Comments
I've been working the problem of making Fargate work in our existing CF stack (#238) and have it to the point where it appears that the container deploys and starts up as an ECS task, and is able to configure itself using remote resources (e.g. SSM parameters). Next is to come at it from the opposite angle - can we host a known-good container as a Fargate task and have it successfully respond to outside requests? To weave these two together, the best approach is to try converting an existing EC2-based task to Fargate. I've picked the Budget 2017 API container as (a) I know how it's supposed to work, having been on the team and (b) it's very unlikely to be getting traffic so downtime is at its most tolerable. I've adapted what I believe to be a working Fargate template to the Budget-Service and at this point it appears that once the container is up and running, ALB's health check tests are returning a 400 error:
|
I had a theory that this was a 400 error due to the Security Groups configuration only allowing access to ports 80 and 443, but not to the port (8000) on which the containerized app is listening - and to which the Health Check must be able to connect to verify that the container is ready to take requests. The Security Group was explicitly configured on the Fargate examples I modelled this after (but I didn't know why, just thought "there must be some reason why Fargate requires this"). So I tried commenting out all references in the master and the task templates to the Security Group. That still didn't result in a container that was deemed "healthy" even if it was deployed into the cluster and even though according to CloudWatch logs the application in the container has completed all its needed startup (i.e. I can see no errors in the logs, and I see that the app is So I'm going to dig deeper into the Security Group configuration and ensure it's explicitly allowing incoming traffic on the port(s) that the container is configured to listen on. |
Note: I'm currently working from this experimental branch (i.e. I don't expect to PR this or merge to master): |
I've uncommented the Security Group references, generated a new security group, and explicitly granted access to the following port combos (from:to):
...and we're back to the original issue - i.e. the Security Group is getting created as well as it was last week, but the extra port combos doesn't solve the "400" problem: The very first hit on that error message leads me back to this article that got me started down this road, so I'll take another stab at other possibilities: |
Assumption: "reason Health checks failed with these codes: [400]" is indicating that some listener in the cluster responded with an HTTP 400 response code ("Bad request"). We can eliminate the container application itself, since I can verify from CloudWatch logs that the container never records an incoming HTTP request - the final entries in the CloudWatch log for each instance of this "unhealthy" container are:
Which are the same as for a healthy instance of this same container without the addition of any entries like the following:
|
SecurityGroup? OK, so I've been messing with the SecurityGroup assigned to the task and seeing if it was too restrictive. After a number of iterations I finally opened it up to all protocols and IPs, and we're still getting:
And I'm still seeing no evidence in CloudWatch that the container app is seeing any requests. Couple of possibilities:
|
Subnets? Here's another thought, after trolling through the templates and looking at the stuff that isn't the same as the EC2 services: what about the Subnet to which the Fargate task is deployed? I see that it's specified as ...in the master yaml, and it's not immediately clear if that's the same subnets as for the EC2 tasks, or if they're on the Digging through the ECS-cluster.yaml and its params from master.yaml, in fact it is the same subnets:
|
Network? Next suspect on the list is this section of the Service definition:
That bit about |
Network Interfaces? Finally in thoroughly crawling the resources EC2, I come across this page, which makes me wonder (a) is there a way to find out what network interface & subnet are connected to the Fargate task, and (b) do we have it hooked up correctly? (Just because things look fine to the eyeball by comparing the template content to the EC2 tasks doesn't make it so.) |
Override the health check? This article gave me a crazy idea: What if (even just temporarily, to get one level deeper into the cluster & logs) we told the Health Check to accept 400 as "healthy"? Edit: that was unexpected - seems to be working but not in any explainable way:
By all measures, the cluster is successfully sending the /budget requests to a container that runs the Budget API, but I cannot see any confirmation in CloudWatch that the Fargate task that's healthy and running is the one that's responding to those requests. In the past we've run into a situation like this, and though I can't remember the details precisely, I do recall that the lesson was "don't trust that you're running the container you think you are until you can track it down and prove it". |
Update on the lack of CloudWatch logs for the Budget task via Fargate: I've redeployed the EC2-based Budget task and confirmed that it also doesn't show any incoming requests in the CloudWatch logs, so this may not be indicative of a problem. Reviewing which Django apps are showing incoming requests in their CW logs:
So next I'll try deploying Emergency Response in Fargate and verify that I'm seeing requests logged in CloudWatch. If good, then it'll be time to start setting up the 2019 API containers (though in a way that will allow Django apps returning 400 to get scheduled into service, which is a piece of Tech Debt we're liable to forget if I don't log it soon). |
OK I think we nailed it:
Both the 2017 Budget and 2017 Emergency Response services are now deploying and remaining healthy on Fargate. This PR enabled the whole shebang: Migrating the remaining EC2-based ECS services should thus just be a matter of copying the pattern established with these two. |
Status of this effort: 2017 Emergency Services container is in good shape. It survived even the refactoring effort that is underway here: However, 2017 Budget container was not so lucky. Container is deemed unhealthy by ALB under that refactored deploy. Referencing @DingoEatingFuzz 's recent document https://github.com/hackoregon/civic-devops/blob/master/docs/HOWTO-Update-an-API-repo-to-use-Fargate.md, the most likely culprit is some out-of-sync'ness with the recent changes that Michael made to Emergency Services in PR's 124-126 here: As well as the direct commits on July 1, 2019: HOWEVER, also important to note that the Budget container eventually stabilizes and starts answering healthily when deployed on the current Thus, there's some difference between the 2017-budget-api.yaml (and its pass-ins from master.yaml) and the 2017-fargate-api.yaml (and its associated pass-ins) that causes the same container image to succeed with the former and fail with the latter. |
Further status on updating the Budget container: along with the fixes made to emergency-service, there have been a number of changes made in a branch on the Budget repo to (a) figure out how the Travis sequence works and (b) get the What I'm finding is that the Emergency Services container doesn't actually leverage SSM parameters to pass its build & test, which explains why it still deploys to ECR without actually having permission to read from SSM. Additional changes I've had to make that weren't recently done for the Emergency container (but may have been present for longer) include:
That IAM user will need to be augmented with a policy that allows access to the
(See #114 (comment) for the first time we setup such a policy.) |
With PR 173 (hackoregon/team-budget#173) in Team Budget repo, we have successfully refactored to get build and test steps working again (even with the new SSM world view of configuration management), and for the validated container image to be published to ECR. However there is an error during the deploy stage that prevents Travis from successfully getting the new Team Budget container image deployed to ECS:
I don't recall ever seeing this error, and while the usual victim to blame is AWS IAM policies, we'll have to do some research to figure out how to fully CD again. OTOH, with a new container image successfully deployed into ECR, we can manually update the CF cluster and see how well that container image behaves in ECS. |
The ClientException error turns out to be a predictable if completely forgotten detail:
|
If things with the next 2017 API containers go really haywire, read up on #158 and see if there's other corrections needed. |
Summary
Migrate all existing containers to Fargate
Definition of Done
All existing containers are running as Fargate tasks in the existing infrastructure.
The text was updated successfully, but these errors were encountered: