Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] EC2 Fleet provider #394

Open
riywo opened this issue Jul 30, 2023 · 14 comments · May be fixed by #518
Open

[Feature request] EC2 Fleet provider #394

riywo opened this issue Jul 30, 2023 · 14 comments · May be fixed by #518

Comments

@riywo
Copy link

riywo commented Jul 30, 2023

The current EC2 provider is awesome. I'm just wondering we could implement a similar provider but using EC2 Fleet as an actual provisioner. It is neat because it can configure multiple instance types and purchase options with built-in strategies and would be super useful to find the lowest spot price without complicated logic, for example.

If you don't have any plan yet. I could work on it by implementing a new provider like Ec2FleetProvider but trying to align with the props of Ec2Provider as much as possible so that users can easily switch from one to the other.

Let me know your opinions!

@kichik
Copy link
Member

kichik commented Jul 30, 2023

Last time I tried to use EC2 Fleet I found this:

// we can't use fleets because they don't let us override user data, security groups or even disk size

If you find a good way to get around those issues, I'd be happy to merge a PR for it. We might be able to do it with a Lambda function. I have been trying to avoid those to keep the Step Function simple. But we have so many Lambda functions around, how bad can one more be?

@cynicaljoy
Copy link

cynicaljoy commented Nov 15, 2023

@kichik you can define a Launch Template with User Data, Security Groups, Disks, etc and then use that Launch Template in the Fleet Request.

Sorry, I don't have time currently to work up a full PR, but something like this:

     new stepfunctions_tasks.CallAwsService(this, "FleetRequest", {
      comment: "FleetRequest",
      integrationPattern: IntegrationPattern.WAIT_FOR_TASK_TOKEN,
      service: 'ec2',
      action: 'fleetRequest',
      heartbeatTimeout: stepfunctions.Timeout.duration(Duration.minutes(10)),
      iamResources: ['*'],
      parameters: {
        TargetCapacitySpecification: {
          TotalTargetCapacity: 1,
          OnDemandTargetCapacity: 0,
          SpotTargetCapacity: 1,
          DefaultTargetCapacityType: 'spot',
        },
        LaunchTemplateConfigs: this.subnets.map((subnet, _) => {
          return {
            LaunchTemplateSpecification: {
              LaunchTemplateId: this.ami.launchTemplate.launchTemplateId,
              Version: this.ami.launchTemplate.versionNumber.toString(),
            },
            Overrides: {
              InstanceType: this.instanceType.toString(),
              IamInstanceProfile: {
                Arn: instanceProfile.attrArn,
              },
              InstanceInitiatedShutdownBehavior: ec2.InstanceInitiatedShutdownBehavior.TERMINATE,
              SubnetId: subnet.subnetId,
              MetadataOptions: {
                HttpTokens: 'required',
              },
              SecurityGroupIds: this.securityGroups.map(sg => sg.securityGroupId),
              BlockDeviceMappings: [{
                DeviceName: rootDeviceResource.ref,
                Ebs: {
                  DeleteOnTermination: true,
                  VolumeSize: this.storageSize.toGibibytes(),
                },
              }],
              UserData: stepfunctions.JsonPath.format(
                stepfunctions.JsonPath.stringAt('$.ec2.userdataTemplate'),
                ...params,
              ),
            }
          }
        }),
      },
    });

@kichik
Copy link
Member

kichik commented Nov 18, 2023

@cynicaljoy where do you see UserData being overridable? That's exactly the missing part. That's where we put all the runner configuration.

@cynicaljoy
Copy link

@kichik err sorry, napkin code. I'm translating this from a Go utility service I have. How I'm handling that is I first create a new version of the LaunchTemplate w/the updated UserData and then I use that version of the Launch Template in the Fleet request. Then I cleanup the oldest version of the LaunchTemplate to ensure that we don't grow the LaunchTemplate versions beyond the account limit which is 1000 by default.

@kichik
Copy link
Member

kichik commented Nov 21, 2023

I don't love creating a new resource for every runner. Combined with the idle reaper, that may cause launch templates to be left behind and reach that 1000 limit. It can also cause issues with deployments as we will be modifying resources created by CloudFormation.

The "cleanest" method (and those quotes are working real hard) I could think of so far was to create a queue for each fleet runner type. The instances could then pull their next runner task from the queue when they boot up. It will take real careful work to avoid having old jobs remain in the queue. What will take them off the queue if EC2 fails to launch, for example?

@cynicaljoy
Copy link

what about making the user data more idempotent? rather than the params being populated into the template and added to the launch template, those values are added to param store or secrets manager w/the task token as part of the key and the task token is added to the resource tags. then the idle reaper can clean up.

@kichik
Copy link
Member

kichik commented Nov 21, 2023

The problem with param store/secrets manager is that we need to be able to spin up multiple instances at the same time. They will override each other.

@cynicaljoy
Copy link

Am I looking in the right place? (PowerShell below that too of course).

I was thinking the workflow would be to create unique secret item (whether Secrets Manager or Parameter Store): e.g./cdk-runners/$ownerPath/$repoPath/$TASK_TOKEN/ (maybe more to ensure idempotence) and the variable parts would be Resource Tags. So the UserData would be updated to read the Resource Tags to know where to pull that runs params. Then the secret store would be deleted at the end of the user data.

Of course, this would be mean some additional required IAM scope and the secret item path should be configurable.

@kichik
Copy link
Member

kichik commented Nov 25, 2023

Am I looking in the right place? (PowerShell below that too of course).

Yes.

I was thinking the workflow would be to create unique secret item (whether Secrets Manager or Parameter Store): e.g./cdk-runners/$ownerPath/$repoPath/$TASK_TOKEN/ (maybe more to ensure idempotence) and the variable parts would be Resource Tags. So the UserData would be updated to read the Resource Tags to know where to pull that runs params. Then the secret store would be deleted at the end of the user data.

Of course, this would be mean some additional required IAM scope and the secret item path should be configurable.

Any resource being created by the step function (workflow) will pose a problem as it is implemented right now. Idle reaper will abort step functions which means created resources can be left behind.

CreateFleet does allow overriding instance resource tags for instant request type. So that could work. We can put the variables in tags and read them from IMDS.

@kichik kichik linked a pull request Mar 8, 2024 that will close this issue
@choco-nishant
Copy link

Any update on this issue? We really need multiple instance types and purchase options for our runners to optimise the cost using spot instances.

@kichik
Copy link
Member

kichik commented Aug 1, 2024

@choco-nishant #518 is working IIRC. It's just limited to I believe 1 hour jobs. I haven't had a chance yet to fix that. But if one hour is enough for you, you can already use that branch.

@choco-nishant
Copy link

Thanks. I'll try it out since we already set a 60-minute timeout for the jobs.

@choco-nishant
Copy link

@kichik we are using ECS runners and as per my understanding the fleet branch only supports EC2 runners, right?

@kichik
Copy link
Member

kichik commented Sep 26, 2024

@kichik we are using ECS runners and as per my understanding the fleet branch only supports EC2 runners, right?

Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants