Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions packages/@aws-cdk/aws-sagemaker-alpha/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,38 @@ const endpointConfig = new sagemaker.EndpointConfig(this, 'EndpointConfig', {
});
```

### Serverless Inference

Amazon SageMaker Serverless Inference is a purpose-built inference option that makes it easy for you to deploy and scale ML models. Serverless endpoints automatically launch compute resources and scale them in and out depending on traffic, eliminating the need to choose instance types or manage scaling policies. For more information, see [SageMaker Serverless Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add the link to the doc for further reference:
SageMaker ServerLess Inference

To create a serverless endpoint configuration, use the `serverlessProductionVariant` property:

```typescript
import * as sagemaker from '@aws-cdk/aws-sagemaker-alpha';

declare const model: sagemaker.Model;

const endpointConfig = new sagemaker.EndpointConfig(this, 'ServerlessEndpointConfig', {
serverlessProductionVariant: {
model: model,
variantName: 'serverlessVariant',
maxConcurrency: 10,
memorySizeInMB: 2048,
provisionedConcurrency: 5, // optional
},
});
```

Serverless inference is ideal for workloads with intermittent or unpredictable traffic patterns. You can configure:

- `maxConcurrency`: Maximum concurrent invocations (1-200)
- `memorySizeInMB`: Memory allocation in 1GB increments (1024, 2048, 3072, 4096, 5120, or 6144 MB)
- `provisionedConcurrency`: Optional pre-warmed capacity to reduce cold starts

**Note**: Provisioned concurrency incurs charges even when the endpoint is not processing requests. Use it only when you need to minimize cold start latency.

You cannot mix serverless and instance-based variants in the same endpoint configuration.

### Endpoint

When you create an endpoint from an `EndpointConfig`, Amazon SageMaker launches the ML compute
Expand Down
193 changes: 188 additions & 5 deletions packages/@aws-cdk/aws-sagemaker-alpha/lib/endpoint-config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,31 @@ export interface InstanceProductionVariantProps extends ProductionVariantProps {
readonly instanceType?: InstanceType;
}

/**
* Construction properties for a serverless production variant.
*/
export interface ServerlessProductionVariantProps extends ProductionVariantProps {
/**
* The maximum number of concurrent invocations your serverless endpoint can process.
*
* Valid range: 1-200
*/
readonly maxConcurrency: number;
/**
* The memory size of your serverless endpoint. Valid values are in 1 GB increments:
* 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB.
*/
readonly memorySizeInMB: number;
/**
* The number of concurrent invocations that are provisioned and ready to respond to your endpoint.
*
* Valid range: 1-200, must be less than or equal to maxConcurrency.
*
* @default - none
*/
readonly provisionedConcurrency?: number;
}

/**
* Represents common attributes of all production variant types (e.g., instance, serverless) once
* associated to an EndpointConfig.
Expand Down Expand Up @@ -119,6 +144,26 @@ export interface InstanceProductionVariant extends ProductionVariant {
readonly instanceType: InstanceType;
}

/**
* Represents a serverless production variant that has been associated with an EndpointConfig.
*
* @internal
*/
interface ServerlessProductionVariant extends ProductionVariant {
/**
* The maximum number of concurrent invocations your serverless endpoint can process.
*/
readonly maxConcurrency: number;
/**
* The memory size of your serverless endpoint.
*/
readonly memorySizeInMB: number;
/**
* The number of concurrent invocations that are provisioned and ready to respond to your endpoint.
*/
readonly provisionedConcurrency?: number;
}

/**
* Construction properties for a SageMaker EndpointConfig.
*/
Expand All @@ -142,9 +187,21 @@ export interface EndpointConfigProps {
* A list of instance production variants. You can always add more variants later by calling
* `EndpointConfig#addInstanceProductionVariant`.
*
* Cannot be specified if `serverlessProductionVariant` is specified.
*
* @default - none
*/
readonly instanceProductionVariants?: InstanceProductionVariantProps[];

/**
* A serverless production variant. Serverless endpoints automatically launch compute resources
* and scale them in and out depending on traffic.
*
* Cannot be specified if `instanceProductionVariants` is specified.
*
* @default - none
*/
readonly serverlessProductionVariant?: ServerlessProductionVariantProps;
}

/**
Expand Down Expand Up @@ -207,6 +264,7 @@ export class EndpointConfig extends cdk.Resource implements IEndpointConfig {
public readonly endpointConfigName: string;

private readonly instanceProductionVariantsByName: { [key: string]: InstanceProductionVariant } = {};
private serverlessProductionVariant?: ServerlessProductionVariant;

constructor(scope: Construct, id: string, props: EndpointConfigProps = {}) {
super(scope, id, {
Expand All @@ -215,13 +273,22 @@ export class EndpointConfig extends cdk.Resource implements IEndpointConfig {
// Enhanced CDK Analytics Telemetry
addConstructMetadata(this, props);

// Validate mutual exclusivity
if (props.instanceProductionVariants && props.serverlessProductionVariant) {
throw new Error('Cannot specify both instanceProductionVariants and serverlessProductionVariant. Choose one variant type.');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to find any documentation that says instanceProductVariant and serverlessProductVariant cannot be used simultaneously for a single endpoint. Could you please provide the source that refers to this restriction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instance based deployment and serverless deployment should not exist at the same time.
Reference: Amazon SageMaker Deploy Model, and AWS::SageMaker::EndpointConfig ProductionVariant.

}

(props.instanceProductionVariants || []).map(p => this.addInstanceProductionVariant(p));

if (props.serverlessProductionVariant) {
this.addServerlessProductionVariant(props.serverlessProductionVariant);
}

// create the endpoint configuration resource
const endpointConfig = new CfnEndpointConfig(this, 'EndpointConfig', {
kmsKeyId: (props.encryptionKey) ? props.encryptionKey.keyRef.keyArn : undefined,
endpointConfigName: this.physicalName,
productionVariants: cdk.Lazy.any({ produce: () => this.renderInstanceProductionVariants() }),
productionVariants: cdk.Lazy.any({ produce: () => this.renderProductionVariants() }),
});
this.endpointConfigName = this.getResourceNameAttribute(endpointConfig.attrEndpointConfigName);
this.endpointConfigArn = this.getResourceArnAttribute(endpointConfig.ref, {
Expand All @@ -238,6 +305,9 @@ export class EndpointConfig extends cdk.Resource implements IEndpointConfig {
*/
@MethodMetadata()
public addInstanceProductionVariant(props: InstanceProductionVariantProps): void {
if (this.serverlessProductionVariant) {
throw new Error('Cannot add instance production variant when serverless production variant is already configured');
}
if (props.variantName in this.instanceProductionVariantsByName) {
throw new Error(`There is already a Production Variant with name '${props.variantName}'`);
}
Expand All @@ -252,6 +322,30 @@ export class EndpointConfig extends cdk.Resource implements IEndpointConfig {
};
}

/**
* Add serverless production variant to the endpoint configuration.
*
* @param props The properties of a serverless production variant to add.
*/
@MethodMetadata()
public addServerlessProductionVariant(props: ServerlessProductionVariantProps): void {
if (Object.keys(this.instanceProductionVariantsByName).length > 0) {
throw new Error('Cannot add serverless production variant when instance production variants are already configured');
}
if (this.serverlessProductionVariant) {
throw new Error('Cannot add more than one serverless production variant per endpoint configuration');
}
this.validateServerlessProductionVariantProps(props);
this.serverlessProductionVariant = {
initialVariantWeight: props.initialVariantWeight || 1.0,
maxConcurrency: props.maxConcurrency,
memorySizeInMB: props.memorySizeInMB,
modelName: props.model.modelName,
provisionedConcurrency: props.provisionedConcurrency,
variantName: props.variantName,
};
}

/**
* Get instance production variants associated with endpoint configuration.
*
Expand All @@ -276,10 +370,20 @@ export class EndpointConfig extends cdk.Resource implements IEndpointConfig {
}

private validateProductionVariants(): void {
// validate number of production variants
if (this._instanceProductionVariants.length < 1) {
const hasServerlessVariant = this.serverlessProductionVariant !== undefined;

// validate at least one production variant
if (this._instanceProductionVariants.length === 0 && !hasServerlessVariant) {
throw new Error('Must configure at least 1 production variant');
} else if (this._instanceProductionVariants.length > 10) {
}

// validate mutual exclusivity
if (this._instanceProductionVariants.length > 0 && hasServerlessVariant) {
throw new Error('Cannot configure both instance and serverless production variants');
}

// validate instance variant limits
if (this._instanceProductionVariants.length > 10) {
throw new Error('Can\'t have more than 10 production variants');
}
}
Expand Down Expand Up @@ -310,11 +414,69 @@ export class EndpointConfig extends cdk.Resource implements IEndpointConfig {
}
}

private validateServerlessProductionVariantProps(props: ServerlessProductionVariantProps): void {
const errors: string[] = [];

// check variant weight is not negative
if (props.initialVariantWeight && props.initialVariantWeight < 0) {
errors.push('Cannot have negative variant weight');
}

// check maxConcurrency range
if (props.maxConcurrency < 1 || props.maxConcurrency > 200) {
errors.push('maxConcurrency must be between 1 and 200');
}

// check memorySizeInMB valid values (1GB increments from 1024 to 6144)
const validMemorySizes = [1024, 2048, 3072, 4096, 5120, 6144];
if (!validMemorySizes.includes(props.memorySizeInMB)) {
errors.push(`memorySizeInMB must be one of: ${validMemorySizes.join(', ')} MB`);
}

// check provisionedConcurrency range and relationship to maxConcurrency
if (props.provisionedConcurrency !== undefined) {
if (props.provisionedConcurrency < 1 || props.provisionedConcurrency > 200) {
errors.push('provisionedConcurrency must be between 1 and 200');
}
if (props.provisionedConcurrency > props.maxConcurrency) {
errors.push('provisionedConcurrency cannot be greater than maxConcurrency');
}
}

// check environment compatibility with model
const model = props.model;
if (!sameEnv(model.env.account, this.env.account)) {
errors.push(`Cannot use model in account ${model.env.account} for endpoint configuration in account ${this.env.account}`);
} else if (!sameEnv(model.env.region, this.env.region)) {
errors.push(`Cannot use model in region ${model.env.region} for endpoint configuration in region ${this.env.region}`);
}

if (errors.length > 0) {
throw new Error(`Invalid Serverless Production Variant Props: ${errors.join(EOL)}`);
}
}

/**
* Render the list of production variants (instance or serverless).
*/
private renderProductionVariants(): CfnEndpointConfig.ProductionVariantProperty[] {
this.validateProductionVariants();

if (this.serverlessProductionVariant) {
return this.renderServerlessProductionVariant();
} else {
return this.renderInstanceProductionVariants();
}
}

/**
* Render the list of instance production variants.
*/
private renderInstanceProductionVariants(): CfnEndpointConfig.ProductionVariantProperty[] {
this.validateProductionVariants();
if (this._instanceProductionVariants.length === 0) {
throw new Error('renderInstanceProductionVariants called but no instance variants are configured');
}

return this._instanceProductionVariants.map( v => ({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a validation here. If the instanceProductionVariant is empty we can throw an error.

acceleratorType: v.acceleratorType?.toString(),
initialInstanceCount: v.initialInstanceCount,
Expand All @@ -324,4 +486,25 @@ export class EndpointConfig extends cdk.Resource implements IEndpointConfig {
variantName: v.variantName,
}) );
}

/**
* Render the serverless production variant.
*/
private renderServerlessProductionVariant(): CfnEndpointConfig.ProductionVariantProperty[] {
if (!this.serverlessProductionVariant) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should throw an error in this case. The design is to only call renderServerlessProductionVariant when serverlessProductionVariant is defined. Therefore, if serverlessProductionVariant is not defined, it should be treated as an error.

throw new Error('renderServerlessProductionVariant called but no serverless variant is configured');
}

const variant = this.serverlessProductionVariant;
return [{
initialVariantWeight: variant.initialVariantWeight,
modelName: variant.modelName,
variantName: variant.variantName,
serverlessConfig: {
maxConcurrency: variant.maxConcurrency,
memorySizeInMb: variant.memorySizeInMB,
provisionedConcurrency: variant.provisionedConcurrency,
},
}];
}
}
Loading
Loading