feat(v2): Use AWS CRT as the default SDK client #1541

jreijn · 2023-12-22T09:03:45Z

Description of changes:

This PR changes the default SDK HTTP client from UrlConnectionClient to the AWS CRT client. The CRT client is a lightweight client which should reduce the cold start and latency of lambda functions.

Checklist

Meet tenets criteria
Update tests
Update docs
PR title follows conventional commit semantics

Breaking change checklist

RFC issue #:

Migration process documented
Implement warnings (if it can live side by side)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

jeromevdl · 2023-12-22T10:31:05Z

Thanks @jreijn, looks great! Thanks to the new sync API, it's not so much change, that's a good news.

Question/ask: could you do a load test on the parameter sample (maybe the easiest to do) before and after CRT? Feel free to say no and I'll do it later.

jeromevdl · 2023-12-22T10:32:14Z

Build is failing, problem in DynamoDBWriter in batch module.

jreijn · 2023-12-22T10:59:13Z

@jeromevdl thanks! Yes the change is quite simple now. I'll look into the issues. I've just setup a dev linux box to see if I can run the tests locally as it seems to fail on my MacBook book due to some docker issue.

scottgerring · 2023-12-22T11:17:46Z

@jeromevdl thanks! Yes the change is quite simple now. I'll look into the issues. I've just setup a dev linux box to see if I can run the tests locally as it seems to fail on my MacBook book due to some docker issue.

It's the test in the CDK example, which is autogenerated and doesn't really test anything. I'd like to remove it in #1542

jreijn · 2023-12-22T14:34:50Z

Thanks @jreijn, looks great! Thanks to the new sync API, it's not so much change, that's a good news.

Question/ask: could you do a load test on the parameter sample (maybe the easiest to do) before and after CRT? Feel free to say no and I'll do it later.

@jeromevdl Do we already have something in place for doing the load test? Or should I just run something on my own?

…a execution

sonarcloud · 2023-12-22T22:41:00Z

Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

jreijn · 2023-12-22T23:47:48Z

@jeromevdl I've ran a load test with artillery on the parameters module and the results are mentioned below.

I've used the parameters example with 1 slight improvement (tiered compilation) against the current v1 (main) branch with the connection client and this PR branch with the aws crt client. It's using 512MB. I've created two reports:

Duration
Memory

I used CW Logs insights with the following 2 queries:

filter @type="REPORT"
| fields greatest(@initDuration, 0) + @duration as duration, ispresent(@initDuration) as coldStart
| stats count(*) as count, pct(duration, 50) as p50, pct(duration, 90) as p90, pct(duration, 99) as p99, max(duration) as max by coldStart

filter @type = "REPORT"
| stats max(@memorySize / 1000 / 1000) as provisionedMemoryMB,
  min(@maxMemoryUsed / 1000 / 1000) as smallestMemoryRequestMB,
  avg(@maxMemoryUsed / 1000 / 1000) as avgMemoryUsedMB,
  max(@maxMemoryUsed / 1000 / 1000) as maxMemoryUsedMB,
  provisionedMemoryMB - maxMemoryUsedMB as overProvisionedMB

Duration

For the main branch (v1) while using the url-connection-client I get the following results:

coldStart	count	p50	p90	p99	max
0	17909	3.1752	4.9216	31.5072	12609.12
1	91	3670.57	6515.75	7931.03	7931.03

For the v2 branch while using the aws-crt-client I get the following results:

coldStart	count	p50	p90	p99	max
0	17902	3.125	4.7294	24.0299	12774.03
1	98	4634.31	8409.65	9343.73	9343.73

Memory

v1 with url-connection-client

provisionedMemoryMB	smallestMemoryRequestMB	avgMemoryUsedMB	maxMemoryUsedMB	overProvisionedMB
512	107	117.3101	183	329

v2 with aws-crt-client

provisionedMemoryMB	smallestMemoryRequestMB	avgMemoryUsedMB	maxMemoryUsedMB	overProvisionedMB
512	117	128.9981	188	324

I expected different results and I noticed that the example was not functioning due to some changes in the v2 branch which I fixed in a commit. There are still some messages during a coldstart for the LambdaJsonLayout of the log4j configuration and a mention about a missing version, which I will try to look at somewhere over the next couple of days.

I also see that the parameters example is using an amazon API call to https://checkip.amazonaws.com. Not sure if that impacts the results, but will try to analyse further.

As the crt sync client should also work with the v1 branch (by providing a client) I'll also try and test that to see if changes the v2 impact the results.

jeromevdl · 2023-12-23T08:22:02Z

Thanks for the tests. I didn't expect such results... It's disappointing. @msailes @scottgerring any thoughts? Do you think the asynchronous version would be better?

msailes · 2023-12-23T08:36:22Z

I don't think this is expected, one to ask Zoe Wang, when she's next working.

jreijn · 2023-12-28T11:09:26Z

I've also tested with the sync crt client in the v1 branch and I see similar behavior. It makes me wonder about the impact compared to the url connection client. In the announcement post on the aws blog for the crt client we see the comparison to the netty client which is a more heavy weight client. Looking forward to some insights from the team that worked on the aws crt client.

scottgerring · 2023-12-28T11:36:55Z

@jreijn thanks for the extensive testing!
I think we must be missing something. We will loop Zoe in once she's back on deck after the break. Failing that we probably need to hook up Codeguru to it and get some profiling data out.

jreijn · 2023-12-28T14:21:51Z

@scottgerring great suggestion for code guru. I'll see if I can find some over the next week to hook it up and find out what I can see about it.

scottgerring · 2023-12-29T14:18:11Z

It seems like this PR will also resolve #1544 !

scottgerring · 2023-12-29T14:22:47Z

I've used this as an opportunity to have a play with CodeGuru.

V2 current branch - No CRT Client - Java 11 Runtime

CRT version @jreijn branch - Java 11 Runtime

I don't think the above ⏫ is that useful - 1/ you can see in the CRT chart, the frames are all to do with fetching the checkmyip website, and 2/ the CRT usage is performed once, in the class constructor, not per-request. I am putting a branch together to make the runtime depend more on AWS SDK usage.

jreijn · 2023-12-29T14:25:00Z

It seems like this PR will also resolve #1544 !

Yes. I wasn't sure if the issue was known so solved it in this PR for now.

jreijn · 2024-01-05T14:47:33Z

Thanks @scottgerring . I'll keep an eye on the issue. I'm still enjoying some time off so if something changes I'll chime in again.

jreijn · 2024-01-05T15:29:03Z

@jreijn / @jeromevdl in the meantime, I suggest we do something like this:

1/ agree on a more indicative test. How do you both feel about this commit I pushed on the branch I referenced? It removes the call to the external website, and initializes the CRT client in the handler itself (so it happens per-request).

2/ Re-use Jeroen's existing cloudwatch cloudstart/warm-start based analysis process above with this test for v2 and the crt branches. I think this actually captures exactly what we want to see, but calling out to the external website muddies the waters.

At this point we'll have an idea on what the performance impact is for both cold/warm scenarios.

What do you both think?

@scottgerring i just noticed I missed this comment.

With regards to:

1 I fully agree. The remote call makes the test less precise due to the latency from the remote call.

2 sounds like a plan. If you want me to incorporate the changes and try again let me know!

zoewangg · 2024-01-09T00:34:29Z

Hi @jreijn, I'm on the AWS SDK for Java team 🙂. We performed tests to compare HTTP clients coldstart latency on Lambda and from our tests, switching to AWS CRT sync HTTP client from UrlConnectionHttpClient reduced 35% startup latency. AWS CRT sync has slight higher startup latency (6%) compared to AWS CRT async HTTP client and we'll look into that. The Lambda memory setting was 512MB and we were just sending one single ListBuckets request with S3 SDK. LMK if you have any questions and I'm interested to see the new test result!

jreijn · 2024-01-09T13:31:18Z

Thanks @zoewangg ! I'll try to make some time today or tomorrow to take it for another test run and see if we can see similar results. Was there a specific java version you used in your tests?

zoewangg · 2024-01-12T00:20:12Z

I was using Java 8.

scottgerring · 2024-01-16T09:26:40Z

Hey @jreijn sorry for the slow reply - took some time off over the break with the family!

If you have the time and the inclination that would of course be great! I'm aware this one is getting a bit more fiddly than initially expected - @jeromevdl and I can also jump in put some time into it if you'd like

jreijn · 2024-01-16T17:32:59Z

@scottgerring i hope you had some wonderful time with the family. I don't mind spending some more time, but it's good to share experiences. I did a test 2 days ago where I incorporated most of your changes (from the branch) and ran a new test. The results did not differ that much and I'm planning to do another run tomorrow before I post the results here just to check for any mistakes I might have made. I'll send an update tomorrow!

scottgerring · 2024-01-19T08:56:32Z

Hey @jreijn sounds good (although a bit surprised about the results). Looking forward to seeing the numbers.

…eliable during testing.

jreijn · 2024-01-24T13:07:22Z

I did a retest today to see the difference. I've incorporated some your changes @scottgerring into this PR for clarity. I'm a bit surprised by the memory results.

Duration v2-with-crt-client (AWS SDK 2.22.0)

coldStart	count	p50	p90	p99	max
0	2950	1.7087	2.3787	7.0289	122.65
1	50	4555.15	4711.33	4802.67	4802.67

Memory usage (AWS SDK 2.22.0)

provisionedMemoryMB	smallestMemoryRequestMB	avgMemoryUsedMB	maxMemoryUsedMB	overProvisionedMB
512	173	177.8377	179	333

Duration (AWS SDK 2.23.9)

coldStart	count	p50	p90	p99	max
0	2745	1.5186	2.1981	6.2845	42.02
1	31	3069.25	3247.63	3306.67	3306.67

Memory (AWS SDK 2.23.9)

provisionedMemoryMB	smallestMemoryRequestMB	avgMemoryUsedMB	maxMemoryUsedMB	overProvisionedMB
512	169	173.5416	175	337

I'll try to create a heap dump of the JVM to analyse the what's actually in the heap and post an update later.

Update

I've analyzed a heap dump of about (28 MB). The amount of live objects is very small. Not sure how this relates to the 169 MB reported by Lambda itself. @scottgerring @jeromevdl any ideas?

scottgerring · 2024-01-25T06:51:06Z

Hey @jreijn great - thanks for the dive deep!

I did a retest today to see the difference.

So .... we can see the remote call latency is down, but the overall picture hasn't really changed - the CRT is still slower both with and without cold starts. Memory usage seems similar.

We'll reach out to the team again on our side, it still feels like we are missing something!

Would you be able to give us the exact lambda configuration you are using - region, runtime, size?

I've analyzed a heap dump of about (28 MB). The amount of live objects is very small. Not sure how this relates to the 169 MB reported by Lambda itself. @scottgerring @jeromevdl any ideas?

I believe the lambda-reported usage will be the "OS view" and include the java runtime as well, not just the JVM-allocated heap. I would guess this accounts for the discrepancy.

sonarcloud · 2024-01-25T09:05:10Z

Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

jreijn · 2024-01-25T09:05:27Z

I was using the sam template for deployment.

region: eu-west-1
runtime: java11
memory size: 512MB

scottgerring · 2024-02-15T16:23:03Z

Hey @jreijn ,

Once again apologies for the high turnaround here. @jeromevdl , @zoewangg and I had an impromptu chat about this to see if there was anything obvious to point at, and came up empty-handed. I'd like to take some time to sit with this myself and dive into it - i've set time aside tomorrow for this. I will loop back with findings and hopefully some reproducible unit of testing we can use to measure these and other perf-impacting decisions for PT.

scottgerring · 2024-02-16T15:18:59Z

@jreijn , @jeromevdl : I think i've gotten to the bottom of it! It's the size of the output JAR; using the CRT client by default includes a pile of sos for every platform it supports, which results in a function JAR that is 2.3x the size of the non-CRT jar.

Detailed numbers here, testing bits here (i'll clean this up a bit so others can use it)

Stripping back (removing all the shared objects for unneeded platforms) the JAR to the minimum reduces cold start time below the URLHttpClient:

CRT-Stripped Avg: 4.45. (lambda size = 15.9 MB)
CRT Normal Avg: 6.19, (lambda size = 32.73 MB)
URL Http Client: 6.49 (lambda size = 13.915 MB)

CRT-normal and URL client are very close to the same. This is close to what what you measured, @jreijn - yours are a little higher, mine are a little lower.

I note that in the maven repo, there are jar variants for all the different combos of arch/runtime. Turns out - you can use the "classifier" to get the right one - details here.

I haven't played with this yet, but it seems like there will be an onus on us to document this clearly in Powertools as part of the "getting started" guide - otherwise simply including the CRT as the default means that v2 may be slower than v1 because of all the unneeded libraries.

jeromevdl · 2024-02-16T16:17:11Z

@scottgerring thanks for your great tests.

Looking at the jar sizes in maven, it's clear that the generic crt is quite big compared to the others 17 Mb vs ~2 Mb for the specific versions. Did you test one of these specific versions ?

If I'm not mistaken, we need either:

<dependency>
    <groupId>software.amazon.awssdk.crt</groupId>
    <artifactId>aws-crt</artifactId>
    <version>0.29.10</version>
    <classifier>linux-aarch_64</classifier> <!-- arm64 -->
</dependency>

Or:

<dependency>
    <groupId>software.amazon.awssdk.crt</groupId>
    <artifactId>aws-crt</artifactId>
    <version>0.29.10</version>
    <classifier>linux-x86_64</classifier>  <!-- x86_64 -->
</dependency>

Problem is that we can't set this in Powertools, so we need to document it... In PT, we can use

<dependency>
    <groupId>software.amazon.awssdk.crt</groupId>
    <artifactId>aws-crt</artifactId>
    <version>0.29.10</version>
    <scope>provided</scope>
</dependency>

I don't really like the developer experience, but if we can get better performance, that's probably worth it.

scottgerring · 2024-02-19T07:27:15Z

If I'm not mistaken, we need either:

not directly; i recreated a specific variant by hand by stripping the "full" JAR, before I realised you can use the classifier. We need to test that and make sure it works. I think the only trick is going to be ensuring that maven does what we think it will!

jeromevdl · 2024-02-19T08:52:36Z

I think the only trick is going to be ensuring that maven does what we think it will!

Maven will do what it's supposed to do if you specify the classifier... Test it but it should work.

Wondering if we could simplify this somehow, rather than just documenting... We'll build a powertools-maven-plugin that reads in your IaC to evaluate the architecture, the target java runtime and choose the best dependencies ;)

jreijn · 2024-02-19T16:17:37Z

@scottgerring wow nice find! I was curious if the package size had any impact after reading this post, but I did not get to it yet.

In my local branch I made some changes and marked the aws-crt-client in the PT parameters module to scope provided ike @jeromevdl suggested and included the aws-crt-client and the aws-crt library the with classifier in the example project, but I was not very pleased by the developer experience.

This is mainly because the following observations:

you need to explicitly set the aws http client in your project (parameters example) by adding the dependencies . Of course we can document this, but it's not a seamless experience with just adding a single dependency (pt-module) to a project and setting the aspectj plugin.
you need to make sure you keep the version of the aws-crt library in combination with classifier in sync with the version used by the aws-crt-client. In your project you will need to exclude the aws-crt transitive dependency from the aws-crt-client like and include the correct aws-crt dependency (runtime specific):

        <dependency>
            <groupId>software.amazon.awssdk.crt</groupId>
            <artifactId>aws-crt</artifactId>
            <version>0.29.9</version>
            <classifier>linux-x86_64</classifier>
        </dependency>

        <dependency>
            <groupId>software.amazon.awssdk</groupId>
            <artifactId>aws-crt-client</artifactId>
            <version>2.23.21</version>
            <exclusions>
                <exclusion>
                    <groupId>software.amazon.awssdk.crt</groupId>
                    <artifactId>aws-crt</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

otherwise you can/will end up with both versions on the class path (I noticed this after the sam build).

If your local environment is different from the target runtime environment this will become a bit of an hassle and you will probably need multiple maven profiles to be able to locally test your function. (I have a Macbook) and the lambda is targeted to run on linux.

It makes me wonder if we should use the CRT client as the default client, as overall the developer experience will be less. Perhaps it's better to leave the url connection client as the default as it offers quite a good performance over dev experience . We could add documentation on how to use the CRT client for the local/target runtime for the users that want to try and squeeze the most out of their function, but it will be an explicit decision made by the developer to take on the extra burden of the local development vs remote and the dependency synch.

Now for some good news.

I did another run with the classifier and I see some very promising results compared to the previous benchmark(s).

The following table is for the v2 crt client with classifier and latest version of the sdk (2.23.21) and crt library (0.29.9) :

coldStart	count	p50	p90	p99	avg	min	max
0	2967	1.5493	2.1589	5.5797	1.7932	1.07	40.57
1	33	3170.56	3310.47	3433.73	3097.1491	2507.71	3433.73

jeromevdl · 2024-02-20T07:36:40Z

Thanks @jreijn. I agree the devX is quite bad and we probably should not do this by default but document it so that people who want better performance can do it...
@scottgerring @msailes wdyt?

scottgerring · 2024-02-20T07:44:09Z

Yeah, the developer experience isn't great. I didn't even consider this:

probably need multiple maven profiles to be able to locally test your function. (I have a Macbook)

... plus you often want flexibility to mix ARM64/AMD64 on the AWS side.

Considering our tenets 1/ Keep it lean and 2/ Work backwards from the community ... I think it is clear we should document how to use the CRT client (including what we've learnt here in terms of JAR-size minimization) and 2/ not use the CRT by default. The extra complexity it pushes onto the developer at the start of the journey outweighs the benefit in our case IMHO.

Open question: Is UrlConnectionHttpClient the best choice for us? Looking at this flowchart it seems like it is - minimize latency over throughput, minimize bundle size - but I think now is a good opportunity to make sure we're all on the same page @jreijn @jeromevdl @msailes !

msailes · 2024-02-20T08:56:02Z

I agree that we should leave CRT for now. At best we document how to use it as an optimization.

scottgerring · 2024-02-21T18:36:12Z

@jreijn - FYI - we have passed over our collective findings to the CRT team, who mentioned they have been thinking about this. We will loop back when we have any further interesting information to share!

In the meantime, would you like to work on the documentation improvements? I appreciate this isn't as glamorous as code, and we can certainly pick it up on our side if not :)

jreijn · 2024-02-21T18:58:22Z

@scottgerring great! Yes, I was already looking into adding some documentation. I will probably do that in a separate PR if that's ok.

scottgerring · 2024-02-26T08:11:31Z

@jreijn sounds good! There's a bit of accumulated history on this one. I'll close this and link the findings on the issue.

jreijn added 2 commits December 22, 2023 09:41

Migrate the large message module to use aws crt client

ba2e663

Migrate idempotency to use the aws crt client

45a0e95

pull-request-size bot added the size/M label Dec 22, 2023

scottgerring changed the title ~~refactor(v2) Use AWS CRT as the default SDK client~~ feat: (v2) Use AWS CRT as the default SDK client Dec 22, 2023

scottgerring added this to the v2 milestone Dec 22, 2023

scottgerring added the v2 Version 2 label Dec 22, 2023

jreijn added 3 commits December 22, 2023 10:07

Migrate the cloudformation module to use the aws crt client

89ab1f6

Migrate the parameters module to use the aws crt client

3805dcd

Migrate the end 2 end tests to use the aws crt client

7475bed

pull-request-size bot added size/L and removed size/M labels Dec 22, 2023

Migrate batch example to use the aws crt client

e4403a4

jeromevdl changed the title ~~feat: (v2) Use AWS CRT as the default SDK client~~ feat(v2): Use AWS CRT as the default SDK client Dec 22, 2023

Migrate batch example to use the aws crt client

65d129e

Fix issue with missing transformer manager resulting in failing lambd…

8d39247

…a execution

Remove logging and remote call checkip service to make example more r…

24a60bc

…eliable during testing.

jreijn added 3 commits January 25, 2024 09:59

Add explicit timeout and set Java tools for Java 11

91c1166

Bump SDK version to 2.23.9

bd2d21c

Fix description of template

7a9dffc

scottgerring closed this Feb 26, 2024

scottgerring mentioned this pull request Feb 26, 2024

Move SDK http client to CRT #1092

Closed

feat(v2): Use AWS CRT as the default SDK client #1541

feat(v2): Use AWS CRT as the default SDK client #1541

Conversation

jreijn commented Dec 22, 2023 • edited Loading

Description of changes:

Breaking change checklist

jeromevdl commented Dec 22, 2023

jeromevdl commented Dec 22, 2023

jreijn commented Dec 22, 2023

scottgerring commented Dec 22, 2023

jreijn commented Dec 22, 2023

sonarcloud bot commented Dec 22, 2023

Quality Gate passed

jreijn commented Dec 22, 2023 • edited Loading

Duration

Memory

jeromevdl commented Dec 23, 2023

msailes commented Dec 23, 2023

jreijn commented Dec 28, 2023

scottgerring commented Dec 28, 2023

jreijn commented Dec 28, 2023

scottgerring commented Dec 29, 2023

scottgerring commented Dec 29, 2023 • edited Loading

jreijn commented Dec 29, 2023

jreijn commented Jan 5, 2024

jreijn commented Jan 5, 2024

zoewangg commented Jan 9, 2024

jreijn commented Jan 9, 2024

zoewangg commented Jan 12, 2024

scottgerring commented Jan 16, 2024

jreijn commented Jan 16, 2024

scottgerring commented Jan 19, 2024

jreijn commented Jan 24, 2024 • edited Loading

scottgerring commented Jan 25, 2024

sonarcloud bot commented Jan 25, 2024

Quality Gate passed

jreijn commented Jan 25, 2024

scottgerring commented Feb 15, 2024

scottgerring commented Feb 16, 2024

jeromevdl commented Feb 16, 2024 • edited Loading

scottgerring commented Feb 19, 2024 • edited Loading

jeromevdl commented Feb 19, 2024

jreijn commented Feb 19, 2024

Now for some good news.

jeromevdl commented Feb 20, 2024

scottgerring commented Feb 20, 2024 • edited Loading

msailes commented Feb 20, 2024

scottgerring commented Feb 21, 2024

jreijn commented Feb 21, 2024

scottgerring commented Feb 26, 2024

jreijn commented Dec 22, 2023 •

edited

Loading

jreijn commented Dec 22, 2023 •

edited

Loading

scottgerring commented Dec 29, 2023 •

edited

Loading

jreijn commented Jan 24, 2024 •

edited

Loading

jeromevdl commented Feb 16, 2024 •

edited

Loading

scottgerring commented Feb 19, 2024 •

edited

Loading

scottgerring commented Feb 20, 2024 •

edited

Loading