Skip to content

Analytics Accelerator Library for Amazon S3 is an open source library that accelerates data access from client applications to Amazon S3.

License

Notifications You must be signed in to change notification settings

awslabs/analytics-accelerator-s3

Analytics Accelerator Library for Amazon S3

Analytics Accelerator Library for Amazon S3 is an open source library that accelerates data access from client applications to Amazon S3.

With this library, you can:

  • Lower processing times and compute costs for data analytics workloads.
  • Implement S3 best practices for performance.
  • Utilize optimizations specific to Apache Parquet files, such as pre-fetching metadata located in the footer of the object and predictive column pre-fetching.
  • Improve the price performance for your data analytics applications, such as workloads based on Apache Spark.

Current status

Analytics Accelerator Library for Amazon S3 is currently an alpha release and should not be used in production. We're especially interested in early feedback on features, performance, and compatibility. Please send feedback by opening a GitHub issue.

Getting Started

Alpha version of the library provides an interface for a seekable input stream. The library is currently being integrated and tested with the Apache Hadoop S3A client.

To get started, import the library dependency from Maven into your project:

    <dependency>
      <groupId>software.amazon.s3.analyticsaccelerator</groupId>
      <artifactId>analyticsaccelerator-s3</artifactId>
      <version>0.0.1</version>
      <scope>compile</scope>
    </dependency>

Then, initialize the library S3SeekableInputStreamFactory

S3AsyncClient crtClient = S3CrtAsyncClient.builder().maxConcurrency(600).build();
S3SeekableInputStreamFactory s3SeekableInputStreamFactory = new S3SeekableInputStreamFactory(
                new S3SdkObjectClient(this.crtClient), S3SeekableInputStreamConfiguration.DEFAULT);

Note: The S3SeekableInputStreamFactory can be initialized with either the S3AsyncClient or the S3 CRT client. We recommend that you use the S3 CRT client due to its enhanced connection pool management and higher throughput on downloads. For either client, we recommend you initialize with a higher concurrency value to fully benefit from the library's optimizations. This is because the library makes multiple parallel requests to S3 to prefetch data asynchronously. For the Java S3AsyncClient, you can increase the maximum connections by doing the following:

NettyNioAsyncHttpClient.Builder httpClientBuilder =
        NettyNioAsyncHttpClient.builder()
        .maxConcurrency(600);

S3AsyncClient s3AsyncClient = S3AsyncClient.builder().httpClientBuilder(httpClientBuilder).build(); 

To open a stream:

S3SeekableInputStream s3SeekableInputStream = s3SeekableInputStreamFactory.createStream(S3URI.of(bucket, key));

For more details on the usage of this stream, please refer to the SeekableInputStream interface.

When the S3SeekableInputStreamFactory is no longer required to create new streams, close it to free resources (eg: caches for prefetched data) held by the factory.

s3SeekableInputStreamFactory.close();

Summary of Optimizations

Analytics Accelerator Library for Amazon S3 accelerates read performance of objects stored in Amazon S3 by integrating AWS Common Run Time (CRT) libraries and implementing optimizations specific to Apache Parquet files. The AWS CRT is a software library built for interacting with AWS services, that implements best practice performance design patterns, including timeouts, retries, and automatic request parallelization for high throughput. S3SeekableInputStreamFactory can be used to initialize streams for all file types to benefit from read optimizations on top of benefits coming from CRT.

These optimizations are:

  • Sequential prefetching - The library detects sequential read patterns to prefetch data and reduce latency, and reads the full object when the object is small to minimize the number of read operations.
  • Small object prefetching - The library will prefetch the object if the object size is less than 3MB.

When the object key ends with the file extension .parquet or .par, we use the following Apache Parquet specific optimizations:

  • Parquet footer caching - The library reads the tail of the object with configurable size (1MB by default) as soon as a stream to a Parquet object is opened and caches it in memory. This is done to prevent multiple small GET requests that occur at the tail of the file for the Parquet metadata, pageIndex, and bloom filter structures.
  • Predictive column prefetching - The library tracks recent columns being read using parquet metadata. When subsequent Parquet files which have these columns are opened, the library will prefetch these columns. For example, if columns x and y are read from A.parquet , and then B.parquet is opened, and it also contains columns named x and y, the library will prefetch them asynchronously.

Benchmark Results

Benchmarking Results -- November 25, 2024

The current benchmarking results are provided for reference only. It is important to note that the performance of these queries can be affected by a variety of factors, including compute and storage variability, cluster configuration, and compute choice. All of the results presented have a margin of error of up to 3%.

To establish the performance impact of changes, we rely on a benchmark derived from an industry standard TPC-DS benchmark at a 3 TB scale. It is important to note that our TPC-DS derived benchmark results are not directly comparable with official TPC-DS benchmark results. We also found that the sizing of Apache Parquet files and partitioning of the dataset have a substantive impact on the workload performance. As a result, we have created several versions of the test dataset, with a focus on different object sizes, ranging from singular MiBs to tens of GiBs, as well as various partitioning approaches

On S3A, we have observed a total suite execution acceleration between 10% and 27%, with some queries showing a speed-up of up to 40%.

Known issue: We are currently observing a regression of up to 8% on queries similar to the Q44 in issue 173. We have determined the root cause of this issue is a data over-read due to overly eager columnar prefetching when all of the following is true:

  1. The query is filtering on dictionary encoded columns.
  2. The query is selective, and most objects do not contain the required data.
  3. The query operates on a multi-GB dataset.

We are actively working on this issue. You can track the progress in the issue 173 page.

The remaining TPC-DS queries show no regressions within the specified margin of error.

Contributions

We welcome contributions to Analytics Accelerator Library for Amazon S3! Please see the contributing guidelines for more information on how to report bugs, build from source code, or submit pull requests.

Security

If you discover a potential security issue in this project we ask that you notify Amazon Web Services (AWS) Security via our vulnerability reporting page. Please do not create a public GitHub issue.

License

Analytics Accelerator Library for Amazon S3 is licensed under the Apache-2.0 license. The pull request template will ask you to confirm the licensing of your contribution and to agree to the Developer Certificate of Origin (DCO).

About

Analytics Accelerator Library for Amazon S3 is an open source library that accelerates data access from client applications to Amazon S3.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Packages

No packages published

Languages