Skip to content

A set of notebooks to explore and explain core conceptions of Apache Hudi, such as file layouts, file sizing, compaction, clustering and so on.

Notifications You must be signed in to change notification settings

bluishglc/apache-hudi-core-conceptions

Repository files navigation

Apache Hudi Core Conceptions

A set of notebooks to explore and explain core conceptions of Apache Hudi, such as file layouts, file sizing, compaction, clustering and so on.

① The notebooks manipulate a public dataset: amazon-reviews-pds, the location is s3://amazon-reviews-pds, it is accessible on aws global regions, for China regions or non aws users, you can download it to local with S3 client tools.

② The running environment of notebooks is Amazon EMR Studio, a managed notebook service for Amazon EMR. If you have no aws accounts, you can modify notebooks to adapt to a notebook environment which supports Spark kernal.

③ The recommended configuration for Spark cluster is: 32 vCore,120GB or higher, the master node must have 100GB+ free disk space.


Update Notes

@2023-08-22: The public dateset "amazon-reviews-pds" on s3://amazon-reviews-pds is closed recently, you can download raw data from: https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/, but the data format and schema are different with original parquet files on s3://amazon-reviews-pds, you need clean & format raw data by yourself.

About

A set of notebooks to explore and explain core conceptions of Apache Hudi, such as file layouts, file sizing, compaction, clustering and so on.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published