Apache Hudi Core Conceptions

A set of notebooks to explore and explain core conceptions of Apache Hudi, such as file layouts, file sizing, compaction, clustering and so on.

① The notebooks manipulate a public dataset: amazon-reviews-pds, the location is s3://amazon-reviews-pds, it is accessible on aws global regions, for China regions or non aws users, you can download it to local with S3 client tools.

② The running environment of notebooks is Amazon EMR Studio, a managed notebook service for Amazon EMR. If you have no aws accounts, you can modify notebooks to adapt to a notebook environment which supports Spark kernal.

③ The recommended configuration for Spark cluster is: 32 vCore，120GB or higher, the master node must have 100GB+ free disk space.

Update Notes

@2023-08-22: The public dateset "amazon-reviews-pds" on s3://amazon-reviews-pds is closed recently, you can download raw data from: https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/, but the data format and schema are different with original parquet files on s3://amazon-reviews-pds, you need clean & format raw data by yourself.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.ipynb_checkpoints		.ipynb_checkpoints
1-data-preparation.ipynb		1-data-preparation.ipynb
2-cow-file-layouts-file-sizing.ipynb		2-cow-file-layouts-file-sizing.ipynb
3-mor-file-layouts-file-sizing.ipynb		3-mor-file-layouts-file-sizing.ipynb
4-mor-compaction.ipynb		4-mor-compaction.ipynb
5-cow-clustering.ipynb		5-cow-clustering.ipynb
6-cow-bloom-index.ipynb		6-cow-bloom-index.ipynb
7-cow-bucket-index.ipynb		7-cow-bucket-index.ipynb
README.md		README.md
hudi-stat.sh		hudi-stat.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache Hudi Core Conceptions

About

Releases

Packages

Languages

bluishglc/apache-hudi-core-conceptions

Folders and files

Latest commit

History

Repository files navigation

Apache Hudi Core Conceptions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages