diff --git a/website/docs/streaming-lakehouse/integrate-data-lakes/catalogs/glue.md b/website/docs/streaming-lakehouse/integrate-data-lakes/catalogs/glue.md new file mode 100644 index 0000000000..bdc2c757d8 --- /dev/null +++ b/website/docs/streaming-lakehouse/integrate-data-lakes/catalogs/glue.md @@ -0,0 +1,167 @@ +--- +title: AWS Glue +sidebar_position: 2 +--- + +# AWS Glue + +## Introduction + +[AWS Glue](https://aws.amazon.com/glue/) is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue includes a central metadata repository, known as the AWS Glue Data Catalog, which is fully compatible with Apache Iceberg. + +This guide explains how to configure Fluss to use AWS Glue as its Iceberg catalog. For general Iceberg integration details (table mapping, data types, limitations), see [Iceberg](../formats/iceberg.md). + +## How It Works + +When Fluss is configured with AWS Glue as its Iceberg catalog: + +1. Fluss creates and manages Iceberg database and table metadata directly within the AWS Glue Data Catalog. +2. The [tiering service](maintenance/tiered-storage/lakehouse-storage.md#start-the-datalake-tiering-service) writes data files to Amazon S3 and commits snapshots to the Glue Data Catalog. +3. Any AWS native or external query engine (such as Amazon Athena, Amazon EMR, AWS Glue Jobs, Snowflake, Trino, Flink, or Spark) can discover and query the tiered tables through AWS Glue. + +## Prerequisites + +### AWS IAM Permissions + +Fluss and the tiering service require appropriate IAM permissions to interact with AWS Glue and S3. Below is a minimal IAM policy template: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "glue:CreateDatabase", + "glue:GetDatabase", + "glue:GetDatabases", + "glue:UpdateDatabase", + "glue:DeleteDatabase", + "glue:CreateTable", + "glue:GetTable", + "glue:GetTables", + "glue:UpdateTable", + "glue:DeleteTable" + ], + "Resource": [ + "arn:aws:glue:::catalog", + "arn:aws:glue:::database/*", + "arn:aws:glue:::table/*" + ] + }, + { + "Effect": "Allow", + "Action": [ + "s3:GetObject", + "s3:PutObject", + "s3:DeleteObject", + "s3:ListBucket" + ], + "Resource": [ + "arn:aws:s3:::", + "arn:aws:s3:::/*" + ] + } + ] +} +``` + +### Prepare Required JARs + +You must place the required Iceberg AWS integration JARs into the classpath of both Fluss and the Flink tiering service. + +#### For Fluss Servers (Coordinator & Tablet Servers) + +Place the following JARs in the `FLUSS_HOME/plugins/iceberg/` directory: + +1. **Iceberg AWS Integration**: [iceberg-aws-1.10.1.jar](https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-aws/1.10.1/iceberg-aws-1.10.1.jar) +2. **AWS SDK Bundle**: [iceberg-aws-bundle-1.10.1.jar](https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-aws-bundle/1.10.1/iceberg-aws-bundle-1.10.1.jar) +3. **Failsafe**: [failsafe-3.3.2.jar](https://repo1.maven.org/maven2/dev/failsafe/failsafe/3.3.2/failsafe-3.3.2.jar) + +#### For the Flink Tiering Service + +Place the same three JARs into the `${FLINK_HOME}/lib` directory. + +## Configure Fluss with AWS Glue + +### Cluster Configuration + +Add the following configuration parameters to your `server.yaml`: + +```yaml +datalake.format: iceberg +datalake.iceberg.type: glue +datalake.iceberg.warehouse: s3:/// +datalake.iceberg.client.region: +datalake.iceberg.io-impl: org.apache.iceberg.aws.s3.S3FileIO +``` + +:::tip +Fluss strips the `datalake.iceberg.` prefix and passes the remaining properties to the Iceberg Glue catalog client. You can configure any additional [Iceberg AWS integration properties](https://iceberg.apache.org/docs/1.10.1/aws/) (such as S3 endpoint overrides or credentials provider configurations) using this prefix. +::: + +#### Authentication Methods + +**1. IAM Role / Default Credentials Provider Chain (Recommended)** + +If Fluss and Flink are running in an AWS environment (e.g., EKS, ECS, or EC2) with attached IAM roles, you do not need to configure credentials in `server.yaml`. The catalog will automatically resolve credentials using the default provider chain. + +**2. Static Credentials (Not Recommended for Production)** + +If you need to supply static credentials for testing, add them to `server.yaml`: + +```yaml +datalake.iceberg.s3.access-key-id: +datalake.iceberg.s3.secret-access-key: +``` + +### Start Tiering Service + +Follow the [Iceberg tiering service setup](../formats/iceberg.md#start-tiering-service-to-iceberg) instructions to start the tiering service. Provide the Glue catalog parameters when launching the Flink tiering job: + +```bash +${FLINK_HOME}/bin/flink run /path/to/fluss-flink-tiering-$FLUSS_VERSION$.jar \ + --fluss.bootstrap.servers :9123 \ + --datalake.format iceberg \ + --datalake.iceberg.type glue \ + --datalake.iceberg.warehouse s3:/// \ + --datalake.iceberg.client.region \ + --datalake.iceberg.io-impl org.apache.iceberg.aws.s3.S3FileIO +``` + +## Usage Example + +### Create a Datalake-Enabled Table + +Connect to Fluss via Flink SQL and create a table with data lake enabled: + +```sql title="Flink SQL" +USE CATALOG fluss_catalog; + +CREATE TABLE customer_orders ( + `order_id` BIGINT, + `customer_name` STRING, + `total_amount` DECIMAL(15, 2), + `order_date` STRING, + PRIMARY KEY (`order_id`) NOT ENFORCED +) WITH ( + 'table.datalake.enabled' = 'true', + 'table.datalake.freshness' = '30s' +); +``` + +Fluss will automatically provision the database (if it does not exist) and create the corresponding Iceberg table within the AWS Glue Data Catalog. Once data is ingested and the tiering service runs, the parquet files are stored in S3. + +### Query Data with Athena + +Because the table metadata is stored in the Glue Data Catalog, you can query your tiered data directly in AWS Athena: + +```sql title="Athena SQL" +SELECT * FROM fluss_database.customer_orders; +``` + +## Further Reading + +- [Iceberg Integration](../formats/iceberg.md) - Table mapping, data types, and configurations. +- [Lakehouse Storage](maintenance/tiered-storage/lakehouse-storage.md) - General tiered storage overview. +- [Iceberg AWS Integration Docs](https://iceberg.apache.org/docs/latest/aws/) - Detailed properties for Glue and S3 configs. diff --git a/website/docs/streaming-lakehouse/integrate-data-lakes/catalogs/hive.md b/website/docs/streaming-lakehouse/integrate-data-lakes/catalogs/hive.md new file mode 100644 index 0000000000..0f88b3a00b --- /dev/null +++ b/website/docs/streaming-lakehouse/integrate-data-lakes/catalogs/hive.md @@ -0,0 +1,125 @@ +--- +title: Hive Metastore +sidebar_position: 3 +--- + +# Hive Metastore + +## Introduction + +The **Hive Metastore (HMS)** is a central metadata repository commonly used in Apache Hadoop and other big data ecosystems to store schema and metadata information for tables. Apache Iceberg provides native integration with Hive Metastore, storing Iceberg table names and metadata locations directly within HMS. + +This guide explains how to configure Fluss to use Hive Metastore as its Iceberg catalog. For general Iceberg integration details (table mapping, data types, limitations), see [Iceberg](../formats/iceberg.md). + +## How It Works + +When Fluss is configured with Hive Metastore as its Iceberg catalog: + +1. Fluss manages Iceberg databases and tables via HMS thrift API. +2. The [tiering service](maintenance/tiered-storage/lakehouse-storage.md#start-the-datalake-tiering-service) writes parquet data files to HDFS (or S3/OSS) and commits table snapshots via the Hive Metastore client. +3. Other query engines (such as Spark, Trino, Flink, and StarRocks) configured with Hive catalog can discover and query these Iceberg tables directly from HMS. + +## Prerequisites + +### Running Hive Metastore + +Ensure you have a running Hive Metastore service. By default, HMS listens on thrift port `9083` (e.g., `thrift://:9083`). + +### Prepare Required JARs + +Because Hive catalog implementation is not bundled in `iceberg-core`, you must supply the Hive catalog and Hadoop client dependencies. + +#### For Fluss Servers (Coordinator & Tablet Servers) + +Download and place the following JARs in the `FLUSS_HOME/plugins/iceberg/` directory: + +1. **Iceberg Hive Runtime**: [iceberg-hive-runtime-1.10.1.jar](https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-hive-runtime/1.10.1/iceberg-hive-runtime-1.10.1.jar) +2. **Pre-bundled Hadoop JAR** (if not using an existing Hadoop environment): [hadoop-apache-3.3.5-2.jar](https://repo1.maven.org/maven2/io/trino/hadoop/hadoop-apache/3.3.5-2/hadoop-apache-3.3.5-2.jar) + +#### For the Flink Tiering Service + +Place the same JAR files in the `${FLINK_HOME}/lib` directory. + +### Hadoop Classpath Configuration + +Both Fluss and Flink must be able to load Hadoop-related configuration (e.g., `core-site.xml`, `hdfs-site.xml`) and classes to resolve HDFS file paths. + +**Option 1: Export Hadoop Environment Classpath (Recommended)** + +Export `HADOOP_CLASSPATH` before launching Fluss servers and Flink: + +```bash +export HADOOP_CLASSPATH=`hadoop classpath` +``` + +**Option 2: Place Hadoop XML Configs** + +Ensure that your `core-site.xml` and `hdfs-site.xml` files are copied to the configuration classpath of both Fluss and Flink. + +## Configure Fluss with Hive Metastore + +### Cluster Configuration + +Add the following configuration parameters to your `server.yaml`: + +```yaml +datalake.format: iceberg +datalake.iceberg.type: hive +datalake.iceberg.uri: thrift://:9083 +datalake.iceberg.warehouse: hdfs://:9000/user/hive/warehouse +``` + +:::note +If your Hive warehouse is located on cloud object storage (like Amazon S3 or Aliyun OSS), set `datalake.iceberg.warehouse` to the corresponding cloud URI (e.g., `s3:///warehouse`) and configure the required filesystem integration. See [AWS Glue](glue.md) for AWS credentials setup. +::: + +### Start Tiering Service + +Follow the [Iceberg tiering service setup](../formats/iceberg.md#start-tiering-service-to-iceberg) instructions to prepare the environment. Launch the Flink tiering job with Hive Metastore catalog configurations: + +```bash +${FLINK_HOME}/bin/flink run /path/to/fluss-flink-tiering-$FLUSS_VERSION$.jar \ + --fluss.bootstrap.servers :9123 \ + --datalake.format iceberg \ + --datalake.iceberg.type hive \ + --datalake.iceberg.uri thrift://:9083 \ + --datalake.iceberg.warehouse hdfs://:9000/user/hive/warehouse +``` + +## Usage Example + +### Create a Datalake-Enabled Table + +Create a Fluss table with data lake tiering enabled using Flink SQL: + +```sql title="Flink SQL" +USE CATALOG fluss_catalog; + +CREATE TABLE daily_events ( + `event_id` BIGINT, + `event_type` STRING, + `severity` STRING, + `event_date` STRING, + PRIMARY KEY (`event_id`) NOT ENFORCED +) WITH ( + 'table.datalake.enabled' = 'true', + 'table.datalake.freshness' = '30s' +); +``` + +Fluss will create a corresponding Iceberg table inside the Hive Metastore under the database matching your Fluss namespace. The metadata will point to the HDFS warehouse directory. + +### Query Data with Spark + +Since HMS manages the metadata, you can register HMS as an Iceberg catalog in Apache Spark and query the tiered table immediately: + +```sql title="Spark SQL" +-- Query the tiered Iceberg table from Hive Metastore catalog +SELECT * FROM hive_catalog.fluss_database.daily_events; +``` + +## Further Reading + +- [Iceberg Integration](../formats/iceberg.md) - Table mapping, data types, and configurations. +- [Lakehouse Storage](maintenance/tiered-storage/lakehouse-storage.md) - General tiered storage overview. +- [Iceberg Hive Catalog Docs](https://iceberg.apache.org/docs/latest/hive/) - Official Iceberg Hive documentation. diff --git a/website/docs/streaming-lakehouse/integrate-data-lakes/formats/iceberg.md b/website/docs/streaming-lakehouse/integrate-data-lakes/formats/iceberg.md index bb54356c1f..e54577d5d7 100644 --- a/website/docs/streaming-lakehouse/integrate-data-lakes/formats/iceberg.md +++ b/website/docs/streaming-lakehouse/integrate-data-lakes/formats/iceberg.md @@ -58,10 +58,10 @@ This approach enables passing custom configurations for Iceberg catalog initiali Fluss supports all Iceberg-compatible catalog types: **Built-in Catalog Types:** -- `hive` - Hive Metastore catalog +- `hive` - Hive Metastore catalog (see [Hive Metastore](../catalogs/hive.md)) - `hadoop` - Hadoop catalog -- `rest` - REST catalog -- `glue` - AWS Glue catalog +- `rest` - REST catalog (see [Lakekeeper](../catalogs/lakekeeper.md) for a REST catalog example) +- `glue` - AWS Glue catalog (see [AWS Glue](../catalogs/glue.md)) - `nessie` - Nessie catalog - `jdbc` - JDBC catalog