Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: new index page #366

Merged
merged 10 commits into from
Aug 30, 2023
4 changes: 4 additions & 0 deletions docs/modules/hive/images/hive_overview.drawio.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/modules/hive/pages/getting_started/first_steps.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -72,4 +72,4 @@ For further testing we recommend to use e.g. the python https://github.com/quint

== What's next

Have a look at the xref:usage.adoc[] page to find out more about the features of the Operator.
Have a look at the xref:usage-guide/index.adoc[usage guide] to find out more about the features of the Operator.
61 changes: 41 additions & 20 deletions docs/modules/hive/pages/index.adoc
Original file line number Diff line number Diff line change
@@ -1,32 +1,53 @@
= Stackable Operator for Apache Hive
:description: The Stackable Operator for Apache Hive is a Kubernetes operator that can manage Apache Hive metastores. Learn about its features, resources, dependencies and demos, and see the list of supported Hive versions.
:keywords: Stackable Operator, Hadoop, Apache Hive, Kubernetes, k8s, operator, engineer, big data, metadata, storage, query

This is an operator for Kubernetes that can manage https://hive.apache.org[Apache Hive] metastores.
The Apache Hive metastore (HMS) stores information on the location of tables and partitions in file and blob storages such as HDFS and S3.
This is an operator for Kubernetes that can manage https://hive.apache.org[Apache Hive] metastores.
The Apache Hive metastore (HMS) was originally developed as part of Apache Hive. It stores information on the location of tables and partitions in file and blob storages such as xref:hdfs:index.adoc[Apache HDFS] and S3 and is now used by other tools besides Hive as well to access tables in files.
This Operator does not support deploying Hive itself, but xref:trino:index.adoc[Trino] is recommended as an alternative query engine.

Only the metastore is supported, not Hive itself.
There are several reasons why running Hive on Kubernetes may not be an optimal solution.
The most obvious reason is that Hive requires YARN as an execution framework, and YARN assumes much of the same role as Kubernetes - i.e. assigning resources.
For this reason we provide xref:trino:index.adoc[Trino] as a query engine in the Stackable Data Platform instead of Hive. Trino still uses the Hive Metastore, hence the inclusion of this operator as well.
There are multiple tools that can use the HMS:

* HiveServer2
** This is the "original" tool using the HMS.
** It offers an endpoint, where you can submit HiveQL (similar to SQL) queries.
** It needs a execution engine, e.g. YARN or Spark.
*** This operator does not support running the Hive server because of the complexity needed to operate YARN on Kubernetes. YARN is a resource manager which is not meant to be running on Kubernetes as Kubernetes already manages its own resources.
*** We offer Trino as a (often times drop-in) replacement (see below)
* Trino
** Takes SQL queries and executes them against the tables, whose metadata are stored in HMS.
** It should offer all the capabilities Hive offers including a lot of additional functionality, such as connections to other data sources.
* Spark
** Takes SQL or programmatic jobs and executes them against the tables, whose metadata are stored in HMS.
* And others
== Getting started

Follow the xref:getting_started/index.adoc[Getting started guide] which will guide you through installing the Stackable Hive Operator and its dependencies. It walks you through setting up a Hive metastore and connecting it to a demo Postgres database and a Minio instance to store data in.

Check notice on line 12 in docs/modules/hive/pages/index.adoc

View workflow job for this annotation

GitHub Actions / LanguageTool

[LanguageTool] docs/modules/hive/pages/index.adoc#L12

In American English, ‘afterward’ is the preferred variant. ‘Afterwards’ is more commonly used in British English and other dialects. (AFTERWARDS_US[1]) Suggestions: `Afterward` Rule: https://community.languagetool.org/rule/show/AFTERWARDS_US?lang=en-US&subId=1 Category: BRITISH_ENGLISH
Raw output
docs/modules/hive/pages/index.adoc:12:1: In American English, ‘afterward’ is the preferred variant. ‘Afterwards’ is more commonly used in British English and other dialects. (AFTERWARDS_US[1])
 Suggestions: `Afterward`
 Rule: https://community.languagetool.org/rule/show/AFTERWARDS_US?lang=en-US&subId=1
 Category: BRITISH_ENGLISH
Afterwards you can consult the xref:usage-guide/index.adoc[] to learn more about tailoring your Hive metastore configuration to your needs, or have a look at the <<demos, demos>> for some example setups with either xref:trino:index.adoc[Trino] or xref:spark-k8s:index.adoc[Spark].

== Operator model

The Operator manages the _HiveCluster_ custom resource. The cluster implements a single `metastore` xref:home:concepts:roles-and-role-groups.adoc[role].

image::hive_overview.drawio.svg[A diagram depicting the Kubernetes resources created by the Stackable Operator for Apache Hive]

For every role group the Operator creates a ConfigMap and StatefulSet which can have multiple replicas (Pods). Every role group is accessible through its own Service, and there is a Service for the whole cluster.

The Operator creates a xref:concepts:service_discovery.adoc[service discovery ConfigMap] for the Hive metastore instance. The discovery ConfigMap contains information on how to connect to the HMS.

== Dependencies

The Stackable Operator for Apache Hive depends on the Stackable xref:commons-operator:index.adoc[commons] and xref:secret-operator:index.adoc[secret] operators.

== Required external component: An SQL database

The Hive metastore requires a database to store metadata.
Consult the xref:required-external-components.adoc[required external components page] for an overview of the supported databases and minimum supported versions.

== [[demos]]Demos

Three demos make use of the Hive metastore.

The xref:stackablectl::demos/spark-k8s-anomaly-detection-taxi-data.adoc[] and xref:stackablectl::demos/trino-taxi-data.adoc[] use the HMS to store metadata information about taxi data. The first demo then analyzes the data using xref:spark-k8s:index.adoc[Apache Spark] and the second one using xref:trino:index.adoc[Trino].

The xref:stackablectl::demos/data-lakehouse-iceberg-trino-spark.adoc[] demo is the biggest demo available. It uses both Spark and Trino for analysis.

== Why is the Hive query engine not supported?

Only the metastore is supported, not Hive itself.
There are several reasons why running Hive on Kubernetes may not be an optimal solution.
The most obvious reason is that Hive requires YARN as an execution framework, and YARN assumes much of the same role as Kubernetes - i.e. assigning resources.
For this reason we provide xref:trino:index.adoc[Trino] as a query engine in the Stackable Data Platform instead of Hive. Trino still uses the Hive Metastore, hence the inclusion of this operator as well. Trino should offer all the capabilities Hive offers including a lot of additional functionality, such as connections to other data sources.

Additionally, Tables in the HMS can also be accessed from xref:spark-k8s:index.adoc[Apache Spark].

== Supported Versions

The Stackable Operator for Apache Hive currently supports the following versions of Hive:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@

= Cluster Operation
= Cluster operation

Hive installations can be configured with different cluster operations like pausing reconciliation or stopping the cluster. See xref:concepts:cluster_operations.adoc[cluster operations] for more details.
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
= Configuration & environment overrides

The cluster definition also supports overriding configuration properties and environment variables, either per role or per role group, where the more specific override (role group) has precedence over the less specific one (role).

IMPORTANT: Overriding certain properties, which are set by the operator (such as the HTTP port) can interfere with the operator and can lead to problems.

== Configuration Properties

For a role or role group, at the same level of `config`, you can specify: `configOverrides` for the following files:

- `hive-site.xml`
- `security.properties`

For example, if you want to set the `datanucleus.connectionPool.maxPoolSize` for the metastore to 20 adapt the `metastore` section of the cluster resource like so:

[source,yaml]
----
metastore:
roleGroups:
default:
config: [...]
configOverrides:
hive-site.xml:
datanucleus.connectionPool.maxPoolSize: "20"
replicas: 1
----

Just as for the `config`, it is possible to specify this at role level as well:

[source,yaml]
----
metastore:
configOverrides:
hive-site.xml:
datanucleus.connectionPool.maxPoolSize: "20"
roleGroups:
default:
config: [...]
replicas: 1
----
fhennig marked this conversation as resolved.
Show resolved Hide resolved

All override property values must be strings. The properties will be formatted and escaped correctly into the XML file.

For a full list of configuration options we refer to the Hive https://cwiki.apache.org/confluence/display/hive/configuration+properties[Configuration Reference].

== The security.properties file

The `security.properties` file is used to configure JVM security properties. It is very seldom that users need to tweak any of these, but there is one use-case that stands out, and that users need to be aware of: the JVM DNS cache.

The JVM manages it's own cache of successfully resolved host names as well as a cache of host names that cannot be resolved. Some products of the Stackable platform are very sensible to the contents of these caches and their performance is heavily affected by them. As of version 3.1.3 Apache Hive performs poorly if the positive cache is disabled. To cache resolved host names, you can configure the TTL of entries in the positive cache like this:
fhennig marked this conversation as resolved.
Show resolved Hide resolved

[source,yaml]
----
metastores:
configOverrides:
security.properties:
networkaddress.cache.ttl: "30"
networkaddress.cache.negative.ttl: "0"
----

NOTE: The operator configures DNS caching by default as shown in the example above.

For details on the JVM security see https://docs.oracle.com/en/java/javase/11/security/java-security-overview1.html


== Environment Variables

Check notice on line 67 in docs/modules/hive/pages/usage-guide/configuration-environment-overrides.adoc

View workflow job for this annotation

GitHub Actions / LanguageTool

[LanguageTool] docs/modules/hive/pages/usage-guide/configuration-environment-overrides.adoc#L67

Consider replacing this phrase with the adverb “similarly” to avoid wordiness. (IN_A_X_MANNER[2]) Suggestions: `Similarly` Rule: https://community.languagetool.org/rule/show/IN_A_X_MANNER?lang=en-US&subId=2 Category: REDUNDANCY
Raw output
docs/modules/hive/pages/usage-guide/configuration-environment-overrides.adoc:67:1: Consider replacing this phrase with the adverb “similarly” to avoid wordiness. (IN_A_X_MANNER[2])
 Suggestions: `Similarly`
 Rule: https://community.languagetool.org/rule/show/IN_A_X_MANNER?lang=en-US&subId=2
 Category: REDUNDANCY
In a similar fashion, environment variables can be (over)written. For example per role group:

[source,yaml]
----
metastore:
roleGroups:
default:
config: {}
envOverrides:
MY_ENV_VAR: "MY_VALUE"
replicas: 1
----

or per role:

[source,yaml]
----
metastore:
envOverrides:
MY_ENV_VAR: "MY_VALUE"
roleGroups:
default:
config: {}
replicas: 1
----
37 changes: 37 additions & 0 deletions docs/modules/hive/pages/usage-guide/data-storage.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
= Data storage backends

Hive does not store data, only metadata. It can store metadata about data stored in various places. The Stackable Operator currently supports S3 and HFS.

== [[s3]]S3 support

Hive supports creating tables in S3 compatible object stores.
To use this feature you need to provide connection details for the object store using the xref:concepts:s3.adoc[S3Connection] in the top level `clusterConfig`.

An example usage can look like this:

[source,yaml]
----
clusterConfig:
s3:
inline:
host: minio
port: 9000
accessStyle: Path
credentials:
secretClass: simple-hive-s3-secret-class
----


== [[hdfs]]Apache HDFS support

As well as S3, Hive also supports creating tables in HDFS.
You can add the HDFS connection in the top level `clusterConfig` as follows:

[source,yaml]
----
clusterConfig:
hdfs:
configMap: my-hdfs-cluster # Name of the HdfsCluster
----

Read about the xref:hdfs:index.adoc[Stackable Operator for Apache HDFS] to learn more about setting up HDFS.
142 changes: 142 additions & 0 deletions docs/modules/hive/pages/usage-guide/derby-example.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@

= Derby example

Please note that the version you need to specify is not only the version of Apache Hive which you want to roll out, but has to be amended with a Stackable version as shown.
This Stackable version is the version of the underlying container image which is used to execute the processes.
For a list of available versions please check our https://repo.stackable.tech/#browse/browse:docker:v2%2Fstackable%2Fhive%2Ftags[image registry].
It should generally be safe to simply use the latest image version that is available.

.Create a single node Apache Hive Metastore cluster using Derby:
[source,yaml]
----
---
apiVersion: hive.stackable.tech/v1alpha1
kind: HiveCluster
metadata:
name: simple-hive-derby
spec:
image:
productVersion: 3.1.3
clusterConfig:
database:
connString: jdbc:derby:;databaseName=/tmp/metastore_db;create=true
user: APP
password: mine
dbType: derby
metastore:
roleGroups:
default:
replicas: 1
----

WARNING: You should not use the `Derby` database with more than one replica or in production. Derby stores data locally and therefore the data is not shared between different metastore Pods and lost after Pod restarts.
fhennig marked this conversation as resolved.
Show resolved Hide resolved

To create a single node Apache Hive Metastore (v2.3.9) cluster with derby and S3 access, deploy a minio (or use any available S3 bucket):
fhennig marked this conversation as resolved.
Show resolved Hide resolved
[source,bash]
----
helm install minio \
minio \
--repo https://charts.bitnami.com/bitnami \
--set auth.rootUser=minio-access-key \
--set auth.rootPassword=minio-secret-key
----

In order to upload data to minio we need a port-forward to access the web ui.
[source,bash]
----
kubectl port-forward service/minio 9001
----
Then, connect to localhost:9001 and login with the user `minio-access-key` and password `minio-secret-key`. Create a bucket and upload data.

Deploy the hive cluster:
[source,yaml]
----
---
apiVersion: hive.stackable.tech/v1alpha1
kind: HiveCluster
metadata:
name: simple-hive-derby
spec:
image:
productVersion: 3.1.3
clusterConfig:
database:
connString: jdbc:derby:;databaseName=/stackable/metastore_db;create=true
user: APP
password: mine
dbType: derby
s3:
inline:
host: minio
port: 9000
accessStyle: Path
credentials:
secretClass: simple-hive-s3-secret-class
metastore:
roleGroups:
default:
replicas: 1
---
apiVersion: secrets.stackable.tech/v1alpha1
kind: SecretClass
metadata:
name: simple-hive-s3-secret-class
spec:
backend:
k8sSearch:
searchNamespace:
pod: {}
---
apiVersion: v1
kind: Secret
metadata:
name: simple-hive-s3-secret
labels:
secrets.stackable.tech/class: simple-hive-s3-secret-class
stringData:
accessKey: minio-access-key
secretKey: minio-secret-key
----


To create a single node Apache Hive Metastore using PostgreSQL, deploy a PostgreSQL instance via helm.

[sidebar]
PostgreSQL introduced a new way to encrypt its passwords in version 10.
This is called `scram-sha-256` and has been the default as of PostgreSQL 14.
Unfortunately, Hive up until the latest 3.3.x version ships with JDBC drivers that do https://wiki.postgresql.org/wiki/List_of_drivers[_not_ support] this method.
You might see an error message like this:
`The authentication type 10 is not supported.`
If this is the case please either use an older PostgreSQL version or change its https://www.postgresql.org/docs/current/runtime-config-connection.html#GUC-PASSWORD-ENCRYPTION[`password_encryption`] setting to `md5`.

This installs PostgreSQL in version 10 to work around the issue mentioned above:
[source,bash]
----
helm install hive bitnami/postgresql --version=12.1.5 \
--set postgresqlUsername=hive \
--set postgresqlPassword=hive \
--set postgresqlDatabase=hive
----

.Create Hive Metastore using a PostgreSQL database
[source,yaml]
----
apiVersion: hive.stackable.tech/v1alpha1
kind: HiveCluster
metadata:
name: simple-hive-postgres
spec:
image:
productVersion: 3.1.3
clusterConfig:
database:
connString: jdbc:postgresql://hive-postgresql.default.svc.cluster.local:5432/hive
user: hive
password: hive
dbType: postgres
metastore:
roleGroups:
default:
replicas: 1
----

4 changes: 4 additions & 0 deletions docs/modules/hive/pages/usage-guide/index.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
= Usage guide
:page-aliases: usage.adoc

This Section will help you to use and configure the Stackable Operator for Apache Hive in various ways. You should already be familiar with how to set up a basic instance. Follow the xref:getting_started/index.adoc[] guide to learn how to set up a basic instance with all the required dependencies.
18 changes: 18 additions & 0 deletions docs/modules/hive/pages/usage-guide/logging.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
= Log aggregation

The logs can be forwarded to a Vector log aggregator by providing a discovery
ConfigMap for the aggregator and by enabling the log agent:

[source,yaml]
----
spec:
clusterConfig:
vectorAggregatorConfigMapName: vector-aggregator-discovery
metastore:
config:
logging:
enableVectorAgent: true
----

Further information on how to configure logging, can be found in
xref:concepts:logging.adoc[].
4 changes: 4 additions & 0 deletions docs/modules/hive/pages/usage-guide/monitoring.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
= Monitoring

The managed Hive instances are automatically configured to export Prometheus metrics. See
xref:operators:monitoring.adoc[] for more details.
Loading