Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: wording improvements, better install instructions #470

Merged
merged 3 commits into from
Sep 30, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 8 additions & 7 deletions docs/modules/spark-k8s/pages/getting_started/first_steps.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Afterwards you can <<_verify_that_it_works, verify that it works>> by looking at

A Spark application is made of up three components:

* Job: this will build a `spark-submit` command from the resource, passing this to internal spark code together with templates for building the driver and executor pods
* Job: this builds a `spark-submit` command from the resource, passing this to internal spark code together with templates for building the driver and executor pods
* Driver: the driver starts the designated number of executors and removes them when the job is completed.
* Executor(s): responsible for executing the job itself

Expand All @@ -25,20 +25,21 @@ Where:
* `spec.version`: SparkApplication version (1.0). This can be freely set by the users and is added by the operator as label to all workload resources created by the application.
* `spec.sparkImage`: the image used by the job, driver and executor pods. This can be a custom image built by the user or an official Stackable image. Available official images are listed in the Stackable https://repo.stackable.tech/#browse/browse:docker:v2%2Fstackable%spark-k8s%2Ftags[image registry].
* `spec.mode`: only `cluster` is currently supported
* `spec.mainApplicationFile`: the artifact (Java, Scala or Python) that forms the basis of the Spark job. This path is relative to the image, so in this case we are running an example python script (that calculates the value of pi): it is bundled with the Spark code and therefore already present in the job image
* `spec.mainApplicationFile`: the artifact (Java, Scala or Python) that forms the basis of the Spark job.
This path is relative to the image, so in this case an example python script (that calculates the value of pi) is running: it is bundled with the Spark code and therefore already present in the job image
* `spec.driver`: driver-specific settings.
* `spec.executor`: executor-specific settings.

== Verify that it works

As mentioned above, the SparkApplication that has just been created will build a `spark-submit` command and pass it to the driver Pod, which in turn will create executor Pods that run for the duration of the job before being clean up.
A running process will look like this:
As mentioned above, the SparkApplication that has just been created builds a `spark-submit` command and pass it to the driver Pod, which in turn creates executor Pods that run for the duration of the job before being clean up.
A running process looks like this:

image::getting_started/spark_running.png[Spark job]

* `pyspark-pi-xxxx`: this is the initializing job that creates the spark-submit command (named as `metadata.name` with a unique suffix)
* `pyspark-pi-xxxxxxx-driver`: the driver pod that drives the execution
* `pythonpi-xxxxxxxxx-exec-x`: the set of executors started by the driver (in our example `spec.executor.instances` was set to 3 which is why we have 3 executors)
* `pythonpi-xxxxxxxxx-exec-x`: the set of executors started by the driver (in our example `spec.executor.instances` was set to 3 which is why 3 executors are running)
maltesander marked this conversation as resolved.
Show resolved Hide resolved

Job progress can be followed by issuing this command:

Expand All @@ -48,11 +49,11 @@ include::example$getting_started/getting_started.sh[tag=wait-for-job]

When the job completes the driver cleans up the executor.
The initial job is persisted for several minutes before being removed.
The completed state will look like this:
The completed state looks like this:

image::getting_started/spark_complete.png[Completed job]

The driver logs can be inspected for more information about the results of the job.
In this case we expect to find the results of our (approximate!) pi calculation:
In this case the result of our (approximate!) pi calculation can be found:

image::getting_started/spark_log.png[Driver log]
6 changes: 3 additions & 3 deletions docs/modules/spark-k8s/pages/getting_started/index.adoc
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
= Getting started

This guide will get you started with Spark using the Stackable Operator for Apache Spark.
It will guide you through the installation of the Operator and its dependencies, executing your first Spark job and reviewing its result.
This guide gets you started with Spark using the Stackable operator for Apache Spark.
It guides you through the installation of the operator and its dependencies, executing your first Spark job and reviewing its result.

== Prerequisites

You will need:
You need:

* a Kubernetes cluster
* kubectl
Expand Down
34 changes: 19 additions & 15 deletions docs/modules/spark-k8s/pages/getting_started/installation.adoc
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
= Installation
:description: Learn how to set up Spark with the Stackable Operator, from installation to running your first job, including prerequisites and resource recommendations.

On this page you will install the Stackable Spark-on-Kubernetes operator as well as the commons, secret and listener operators
Install the Stackable Spark operator as well as the commons, secret and listener operators
which are required by all Stackable operators.

== Dependencies
Expand All @@ -18,24 +18,26 @@

== Stackable Operators

There are 2 ways to install Stackable operators
There are multiple ways to install the Stackable Operator for Apache Spark.
xref:management:stackablectl:index.adoc[] is the preferred way but Helm is also supported.

Check notice on line 22 in docs/modules/spark-k8s/pages/getting_started/installation.adoc

View workflow job for this annotation

GitHub Actions / LanguageTool

[LanguageTool] docs/modules/spark-k8s/pages/getting_started/installation.adoc#L22

Use a comma before ‘but’ if it connects two independent clauses (unless they are closely connected and short). (COMMA_COMPOUND_SENTENCE_2[3]) Suggestions: `, but` URL: https://languagetool.org/insights/post/types-of-sentences/#compound-sentence Rule: https://community.languagetool.org/rule/show/COMMA_COMPOUND_SENTENCE_2?lang=en-US&subId=3 Category: PUNCTUATION
Raw output
docs/modules/spark-k8s/pages/getting_started/installation.adoc:22:62: Use a comma before ‘but’ if it connects two independent clauses (unless they are closely connected and short). (COMMA_COMPOUND_SENTENCE_2[3])
 Suggestions: `, but`
 URL: https://languagetool.org/insights/post/types-of-sentences/#compound-sentence 
 Rule: https://community.languagetool.org/rule/show/COMMA_COMPOUND_SENTENCE_2?lang=en-US&subId=3
 Category: PUNCTUATION
maltesander marked this conversation as resolved.
Show resolved Hide resolved
OpenShift users may prefer installing the operator from the RedHat Certified Operator catalog using the OpenShift web console.

. Using xref:management:stackablectl:index.adoc[]
. Using a Helm chart

=== stackablectl

`stackablectl` is the command line tool to interact with Stackable operators and our recommended way to install Operators.
[tabs]
====
stackablectl::
+
--
`stackablectl` is the command line tool to interact with Stackable operators and our recommended way to install operators.
Follow the xref:management:stackablectl:installation.adoc[installation steps] for your platform.

After you have installed `stackablectl` run the following command to install the Spark-k8s operator:
After you have installed `stackablectl` run the following command to install the Spark operator:

[source,bash]
----
include::example$getting_started/getting_started.sh[tag=stackablectl-install-operators]
----

The tool will show
The tool shows

[source]
----
Expand All @@ -44,24 +46,26 @@

TIP: Consult the xref:management:stackablectl:quickstart.adoc[] to learn more about how to use stackablectl.
For example, you can use the `--cluster kind` flag to create a Kubernetes cluster with link:https://kind.sigs.k8s.io/[kind].
--

=== Helm

You can also use Helm to install the operator.
Helm::
+
--
Add the Stackable Helm repository:
[source,bash]
----
include::example$getting_started/getting_started.sh[tag=helm-add-repo]
----

Then install the Stackable Operators:
Install the Stackable Operators:
[source,bash]
----
include::example$getting_started/getting_started.sh[tag=helm-install-operators]
----

Helm will deploy the operators in a Kubernetes Deployment and apply the CRDs for the SparkApplication (as well as the CRDs for the required operators).
You are now ready to create a Spark job.
--
====

== What's next

Expand Down
4 changes: 2 additions & 2 deletions docs/modules/spark-k8s/pages/index.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Its in-memory processing and fault-tolerant architecture make it ideal for a var
== Getting started

Follow the xref:getting_started/index.adoc[] guide to get started with Apache Spark using the Stackable operator.
The guide will lead you through the installation of the operator and running your first Spark application on Kubernetes.
The guide leads you through the installation of the operator and running your first Spark application on Kubernetes.

== How the operator works

Expand Down Expand Up @@ -62,7 +62,7 @@ A ConfigMap supplies the necessary configuration, and there is a service to conn
The {spark-rbac}[Spark-Kubernetes RBAC documentation] describes what is needed for `spark-submit` jobs to run successfully:
minimally a role/cluster-role to allow the driver pod to create and manage executor pods.

However, to add security each `spark-submit` job launched by the operator will be assigned its own ServiceAccount.
However, to add security each `spark-submit` job launched by the operator is assigned its own ServiceAccount.

During the operator installation, a cluster role named `spark-k8s-clusterrole` is created with pre-defined permissions.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ This operator accepts the following command line parameters:

*Multiple values:* false

The operator will **only** watch for resources in the provided namespace `test`:
The operator **only** watches for resources in the provided namespace `test`:

[source]
----
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ This operator accepts the following environment variables:

*Multiple values:* false

The operator will **only** watch for resources in the provided namespace `test`:
The operator **only** watches for resources in the provided namespace `test`:

[source]
----
Expand Down
12 changes: 6 additions & 6 deletions docs/modules/spark-k8s/pages/usage-guide/examples.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@
The following examples have the following `spec` fields in common:

* `version`: the current version is "1.0"
* `sparkImage`: the docker image that will be used by job, driver and executor pods. This can be provided by the user.
* `sparkImage`: the docker image that is used by job, driver and executor pods.
This can be provided by the user.
* `mode`: only `cluster` is currently supported
* `mainApplicationFile`: the artifact (Java, Scala or Python) that forms the basis of the Spark job.
* `args`: these are the arguments passed directly to the application. In the examples below it is e.g. the input path for part of the public New York taxi dataset.
Expand All @@ -22,10 +23,10 @@ Job-specific settings are annotated below.
include::example$example-sparkapp-image.yaml[]
----

<1> Job image: this contains the job artifact that will be retrieved from the volume mount backed by the PVC
<1> Job image: this contains the job artifact that is retrieved from the volume mount backed by the PVC
<2> Job python artifact (local)
<3> Job argument (external)
<4> List of python job requirements: these will be installed in the pods via `pip`
<4> List of python job requirements: these are installed in the Pods via `pip`.
<5> Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store)

== JVM (Scala): externally located artifact and dataset
Expand All @@ -34,7 +35,6 @@ include::example$example-sparkapp-image.yaml[]
----
include::example$example-sparkapp-pvc.yaml[]
----

<1> Job artifact located on S3.
<2> Job main class
<3> Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store, accessed without credentials)
Expand Down Expand Up @@ -70,5 +70,5 @@ include::example$example-sparkapp-configmap.yaml[]
<3> Job scala artifact that requires an input argument
<4> The volume backed by the configuration map
<5> The expected job argument, accessed via the mounted configuration map file
<6> The name of the volume backed by the configuration map that will be mounted to the driver/executor
<7> The mount location of the volume (this will contain a file `/arguments/job-args.txt`)
<6> The name of the volume backed by the configuration map that is mounted to the driver/executor
<7> The mount location of the volume (this contains a file `/arguments/job-args.txt`)
25 changes: 14 additions & 11 deletions docs/modules/spark-k8s/pages/usage-guide/history-server.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,8 @@
:description: Set up Spark History Server on Kubernetes to access Spark logs via S3, with configuration for cleanups and web UI access details.
:page-aliases: history_server.adoc

== Overview

The Stackable Spark-on-Kubernetes operator runs Apache Spark workloads in a Kubernetes cluster, whereby driver- and executor-pods are created for the duration of the job and then terminated.
One or more Spark History Server instances can be deployed independently of SparkApplication jobs and used as an end-point for spark logging, so that job information can be viewed once the job pods are no longer available.
One or more Spark History Server instances can be deployed independently of SparkApplication jobs and used as an endpoint for Spark logging, so that job information can be viewed once the job pods are no longer available.

== Deployment

Expand All @@ -14,25 +12,30 @@ The event logs are loaded from an S3 bucket named `spark-logs` and the folder `e
The credentials for this bucket are provided by the secret class `s3-credentials-class`.
For more details on how the Stackable Data Platform manages S3 resources see the xref:concepts:s3.adoc[S3 resources] page.


[source,yaml]
----
include::example$example-history-server.yaml[]
----

<1> The location of the event logs. Must be a S3 bucket. Future implementations might add support for other shared filesystems such as HDFS.
<2> Folder within the S3 bucket where the log files are located. This folder is required and must exist before setting up the history server.
<1> The location of the event logs.
Must be an S3 bucket.
Future implementations might add support for other shared filesystems such as HDFS.
<2> Directory within the S3 bucket where the log files are located.
This directory is required and must exist before setting up the history server.
<3> The S3 bucket definition, here provided in-line.
<4> Additional history server configuration properties can be provided here as a map. For possible properties see: https://spark.apache.org/docs/latest/monitoring.html#spark-history-server-configuration-options
<5> This deployment has only one Pod. Multiple history servers can be started, all reading the same event logs by increasing the replica count.
<6> This history server will automatically clean up old log files by using default properties. You can change any of these by using the `sparkConf` map.
<4> Additional history server configuration properties can be provided here as a map.
For possible properties see: https://spark.apache.org/docs/latest/monitoring.html#spark-history-server-configuration-options
<5> This deployment has only one Pod.
Multiple history servers can be started, all reading the same event logs by increasing the replica count.
<6> This history server automatically cleans up old log files by using default properties.
Change any of these by using the `sparkConf` map.

NOTE: Only one role group can have scheduled cleanups enabled (`cleaner: true`) and this role group cannot have more than 1 replica.

The secret with S3 credentials must contain at least the following two keys:

* `accessKey` - the access key of a user with read and write access to the event log bucket.
* `secretKey` - the secret key of a user with read and write access to the event log bucket.
* `accessKey` -- the access key of a user with read and write access to the event log bucket.
* `secretKey` -- the secret key of a user with read and write access to the event log bucket.

Any other entries of the Secret are ignored by the operator.

Expand Down
7 changes: 5 additions & 2 deletions docs/modules/spark-k8s/pages/usage-guide/listenerclass.adoc
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
= Service exposition with ListenerClasses

The Spark Operator deploys SparkApplications, and does not offer a UI or other API, so no services are exposed. However, the Operator can also deploy HistoryServers, which do offer a UI and API. The Operator deploys a service called `<name>-historyserver` (where `<name>` is the name of the HistoryServer) through which HistoryServer can be reached.
The Spark operator deploys SparkApplications, and does not offer a UI or other API, so no services are exposed.
However, the operator can also deploy HistoryServers, which do offer a UI and API.
The operator deploys a service called `<name>-historyserver` (where `<name>` is the name of the HistoryServer) through which HistoryServer can be reached.
maltesander marked this conversation as resolved.
Show resolved Hide resolved

This service can have three different types: `cluster-internal`, `external-unstable` and `external-stable`. Read more about the types in the xref:concepts:service-exposition.adoc[service exposition] documentation at platform level.
This service can have three different types: `cluster-internal`, `external-unstable` and `external-stable`.
Read more about the types in the xref:concepts:service-exposition.adoc[service exposition] documentation at platform level.

This is how the ListenerClass is configured:

Expand Down
10 changes: 6 additions & 4 deletions docs/modules/spark-k8s/pages/usage-guide/logging.adoc
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
= Logging

The Spark operator installs a https://vector.dev/docs/setup/deployment/roles/#agent[vector agent] as a side-car container in every application Pod except the `job` Pod that runs `spark-submit`. It also configures the logging framework to output logs in XML format. This is the same https://logging.apache.org/log4j/2.x/manual/layouts.html#XMLLayout[format] used across all Stackable products and it enables the https://vector.dev/docs/setup/deployment/roles/#aggregator[vector aggregator] to collect logs across the entire platform.
The Spark operator installs a https://vector.dev/docs/setup/deployment/roles/#agent[vector agent] as a side-car container in every application Pod except the `job` Pod that runs `spark-submit`.
It also configures the logging framework to output logs in XML format.
This is the same https://logging.apache.org/log4j/2.x/manual/layouts.html#XMLLayout[format] used across all Stackable products and it enables the https://vector.dev/docs/setup/deployment/roles/#aggregator[vector aggregator] to collect logs across the entire platform.

Check notice on line 5 in docs/modules/spark-k8s/pages/usage-guide/logging.adoc

View workflow job for this annotation

GitHub Actions / LanguageTool

[LanguageTool] docs/modules/spark-k8s/pages/usage-guide/logging.adoc#L5

Use a comma before ‘and’ if it connects two independent clauses (unless they are closely connected and short). (COMMA_COMPOUND_SENTENCE[1]) Suggestions: `, and` URL: https://languagetool.org/insights/post/types-of-sentences/#compound-sentence Rule: https://community.languagetool.org/rule/show/COMMA_COMPOUND_SENTENCE?lang=en-US&subId=1 Category: PUNCTUATION
Raw output
docs/modules/spark-k8s/pages/usage-guide/logging.adoc:5:126: Use a comma before ‘and’ if it connects two independent clauses (unless they are closely connected and short). (COMMA_COMPOUND_SENTENCE[1])
 Suggestions: `, and`
 URL: https://languagetool.org/insights/post/types-of-sentences/#compound-sentence 
 Rule: https://community.languagetool.org/rule/show/COMMA_COMPOUND_SENTENCE?lang=en-US&subId=1
 Category: PUNCTUATION
maltesander marked this conversation as resolved.
Show resolved Hide resolved

It is the user's responsibility to install and configure the vector aggregator, but the agents can discover the aggregator automatically using a discovery ConfigMap as described in the xref:concepts:logging.adoc[logging concepts].

Expand Down Expand Up @@ -35,12 +37,12 @@
level: INFO
...
----
<1> Name of a ConfigMap that referenced the vector aggregator. See example below.
<1> Name of a ConfigMap that referenced the vector aggregator.
See example below.
<2> Enable the vector agent in the history pod.
<3> Configure log levels for file and console outputs.

Example vector aggregator configuration.

.Example vector aggregator configuration
[source,yaml]
----
---
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
= Spark Applications
= Spark applications

Spark applications are submitted to the Spark Operator as SparkApplication resources. These resources are used to define the configuration of the Spark job, including the image to use, the main application file, and the number of executors to start.
Spark applications are submitted to the Spark Operator as SparkApplication resources.
These resources are used to define the configuration of the Spark job, including the image to use, the main application file, and the number of executors to start.

Upon creation, the application's status set to `Unknown`. As the operator creates the necessary resources, the status of the application transitions through different phases that reflect the phase of the driver Pod. A successful application will eventually reach the `Succeeded` phase.
Upon creation, the application's status set to `Unknown`.
As the operator creates the necessary resources, the status of the application transitions through different phases that reflect the phase of the driver Pod. A successful application eventually reaches the `Succeeded` phase.

NOTE: The operator will never reconcile an application once it has been created. To resubmit an application, a new SparkApplication resource must be created.
NOTE: The operator never reconciles an application once it has been created.
To resubmit an application, a new SparkApplication resource must be created.
Loading
Loading