diff --git a/docs/modules/spark-k8s/pages/getting_started/first_steps.adoc b/docs/modules/spark-k8s/pages/getting_started/first_steps.adoc index 675965a1..4fe84603 100644 --- a/docs/modules/spark-k8s/pages/getting_started/first_steps.adoc +++ b/docs/modules/spark-k8s/pages/getting_started/first_steps.adoc @@ -8,7 +8,7 @@ Afterwards you can <<_verify_that_it_works, verify that it works>> by looking at A Spark application is made of up three components: -* Job: this will build a `spark-submit` command from the resource, passing this to internal spark code together with templates for building the driver and executor pods +* Job: this builds a `spark-submit` command from the resource, passing this to internal spark code together with templates for building the driver and executor pods * Driver: the driver starts the designated number of executors and removes them when the job is completed. * Executor(s): responsible for executing the job itself @@ -25,20 +25,21 @@ Where: * `spec.version`: SparkApplication version (1.0). This can be freely set by the users and is added by the operator as label to all workload resources created by the application. * `spec.sparkImage`: the image used by the job, driver and executor pods. This can be a custom image built by the user or an official Stackable image. Available official images are listed in the Stackable https://repo.stackable.tech/#browse/browse:docker:v2%2Fstackable%spark-k8s%2Ftags[image registry]. * `spec.mode`: only `cluster` is currently supported -* `spec.mainApplicationFile`: the artifact (Java, Scala or Python) that forms the basis of the Spark job. This path is relative to the image, so in this case we are running an example python script (that calculates the value of pi): it is bundled with the Spark code and therefore already present in the job image +* `spec.mainApplicationFile`: the artifact (Java, Scala or Python) that forms the basis of the Spark job. + This path is relative to the image, so in this case an example python script (that calculates the value of pi) is running: it is bundled with the Spark code and therefore already present in the job image * `spec.driver`: driver-specific settings. * `spec.executor`: executor-specific settings. == Verify that it works -As mentioned above, the SparkApplication that has just been created will build a `spark-submit` command and pass it to the driver Pod, which in turn will create executor Pods that run for the duration of the job before being clean up. -A running process will look like this: +As mentioned above, the SparkApplication that has just been created builds a `spark-submit` command and pass it to the driver Pod, which in turn creates executor Pods that run for the duration of the job before being clean up. +A running process looks like this: image::getting_started/spark_running.png[Spark job] * `pyspark-pi-xxxx`: this is the initializing job that creates the spark-submit command (named as `metadata.name` with a unique suffix) * `pyspark-pi-xxxxxxx-driver`: the driver pod that drives the execution -* `pythonpi-xxxxxxxxx-exec-x`: the set of executors started by the driver (in our example `spec.executor.instances` was set to 3 which is why we have 3 executors) +* `pythonpi-xxxxxxxxx-exec-x`: the set of executors started by the driver (in the example `spec.executor.instances` was set to 3 which is why 3 executors are running) Job progress can be followed by issuing this command: @@ -48,11 +49,11 @@ include::example$getting_started/getting_started.sh[tag=wait-for-job] When the job completes the driver cleans up the executor. The initial job is persisted for several minutes before being removed. -The completed state will look like this: +The completed state looks like this: image::getting_started/spark_complete.png[Completed job] The driver logs can be inspected for more information about the results of the job. -In this case we expect to find the results of our (approximate!) pi calculation: +In this case the result of our (approximate!) pi calculation can be found: image::getting_started/spark_log.png[Driver log] diff --git a/docs/modules/spark-k8s/pages/getting_started/index.adoc b/docs/modules/spark-k8s/pages/getting_started/index.adoc index 1754249e..ed7a05b7 100644 --- a/docs/modules/spark-k8s/pages/getting_started/index.adoc +++ b/docs/modules/spark-k8s/pages/getting_started/index.adoc @@ -1,11 +1,11 @@ = Getting started -This guide will get you started with Spark using the Stackable Operator for Apache Spark. -It will guide you through the installation of the Operator and its dependencies, executing your first Spark job and reviewing its result. +This guide gets you started with Spark using the Stackable operator for Apache Spark. +It guides you through the installation of the operator and its dependencies, executing your first Spark job and reviewing its result. == Prerequisites -You will need: +You need: * a Kubernetes cluster * kubectl diff --git a/docs/modules/spark-k8s/pages/getting_started/installation.adoc b/docs/modules/spark-k8s/pages/getting_started/installation.adoc index c12e629c..f95893c8 100644 --- a/docs/modules/spark-k8s/pages/getting_started/installation.adoc +++ b/docs/modules/spark-k8s/pages/getting_started/installation.adoc @@ -1,7 +1,7 @@ = Installation :description: Learn how to set up Spark with the Stackable Operator, from installation to running your first job, including prerequisites and resource recommendations. -On this page you will install the Stackable Spark-on-Kubernetes operator as well as the commons, secret and listener operators +Install the Stackable Spark operator as well as the commons, secret and listener operators which are required by all Stackable operators. == Dependencies @@ -18,24 +18,26 @@ More information about the different ways to define Spark jobs and their depende == Stackable Operators -There are 2 ways to install Stackable operators +There are multiple ways to install the Stackable Operator for Apache Spark. +xref:management:stackablectl:index.adoc[] is the preferred way, but Helm is also supported. +OpenShift users may prefer installing the operator from the RedHat Certified Operator catalog using the OpenShift web console. -. Using xref:management:stackablectl:index.adoc[] -. Using a Helm chart - -=== stackablectl - -`stackablectl` is the command line tool to interact with Stackable operators and our recommended way to install Operators. +[tabs] +==== +stackablectl:: ++ +-- +`stackablectl` is the command line tool to interact with Stackable operators and our recommended way to install operators. Follow the xref:management:stackablectl:installation.adoc[installation steps] for your platform. -After you have installed `stackablectl` run the following command to install the Spark-k8s operator: +After you have installed `stackablectl` run the following command to install the Spark operator: [source,bash] ---- include::example$getting_started/getting_started.sh[tag=stackablectl-install-operators] ---- -The tool will show +The tool shows [source] ---- @@ -44,24 +46,26 @@ include::example$getting_started/install_output.txt[] TIP: Consult the xref:management:stackablectl:quickstart.adoc[] to learn more about how to use stackablectl. For example, you can use the `--cluster kind` flag to create a Kubernetes cluster with link:https://kind.sigs.k8s.io/[kind]. +-- -=== Helm - -You can also use Helm to install the operator. +Helm:: ++ +-- Add the Stackable Helm repository: [source,bash] ---- include::example$getting_started/getting_started.sh[tag=helm-add-repo] ---- -Then install the Stackable Operators: +Install the Stackable Operators: [source,bash] ---- include::example$getting_started/getting_started.sh[tag=helm-install-operators] ---- Helm will deploy the operators in a Kubernetes Deployment and apply the CRDs for the SparkApplication (as well as the CRDs for the required operators). -You are now ready to create a Spark job. +-- +==== == What's next diff --git a/docs/modules/spark-k8s/pages/index.adoc b/docs/modules/spark-k8s/pages/index.adoc index 2239e220..eb41ac98 100644 --- a/docs/modules/spark-k8s/pages/index.adoc +++ b/docs/modules/spark-k8s/pages/index.adoc @@ -22,7 +22,7 @@ Its in-memory processing and fault-tolerant architecture make it ideal for a var == Getting started Follow the xref:getting_started/index.adoc[] guide to get started with Apache Spark using the Stackable operator. -The guide will lead you through the installation of the operator and running your first Spark application on Kubernetes. +The guide leads you through the installation of the operator and running your first Spark application on Kubernetes. == How the operator works @@ -62,7 +62,7 @@ A ConfigMap supplies the necessary configuration, and there is a service to conn The {spark-rbac}[Spark-Kubernetes RBAC documentation] describes what is needed for `spark-submit` jobs to run successfully: minimally a role/cluster-role to allow the driver pod to create and manage executor pods. -However, to add security each `spark-submit` job launched by the operator will be assigned its own ServiceAccount. +However, to add security each `spark-submit` job launched by the operator is assigned its own ServiceAccount. During the operator installation, a cluster role named `spark-k8s-clusterrole` is created with pre-defined permissions. diff --git a/docs/modules/spark-k8s/pages/reference/commandline-parameters.adoc b/docs/modules/spark-k8s/pages/reference/commandline-parameters.adoc index e2b7ff8f..b52765c7 100644 --- a/docs/modules/spark-k8s/pages/reference/commandline-parameters.adoc +++ b/docs/modules/spark-k8s/pages/reference/commandline-parameters.adoc @@ -10,7 +10,7 @@ This operator accepts the following command line parameters: *Multiple values:* false -The operator will **only** watch for resources in the provided namespace `test`: +The operator **only** watches for resources in the provided namespace `test`: [source] ---- diff --git a/docs/modules/spark-k8s/pages/reference/environment-variables.adoc b/docs/modules/spark-k8s/pages/reference/environment-variables.adoc index 71fd0eac..8743497d 100644 --- a/docs/modules/spark-k8s/pages/reference/environment-variables.adoc +++ b/docs/modules/spark-k8s/pages/reference/environment-variables.adoc @@ -10,7 +10,7 @@ This operator accepts the following environment variables: *Multiple values:* false -The operator will **only** watch for resources in the provided namespace `test`: +The operator **only** watches for resources in the provided namespace `test`: [source] ---- diff --git a/docs/modules/spark-k8s/pages/usage-guide/examples.adoc b/docs/modules/spark-k8s/pages/usage-guide/examples.adoc index 81b220b7..8e9211e1 100644 --- a/docs/modules/spark-k8s/pages/usage-guide/examples.adoc +++ b/docs/modules/spark-k8s/pages/usage-guide/examples.adoc @@ -4,7 +4,8 @@ The following examples have the following `spec` fields in common: * `version`: the current version is "1.0" -* `sparkImage`: the docker image that will be used by job, driver and executor pods. This can be provided by the user. +* `sparkImage`: the docker image that is used by job, driver and executor pods. + This can be provided by the user. * `mode`: only `cluster` is currently supported * `mainApplicationFile`: the artifact (Java, Scala or Python) that forms the basis of the Spark job. * `args`: these are the arguments passed directly to the application. In the examples below it is e.g. the input path for part of the public New York taxi dataset. @@ -22,10 +23,10 @@ Job-specific settings are annotated below. include::example$example-sparkapp-image.yaml[] ---- -<1> Job image: this contains the job artifact that will be retrieved from the volume mount backed by the PVC +<1> Job image: this contains the job artifact that is retrieved from the volume mount backed by the PVC <2> Job python artifact (local) <3> Job argument (external) -<4> List of python job requirements: these will be installed in the pods via `pip` +<4> List of python job requirements: these are installed in the Pods via `pip`. <5> Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store) == JVM (Scala): externally located artifact and dataset @@ -34,7 +35,6 @@ include::example$example-sparkapp-image.yaml[] ---- include::example$example-sparkapp-pvc.yaml[] ---- - <1> Job artifact located on S3. <2> Job main class <3> Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in an S3 store, accessed without credentials) @@ -70,5 +70,5 @@ include::example$example-sparkapp-configmap.yaml[] <3> Job scala artifact that requires an input argument <4> The volume backed by the configuration map <5> The expected job argument, accessed via the mounted configuration map file -<6> The name of the volume backed by the configuration map that will be mounted to the driver/executor -<7> The mount location of the volume (this will contain a file `/arguments/job-args.txt`) +<6> The name of the volume backed by the configuration map that is mounted to the driver/executor +<7> The mount location of the volume (this contains a file `/arguments/job-args.txt`) diff --git a/docs/modules/spark-k8s/pages/usage-guide/history-server.adoc b/docs/modules/spark-k8s/pages/usage-guide/history-server.adoc index 6f4a0ffc..8f7670ff 100644 --- a/docs/modules/spark-k8s/pages/usage-guide/history-server.adoc +++ b/docs/modules/spark-k8s/pages/usage-guide/history-server.adoc @@ -2,10 +2,8 @@ :description: Set up Spark History Server on Kubernetes to access Spark logs via S3, with configuration for cleanups and web UI access details. :page-aliases: history_server.adoc -== Overview - The Stackable Spark-on-Kubernetes operator runs Apache Spark workloads in a Kubernetes cluster, whereby driver- and executor-pods are created for the duration of the job and then terminated. -One or more Spark History Server instances can be deployed independently of SparkApplication jobs and used as an end-point for spark logging, so that job information can be viewed once the job pods are no longer available. +One or more Spark History Server instances can be deployed independently of SparkApplication jobs and used as an endpoint for Spark logging, so that job information can be viewed once the job pods are no longer available. == Deployment @@ -14,25 +12,30 @@ The event logs are loaded from an S3 bucket named `spark-logs` and the folder `e The credentials for this bucket are provided by the secret class `s3-credentials-class`. For more details on how the Stackable Data Platform manages S3 resources see the xref:concepts:s3.adoc[S3 resources] page. - [source,yaml] ---- include::example$example-history-server.yaml[] ---- -<1> The location of the event logs. Must be a S3 bucket. Future implementations might add support for other shared filesystems such as HDFS. -<2> Folder within the S3 bucket where the log files are located. This folder is required and must exist before setting up the history server. +<1> The location of the event logs. + Must be an S3 bucket. + Future implementations might add support for other shared filesystems such as HDFS. +<2> Directory within the S3 bucket where the log files are located. + This directory is required and must exist before setting up the history server. <3> The S3 bucket definition, here provided in-line. -<4> Additional history server configuration properties can be provided here as a map. For possible properties see: https://spark.apache.org/docs/latest/monitoring.html#spark-history-server-configuration-options -<5> This deployment has only one Pod. Multiple history servers can be started, all reading the same event logs by increasing the replica count. -<6> This history server will automatically clean up old log files by using default properties. You can change any of these by using the `sparkConf` map. +<4> Additional history server configuration properties can be provided here as a map. + For possible properties see: https://spark.apache.org/docs/latest/monitoring.html#spark-history-server-configuration-options +<5> This deployment has only one Pod. + Multiple history servers can be started, all reading the same event logs by increasing the replica count. +<6> This history server automatically cleans up old log files by using default properties. + Change any of these by using the `sparkConf` map. NOTE: Only one role group can have scheduled cleanups enabled (`cleaner: true`) and this role group cannot have more than 1 replica. The secret with S3 credentials must contain at least the following two keys: -* `accessKey` - the access key of a user with read and write access to the event log bucket. -* `secretKey` - the secret key of a user with read and write access to the event log bucket. +* `accessKey` -- the access key of a user with read and write access to the event log bucket. +* `secretKey` -- the secret key of a user with read and write access to the event log bucket. Any other entries of the Secret are ignored by the operator. diff --git a/docs/modules/spark-k8s/pages/usage-guide/listenerclass.adoc b/docs/modules/spark-k8s/pages/usage-guide/listenerclass.adoc index 99afcbef..7df222d0 100644 --- a/docs/modules/spark-k8s/pages/usage-guide/listenerclass.adoc +++ b/docs/modules/spark-k8s/pages/usage-guide/listenerclass.adoc @@ -1,8 +1,11 @@ = Service exposition with ListenerClasses -The Spark Operator deploys SparkApplications, and does not offer a UI or other API, so no services are exposed. However, the Operator can also deploy HistoryServers, which do offer a UI and API. The Operator deploys a service called `-historyserver` (where `` is the name of the HistoryServer) through which HistoryServer can be reached. +The Spark operator deploys SparkApplications, and does not offer a UI or other API, so no services are exposed. +However, the operator can also deploy HistoryServers, which do offer a UI and API. +The operator deploys a service called `-historyserver` (where `` is the name of the spark application) through which the HistoryServer can be reached. -This service can have three different types: `cluster-internal`, `external-unstable` and `external-stable`. Read more about the types in the xref:concepts:service-exposition.adoc[service exposition] documentation at platform level. +This service can have three different types: `cluster-internal`, `external-unstable` and `external-stable`. +Read more about the types in the xref:concepts:service-exposition.adoc[service exposition] documentation at platform level. This is how the ListenerClass is configured: diff --git a/docs/modules/spark-k8s/pages/usage-guide/logging.adoc b/docs/modules/spark-k8s/pages/usage-guide/logging.adoc index 60166aae..9b2d2bf1 100644 --- a/docs/modules/spark-k8s/pages/usage-guide/logging.adoc +++ b/docs/modules/spark-k8s/pages/usage-guide/logging.adoc @@ -1,6 +1,8 @@ = Logging -The Spark operator installs a https://vector.dev/docs/setup/deployment/roles/#agent[vector agent] as a side-car container in every application Pod except the `job` Pod that runs `spark-submit`. It also configures the logging framework to output logs in XML format. This is the same https://logging.apache.org/log4j/2.x/manual/layouts.html#XMLLayout[format] used across all Stackable products and it enables the https://vector.dev/docs/setup/deployment/roles/#aggregator[vector aggregator] to collect logs across the entire platform. +The Spark operator installs a https://vector.dev/docs/setup/deployment/roles/#agent[vector agent] as a side-car container in every application Pod except the `job` Pod that runs `spark-submit`. +It also configures the logging framework to output logs in XML format. +This is the same https://logging.apache.org/log4j/2.x/manual/layouts.html#XMLLayout[format] used across all Stackable products, and it enables the https://vector.dev/docs/setup/deployment/roles/#aggregator[vector aggregator] to collect logs across the entire platform. It is the user's responsibility to install and configure the vector aggregator, but the agents can discover the aggregator automatically using a discovery ConfigMap as described in the xref:concepts:logging.adoc[logging concepts]. @@ -35,12 +37,12 @@ spec: level: INFO ... ---- -<1> Name of a ConfigMap that referenced the vector aggregator. See example below. +<1> Name of a ConfigMap that referenced the vector aggregator. + See example below. <2> Enable the vector agent in the history pod. <3> Configure log levels for file and console outputs. -Example vector aggregator configuration. - +.Example vector aggregator configuration [source,yaml] ---- --- diff --git a/docs/modules/spark-k8s/pages/usage-guide/operations/applications.adoc b/docs/modules/spark-k8s/pages/usage-guide/operations/applications.adoc index ab38f728..cc85b738 100644 --- a/docs/modules/spark-k8s/pages/usage-guide/operations/applications.adoc +++ b/docs/modules/spark-k8s/pages/usage-guide/operations/applications.adoc @@ -1,7 +1,10 @@ -= Spark Applications += Spark applications -Spark applications are submitted to the Spark Operator as SparkApplication resources. These resources are used to define the configuration of the Spark job, including the image to use, the main application file, and the number of executors to start. +Spark applications are submitted to the Spark Operator as SparkApplication resources. +These resources are used to define the configuration of the Spark job, including the image to use, the main application file, and the number of executors to start. -Upon creation, the application's status set to `Unknown`. As the operator creates the necessary resources, the status of the application transitions through different phases that reflect the phase of the driver Pod. A successful application will eventually reach the `Succeeded` phase. +Upon creation, the application's status set to `Unknown`. +As the operator creates the necessary resources, the status of the application transitions through different phases that reflect the phase of the driver Pod. A successful application eventually reaches the `Succeeded` phase. -NOTE: The operator will never reconcile an application once it has been created. To resubmit an application, a new SparkApplication resource must be created. +NOTE: The operator never reconciles an application once it has been created. +To resubmit an application, a new SparkApplication resource must be created. diff --git a/docs/modules/spark-k8s/pages/usage-guide/resources.adoc b/docs/modules/spark-k8s/pages/usage-guide/resources.adoc index a2daecb7..bec9bf7f 100644 --- a/docs/modules/spark-k8s/pages/usage-guide/resources.adoc +++ b/docs/modules/spark-k8s/pages/usage-guide/resources.adoc @@ -58,7 +58,7 @@ To illustrate resource configuration consider the use-case where resources are d === CPU -CPU request and limit will be used as defined in the custom resource resulting in the following: +CPU request and limit are used as defined in the custom resource resulting in the following: |=== @@ -87,7 +87,10 @@ Task parallelism (the number of tasks an executor can run concurrently), is dete === Memory -Memory values are not rounded as is the case with CPU. Values for `spark.{driver|executor}.memory}` - this is the amount of memory to use for the driver process (i.e. where SparkContext is initialized) and executor processes respectively - are passed to Spark in such as a way that the overheads added by Spark are already implicitly declared: this overhead will be applied using a factor of 0.1 (JVM jobs) or 0.4 (non-JVM jobs), being not less than 384MB, the minimum overhead applied by Spark. Once the overhead is applied, the effective value is the one defined by the user. This keeps the values transparent: what is defined in the CRD is what is actually provisioned for the process. +Memory values are not rounded as is the case with CPU. +Values for `spark.{driver|executor}.memory}` -- this is the amount of memory to use for the driver process (i.e. where SparkContext is initialized) and executor processes respectively -- are passed to Spark in such as a way that the overheads added by Spark are already implicitly declared: this overhead is applied using a factor of 0.1 (JVM jobs) or 0.4 (non-JVM jobs), being not less than 384MB, the minimum overhead applied by Spark. +Once the overhead is applied, the effective value is the one defined by the user. +This keeps the values transparent: what is defined in the CRD is what is actually provisioned for the process. An alternative is to do define the spark.conf settings explicitly and then let Spark apply the overheads to those values. @@ -125,7 +128,7 @@ A SparkApplication defines the following resources: ... ---- -This will result in the following Pod definitions: +This results in the following Pod definitions: For the job: diff --git a/docs/modules/spark-k8s/pages/usage-guide/s3.adoc b/docs/modules/spark-k8s/pages/usage-guide/s3.adoc index e4b93c40..19934e7f 100644 --- a/docs/modules/spark-k8s/pages/usage-guide/s3.adoc +++ b/docs/modules/spark-k8s/pages/usage-guide/s3.adoc @@ -2,6 +2,7 @@ :description: Learn how to configure S3 access in SparkApplications using inline credentials or external resources, including TLS for secure connections. You can specify S3 connection details directly inside the SparkApplication specification or by referring to an external S3Bucket custom resource. +Refer to the xref:concepts:s3.adoc[S3 concept documentation] for general information about S3 resources on the Stackable Data Platform. == S3 access using credentials @@ -20,7 +21,7 @@ s3connection: # <1> <1> Entry point for the S3 connection configuration. <2> Connection host. <3> Optional connection port. -<4> Name of the `Secret` object expected to contain the following keys: `accessKey` and `secretKey` +<4> Name of the Secret object expected to contain the following keys: `accessKey` and `secretKey`. It is also possible to configure the connection details as a separate Kubernetes resource and only refer to that object from the SparkApplication like this: @@ -52,7 +53,8 @@ This has the advantage that one connection configuration can be shared across Sp == S3 access with TLS -A custom certificate can also be used for S3 access. In the example below, a Secret containing a custom certificate is referenced, which will used a to create a custom truststore which is used by Spark for S3-bucket access: +A custom certificate can be used for S3 access. +In the example below, a Secret containing a custom certificate is referenced, which is used to create a custom truststore for Spark to access the S3 bucket: [source,yaml] ---- @@ -73,8 +75,9 @@ spec: caCert: secretClass: minio-tls-certificates # <2> ---- -<1> Name of the `Secret` object expected to contain the following keys: `accessKey` and `secretKey` (as in the previous example). -<2> Name of the `Secret` object containing the custom certificate. The certificate should comprise the 3 files named as shown below: +<1> Name of the Secret object expected to contain the following keys: `accessKey` and `secretKey` (as in the previous example). +<2> Name of the Secret object containing the custom certificate. + The certificate should comprise the 3 files named as shown below: [source,yaml] ----