Skip to content

DOC-2674 added support for spark connector #818

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions modules/data-loading/pages/load-from-spark-dataframe.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,11 @@ and xref:tigergraph-server:data-loading:load-from-spark-dataframe.adoc#_load_dat

include::partial$spark/jdbc-deprecation.adoc[]

[NOTE]
====
From version *4.1.3*, TigerGraph's Spark Connector supports `OAuth2` authentication, enabling it to request JWT tokens from third-party Identity Providers (IdPs) like `Azure AD` and `Auth0`, enhancing security and token management for Spark jobs.
====

== Prerequisites

include::partial$spark/prerequisites.adoc[]
Expand Down
35 changes: 34 additions & 1 deletion modules/data-loading/partials/spark/common-options.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -116,4 +116,37 @@ The log will be printed to stderr unless `log.file` is set.
¦ (none)
¦ The log file name pattern, e.g., "/tmp/tigergraph-spark-connector.log". It requires setting `log.level` first.
¦ Logging
|===

¦ `oauth2.url`
¦ (none)
¦ The URL where the Client Credentials grant request should be sent for retrieving the access token. This URL is typically provided by the Identity Provider (IdP) such as Azure AD or Auth0.

For *Azure AD:*
`https://login.microsoftonline.com/{tenant_id}/oauth2/v2.0/token`

For *Auth0:*
`https://{yourDomain}/oauth/token`

*Note:* Use the gadmin command to set the `Security.JWT.RSA.PublicKey` in advance for JWT verification.
¦ Authentication

¦ `oauth2.parameters`
¦ (none)
¦ A *stringified JSON* that carries parameters for retrieving the access token. The required parameters vary depending on the IdP.

For *Azure AD:*
`{
"client_id": "{client_id}",
"client_secret": "{client_secret}",
"scope": "https://storage.azure.com/.default"
}`

For *Auth0:*
`{
"client_id": "{client_id}",
"client_secret": "{client_secret}",
"audience": "{api_identifier}"
}`

¦ Authentication
|===
28 changes: 19 additions & 9 deletions modules/data-loading/partials/spark/pre-def-options.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -12,18 +12,28 @@ val tgOptions = Map(
)
----

==== Some Helpful tips
*Some Helpful tips*

* **Connection options object**
* **Connection options object**

You may find it convenient to bundle the options related to making a database connection in a data object (e.g., `tgOptions`) and then use `.options(tgOptions)` when running the connector.
Moreover, placing the user credentials in a data object keeps them separate from the connection commands.
To simplify configuration, you can bundle the options related to database connection in a data object (e.g., `tgOptions`) and then use `.options(tgOptions)` when running the connector. This approach helps keep user credentials separate from connection commands.

* **Authentication methods**
* **High Availability**

Using username/password has the advantage that this authentication method will automatically refresh an access token as needed.If you choose to use a token, please make sure the lifetime of the token is long enough for the loading job.
To ensure fault tolerance, provide a list of all node `URLs` in the `url` option for the connector:

* **High Availabilty**

To support fault tolerance and failover, please provide the URL of all nodes in the option `url`, e.g.,
[source,json]
----
"url" -> "https://m1:14240,https://m2:14240,https://m3:14240,https://m4:14240",
----

=== Authentication Methods

We recommend using *OAuth2* for authentication. Our implementation supports automatic token requests and refresh, and is more secure that *username/password*, offering convenience and security for both batch and streaming jobs.

The Spark connector now supports the following authentication methods:

* `oauth2`: Recommended for both batch and streaming jobs. The token is automatically requested and refreshed using `oauth2.url` and `oauth2.parameters`.
* `token`: Suitable for one-time Spark jobs, but lacks support for refresh tokens.
* `username/password`: Works for both batch and streaming jobs, but is considered an older, less secure method. The token is automatically requested and refreshed.
* `secret`: Also works for batch and streaming jobs, and the token is managed using the GSQL secret, but is less secure compared to OAuth2.