Skip to content

Commit b47d29a

Browse files
committed
[SPARK-42445][R] Fix SparkR install.spark function
### What changes were proposed in this pull request? This PR fixes `SparkR` `install.spark` method. ``` $ curl -LO https://dist.apache.org/repos/dist/dev/spark/v3.3.2-rc1-bin/SparkR_3.3.2.tar.gz $ R CMD INSTALL SparkR_3.3.2.tar.gz $ R R version 4.2.1 (2022-06-23) -- "Funny-Looking Kid" Copyright (C) 2022 The R Foundation for Statistical Computing Platform: aarch64-apple-darwin20 (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > library(SparkR) Attaching package: ‘SparkR’ The following objects are masked from ‘package:stats’: cov, filter, lag, na.omit, predict, sd, var, window The following objects are masked from ‘package:base’: as.data.frame, colnames, colnames<-, drop, endsWith, intersect, rank, rbind, sample, startsWith, subset, summary, transform, union > install.spark() Spark not found in the cache directory. Installation will start. MirrorUrl not provided. Looking for preferred site from apache website... Preferred mirror site found: https://dlcdn.apache.org/spark Downloading spark-3.3.2 for Hadoop 2.7 from: - https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop2.7.tgz trying URL 'https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop2.7.tgz' simpleWarning in download.file(remotePath, localPath): downloaded length 0 != reported length 196 > install.spark(hadoopVersion="3") Spark not found in the cache directory. Installation will start. MirrorUrl not provided. Looking for preferred site from apache website... Preferred mirror site found: https://dlcdn.apache.org/spark Downloading spark-3.3.2 for Hadoop 3 from: - https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-3.tgz trying URL 'https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-3.tgz' simpleWarning in download.file(remotePath, localPath): downloaded length 0 != reported length 196 ``` Note that this is a regression at Spark 3.3.0 and not a blocker for on-going Spark 3.3.2 RC vote. ### Why are the changes needed? https://spark.apache.org/docs/latest/api/R/reference/install.spark.html#ref-usage ![Screenshot 2023-02-14 at 10 07 49 PM](https://user-images.githubusercontent.com/9700541/218946460-ab7eab1b-65ae-4cb2-bc7c-5810ad359ac9.png) First, the existing Spark 2.0.0 link is broken. - https://spark.apache.org/docs/latest/api/R/reference/install.spark.html#details - http://apache.osuosl.org/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz (Broken) Second, Spark 3.3.0 changed the Hadoop postfix pattern from the distribution files so that the function raises errors as described before. - http://archive.apache.org/dist/spark/spark-3.2.3/spark-3.2.3-bin-hadoop2.7.tgz (Old Pattern) - http://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz (New Pattern) ### Does this PR introduce _any_ user-facing change? No, this fixes a bug like Spark 3.2.3 and older versions. ### How was this patch tested? Pass the CI and manual testing. Please note that the link pattern is correct although it fails because 3.5.0 is not published yet. ``` $ NO_MANUAL=1 ./dev/make-distribution.sh --r $ R CMD INSTALL R/SparkR_3.5.0-SNAPSHOT.tar.gz $ R > library(SparkR) > install.spark() Spark not found in the cache directory. Installation will start. MirrorUrl not provided. Looking for preferred site from apache website... Preferred mirror site found: https://dlcdn.apache.org/spark Downloading spark-3.5.0 for Hadoop 3 from: - https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz trying URL 'https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz' simpleWarning in download.file(remotePath, localPath): downloaded length 0 != reported length 196 ``` Closes apache#40031 from dongjoon-hyun/SPARK-42445. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
1 parent 9843c7c commit b47d29a

File tree

1 file changed

+7
-8
lines changed

1 file changed

+7
-8
lines changed

R/pkg/R/install.R

+7-8
Original file line numberDiff line numberDiff line change
@@ -29,19 +29,18 @@
2929
#' \code{mirrorUrl} specifies the remote path to a Spark folder. It is followed by a subfolder
3030
#' named after the Spark version (that corresponds to SparkR), and then the tar filename.
3131
#' The filename is composed of four parts, i.e. [Spark version]-bin-[Hadoop version].tgz.
32-
#' For example, the full path for a Spark 2.0.0 package for Hadoop 2.7 from
33-
#' \code{http://apache.osuosl.org} has path:
34-
#' \code{http://apache.osuosl.org/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz}.
32+
#' For example, the full path for a Spark 3.3.1 package from
33+
#' \code{https://archive.apache.org} has path:
34+
#' \code{http://archive.apache.org/dist/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz}.
3535
#' For \code{hadoopVersion = "without"}, [Hadoop version] in the filename is then
3636
#' \code{without-hadoop}.
3737
#'
38-
#' @param hadoopVersion Version of Hadoop to install. Default is \code{"2.7"}. It can take other
39-
#' version number in the format of "x.y" where x and y are integer.
38+
#' @param hadoopVersion Version of Hadoop to install. Default is \code{"3"}.
4039
#' If \code{hadoopVersion = "without"}, "Hadoop free" build is installed.
4140
#' See
4241
#' \href{https://spark.apache.org/docs/latest/hadoop-provided.html}{
4342
#' "Hadoop Free" Build} for more information.
44-
#' Other patched version names can also be used, e.g. \code{"cdh4"}
43+
#' Other patched version names can also be used.
4544
#' @param mirrorUrl base URL of the repositories to use. The directory layout should follow
4645
#' \href{https://www.apache.org/dyn/closer.lua/spark/}{Apache mirrors}.
4746
#' @param localDir a local directory where Spark is installed. The directory contains
@@ -65,7 +64,7 @@
6564
#' @note install.spark since 2.1.0
6665
#' @seealso See available Hadoop versions:
6766
#' \href{https://spark.apache.org/downloads.html}{Apache Spark}
68-
install.spark <- function(hadoopVersion = "2.7", mirrorUrl = NULL,
67+
install.spark <- function(hadoopVersion = "3", mirrorUrl = NULL,
6968
localDir = NULL, overwrite = FALSE) {
7069
sparkHome <- Sys.getenv("SPARK_HOME")
7170
if (isSparkRShell()) {
@@ -251,7 +250,7 @@ defaultMirrorUrl <- function() {
251250
hadoopVersionName <- function(hadoopVersion) {
252251
if (hadoopVersion == "without") {
253252
"without-hadoop"
254-
} else if (grepl("^[0-9]+\\.[0-9]+$", hadoopVersion, perl = TRUE)) {
253+
} else if (grepl("^[0-9]+$", hadoopVersion, perl = TRUE)) {
255254
paste0("hadoop", hadoopVersion)
256255
} else {
257256
hadoopVersion

0 commit comments

Comments
 (0)