Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to Spark 3.2.x #6

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

srikar
Copy link

@srikar srikar commented Mar 19, 2022

PR Details

Issue

Fixes #5

Contains

Upgrade to Spark 3.2.1
Upgrade to Scala version to be the latest patch version of 2.12.x (2.12.4 -> 2.12.15)
Fix to a typo in README referencing incorrect version of sbt

Test with Colima

Environment

$ sw_vers
ProductName:	macOS
ProductVersion:	12.2.1
BuildVersion:	21D62
$ ./batect --docker-host=unix://$HOME/.colima/docker.sock --version
Batect version:    0.74.0
Built:             2021-09-05 11:50:34 +0000
Built from commit: 882f6c4ba18ab5385373185e0c9e75f928f980ea (commit date: 2021-09-05 20:20:09 +1000)
JVM version:       Eclipse Adoptium OpenJDK 64-Bit Server VM 11.0.13
OS version:        Mac OS X 12.2.1 x86_64
Docker version:    20.10.11 (API version: 1.41, minimum supported API version: 1.12, commit: 847da184ad5048b27f5bdf9d53d070f731b43180, operating system: 'linux', experimental: false)
Git version:       2.34.0
...
...
(global-anaconda3-2020.07) srikars-mbp-2:dataengineer-transformations-scala sayilava$ colima version
colima version 0.3.4
git commit: 5a4a70481ca8d1e794677f22524e3c1b79a9b4ae

runtime: docker
arch: x86_64
client: v20.10.13
server: v20.10.11

Commands run

./batect --docker-host=unix://$HOME/.colima/docker.sock unit-test
./batect --docker-host=unix://$HOME/.colima/docker.sock style-checks
INPUT_FILE_PATH="src/test/resources/data/words.txt" JOB=thoughtworks.wordcount.WordCount ./batect --docker-host=unix://$HOME/.colima/docker.sock run-job
INPUT_FILE_PATH="src/test/resources/data/citibike.csv" JOB=thoughtworks.ingest.DailyDriver ./batect --docker-host=unix://$HOME/.colima/docker.sock run-job
INPUT_FILE_PATH=${output_parquest_ingest} JOB=thoughtworks.citibike.CitibikeTransformer ./batect --docker-host=unix://$HOME/.colima/docker.sock run-job

Status

All commands ran successfully

Remaining

Copy link
Contributor

@svishal9 svishal9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @srikar , it looks great, I have some minor points for discussion.

@@ -4,14 +4,14 @@ FROM python:$PYTHON_VERSION
USER root
WORKDIR /opt
RUN wget -c --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.tar.gz && \
wget https://downloads.lightbend.com/scala/2.13.5/scala-2.13.5.tgz && \
wget https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz && \
wget https://downloads.lightbend.com/scala/2.12.15/scala-2.12.15.tgz && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we downgrading version of scala?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are inconsistent versions of scala in our codebase. I was just making them consistent. I am happy to bump it up to 2.13.x if we want to do that @svishal9 @sambhav13

Copy link

@ktiwary-tw ktiwary-tw Jul 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote for keeping Scala version as 2.13.x since that's the version used for Spark 3.x

wget https://github.com/sbt/sbt/releases/download/v1.3.0/sbt-1.3.0.tgz
RUN tar xzf jdk-8u131-linux-x64.tar.gz && \
tar xvf scala-2.13.5.tgz && \
tar xvf spark-3.1.1-bin-hadoop3.2.tgz && \
tar xvf scala-2.12.15.tgz && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we downgrading version of scala?

tar xvf sbt-1.3.0.tgz
ENV PATH="/opt/jdk1.8.0_131/bin:/opt/scala-2.13.5/bin:/opt/spark-3.1.1-bin-hadoop3.2/bin:/opt/sbt/bin:/opt/sbt/bin$PATH"
ENV PATH="/opt/jdk1.8.0_131/bin:/opt/scala-2.12.15/bin:/opt/spark-3.2.1-bin-hadoop3.2/bin:/opt/sbt/bin:/opt/sbt/bin$PATH"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we downgrading version of scala?

@@ -69,7 +69,7 @@ A single `*.csv` file containing data similar to:
#### Run the job

```bash
INPUT_FILE_PATH="src/test/resources/data/words.txt" JOB=thoughtworks.wordcount.WordCount ./batect run-job
INPUT_FILE_PATH="src/test/resources/data/words.txt" OUTPUT_PATH="output" JOB=thoughtworks.wordcount.WordCount ./batect run-job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it implicit?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was implicit earlier. However because the output of one stage was the input for the other, the implicit path didn't work. And we had to call this out explicitly.

@@ -103,7 +103,7 @@ Historical bike ride `*.csv` file:
##### Run the job

```bash
INPUT_FILE_PATH="src/test/resources/data/citibike.csv" JOB=thoughtworks.ingest.DailyDriver ./batect run-job
INPUT_FILE_PATH="src/test/resources/data/citibike.csv" OUTPUT_PATH="output_parquet_ingest" JOB=thoughtworks.ingest.DailyDriver ./batect run-job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it implicit ? like a pattern?

@@ -132,7 +132,7 @@ Historical bike ride `*.parquet` files
##### Run the job

```bash
INPUT_FILE_PATH=${output_parquest_ingest} JOB=thoughtworks.citibike.CitibikeTransformer ./batect run-job
INPUT_FILE_PATH="output_parquet_ingest" OUTPUT_PATH="output" JOB=thoughtworks.citibike.CitibikeTransformer ./batect run-job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@@ -1,14 +1,14 @@

scalaVersion := "2.12.4"
scalaVersion := "2.12.15"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting, so we installed 2.13 in previous commits but we were using 2.12?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We see the inconsistent versions of scala here. The build file originally contained 2.12.4. And the docker container was being built with 2.13.5.

I just brought them both down to 2.12.latest. I chose 2.12 as it had been around longer and still actively being used (unlike Spark 2.x)

@@ -2,7 +2,6 @@

set -e

OUTPUT_PATH="./output"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can w have a pattern instead of removing it?


val sparkVersion = "2.4.0"
val sparkVersion = "3.2.1"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that Spark 3.3 is already released and this PR isn't merged, would it be worthwhile to upgrade to Spark3.3.0?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upgrade Spark
3 participants