Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] The delta-spark dependency pyspark package is too large #3789

Open
melin opened this issue Oct 22, 2024 · 1 comment
Open
Labels
enhancement New feature or request

Comments

@melin
Copy link

melin commented Oct 22, 2024

To install the delta-spark python package on the spark image, you need to download pyspark.zip. pyspark.zip has more than 370 MB. Can I avoid increasing the size of the spark image?

FROM spark:3.5.3-scala2.12-java11-ubuntu

USER root

RUN set -ex; \
    apt-get update; \
    apt-get install -y python3 python3-pip; \
    rm -rf /var/lib/apt/lists/*

RUN pip install requests aspectlib delta-spark;

ADD build/docker/aspectjweaver-1.9.22.1.jar /opt/spark

ADD build/docker/jars/ \
    build/docker/datatunnel-3.5.0/ \
    spark-jobserver-driver/target/spark-jobserver-driver-3.5.0.jar \
    spark-jobserver-extensions/target/spark-jobserver-extensions-3.5.0.jar /opt/spark/jars/
USER spark
@melin melin added the enhancement New feature or request label Oct 22, 2024
@Pshak-20000
Copy link

Hi,
To minimize the size of the Spark image while adding delta-spark, I suggest we consider:

Using a lighter base image.
Installing only the necessary dependencies instead of the entire pyspark.
Implementing multi-stage builds to keep only essential files.
Cleaning up temporary files and caches after installations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants