Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(hive): Only build and ship Hive metastore #619

Merged
merged 14 commits into from
Apr 12, 2024
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,19 +9,26 @@ All notable changes to this project will be documented in this file.
- Build all `0.0.0-dev` product images as multi-arch and push them to Nexus and Harbor.
Also SBOMs are generated and everything is signed ([#614]).

### Changed

- hive: Only build and ship Hive metastore. This reduces the image size from `2.63GB` to `1.9GB` and should also reduce the number of dependencies ([#619]).

### Fixed

- superset: Let Superset 3.1.0 build on ARM by adding `make` and `diffutils` ([#611]).
- airflow: Let Airflow 2.8.x and 2.9.x build on ARM by adding `make` and `diffutils` ([#612]).
- python:3.11 manifest list fixed. Added proper hash ([#613]).
- trino-cli: Include the trino-cli in the CI build process ([#614]).
- hive: Fix compilation on ARM by back-porting [HIVE-21939](https://issues.apache.org/jira/browse/HIVE-21939) from [this](https://github.com/apache/hive/commit/2baf21bb55fcf33d8522444c78a8d8cab60e7415) commit ([#617]).
- hive: Fix compilation on ARM in CI as well ([#619]).
- hive: Fix compilation of x86 in CI due to lower disk usage to prevent disk running full ([#619]).

[#611]: https://github.com/stackabletech/docker-images/pull/611
[#612]: https://github.com/stackabletech/docker-images/pull/612
[#613]: https://github.com/stackabletech/docker-images/pull/613
[#614]: https://github.com/stackabletech/docker-images/pull/614
[#617]: https://github.com/stackabletech/docker-images/pull/617
[#619]: https://github.com/stackabletech/docker-images/pull/619

## [24.3.0] - 2024-03-20

Expand Down
4 changes: 4 additions & 0 deletions conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,10 @@
"java-base": "1.8.0",
"hadoop": "3.3.4",
"jackson_dataformat_xml": "2.12.3",
# Normally Hive 3.1.3 ships with "postgresql-9.4.1208.jre7.jar", but as this so old it does only support
sbernauer marked this conversation as resolved.
Show resolved Hide resolved
# MD5 based authentication. Because of this, it does not work against more recent PostgresQL versions.
# See https://github.com/stackabletech/hive-operator/issues/170 for details.
"postgres_driver": "42.7.2",
"aws_java_sdk_bundle": "1.12.262",
"azure_storage": "7.0.1",
"azure_keyvault_core": "1.0.0",
Expand Down
27 changes: 19 additions & 8 deletions hadoop/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,16 @@
RUN microdnf update && \
microdnf install \
# Required for Hadoop build
cmake cyrus-sasl-devel fuse-devel gcc gcc-c++ maven openssl-devel tar xz git \
cmake \
cyrus-sasl-devel \
fuse-devel \
gcc \
gcc-c++ \
git \
maven \
openssl-devel \
tar \
xz \
# Required for log4shell.sh
unzip zip && \
microdnf clean all
Expand All @@ -26,8 +35,6 @@

COPY hadoop/stackable /stackable

# Build from source to enable FUSE module, and to apply custom patches.
RUN curl --fail -L "https://repo.stackable.tech/repository/packages/hadoop/hadoop-${PRODUCT}-src.tar.gz" | tar -xzC .

# The symlink from JMX Exporter 0.16.1 to the versionless link exists because old HDFS Operators (up until and including 23.7) used to hardcode
# the version of JMX Exporter like this: "-javaagent:/stackable/jmx/jmx_prometheus_javaagent-0.16.1.jar"
Expand All @@ -52,20 +59,24 @@
tar xzf /opt/protobuf.tar.gz --strip-components 1 --no-same-owner && \
./configure --prefix=/opt/protobuf && \
make "-j$(nproc)" && \
make install
make install && \
rm -rf /opt/protobuf-src

ENV PROTOBUF_HOME /opt/protobuf
ENV PATH "${PATH}:/opt/protobuf/bin"

WORKDIR /stackable
RUN patches/apply_patches.sh ${PRODUCT}

WORKDIR /stackable/hadoop-${PRODUCT}-src
# Hadoop Pipes requires libtirpc to build, whose headers are not packaged in RedHat UBI, so skip building this module
RUN mvn clean package -Pdist,native -pl '!hadoop-tools/hadoop-pipes' -Drequire.fuse=true -DskipTests -Dmaven.javadoc.skip=true && \
# Build from source to enable FUSE module, and to apply custom patches.
RUN curl --fail -L "https://repo.stackable.tech/repository/packages/hadoop/hadoop-${PRODUCT}-src.tar.gz" | tar -xzC . && \

Check warning on line 72 in hadoop/Dockerfile

View workflow job for this annotation

GitHub Actions / hadolint

[hadolint] hadoop/Dockerfile#L72 <DL3003>(https://github.com/hadolint/hadolint/wiki/DL3003)

Use WORKDIR to switch to a directory
Raw output
message:"Use WORKDIR to switch to a directory" location:{path:"hadoop/Dockerfile" range:{start:{line:72 column:1}}} severity:WARNING source:{name:"hadolint" url:"https://github.com/hadolint/hadolint"} code:{value:"DL3003" url:"https://github.com/hadolint/hadolint/wiki/DL3003"}
patches/apply_patches.sh ${PRODUCT} && \
cd hadoop-${PRODUCT}-src && \
mvn clean package -Pdist,native -pl '!hadoop-tools/hadoop-pipes' -Drequire.fuse=true -DskipTests -Dmaven.javadoc.skip=true && \
cp -r hadoop-dist/target/hadoop-${PRODUCT} /stackable/hadoop-${PRODUCT} && \
# HDFS fuse-dfs is not part of the regular dist output, so we need to copy it in ourselves
cp hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/fuse-dfs/fuse_dfs /stackable/hadoop-${PRODUCT}/bin
cp hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/fuse-dfs/fuse_dfs /stackable/hadoop-${PRODUCT}/bin && \
rm -rf /stackable/hadoop-${PRODUCT}-src

# ===
# Mitigation for CVE-2021-44228 (Log4Shell)
Expand Down
73 changes: 33 additions & 40 deletions hive/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
ARG HADOOP
ARG JMX_EXPORTER
ARG JACKSON_DATAFORMAT_XML
ARG POSTGRES_DRIVER
ARG AWS_JAVA_SDK_BUNDLE
ARG AZURE_STORAGE
ARG AZURE_KEYVAULT_CORE
Expand All @@ -31,41 +32,32 @@
USER stackable
WORKDIR /stackable

RUN curl --fail -L "https://repo.stackable.tech/repository/packages/hive/apache-hive-${PRODUCT}-src.tar.gz" | tar -xzC .
RUN chmod +x patches/apply_patches.sh
RUN patches/apply_patches.sh ${PRODUCT}
RUN cd /stackable/apache-hive-${PRODUCT}-src/ && \
mvn clean package -DskipTests -Pdist
RUN cd /stackable/apache-hive-${PRODUCT}-src/ && \
tar -xzf packaging/target/apache-hive-${PRODUCT}-bin.tar.gz -C /stackable && \
mv /stackable/apache-hive-${PRODUCT}-bin /stackable/apache-hive-${PRODUCT} && \
ln -s /stackable/apache-hive-${PRODUCT}/ /stackable/hive && \
cp /stackable/bin/start-metastore /stackable/hive/bin
RUN curl --fail -L "https://repo.stackable.tech/repository/packages/hive/apache-hive-${PRODUCT}-src.tar.gz" | tar -xzC . && \

Check warning on line 35 in hive/Dockerfile

View workflow job for this annotation

GitHub Actions / hadolint

[hadolint] hive/Dockerfile#L35 <DL3003>(https://github.com/hadolint/hadolint/wiki/DL3003)

Use WORKDIR to switch to a directory
Raw output
message:"Use WORKDIR to switch to a directory" location:{path:"hive/Dockerfile" range:{start:{line:35 column:1}}} severity:WARNING source:{name:"hadolint" url:"https://github.com/hadolint/hadolint"} code:{value:"DL3003" url:"https://github.com/hadolint/hadolint/wiki/DL3003"}
patches/apply_patches.sh ${PRODUCT} && \
cd /stackable/apache-hive-${PRODUCT}-src/ && \
mvn clean package -DskipTests --projects standalone-metastore && \
mv standalone-metastore/target/apache-hive-metastore-${PRODUCT}-bin/apache-hive-metastore-${PRODUCT}-bin /stackable && \
ln -s /stackable/apache-hive-metastore-${PRODUCT}-bin/ /stackable/hive-metastore && \
cp /stackable/hive-metastore/bin/start-metastore /stackable/hive-metastore/bin/start-metastore.bak && \
cp /stackable/bin/start-metastore /stackable/hive-metastore/bin && \
rm -rf /stackable/apache-hive-${PRODUCT}-src

COPY --chown=stackable:stackable --from=hadoop-builder /stackable/hadoop /stackable/hadoop

# TODO: Remove hardcoded _new_ version
# Replace the old (postgresql-9.4.1208.jre7.jar) postgresql JDBC driver with a newer one, as the old one does only support MD5 based authentication.
# Because of this, the contained driver version does not work against more recent PostgresQL versions.
# See https://github.com/stackabletech/hive-operator/issues/170 for details.
# Note: We hardcode the versions here to make sure this replacement will be removed once Hive ships with a more recent driver
# version as the "rm" statement will fail.
RUN rm /stackable/apache-hive-${PRODUCT}/lib/postgresql-9.4.1208.jre7.jar && \
curl --fail -L https://repo.stackable.tech/repository/packages/pgjdbc/postgresql-42.7.2.jar -o /stackable/hive/lib/postgresql-42.7.2.jar


COPY --link --from=hadoop-builder /stackable/hadoop /stackable/hadoop
# Add a PostgreSQL driver, as this is the primary used persistence
RUN curl --fail -L https://repo.stackable.tech/repository/packages/pgjdbc/postgresql-${POSTGRES_DRIVER}.jar -o /stackable/hive-metastore/lib/postgresql-${POSTGRES_DRIVER}.jar

# The next two sections for S3 and Azure use hardcoded version numbers on purpose instead of wildcards
# This way the build will fail should one of the files not be available anymore in a later Hadoop version!

# Add S3 Support for Hive (support for s3a://)
RUN cp /stackable/hadoop/share/hadoop/tools/lib/hadoop-aws-${HADOOP}.jar /stackable/hive/lib/
RUN cp /stackable/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-${AWS_JAVA_SDK_BUNDLE}.jar /stackable/hive/lib/
RUN cp /stackable/hadoop/share/hadoop/tools/lib/hadoop-aws-${HADOOP}.jar /stackable/hive-metastore/lib/

Check notice on line 54 in hive/Dockerfile

View workflow job for this annotation

GitHub Actions / hadolint

[hadolint] hive/Dockerfile#L54 <DL3059>(https://github.com/hadolint/hadolint/wiki/DL3059)

Multiple consecutive `RUN` instructions. Consider consolidation.
Raw output
message:"Multiple consecutive `RUN` instructions. Consider consolidation." location:{path:"hive/Dockerfile" range:{start:{line:54 column:1}}} severity:INFO source:{name:"hadolint" url:"https://github.com/hadolint/hadolint"} code:{value:"DL3059" url:"https://github.com/hadolint/hadolint/wiki/DL3059"}
RUN cp /stackable/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-${AWS_JAVA_SDK_BUNDLE}.jar /stackable/hive-metastore/lib/

Check notice on line 55 in hive/Dockerfile

View workflow job for this annotation

GitHub Actions / hadolint

[hadolint] hive/Dockerfile#L55 <DL3059>(https://github.com/hadolint/hadolint/wiki/DL3059)

Multiple consecutive `RUN` instructions. Consider consolidation.
Raw output
message:"Multiple consecutive `RUN` instructions. Consider consolidation." location:{path:"hive/Dockerfile" range:{start:{line:55 column:1}}} severity:INFO source:{name:"hadolint" url:"https://github.com/hadolint/hadolint"} code:{value:"DL3059" url:"https://github.com/hadolint/hadolint/wiki/DL3059"}

# Add Azure ABFS support (support for abfs://)
RUN cp /stackable/hadoop/share/hadoop/tools/lib/hadoop-azure-${HADOOP}.jar /stackable/hive/lib/
RUN cp /stackable/hadoop/share/hadoop/tools/lib/azure-storage-${AZURE_STORAGE}.jar /stackable/hive/lib/
RUN cp /stackable/hadoop/share/hadoop/tools/lib/azure-keyvault-core-${AZURE_KEYVAULT_CORE}.jar /stackable/hive/lib/
RUN cp /stackable/hadoop/share/hadoop/tools/lib/hadoop-azure-${HADOOP}.jar /stackable/hive-metastore/lib/

Check notice on line 58 in hive/Dockerfile

View workflow job for this annotation

GitHub Actions / hadolint

[hadolint] hive/Dockerfile#L58 <DL3059>(https://github.com/hadolint/hadolint/wiki/DL3059)

Multiple consecutive `RUN` instructions. Consider consolidation.
Raw output
message:"Multiple consecutive `RUN` instructions. Consider consolidation." location:{path:"hive/Dockerfile" range:{start:{line:58 column:1}}} severity:INFO source:{name:"hadolint" url:"https://github.com/hadolint/hadolint"} code:{value:"DL3059" url:"https://github.com/hadolint/hadolint/wiki/DL3059"}
RUN cp /stackable/hadoop/share/hadoop/tools/lib/azure-storage-${AZURE_STORAGE}.jar /stackable/hive-metastore/lib/

Check notice on line 59 in hive/Dockerfile

View workflow job for this annotation

GitHub Actions / hadolint

[hadolint] hive/Dockerfile#L59 <DL3059>(https://github.com/hadolint/hadolint/wiki/DL3059)

Multiple consecutive `RUN` instructions. Consider consolidation.
Raw output
message:"Multiple consecutive `RUN` instructions. Consider consolidation." location:{path:"hive/Dockerfile" range:{start:{line:59 column:1}}} severity:INFO source:{name:"hadolint" url:"https://github.com/hadolint/hadolint"} code:{value:"DL3059" url:"https://github.com/hadolint/hadolint/wiki/DL3059"}
RUN cp /stackable/hadoop/share/hadoop/tools/lib/azure-keyvault-core-${AZURE_KEYVAULT_CORE}.jar /stackable/hive-metastore/lib/

Check notice on line 60 in hive/Dockerfile

View workflow job for this annotation

GitHub Actions / hadolint

[hadolint] hive/Dockerfile#L60 <DL3059>(https://github.com/hadolint/hadolint/wiki/DL3059)

Multiple consecutive `RUN` instructions. Consider consolidation.
Raw output
message:"Multiple consecutive `RUN` instructions. Consider consolidation." location:{path:"hive/Dockerfile" range:{start:{line:60 column:1}}} severity:INFO source:{name:"hadolint" url:"https://github.com/hadolint/hadolint"} code:{value:"DL3059" url:"https://github.com/hadolint/hadolint/wiki/DL3059"}

# The symlink from JMX Exporter 0.16.1 to the versionless link exists because old HDFS Operators (up until and including 23.7) used to hardcode
# the version of JMX Exporter like this: "-javaagent:/stackable/jmx/jmx_prometheus_javaagent-0.16.1.jar"
Expand All @@ -78,16 +70,16 @@
ln -s /stackable/jmx/jmx_prometheus_javaagent.jar /stackable/jmx/jmx_prometheus_javaagent-0.16.1.jar

# Logging
RUN rm /stackable/hive/lib/log4j-slf4j-impl* && \
curl --fail -L https://repo.stackable.tech/repository/packages/jackson-dataformat-xml/jackson-dataformat-xml-${JACKSON_DATAFORMAT_XML}.jar -o /stackable/hive/lib/jackson-dataformat-xml-${JACKSON_DATAFORMAT_XML}.jar
RUN rm /stackable/hive-metastore/lib/log4j-slf4j-impl* && \
curl --fail -L https://repo.stackable.tech/repository/packages/jackson-dataformat-xml/jackson-dataformat-xml-${JACKSON_DATAFORMAT_XML}.jar -o /stackable/hive-metastore/lib/jackson-dataformat-xml-${JACKSON_DATAFORMAT_XML}.jar

# ===
# For earlier versions this script removes the .class file that contains the
# vulnerable code.
# TODO: This can be restricted to target only versions which do not honor the environment
# varible that has been set above but this has not currently been implemented
COPY shared/log4shell.sh /bin
RUN /bin/log4shell.sh /stackable/apache-hive-${PRODUCT}
RUN /bin/log4shell.sh /stackable/apache-hive-metastore-${PRODUCT}-bin/

# Ensure no vulnerable files are left over
# This will currently report vulnerable files being present, as it also alerts on
Expand All @@ -96,7 +88,8 @@
COPY shared/log4shell_1.6.1-log4shell_Linux_x86_64 /bin/log4shell_scanner_x86_64
COPY shared/log4shell_1.6.1-log4shell_Linux_aarch64 /bin/log4shell_scanner_aarch64
COPY shared/log4shell_scanner /bin/log4shell_scanner
RUN /bin/log4shell_scanner s /stackable/apache-hive-${PRODUCT}
# log4shell_scanner does not work on symlinks!
RUN /bin/log4shell_scanner s /stackable/apache-hive-metastore-${PRODUCT}-bin/
# ===

# syntax=docker/dockerfile:1@sha256:ac85f380a63b13dfcefa89046420e1781752bab202122f8f50032edf31be0021
Expand All @@ -106,12 +99,12 @@
ARG HADOOP
ARG RELEASE

LABEL name="Apache Hive" \
LABEL name="Apache Hive metastore" \
maintainer="[email protected]" \
vendor="Stackable GmbH" \
version="${PRODUCT}" \
release="${RELEASE}" \
summary="The Stackable image for Apache Hive." \
summary="The Stackable image for Apache Hive metastore." \
description="This image is deployed by the Stackable Operator for Apache Hive."

RUN microdnf update && \
Expand All @@ -122,15 +115,15 @@
USER stackable
WORKDIR /stackable

COPY --link --from=builder /stackable/apache-hive-${PRODUCT} /stackable/apache-hive-${PRODUCT}
RUN ln -s /stackable/apache-hive-${PRODUCT}/ /stackable/hive
COPY --chown=stackable:stackable --from=builder /stackable/apache-hive-metastore-${PRODUCT}-bin /stackable/apache-hive-metastore-${PRODUCT}-bin
Maleware marked this conversation as resolved.
Show resolved Hide resolved
RUN ln -s /stackable/apache-hive-metastore-${PRODUCT}-bin/ /stackable/hive-metastore

# It is useful to see which version of Hadoop is used at a glance
# Therefore the use of the full name here
COPY --link --from=builder /stackable/hadoop /stackable/hadoop-${HADOOP}
COPY --chown=stackable:stackable --from=builder /stackable/hadoop /stackable/hadoop-${HADOOP}
sbernauer marked this conversation as resolved.
Show resolved Hide resolved
RUN ln -s /stackable/hadoop-${HADOOP}/ /stackable/hadoop

COPY --link --from=builder /stackable/jmx /stackable/jmx
COPY --chown=stackable:stackable --from=builder /stackable/jmx /stackable/jmx
sbernauer marked this conversation as resolved.
Show resolved Hide resolved
COPY hive/licenses /licenses

# Mitigation for CVE-2021-44228 (Log4Shell)
Expand All @@ -139,8 +132,8 @@
ENV LOG4J_FORMAT_MSG_NO_LOOKUPS=true

ENV HADOOP_HOME=/stackable/hadoop
ENV HIVE_HOME=/stackable/hive
ENV PATH="${PATH}":/stackable/hadoop/bin:/stackable/hive/bin
ENV HIVE_HOME=/stackable/hive-metastore
ENV PATH="${PATH}":/stackable/hadoop/bin:/stackable/hive-metastore/bin

WORKDIR /stackable/hive
CMD ["./bin/start-metastore", "--config", "conf", "--hive-bin-dir", "bin", "--db-type", "derby"]
WORKDIR /stackable/hive-metastore
# Start command is set by oeprator to something like "bin/start-metastore --config /stackable/config --db-type postgres --hive-bin-dir bin"
sbernauer marked this conversation as resolved.
Show resolved Hide resolved
12 changes: 6 additions & 6 deletions hive/stackable/bin/start-metastore
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@
# Usage: start-metastore <options>
# Options:
# --help
# --config <path-to-hadoop-conf-folder>
# --config <path-to-hadoop-conf-folder>
# --db-type <db>
# --hive-bin-dir <path>
#
#
# Checks if the metastore database schema is initialized. If so it starts the metastore,
# otherwise it tries to initialize the schma first.
#
Expand All @@ -22,7 +22,7 @@ HIVE_BIN_DIR=""
function parse_args {
while true; do
echo "processing arg $1"
case $1 in
case $1 in
--db-type)
shift
DB_TYPE=$1
Expand Down Expand Up @@ -71,14 +71,14 @@ function parse_args {
}

function init_schema {
if ! $HIVE_BIN_DIR/hive --config $CONF_DIR --service schemaTool -dbType $DB_TYPE -validate ; then
if ! $HIVE_BIN_DIR/base --config $CONF_DIR --service schemaTool -dbType $DB_TYPE -validate ; then
echo "No valid schema found, initializing schema ..."
$HIVE_BIN_DIR/hive --config $CONF_DIR --service schemaTool -dbType $DB_TYPE -initSchema || exit 1
$HIVE_BIN_DIR/base --config $CONF_DIR --service schemaTool -dbType $DB_TYPE -initSchema || exit 1
fi
}

function start_metastore {
$HIVE_BIN_DIR/hive --config $CONF_DIR --service metastore
$HIVE_BIN_DIR/base --config $CONF_DIR --service metastore
}

function main {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ index e36f1e64f0..6007b7961b 100644
+ <protobuf-exc.version>2.6.1</protobuf-exc.version>
<sqlline.version>1.3.0</sqlline.version>
<storage-api.version>2.7.0</storage-api.version>

@@ -443,6 +446,20 @@
</plugins>
</build>
Expand Down Expand Up @@ -57,6 +57,6 @@ index e36f1e64f0..6007b7961b 100644
<addSources>none</addSources>
<inputDirectories>
<include>${basedir}/src/main/protobuf/org/apache/hadoop/hive/metastore</include>
--
--
2.43.0

Empty file modified hive/stackable/patches/apply_patches.sh
100644 → 100755
Empty file.