Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] How to make this hive/presto setup query parquet on AWS S3 #17

Open
leletan opened this issue Jul 2, 2018 · 9 comments
Open

Comments

@leletan
Copy link

leletan commented Jul 2, 2018

Simply tried to add aws hadoop jar to path for every each docker image and did not seem to work. Any advices?

Thanks in advance.

@leletan
Copy link
Author

leletan commented Jul 2, 2018

Tried to create external table:

CREATE EXTERNAL TABLE IF NOT EXISTS aaa (
event_id string,
make string)
STORED AS PARQUET
LOCATION 's3://<access_key>:<secret_key>@<some_bucket>/<some_path>';

But ended up

Error: org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: java.io.IOException /<some_path> doesn't exist)

I tried the same keys to lookup the files on the path in s3 using aws cli and it was there.

@leletan
Copy link
Author

leletan commented Jul 2, 2018

We have a schema / external table on AWS athena for this folder as well, and it was working fine.

@GrigorievNick
Copy link

GrigorievNick commented Aug 12, 2019

I have add aws-hadoop macen dependency

FROM maven:3.6.1-jdk-8 AS DEPENDENCY_BUILD
ADD pom.xml pom.xml
RUN mvn dependency:copy-dependencies -DoutputDirectory=/external-jars/

FROM bde2020/hive:2.3.2-postgresql-metastore

COPY --from=DEPENDENCY_BUILD /external-jars/ /external-jars/
<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.datainfra.hive</groupId>
    <artifactId>hive</artifactId>
    <version>1.0-SNAPSHOT</version>
    <name>hive-metastore-external-dependency</name>

    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-aws</artifactId>
            <version>2.7.7</version>
        </dependency>
        <dependency>
            <groupId>com.amazonaws</groupId>
            <artifactId>aws-java-sdk</artifactId>
            <version>1.7.4</version>
        </dependency>
    </dependencies>
</project>

HIVE_AUX_JARS_PATH=/external-jars/

@shveikus
Copy link

shveikus commented Mar 2, 2020

Yep, rebuilding container 'bde2020/hive:2.3.2-postgresql-metastore' with adding aws jar dependencies and replacing it for hive-server and hive-metastore worked pretty well to me as @GrigorievNick recommended.
Only, few additions, if you want to avoid s3 credentials when you creating hive tables you need to add to your hadoop-hive.env:

CORE_CONF_fs_s3a_aws_credentials_provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
CORE_CONF_fs_s3a_access_key=***
CORE_CONF_fs_s3a_secret_key=***

More info docs.
It worked for me with Minio object storage, I think with S3 will be the same.

@Ravipillala
Copy link

@shveikus @GrigorievNick Can you provide the detailed steps to make this working. I'm new to docker. I tried adding following to dockerfile after creating pom.xml in the same folder. No luck.

@GrigorievNick
Copy link

GrigorievNick commented May 17, 2020

Hi @Ravipillala
I don't remeber all details.
But here is Docker file, that I use.

#FROM fabric8/java-alpine-openjdk8-jdk:1.6.3
#FROM adoptopenjdk/openjdk11:alpine-slim
FROM maven:3.6.1-jdk-8 AS DEPENDENCY_BUILD
ADD pom.xml pom.xml
RUN mvn dependency:copy-dependencies -DoutputDirectory=/external-jars/

FROM bde2020/hive:2.3.2-postgresql-metastore

COPY --from=DEPENDENCY_BUILD /external-jars/ /external-jars/

And pom.xml

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.appsflyer.datainfra.hive</groupId>
    <artifactId>hive</artifactId>
    <version>1.0-SNAPSHOT</version>
    <name>hive-metastore-external-dependency</name>

    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-aws</artifactId>
            <version>2.7.7</version>
        </dependency>
        <dependency>
            <groupId>com.amazonaws</groupId>
            <artifactId>aws-java-sdk</artifactId>
            <version>1.7.4</version>
        </dependency>
    </dependencies>
</project>

And build command

docker build -t artifactory.com:5000/mykola.hryhoriev/hive:2.3.2-postgresql-metastore hivemetastore/

I am personally deploying this to k8s, so I will add YAML, you can see ENV variables and configs there.

apiVersion: v1
data:
  HIVE_AUX_JARS_PATH: /external-jars/
  HIVE_SITE_CONF_datanucleus_autoCreateSchema: "false"
  HIVE_SITE_CONF_fs_s3a_access_key: Enter your key jere
  HIVE_SITE_CONF_fs_s3a_secret_key: enter you secret
  HIVE_SITE_CONF_hive_metastore_uris: thrift://hive-metastore:9083
  HIVE_SITE_CONF_hive_metastore_warehouse_dir: s3a://stage-presto-iceberg-metadata/iceberg-catalog
  HIVE_SITE_CONF_javax_jdo_option_ConnectionDriverName: org.postgresql.Driver
  HIVE_SITE_CONF_javax_jdo_option_ConnectionPassword: hive
  HIVE_SITE_CONF_javax_jdo_option_ConnectionURL: jdbc:postgresql://hive-metastore-postgresql/metastore
  HIVE_SITE_CONF_javax_jdo_option_ConnectionUserName: hive
kind: ConfigMap
metadata:
  name: hive-env
  namespace: bigdata
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hive-metastore
  namespace: bigdata
spec:
  selector:
    matchLabels:
      app: hive-metastore
  replicas: 1
  template:
    metadata:
      labels:
        app: hive-metastore
    spec:
      nodeSelector:
        kubernetes.io/lifecycle: normal
      containers:
        - name: hive-metastore
          image: artifactory.com:5000/mykola.hryhoriev/hive:2.3.2-postgresql-metastore
          args:
            - /opt/hive/bin/hive --service metastore
          imagePullPolicy: Always
          env:
            - name: SERVICE_PRECONDITION
              value: hive-metastore-postgresql:5432
          envFrom:
            - configMapRef:
                name: hive-env
          ports:
            - containerPort: 9083

@shveikus
Copy link

Hi @Ravipillala , you can try this:

  1. Create a folder and put there Dockerfile
ADD pom.xml pom.xml
RUN mvn dependency:copy-dependencies -DoutputDirectory=/external-jars/

FROM bde2020/hive:2.3.2-postgresql-metastore

COPY --from=DEPENDENCY_BUILD /external-jars/ /external-jars/

and pom.xml (this is my own)


<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.datainfra.hive</groupId>
    <artifactId>hive</artifactId>
    <version>1.0-SNAPSHOT</version>
    <name>hive-metastore-external-dependency</name>

    <dependencies>
       <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-mapreduce-client-shuffle</artifactId>
          <version>2.7.4</version>
        </dependency>
        <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
          <version>2.7.4</version>
        </dependency>
       <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-client</artifactId>
          <version>2.7.4</version>
        </dependency>
        <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-aws</artifactId>
          <version>2.7.4</version>
        </dependency>
        <dependency>
            <groupId>com.amazonaws</groupId>
            <artifactId>aws-java-sdk</artifactId>
            <version>1.7.4</version>
        </dependency>
        <dependency>
          <groupId>org.keedio.openx.data</groupId>
          <artifactId>json-serde</artifactId>
          <version>1.3.7.3</version>
        </dependency>
    </dependencies>
  </project>
  1. after that build new docker image with your tag
docker build . -t my-hive-aws:0.1
  1. then you need to replace all stock images with yours, something like this
  hive-server:
    image: my-hive-aws:0.1
    env_file:
      - ./hadoop-hive.env
    environment:
      HIVE_CORE_CONF_javax_jdo_option_ConnectionURL: "jdbc:postgresql://hive-metastore/metastore"
      SERVICE_PRECONDITION: "hive-metastore:9083"
    ports:
      - "10000:10000"
  1. then add credentials to your hadoop-hive.env as I mentioned above.

Hope, this will be useful.

@zhenik
Copy link

zhenik commented May 18, 2020

Hi,
I am sorry for the offtopic. @Ravipillala working PoC: hive+s3+presto https://github.com/zhenik-poc/big-data-stack-practice

@Ravipillala
Copy link

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants