Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiling #2

Open
saj9191 opened this issue Jul 18, 2018 · 11 comments
Open

Compiling #2

saj9191 opened this issue Jul 18, 2018 · 11 comments

Comments

@saj9191
Copy link

saj9191 commented Jul 18, 2018

Hello,
I'm trying to install spark on lambda. When I run

./dev/make-distribution.sh --name spark-lambda-2.1.0 --tgz -Phive -Phadoop-2.7 -Dhadoop.version=2.6.0-qds-0.4.13 -DskipTests

The Project Launcher fails and I get the following error.

[ERROR] Failed to execute goal on project spark-launcher_2.11: Could not resolve dependencies for project org.apache.spark:spark-launcher_2.11:jar:2.1.0: Failure to find com.hadoop.gplcompression:hadoop-lzo:jar:0.4.19 in https://repo1.maven.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced -> [Help 1]

I tried to explicitly add hadoop-lzo as a dependency in the launcher pom.xml, but I still get the same error. Is there something I need to download or change to get this to work?

Thanks!

@venkata91
Copy link
Contributor

Hi saj9191,

It seems like something changed in our side where we keep the maven artifacts, we'll fix it and update you here. Thanks for trying it out. Sorry for the inconvenience.

@faromero
Copy link

faromero commented Sep 2, 2018

I am also having the same issue (also tried adding hadoop-lzo dependency manually to pom.xml with no success). Have there been any updates on resolving this issue?

@venkata91
Copy link
Contributor

We were also hitting this issue recently. I will get back with a fix soon and post it here. Thanks for taking your time to try it out.

@faromero
Copy link

faromero commented Sep 4, 2018

I believe I have found a solution:
In spark-on-lambda/common/network-common/pom.xml, add the following dependency (as suggested previously):

<dependency>
  <groupId>com.hadoop.gplcompression</groupId>
  <artifactId>hadoop-lzo</artifactId>
  <version>0.4.19</version>
</dependency>

Then, in spark-on-lambda/pom.xml, add the following repository (which "houses" hadoop-lzo):

<repository>
  <id>twitter</id>
  <name>Twitter Repository</name>
  <url>http://maven.twttr.com</url>
</repository>

After this, I ran the make-distribution.sh command from your README and was able to build it all the way through.

@venkata91
Copy link
Contributor

Nice workaround! Let me also try it and update it.

@venkata91
Copy link
Contributor

Also may I know your use case for which you are trying it out or do you want to just try it out?

@faromero
Copy link

faromero commented Sep 4, 2018

Thanks for working to update it!

We are working on a research project associated with using Lambda for what we call "interactive massively parallel" applications, and wanted to compare Spark-on-Lambda to current state-of-the-art, as well as our work!

By the way, from your blog post, do you have the data available that you use for sorting 100GB in under 10 minutes?

@venkata91
Copy link
Contributor

Interesting! Can you please elaborate a bit more on that? Btw the data is generated using Teragen utility from https://github.com/ehiggs/spark-terasort which you can use to generate the data.

@faromero
Copy link

faromero commented Sep 4, 2018

You can view our work here: we call it gg, and while it was originally intended for compilation, it now supports general purpose applications (as simple as sorting and as complex as video encoding). Let me know if you have any questions about it (can be in a different forum instead of this issue thread)

I will try to run your sorting example and let you know if I have any issues!

@venkata91
Copy link
Contributor

Another easier workaround is to remove the pom.xml additions basically reverting the commit "Fix pom.xml to have the other Qubole repository location having 2.6.0... (2ca6c68)"

Build your package using this command - ./dev/make-distribution.sh --name spark-lambda-2.1.0 --tgz -Phive -Phadoop-2.7 -DskipTests

And finally add the below jars to classpath before starting spark-shell

1. wget http://central.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar
2. wget http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.3/hadoop-aws-2.7.3.jar

Refer here - https://markobigdata.com/2017/04/23/manipulating-files-from-s3-with-apache-spark/

@webroboteu
Copy link

hi, venkata91, I wrote you an email. I'm looking for an advisor for my startup. It is a spark-based web scraping service. The idea is to use this serverless computation but I'm having problems. As soon as you have time I would like to deepen it.

@venkata91 venkata91 mentioned this issue May 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants