Run Hadoop program locally with Intellij and Maven

You may use this method to write and test your Hadoop program locally without configuring Hadoop environment on your own machine or using the clusters. This tutorial is based on Hadoop: Intellij结合Maven本地运行和调试MapReduce程序 (无需搭载Hadoop和HDFS环境) , How-to: Create an IntelliJ IDEA Project for Apache Hadoop and Developing Hadoop Mapreduce Application within Intellij IDEA on Windows 10.

Requirements

IntelliJ IDEA
JDK
Linux or Mac OS

How to

Create a new project

In Intellij, Create New Project, select Maven in the pop-up window, select your JDK, then Next.

Fill your GroupId and ArtifactId, then Next.

Set your Project name and Project location, then Finish.

Configure dependencies

The pom.xml should be open automatically, if not, find it in the Project browser. It is recommended to click Enable Auto-Import option in the popup dialog in the lower right corner, but you can still manually Import Changes any time when you modify pom.xml.

Paste the following 2 blocks before the last </project> tag.

<repositories>
    <repository>
        <id>apache</id>
        <url>http://maven.apache.org</url>
    </repository>
</repositories>

<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>3.3.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>3.3.1</version>
    </dependency>
</dependencies>

The final pom.xml will look like below:

The full pom.xml is below:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>cs242</groupId>
    <artifactId>wordcount</artifactId>
    <version>1.0-SNAPSHOT</version>
    <repositories>
        <repository>
            <id>apache</id>
            <url>http://maven.apache.org</url>
        </repository>
    </repositories>
    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>3.3.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>3.3.1</version>
        </dependency>
    </dependencies>
</project>

Create WordCount class

Under Project→src→main→java, create a new Java Class named WordCount.

Paste the Java code into WordCount.java (this is from the original Hadoop MapReduce Tutorial).

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

    public static class TokenizerMapper
            extends Mapper<Object, Text, Text, IntWritable> {

        private final static IntWritable one = new IntWritable(1);
        private final Text word = new Text();

        public void map(Object key, Text value, Context context
        ) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }

    public static class IntSumReducer
            extends Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values,
                           Context context
        ) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Prepare to run

The same WordCount program scans all text files in the folder defined by the first argument, and outout the frequencies of all words into a folder defined by the second argument.

Create a folder named input under the project's root folder (same as src), put some text files inside it.

Then set the 2 program arguments. Select Run→Edit Configurations.

Add a new Application configuration, set the Name be WordCount, set the Main class be WordCount, set Program arguments be input output. This allows the program to read from input folder and save the result to output folder. Do not create the output folder, as Hadoop will create the folder automatically. If the folder exists, Hadoop will raise exceptions.

Run

Select Run→Run 'WordCount' to run the Hadoop program. If you re-run the program, delete the output folder before each run.

Results are saved in output/part-r-00000.

Build Runnable JAR with Dependencies

You can build a single jar file with your program and all necessary dependencies (e.g, Hadoop libraries) so you can transfer the jar file to another machine to run it.

Add the following block to pom.xml. build block should be at the same level of repositories block and dependencies block.

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <configuration>
                <source>8</source>
                <target>8</target>
            </configuration>
        </plugin>
        <plugin>
            <artifactId>maven-assembly-plugin</artifactId>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
            <configuration>
                <archive>
                    <manifest>
                        <!-- Path to your main class, include package path if needed -->
                        <mainClass>WordCount</mainClass>
                    </manifest>
                </archive>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
            </configuration>
        </plugin>
    </plugins>
</build>

Then in a terminal, cd to the folder which contains the pom.xml file, execute the following command:

mvn package

This will build wordcount-1.0-SNAPSHOT-jar-with-dependencies.jar under target folder. To run your program, execute the following command:

java -jar target/wordcount-1.0-SNAPSHOT-jar-with-dependencies.jar input output

Issues on Windows

You may encounter errors when running the Hadoop program on Windows:

ERROR security.UserGroupInformation: PriviledgedActionException as ...

This is because the Hadoop library tries to access some Unix specific path (e.g. /tmp), and tries to run some Unix specific functions (e.g. set folder permission to 0700). There are some possible solutions, for example, overriding Hadoop's configuration to set the correct temporary folder, or applying patches to Hadoop library to bypass Unix specific functions. However, it is still highly recommended to run Hadoop program on Linux or Mac OS.

Sample Project

See WordCount.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
WordCount		WordCount
images		images
.gitignore		.gitignore
README.md		README.md
score.js		score.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Run Hadoop program locally with Intellij and Maven

Requirements

How to

Create a new project

Configure dependencies

Create WordCount class

Prepare to run

Run

Build Runnable JAR with Dependencies

Issues on Windows

Sample Project

About

Releases

Packages

Languages

sfuad001/Intellij-Hadoop

Folders and files

Latest commit

History

Repository files navigation

Run Hadoop program locally with Intellij and Maven

Requirements

How to

Create a new project

Configure dependencies

Create WordCount class

Prepare to run

Run

Build Runnable JAR with Dependencies

Issues on Windows

Sample Project

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages