Skip to content

Hadoop configuration

amatteini edited this page Nov 7, 2012 · 2 revisions

Machine setup

Set ulimit to 8192 (Ubuntu Linux has a default limit of 1024 open files).

Cluster setup

hdfs-site.xml

<!-- An Hadoop HDFS datanode has an upper bound on the number of files that it will serve at any one time. 
     The upper bound parameter is called xcievers (yes, this is misspelled). 
     Be sure to restart your HDFS after making the above configuration.
     Not having this configuration in place makes for strange looking failures. 
     Eventually you'll see a complain in the datanode logs complaining about the xcievers exceeded, 
     but on the run up to this one manifestation is complaint about missing blocks. -->
<property>
   <name>dfs.datanode.max.xcievers</name>
   <value>4096</value>
</property>

mapred-site.xml

<!-- The minimum size chunk that map input should be split into -->
<property>
  <name>mapred.min.split.size</name>
  <value>268435456</value> <!-- 256 MB-->
</property>

<!-- Output compression -->
<property>
  <name>mapred.output.compress</name>
  <value>true</value>
</property>
<property>
  <name>mapred.output.compression.type</name>
  <value>BLOCK</value>
</property>
<property>
  <name>mapred.output.compression.codec</name>
  <value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>

<!-- Reuse of a JVM across multiple tasks of the same job -->
<property>
  <name>mapred.job.reuse.jvm.num.tasks</name>
  <value>-1</value>
</property>

<!-- Number of reduce tasks --> 
<property>
  <name>mapred.reduce.tasks</name>
  <value>6</value>
</property>

<!-- Heap-size for child jvms --> 
<property>
  <name>mapred.map.child.java.opts</name>
  <value>-Xmx1G</value>
</property>
<property>
  <name>mapred.reduce.child.java.opts</name>
  <value>-Xmx1G</value>
</property>

<!-- Number of maps/reduces spawned simultaneously on a TaskTracker. Default value is 2 -->
<property>
  <name>mapred.tasktracker.map.tasks.maximum</name>
  <value>4</value>
</property>
<property>
  <name>mapred.tasktracker.reduce.tasks.maximum</name>
  <value>2</value>
</property>

Specific configuration

The following configuration has been used for the 3690M triples test:

Clone this wiki locally