Data … as usual

All things about data by Laurent Leturgez

Category Archives: hadoop

Install a Standalone Spark Environment on Oracle Linux 7

Leave a comment Posted by Laurent on September 14, 2017

Spark is one of the most trendy project in the Apache Fundation.

From now, I usually used it directly on hadoop clusters, but each time I had to play with spark without the need of a complete hadoop cluster, or to test some basic pieces of code … It became hard to do it, specially on my laptop !!! Running a 3 node CDH cluster on your laptop requires CPU and memory !

So in this post, I decided to write how you can setup a small linux virtual machine, and install the last spark version in standalone mode.

First, of all, you need a fully operating linux box … I chose an Oracle Enterprise linux 7.4 one with 3.8.13-118 UEK kernel.

[spark@spark ~]$ sudo uname -r
3.8.13-118.19.4.el7uek.x86_64

Once installed and configured, you need to install java. In my case, I’ve installed a jdk8 SE:

[spark@spark ~]$ sudo yum localinstall /home/spark/jdk-8u121-linux-x64.rpm -y
[spark@spark ~]$ java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

Then, create all the required directories for Spark installation and download sources (If you need another version of Spark, you will find following this URL: https://spark.apache.org/downloads.html) :

[spark@spark ~]$ sudo mkdir /usr/local/share/spark
[spark@spark ~]$ sudo chown spark:spark /usr/local/share/spark
[spark@spark ~]$ curl -O https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0.tgz
[spark@spark ~]$ tar -xvzf spark-2.2.0.tgz -C /usr/local/share/spark/
[spark@spark ~]$ cd /usr/local/share/spark/spark-2.2.0/

If you are behind a proxy server, you have to create a settings.xml file in $HOME/.m2 directory (you’ll probably have to create it). You have to do it, even if you have set http_proxy variable in your environment (beause maven, which is used during the installation process will use it).

Below, you’ll see what my settings.xml file looks like:

[spark@spark ~]$ cat ~/.m2/settings.xml
<settings>
 <proxies>
 <proxy>
 <id>example-proxy</id>
 <active>true</active>
 <protocol>http</protocol>
 <host>10.239.9.20</host>
 <port>80</port>
 </proxy>
 </proxies>
</settings>

Then, you are ready to configure MAVEN environment and launch the installation process:

[spark@spark ~]$ cd /usr/local/share/spark/spark-2.2.0/
[spark@spark spark-2.2.0]$ export MAVEN_OPTS=-Xmx2g -XX:ReservedCodeCacheSize=512m
[spark@spark spark-2.2.0]$ ./build/mvn -DskipTests clean package

At the end of the process, a summary report is printed.

[spark@spark spark-2.2.0]$ ./build/mvn -DskipTests clean package

.../...

[INFO] Replacing original artifact with shaded artifact.
[INFO] Replacing /usr/local/share/spark/spark-2.2.0/external/kafka-0-10-assembly/target/spark-streaming-kafka-0-10-assembly_2.11-2.2.0.jar with /usr/local/share/spark/spark-2.2.0/external/kafka-0-10-assembly/target/spark-streaming-kafka-0-10-assembly_2.11-2.2.0-shaded.jar
[INFO] Dependency-reduced POM written at: /usr/local/share/spark/spark-2.2.0/external/kafka-0-10-assembly/dependency-reduced-pom.xml
[INFO]
[INFO] --- maven-source-plugin:3.0.1:jar-no-fork (create-source-jar) @ spark-streaming-kafka-0-10-assembly_2.11 ---
[INFO] Building jar: /usr/local/share/spark/spark-2.2.0/external/kafka-0-10-assembly/target/spark-streaming-kafka-0-10-assembly_2.11-2.2.0-sources.jar
[INFO]
[INFO] --- maven-source-plugin:3.0.1:test-jar-no-fork (create-source-jar) @ spark-streaming-kafka-0-10-assembly_2.11 ---
[INFO] Building jar: /usr/local/share/spark/spark-2.2.0/external/kafka-0-10-assembly/target/spark-streaming-kafka-0-10-assembly_2.11-2.2.0-test-sources.jar
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [01:04 min]
[INFO] Spark Project Tags ................................. SUCCESS [ 26.598 s]
[INFO] Spark Project Sketch ............................... SUCCESS [ 6.316 s]
[INFO] Spark Project Networking ........................... SUCCESS [ 17.129 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 6.836 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [ 9.039 s]
[INFO] Spark Project Launcher ............................. SUCCESS [ 21.286 s]
[INFO] Spark Project Core ................................. SUCCESS [02:24 min]
[INFO] Spark Project ML Local Library ..................... SUCCESS [ 20.021 s]
[INFO] Spark Project GraphX ............................... SUCCESS [ 13.117 s]
[INFO] Spark Project Streaming ............................ SUCCESS [ 33.581 s]
[INFO] Spark Project Catalyst ............................. SUCCESS [01:22 min]
[INFO] Spark Project SQL .................................. SUCCESS [02:56 min]
[INFO] Spark Project ML Library ........................... SUCCESS [02:08 min]
[INFO] Spark Project Tools ................................ SUCCESS [ 3.084 s]
[INFO] Spark Project Hive ................................. SUCCESS [ 51.106 s]
[INFO] Spark Project REPL ................................. SUCCESS [ 4.365 s]
[INFO] Spark Project Assembly ............................. SUCCESS [ 2.109 s]
[INFO] Spark Project External Flume Sink .................. SUCCESS [ 8.062 s]
[INFO] Spark Project External Flume ....................... SUCCESS [ 9.350 s]
[INFO] Spark Project External Flume Assembly .............. SUCCESS [ 2.087 s]
[INFO] Spark Integration for Kafka 0.8 .................... SUCCESS [ 12.043 s]
[INFO] Kafka 0.10 Source for Structured Streaming ......... SUCCESS [ 12.758 s]
[INFO] Spark Project Examples ............................. SUCCESS [ 19.236 s]
[INFO] Spark Project External Kafka Assembly .............. SUCCESS [ 5.637 s]
[INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [ 9.345 s]
[INFO] Spark Integration for Kafka 0.10 Assembly .......... SUCCESS [ 3.909 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 14:54 min
[INFO] Finished at: 2017-09-14T12:22:31+02:00
[INFO] Final Memory: 86M/896M
[INFO] ------------------------------------------------------------------------

At this step, if you run some scripts, you’ll throw an error because, even if you have installed spark in standalone, you need hadoop librairies.

It’s an easy thing to do, we just have to download hadoop and configure our environment that way (Please download the hadoop version you need, I chose 2.8 which is the latest stable version for hadoop2, I didn’t make the test with hadoop3 as it’s still in beta):

[spark@spark ~]$ cd /usr/local/share/
[spark@spark share]$ sudo mkdir hadoop
[spark@spark share]$ sudo chown spark:spark hadoop/
[spark@spark share]$ cd hadoop/
[spark@spark hadoop]$ curl -O http://apache.mirrors.ovh.net/ftp.apache.org/dist/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz
[spark@spark hadoop]$ tar -xzf hadoop-2.8.1.tar.gz
[spark@spark hadoop]$ cat >> ~/.bashrc
export HADOOP_HOME=/usr/local/share/hadoop/hadoop-2.8.1
export LD_LIBRARY_PATH=${HADOOP_HOME}/lib/native:${LD_LIBRARY_PATH}
export SPARK_HOME=/usr/local/share/spark/spark-2.2.0
export PATH=${SPARK_HOME}/bin:${PATH}
[spark@spark hadoop]$ . ~/.bashrc
[spark@spark hadoop]$ env | egrep 'HADOOP|PATH|SPARK'
SPARK_HOME=/usr/local/share/spark/spark-2.2.0
HADOOP_HOME=/usr/local/share/hadoop/hadoop-2.8.1
LD_LIBRARY_PATH=/usr/local/share/hadoop/hadoop-2.8.1/lib/native:/usr/local/share/hadoop/hadoop-2.8.1/lib/native:
PATH=/usr/local/share/spark/spark-2.2.0/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/spark/.local/bin:/home/spark/bin

Now, we can run the SparkPi example:

[spark@spark ~]$ run-example SparkPi 500
Pi is roughly 3.141360702827214

Note: If you want to remove all those crappy INFO messages in the output, run the command below to configure log4j properties:

[spark@spark hadoop]$ cd $SPARK_HOME/conf
[spark@spark conf]$ sed 's/log4j\.rootCategory=INFO, console/log4j\.rootCategory=WARN, console/g' log4j.properties.template > log4j.properties

That’s done, now you’re ready to run your code on spark. Below, I wrote a sample code written in scala to create a dataframe from an oracle JDBC datasource, and run a groupby function on it.

[spark@spark ~]$ spark-shell --driver-class-path ojdbc7.jar --jars ojdbc7.jar
Spark context Web UI available at http://192.168.99.14:4040
Spark context available as 'sc' (master = local[*], app id = local-1505397247969).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)
Type in expressions to have them evaluated.
Type :help for more information.

scala> :load jdbc_sample.scala
Loading jdbc_sample.scala...
import java.util.Properties
connProps: java.util.Properties = {}
res0: Object = null
res1: Object = null
df: org.apache.spark.sql.DataFrame = [PROD_ID: decimal(6,0), PROD_NAME: string ... 20 more fields]

scala> df.printSchema
root
 |-- PROD_ID: decimal(6,0) (nullable = false)
 |-- PROD_NAME: string (nullable = false)
 |-- PROD_DESC: string (nullable = false)
 |-- PROD_SUBCATEGORY: string (nullable = false)
 |-- PROD_SUBCATEGORY_ID: decimal(38,10) (nullable = false)
 |-- PROD_SUBCATEGORY_DESC: string (nullable = false)
 |-- PROD_CATEGORY: string (nullable = false)
 |-- PROD_CATEGORY_ID: decimal(38,10) (nullable = false)
 |-- PROD_CATEGORY_DESC: string (nullable = false)
 |-- PROD_WEIGHT_CLASS: decimal(3,0) (nullable = false)
 |-- PROD_UNIT_OF_MEASURE: string (nullable = true)
 |-- PROD_PACK_SIZE: string (nullable = false)
 |-- SUPPLIER_ID: decimal(6,0) (nullable = false)
 |-- PROD_STATUS: string (nullable = false)
 |-- PROD_LIST_PRICE: decimal(8,2) (nullable = false)
 |-- PROD_MIN_PRICE: decimal(8,2) (nullable = false)
 |-- PROD_TOTAL: string (nullable = false)
 |-- PROD_TOTAL_ID: decimal(38,10) (nullable = false)
 |-- PROD_SRC_ID: decimal(38,10) (nullable = true)
 |-- PROD_EFF_FROM: timestamp (nullable = true)
 |-- PROD_EFF_TO: timestamp (nullable = true)
 |-- PROD_VALID: string (nullable = true)

scala> df.groupBy("PROD_CATEGORY").count.show
+--------------------+-----+
|       PROD_CATEGORY|count|
+--------------------+-----+
|      Software/Other|   26|
|               Photo|   10|
|         Electronics|   13|
|Peripherals and A...|   21|
|            Hardware|    2|
+--------------------+-----+

And … that’s it … have fun with Spark 😉

Development, hadoop, Linux, Scala, Spark, tools

Install a simple hadoop cluster for testing

7 Comments Posted by Laurent on July 2, 2013

Today, I won’t write an oracle related post, but a post about hadoop. If you don’t have a full rack of servers in your cellar, and you want to write map/reduce algorithm and test it, you will need a simple cluster.

In my case, I installed hadoop on a 3 nodes cluster (bigdata1, bigdata2 and bigdata3). Each node is an Oracle Enterprise Linux 6 box with :

2 Gb RAM
a 30Gb system disk
2 disks mounted on /mnt/hdfs/1 and /mnt/hdfs/2 for storing HDFS datas. Each disk is 10Gb.

My master node will be bigdata1, with 2 slaves nodes: bigdata2 and bigdata3.

Each box has java installed:

[oracle@bigdata1 ~]$ java -version
java version "1.7.0_11"
Java(TM) SE Runtime Environment (build 1.7.0_11-b21)
Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)

First, you need to download last stable hadoop tarball: http://hadoop.apache.org/releases.html#Download. In my case, I downloaded Hadoop 1.1.2.

After this, you need to untar the file on each servers:

[oracle@bigdata3 ~]$ pwd
/home/oracle
[oracle@bigdata3 ~]$ tar -xvzf hadoop-1.1.2.tar.gz

On the master node, edit conf/core-site.xml to configure the namenode:

[oracle@bigdata1 hadoop-1.1.2]$ cat conf/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>fs.default.name</name>
    <value>hdfs://bigdata1:8020</value>
    </property>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/mnt/hdfs/1/hadoop/dfs,/mnt/hdfs/2/hadoop/dfs</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.checkpoint.dir</name>
  <value>/mnt/hdfs/1/hadoop/dfs/namesecondary,/mnt/hdfs/2/hadoop/dfs/namesecondary</value>
  <description>Determines where on the local filesystem the DFS secondary
               name node should store the temporary images to merge.
               If this is a comma-delimited list of directories then the image is
               replicated in all of the directories for redundancy.
  </description>
</property>
</configuration>

An important parameter is hadoop.tmp.dir. Indeed, as we want to setup a simple cluster for testing, we would like to keep this configuration as simple as possible. If you have a look to the doc pages related to config file (core, hdfs and mapreduce), you will see that most of the parameter (dfs.data.dir, mapred.system.dir etc. are derived from hadoop.tmp.dir parameter). This will keep our config very simple.

More information about this file here: http://hadoop.apache.org/docs/r1.1.2/core-default.html

Next, you have to configure the conf/hdfs-site.xml file (hdfs config). This is a basic config file with location of the name table on the local fs, and the number block replication ways:

[oracle@bigdata1 hadoop-1.1.2]$ cat conf/hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>dfs.name.dir</name>
  <value>/mnt/hdfs/1/hadoop/dfs/name,/mnt/hdfs/2/hadoop/dfs/name</value>
</property>

<property>
  <name>dfs.replication</name>
      <value>2</value>
      <description>Default block replication.
                   The actual number of replications can be specified when the file is created.
                   The default is used if replication is not specified in create time.
      </description>
</property>
</configuration>

Other parameters are documented here: http://hadoop.apache.org/docs/r1.1.2/hdfs-default.html

Finally, we need to configure the mapreduce config file (located in HADOOP_HOME/conf/mapred-site.xml).

In my case, I decided to configure only one parameter which is the location of the JobTracker (usually located on the namenode).

[oracle@bigdata1 hadoop-1.1.2]$ cat conf/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
    <name>mapred.job.tracker</name>
    <value>bigdata1:9001</value>
</property>
</configuration>

You will find more information about mapreduce configuration parameters here: http://hadoop.apache.org/docs/r1.1.2/mapred-default.html

The next step is to mention which server will act as a namenode, and which servers will act as datanode.

To do this, we have to configure name resolution correctly, each node must resolve all the node names in the cluster:

[oracle@bigdata1 hadoop-1.1.2]$ cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
192.168.99.40 bigdata1.localdomain bigdata1
192.168.99.41 bigdata2.localdomain bigdata2
192.168.99.42 bigdata3.localdomain bigdata3
The next step, is to edit masters and slaves files:
[oracle@bigdata1 hadoop-1.1.2]$ cat conf/masters
bigdata1
[oracle@bigdata1 hadoop-1.1.2]$ cat conf/slaves
bigdata1
bigdata2
bigdata3

Ok, now our system is configured as we want … simple !

Next step is to configure the hadoop environment. This can be done in the HADOOP_HOME/conf/hadoop-env.sh file.
[oracle@bigdata1 hadoop-1.1.2]$ cat conf/hadoop-env.sh | sed -e ‘/^#/d’

export JAVA_HOME=/usr/java/jdk1.7.0_11
export HADOOP_HEAPSIZE=1024
export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_NAMENODE_OPTS"
export HADOOP_SECONDARYNAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_SECONDARYNAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTS"
export HADOOP_BALANCER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_BALANCER_OPTS"
export HADOOP_JOBTRACKER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_JOBTRACKER_OPTS"
export HADOOP_LOG_DIR=/home/oracle/hadoop-1.1.2/log

Each parameter is documented in the file.

Don’t forget to copy each config file on each cluster node.

Next step is to create directory for the name table location:

[oracle@bigdata1 hadoop-1.1.2]$ mkdir -p /mnt/hdfs/1/hadoop/dfs/name
[oracle@bigdata1 hadoop-1.1.2]$ mkdir -p /mnt/hdfs/2/hadoop/dfs/name

Now, we can format the namenode:

[oracle@bigdata1 hadoop-1.1.2]$ /home/oracle/hadoop-1.1.2/bin/hadoop namenode -format
13/07/02 09:01:22 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = bigdata1.localdomain/192.168.99.40
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 1.1.2
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1440782; compiled by 'hortonfo' on Thu Jan 31 02:03:24 UTC 2013
************************************************************/
Re-format filesystem in /mnt/hdfs/1/hadoop/dfs/name ? (Y or N) Y
Re-format filesystem in /mnt/hdfs/2/hadoop/dfs/name ? (Y or N) Y
13/07/02 09:01:29 INFO util.GSet: VM type       = 64-bit
13/07/02 09:01:29 INFO util.GSet: 2% max memory = 19.7975 MB
13/07/02 09:01:29 INFO util.GSet: capacity      = 2^21 = 2097152 entries
13/07/02 09:01:29 INFO util.GSet: recommended=2097152, actual=2097152
13/07/02 09:01:29 INFO namenode.FSNamesystem: fsOwner=oracle
13/07/02 09:01:29 INFO namenode.FSNamesystem: supergroup=supergroup
13/07/02 09:01:29 INFO namenode.FSNamesystem: isPermissionEnabled=true
13/07/02 09:01:29 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
13/07/02 09:01:29 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
13/07/02 09:01:29 INFO namenode.NameNode: Caching file names occuring more than 10 times
13/07/02 09:01:30 INFO common.Storage: Image file of size 112 saved in 0 seconds.
13/07/02 09:01:30 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/mnt/hdfs/1/hadoop/dfs/name/current/edits
13/07/02 09:01:30 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/mnt/hdfs/1/hadoop/dfs/name/current/edits
13/07/02 09:01:30 INFO common.Storage: Storage directory /mnt/hdfs/1/hadoop/dfs/name has been successfully formatted.
13/07/02 09:01:30 INFO common.Storage: Image file of size 112 saved in 0 seconds.
13/07/02 09:01:30 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/mnt/hdfs/2/hadoop/dfs/name/current/edits
13/07/02 09:01:30 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/mnt/hdfs/2/hadoop/dfs/name/current/edits
13/07/02 09:01:30 INFO common.Storage: Storage directory /mnt/hdfs/2/hadoop/dfs/name has been successfully formatted.
13/07/02 09:01:30 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at bigdata1.localdomain/192.168.99.40
************************************************************/

And now … the most important step, we will start our cluster. This operation is launched from the namenode:

[oracle@bigdata1 hadoop-1.1.2]$ /home/oracle/hadoop-1.1.2/bin/start-all.sh
starting namenode, logging to /home/oracle/hadoop-1.1.2/log/hadoop-oracle-namenode-bigdata1.localdomain.out
bigdata2: starting datanode, logging to /home/oracle/hadoop-1.1.2/log/hadoop-oracle-datanode-bigdata2.localdomain.out
bigdata3: starting datanode, logging to /home/oracle/hadoop-1.1.2/log/hadoop-oracle-datanode-bigdata3.localdomain.out
bigdata1: starting datanode, logging to /home/oracle/hadoop-1.1.2/log/hadoop-oracle-datanode-bigdata1.localdomain.out
bigdata1: starting secondarynamenode, logging to /home/oracle/hadoop-1.1.2/log/hadoop-oracle-secondarynamenode-bigdata1.localdomain.out
starting jobtracker, logging to /home/oracle/hadoop-1.1.2/log/hadoop-oracle-jobtracker-bigdata1.localdomain.out
bigdata2: starting tasktracker, logging to /home/oracle/hadoop-1.1.2/log/hadoop-oracle-tasktracker-bigdata2.localdomain.out
bigdata3: starting tasktracker, logging to /home/oracle/hadoop-1.1.2/log/hadoop-oracle-tasktracker-bigdata3.localdomain.out
bigdata1: starting tasktracker, logging to /home/oracle/hadoop-1.1.2/log/hadoop-oracle-tasktracker-bigdata1.localdomain.out

Ok everything seems to be fine (you can have a look to the log files to check processes health).

You can use to jpm command to check that every process is now launched:

On the name node (in my case, it acts as name node and data node, that’s why we find a TaskTracker and a DataNode process)

[oracle@bigdata1 hadoop-1.1.2]$ jps -m
11699 JobTracker
11592 SecondaryNameNode
11832 TaskTracker
11964 Jps -m
11467 DataNode
11345 NameNode

On the data nodes

[oracle@bigdata2 log]$ jps -m
6675 DataNode
6788 TaskTracker
6866 Jps -m

[oracle@bigdata3 conf]$ jps -m
6856 Jps -m
6770 TaskTracker
6657 DataNode

We can now, push big files on hdfs to launch map/reduce on it.

First, we need to create directories:

[oracle@bigdata1 hadoop-1.1.2]$ /home/oracle/hadoop-1.1.2/bin/hadoop dfs -mkdir input
[oracle@bigdata1 hadoop-1.1.2]$ /home/oracle/hadoop-1.1.2/bin/hadoop dfs -mkdir output
[oracle@bigdata1 hadoop-1.1.2]$ /home/oracle/hadoop-1.1.2/bin/hadoop dfs -ls
Found 2 items
drwxr-xr-x - oracle supergroup 0 2013-07-02 09:35 /user/oracle/input
drwxr-xr-x - oracle supergroup 0 2013-07-02 09:36 /user/oracle/output

Next, we will copy a sample file that contains information about artists and albums (see a sample above)

[oracle@bigdata1 hadoop-1.1.2]$ /home/oracle/hadoop-1.1.2/bin/hadoop dfs -copyFromLocal /tmp/unique_tracks.txt /user/oracle/input
[oracle@bigdata1 hadoop-1.1.2]$ head -50 /tmp/unique_tracks.txt
TRMMMHY12903CB53F1<SEP>SOPMIYT12A6D4F851E<SEP>Joseph Locke<SEP>Goodbye
TRMMMML128F4280EE9<SEP>SOJCFMH12A8C13B0C2<SEP>The Sun Harbor's Chorus-Documentary Recordings<SEP>Mama_ mama can't you see ?
TRMMMNS128F93548E1<SEP>SOYGNWH12AB018191E<SEP>3 Gars Su'l Sofa<SEP>L'antarctique
TRMMMXJ12903CBF111<SEP>SOLJTLX12AB01890ED<SEP>Jorge Negrete<SEP>El hijo del pueblo
TRMMMCJ128F930BFF8<SEP>SOQQESG12A58A7AA28<SEP>Danny Diablo<SEP>Cold Beer feat. Prince Metropolitan
TRMMMBW128F4260CAE<SEP>SOMPVQB12A8C1379BB<SEP>Tiger Lou<SEP>Pilots
TRMMMXI128F4285A3F<SEP>SOGPCJI12A8C13CCA0<SEP>Waldemar Bastos<SEP>N Gana
TRMMMKI128F931D80D<SEP>SOSDCFG12AB0184647<SEP>Lena Philipsson<SEP>006
TRMMMUT128F42646E8<SEP>SOBARPM12A8C133DFF<SEP>Shawn Colvin<SEP>(Looking For) The Heart Of Saturday
TRMMMQY128F92F0EA3<SEP>SOKOVRQ12A8C142811<SEP>Dying Fetus<SEP>Ethos of Coercion
.../...

[oracle@bigdata1 hadoop-1.1.2]$ /home/oracle/hadoop-1.1.2/bin/hadoop dfs -du
Found 2 items
84046293    hdfs://bigdata1:8020/user/oracle/input
0           hdfs://bigdata1:8020/user/oracle/output

Finally, we can create our MapReduce program. In my case, I used a simple counter of artists (Third field):

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class ArtistCounter {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      String[] res = value.toString().split("<SEP>");
        word.set(res[2].toLowerCase());
        context.write(word, one);
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = new Job(conf, "Laurent's Singer counter");

    job.setJarByClass(ArtistCounter.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

Next step, compile …

[oracle@bigdata1 java]$ echo $CLASSPATH
/home/oracle/hadoop-1.1.2/libexec/../conf:/usr/java/jdk1.7.0_11/lib/tools.jar:/home/oracle/hadoop-1.1.2/libexec/..:/home/oracle/hadoop-1.1.2/libexec/../hadoop-core-1.1.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/asm-3.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/aspectjrt-1.6.11.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/aspectjtools-1.6.11.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-beanutils-1.7.0.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-beanutils-core-1.8.0.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-cli-1.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-codec-1.4.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-collections-3.2.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-configuration-1.6.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-daemon-1.0.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-digester-1.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-el-1.0.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-httpclient-3.0.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-io-2.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-lang-2.4.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-logging-1.1.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-logging-api-1.0.4.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-math-2.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-net-3.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/core-3.1.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/hadoop-capacity-scheduler-1.1.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/hadoop-fairscheduler-1.1.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/hadoop-thriftfs-1.1.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/hsqldb-1.8.0.10.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jackson-core-asl-1.8.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jackson-mapper-asl-1.8.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jasper-compiler-5.5.12.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jasper-runtime-5.5.12.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jdeb-0.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jersey-core-1.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jersey-json-1.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jersey-server-1.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jets3t-0.6.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jetty-6.1.26.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jetty-util-6.1.26.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jsch-0.1.42.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/junit-4.5.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/kfs-0.2.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/log4j-1.2.15.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/mockito-all-1.8.5.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/oro-2.0.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/servlet-api-2.5-20081211.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/slf4j-api-1.4.3.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/slf4j-log4j12-1.4.3.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/xmlenc-0.52.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jsp-2.1/jsp-2.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jsp-2.1/jsp-api-2.1.jar:/home/oracle/hadoop-1.1.2/hadoop-ant-1.1.2.jar:/home/oracle/hadoop-1.1.2/hadoop-client-1.1.2.jar:/home/oracle/hadoop-1.1.2/hadoop-core-1.1.2.jar:/home/oracle/hadoop-1.1.2/hadoop-examples-1.1.2.jar:/home/oracle/hadoop-1.1.2/hadoop-minicluster-1.1.2.jar:/home/oracle/hadoop-1.1.2/hadoop-test-1.1.2.jar:/home/oracle/hadoop-1.1.2/hadoop-tools-1.1.2.jar

[oracle@bigdata1 java]$ javac -cp $CLASSPATH:. ArtistCounter.java

Next … packaging

[oracle@bigdata1 java]$ jar -cvf ArtistCounter.jar ArtistCounter*.class
added manifest
adding: ArtistCounter.class(in = 1489) (out= 813)(deflated 45%)
adding: ArtistCounter$IntSumReducer.class(in = 1751) (out= 742)(deflated 57%)
adding: ArtistCounter$TokenizerMapper.class(in = 1721) (out= 718)(deflated 58%)

And now, we can run this small test program in our hadoop cluster:

[oracle@bigdata1 java]$ echo $CLASSPATH
/home/oracle/hadoop-1.1.2/libexec/../conf:/usr/java/jdk1.7.0_11/lib/tools.jar:/home/oracle/hadoop-1.1.2/libexec/..:/home/oracle/hadoop-1.1.2/libexec/../hadoop-core-1.1.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/asm-3.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/aspectjrt-1.6.11.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/aspectjtools-1.6.11.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-beanutils-1.7.0.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-beanutils-core-1.8.0.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-cli-1.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-codec-1.4.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-collections-3.2.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-configuration-1.6.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-daemon-1.0.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-digester-1.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-el-1.0.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-httpclient-3.0.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-io-2.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-lang-2.4.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-logging-1.1.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-logging-api-1.0.4.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-math-2.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-net-3.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/core-3.1.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/hadoop-capacity-scheduler-1.1.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/hadoop-fairscheduler-1.1.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/hadoop-thriftfs-1.1.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/hsqldb-1.8.0.10.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jackson-core-asl-1.8.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jackson-mapper-asl-1.8.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jasper-compiler-5.5.12.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jasper-runtime-5.5.12.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jdeb-0.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jersey-core-1.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jersey-json-1.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jersey-server-1.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jets3t-0.6.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jetty-6.1.26.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jetty-util-6.1.26.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jsch-0.1.42.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/junit-4.5.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/kfs-0.2.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/log4j-1.2.15.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/mockito-all-1.8.5.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/oro-2.0.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/servlet-api-2.5-20081211.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/slf4j-api-1.4.3.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/slf4j-log4j12-1.4.3.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/xmlenc-0.52.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jsp-2.1/jsp-2.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jsp-2.1/jsp-api-2.1.jar:/home/oracle/hadoop-1.1.2/hadoop-ant-1.1.2.jar:/home/oracle/hadoop-1.1.2/hadoop-client-1.1.2.jar:/home/oracle/hadoop-1.1.2/hadoop-core-1.1.2.jar:/home/oracle/hadoop-1.1.2/hadoop-examples-1.1.2.jar:/home/oracle/hadoop-1.1.2/hadoop-minicluster-1.1.2.jar:/home/oracle/hadoop-1.1.2/hadoop-test-1.1.2.jar:/home/oracle/hadoop-1.1.2/hadoop-tools-1.1.2.jar

[oracle@bigdata1 java]$ /home/oracle/hadoop-1.1.2/bin/hadoop jar ArtistCounter.jar ArtistCounter /user/oracle/input /user/oracle/output/laurent
13/07/02 09:56:08 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/07/02 09:56:08 INFO input.FileInputFormat: Total input paths to process : 1
13/07/02 09:56:08 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/07/02 09:56:08 WARN snappy.LoadSnappy: Snappy native library not loaded
13/07/02 09:56:08 INFO mapred.JobClient: Running job: job_201307020925_0001
13/07/02 09:56:09 INFO mapred.JobClient:  map 0% reduce 0%
13/07/02 09:56:20 INFO mapred.JobClient:  map 50% reduce 0%
13/07/02 09:56:22 INFO mapred.JobClient:  map 100% reduce 0%
13/07/02 09:56:28 INFO mapred.JobClient:  map 100% reduce 33%
13/07/02 09:56:31 INFO mapred.JobClient:  map 100% reduce 100%
13/07/02 09:56:32 INFO mapred.JobClient: Job complete: job_201307020925_0001
13/07/02 09:56:32 INFO mapred.JobClient: Counters: 29
13/07/02 09:56:32 INFO mapred.JobClient:   Job Counters
13/07/02 09:56:32 INFO mapred.JobClient:     Launched reduce tasks=1
13/07/02 09:56:32 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=18647
13/07/02 09:56:32 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/07/02 09:56:32 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/07/02 09:56:32 INFO mapred.JobClient:     Launched map tasks=2
13/07/02 09:56:32 INFO mapred.JobClient:     Data-local map tasks=2
13/07/02 09:56:32 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10862
13/07/02 09:56:32 INFO mapred.JobClient:   File Output Format Counters
13/07/02 09:56:32 INFO mapred.JobClient:     Bytes Written=1651492
13/07/02 09:56:32 INFO mapred.JobClient:   FileSystemCounters
13/07/02 09:56:32 INFO mapred.JobClient:     FILE_BYTES_READ=6240405
13/07/02 09:56:32 INFO mapred.JobClient:     HDFS_BYTES_READ=84050631
13/07/02 09:56:32 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=9124002
13/07/02 09:56:32 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1651492
13/07/02 09:56:32 INFO mapred.JobClient:   File Input Format Counters
13/07/02 09:56:32 INFO mapred.JobClient:     Bytes Read=84050389
13/07/02 09:56:32 INFO mapred.JobClient:   Map-Reduce Framework
13/07/02 09:56:32 INFO mapred.JobClient:     Map output materialized bytes=2724501
13/07/02 09:56:32 INFO mapred.JobClient:     Map input records=1000000
13/07/02 09:56:32 INFO mapred.JobClient:     Reduce shuffle bytes=2724501
13/07/02 09:56:32 INFO mapred.JobClient:     Spilled Records=370725
13/07/02 09:56:32 INFO mapred.JobClient:     Map output bytes=18534685
13/07/02 09:56:32 INFO mapred.JobClient:     Total committed heap usage (bytes)=270082048
13/07/02 09:56:32 INFO mapred.JobClient:     CPU time spent (ms)=10870
13/07/02 09:56:32 INFO mapred.JobClient:     Combine input records=1150423
13/07/02 09:56:32 INFO mapred.JobClient:     SPLIT_RAW_BYTES=242
13/07/02 09:56:32 INFO mapred.JobClient:     Reduce input records=110151
13/07/02 09:56:32 INFO mapred.JobClient:     Reduce input groups=72663
13/07/02 09:56:32 INFO mapred.JobClient:     Combine output records=260574
13/07/02 09:56:32 INFO mapred.JobClient:     Physical memory (bytes) snapshot=453685248
13/07/02 09:56:32 INFO mapred.JobClient:     Reduce output records=72663
13/07/02 09:56:32 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=3173515264
13/07/02 09:56:32 INFO mapred.JobClient:     Map output records=1000000

Last step, we can see the content of the result file, and counting how many album did the mapreducer find in the file:

[oracle@bigdata1 java]$ /home/oracle/hadoop-1.1.2/bin/hadoop dfs -ls /user/oracle/output/laurent
Found 3 items
-rw-r--r--   3 oracle supergroup          0 2013-07-02 09:56 /user/oracle/output/laurent/_SUCCESS
drwxr-xr-x   - oracle supergroup          0 2013-07-02 09:56 /user/oracle/output/laurent/_logs
-rw-r--r--   3 oracle supergroup    1651492 2013-07-02 09:56 /user/oracle/output/laurent/part-r-00000

[oracle@bigdata1 java]$ /home/oracle/hadoop-1.1.2/bin/hadoop dfs -cat /user/oracle/output/laurent/part-r-00000 | grep 'pink floyd'
pink floyd      117
pink floyd tribute      10

hadoop, java, Linux, Mapreduce

RSS feed

The views expressed on this blog are my own and do not reflect the views of the company(ies) I work (or have worked for) neither Oracle Corporation. The opinions expressed by visitors on this blog are theirs, not mine.

The information in this blog is written based on personal experiences. You are free to use the information on this blog but I am not responsible and will not compensate to you if you ever happen to suffer a loss/inconvenience/damage because of/while making use of this information.

Data … as usual

Category Archives: hadoop

Install a Standalone Spark Environment on Oracle Linux 7

Install a simple hadoop cluster for testing

Blogroll

Archives

DISCLAIMER

Blog Stats