Data … as usual

All things about data by Laurent Leturgez

Monthly Archives: July 2013

Where are X11 libs after a Solaris 11 fresh install

Leave a comment Posted by Laurent on July 25, 2013

If you already installed Solaris 11 (for example in a Virtual Box VM), you may want to install X11 libs.

To do this, you have to install this bunch of packages:

root@s11:~# pkg install SUNWarc SUNWbtool SUNWhea SUNWlibms SUNWmfrun SUNWxorg-client-programs SUNWxorg-clientlibs SUNWxwfsw pkg://solaris/SUNWxwplt truetype/fonts-core

           Packages to install: 70
       Create boot environment: No
Create backup boot environment: No
            Services to change:  5

DOWNLOAD                                  PKGS       FILES    XFER (MB)
Completed                                70/70   5344/5344    41.2/41.2

PHASE                                        ACTIONS
Install Phase                              7322/7322

PHASE                                          ITEMS
Package State Update Phase                     70/70
Image State Update Phase                         2/2

Solaris 11, VirtualBox

Write SQL statements on internal structures in multitenant databases.

Leave a comment Posted by Laurent on July 8, 2013

In a multitenant configuration, if you are querying X$ structures and fixed tables (OBJ$, TAB$ etc.), you will face differents behaviour depending on the fact you are connected to the root container or to a pluggable database.

If you are connected to a root container (CDB$ROOT)
- In fixed tables, for example OBJ$, there’s no CON_ID column defined in this table so this will give you all objects for your root container, and not for all containers.

SQL> show con_name
CON_NAME
------------------------------
CDB$ROOT
SQL> desc obj$
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 OBJ#                                      NOT NULL NUMBER
 DATAOBJ#                                           NUMBER
 OWNER#                                    NOT NULL NUMBER
 NAME                                      NOT NULL VARCHAR2(128)
 NAMESPACE                                 NOT NULL NUMBER
 SUBNAME                                            VARCHAR2(128)
 TYPE#                                     NOT NULL NUMBER
 CTIME                                     NOT NULL DATE
 MTIME                                     NOT NULL DATE
 STIME                                     NOT NULL DATE
 STATUS                                    NOT NULL NUMBER
 REMOTEOWNER                                        VARCHAR2(128)
 LINKNAME                                           VARCHAR2(128)
 FLAGS                                              NUMBER
 OID$                                               RAW(16)
 SPARE1                                             NUMBER
 SPARE2                                             NUMBER
 SPARE3                                             NUMBER
 SPARE4                                             VARCHAR2(1000)
 SPARE5                                             VARCHAR2(1000)
 SPARE6                                             DATE
 SIGNATURE                                          RAW(16)
 SPARE7                                             NUMBER
 SPARE8                                             NUMBER
 SPARE9                                             NUMBER

X$ structures will give you information for all containers (ROOT$CDB, SEED and and all pluggable database). And there’s a CON_ID column defined in all those structures.

SQL> show con_name

CON_NAME
------------------------------
CDB$ROOT

SQL> desc x$ksppi
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 ADDR                                               RAW(8)
 INDX                                               NUMBER
 INST_ID                                            NUMBER
 CON_ID                                             NUMBER
 KSPPINM                                            VARCHAR2(80)
 KSPPITY                                            NUMBER
 KSPPDESC                                           VARCHAR2(255)
 KSPPIFLG                                           NUMBER
 KSPPILRMFLG                                        NUMBER
 KSPPIHASH                                          NUMBER

SQL> select c.name,KSPPINM
  2  from x$ksppi x, v$containers c
  3  where x.con_id=c.con_id and  KSPPINM='open_cursors';

NAME                           KSPPINM
------------------------------ ------------------------------
CDB$ROOT                       open_cursors
PDB$SEED                       open_cursors
PDB1                           open_cursors
PDB2                           open_cursors

If you are connected to a PDB
- In fixed tables, for example OBJ$, there’s still no CON_ID, so this will give you all objects on your PDB.
- In X$ structures, there’s a CON_ID column, but if you are connected to a PDB, you will see only the data related to this PDB.

To summarize, static fixed tables contain data for the actual container (CDB$ROOT or PDB), but X$ structures contain data for all containers if you are connected to the CDB$ROOT container, and contain data related to the PDB you are connected to, in case of a PDB.

If you disassemble static views like CDB_TABLES, you will see a new function used to have data for all containers. This is CDB$VIEW function :

SQL> show con_id

CON_ID
------------------------------
1
SQL> show con_name

CON_NAME
------------------------------
CDB$ROOT
SQL> select count(*) from obj$;

  COUNT(*)
----------
     91005

SQL> select con_id,count(*) from CDB$VIEW("SYS"."OBJ$") group by con_id order by 1;

    CON_ID   COUNT(*)
---------- ----------
         1      91005
         2      90708
         3      90960
         4      90948

Ok, now we have the information for all our containers. But if we want to join an X$ Structure and a CDB$VIEW transformed object, time for execute this is too long:

select x.con_id,o.name,count(*)
from x$bh x, (select name,dataobj#,con_id from CDB$VIEW("SYS"."OBJ$")) o
where x.con_id=o.con_id
and o.dataobj#=x.obj
and o.name like 'T_PDB%'
group by x.con_id,o.name
/
... never ends :(

To execute it in a better time, I used query factorization with a WITH block, and forced materialization of it:

SQL> with o as (select /*+ MATERIALIZE */ name,dataobj#,con_id from CDB$VIEW("SYS"."OBJ$"))
  2  select x.con_id,o.name,count(*)
  3  from x$bh x,o
  4  where x.con_id=o.con_id
  5  and o.dataobj#=x.obj
  6  and o.name like 'T_PDB%'
  7  group by x.con_id,o.name
  8  /

    CON_ID NAME                   COUNT(*)
---------- -------------------- ----------
         4 T_PDB2                        9
         3 T_PDB1                        9

Elapsed: 00:00:02.40

12c, CBO, Internals, Oracle, SQL

Install a simple hadoop cluster for testing

7 Comments Posted by Laurent on July 2, 2013

Today, I won’t write an oracle related post, but a post about hadoop. If you don’t have a full rack of servers in your cellar, and you want to write map/reduce algorithm and test it, you will need a simple cluster.

In my case, I installed hadoop on a 3 nodes cluster (bigdata1, bigdata2 and bigdata3). Each node is an Oracle Enterprise Linux 6 box with :

2 Gb RAM
a 30Gb system disk
2 disks mounted on /mnt/hdfs/1 and /mnt/hdfs/2 for storing HDFS datas. Each disk is 10Gb.

My master node will be bigdata1, with 2 slaves nodes: bigdata2 and bigdata3.

Each box has java installed:

[oracle@bigdata1 ~]$ java -version
java version "1.7.0_11"
Java(TM) SE Runtime Environment (build 1.7.0_11-b21)
Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)

First, you need to download last stable hadoop tarball: http://hadoop.apache.org/releases.html#Download. In my case, I downloaded Hadoop 1.1.2.

After this, you need to untar the file on each servers:

[oracle@bigdata3 ~]$ pwd
/home/oracle
[oracle@bigdata3 ~]$ tar -xvzf hadoop-1.1.2.tar.gz

On the master node, edit conf/core-site.xml to configure the namenode:

[oracle@bigdata1 hadoop-1.1.2]$ cat conf/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>fs.default.name</name>
    <value>hdfs://bigdata1:8020</value>
    </property>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/mnt/hdfs/1/hadoop/dfs,/mnt/hdfs/2/hadoop/dfs</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.checkpoint.dir</name>
  <value>/mnt/hdfs/1/hadoop/dfs/namesecondary,/mnt/hdfs/2/hadoop/dfs/namesecondary</value>
  <description>Determines where on the local filesystem the DFS secondary
               name node should store the temporary images to merge.
               If this is a comma-delimited list of directories then the image is
               replicated in all of the directories for redundancy.
  </description>
</property>
</configuration>

An important parameter is hadoop.tmp.dir. Indeed, as we want to setup a simple cluster for testing, we would like to keep this configuration as simple as possible. If you have a look to the doc pages related to config file (core, hdfs and mapreduce), you will see that most of the parameter (dfs.data.dir, mapred.system.dir etc. are derived from hadoop.tmp.dir parameter). This will keep our config very simple.

More information about this file here: http://hadoop.apache.org/docs/r1.1.2/core-default.html

Next, you have to configure the conf/hdfs-site.xml file (hdfs config). This is a basic config file with location of the name table on the local fs, and the number block replication ways:

[oracle@bigdata1 hadoop-1.1.2]$ cat conf/hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>dfs.name.dir</name>
  <value>/mnt/hdfs/1/hadoop/dfs/name,/mnt/hdfs/2/hadoop/dfs/name</value>
</property>

<property>
  <name>dfs.replication</name>
      <value>2</value>
      <description>Default block replication.
                   The actual number of replications can be specified when the file is created.
                   The default is used if replication is not specified in create time.
      </description>
</property>
</configuration>

Other parameters are documented here: http://hadoop.apache.org/docs/r1.1.2/hdfs-default.html

Finally, we need to configure the mapreduce config file (located in HADOOP_HOME/conf/mapred-site.xml).

In my case, I decided to configure only one parameter which is the location of the JobTracker (usually located on the namenode).

[oracle@bigdata1 hadoop-1.1.2]$ cat conf/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
    <name>mapred.job.tracker</name>
    <value>bigdata1:9001</value>
</property>
</configuration>

You will find more information about mapreduce configuration parameters here: http://hadoop.apache.org/docs/r1.1.2/mapred-default.html

The next step is to mention which server will act as a namenode, and which servers will act as datanode.

To do this, we have to configure name resolution correctly, each node must resolve all the node names in the cluster:

[oracle@bigdata1 hadoop-1.1.2]$ cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
192.168.99.40 bigdata1.localdomain bigdata1
192.168.99.41 bigdata2.localdomain bigdata2
192.168.99.42 bigdata3.localdomain bigdata3
The next step, is to edit masters and slaves files:
[oracle@bigdata1 hadoop-1.1.2]$ cat conf/masters
bigdata1
[oracle@bigdata1 hadoop-1.1.2]$ cat conf/slaves
bigdata1
bigdata2
bigdata3

Ok, now our system is configured as we want … simple !

Next step is to configure the hadoop environment. This can be done in the HADOOP_HOME/conf/hadoop-env.sh file.
[oracle@bigdata1 hadoop-1.1.2]$ cat conf/hadoop-env.sh | sed -e ‘/^#/d’

export JAVA_HOME=/usr/java/jdk1.7.0_11
export HADOOP_HEAPSIZE=1024
export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_NAMENODE_OPTS"
export HADOOP_SECONDARYNAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_SECONDARYNAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTS"
export HADOOP_BALANCER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_BALANCER_OPTS"
export HADOOP_JOBTRACKER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_JOBTRACKER_OPTS"
export HADOOP_LOG_DIR=/home/oracle/hadoop-1.1.2/log

Each parameter is documented in the file.

Don’t forget to copy each config file on each cluster node.

Next step is to create directory for the name table location:

[oracle@bigdata1 hadoop-1.1.2]$ mkdir -p /mnt/hdfs/1/hadoop/dfs/name
[oracle@bigdata1 hadoop-1.1.2]$ mkdir -p /mnt/hdfs/2/hadoop/dfs/name

Now, we can format the namenode:

[oracle@bigdata1 hadoop-1.1.2]$ /home/oracle/hadoop-1.1.2/bin/hadoop namenode -format
13/07/02 09:01:22 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = bigdata1.localdomain/192.168.99.40
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 1.1.2
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1440782; compiled by 'hortonfo' on Thu Jan 31 02:03:24 UTC 2013
************************************************************/
Re-format filesystem in /mnt/hdfs/1/hadoop/dfs/name ? (Y or N) Y
Re-format filesystem in /mnt/hdfs/2/hadoop/dfs/name ? (Y or N) Y
13/07/02 09:01:29 INFO util.GSet: VM type       = 64-bit
13/07/02 09:01:29 INFO util.GSet: 2% max memory = 19.7975 MB
13/07/02 09:01:29 INFO util.GSet: capacity      = 2^21 = 2097152 entries
13/07/02 09:01:29 INFO util.GSet: recommended=2097152, actual=2097152
13/07/02 09:01:29 INFO namenode.FSNamesystem: fsOwner=oracle
13/07/02 09:01:29 INFO namenode.FSNamesystem: supergroup=supergroup
13/07/02 09:01:29 INFO namenode.FSNamesystem: isPermissionEnabled=true
13/07/02 09:01:29 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
13/07/02 09:01:29 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
13/07/02 09:01:29 INFO namenode.NameNode: Caching file names occuring more than 10 times
13/07/02 09:01:30 INFO common.Storage: Image file of size 112 saved in 0 seconds.
13/07/02 09:01:30 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/mnt/hdfs/1/hadoop/dfs/name/current/edits
13/07/02 09:01:30 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/mnt/hdfs/1/hadoop/dfs/name/current/edits
13/07/02 09:01:30 INFO common.Storage: Storage directory /mnt/hdfs/1/hadoop/dfs/name has been successfully formatted.
13/07/02 09:01:30 INFO common.Storage: Image file of size 112 saved in 0 seconds.
13/07/02 09:01:30 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/mnt/hdfs/2/hadoop/dfs/name/current/edits
13/07/02 09:01:30 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/mnt/hdfs/2/hadoop/dfs/name/current/edits
13/07/02 09:01:30 INFO common.Storage: Storage directory /mnt/hdfs/2/hadoop/dfs/name has been successfully formatted.
13/07/02 09:01:30 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at bigdata1.localdomain/192.168.99.40
************************************************************/

And now … the most important step, we will start our cluster. This operation is launched from the namenode:

[oracle@bigdata1 hadoop-1.1.2]$ /home/oracle/hadoop-1.1.2/bin/start-all.sh
starting namenode, logging to /home/oracle/hadoop-1.1.2/log/hadoop-oracle-namenode-bigdata1.localdomain.out
bigdata2: starting datanode, logging to /home/oracle/hadoop-1.1.2/log/hadoop-oracle-datanode-bigdata2.localdomain.out
bigdata3: starting datanode, logging to /home/oracle/hadoop-1.1.2/log/hadoop-oracle-datanode-bigdata3.localdomain.out
bigdata1: starting datanode, logging to /home/oracle/hadoop-1.1.2/log/hadoop-oracle-datanode-bigdata1.localdomain.out
bigdata1: starting secondarynamenode, logging to /home/oracle/hadoop-1.1.2/log/hadoop-oracle-secondarynamenode-bigdata1.localdomain.out
starting jobtracker, logging to /home/oracle/hadoop-1.1.2/log/hadoop-oracle-jobtracker-bigdata1.localdomain.out
bigdata2: starting tasktracker, logging to /home/oracle/hadoop-1.1.2/log/hadoop-oracle-tasktracker-bigdata2.localdomain.out
bigdata3: starting tasktracker, logging to /home/oracle/hadoop-1.1.2/log/hadoop-oracle-tasktracker-bigdata3.localdomain.out
bigdata1: starting tasktracker, logging to /home/oracle/hadoop-1.1.2/log/hadoop-oracle-tasktracker-bigdata1.localdomain.out

Ok everything seems to be fine (you can have a look to the log files to check processes health).

You can use to jpm command to check that every process is now launched:

On the name node (in my case, it acts as name node and data node, that’s why we find a TaskTracker and a DataNode process)

[oracle@bigdata1 hadoop-1.1.2]$ jps -m
11699 JobTracker
11592 SecondaryNameNode
11832 TaskTracker
11964 Jps -m
11467 DataNode
11345 NameNode

On the data nodes

[oracle@bigdata2 log]$ jps -m
6675 DataNode
6788 TaskTracker
6866 Jps -m

[oracle@bigdata3 conf]$ jps -m
6856 Jps -m
6770 TaskTracker
6657 DataNode

We can now, push big files on hdfs to launch map/reduce on it.

First, we need to create directories:

[oracle@bigdata1 hadoop-1.1.2]$ /home/oracle/hadoop-1.1.2/bin/hadoop dfs -mkdir input
[oracle@bigdata1 hadoop-1.1.2]$ /home/oracle/hadoop-1.1.2/bin/hadoop dfs -mkdir output
[oracle@bigdata1 hadoop-1.1.2]$ /home/oracle/hadoop-1.1.2/bin/hadoop dfs -ls
Found 2 items
drwxr-xr-x - oracle supergroup 0 2013-07-02 09:35 /user/oracle/input
drwxr-xr-x - oracle supergroup 0 2013-07-02 09:36 /user/oracle/output

Next, we will copy a sample file that contains information about artists and albums (see a sample above)

[oracle@bigdata1 hadoop-1.1.2]$ /home/oracle/hadoop-1.1.2/bin/hadoop dfs -copyFromLocal /tmp/unique_tracks.txt /user/oracle/input
[oracle@bigdata1 hadoop-1.1.2]$ head -50 /tmp/unique_tracks.txt
TRMMMHY12903CB53F1<SEP>SOPMIYT12A6D4F851E<SEP>Joseph Locke<SEP>Goodbye
TRMMMML128F4280EE9<SEP>SOJCFMH12A8C13B0C2<SEP>The Sun Harbor's Chorus-Documentary Recordings<SEP>Mama_ mama can't you see ?
TRMMMNS128F93548E1<SEP>SOYGNWH12AB018191E<SEP>3 Gars Su'l Sofa<SEP>L'antarctique
TRMMMXJ12903CBF111<SEP>SOLJTLX12AB01890ED<SEP>Jorge Negrete<SEP>El hijo del pueblo
TRMMMCJ128F930BFF8<SEP>SOQQESG12A58A7AA28<SEP>Danny Diablo<SEP>Cold Beer feat. Prince Metropolitan
TRMMMBW128F4260CAE<SEP>SOMPVQB12A8C1379BB<SEP>Tiger Lou<SEP>Pilots
TRMMMXI128F4285A3F<SEP>SOGPCJI12A8C13CCA0<SEP>Waldemar Bastos<SEP>N Gana
TRMMMKI128F931D80D<SEP>SOSDCFG12AB0184647<SEP>Lena Philipsson<SEP>006
TRMMMUT128F42646E8<SEP>SOBARPM12A8C133DFF<SEP>Shawn Colvin<SEP>(Looking For) The Heart Of Saturday
TRMMMQY128F92F0EA3<SEP>SOKOVRQ12A8C142811<SEP>Dying Fetus<SEP>Ethos of Coercion
.../...

[oracle@bigdata1 hadoop-1.1.2]$ /home/oracle/hadoop-1.1.2/bin/hadoop dfs -du
Found 2 items
84046293    hdfs://bigdata1:8020/user/oracle/input
0           hdfs://bigdata1:8020/user/oracle/output

Finally, we can create our MapReduce program. In my case, I used a simple counter of artists (Third field):

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class ArtistCounter {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      String[] res = value.toString().split("<SEP>");
        word.set(res[2].toLowerCase());
        context.write(word, one);
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = new Job(conf, "Laurent's Singer counter");

    job.setJarByClass(ArtistCounter.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

Next step, compile …

[oracle@bigdata1 java]$ echo $CLASSPATH
/home/oracle/hadoop-1.1.2/libexec/../conf:/usr/java/jdk1.7.0_11/lib/tools.jar:/home/oracle/hadoop-1.1.2/libexec/..:/home/oracle/hadoop-1.1.2/libexec/../hadoop-core-1.1.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/asm-3.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/aspectjrt-1.6.11.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/aspectjtools-1.6.11.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-beanutils-1.7.0.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-beanutils-core-1.8.0.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-cli-1.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-codec-1.4.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-collections-3.2.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-configuration-1.6.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-daemon-1.0.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-digester-1.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-el-1.0.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-httpclient-3.0.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-io-2.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-lang-2.4.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-logging-1.1.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-logging-api-1.0.4.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-math-2.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-net-3.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/core-3.1.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/hadoop-capacity-scheduler-1.1.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/hadoop-fairscheduler-1.1.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/hadoop-thriftfs-1.1.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/hsqldb-1.8.0.10.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jackson-core-asl-1.8.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jackson-mapper-asl-1.8.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jasper-compiler-5.5.12.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jasper-runtime-5.5.12.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jdeb-0.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jersey-core-1.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jersey-json-1.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jersey-server-1.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jets3t-0.6.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jetty-6.1.26.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jetty-util-6.1.26.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jsch-0.1.42.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/junit-4.5.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/kfs-0.2.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/log4j-1.2.15.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/mockito-all-1.8.5.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/oro-2.0.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/servlet-api-2.5-20081211.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/slf4j-api-1.4.3.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/slf4j-log4j12-1.4.3.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/xmlenc-0.52.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jsp-2.1/jsp-2.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jsp-2.1/jsp-api-2.1.jar:/home/oracle/hadoop-1.1.2/hadoop-ant-1.1.2.jar:/home/oracle/hadoop-1.1.2/hadoop-client-1.1.2.jar:/home/oracle/hadoop-1.1.2/hadoop-core-1.1.2.jar:/home/oracle/hadoop-1.1.2/hadoop-examples-1.1.2.jar:/home/oracle/hadoop-1.1.2/hadoop-minicluster-1.1.2.jar:/home/oracle/hadoop-1.1.2/hadoop-test-1.1.2.jar:/home/oracle/hadoop-1.1.2/hadoop-tools-1.1.2.jar

[oracle@bigdata1 java]$ javac -cp $CLASSPATH:. ArtistCounter.java

Next … packaging

[oracle@bigdata1 java]$ jar -cvf ArtistCounter.jar ArtistCounter*.class
added manifest
adding: ArtistCounter.class(in = 1489) (out= 813)(deflated 45%)
adding: ArtistCounter$IntSumReducer.class(in = 1751) (out= 742)(deflated 57%)
adding: ArtistCounter$TokenizerMapper.class(in = 1721) (out= 718)(deflated 58%)

And now, we can run this small test program in our hadoop cluster:

[oracle@bigdata1 java]$ echo $CLASSPATH
/home/oracle/hadoop-1.1.2/libexec/../conf:/usr/java/jdk1.7.0_11/lib/tools.jar:/home/oracle/hadoop-1.1.2/libexec/..:/home/oracle/hadoop-1.1.2/libexec/../hadoop-core-1.1.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/asm-3.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/aspectjrt-1.6.11.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/aspectjtools-1.6.11.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-beanutils-1.7.0.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-beanutils-core-1.8.0.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-cli-1.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-codec-1.4.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-collections-3.2.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-configuration-1.6.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-daemon-1.0.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-digester-1.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-el-1.0.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-httpclient-3.0.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-io-2.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-lang-2.4.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-logging-1.1.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-logging-api-1.0.4.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-math-2.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/commons-net-3.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/core-3.1.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/hadoop-capacity-scheduler-1.1.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/hadoop-fairscheduler-1.1.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/hadoop-thriftfs-1.1.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/hsqldb-1.8.0.10.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jackson-core-asl-1.8.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jackson-mapper-asl-1.8.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jasper-compiler-5.5.12.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jasper-runtime-5.5.12.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jdeb-0.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jersey-core-1.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jersey-json-1.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jersey-server-1.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jets3t-0.6.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jetty-6.1.26.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jetty-util-6.1.26.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jsch-0.1.42.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/junit-4.5.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/kfs-0.2.2.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/log4j-1.2.15.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/mockito-all-1.8.5.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/oro-2.0.8.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/servlet-api-2.5-20081211.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/slf4j-api-1.4.3.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/slf4j-log4j12-1.4.3.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/xmlenc-0.52.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jsp-2.1/jsp-2.1.jar:/home/oracle/hadoop-1.1.2/libexec/../lib/jsp-2.1/jsp-api-2.1.jar:/home/oracle/hadoop-1.1.2/hadoop-ant-1.1.2.jar:/home/oracle/hadoop-1.1.2/hadoop-client-1.1.2.jar:/home/oracle/hadoop-1.1.2/hadoop-core-1.1.2.jar:/home/oracle/hadoop-1.1.2/hadoop-examples-1.1.2.jar:/home/oracle/hadoop-1.1.2/hadoop-minicluster-1.1.2.jar:/home/oracle/hadoop-1.1.2/hadoop-test-1.1.2.jar:/home/oracle/hadoop-1.1.2/hadoop-tools-1.1.2.jar

[oracle@bigdata1 java]$ /home/oracle/hadoop-1.1.2/bin/hadoop jar ArtistCounter.jar ArtistCounter /user/oracle/input /user/oracle/output/laurent
13/07/02 09:56:08 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/07/02 09:56:08 INFO input.FileInputFormat: Total input paths to process : 1
13/07/02 09:56:08 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/07/02 09:56:08 WARN snappy.LoadSnappy: Snappy native library not loaded
13/07/02 09:56:08 INFO mapred.JobClient: Running job: job_201307020925_0001
13/07/02 09:56:09 INFO mapred.JobClient:  map 0% reduce 0%
13/07/02 09:56:20 INFO mapred.JobClient:  map 50% reduce 0%
13/07/02 09:56:22 INFO mapred.JobClient:  map 100% reduce 0%
13/07/02 09:56:28 INFO mapred.JobClient:  map 100% reduce 33%
13/07/02 09:56:31 INFO mapred.JobClient:  map 100% reduce 100%
13/07/02 09:56:32 INFO mapred.JobClient: Job complete: job_201307020925_0001
13/07/02 09:56:32 INFO mapred.JobClient: Counters: 29
13/07/02 09:56:32 INFO mapred.JobClient:   Job Counters
13/07/02 09:56:32 INFO mapred.JobClient:     Launched reduce tasks=1
13/07/02 09:56:32 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=18647
13/07/02 09:56:32 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/07/02 09:56:32 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/07/02 09:56:32 INFO mapred.JobClient:     Launched map tasks=2
13/07/02 09:56:32 INFO mapred.JobClient:     Data-local map tasks=2
13/07/02 09:56:32 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10862
13/07/02 09:56:32 INFO mapred.JobClient:   File Output Format Counters
13/07/02 09:56:32 INFO mapred.JobClient:     Bytes Written=1651492
13/07/02 09:56:32 INFO mapred.JobClient:   FileSystemCounters
13/07/02 09:56:32 INFO mapred.JobClient:     FILE_BYTES_READ=6240405
13/07/02 09:56:32 INFO mapred.JobClient:     HDFS_BYTES_READ=84050631
13/07/02 09:56:32 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=9124002
13/07/02 09:56:32 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1651492
13/07/02 09:56:32 INFO mapred.JobClient:   File Input Format Counters
13/07/02 09:56:32 INFO mapred.JobClient:     Bytes Read=84050389
13/07/02 09:56:32 INFO mapred.JobClient:   Map-Reduce Framework
13/07/02 09:56:32 INFO mapred.JobClient:     Map output materialized bytes=2724501
13/07/02 09:56:32 INFO mapred.JobClient:     Map input records=1000000
13/07/02 09:56:32 INFO mapred.JobClient:     Reduce shuffle bytes=2724501
13/07/02 09:56:32 INFO mapred.JobClient:     Spilled Records=370725
13/07/02 09:56:32 INFO mapred.JobClient:     Map output bytes=18534685
13/07/02 09:56:32 INFO mapred.JobClient:     Total committed heap usage (bytes)=270082048
13/07/02 09:56:32 INFO mapred.JobClient:     CPU time spent (ms)=10870
13/07/02 09:56:32 INFO mapred.JobClient:     Combine input records=1150423
13/07/02 09:56:32 INFO mapred.JobClient:     SPLIT_RAW_BYTES=242
13/07/02 09:56:32 INFO mapred.JobClient:     Reduce input records=110151
13/07/02 09:56:32 INFO mapred.JobClient:     Reduce input groups=72663
13/07/02 09:56:32 INFO mapred.JobClient:     Combine output records=260574
13/07/02 09:56:32 INFO mapred.JobClient:     Physical memory (bytes) snapshot=453685248
13/07/02 09:56:32 INFO mapred.JobClient:     Reduce output records=72663
13/07/02 09:56:32 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=3173515264
13/07/02 09:56:32 INFO mapred.JobClient:     Map output records=1000000

Last step, we can see the content of the result file, and counting how many album did the mapreducer find in the file:

[oracle@bigdata1 java]$ /home/oracle/hadoop-1.1.2/bin/hadoop dfs -ls /user/oracle/output/laurent
Found 3 items
-rw-r--r--   3 oracle supergroup          0 2013-07-02 09:56 /user/oracle/output/laurent/_SUCCESS
drwxr-xr-x   - oracle supergroup          0 2013-07-02 09:56 /user/oracle/output/laurent/_logs
-rw-r--r--   3 oracle supergroup    1651492 2013-07-02 09:56 /user/oracle/output/laurent/part-r-00000

[oracle@bigdata1 java]$ /home/oracle/hadoop-1.1.2/bin/hadoop dfs -cat /user/oracle/output/laurent/part-r-00000 | grep 'pink floyd'
pink floyd      117
pink floyd tribute      10

hadoop, java, Linux, Mapreduce

RSS feed

The views expressed on this blog are my own and do not reflect the views of the company(ies) I work (or have worked for) neither Oracle Corporation. The opinions expressed by visitors on this blog are theirs, not mine.

The information in this blog is written based on personal experiences. You are free to use the information on this blog but I am not responsible and will not compensate to you if you ever happen to suffer a loss/inconvenience/damage because of/while making use of this information.

Data … as usual

Monthly Archives: July 2013

Where are X11 libs after a Solaris 11 fresh install

Write SQL statements on internal structures in multitenant databases.

Install a simple hadoop cluster for testing

Follow me on Twitter

Blogroll

Archives

DISCLAIMER

Blog Stats