About Me

Love JAVA related technologies. Recently researching on Enterprise Integration (SOA and Messaging), Mobility and Big Data. I have working in JAVA related technologies as Software Architect, Enterprise Architect and Software Developer/Engineer for over 11 years. Currently, I am working as Senior Consultant of VMWare Inc.

Tuesday, November 20, 2012

Spring bean lifecycle

1 Spring instantiates the bean.
2 Spring injects values and bean references into the bean’s properties.
3 If the bean implements BeanNameAware, Spring passes the bean’s ID to the set-
BeanName() method.
4 If the bean implements BeanFactoryAware, Spring calls the setBeanFactory()
method, passing in the bean factory itself.
5 If the bean implements ApplicationContextAware, Spring will call the set-
ApplicationContext() method, passing in a reference to the enclosing application
context.
6 If any of the beans implement the BeanPostProcessor interface, Spring calls
their postProcessBeforeInitialization() method.
7 If any beans implement the InitializingBean interface, Spring calls their
afterPropertiesSet() method. Similarly, if the bean was declared with an
init-method, then the specified initialization method will be called.
8 If there are any beans that implement BeanPostProcessor, Spring will call their
postProcessAfterInitialization() method.
9 At this point, the bean is ready to be used by the application and will remain in
the application context until the application context is destroyed.
10 If any beans implement the DisposableBean interface, then Spring will call
their destroy() methods. Likewise, if any bean was declared with a destroymethod,
then the specified method will be called.

Wednesday, November 14, 2012

Hadoop (setup fully-distributed/multiple nodes mode)

Coming!

Hadoop (setup in standalone)

 Pre-setup
1) install Hadoop
2) setup environment variable
  • JAVA_HOME
  • HADOOP_HOME
standalone
setup SSH for a Hadoop cluster

1) Define a common account
create a user level account with no Hadoop management privilegesAssume it is "hadoopUser"

2) Generate SSH key pair
execute this command and following the prompts for additional inputs "ssh-keygen -t rsa"
the public keys are stored in location you have specified

3) Distribute the public key to all nodes (master and slaves)
scp <the location of your public key> hadoopUser@<hostname>:<new location>/master_key

on the target host, execute the following commands
$ mkdir ~/.ssh
$ chmod 700 ~/.ssh
$ mv ~/master_key ~/.ssh/authorized_keys
$ chmod 600 ~/.ssh/authorized_keys

4) Hadoop configuration
cd $HADOOP_HOME
In "hadoop-env.sh" add "export JAVA_HOME=/usr/share/jdk"

5) the 3 main configuration files should be empty
  1. core-site.xml
  2. hdfs-site.xml
  3. mapred-site.xml
Hadoop runs completely on local machine and it doesn't launch any of the Hadoop daemons.






Tuesday, November 13, 2012

Hadoop clustering components

Hadoop clustering is composed with following daemons on one server or across multiple servers
  • NameNode -- keeps track of the file metadata, which files are in the system and how each file is broken down into blocks
  • DataNode -- provides backup store of data blocks and constantly report to NameNode to keep track of metadata update
  • Secondary NameNode -- assistant daemon for monitoring the state of a cluster HDFS. It communicates with the NameNode to take snapshots of the HDFS metadata at intervals defined by the cluster configuration
  • JobTracker -- liaison between application and Hadoop.  It determines the execution plan by determining which files to process, assign nodes to different tasks, and monitors all tasks as they are running.
    • one per Hadoop cluster
    • automatic relaunch failed task
    • oversees the overall execution of a MapReduce job
  • TaskTracker -- slave to the JobTracker
    • executes individual tasks that the JobTracker assigns
    • one per a slave node
    • able to spawn multiple map or reduce tasks in parallel
    • send heartbeat to JobTracker

Hadoop commands

Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
  • namenode                     format format the DFS filesystem
  • secondarynamenode     run the DFS secondary namenode
  • namenode                     run the DFS namenode
  • datanode                      run a DFS datanode
  • dfsadmin                      run a DFS admin client
  • fsck                             run a DFS filesystem checking utility
  • fs                                 run a generic filesystem user client
  • balancer                       run a cluster balancing utility
  • jobtracker                    run the MapReduce job Tracker node
  • pipes                            run a Pipes job
  • tasktracker                   run a MapReduce task Tracker node
  • job                               manipulate MapReduce jobs
  • version                         print the version
  • jar <jar>                       run a jar file
  • distcp <srcurl> <desturl> copy file or directories recursively
  • archive -archiveName NAME <src>* <dest> create a hadoop archive
  • daemonlog                     get/set the log level for each daemon
  • CLASSNAME              run the class named CLASSNAME

Hadoop (software stack)

Currently there are nine sub-projects in Hadop
  • Common - common code
  • Avro - serialization and RPC
  • MapReduce - computation
  • HDFS - storage
  • Pig - data flow language
  • Hive - data warehousing and query language
  • HBase - column-oriented database
  • ZooKeeper - coordination service
  • Chukwa - data collection and analysis

Sunday, November 11, 2012

hadoop useful CLI commands

$ hadoop fs -ls /                           (list all files in HDFS root directory)
$ hadoop job -list                         (find all running MapReduce jobs)
$ for svc in /etc/init.d/hadoop-0.20-*; do sudo $svc start; done    (Start up hadoop cluster)
$ for svc in /etc/init.d/hadoop-0.20-*; do sudo $svc stop; done    (Stop cluster)

(HDFS commands)
http://hadoop.apache.org/docs/r1.0.0/file_system_shell.html

(MapReduce commands)
http://hadoop.apache.org/docs/r1.0.0/commands_manual.html#job

hadoop configuration

Configuration
The Hadroop configs are contained under /etc/hadoop/conf.

Log
/var/log/hadoop (all hadoop daemon log files are resided)
hadoop-hadoop-namenode-<HOSTNAME>.log (NameNode logs)

CentOS add user to the sudoers list

go to /etc/sudoer

put your cursor on the "root    ALL=(ALL)       ALL" and add the following on the next line

replace <username> with your username
"<username>    ALL=(ALL)       ALL"

Friday, November 9, 2012

Hadoop stack (Installation) - redhat

Download hadroop from http://www.cloudera.com/hadoop

Prerequisites
1) JDK1.6 update 8 or newer

Download and install the “bootstrap” RPM
$ sudo -s
$ wget http://archive.cloudera.com/redhat/cdh/cdh3-repository-1.0-1.noarch.rpm
$ rpm -ivh cdh3-repository-1.0-1.noarch.rpm

Import Cloudera's RPM signing key
$ rpm --import \
http://archive.cloudera.com/redhat/cdh/RPM-GPG-KEY-cloudera

Install the pseudo-distributed RPM package and it dependencies: Pig, Hive, and Snappy
$ yum install hadoop-0.20-conf-pseudo hadoop-0.20-native \
hadoop-pig hadoop-hive

Hadoop limitations

  1. Availability -- Master process are single point of failure (Hadoop 2.x brings HA supports for NameNode and JobTracker to migrate this issue)
  2. Security -- security model is disabled by default. No storage or wire-level encryption.  
    1. configure to run with Keberos (a network authentication protocol)
  3.  HDFS -- lack of availability 
  4. MapReduce -- is a "batch-based architecture" and it uses "shared-nothing architecture".  Not good for job that needs real-time data access.
  5. Ecosystem version compatibilities 

HDFS architecture

Wednesday, November 7, 2012