IT Gossips: 2012

Tuesday, November 20, 2012

Spring bean lifecycle

1 Spring instantiates the bean.
2 Spring injects values and bean references into the bean’s properties.
3 If the bean implements BeanNameAware, Spring passes the bean’s ID to the set-
BeanName() method.
4 If the bean implements BeanFactoryAware, Spring calls the setBeanFactory()
method, passing in the bean factory itself.
5 If the bean implements ApplicationContextAware, Spring will call the set-
ApplicationContext() method, passing in a reference to the enclosing application
context.
6 If any of the beans implement the BeanPostProcessor interface, Spring calls
their postProcessBeforeInitialization() method.
7 If any beans implement the InitializingBean interface, Spring calls their
afterPropertiesSet() method. Similarly, if the bean was declared with an
init-method, then the specified initialization method will be called.
8 If there are any beans that implement BeanPostProcessor, Spring will call their
postProcessAfterInitialization() method.
9 At this point, the bean is ready to be used by the application and will remain in
the application context until the application context is destroyed.
10 If any beans implement the DisposableBean interface, then Spring will call
their destroy() methods. Likewise, if any bean was declared with a destroymethod,
then the specified method will be called.

Wednesday, November 14, 2012

Hadoop (setup fully-distributed/multiple nodes mode)

Coming!

Hadoop (setup in standalone)

Pre-setup
1) install Hadoop
2) setup environment variable

JAVA_HOME
HADOOP_HOME

standalone
setup SSH for a Hadoop cluster

1) Define a common account
create a user level account with no Hadoop management privileges. Assume it is "hadoopUser"

2) Generate SSH key pair
execute this command and following the prompts for additional inputs "ssh-keygen -t rsa"
the public keys are stored in location you have specified

3) Distribute the public key to all nodes (master and slaves)
scp <the location of your public key> hadoopUser@<hostname>:<new location>/master_key

on the target host, execute the following commands
$ mkdir ~/.ssh
$ chmod 700 ~/.ssh
$ mv ~/master_key ~/.ssh/authorized_keys
$ chmod 600 ~/.ssh/authorized_keys

4) Hadoop configuration
cd $HADOOP_HOME
In "hadoop-env.sh" add "export JAVA_HOME=/usr/share/jdk"

5) the 3 main configuration files should be empty

core-site.xml
hdfs-site.xml
mapred-site.xml

Hadoop runs completely on local machine and it doesn't launch any of the Hadoop daemons.

Tuesday, November 13, 2012

Hadoop clustering components

Hadoop clustering is composed with following daemons on one server or across multiple servers

NameNode -- keeps track of the file metadata, which files are in the system and how each file is broken down into blocks
DataNode -- provides backup store of data blocks and constantly report to NameNode to keep track of metadata update
Secondary NameNode -- assistant daemon for monitoring the state of a cluster HDFS. It communicates with the NameNode to take snapshots of the HDFS metadata at intervals defined by the cluster configuration
JobTracker -- liaison between application and Hadoop. It determines the execution plan by determining which files to process, assign nodes to different tasks, and monitors all tasks as they are running.

one per Hadoop cluster
automatic relaunch failed task
oversees the overall execution of a MapReduce job

TaskTracker -- slave to the JobTracker

executes individual tasks that the JobTracker assigns
one per a slave node
able to spawn multiple map or reduce tasks in parallel
send heartbeat to JobTracker

Hadoop commands

Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:

namenode format format the DFS filesystem
secondarynamenode run the DFS secondary namenode
namenode run the DFS namenode
datanode run a DFS datanode
dfsadmin run a DFS admin client
fsck run a DFS filesystem checking utility
fs run a generic filesystem user client
balancer run a cluster balancing utility
jobtracker run the MapReduce job Tracker node
pipes run a Pipes job
tasktracker run a MapReduce task Tracker node
job manipulate MapReduce jobs
version print the version
jar <jar> run a jar file
distcp <srcurl> <desturl> copy file or directories recursively
archive -archiveName NAME <src>* <dest> create a hadoop archive
daemonlog get/set the log level for each daemon
CLASSNAME run the class named CLASSNAME

Hadoop (software stack)

Currently there are nine sub-projects in Hadop

Common - common code
Avro - serialization and RPC
MapReduce - computation
HDFS - storage
Pig - data flow language
Hive - data warehousing and query language
HBase - column-oriented database
ZooKeeper - coordination service
Chukwa - data collection and analysis

Sunday, November 11, 2012

hadoop useful CLI commands

$ hadoop fs -ls /                           (list all files in HDFS root directory)
$ hadoop job -list                         (find all running MapReduce jobs)
$ for svc in /etc/init.d/hadoop-0.20-*; do sudo $svc start; done    (Start up hadoop cluster)
$ for svc in /etc/init.d/hadoop-0.20-*; do sudo $svc stop; done    (Stop cluster)

(HDFS commands)
http://hadoop.apache.org/docs/r1.0.0/file_system_shell.html

(MapReduce commands)
http://hadoop.apache.org/docs/r1.0.0/commands_manual.html#job

hadoop configuration

Configuration
The Hadroop configs are contained under /etc/hadoop/conf.

Log
/var/log/hadoop (all hadoop daemon log files are resided)
hadoop-hadoop-namenode-<HOSTNAME>.log (NameNode logs)

CentOS add user to the sudoers list

go to /etc/sudoer

put your cursor on the "root ALL=(ALL) ALL" and add the following on the next line

replace <username> with your username
"<username> ALL=(ALL) ALL"

Friday, November 9, 2012

Hadoop stack (Installation) - redhat

Download hadroop from http://www.cloudera.com/hadoop

Prerequisites
1) JDK1.6 update 8 or newer

Download and install the “bootstrap” RPM
$ sudo -s
$ wget http://archive.cloudera.com/redhat/cdh/cdh3-repository-1.0-1.noarch.rpm
$ rpm -ivh cdh3-repository-1.0-1.noarch.rpm

Import Cloudera's RPM signing key
$ rpm --import \
http://archive.cloudera.com/redhat/cdh/RPM-GPG-KEY-cloudera

Install the pseudo-distributed RPM package and it dependencies: Pig, Hive, and Snappy
$ yum install hadoop-0.20-conf-pseudo hadoop-0.20-native \
hadoop-pig hadoop-hive

Hadoop limitations

Availability -- Master process are single point of failure (Hadoop 2.x brings HA supports for NameNode and JobTracker to migrate this issue)
Security -- security model is disabled by default. No storage or wire-level encryption.

configure to run with Keberos (a network authentication protocol)

HDFS -- lack of availability
MapReduce -- is a "batch-based architecture" and it uses "shared-nothing architecture". Not good for job that needs real-time data access.
Ecosystem version compatibilities

IT Gossips

About Me