通过单机的Hadoop伪分布式运行理解Hadoop运行过程

在上篇 基于单机的Hadoop伪分布式运行模拟实现 文章中,成功实现了模拟Hadoop工具实例WrodCount的伪分布式运行过程,总结在这个过程中使用的命令如下:

$ cd ../../cygdrive/g/hadoop-0.16.4
$ bin/hadoop namenode -format
$ bin/start-all.sh
$ bin/hadoop dfs -put ./input input
$ bin/hadoop jar hadoop-0.16.4-examples.jar wordcount input output
$ bin/hadoop dfs -cat output/part-00000
$ bin/stop-all.sh

 

为了加深对这个过程理解,以及认识HDFS,再执行该例子,每执行一步,探查一下都做了哪些工作。

1、格式化HDFS

首先是格式化HDFS:

$ bin/hadoop namenode -format

 

我们关注一下运行过程中输出的信息并进行追踪:

cygpath: cannot create short name of g:\hadoop-0.16.4\logs
08/09/21 17:32:43 INFO dfs.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = cbbd2ce9428e48b/192.168.151.201
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.16.4
STARTUP_MSG:   build =
ch-0.16 -r 652614; compiled by 'hadoopqa' on Fri May 2 00:18:12 UTC 2008
************************************************************/
08/09/21 17:32:44 INFO fs.FSNamesystem: fsOwner=SHIYANJUN,None,root,Administrato
rs,Users
08/09/21 17:32:44 INFO fs.FSNamesystem: supergroup=supergroup
08/09/21 17:32:44 INFO fs.FSNamesystem: isPermissionEnabled=true
08/09/21 17:32:45 INFO dfs.Storage: Storage directory \tmp\hadoop-SHIYANJUN\dfs\
name has been successfully formatted.
08/09/21 17:32:45 INFO dfs.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at cbbd2ce9428e48b/192.168.151.201
************************************************************/

 

第一行:

cygpath: cannot create short name of g:\hadoop-0.16.4\logs

 

查找日志输出目录没有找到,只要你在g:\hadoop-0.16.4目录中创建一个logs目录就可以了。

上面第一组“STARTUP_MSG”指定了一些启动信息,可以看到涉及到了如下相关信息:

启动目标:NameNode
主机信息:cbbd2ce9428e48b/192.168.151.201
命令行参数:[-format]
版本控制:0.16.4
Build 信息:
ch-0.16 -r 652614; compiled by 'hadoopqa' on Fri May 2 00:18:12 UTC 2008

 

可见,在格式化Hadoop分布式文件系统(HDFS)的时候,是通过启动NameNode来进行完成初始化工作。

而且,在格式化HDFS之后,NameNode进程又终止了运行。

第二组显示了FS和DFS的一些信息,即本地文件系统和Hadoop分布式文件系统的信息。NameNode在初始化HDFS之前,要对当前系统配置进行查看,包括FS的fsOwner、supergroup组、是否具备权限,然后执行HDFS的格式化操作。

根据输出信息可知,目录\tmp\hadoop-SHIYANJUN\dfs\name已经被格式化,可以查看本地文件系统,查看生成的DFS及其信息,如图所示:

通过单机的Hadoop伪分布式运行理解Hadoop运行过程

果然初始化了一个\tmp\hadoop-SHIYANJUN\dfs\name,这里的\tmp\hadoop-SHIYANJUN\dfs就是HDFS,NameNode对应着\tmp\hadoop-SHIYANJUN\dfs\name目录,可以看到该目录下的两个目录,其中多了几个文件,其中\current目录下有edits、fsimage、fstime和VERSION着四个文件,\image目录下有fsimage一个文件,他们都与NameNode在后面的分布式计算中有用的。

2、启动Hadoop进程

启动Hadoop进程使用命令:

$ bin/start-all.sh

 

这个启动过程需要做很多工作了,用的时间也比较长。启动输出信息如下所示:

starting namenode, logging to /cygdrive/g/hadoop-0.16.4/bin/../logs/hadoop-SHIYA
NJUN-namenode-cbbd2ce9428e48b.out
localhost: starting datanode, logging to /cygdrive/g/hadoop-0.16.4/bin/../logs/h
adoop-SHIYANJUN-datanode-cbbd2ce9428e48b.out
localhost: starting secondarynamenode, logging to /cygdrive/g/hadoop-0.16.4/bin/
../logs/hadoop-SHIYANJUN-secondarynamenode-cbbd2ce9428e48b.out
starting jobtracker, logging to /cygdrive/g/hadoop-0.16.4/bin/../logs/hadoop-SHI
YANJUN-jobtracker-cbbd2ce9428e48b.out
localhost: starting tasktracker, logging to /cygdrive/g/hadoop-0.16.4/bin/../log
s/hadoop-SHIYANJUN-tasktracker-cbbd2ce9428e48b.out

 

从这个启动信息能够看出,启动了namenode、datanode、secondarynamenode、jobtracker和tasktracker这5个进程,而且将启动的信息登录到日志文件中,如图所示生成的日志文件:

通过单机的Hadoop伪分布式运行理解Hadoop运行过程

日志正好对应于上面的五个进程,其中五个进程的日志文件(以.log作为文件扩展名)中已经记录了启动的日志信息,看一下NameNode的日志信息,如下所示:

2008-09-21 18:10:27,812 INFO org.apache.hadoop.dfs.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = cbbd2ce9428e48b/192.168.151.201
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.16.4
STARTUP_MSG:   build = -r 652614; compiled by 'hadoopqa' on Fri May 2 00:18:12 UTC 2008
************************************************************/
2008-09-21 18:10:28,375 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=9000
2008-09-21 18:10:28,421 INFO org.apache.hadoop.dfs.NameNode: Namenode up at: localhost/127.0.0.1:9000
2008-09-21 18:10:28,437 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null
2008-09-21 18:10:28,640 INFO org.apache.hadoop.dfs.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext
2008-09-21 18:10:30,828 INFO org.apache.hadoop.fs.FSNamesystem: fsOwner=SHIYANJUN,None,root,Administrators,Users
2008-09-21 18:10:30,828 INFO org.apache.hadoop.fs.FSNamesystem: supergroup=supergroup
2008-09-21 18:10:30,828 INFO org.apache.hadoop.fs.FSNamesystem: isPermissionEnabled=true
2008-09-21 18:10:31,062 INFO org.apache.hadoop.fs.FSNamesystem: Finished loading FSImage in 2266 msecs
2008-09-21 18:10:31,078 INFO org.apache.hadoop.fs.FSNamesystem: Leaving safemode after 2282 msecs
2008-09-21 18:10:31,078 INFO org.apache.hadoop.dfs.StateChange: STATE* Network topology has 0 racks and 0 datanodes
2008-09-21 18:10:31,078 INFO org.apache.hadoop.dfs.StateChange: STATE* UnderReplicatedBlocks has 0 blocks
2008-09-21 18:10:31,093 INFO org.apache.hadoop.fs.FSNamesystem: Registered FSNamesystemStatusMBean
2008-09-21 18:10:31,359 INFO org.mortbay.util.Credential: Checking Resource aliases
2008-09-21 18:10:31,546 INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4
2008-09-21 18:10:31,546 INFO org.mortbay.util.Container: Started HttpContext[/static,/static]
2008-09-21 18:10:31,546 INFO org.mortbay.util.Container: Started HttpContext[/logs,/logs]
2008-09-21 18:10:33,015 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.WebApplicationHandler@94cb8b
2008-09-21 18:10:33,281 INFO org.mortbay.util.Container: Started WebApplicationContext[/,/]
2008-09-21 18:10:33,296 INFO org.mortbay.http.SocketListener: Started SocketListener on 0.0.0.0:50070
2008-09-21 18:10:33,296 INFO org.mortbay.util.Container: Started org.mortbay.jetty.Server@15c62bc
2008-09-21 18:10:33,296 INFO org.apache.hadoop.fs.FSNamesystem: Web-server up at: 0.0.0.0:50070
2008-09-21 18:10:33,359 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2008-09-21 18:10:33,390 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 9000: starting
2008-09-21 18:10:33,718 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 9000: starting
2008-09-21 18:10:33,718 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9000: starting
2008-09-21 18:10:33,718 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 9000: starting
2008-09-21 18:10:33,718 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 9000: starting
2008-09-21 18:10:33,718 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 9000: starting
2008-09-21 18:10:33,718 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 9000: starting
2008-09-21 18:10:33,718 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 9000: starting
2008-09-21 18:10:33,781 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 9000: starting
2008-09-21 18:10:33,968 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 9000: starting
2008-09-21 18:10:33,968 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 9000: starting
2008-09-21 18:10:57,312 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.registerDatanode: node registration from 127.0.0.1:50010 storage DS-1069829945-192.168.151.201-50010-1221991857296
2008-09-21 18:10:57,328 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/127.0.0.1:50010
2008-09-21 18:11:47,250 WARN org.apache.hadoop.dfs.StateChange: DIR* FSDirectory.unprotectedDelete: failed to remove /tmp/hadoop-SHIYANJUN/mapred/system because it does not exist
2008-09-21 18:11:47,250 INFO org.apache.hadoop.fs.FSNamesystem: Number of transactions: 0 Total time for transactions(ms): 0 Number of syncs: 0 SyncTimes(ms): 0
2008-09-21 18:16:27,640 INFO org.apache.hadoop.fs.FSNamesystem: Roll Edit Log from 127.0.0.1
2008-09-21 18:16:27,640 INFO org.apache.hadoop.fs.FSNamesystem: Number of transactions: 5 Total time for transactions(ms): 0 Number of syncs: 3 SyncTimes(ms): 62
2008-09-21 18:16:30,171 INFO org.apache.hadoop.fs.FSNamesystem: Roll FSImage from 127.0.0.1
2008-09-21 18:16:30,171 INFO org.apache.hadoop.fs.FSNamesystem: Number of transactions: 0 Total time for transactions(ms): 0 Number of syncs: 0 SyncTimes(ms): 0

 

如果启动过程出现异常,可以查看日志文件来查找启动失败的原因。

你也可以使用如下命令查看当前运行的进程:

$ ps -ef

 

显示结果如下所示:

通过单机的Hadoop伪分布式运行理解Hadoop运行过程

此时已经启动了5个Java进程。

你还可以在tmp\hadoop-SHIYANJUN\dfs\name目录中看到一个in_use.lock文件,也就是说NameNode已经启动了,并且当前正在使用,被上锁了。

在\tmp目录中有多了一个\hadoop-SYSTEM目录,它是Hadoop的系统目录,此时生成的目录结构如图所示:

通过单机的Hadoop伪分布式运行理解Hadoop运行过程

其中,tmp\hadoop-SYSTEM\dfs\data目录是与DataNode相对应的,tmp\hadoop-SYSTEM\dfs\namesecondary目录是与secondarynamenode结点对应的。在tmp\hadoop-SYSTEM的某些目录中也生成了一些文件,在此不多说了。

3、将本地待处理数据文件复制到HDFS

使用如下命令:

$ bin/hadoop dfs -put ./input input

 

这时,在tmp\hadoop-SYSTEM\dfs\data\current 的目录中,除了dncp_block_verification.log.curr
和VERSION两个文件以外,又生成了很多以“blk”开头的文件,这些就是复制本地input目录中的文件到HDFS中以块的形式存储,并且生成元数据,tmp\hadoop-SYSTEM\dfs\data\current 的目录中的文件如下所示:

G:\tmp\hadoop-SYSTEM\dfs\data\current 的目录

2008-09-21 18:10    <DIR>          .
2008-09-21 18:10    <DIR>          ..
2008-09-21 18:10               167 VERSION
2008-09-21 18:10               480 dncp_block_verification.log.curr
2008-09-21 18:39                87 blk_7287293315123774920.meta
2008-09-21 18:39            10,109 blk_7287293315123774920
2008-09-21 18:39                23 blk_3670615963974276357.meta
2008-09-21 18:39             1,957 blk_3670615963974276357
2008-09-21 18:39                23 blk_125370523583213471.meta
2008-09-21 18:39             1,987 blk_125370523583213471
2008-09-21 18:39                23 blk_-8983105898459096464.meta
2008-09-21 18:39             1,957 blk_-8983105898459096464
2008-09-21 18:39                23 blk_-6348337313643072566.meta
2008-09-21 18:39             1,985 blk_-6348337313643072566
2008-09-21 18:39                23 blk_-140532538463136818.meta
2008-09-21 18:39             1,957 blk_-140532538463136818
2008-09-21 18:39                23 blk_2961784518440227574.meta
2008-09-21 18:39             1,957 blk_2961784518440227574
              16 个文件         22,781 字节
               2 个目录 2,220,326,912 可用字节

 

我在本地的input目录中准备了7个TXT文件,复制到HDFS的过程中,每个TXT文件对应了HDFS中(即tmp\hadoop-SYSTEM\dfs\data\current目录中)7个块(Block)文件,其中每个块(Block)文件中的内容就是本地FS中的对应的TXT文件的内容,例如我拿blk_125370523583213471打开可以看到:

apache apache bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache hash bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash shirdrn bash apache apache bash bash apache apache bash bash shirdrn apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache fax bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache find apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash bash apache apache bash

 

4、启动任务的执行

使用下面的命令开始执行:

$   bin/hadoop jar hadoop-0.16.4-examples.jar wordcount input output

 

执行过程信息如下所示:

08/09/21 18:54:11 INFO mapred.FileInputFormat: Total input paths to process : 7
08/09/21 18:54:13 INFO mapred.JobClient: Running job: job_200809211811_0001
08/09/21 18:54:14 INFO mapred.JobClient: map 0% reduce 0%
08/09/21 18:54:34 INFO mapred.JobClient: map 28% reduce 0%
08/09/21 18:54:45 INFO mapred.JobClient: map 28% reduce 4%
08/09/21 18:54:46 INFO mapred.JobClient: map 42% reduce 4%
08/09/21 18:54:47 INFO mapred.JobClient: map 57% reduce 4%
08/09/21 18:54:52 INFO mapred.JobClient: map 57% reduce 9%
08/09/21 18:54:57 INFO mapred.JobClient: map 85% reduce 19%
08/09/21 18:55:02 INFO mapred.JobClient: map 100% reduce 19%
08/09/21 18:55:07 INFO mapred.JobClient: map 100% reduce 28%
08/09/21 18:55:11 INFO mapred.JobClient: map 100% reduce 100%
08/09/21 18:55:12 INFO mapred.JobClient: Job complete: job_200809211811_0001
08/09/21 18:55:12 INFO mapred.JobClient: Counters: 12
08/09/21 18:55:12 INFO mapred.JobClient:   Job Counters
08/09/21 18:55:12 INFO mapred.JobClient:     Launched map tasks=7
08/09/21 18:55:12 INFO mapred.JobClient:     Launched reduce tasks=1
08/09/21 18:55:12 INFO mapred.JobClient:     Data-local map tasks=7
08/09/21 18:55:12 INFO mapred.JobClient:   Map-Reduce Framework
08/09/21 18:55:12 INFO mapred.JobClient:     Map input records=7
08/09/21 18:55:12 INFO mapred.JobClient:     Map output records=3649
08/09/21 18:55:12 INFO mapred.JobClient:     Map input bytes=21909
08/09/21 18:55:12 INFO mapred.JobClient:     Map output bytes=36511
08/09/21 18:55:12 INFO mapred.JobClient:     Combine input records=3649
08/09/21 18:55:12 INFO mapred.JobClient:     Combine output records=21
08/09/21 18:55:12 INFO mapred.JobClient:     Reduce input groups=7
08/09/21 18:55:12 INFO mapred.JobClient:     Reduce input records=21
08/09/21 18:55:12 INFO mapred.JobClient:     Reduce output records=7

 

执行完成以后,可以在tmp\hadoop-SYSTEM\dfs\data\current目录中看到多了一个块(Block)文件,我这里是blk_6547411606566553711及其对应的元数据blk_6547411606566553711.meta,打开blk_6547411606566553711文件,内容如下所示:

apache 1826
baketball 1
bash 1813
fax 2
find 1
hash 1
shirdrn 5

 

看到了,这就是最终的结果。

执行过程,肯定会把运行信息登录到日志文件中的,日志文件也很庞大而且详细,如下所示:

G:\>tree G:\hadoop-0.16.4\logs /A /F
文件夹 PATH 列表
卷序列号为 D275-ECF3
G:\HADOOP-0.16.4\LOGS
|   hadoop-SHIYANJUN-namenode-cbbd2ce9428e48b.out
|   hadoop-SHIYANJUN-namenode-cbbd2ce9428e48b.log
|   hadoop-SHIYANJUN-datanode-cbbd2ce9428e48b.out
|   hadoop-SHIYANJUN-datanode-cbbd2ce9428e48b.log
|   hadoop-SHIYANJUN-secondarynamenode-cbbd2ce9428e48b.out
|   hadoop-SHIYANJUN-jobtracker-cbbd2ce9428e48b.out
|   hadoop-SHIYANJUN-secondarynamenode-cbbd2ce9428e48b.log
|   hadoop-SHIYANJUN-jobtracker-cbbd2ce9428e48b.log
|   hadoop-SHIYANJUN-tasktracker-cbbd2ce9428e48b.out
|   hadoop-SHIYANJUN-tasktracker-cbbd2ce9428e48b.log
|
+---history
|       JobHistory.log
|       1221994453046_job_200809211811_0001
|       job_200809211811_0001_conf.xml
|
\---userlogs
    +---task_200809211811_0001_m_000000_0
    |       stdout
    |       stderr
    |       syslog
    |
    +---task_200809211811_0001_m_000001_0
    |       stdout
    |       stderr
    |       syslog
    |
    +---task_200809211811_0001_r_000000_0
    |       stdout
    |       stderr
    |       syslog
    |
    +---task_200809211811_0001_m_000002_0
    |       stdout
    |       stderr
    |       syslog
    |
    +---task_200809211811_0001_m_000003_0
    |       stdout
    |       stderr
    |       syslog
    |
    +---task_200809211811_0001_m_000004_0
    |       stdout
    |       stderr
    |       syslog
    |
    +---task_200809211811_0001_m_000005_0
    |       stdout
    |       stderr
    |       syslog
    |
    \---task_200809211811_0001_m_000006_0
            stdout
            stderr
            syslog
 

5、终止Hadoop进程

使用如下命令:

$ bin/stop-all.sh

 

终止Hadoop进程过程:

stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode

 

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:http://www.heiqu.com/pxpfd.html