hadoop1.1.1集群安装
1、下载安装包
下载hadoop安装包,拷贝并解压到集群所有机器上
集群里面机器分两个角色:masters(管理机)、slaves(从机器),masters主要运行两个资源管理进程:namenode(文件管理)、JobTracker(作业管理),集群只能有一个管理进程。slaves机器上必须有两个进程:datanode(文件存储)、TaskTracker(任务管理)。
hadoop程序在每个机器上安装的路径由环境变量HADOOP_HOME决定的,通常所有机器上该变量是一致的
2、配置
2.1 配置文件
2.2 site configuration(上述conf下面配置文件的设置)
2.2.1 配置hadoop程序运行环境
Daemon | Configure Options |
---|---|
NameNode | HADOOP_NAMENODE_OPTS |
DataNode | HADOOP_DATANODE_OPTS |
SecondaryNamenode | HADOOP_SECONDARYNAMENODE_OPTS |
JobTracker | HADOOP_JOBTRACKER_OPTS |
TaskTracker | HADOOP_TASKTRACKER_OPTS |
- HADOOP_LOG_DIR:模块日志存储目
- HADOOP_HEAPSIZE:最大可用堆空间,单位是M。这个可以用来配置hadoop模块的堆空间大小。默认的这个值是1000M
2.2.2 配置hadoop相关模块
Parameter | Value | Notes |
---|---|---|
fs.default.name | URI of NameNode. | hdfs://hostname/ |
Parameter | Value | Notes |
---|---|---|
dfs.name.dir | namenode的存储目录 | 可以有多个路径,用逗号分隔。当有多个路径时,会进行冗余存储,多个路径都会保存相同的namenode文件 |
dfs.data.dir | 逗号分隔的本地路径,datanode将在这里放置存储块 | 如果有多个路径,数据将会存储在所有目录中 |
conf/mapred-site.xml:
Parameter | Value | Notes |
---|---|---|
mapred.job.tracker | 指定jobtraker的ip、端口 | host:port pair. |
mapred.system.dir | hdfs文件目录,存放mapreduce 系统文件。 | 默认是hdfs文件系统,注意,必须保证集群所有机器都能访问 |
mapred.local.dir | 本地文件目录,存放mapreduce 临时文件 | 多个目录,可以扩展disk IO |
mapred.tasktracker.{map|reduce}.tasks.maximum | 单个tasktracker节点最多可运行的mapreduce任务 | 默认是2(2个map任务、2个reduce任务),需要你根据机器的硬件状况进行调整、优化 |
dfs.hosts/dfs.hosts.exclude | 可用的和不可用的datanode节点 | 如果需要,最好使用文件来配置可用的datanode节点 |
mapred.hosts/mapred.hosts.exclude | 可用的和不可用的datanode节点 TaskTracker节点 | 如果需要,最好使用文件来配置可用的TaskTracker节点 |
mapred.queue.names | mapreduce作业提交队列。多个队列用逗号分隔 | mapreduce系统会保证至少会存在一个queue。此后,这和配置值经常为default。一些任务调度器,比如Capacity Scheduler,支持多队列。如果这样的调度器被使用,那么多个队列名字必须在这里指定。一旦队列被定义,用户可以提交任务 使用job 配置里面mapred.job.queue.name的值。 |
mapred.acls.enabled | 是否进行身份认证 | If true, queue ACLs are checked while submitting and administering jobs and job ACLs are checked for authorizing view and modification of jobs. Queue ACLs are specified using the configuration parameters of the form mapred.queue.queue-name.acl-name, defined below under mapred-queue-acls.xml. Job ACLs are described atJob Authorization |
conf/mapred-queue-acls.xml:
Parameter | Value | Notes |
---|---|---|
mapred.queue.queue-name.acl-submit-job | List of users and groups that can submit jobs to the specifiedqueue-name. | The list of users and groups are both comma separated list of names. The two lists are separated by a blank. Example:user1,user2 group1,group2. If you wish to define only a list of groups, provide a blank at the beginning of the value. |
mapred.queue.queue-name.acl-administer-jobs | List of users and groups that can view job details, change the priority or kill jobs that have been submitted to the specifiedqueue-name. | The list of users and groups are both comma separated list of names. The two lists are separated by a blank. Example:user1,user2 group1,group2. If you wish to define only a list of groups, provide a blank at the beginning of the value. Note that the owner of a job can always change the priority or kill his/her own job, irrespective of the ACLs. |
上面这个参数,可以标志位final,防止用户程序修改
2.2.3 实战中参数配置举例
Configuration File | Parameter | Value | Notes |
---|---|---|---|
conf/hdfs-site.xml | dfs.block.size | 134217728 | HDFS blocksize of 128MB for large file-systems. |
conf/hdfs-site.xml | dfs.namenode.handler.count | 40 | More NameNode server threads to handle RPCs from large number of DataNodes. |
conf/mapred-site.xml | mapred.reduce.parallel.copies | 20 | Higher number of parallel copies run by reduces to fetch outputs from very large number of maps. |
conf/mapred-site.xml | mapred.map.child.java.opts | -Xmx512M | Larger heap-size for child jvms of maps. |
conf/mapred-site.xml | mapred.reduce.child.java.opts | -Xmx512M | Larger heap-size for child jvms of reduces. |
conf/core-site.xml | fs.inmemory.size.mb | 200 | Larger amount of memory allocated for the in-memory file-system used to merge map-outputs at the reduces. |
conf/core-site.xml | io.sort.factor | 100 | More streams merged at once while sorting files. |
conf/core-site.xml | io.sort.mb | 200 | Higher memory-limit while sorting data. |
conf/core-site.xml | io.file.buffer.size | 131072 | Size of read/write buffer used in SequenceFiles. |
Configuration File Parameter Value Notes conf/mapred-site.xml mapred.job.tracker.handler.count 60 More JobTracker server threads to handle RPCs from large number of TaskTrackers. conf/mapred-site.xml mapred.reduce.parallel.copies 50 conf/mapred-site.xml tasktracker.http.threads 50 More worker threads for the TaskTracker's http server. The http server is used by reduces to fetch intermediate map-outputs. conf/mapred-site.xml mapred.map.child.java.opts -Xmx512M Larger heap-size for child jvms of maps. conf/mapred-site.xml mapred.reduce.child.java.opts -Xmx1024M Larger heap-size for child jvms of reduces.
2.2.4 Task Controllers (任务控制器)
Property | Value | Notes |
---|---|---|
mapred.task.tracker.task-controller | Fully qualified class name of the task controller class | Currently there are two implementations of task controller in the Hadoop system, DefaultTaskController and LinuxTaskController. Refer to the class names mentioned above to determine the value to set for the class of choice. |
2.2.5 ip配置
3、启动集群
3.1、MapReduce
3.2、Hadoop Startup
To start a Hadoop cluster you will need to start both the HDFS and Map/Reduce cluster.
1、Format a new distributed filesystem:
$ bin/hadoop namenode -format
2、Start the HDFS with the following command, run on the designated NameNode:
$ bin/start-dfs.sh
3、The bin/start-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and starts theDataNode daemon on all the listed slaves.
Start Map-Reduce with the following command, run on the designated JobTracker:
$ bin/start-mapred.sh
The bin/start-mapred.sh script also consults the${HADOOP_CONF_DIR}/slaves file on the JobTracker and starts the TaskTracker daemon on all the listed slaves
3.3、Hadoop Shutdown
1、Stop HDFS with the following command, run on the designated NameNode:
$ bin/stop-dfs.sh
2、The bin/stop-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and stops theDataNode daemon on all the listed slaves.
Stop Map/Reduce with the following command, run on the designated the designatedJobTracker:
$ bin/stop-mapred.sh
The bin/stop-mapred.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the JobTracker and stops theTaskTracker daemon on all the listed slaves