hadoop1.1.1集群安装

1、下载安装包

下载hadoop安装包，拷贝并解压到集群所有机器上

集群里面机器分两个角色：masters（管理机）、slaves（从机器），masters主要运行两个资源管理进程：namenode（文件管理）、JobTracker（作业管理），集群只能有一个管理进程。slaves机器上必须有两个进程：datanode（文件存储）、TaskTracker（任务管理）。

hadoop程序在每个机器上安装的路径由环境变量HADOOP_HOME决定的，通常所有机器上该变量是一致的

2、配置

接下来的内容将阐述如何对hadoop集群进行配置

2.1 配置文件

hadoop主要有两种重要的配置文件

1、只能读不能被程序修改的配置：src/core/core-default.xml、src/hdfs/hdfs-default.xml、src/mapred/mapred-default.xml

2、指定配置: conf/core-site.xml , conf/hdfs-site.xml , conf/mapred-site.xml

此外，通bin目录下的hadoop脚本，通过conf/hadoop-env.sh来指定hadoop的一些配置

2.2 site configuration(上述conf下面配置文件的设置）

配置hadoop集群，首先得为hadoop程序配置正确的环境（配置java目录，etc/hosts主机名，ssh建立信任）

hadoop后台程序主要是：NameNode/DataNode，JobTracker/TaskTracker

2.2.1 配置hadoop程序运行环境

你可以通过conf/hadoop-env.sh环境设置脚本，自动化配置hadoop运行环境

当然，至少你需要配置JAVA_HOME环境变量

此外，你可以通过配置选项HADOOP_*_OPTS，为指定模块进行私有配置

Daemon	Configure Options
NameNode	HADOOP_NAMENODE_OPTS
DataNode	HADOOP_DATANODE_OPTS
SecondaryNamenode	HADOOP_SECONDARYNAMENODE_OPTS
JobTracker	HADOOP_JOBTRACKER_OPTS
TaskTracker	HADOOP_TASKTRACKER_OPTS

举例：你可以为namenode模块使用parallelGC配置（并行垃圾回收选项），只需要将下面的表达式加入 hadoop-env.sh

export HADOOP_NAMENODE_OPTS="-XX:+UseParallelGC ${HADOOP_NAMENODE_OPTS}"

其他一些有用的选项：

HADOOP_LOG_DIR：模块日志存储目
HADOOP_HEAPSIZE：最大可用堆空间，单位是M。这个可以用来配置hadoop模块的堆空间大小。默认的这个值是1000M

2.2.2 配置hadoop相关模块

这里选取了一些重要的配置加以说明：

conf/core-site.xml:

Parameter	Value	Notes
fs.default.name	URI of NameNode.	hdfs://hostname/

conf/hdfs-site.xml:

Parameter	Value	Notes
dfs.name.dir	namenode的存储目录	可以有多个路径，用逗号分隔。当有多个路径时，会进行冗余存储，多个路径都会保存相同的namenode文件
dfs.data.dir	逗号分隔的本地路径，datanode将在这里放置存储块	如果有多个路径，数据将会存储在所有目录中

conf/mapred-site.xml：

Parameter	Value	Notes
mapred.job.tracker	指定jobtraker的ip、端口	host:port pair.
mapred.system.dir	hdfs文件目录，存放mapreduce 系统文件。	默认是hdfs文件系统，注意，必须保证集群所有机器都能访问
mapred.local.dir	本地文件目录，存放mapreduce 临时文件	多个目录，可以扩展disk IO
mapred.tasktracker.{map\|reduce}.tasks.maximum	单个tasktracker节点最多可运行的mapreduce任务	默认是2（2个map任务、2个reduce任务），需要你根据机器的硬件状况进行调整、优化
dfs.hosts/dfs.hosts.exclude	可用的和不可用的datanode节点	如果需要，最好使用文件来配置可用的datanode节点
mapred.hosts/mapred.hosts.exclude	可用的和不可用的datanode节点 TaskTracker节点	如果需要，最好使用文件来配置可用的TaskTracker节点
mapred.queue.names	mapreduce作业提交队列。多个队列用逗号分隔	mapreduce系统会保证至少会存在一个queue。此后，这和配置值经常为default。一些任务调度器，比如Capacity Scheduler,支持多队列。如果这样的调度器被使用，那么多个队列名字必须在这里指定。一旦队列被定义，用户可以提交任务使用job 配置里面mapred.job.queue.name的值。
mapred.acls.enabled	是否进行身份认证	If true, queue ACLs are checked while submitting and administering jobs and job ACLs are checked for authorizing view and modification of jobs. Queue ACLs are specified using the configuration parameters of the form mapred.queue.queue-name.acl-name, defined below under mapred-queue-acls.xml. Job ACLs are described atJob Authorization

conf/mapred-queue-acls.xml：

Parameter	Value	Notes
mapred.queue.queue-name.acl-submit-job	List of users and groups that can submit jobs to the specifiedqueue-name.	The list of users and groups are both comma separated list of names. The two lists are separated by a blank. Example:user1,user2 group1,group2. If you wish to define only a list of groups, provide a blank at the beginning of the value.
mapred.queue.queue-name.acl-administer-jobs	List of users and groups that can view job details, change the priority or kill jobs that have been submitted to the specifiedqueue-name.	The list of users and groups are both comma separated list of names. The two lists are separated by a blank. Example:user1,user2 group1,group2. If you wish to define only a list of groups, provide a blank at the beginning of the value. Note that the owner of a job can always change the priority or kill his/her own job, irrespective of the ACLs.

上面这个参数，可以标志位final，防止用户程序修改

2.2.3 实战中参数配置举例

这一部分列举了，一些benchmark程序使用的配置参数，经过了大数据、大集群的检验。

Configuration File	Parameter	Value	Notes
conf/hdfs-site.xml	dfs.block.size	134217728	HDFS blocksize of 128MB for large file-systems.
conf/hdfs-site.xml	dfs.namenode.handler.count	40	More NameNode server threads to handle RPCs from large number of DataNodes.
conf/mapred-site.xml	mapred.reduce.parallel.copies	20	Higher number of parallel copies run by reduces to fetch outputs from very large number of maps.
conf/mapred-site.xml	mapred.map.child.java.opts	-Xmx512M	Larger heap-size for child jvms of maps.
conf/mapred-site.xml	mapred.reduce.child.java.opts	-Xmx512M	Larger heap-size for child jvms of reduces.
conf/core-site.xml	fs.inmemory.size.mb	200	Larger amount of memory allocated for the in-memory file-system used to merge map-outputs at the reduces.
conf/core-site.xml	io.sort.factor	100	More streams merged at once while sorting files.
conf/core-site.xml	io.sort.mb	200	Higher memory-limit while sorting data.
conf/core-site.xml	io.file.buffer.size	131072	Size of read/write buffer used in SequenceFiles.

Configuration File	Parameter	Value	Notes
conf/mapred-site.xml	mapred.job.tracker.handler.count	60	More JobTracker server threads to handle RPCs from large number of TaskTrackers.
conf/mapred-site.xml	mapred.reduce.parallel.copies	50
conf/mapred-site.xml	tasktracker.http.threads	50	More worker threads for the TaskTracker's http server. The http server is used by reduces to fetch intermediate map-outputs.
conf/mapred-site.xml	mapred.map.child.java.opts	-Xmx512M	Larger heap-size for child jvms of maps.
conf/mapred-site.xml	mapred.reduce.child.java.opts	-Xmx1024M	Larger heap-size for child jvms of reduces.

2.2.4 Task Controllers （任务控制器）

Task controllers在hadoop mapreduce框架中定义了如何启动、控制 mapreduce任务。在某些场景下，集群有定制化需求，进行处理mapreduce任务。这一部分阐述了，如何配置和使用Task controllers

hadoop提供了两种可用的任务管理器

1、DefaultTaskController：org.apache.hadoop.mapred.DefaultTaskController

默认的任务控制器，task tracker是mapreduce任务的用户。

2、LinuxTaskController：org.apache.hadoop.mapred.LinuxTaskController

This task controller, which is supported only on Linux, runs the tasks as the user who submitted the job. It requires these user accounts to be created on the cluster nodes where the tasks are launched. It uses a setuid executable that is included in the Hadoop distribution. The task tracker uses this executable to launch and kill tasks. The setuid executable switches to the user who has submitted the job and launches or kills the tasks. For maximum security, this task controller sets up restricted permissions and user/group ownership of local files and directories used by the tasks such as the job jar files, intermediate files, task log files and distributed cache files. Particularly note that, because of this, except the job owner and tasktracker, no other user can access any of the local files/directories including those localized as part of the distributed cache.

配置任务控制器

Property	Value	Notes
mapred.task.tracker.task-controller	Fully qualified class name of the task controller class	Currently there are two implementations of task controller in the Hadoop system, DefaultTaskController and LinuxTaskController. Refer to the class names mentioned above to determine the value to set for the class of choice.

2.2.5 ip配置

conf/masters配置上主机ip

conf/slaves配置从节点ip

3、启动集群

3.1、MapReduce

如果mapred.jobtracker.restart.recover、JobHistory logging 参数被设置启动，job tracker重启后，会恢复之前未运行结束的job。mapred.jobtracker.job.history.block.size这个值需要设置一个合理的值，来存储job history到磁盘，典型值是3M

3.2、Hadoop Startup

To start a Hadoop cluster you will need to start both the HDFS and Map/Reduce cluster.

1、Format a new distributed filesystem:
$ bin/hadoop namenode -format

2、Start the HDFS with the following command, run on the designated NameNode:
$ bin/start-dfs.sh

3、The bin/start-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and starts theDataNode daemon on all the listed slaves.

Start Map-Reduce with the following command, run on the designated JobTracker:
$ bin/start-mapred.sh

The bin/start-mapred.sh script also consults the${HADOOP_CONF_DIR}/slaves file on the JobTracker and starts the TaskTracker daemon on all the listed slaves

3.3、Hadoop Shutdown

1、Stop HDFS with the following command, run on the designated NameNode:
$ bin/stop-dfs.sh

2、The bin/stop-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and stops theDataNode daemon on all the listed slaves.

Stop Map/Reduce with the following command, run on the designated the designatedJobTracker:
$ bin/stop-mapred.sh

The bin/stop-mapred.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the JobTracker and stops theTaskTracker daemon on all the listed slaves