hadoop1.1.1安装

hadoop1.1.1集群安装

1、下载安装包

      下载hadoop安装包,拷贝并解压到集群所有机器上

       集群里面机器分两个角色:masters(管理机)、slaves(从机器),masters主要运行两个资源管理进程:namenode(文件管理)、JobTracker(作业管理),集群只能有一个管理进程。slaves机器上必须有两个进程:datanode(文件存储)、TaskTracker(任务管理)。

       hadoop程序在每个机器上安装的路径由环境变量HADOOP_HOME决定的,通常所有机器上该变量是一致的

2、配置

     接下来的内容将阐述如何对hadoop集群进行配置

2.1 配置文件

     hadoop主要有两种重要的配置文件
     1、只能读不能被程序修改的配置:src/core/core-default.xml、src/hdfs/hdfs-default.xml、src/mapred/mapred-default.xml
     2、指定配置: conf/core-site.xml , conf/hdfs-site.xml  ,  conf/mapred-site.xml
      此外,通bin目录下的hadoop脚本,通过conf/hadoop-env.sh来指定hadoop的一些配置

2.2 site configuration(上述conf下面配置文件的设置)

     配置hadoop集群,首先得为hadoop程序配置正确的环境(配置java目录,etc/hosts主机名,ssh建立信任)
     hadoop后台程序主要是:NameNode/DataNode,JobTracker/TaskTracker

2.2.1 配置hadoop程序运行环境

     你可以通过conf/hadoop-env.sh环境设置脚本,自动化配置hadoop运行环境
     当然,至少你需要配置JAVA_HOME环境变量
     此外,你可以通过配置选项HADOOP_*_OPTS,为指定模块进行私有配置
DaemonConfigure Options
NameNodeHADOOP_NAMENODE_OPTS
DataNodeHADOOP_DATANODE_OPTS
SecondaryNamenodeHADOOP_SECONDARYNAMENODE_OPTS
JobTrackerHADOOP_JOBTRACKER_OPTS
TaskTrackerHADOOP_TASKTRACKER_OPTS
      举例:你可以为namenode模块使用parallelGC配置(并行垃圾回收选项),只需要将下面的表达式加入 hadoop-env.sh
                 export HADOOP_NAMENODE_OPTS="-XX:+UseParallelGC ${HADOOP_NAMENODE_OPTS}"
      其他一些有用的选项:
  • HADOOP_LOG_DIR:模块日志存储目
  • HADOOP_HEAPSIZE:最大可用堆空间,单位是M。这个可以用来配置hadoop模块的堆空间大小。默认的这个值是1000M

2.2.2 配置hadoop相关模块

    这里选取了一些重要的配置加以说明:
   conf/core-site.xml:
   
ParameterValueNotes
fs.default.nameURI of NameNode.hdfs://hostname/
   conf/hdfs-site.xml:
ParameterValueNotes
dfs.name.dirnamenode的存储目录可以有多个路径,用逗号分隔。当有多个路径时,会进行冗余存储,多个路径都会保存相同的namenode文件
dfs.data.dir逗号分隔的本地路径,datanode将在这里放置存储块如果有多个路径,数据将会存储在所有目录中

conf/mapred-site.xml:

ParameterValueNotes
mapred.job.tracker指定jobtraker的ip、端口host:port pair.
mapred.system.dirhdfs文件目录,存放mapreduce 系统文件。默认是hdfs文件系统,注意,必须保证集群所有机器都能访问
mapred.local.dir本地文件目录,存放mapreduce 临时文件多个目录,可以扩展disk IO
mapred.tasktracker.{map|reduce}.tasks.maximum单个tasktracker节点最多可运行的mapreduce任务默认是2(2个map任务、2个reduce任务),需要你根据机器的硬件状况进行调整、优化
dfs.hosts/dfs.hosts.exclude可用的和不可用的datanode节点如果需要,最好使用文件来配置可用的datanode节点
mapred.hosts/mapred.hosts.exclude可用的和不可用的datanode节点 TaskTracker节点如果需要,最好使用文件来配置可用的TaskTracker节点
mapred.queue.namesmapreduce作业提交队列。多个队列用逗号分隔mapreduce系统会保证至少会存在一个queue。此后,这和配置值经常为default。一些任务调度器,比如Capacity Scheduler,支持多队列。如果这样的调度器被使用,那么多个队列名字必须在这里指定。一旦队列被定义,用户可以提交任务 使用job 配置里面mapred.job.queue.name的值。
mapred.acls.enabled是否进行身份认证If true, queue ACLs are checked while submitting and administering jobs and job ACLs are checked for authorizing view and modification of jobs. Queue ACLs are specified using the configuration parameters of the form mapred.queue.queue-name.acl-name, defined below under mapred-queue-acls.xml. Job ACLs are described atJob Authorization            

conf/mapred-queue-acls.xml:

ParameterValueNotes
mapred.queue.queue-name.acl-submit-jobList of users and groups that can submit jobs to the specifiedqueue-name.The list of users and groups are both comma separated list of names. The two lists are separated by a blank. Example:user1,user2 group1,group2. If you wish to define only a list of groups, provide a blank at the beginning of the value.
mapred.queue.queue-name.acl-administer-jobsList of users and groups that can view job details, change the priority or kill jobs that have been submitted to the specifiedqueue-name.The list of users and groups are both comma separated list of names. The two lists are separated by a blank. Example:user1,user2 group1,group2. If you wish to define only a list of groups, provide a blank at the beginning of the value. Note that the owner of a job can always change the priority or kill his/her own job, irrespective of the ACLs.

上面这个参数,可以标志位final,防止用户程序修改

2.2.3 实战中参数配置举例

    这一部分列举了,一些benchmark程序使用的配置参数,经过了大数据、大集群的检验。
   
Configuration FileParameterValueNotes
conf/hdfs-site.xmldfs.block.size134217728HDFS blocksize of 128MB for large file-systems.
conf/hdfs-site.xmldfs.namenode.handler.count40More NameNode server threads to handle RPCs from large number of DataNodes.
conf/mapred-site.xmlmapred.reduce.parallel.copies20Higher number of parallel copies run by reduces to fetch outputs from very large number of maps.
conf/mapred-site.xmlmapred.map.child.java.opts-Xmx512MLarger heap-size for child jvms of maps.
conf/mapred-site.xmlmapred.reduce.child.java.opts-Xmx512MLarger heap-size for child jvms of reduces.
conf/core-site.xmlfs.inmemory.size.mb200Larger amount of memory allocated for the in-memory file-system used to merge map-outputs at the reduces.
conf/core-site.xmlio.sort.factor100More streams merged at once while sorting files.
conf/core-site.xmlio.sort.mb200Higher memory-limit while sorting data.
conf/core-site.xmlio.file.buffer.size131072Size of read/write buffer used in SequenceFiles.




  • Configuration FileParameterValueNotes
    conf/mapred-site.xmlmapred.job.tracker.handler.count60More JobTracker server threads to handle RPCs from large number of TaskTrackers.
    conf/mapred-site.xmlmapred.reduce.parallel.copies50 
    conf/mapred-site.xmltasktracker.http.threads50More worker threads for the TaskTracker's http server. The http server is used by reduces to fetch intermediate map-outputs.
    conf/mapred-site.xmlmapred.map.child.java.opts-Xmx512MLarger heap-size for child jvms of maps.
    conf/mapred-site.xmlmapred.reduce.child.java.opts-Xmx1024MLarger heap-size for child jvms of reduces.


2.2.4 Task Controllers (任务控制器)

     Task controllers在hadoop mapreduce框架中定义了如何启动、控制 mapreduce任务。在某些场景下,集群有定制化需求,进行处理mapreduce任务。这一部分阐述了,如何配置 和使用Task controllers
    hadoop提供了两种可用的任务管理器
1、DefaultTaskController:org.apache.hadoop.mapred.DefaultTaskController
   默认的任务控制器,task tracker是mapreduce任务的用户。
2、LinuxTaskController:org.apache.hadoop.mapred.LinuxTaskController
    This task controller, which is supported only on Linux, runs the tasks as the user who submitted the job. It requires these user accounts to be created on the cluster nodes where the tasks are launched. It uses a setuid executable that is included in the Hadoop distribution. The task tracker uses this executable to launch and kill tasks. The setuid executable switches to the user who has submitted the job and launches or kills the tasks. For maximum security, this task controller sets up restricted permissions and user/group ownership of local files and directories used by the tasks such as the job jar files, intermediate files, task log files and distributed cache files. Particularly note that, because of this, except the job owner and tasktracker, no other user can access any of the local files/directories including those localized as part of the distributed cache.
配置任务控制器
PropertyValueNotes
mapred.task.tracker.task-controllerFully qualified class name of the task controller classCurrently there are two implementations of task controller in the Hadoop system, DefaultTaskController and LinuxTaskController. Refer to the class names mentioned above to determine the value to set for the class of choice.

2.2.5 ip配置

    conf/masters配置上主机ip
    conf/slaves配置从节点ip

3、启动集群

3.1、MapReduce

    如果mapred.jobtracker.restart.recover、JobHistory logging 参数被设置启动,job tracker重启后,会恢复之前未运行结束的job。mapred.jobtracker.job.history.block.size这个值需要设置一个合理的值,来存储job history到磁盘,典型值是3M

3.2、Hadoop Startup

To start a Hadoop cluster you will need to start both the HDFS and Map/Reduce cluster.

1、Format a new distributed filesystem:
$ bin/hadoop namenode -format

2、Start the HDFS with the following command, run on the designated NameNode:
$ bin/start-dfs.sh

3、The bin/start-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and starts theDataNode daemon on all the listed slaves.

Start Map-Reduce with the following command, run on the designated JobTracker:
$ bin/start-mapred.sh

The bin/start-mapred.sh script also consults the${HADOOP_CONF_DIR}/slaves file on the JobTracker and starts the TaskTracker daemon on all the listed slaves

3.3、Hadoop Shutdown

1、Stop HDFS with the following command, run on the designated NameNode:
$ bin/stop-dfs.sh

2、The bin/stop-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and stops theDataNode daemon on all the listed slaves.

Stop Map/Reduce with the following command, run on the designated the designatedJobTracker:
$ bin/stop-mapred.sh

The bin/stop-mapred.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the JobTracker and stops theTaskTracker daemon on all the listed slaves