一.Hadoop
Hadoop是一个由Apache基金会所开发的分布式系统基础架构。
用户可以在不了解分布式底层细节的情况下,开发分布式程序。充分利用集群的威力进行高速运算和存储。
Hadoop实现了一个分布式文件系统(Hadoop Distributed File
System),简称HDFS。HDFS有高容错性的特点,并且设计用来部署在低廉的(low-cost)硬件上;而且它提供高吞吐量(high
throughput)来访问应用程序的数据,适合那些有着超大数据集(large data
set)的应用程序。HDFS放宽了(relax)POSIX的要求,可以以流的形式访问(streaming access)文件系统中的数据。
Hadoop的框架最核心的设计就是:HDFS和MapReduce。HDFS为海量的数据提供了存储,而MapReduce则为海量的数据提供了计算
二.Hadoop单机模式的配置
1.下载安装包
[root@server1 ~]# ls
hadoop-3.0.3.tar.gz jdk-8u181-linux-x64.tar.gz
[root@server1 ~]# ls
hadoop-3.0.3.tar.gz jdk-8u181-linux-x64.tar.gz
2.创建hadoop用户
[root@server1 ~]# useradd hadoop
[root@server1 ~]# id hadoop
uid=1000(hadoop) gid=1000(hadoop) groups=1000(hadoop)
3.将安装包放到hadoop的家目录/home/hadop
[root@server1 ~]# mv * /home/hadoop/
[root@server1 ~]# su - hadoop
[hadoop@server1 ~]$ ls
hadoop-3.0.3.tar.gz jdk-8u181-linux-x64.tar.gz
4.解压jdk并创建java软连接
[hadoop@server1 ~]$ tar zxf jdk-8u181-linux-x64.tar.gz
[hadoop@server1 ~]$ ls
hadoop-3.0.3.tar.gz jdk1.8.0_181 jdk-8u181-linux-x64.tar.gz
[hadoop@server1 ~]$ ln -s jdk1.8.0_181/ java
[hadoop@server1 ~]$ ls
hadoop-3.0.3.tar.gz java jdk1.8.0_181 jdk-8u181-linux-x64.tar.gz
5.解压并创建hadoop软连接
[hadoop@server1 ~]$ tar zxf hadoop-3.0.3.tar.gz
[hadoop@server1 ~]$ ln -s hadoop-3.0.3 hadoop
[hadoop@server1 ~]$ ls
hadoop hadoop-3.0.3.tar.gz jdk1.8.0_181
hadoop-3.0.3 java jdk-8u181-linux-x64.tar.gz
[hadoop@server1 ~]$ cd hadoop
[hadoop@server1 hadoop]$ ls
bin include libexec NOTICE.txt sbin
etc lib LICENSE.txt README.txt share
6.修改hadoop-env.sh脚本文件
[hadoop@server1 hadoop]$ cd etc/hadoop/
[hadoop@server1 hadoop]$ vim hadoop-env.sh
[hadoop@server1 hadoop]$ pwd
/home/hadoop/hadoop/etc/hadoop
7.加载环境变量
[hadoop@server1 ~]$ vim .bash_profile
PATH=$PATH:$HOME/.local/bin:$HOME/bin:$HOME/java/bin
[hadoop@server1 ~]$ source .bash_profile
[hadoop@server1 ~]$ jps
2442 Jps
8.创建输入目录input
[hadoop@server1 ~]$ ls
hadoop hadoop-3.0.3.tar.gz jdk1.8.0_181
hadoop-3.0.3 java jdk-8u181-linux-x64.tar.gz
[hadoop@server1 ~]$ cd hadoop
[hadoop@server1 hadoop]$ ls
bin include libexec NOTICE.txt sbin
etc lib LICENSE.txt README.txt share
[hadoop@server1 hadoop]$ mkdir input/
[hadoop@server1 hadoop]$ ls
bin include lib LICENSE.txt README.txt share
etc input libexec NOTICE.txt sbin
10.将数据写入input目录里
[hadoop@server1 hadoop]$ cp etc/hadoop/*.xml input
[hadoop@server1 hadoop]$ ls input/
capacity-scheduler.xml hdfs-site.xml kms-site.xml
core-site.xml httpfs-site.xml mapred-site.xml
hadoop-policy.xml kms-acls.xml yarn-site.xml
11.hadoop计算
[hadoop@server1 hadoop]$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.3.jar grep input output 'dfs[a-z.]+'
12.hadoop计算完成后会自动创建output目录,输入结果放到output目录里
doop@server1 hadoop]$ ls
bin include lib LICENSE.txt output sbin
etc input libexec NOTICE.txt README.txt share
[hadoop@server1 hadoop]$ ls output/
part-r-00000 _SUCCESS
三.伪分布式的配置
1.配置
[hadoop@server1 hadoop]$
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://172.25.8.1:9000</value>
</property>
</configuration>
[hadoop@server1 hadoop]$ vim hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
2.生成公钥和私钥
[hadoop@server1 hadoop]$ cd
[hadoop@server1 ~]$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
97:c4:69:fe:03:09:23:80:29:11:18:31:ef:cf:2a:b0 hadoop@server1
The key's randomart image is:
+--[ RSA 2048]----+
|B= o. |
|ooo . . . |
| .. . o = |
| . . * o |
| . S * |
|. o . o |
|.. o o |
|E . . |
| .. |
+-----------------+
[hadoop@server1 ~]$ logout
3.设置hadoop用户密码
[root@server1 ~]# passwd hadoop
Changing password for user hadoop.
New password:
BAD PASSWORD: The password is shorter than 8 characters
Retype new password:
passwd: all authentication tokens updated successfully.
4.免密设置
[root@server1 ~]# su - hadoop
Last login: Sat May 18 23:11:49 EDT 2019 on pts/0
[hadoop@server1 ~]$ ssh-copy-id 172.25.8.1
The authenticity of host '172.25.8.1 (172.25.8.1)' can't be established.
ECDSA key fingerprint is 5f:6e:2c:77:cf:fa:0f:af:e6:c0:5f:ac:23:50:8e:e7.
Are you sure you want to continue connecting (yes/no)? yes
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@172.25.8.1's password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh '172.25.8.1'"
and check to make sure that only the key(s) you wanted were added.
[hadoop@server1 ~]$ ssh-copy-id localhost
The authenticity of host 'localhost (::1)' can't be established.
ECDSA key fingerprint is 5f:6e:2c:77:cf:fa:0f:af:e6:c0:5f:ac:23:50:8e:e7.
Are you sure you want to continue connecting (yes/no)? yes
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/bin/ssh-copy-id: WARNING: All keys were skipped because they already exist on the remote system.
[hadoop@server1 ~]$ ssh-copy-id server1
The authenticity of host 'server1 (172.25.8.1)' can't be established.
ECDSA key fingerprint is 5f:6e:2c:77:cf:fa:0f:af:e6:c0:5f:ac:23:50:8e:e7.
Are you sure you want to continue connecting (yes/no)? yes
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/bin/ssh-copy-id: WARNING: All keys were skipped because they already exist on the remote system.
5.格式化
[hadoop@server1 ~]$ cd hadoop/etc/hadoop/
[hadoop@server1 hadoop]$ bin/hdfs namenode -format
[hadoop@server1 hadoop]$ ls /tmp/ ##格式化后生成hadoop-hadoop-namenode.pid
hadoop hadoop-hadoop hadoop-hadoop-namenode.pid hsperfdata_hadoop
6.开启
[hadoop@server1 hadoop]$ sbin/start-dfs.sh
Starting namenodes on [server1]
Starting datanodes
Starting secondary namenodes [server1]
7.创建伪分布式目录
[hadoop@server1 hadoop]$ bin/hdfs dfs -mkdir /user
[hadoop@server1 hadoop]$ bin/hdfs dfs -mkdir /user/hadoop
[hadoop@server1 hadoop]$ id
uid=1000(hadoop) gid=1000(hadoop) groups=1000(hadoop)
[hadoop@server1 hadoop]$ bin/hdfs dfs -ls
浏览器访问: 172.25.8.1:9870
8.上传input目录
[hadoop@server1 hadoop]$ bin/hdfs dfs -put input/
浏览器访问: 172.25.8.1:9870
9.删除之前本地创建的input目录和output目录
[hadoop@server1 hadoop]$ rm -fr input/
[hadoop@server1 hadoop]$ rm -fr output/
[hadoop@server1 hadoop]$ ls
bin include libexec logs README.txt share
etc lib LICENSE.txt NOTICE.txt sbin
10.分布式计算
[hadoop@server1 hadoop]$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.3.jar wordcount input output
自动创建outout目录
11.查看计算结果
方法1:在线查看
[hadoop@server1 hadoop]$ bin/hdfs dfs -cat output/*
方法2:下载到本地查看
[hadoop@server1 hadoop]$ bin/hdfs dfs -get output
[hadoop@server1 hadoop]$ ls
bin include libexec logs output sbin
etc lib LICENSE.txt NOTICE.txt README.txt share
[hadoop@server1 hadoop]$ cd output/
[hadoop@server1 output]$ ls
part-r-00000 _SUCCESS
四.分布式文件系统的配置
【1】节点同步配置:
server1:
1.关闭
[hadoop@server1 hadoop]$ sbin/stop-dfs.sh
Stopping namenodes on [server1]
Stopping datanodes
Stopping secondary namenodes [server1]
2.为了数据同步删除
[hadoop@server1 hadoop]$ cd /tmp/
[hadoop@server1 tmp]$ ls
hadoop hadoop-hadoop hsperfdata_hadoop
[hadoop@server1 tmp]$ rm -fr *
[hadoop@server1 tmp]$ cd
[hadoop@server1 ~]$ logout
3.安装nfs
[root@server1 ~]# yum install -y nfs-utils
4.开启rpcbind并设置开机自启
[root@server1 ~]# systemctl start rpcbind
[root@server1 ~]# systemctl enable rpcbind
[root@server1 ~]# systemctl is-enabled rpcbind
indirect
5.修改文件
[root@server1 ~]# vim /etc/exports
/home/hadoop *(rw,anonuid=1000,anongid=1000)
6.开启
[root@server1 ~]# systemctl start nfs
[root@server1 ~]# exportfs -v
/home/hadoop <world>(rw,wdelay,root_squash,no_subtree_check,anonuid=1000,anongid=1000,sec=sys,rw,secure,root_squash,no_all_squash)
server2:
1.安装nfs
[root@server1 ~]# yum install -y nfs-utils
2.创建hadoop用户
[root@server2 ~]# useradd hadoop
[root@server2 ~]# id hadoop
uid=1000(hadoop) gid=1000(hadoop) groups=1000(hadoop)
3.开启
[root@server2 ~]# systemctl start rpcbind
4.挂载
[root@server2 ~]# showmount -e server1
Export list for server1:
/home/hadoop *
[root@server2 ~]# mount 172.25.8.1:/home/hadoop/ /home/hadoop
[root@server2 ~]# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/rhel-root 17811456 1106308 16705148 7% /
devtmpfs 930892 0 930892 0% /dev
tmpfs 941864 0 941864 0% /dev/shm
tmpfs 941864 8584 933280 1% /run
tmpfs 941864 0 941864 0% /sys/fs/cgroup
/dev/sda1 1038336 141516 896820 14% /boot
tmpfs 188376 0 188376 0% /run/user/0
172.25.8.1:/home/hadoop 17811456 2804224 15007232 16% /home/hadoop
5.查看是否同步
[root@server2 ~]# su - hadoop
[hadoop@server2 ~]$ ls
hadoop hadoop-3.0.3.tar.gz jdk1.8.0_181
hadoop-3.0.3 java jdk-8u181-linux-x64.tar.gz
server3:
1.安装nfs-utils
[root@server3 ~]# yum install -y nfs-utils
2.创建hadoop用户
[root@server3 ~]# useradd hadoop
3.开启rpcbind
[root@server3 ~]# systemctl start rpcbind
4.挂载
[root@server3 ~]# mount 172.25.8.1:/home/hadoop/ /home/hadoop/
5.查看是否数据同步
[root@server3 ~]# su - hadoop
[hadoop@server3 ~]$ ll
total 488256
lrwxrwxrwx 1 hadoop hadoop 12 May 18 23:12 hadoop -> hadoop-3.0.3
drwxr-xr-x 11 hadoop hadoop 175 May 19 02:40 hadoop-3.0.3
-rw-r--r-- 1 root root 314322972 May 18 23:10 hadoop-3.0.3.tar.gz
lrwxrwxrwx 1 hadoop hadoop 13 May 18 23:12 java -> jdk1.8.0_181/
drwxr-xr-x 7 hadoop hadoop 245 Jul 7 2018 jdk1.8.0_181
-rw-r--r-- 1 root root 185646832 May 18 23:10 jdk-8u181-linux-x64.tar.gz
【2】配置分布式文件系统
1.免密设置:
[root@server1 ~]# ssh-copy-id 172.25.8.2
/usr/bin/ssh-copy-id: ERROR: No identities found
[root@server1 ~]# su - hadoop
Last login: Sun May 19 02:17:46 EDT 2019 on pts/0
[hadoop@server1 ~]$ ssh-copy-id 172.25.8.2
The authenticity of host '172.25.8.2 (172.25.8.2)' can't be established.
ECDSA key fingerprint is 5f:6e:2c:77:cf:fa:0f:af:e6:c0:5f:ac:23:50:8e:e7.
Are you sure you want to continue connecting (yes/no)? yes
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/bin/ssh-copy-id: WARNING: All keys were skipped because they already exist on the remote system.
[hadoop@server1 ~]$ ssh-copy-id 172.25.8.3
The authenticity of host '172.25.8.3 (172.25.8.3)' can't be established.
ECDSA key fingerprint is 5f:6e:2c:77:cf:fa:0f:af:e6:c0:5f:ac:23:50:8e:e7.
Are you sure you want to continue connecting (yes/no)? yes
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/bin/ssh-copy-id: WARNING: All keys were skipped because they already exist on the remote system.
[hadoop@server1 ~]$ ssh 172.25.8.2
Last login: Sun May 19 03:39:48 2019
[hadoop@server2 ~]$ logout
Connection to 172.25.8.2 closed.
[hadoop@server1 ~]$ ssh 172.25.8.3
Last login: Sun May 19 03:45:45 2019
[hadoop@server3 ~]$ logout
Connection to 172.25.8.3 closed.
2.配置文件
[hadoop@server1 hadoop]$ vim workers
172.25.8.2
172.25.8.3
[hadoop@server1 hadoop]$ vim hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>
3.格式化
[hadoop@server1 hadoop]$ cd ..
[hadoop@server1 etc]$ cd ..
[hadoop@server1 hadoop]$ bin/hdfs namenode -format
4.开启
[hadoop@server1 hadoop]$ sbin/start-dfs.sh
Starting namenodes on [server1]
Starting datanodes
Starting secondary namenodes [server1]
5.查看
[hadoop@server1 hadoop]$ jps
15456 Jps
15125 NameNode
15311 SecondaryNameNode
[hadoop@server2 ~]$ jps
12342 Jps
12279 DataNode
6.上传数据
[hadoop@server1 hadoop]$ bin/hdfs dfs -mkdir /user
[hadoop@server1 hadoop]$ bin/hdfs dfs -mkdir /user/hadoop
[hadoop@server1 hadoop]$ ls
bin include libexec logs output sbin
etc lib LICENSE.txt NOTICE.txt README.txt share
[hadoop@server1 hadoop]$ bin/hdfs dfs -put etc/hadoop/ input
7.删除数据
[hadoop@server1 hadoop]$ bin/hdfs dfs -rm -r input
Deleted input