本文基于实验室已经搭建好的Hadoop平台。
1.编写mapper.py
#!/usr/bin/python2.6
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print '%s %s' % (word, 1)
2.编写reducer.py
#!/usr/bin/python
import sys
from operator import itemgetter
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split(' ', 1)
try:
count = int(count)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
print '%s %s' % (current_word, current_count)
current_word = word
current_count = count
if current_word == word:
print '%s %s' % (current_word, current_count)
3.将mapper.py和reducer.py上传到HadoopMaster上/home/hduser目录下
确保这两个文件具有执行权限:chmod +x /home/hduser/mapper.py
chmod +x /home/hduser/reducer.py
注意若执行时出现如下错误: /usr/bin/python^M: bad interpreter: No such file or directory
原因是:在Windows下编写的文件格式是dos,将文件上传到HadoopMaster(Linux系统)后,需要将文件格式修改为unix
vi filename # 打开文件
:set ff # 查看文件格式
:set ff=unix # 设置文件格式
:wq # 保存并退出
4.通过Hue将测试文件上传至HDFS上
5.切换之hdfs用户,执行hadoop jar 命令
6.执行结果