python写mapreduce_用Python编写一个MapReduce程序

本文基于实验室已经搭建好的Hadoop平台。

1.编写mapper.py

#!/usr/bin/python2.6

import sys

for line in sys.stdin:

line = line.strip()

words = line.split()

for word in words:

print '%s %s' % (word, 1)

2.编写reducer.py

#!/usr/bin/python

import sys

from operator import itemgetter

current_word = None

current_count = 0

word = None

for line in sys.stdin:

line = line.strip()

word, count = line.split(' ', 1)

try:

count = int(count)

except ValueError:

continue

if current_word == word:

current_count += count

else:

if current_word:

print '%s %s' % (current_word, current_count)

current_word = word

current_count = count

if current_word == word:

print '%s %s' % (current_word, current_count)

3.将mapper.py和reducer.py上传到HadoopMaster上/home/hduser目录下

确保这两个文件具有执行权限：chmod +x /home/hduser/mapper.py

chmod +x /home/hduser/reducer.py

注意若执行时出现如下错误： /usr/bin/python^M: bad interpreter: No such file or directory

原因是：在Windows下编写的文件格式是dos，将文件上传到HadoopMaster(Linux系统)后，需要将文件格式修改为unix

vi filename # 打开文件

:set ff # 查看文件格式

:set ff=unix # 设置文件格式

:wq # 保存并退出

4.通过Hue将测试文件上传至HDFS上

5.切换之hdfs用户，执行hadoop jar 命令

6.执行结果