python - Amazon MapReduce with my own reducer for streaming -
i wrote simple map , cut down programme in python count numbers each sentence, , grouping same number together. i.e suppose sentence 1 has 10 words, sentence 2 has 17 words , sentence 3 has 10 words. final result be:
10 \t 2 17 \t 1
the mapper function is:
import sys import re pattern = re.compile("[a-za-z][a-za-z0-9]*") line in sys.stdin: word = str(len(line.split())) # calculate how many words each line count = str(1) print "%s\t%s" % (word, count)
the reducer function is:
import sys current_word = none current_count = 0 word = none line in sys.stdin: line = line.strip() word, count = line.split('\t') try: count = int(count) word = int(word) except valueerror: go on if current_word == word: current_count += count else: if current_word: print "%s\t%s" % (current_word, current_count) current_count = count current_word = word if current_word == word: print "%s\t%s" %(current_word, current_count)
i tested on local machine first 200 lines of file : head -n 200 sentences.txt | python mapper.py | sort | python reducer.py results correct. used amazon mapreduce streaming service, failed @ reducer step. changed print in mapper function to:
print "longvaluesum" + word + "\t" + "1"
this fits default aggregate in mapreduce streaming service. in case, don't need reducer.py function. final results big file sentences.txt. don't know why reducer.py function failed. give thanks you!
got it! "stupid" mistake. when tested it, utilize python mapper.py. mapreduce, need create executable. add
# !/usr/bin/env python
in beginning.
python amazon-web-services mapreduce
No comments:
Post a Comment