我正在尝试在 AWS EMR 上运行 Mapper For Co-occurance
错误日志
错误:java.lang.RuntimeException:PipeMapRed.waitOutputThreads():子进程失败,代码为 1,位于 org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325),位于 org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538),位于 org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130),位于 org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61),位于 org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34),位于 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:455),位于org.apache.hadoop.mapred.MapTask.run(MapTask.java:344) 在 org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) 在 java.security.AccessController.doPrivileged(Native Method) 在 javax.security.auth.Subject.doAs(Subject.java:422) 在 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) 在 org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169)
系统日志:
2019-04-18 00:34:29,518 INFO org.apache.hadoop.yarn.client.RMProxy(main):连接到 ip-172-31-1-199.us-east-2.compute.internal/172.31.1.199:8032 处的 ResourceManager 2019-04-18 00:34:29,741 INFO org.apache.hadoop.yarn.client.RMProxy(main):连接到 ip-172-31-1-199.us-east-2.compute.internal/172.31.1.199:8032 处的 ResourceManager 2019-04-18 00:34:30,046 INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem (main): 打开 's3://abhavtwitterdataset/mr/mapper2.py' 进行读取 2019-04-18 00:34:30,240 INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem (main): 打开 's3://abhavtwitterdataset/mr/reducer.py' 进行读取 2019-04-18 00:34:30,815 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader (main): 已加载本机 gpl 库 2019-04-18 00:34:30,822 INFO com.hadoop.compression.lzo.LzoCodec (main): 已成功加载并初始化本机 lzo 库 [hadoop-lzo rev 59c952a855a0301a4f9e1b2736510df04a640bd3] 2019-04-18 00:34:30,919 INFO org.apache.hadoop.mapred.FileInputFormat(main):要处理的输入文件总数:4 2019-04-18 00:34:31,403 INFO org.apache.hadoop.mapreduce.JobSubmitter(main):拆分数:9 2019-04-18 00:34:31,559 INFO org.apache.hadoop.mapreduce.JobSubmitter(main):为作业提交令牌:job_1555547086751_0002 2019-04-18 00:34:31,802 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl (main): 已提交的应用程序 application_1555547086751_0002 2019-04-18 00:34:31,876 INFO org.apache.hadoop.mapreduce.Job (main): 跟踪作业的 url:http://ip-172-31-1-199.us-east-2.compute.internal:20888/proxy/application_1555547086751_0002/ 2019-04-18 00:34:31,878 INFO org.apache.hadoop.mapreduce.Job (main): 正在运行的作业:job_1555547086751_0002 2019-04-18 00:34:41,076 INFO org.apache.hadoop.mapreduce.Job (main): 作业 job_1555547086751_0002 以 uber 模式运行:false 2019-04-18 00:34:41,078 INFO org.apache.hadoop.mapreduce.Job (main): 映射 0% 减少 0% 2019-04-18 00:34:59,350 INFO org.apache.hadoop.mapreduce.Job (main): 任务 ID: attempt_1555547086751_0002_m_000001_0,状态:失败 2019-04-18 00:35:01,388 INFO org.apache.hadoop.mapreduce.Job(main):任务 ID:attempt_1555547086751_0002_m_000003_0,状态:失败 2019-04-18 00:35:09,627 INFO org.apache.hadoop.mapreduce.Job(main):任务 ID:attempt_1555547086751_0002_m_000000_0,状态:失败 2019-04-18 00:35:11,646 INFO org.apache.hadoop.mapreduce.Job(main):任务ID:attempt_1555547086751_0002_m_000002_0,状态:失败2019-04-18 00:35:12,657 INFO org.apache.hadoop.mapreduce.Job(main):任务ID:attempt_1555547086751_0002_m_000004_0,状态:失败2019-04-18 00:35:13,667 INFO org.apache.hadoop.mapreduce.Job(main):任务ID:attempt_1555547086751_0002_m_000005_0,状态:失败2019-04-18 00:35:15,682 INFO org.apache.hadoop.mapreduce.Job(main):任务ID:attempt_1555547086751_0002_m_000001_1,状态:失败 2019-04-18 00:35:15,683 INFO org.apache.hadoop.mapreduce.Job(main):任务ID:attempt_1555547086751_0002_m_000006_0,状态:失败 2019-04-18 00:35:30,760 INFO org.apache.hadoop.mapreduce.Job(main):任务ID:attempt_1555547086751_0002_m_000005_1,状态:失败2019-04-18 00:35:30,761 INFO org.apache.hadoop.mapreduce.Job(main):任务 ID:attempt_1555547086751_0002_m_000001_2,状态:失败2019-04-18 00:35:34,782 INFO org.apache.hadoop.mapreduce.Job(main):任务 ID:attempt_1555547086751_0002_m_000003_1,状态:失败2019-04-18 00:35:39,816 INFO org.apache.hadoop.mapreduce.Job(main):任务 ID: attempt_1555547086751_0002_m_000000_1,状态:失败 2019-04-18 00:35:40,824 INFO org.apache.hadoop.mapreduce.Job(main):任务 ID:attempt_1555547086751_0002_m_000004_1,状态:失败 2019-04-18 00:35:41,829 INFO org.apache.hadoop.mapreduce.Job(main):任务 ID:attempt_1555547086751_0002_m_000002_1,状态:失败 2019-04-18 00:35:45,851 INFO org.apache.hadoop.mapreduce.Job(main):任务ID:attempt_1555547086751_0002_m_000006_1,状态:失败2019-04-18 00:35:45,853 INFO org.apache.hadoop.mapreduce.Job(main):任务ID:attempt_1555547086751_0002_m_000005_2,状态:失败2019-04-18 00:36:00,941 INFO org.apache.hadoop.mapreduce.Job(main):任务ID:attempt_1555547086751_0002_m_000002_2,状态:失败2019-04-18 00:36:00,944 INFO org.apache.hadoop.mapreduce.Job(main):任务 ID:attempt_1555547086751_0002_m_000006_2,状态:FAILED 2019-04-18 00:36:03,957 INFO org.apache.hadoop.mapreduce.Job(main):映射 100% 减少 100% 2019-04-18 00:36:04,966 INFO org.apache.hadoop.mapreduce.Job(main):作业 job_1555547086751_0002 失败,状态为 FAILED,原因如下:任务失败 task_1555547086751_0002_m_000001 作业因任务失败而失败。failedMaps:1 failedReduces:0
2019-04-18 00:36:05,073 INFO org.apache.hadoop.mapreduce.Job(main):计数器:17 个作业计数器失败的 map 任务=19 已终止的 map 任务=8 已终止的 Reduce 任务=3 已启动的 map 任务=24 其他本地 map 任务=17 数据本地 map 任务=7 所有 map 在占用时隙中花费的总时间(毫秒)=20982096 所有 Reduce 在占用时隙中花费的总时间(毫秒)=0 所有 map 任务花费的总时间(毫秒)=437127 所有 Reduce 任务花费的总时间(毫秒)=0 所有 map 任务所花费的总 vcore 毫秒数=437127 所有 Reduce 任务所花费的总 vcore 毫秒数=0 所有 map 任务所花费的总兆字节毫秒数=671427072 所有 Reduce 任务所花费的总兆字节毫秒数=0 Map-Reduce 框架 CPU 时间(ms)=0 物理内存(字节)快照=0 虚拟内存(字节)快照=0 2019-04-18 00:36:05,074 错误 org.apache.hadoop.streaming.StreamJob(main):作业不成功!
共现映射器代码
#!/usr/bin/env python
"""mapper2.py"""
import sys
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
nltk.data.path.append("/tmp")
nltk.download('punkt', download_dir="/tmp")
for line in sys.stdin:
line = re.sub(r'http\S+', '', line)
# replace 't by t
line = re.sub(r"'t",'t',line)
stopwords = ['about', 'all', 'along', 'also', 'an', 'any', 'and', 'are', 'around', 'after', 'according', 'another',
'already', 'because', 'been', 'being', 'but', 'become', 'can', 'could', 'called',
'during', 'do', 'dont', 'does', 'doesn', 'did', 'didnt', 'etc', 'for', 'from', 'far',
'get', 'going', 'had', 'has', 'have', 'he', 'her', 'here', 'him', 'his', 'how',
'into', 'isnt', 'its', 'just', 'let', 'like', 'may', 'more', 'must', 'most',
'not', 'now', 'new', 'next', 'one', 'other', 'our', 'out', 'over', 'own', 'put', 'right',
'say', 'said', 'should', 'she', 'since', 'some', 'still', 'such',
'take', 'that', 'than', 'the', 'their', 'them', 'then', 'there', 'these',
'they', 'this', 'those', 'through', 'time', 'told', 'thing',
'use' ,'until', 'via', 'very', 'under',
'was', 'way', 'were', 'what', 'which', 'when', 'where', 'who', 'why', 'will', 'with', 'would', 'wouldnt',
'yes', 'you', 'your']
lines = nltk.sent_tokenize(line)
for line in lines:
#remove puntuation
line = re.sub(r'[^\w\s]',' ',line)
# split the line into words
words = line.split()
for k in range(len(words) - 1):
#stemming
#l = ps.stem(words[k])
l = words[k]
l = l.lower()
if l not in stopwords and len(l)>2 and not (l.isdigit()):
for j in words[k+1:]:
r = j.lower()
if r in stopwords or l == r or len(r)<=2 or r.isdigit():
continue
key = l+"-"+r
print("%s\t%s" % (key.lower(),1))