fetcher#1 Hadoop 中的 shuffle 发生错误

fetcher#1 Hadoop 中的 shuffle 发生错误

我正在尝试运行 Hadoop pi 示例。它在单个节点上运行没有任何问题。但现在我正在多节点上运行,它给出了以下错误。如果有人能提供建议,请提供。

mapred-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<!-- In: conf/mapred-site.xml -->
<property>
  <name>mapred.job.tracker</name>
  <value>master:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>
<property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx2048m</value>
</property>
<property>
    <name>mapred.shuffle.input.buffer.percent</name>
    <value>0.2</value>
  </property>
</configuration>

控制台输出:

Number of Maps  = 3
Samples per Map = 10
14/10/11 20:34:20 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
14/10/11 20:34:54 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Starting Job
14/10/11 20:34:54 WARN mapreduce.JobSubmitter: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/10/11 20:34:55 INFO input.FileInputFormat: Total input paths to process : 3
14/10/11 20:34:55 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
14/10/11 20:34:55 INFO mapreduce.JobSubmitter: number of splits:3
14/10/11 20:34:55 INFO mapreduce.JobSubmitter: adding the following namenodes' delegation tokens:null
14/10/11 20:34:55 INFO mapreduce.Job: Running job: job_201410112034_0001
14/10/11 20:34:56 INFO mapreduce.Job:  map 0% reduce 0%
14/10/11 20:35:05 INFO mapreduce.Job:  map 33% reduce 0%
14/10/11 20:35:08 INFO mapreduce.Job:  map 100% reduce 0%
14/10/11 20:35:14 INFO mapreduce.Job:  map 100% reduce 11%
14/10/11 20:35:31 INFO mapreduce.Job: Task Id : attempt_201410112034_0001_r_000000_0, Status : FAILED
org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#1
    at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:124)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)
    at org.apache.hadoop.mapred.Child.main(Child.java:211)
Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    at org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler.checkReducerHealth(ShuffleScheduler.java:253)
    at org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler.copyFailed(ShuffleScheduler.java:187)
    at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:234)
    at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:149)

14/10/11 20:35:32 INFO mapreduce.Job:  map 100% reduce 0%
14/10/11 20:35:41 INFO mapreduce.Job:  map 100% reduce 11%
14/10/11 20:35:49 INFO mapreduce.Job: Task Id : attempt_201410112034_0001_m_000000_0, Status : FAILED
Too many fetch-failures
14/10/11 20:35:49 WARN mapreduce.Job: Error reading task outputhttp://userA:50060/tasklog?plaintext=true&attemptid=attempt_201410112034_0001_m_000000_0&filter=stdout
14/10/11 20:35:49 WARN mapreduce.Job: Error reading task outputhttp://userA:50060/tasklog?plaintext=true&attemptid=attempt_201410112034_0001_m_000000_0&filter=stderr
14/10/11 20:36:13 INFO mapreduce.Job: Task Id : attempt_201410112034_0001_r_000000_1, Status : FAILED
org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#2
    at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:124)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)
    at org.apache.hadoop.mapred.Child.main(Child.java:211)
Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
    at org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler.checkReducerHealth(ShuffleScheduler.java:253)
    at org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler.copyFailed(ShuffleScheduler.java:187)
    at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:234)
    at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:149)

14/10/11 20:36:14 INFO mapreduce.Job:  map 100% reduce 0%
14/10/11 20:36:22 INFO mapreduce.Job: Task Id : attempt_201410112034_0001_m_000001_0, Status : FAILED
Too many fetch-failures
14/10/11 20:36:22 WARN mapreduce.Job: Error reading task outputhttp://userA:50060/tasklog?plaintext=true&attemptid=attempt_201410112034_0001_m_000001_0&filter=stdout
14/10/11 20:36:22 WARN mapreduce.Job: Error reading task outputhttp://userA:50060/tasklog?plaintext=true&attemptid=attempt_201410112034_0001_m_000001_0&filter=stderr
14/10/11 20:36:23 INFO mapreduce.Job:  map 100% reduce 11%
14/10/11 20:36:32 INFO mapreduce.Job:  map 100% reduce 100%
14/10/11 20:36:34 INFO mapreduce.Job: Job complete: job_201410112034_0001
14/10/11 20:36:34 INFO mapreduce.Job: Counters: 33
    FileInputFormatCounters
        BYTES_READ=354
    FileSystemCounters
        FILE_BYTES_READ=72
        FILE_BYTES_WRITTEN=252
        HDFS_BYTES_READ=765
        HDFS_BYTES_WRITTEN=215
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=1
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    Job Counters 
        Data-local map tasks=5
        Total time spent by all maps waiting after reserving slots (ms)=0
        Total time spent by all reduces waiting after reserving slots (ms)=0
        SLOTS_MILLIS_MAPS=11950
        SLOTS_MILLIS_REDUCES=80809
        Launched map tasks=5
        Launched reduce tasks=3
    Map-Reduce Framework
        Combine input records=0
        Combine output records=0
        Failed Shuffles=1
        GC time elapsed (ms)=6
        Map input records=3
        Map output bytes=54
        Map output records=6
        Merged Map outputs=3
        Reduce input groups=2
        Reduce input records=6
        Reduce output records=0
        Reduce shuffle bytes=84
        Shuffled Maps =3
        Spilled Records=12
        SPLIT_RAW_BYTES=411
Job Finished in 100.067 seconds
Estimated value of Pi is 3.60000000000000000000

答案1

导致此错误的一个原因可能是 Hadoop 集群中机器之间的通信无法正常工作。机器应该能够互相 ping 通(主服务器和从服务器之间,从服务器之间也可以)。根据您的设置,您可能需要修改/etc/hosts机器上的文件,以便它们能够通过主机名互相 ping 通。

例如/etc/hosts可以配置如下:

127.0.0.1       localhost
<ipslave1>  slave1
<ipmaster> master
<ipslave2> slave2

相关内容