Hadoop Streaming Job Failed (Not Successful) in Python_问答_开发者

I'm trying to run a Map-Reduce job on Hadoop Streaming with Python scripts and getting the same errors as Hadoop Streaming Job failed error in python but those solutions didn't work for me.

My scripts work fine when I run "cat sample.txt | ./p1mapper.py | sort | ./p1reducer.py"

But when I run the following:

./bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar \
    -input "p1input/*" \
    -output p1output \
    -mapper "python p1mapper.py" \
    -reducer "python p1reducer.py" \
    -file /Users/Tish/Desktop/HW1/p1mapper.py \
    -file /Users/Tish/Desktop/HW1/p1reducer.py
开发者_Python百科

(NB: Even if I remove the "python" or type the full pathname for -mapper and -reducer, the result is the same)

This is the output I get:

packageJobJar: [/Users/Tish/Desktop/HW1/p1mapper.py, /Users/Tish/Desktop/CS246/HW1/p1reducer.py, /Users/Tish/Documents/workspace/hadoop-0.20.2/tmp/hadoop-unjar4363616744311424878/] [] /var/folders/Mk/MkDxFxURFZmLg+gkCGdO9U+++TM/-Tmp-/streamjob3714058030803466665.jar tmpDir=null
11/01/18 03:02:52 INFO mapred.FileInputFormat: Total input paths to process : 1
11/01/18 03:02:52 INFO streaming.StreamJob: getLocalDirs(): [tmp/mapred/local]
11/01/18 03:02:52 INFO streaming.StreamJob: Running job: job_201101180237_0005
11/01/18 03:02:52 INFO streaming.StreamJob: To kill this job, run:
11/01/18 03:02:52 INFO streaming.StreamJob: /Users/Tish/Documents/workspace/hadoop-0.20.2/bin/../bin/hadoop job  -Dmapred.job.tracker=localhost:54311 -kill job_201101180237_0005
11/01/18 03:02:52 INFO streaming.StreamJob: Tracking URL: http://www.glassdoor.com:50030/jobdetails.jsp?jobid=job_201101180237_0005
11/01/18 03:02:53 INFO streaming.StreamJob:  map 0%  reduce 0%
11/01/18 03:03:05 INFO streaming.StreamJob:  map 100%  reduce 0%
11/01/18 03:03:44 INFO streaming.StreamJob:  map 50%  reduce 0%
11/01/18 03:03:47 INFO streaming.StreamJob:  map 100%  reduce 100%
11/01/18 03:03:47 INFO streaming.StreamJob: To kill this job, run:
11/01/18 03:03:47 INFO streaming.StreamJob: /Users/Tish/Documents/workspace/hadoop-0.20.2/bin/../bin/hadoop job  -Dmapred.job.tracker=localhost:54311 -kill job_201101180237_0005
11/01/18 03:03:47 INFO streaming.StreamJob: Tracking URL: http://www.glassdoor.com:50030/jobdetails.jsp?jobid=job_201101180237_0005
11/01/18 03:03:47 ERROR streaming.StreamJob: Job not Successful!
11/01/18 03:03:47 INFO streaming.StreamJob: killJob...
Streaming Job Failed!

For each Failed/Killed Task Attempt:

Map output lost, rescheduling: getMapOutput(attempt_201101181225_0001_m_000000_0,0) failed :
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_201101181225_0001/attempt_201101181225_0001_m_000000_0/output/file.out.index in any of the configured local directories
    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:389)
    at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
    at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:2887)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
    at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
    at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
    at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
    at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
    at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
    at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
    at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
    at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
    at org.mortbay.jetty.Server.handle(Server.java:324)
    at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
    at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
    at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
    at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
    at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
    at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
    at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)

Here are my Python scripts: p1mapper.py

#!/usr/bin/env python

import sys
import re

SEQ_LEN = 4

eos = re.compile('(?<=[a-zA-Z])\.')   # period preceded by an alphabet
ignore = re.compile('[\W\d]')

for line in sys.stdin:
    array = re.split(eos, line)
    for sent in array:
        sent = ignore.sub('', sent)
        sent = sent.lower()
        if len(sent) >= SEQ_LEN:
            for i in range(len(sent)-SEQ_LEN + 1):
                print '%s 1' % sent[i:i+SEQ_LEN]

p1reducer.py

#!/usr/bin/env python

from operator import itemgetter
import sys

word2count = {}

for line in sys.stdin:
    word, count = line.split(' ', 1)
    try:
        count = int(count)
        word2count[word] = word2count.get(word, 0) + count
    except ValueError:    # count was not a number
        pass

# sort
sorted_word2count = sorted(word2count.items(), key=itemgetter(1), reverse=True)

# write the top 3 sequences
for word, count in sorted_word2count[0:3]:
    print '%s\t%s'% (word, count)

Would really appreciate any help, thanks!

UPDATE:

hdfs-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>

          <name>dfs.replication</name>

          <value>1</value>

</property>

</configuration>

mapred-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>

          <name>mapred.job.tracker</name>

          <value>localhost:54311</value>

</property>

</configuration>

You are missing a lot of configurations and you need to define directories and such. See here:

http://wiki.apache.org/hadoop/QuickStart

Distributed operation is just like the pseudo-distributed operation described above, except:

Specify hostname or IP address of the master server in the values for fs.default.name and mapred.job.tracker in conf/hadoop-site.xml. These are specified as host:port pairs.
Specify directories for dfs.name.dir and dfs.data.dir in conf/hadoop-site.xml. These are used to hold distributed filesystem data on the master node and slave nodes respectively. Note that dfs.data.dir may contain a space- or comma-separated list of directory names, so that data may be stored on multiple devices.
Specify mapred.local.dir in conf/hadoop-site.xml. This determines where temporary MapReduce data is written. It also may be a list of directories.
Specify mapred.map.tasks and mapred.reduce.tasks in conf/mapred-default.xml. As a rule of thumb, use 10x the number of slave processors for mapred.map.tasks, and 2x the number of slave processors for mapred.reduce.tasks.
List all slave hostnames or IP addresses in your conf/slaves file, one per line and make sure jobtracker is in your /etc/hosts file pointing to your jobtracker node

Well, I stuck upon the same problem for 2 days now.. The solution that Joe provided in his other post works well for me..

As a solution to your problem I suggest:

1) Follow blindly and only blindly the instructions on how to setup a single node cluster here (I assume you have already done so)

2) If anywhere you face a java.io.IOException: Incompatible namespaceIDs error (you will find it if you examine the logs), have a look here

3) REMOVE ALL THE DOUBLE QUOTES FROM YOUR COMMAND, in your example run

./bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar \
    -input "p1input/*" \
    -output p1output \
    -mapper p1mapper.py \
    -reducer p1reducer.py \
    -file /Users/Tish/Desktop/HW1/p1mapper.py \
    -file /Users/Tish/Desktop/HW1/p1reducer.py

this is ridiculous, but it was the point at which I stuck for 2 whole days