Nutch on EMR problem reading from S3_问答_开发者

开发者 https://www.devze.com 2023-03-31 20:16 出处：网络

Hi I am trying to run Apache Nutch 1.2 on Amazon\'s EMR. To do this I specifiy an input directory from S3.I get the following error:

Hi I am trying to run Apache Nutch 1.2 on Amazon's EMR.

To do this I specifiy an input directory from S3. I get the following error:

Fetcher: java.lang.IllegalArgumentException:
    This file system object (hdfs://ip-11-202-55-144.ec2.internal:9000)
    does not support access to the request path 
    's3n://crawlResults2/segments/20110823155002/crawl_fetch'
    You possibly called FileSystem.get(conf) when you should have called
    FileSystem.get(uri, conf) to obtain a file system supporting your path.

I understand the difference between FileSystem.get(uri, conf), and FileSystem.get(conf). If I were writing this myself I would FileSystem.get(uri, conf) however I am trying to use existing Nutch code.

I asked this question, and someone told me that I needed to modify hadoop-site.xml to include the following properties开发者_StackOverflow: fs.default.name, fs.s3.awsAccessKeyId, fs.s3.awsSecretAccessKey. I updated these properties in core-site.xml (hadoop-site.xml does not exist), but that didn't make a difference. Does anyone have any other ideas? Thanks for the help.

try to specify in

hadoop-site.xml

<property>
  <name>fs.default.name</name>
  <value>org.apache.hadoop.fs.s3.S3FileSystem</value>
</property>

This will mention to Nutch that by default S3 should be used

Properties

fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey

specification you need only in case when your S3 objects are placed under authentication (In S3 object can be accessed to all users, or only by authentication)