I wasn'开发者_Go百科t sure if one was better to use than another, ie. Java, PHP, or Perl.
The best one is the one you are most comfortable working with.
It doesn't really matter, as long as you are using the right tools to do the job.
You need to consider where you are deploying your application (web versus desktop), the time you want to spend learning a new technology/language, and availability of libraries for parsing RSS and/or XML and/or HTML. The three languages that you named are all good candidates, though.
RSS files are just formatted XML that you obtain over the internet. All you need in a language is that it can make a HTTP request and has ways to parse the XML.
The framework code can be in anything, but consider using XSL transforms (or XPath queries) to get the XML into a more palatable format. Espec. if you're looking for small subsets of the data, or individual values.
It's hardly "scraping" if the source data was meant to be machine-parsed in the first place. :)
If you are stronger with one particular technology and you have a dead line (or other factors) then go with that technology as they all have capabilities.
If this is not the case then it falls to the requirements of the project you are undertaking and also if you want to/are able to learn a new technology.
PHP is the most naturally web based technology and you can use a library like this Simple HTML DOM Parser (it supports XML as well) to get quick results as well as delve deeper into the complexities of web scraping which PHP will support as well.
Java has a nice project called Web Harvest which I have used in the past with good results (all though you to learn a non-standard xml syntax but it's similar to xslt) and once your system is set up your web scraping can be easily modified.
Perl is the strongest when it comes to regex (Java and especially PHP can become a bit messy when working with regex I find) and regex is a nice skill to have so depending on what you want to do with your information this is a reasnoble option as well.
If you are writing a server application that needs to run often and aggregate content across a large number of sites then performance should be a significant criteria for you. This would mean a language capable of processing a large volume of data quickly.
If you just need a program to run occasionally and pick out bits of data from many pages then you can consider a specialized language. The product TestPlan offers a very simply language that would let you grab RSS content quickly and expose it in a simple fashion.
I've used it in some significant scraping projects. While not blazingly fast the scripts are extremely easy to maintain.
精彩评论