开发者

How do I write a web scraper in Ruby?

开发者 https://www.devze.com 2023-03-04 15:37 出处:网络
I would like 开发者_开发百科to crawl a popular site (say Quora) that doesn\'t have an API and get some specific information and dump it into a file - say either a csv, .txt, or .html formatted nicely

I would like 开发者_开发百科to crawl a popular site (say Quora) that doesn't have an API and get some specific information and dump it into a file - say either a csv, .txt, or .html formatted nicely :)

E.g. return only a list of all the 'Bios' of the Users of Quora that have, listed in their publicly available information, the occupation 'UX designer'.

How would I do that in Ruby ?

I have a moderate enough level of understanding of how Ruby & Rails work. I just completed a Rails app - mainly all written by myself. But I am no guru by any stretch of the imagination.

I understand RegExs, etc.


Your best bet would be to use Mechanize.It can follow links, submit forms, anything you will need, web client-wise. By the way, don't use regexes to parse HTML. Use an HTML parser.


If you want something more high level, try wombat, which is this gem I built on top of Mechanize and Nokogiri. It is able to parse pages and follow links using a really simple and high level DSL.


I know the answer has been accepted, but Hpricot is also very popular for parsing HTML.

All you have to do is take a look at the html source of the pages and try to find a XPath or CSS expression that matches the desired elements, then use something like:

doc.search("//p[@class='posted']")


Mechanize is awesome. If you're looking to learn something new though, you could take a look at Scrubyt: https://github.com/scrubber/scrubyt. It looks like Mechanize + Hpricot. I've never used it, but it seems interesting.


Nokogiri is great, but I find the output messy to work with. I wrote a ruby gem to easily create classes off HTML: https://github.com/jassa/hyper_api

The HyperAPI gem uses Nokogiri to parse HTML with CSS selectors.

E.g.

Post = HyperAPI.new_class do
  string title: 'div#title'
  string body: 'div#body'
  string author: '#details .author'
  integer comments_count: '#extra .comment' do
    size
  end
end
# => Post

post = Post.new(html_string)
# => #<Post title: 'Hi there!', body: 'This blog post will talk about...', author: 'Bob', comments_count: 74>
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号