My goal is to find the first result in google search resultes and collect the site link, so I built this script:
require 'hpricot'
require 'open-uri'
require 'mechanize'
query = gets.chomp
agent = Mechanize.new
page = agent.get("http://www.google.co.il/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = query.to_s
search_results = agent.submit(search_form)
search_results = search_results.body
doc = Hpricot(search_results)
site = doc.search("a")[16,1]
url = site.to_s
puts url
I get a string like this: 开发者_StackOverflow社区
url = <a href="http://en.wikipedia.org/wiki/Gallon" dir="ltr" class="l"><em>Gallon</em> - Wikipedia, the free encyclopedia</a>
But I need only the link (http://en.wikipedia.org/wiki/Gallon) not all the html code... How can I do it? I am using the gems:
require 'hpricot'
require 'open-uri'
require 'mechanize'
You can get the value of attributes like this
(doc/"a")[16].attributes['href']
but I have to say that the magic number 16 seems brittle.
You are also not supposed to scrape the search results, you should consider using the Custom Search API.
Since mechanize includes nokogiri you can should skip hpricot altogether. It will slow your code down unnecessarily. You are effectively doing the same thing twice.
require 'mechanize'
query = gets.chomp
agent = Mechanize.new
page = agent.get("http://www.google.co.il/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = query.to_s
search_results = agent.submit(search_form)
puts search_results.links[16].href
Instead of converting to a string with url = site.to_s
do url = site[0].attributes['href']
try to use:
site = doc.search("a[@href]")[16,1]
Waitir is a reasonable choice to check the layout of a web page.
require 'rubygems'
require 'watir'
#Launching browser windows and navigating to google
browser = Watir::Browser.new
browser.goto("http://www.google.co.il/")
#Logging to console if a link with href = http://en.wikipedia.org/wiki/Gallon present
puts browser.link(:href, "http://en.wikipedia.org/wiki/Gallon").exists?
Since the input is always going to follow the same format, you could just do:
url.split("href=\"").last.split("\"").first
精彩评论