I have two URLs (actually more because Google has Maps, News, Images etc.) Google Organic search:
http://www.google.nl/#hl=nl&biw=1920&bih=965&q=koffie&aq=f&aqi=g10&aql=&oq=&
fp=b8a3028139d33c34`
and Google Adwords search:
http://www.google.nl/aclk?sa=L&ai=CZYun1fI3TY_hO8aMOrer6aQCmK2m2AGIpdyCFr_g_-RVE
AEoCFDytZmR-_____8BYJGkmoWEGMgBAakCkm-p2E6Ttj6qBBlP0O_GI1GZU09CYDd728FmO_QIDea76u
yT&num=1&sig=AGiWqtzxvt17KyOWqEkwJ7jVdanxR645tw&
adurl=http://ad-emea.doubleclick.net/clk%3B233218340%3B57152064%3Bv
I need a regex to find google
in a URL and exclude the aclk?
part. This is only used by Google Adwords. The regex will be used to filter from the host referrer and find only Google Organic traffic.
First I tried this regex:
www[.]google[.].{1,}client=|www[.]google[.].{1,}gs_rfai|www[.]google[.].{1,}&
prmd|news[.]google[.].{1,}nwshp?| video[.]google|www[.]google[.].{1,} imghp?|
www[.]google[.].{1,}imgres|www[.]google[.].{1,}search
This caught 50% of the traf开发者_StackOverflow社区fic. At that time we didn't have Adwords running, so it could've caught all traffic. But it didn't.
We want to catch all Google (organic) URLs and exclude Adwords URLs (with the aclk? ).
If you need to separate out the domain name from the rest of the URL, consider using a URL parser. There's one in Ruby's standard library.
Ok, here's some code:
require "uri"
uri ="http://www.google.nl/aclk?sa=L&ai=CZYun1fI3TY_hO8aMOrer6aQCmK2m2AGIpdyCFr_g_-RVEAEoCFDytZmR-_____8BYJGkmoWEGMgBAakCkm-p2E6Ttj6qBBlP0O_GI1GZU09CYDd728FmO_QIDea76uyT&num=1&sig=AGiWqtzxvt17KyOWqEkwJ7jVdanxR645tw&adurl=http://ad-emea.doubleclick.net/clk%3B233218340%3B57152064%3Bv"
puts URI.split(uri).inspect
gives
["http", nil, "www.google.nl", nil, nil, "/aclk", nil, "sa=L&ai=CZYun1fI3TY_hO8aMOrer6aQCmK2m2AGIpdyCFr_g_-RVEAEoCFDytZmR-_____8BYJGkmoWEGMgBAakCkm-p2E
d728FmO_QIDea76uyT&num=1&sig=AGiWqtzxvt17KyOWqEkwJ7jVdanxR645tw&adurl=http://ad-emea.doubleclick.net/clk%3B233218340%3B57152064%3Bv", nil]
You probably want to call split
on the long string (use "&" to split them up, and then split those things up using "=") if you want the parameters. Sorry if I'm not too precise here, I didn't fully understand your question.
The rdoc for URI is at http://www.ruby-doc.org/stdlib/libdoc/uri/rdoc/ . Click on "URI" to see the main documentation.
精彩评论