开发者

How can i parse/scrape/crawl sites for specific information?

开发者 https://www.devze.com 2023-02-23 14:19 出处:网络
Ive recently been charged with a task that blows my mind. My club wants to go through sites and find people who are doing what we are.

Ive recently been charged with a task that blows my mind. My club wants to go through sites and find people who are doing what we are.

The method in use currently is go to wikiped开发者_StackOverflowia, get the list of every city (eg: List of cities in alabama ), go each of the sites (eg: meetup, facebook, craigslist, etc.), then execute a search for each keyword, in each city, for every site. (eg: kung-fu, martial arts etc.)

so 460 cities X 5 sites X 5 keywords =11500 different searches = mind numbing monotony.

i was truly hoping there was an easier way. in searching for an answer i came across this site ( building a web spider ) and was thinking this might be the way.

THE QUESTION IS: can i modify some web spider (on that site or any other) to do that search and return only the results that return true for a keyword? I don't care if its a bash script, python, ruby or any other language.

let me know if any of that was unclear, and sorry if it was a bit verbose.


I wouldn't create a real web crawler for something as simple as this. I think what would suffice is to:

  1. Get list of cities in a file, say cities.txt (doable manually or figure something out)
  2. Figure out what URL patterns are used to search from the sites you want.
  3. Write a shell script that makes all the searches and saves the results.
  4. Analyze the data on your hard drive (e.g. figure which XPaths match results for each of the content providers, and search with them)

The data acquisition part should be simple with wget:

for city in `cat cities.txt`; do
  for keyword in 'kung-fu' 'martial arts'; do
    wget http://searchsite1.com/?search=${keyword}&city=${city}
    wget http://searchsite2.com/groups/search?q=${keyword}+${city}
  done
done

The other parts require a little figuring out on your own. This is how I'd do it, YMMV.

0

精彩评论

暂无评论...
验证码 换一张
取 消