开发者

How to search a particular type of web addresses?

开发者 https://www.devze.com 2022-12-21 04:19 出处:网络
See these url\'s: http://en.wikipedia.org/wiki/1_(number) http://en.wikipedia.org/wiki/10_(number开发者_开发百科)

See these url's:

http://en.wikipedia.org/wiki/1_(number)

http://en.wikipedia.org/wiki/10_(number开发者_开发百科)

http://en.wikipedia.org/wiki/100_(number)

http://en.wikipedia.org/wiki/10000_(number)

Is there some way to search a list of all the pages of this format on the WWW?


I see two problems to solve.

The first one: You don't have any real central directory of all URLs in the world, and even you will not have a sitemap on every site you know

An idea would be to check if a search engine (Google or other) let you works at URL level instead of content level for searching. You would then generate search query that could return list of sites matching your regex and try to do it.

The second one: For certain webservices which may exposing functions as resources, you may have an infinite URL list matching a regex

You may use several check to avoid this.

By the way, you are facing the same problem as every search engine ... making an inventory of all the web. No one ever solved this problem.

EDIT: webcrawler basic algorithm

take a list of seed sites
for each seed
  parse the webpage returned
  add each link found in the page to the seed list
  apply some algorithms for referencing the page to several keywords in a db


Usually grep -E "http://en.wikipedia.org/wiki/10*_\(number\)" list_of_urls

But if you want to know whether some website presents some content on urls of some format, you have a few possibilities.

  1. There is some sitemap, where you can grab your list_of_urls and use it in grep. (http://en.wikipedia.org/wiki/Special:AllPages)
  2. You have to build a list of these addresses and try them. There is no standard way for an HTTP server to advertise all its pages.
  3. The Google's way - crawl the site following the links so you can find all public pages it has and then search in the list you've built.

Also, Google supports allinurl: and site: keywords, they could help you too.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号