so i'm writing this program that opens the pag开发者_如何转开发e and one of the things that it should do is detect how many navigations (menus) web page has, how long is the main navigation (how many elements), average text in elements in navigation and so on...
anyway i have some problems detecting menus. i'm thinking there is 2 ways web navigation is coded:
1. <ul><li><a>Home</a><li><a>Products</a></li>...</ul>
2. <div><a>Home</a><a>Product</a>...</div>
so if i find this structure i know (or should i say "i think") its navigation. but this is NOT bulletproof. i get a lot of miss hits.
so does any1 have any better idea how to detect navigations on web pages?
There is no universal solution. You need to implement some heuristics. I will try such:
- get all site pages with recursion limit=1 (like wget -r -l1 http://example.com/)
- for each internal page, keep set of internal links on that page
- get intersection of all sets.
This way you will get the constant set of internal links which in most cases will be "menu" of the site.
In HTML4 and XHTML there is no standard way of writing menus. In HTML5 you have the <menu>
and <nav>
tags, but as you have concluded, in earlier versions the generally recommended way is to use an unordered list.
I would probably write a number of tests, and use them all in parallel to try and find the menu, e.g. based on position in the document, structure, and things like id
and class
attributes (the values of which will often contain "menu").
Don't forget the HTML5 <nav>
tag.
Adding to the previous answers, a ul or div with a class
or id
that includes "nav" is probably what you want too. There is no universal answer, though. Also, keep in mind the possibility of primary and secondary navigation menus (e.g. a top menu and a side menu, or Stack Overflow's two horizontal menus at the top of the page).
精彩评论