How to guess the REAL title of an HTML document?_问答_开发者

A lot of people put extremely useless and annoying stuff in their <title> tags and I'm trying to come up with Javascript code that extracts only the interesting part.

For example on a Google search you get this document title:

some random search - Google Search

The "Google Search" part is redundant, because you already have this information in the domain name (and the favicon). In this example I only want this part:

some random search

开发者_StackOverflow

Most site's authors probably use the "dash notation" which looks like this:

Site name - Title or

Title - Site name

But if it was that easy I wouldn't be asking here. ;)

There are also some really annoying cases where the title isn't present in the <title> tag at all. (Oh the irony!) Just have a look at this page from the NY Times: Egypt’s Autocrats Exploited Internet’s Weaknesses - NYTimes.com. Whereas the headline of the article actually is: Egypt Leaders Found ‘Off’ Switch for Internet. What the f***, New York Times?

What's the most reliable approach to extract this information under the assumption that we have access to the page's DOM? I think a good starting point would be the <h1> tag, but it isn't reliable. I imagine that there are a lot of authors who don't use it at all or use it multiple times.

Update: The combination of the <title> and <h1> content seems reasonable to me. Thanks to all of you who have suggested it. But what if there is no <h1> tag? I think some (admittedly, bad) authors don't use them and instead just specify the font-size of a <div> or <span>.

I'm currently creating my very first browser extension. (Isn't that nice?) It has a feature that let's you save the current tab, so it should work generally and for as many pages as possible.

Thanks to all of you! :)

title tags are arbitrary, h1 tags are arbitrary. Best you can really hope for is to tailor your script on a site-by-site basis and hope the site at least consistently does things from page to page. Like for instance with SO you can see they do [tag] - [question] - [site]. So you can easily split at the hyphen and grab the 2nd element. No real "one size fits all" solution. Gotta do the research for the site, find the pattern.

edit:

Based on response in comment...IMO "good enough guess" would involve

1) only looking at document.title. As others have mentioned, people can use other things besides h1 tags for in-page "title" and then you run the risk of looking at something that's not meant for title at all.

2) split at hyphen, pipe or colon. Those are 3 most common delimiters used.

3) If splitting yields 2+ array elements, see if the last element returns true for an indexOf the domain. If so, use the 2nd to last element. If not, use the last array element.

In this SO page finding the common string from <title> and <h1> is an effective solution.

<title>javascript - How to guess the REAL title of an HTML document? - Stack Overflow</title>
<h1>How to guess the REAL title of an HTML document?</h1>
 Common string is "How to guess the REAL title of an HTML document?"

In your first example, you might have enough information in the DOM to determine if it's Site name - Title or Title - Site name. You can look for terms in the URL and in the page text. Quite likely, the Site name will be used more often in the page text than the actual title is. But any such heuristic is going to be less than perfect.

Beyond that, you have to resort to heuristic methods that you build up over time from examining many different pages across many different domains. We've done something like this to differentiate between page content and sidebars, ads, and other stuff on HTML pages. It's not 100% reliable in general, but it is very reliable on sites that follow common patterns.

You'll find, as others have pointed out, that h1 tags often (but not always) repeat the title text. But sometimes the designer used a div named "title" or "main_content" or "header" or something else. Or they'll use h2 for the content title.

I would suggest that you first work on the simple case. That is, if you see a hyphen (-) in the title, assume that it's either site name - title or title - site name. When you get that working reliably, then look into how you determine if the title is actually representative of the page content.