Given a body of HTML, is there any function out there someone has written that will automatically extract say the top 10 keywords that appear from a chunk of HTML, excluding any HTML tags (IE just plain text)?
It should ignore common words like "and", "is" "but" etc but list the most frequent uncommon words.
Example input:
Mary had a <strong>snow</strong> lamb. <img src=lamb.jpg /> The <i>lamb</i> was snow white, it lay in the snow all white.
Ou开发者_如何转开发tput:
Snow (3)
White (2)
Lamb (2)
Jquery is fine!
in short terms:
1) take the innerHTML of your body;
2) strip all punctuation and \n so you have a single line string;
3) strip all tags with a .replace() (/<[^>]*>/g);
4) strip all common words (/\band\b/g, /\bbut\b/g, ...); E.g. if your useless words are those with less than 4 chars then strip /\b[.+]{1,3}\b/
- now you should have a one-line string (str) without markup and useless words
4a) Optional: if you don't care about WoRdCAse just transform all in lowercase (str.toLowerCase())
5) make a split over the blank space (str.split(' ')), you obtain an array (arr)
6)
var words = {},
i = arr.length;
while(--i) {
war extWord = arr[i];
words[extWord] = (!!words[extWord])? words[extWord] + 1 : 1;
}
7) make a for.. in cycle over (words) object to obtain key (a single word) and value (occurencies for that word)
Hope this help
Slight modification to the option outlined by Fabrizio and using jQuery.
//grab all text from page
var myDocumentText = $("body").text();
myParseText(myDocumentText);
function myParseText(myText){
... do processing of text in here with your logic to not count and, or, etc.
}
精彩评论