开发者

Javascript auto pick keywords from HTML

开发者 https://www.devze.com 2023-01-19 07:58 出处:网络
Given a body of HTML, is there any function out there someone has written that will automatically extract say the top 10 keywords that appear from a chunk of HTML, excluding any HTML tags (IE just pla

Given a body of HTML, is there any function out there someone has written that will automatically extract say the top 10 keywords that appear from a chunk of HTML, excluding any HTML tags (IE just plain text)?

It should ignore common words like "and", "is" "but" etc but list the most frequent uncommon words.

Example input:

Mary had a <strong>snow</strong> lamb. <img src=lamb.jpg /> The <i>lamb</i> was snow white, it lay in the snow all white.

Ou开发者_如何转开发tput:

Snow (3)
White (2)
Lamb (2)

Jquery is fine!


in short terms:

1) take the innerHTML of your body;

2) strip all punctuation and \n so you have a single line string;

3) strip all tags with a .replace() (/<[^>]*>/g);

4) strip all common words (/\band\b/g, /\bbut\b/g, ...); E.g. if your useless words are those with less than 4 chars then strip /\b[.+]{1,3}\b/

  • now you should have a one-line string (str) without markup and useless words

4a) Optional: if you don't care about WoRdCAse just transform all in lowercase (str.toLowerCase())

5) make a split over the blank space (str.split(' ')), you obtain an array (arr)

6)

var words = {},
        i = arr.length; 

    while(--i) {
       war extWord = arr[i];
       words[extWord] = (!!words[extWord])? words[extWord] + 1 : 1;
    }

7) make a for.. in cycle over (words) object to obtain key (a single word) and value (occurencies for that word)

Hope this help


Slight modification to the option outlined by Fabrizio and using jQuery.

//grab all text from page

var myDocumentText = $("body").text();

myParseText(myDocumentText);

function myParseText(myText){

... do processing of text in here with your logic to not count and, or, etc.

}

0

精彩评论

暂无评论...
验证码 换一张
取 消