Hi I was wondering whether anyone could offer some advice on the fastest / most efficient way to compre two arrays of strings in javascript.
I am developing a kind of tag cloud type thing based on a users input - the input being in the form a written piece of text such as a blog article or the likes.
I therefore have an array that I keep of words to not include - is,开发者_JAVA百科 a, the etc etc.
At the moment i am doing the following:
Remove all punctuation from the input string, tokenize it, compare each word to the exclude array and then remove any duplicates.
The comparisons are preformed by looping over each item in the exclude array for every word in the input text - this seems kind of brute force and is crashing internet explorer on arrays of more than a few hundred words.
i should also mention my exclude list has around 300 items.
Any help would really be appreciated.
Thanks
I'm not sure about the whole approach, but rather than building a huge array then iterating over it, why not put the "keys" into a map-"like" object for easier comparison?
e.g.
var excludes = {};//object
//set keys into the "map"
excludes['bad'] = true;
excludes['words'] = true;
excludes['exclude'] = true;
excludes['all'] = true;
excludes['these'] = true;
Then when you want to compare... just do
var wordsToTest = ['these','are','all','my','words','to','check','for'];
var checkWord;
for(var i=0;i<wordsToTest.length;i++){
checkWord = wordsToTest[i];
if(excludes[checkword]){
//bad word, ignore...
} else {
//good word... do something with it
}
}
allows these words through ['are','my','to','check','for']
It would be worth a try to combine the words into a single regex, and then compare with that. The regex engine's optimizations might allow the search to skip forward through the search text a lot more efficiently than you could do by iterating yourself over separate strings.
You could use a hashing function for strings (I don't know if JS has one but i'm sure uncle Google can help ;] ). Then you would calculate hashes for all the words in your exclude list and create an array af booleans indexed by those hashes. Then just iterate through the text and check the word hashes against that array.
I have taken scunliffe's answer and modified it as follows:
var excludes = ['bad','words','exclude','all','these']; //array
now lets prototype a function that checks if a value is inside an Array:
Array.prototype.hasValue= function(value) {
for (var i=0; i<this.length; i++)
if (this[i] === value) return true;
return false;
}
lets test some words:
var wordsToTest = ['these','are','all','my','words','to','check','for'];
var checkWord;
for(var i=0; i< wordsToTest.length; i++){
checkWord = wordsToTest[i];
if( excludes.hasValue(checkWord) ){
//is bad word
} else {
//is good word
console.log( checkWord );
}
}
output:
['are','my','to','check','for']
I'd opt for the regex version
text = 'This is a text that contains the words to delete. It has some <b>HTML</b> code in it, and punctuation!';
deleteWords = ['is', 'a', 'that', 'the', 'to', 'this', 'it', 'in', 'and', 'has'];
// clear punctuation and HTML code
onlyWordsReg = /\<[^>]*\>|\W/g;
onlyWordsText = text.replace(onlyWordsReg, ' ');
reg = new RegExp('\\b' + deleteWords.join('\\b|\\b') + '\\b', 'ig');
cleanText = onlyWordsText .replace(reg, '');
// tokenize after this
精彩评论