开发者

Using regex to potentially improve performance of string parsing?

开发者 https://www.devze.com 2023-01-10 22:51 出处:网络
I have a string like this: // string1 horse|cow|goat|zebra| and another string like this: // string2 horse:a,pig:b,cow:z,monkey:g,goat:a,

I have a string like this:

// string1
horse|cow|goat|zebra|

and another string like this:

// string2
horse:a,pig:b,cow:z,monkey:g,goat:a,

my goal is to split string1, then pick out any occurrences of it in string2, to build a histogram. I am currently doing this:

var histogram = {};

var animals = string1.split("|");
for (var i = 0; i < animals.length; i++) {
    var animal = animals[i];
    var animalColon = animal + ":";

    var index = string2.indexOf(animalColon);
    while (index != -1) {
        var indexColon = index + animalColon.length;
        var indexFinal = string2.indexOf(",", indexColon);
        var letter = string2.substring(indexColon, indexFinal);

        if (histogram[letter] == null) {
            histogram[letter] = 1;
        }
        else {
            histogram[letter] = histogram[letter] + 1;
        }
        index = string2.indexOf(animalColon, index + 1);
    }
}
开发者_如何学C

at the end, it might print something like:

// histogram:
a: 2 instances // from { horse, goat }
z: 1 instance  // from { cow }

the above will work, but I have to dp animals.length passes through string2 to check everyone.

Is there a way to use regular expressions to do this parsing - essentially run all tests in parallel, instead of doing multiple passes through? Since string2 is const, it seems that all checks could be done simultaneously (not sure if regexes are implemented like this).

I increased the number of elements in string1 and string2 on the order of thousands of elements and it still runs quite fast, but am worried about slower machines, maintainability, and stuff like that,

Thanks


I'd start by pre-processing your string2, which you say is constant. Working with an object is better than keep searching in the string:

var s = "horse:a,pig:b,cow:z,monkey:g,goat:a";
var hash = {};
var tokens = s.split(',');
for(var i=0;i<tokens.length;i++){
    var a = tokens[i].split(':');
    hash[a[0]] = a[1];
}

Next, when you get the string, you have easier time looking up the letters (you may also want to check for if(letter), if you get a new animal in string1):

var histogram = {};
var string1 = "horse|cow|goat|zebra";
var animals = string1.split("|");
for(var i=0;i<animals.length;i++){
    var letter = hash[animals[i]];
    if (!histogram[letter])
        histogram[letter] = 0;
    histogram[letter]++;
}

As per your question, you could probably abuse regular expression to count the letters, but it isn't parallel, but linear at best, and probably complex enough not to worthwhile.


A few tips that may increase performance:

  • Define all variables once at the beginning of the script
  • Calculate string length once at the beginning of the loop
  • Use strict comparison operator (===) when applicable


For the record, you can use a regex to get the histogram in 3 statements:

var letters = "horse:a,pig:b,cow:z,monkey:g,goat:a";
var string1 = "horse|cow|goat|zebra";

var h = {};
var regex = new RegExp("\\b(?:" + string1 + "):(\\w+)", "ig");
letters.replace(regex, function(g0, g1){h[g1] = (h[g1] || 0) + 1;});

This has many levels of abuse, namely, the use of replace as an iterator (ignoring the result and having side-effects in the callback), and noting that string1 sort-of looks like a regex already, with | as separators, and it doesn't seem to contain other regex meta-characters.

0

精彩评论

暂无评论...
验证码 换一张
取 消