开发者

Any faster, simpler alternative to php preg_match

开发者 https://www.devze.com 2023-01-27 20:11 出处:网络
I am using cakephp 1.3 and I have textarea where users submit articles. On submit, I want to look into the article for certain key words and and add开发者_如何转开发 respective tags to the article.

I am using cakephp 1.3 and I have textarea where users submit articles. On submit, I want to look into the article for certain key words and and add开发者_如何转开发 respective tags to the article.

I was thinking of preg_match, But preg_match pattern has to be string. So I would have to loop through an array(big).

Is there a easier way to plug in the keywords array for the pattern.

I appreciate all your help.

Thanks.


I suggest treating your array of keywords like a hash table. Lowercase the article text, explode by spaces, then loop through each word of the exploded array. If the word exists in your hash table, push it to a new array while keeping track of the number of times it's been seen.

I ran a quick benchmark comparing regex to hash tables in this scenario. To run it with regex 1000 times, it took 17 seconds. To run it with a hash table 1000 times, it took 0.4 seconds. It should be an O(n+m) process.

$keywords = array("computer", "dog", "sandwich");
$article = "This is a test using your computer when your dog is being a dog";
$arr = explode(" ", strtolower($article));
$tracker = array();

foreach($arr as $word){
    if(in_array($word, $keywords)){
        if(isset($tracker[$word]))
            $tracker[$word]++;
        else 
            $tracker[$word] = 1;
    }
}

The $tracker array would output: "computer" => 1, "dog" => 2. You can then do the process to decide what tags to use. Or if you don't care about the number of times the keyword appears, you can skip the tracker part and add the tags as the keywords appear.

EDIT: The keyword array may need to be an inverted index array to ensure the fastest lookup. I am not sure how in_array() works, but if it searches, then this isn't as fast as it should be. An inverted index array would look like

array("computer" => 1, "dog" => 1, "sandwich" => 1); // "1" can be any value

Then you would do isset($keywords[$word]) to check if the word matches a keyword, instead of in_array(), which should give you O(1). Someone else may be able to clarify this for me though.


If you don't need the power of regular expressions, you should just use strpos().

You will still need to loop through the array of words, but strpos is much, much faster than preg_match.


Of course, you could try matching all the keywords using one single regexp, like /word1|word2|word3/, but I'm not sure it is what you are looking for. And also I think it would be quite heavy and resource-consuming.

Instead, you can try with a different approach, such as splitting the text into words and checking if the words are interesting or not. I would make use of str_word_count() using someting like:

$text = 'this is my string containing some words, some of the words in this string are duplicated, some others are not.';
$words_freq = array_count_values(str_word_count($text, 1));

that splits the text into words and counts occurrences. Then you can check with in_array($keyword, $words_freq) or array_intersect(array_keys($words_freq), $my_keywords).

If you are not interested, as I guess, to the keywords case, you can strtolower() the whole text before proceeding with words splitting.

Of course, the only way to determine which approach is the best is to setup some testing, by running various search functions against some "representative" and quite long text and measuring the execution time and resource usage (try microtime(TRUE) and memory_get_peak_usage() to benchmark this).

EDIT: I cleaned up a bit the code and added a missing semi-colon :)


If you want to look for multiple words from an array, then combine said array into an regular expression:

 $regex_array = implode("|", array_map("preg_escape", $array));
 preg_match_all("/($regex_array)/", $src, $tags);

This converts your array into /(word|word|word|word|word|...)/. The arrray_map and preg_escape part is optional, only needed if the $array might contain special characters.

Avoid strpos and loops for this case. preg_match is faster for searching after alternatives.


strtr()

If given two arguments, the second should be an array in the form array('from' => 'to', ...). The return value is a string where all the occurrences of the array keys have been replaced by the corresponding values. The longest keys will be tried first. Once a substring has been replaced, its new value will not be searched again.


Add tags manually? Just like we add tags here at SO.

0

精彩评论

暂无评论...
验证码 换一张
取 消