开发者

Wildcard string matching

开发者 https://www.devze.com 2022-12-17 08:09 出处:网络
What is the most efficient wildcard string matching algorithm? I am asking only about an idea, it is not necessary to provide actual code.

What is the most efficient wildcard string matching algorithm? I am asking only about an idea, it is not necessary to provide actual code.

I'm thinking that such algorithm can be built with sorted suffix arrays, whic开发者_运维知识库h can yield performance of O(log(n)).

Am I correct?

Edited:

I mean patterns like "A*B", "*sip*" or "A?B" where star means any number of symbols and question mark means single symbol.


There is a paper covering the fastest options here http://swtch.com/~rsc/regexp/regexp1.html in particular it allows you avoid naive algorithms that become pathologically slow when long patterns are used.

It covers generic regular expressions but you can limit your implementation to the subset you require.


Hm, I think that normal pattern matching rules would apply here. Usually, since you have a stream of data and short patterns, you would not need to implement something more efficient than linear. However, the longer the pattern gets, the more room there is for optimization.

What kind of wildcard do you have in mind? a one-character-wildcard (e.g. . in regex), or a multiple-character-wildcard (e.g. .*)? Are there limitations? What is the expected pattern length, and do you have random or serial access to the data to be checked?


I was looking for a simple wildcard matching algorithm which runs in polynomial time. E.g. this one is simple, but doesn't run in polynomial time when the pattern contains many stars (*): http://www.codeproject.com/Articles/188256/A-Simple-Wildcard-Matching-Function Below is the code which uses dynamic programming to reduce the time complexity to O(n*m) where n is the length of the text and m is the length of the pattern.

#include <string>
#include <vector>
#include <algorithm>
using namespace std;

const int UNKNOWN = -1;
const int NOMATCH = 0;
const int MATCHES = 1;

class Wildcard {
    string _text;
    string _pattern;
    vector<vector<int>> _mf;
    int F(int n, int m) {
        if (_mf[n][m] >= 0) return _mf[n][m];
        if (n == 0 && m == 0) {
            _mf[n][m] = MATCHES;
            return _mf[n][m];
        }
        if (n > 0 && m == 0) {
            _mf[n][m] = NOMATCH;
            return _mf[n][m];
        }
        // m > 0
        int ans = NOMATCH;
        if (_pattern[m - 1] == '*') {
            ans = max(ans, F(n, m-1));
            if (n > 0) {
                ans = max(ans, F(n - 1, m));
            }
        }
        if (n > 0) {
            if (_pattern[m - 1] == '?' || _pattern[m - 1] == _text[n - 1]) {
                ans = max(ans, F(n - 1, m - 1));
            }
        }
        _mf[n][m] = ans;
        return _mf[n][m];
    }
public:
    bool match(string text, string pattern) {
        _text = text;
        _pattern = pattern;
        _mf.clear();
        for (int i = 0; i <= _text.size(); i++) {
            _mf.push_back(vector<int>());
            for (int j = 0; j <= _pattern.size(); j++) {
                _mf[i].push_back(UNKNOWN); // not calculated
            }
        }
        int ans = F(_text.size(), _pattern.size());
        return ans == MATCHES;
    }
};


If your pattern can contain only * wild cards, then the trivial implementation is as fast as possible. The main thing to realize in this case, is that you only need to look for each card once (card = segment between stars).

Here is an implementation (supporting only * wild cards):

#include <cstddef>
#include <cstring>
#include <algorithm>
#include <string>
#include <vector>
#include <iostream>

using namespace std;

class wildcard_pattern {
public:
    explicit wildcard_pattern(const string& text);

    bool match(const char* begin, const char* end) const;

    bool match(const char* c_str) const;

private:
    string m_text;

    struct card {
        size_t m_offset, m_size;
        card(size_t begin, size_t end);
    };

    // Must contain at least one card. The first, and the last card
    // may be empty strings. All other cards must be non-empty. If
    // there is exactly one card, the pattern matches a string if, and
    // only if the string is equal to the card. Otherwise, the first
    // card must be a prefix of the string, and the last card must be
    // a suffix.
    vector<card> m_cards;
};


wildcard_pattern::wildcard_pattern(const string& text):
    m_text(text)
{
    size_t pos = m_text.find('*');
    if (pos == string::npos) {
        m_cards.push_back(card(0, m_text.size()));
        return;
    }
    m_cards.push_back(card(0, pos));
    ++pos;
    for (;;) {
        size_t pos_2 = m_text.find('*', pos);
        if (pos_2 == string::npos)
            break;
        if (pos_2 != pos)
            m_cards.push_back(card(pos, pos_2));
        pos = pos_2 + 1;
    }
    m_cards.push_back(card(pos, m_text.size()));
}

bool wildcard_pattern::match(const char* begin, const char* end) const
{
    const char* begin_2 = begin;
    const char* end_2   = end;

    size_t num_cards = m_cards.size();
    typedef string::const_iterator str_iter;

    // Check anchored prefix card
    {
        const card& card = m_cards.front();
        if (size_t(end_2 - begin_2) < card.m_size)
            return false;
        str_iter card_begin = m_text.begin() + card.m_offset;
        if (!equal(begin_2, begin_2 + card.m_size, card_begin))
            return false;
        begin_2 += card.m_size;
    }

    if (num_cards == 1)
        return begin_2 == end_2;

    // Check anchored suffix card
    {
        const card& card = m_cards.back();
        if (size_t(end_2 - begin_2) < card.m_size)
            return false;
        str_iter card_begin = m_text.begin() + card.m_offset;
        if (!equal(end_2 - card.m_size, end_2, card_begin))
            return false;
        end_2 -= card.m_size;
    }

    // Check unanchored infix cards
    for (size_t i = 1; i != num_cards-1; ++i) {
        const card& card = m_cards[i];
        str_iter card_begin = m_text.begin() + card.m_offset;
        str_iter card_end   = card_begin + card.m_size;
        begin_2 = search(begin_2, end_2, card_begin, card_end);
        if (begin_2 == end_2)
            return false;
        begin_2 += card.m_size;
    }

    return true;
}

inline bool wildcard_pattern::match(const char* c_str) const
{
    const char* begin = c_str;
    const char* end   = begin + strlen(c_str);
    return match(begin, end);
}

inline wildcard_pattern::card::card(size_t begin, size_t end)
{
    m_offset = begin;
    m_size   = end - begin;
}


int main(int, const char* argv[])
{
    wildcard_pattern pat(argv[1]);
    cout << pat.match(argv[2]) << endl;
}


The performance will not just depend on the length of the string to search but also on the number (and type) of wildcards in the query string. If you are allowed to use a * which matches any number of characters, up to and including the entire document, and you can have any number of stars, this will put some limits on what is possible to get.

If you can determine a match some string foo in O(f(n)) time, then the query foo_0*foo_1*foo_2*...*foo_m will take O(m*f(n)) time where m is the number of * wildcards.


Depending on the wildcard "language", I'd (probably) simply write a wildcard->regexp compiler and use the (commonly provided) regexp engine to do the actual matching. It's a bit lazy, but unless there's an actual performance problem doing it that way, it'd be fast enough.


You could convert your wildcard query into a regular expression and use that to match; an RE can always be transformed into a DFA (discrete finite automaton) and those are efficient (lineair time) and a small constant.


O(n log m) is correct. See http://www.cs.bris.ac.uk/Publications/Papers/2000602.pdf

Hope this helps...


This is a simple implementation in C that makes use of pointers and attempts to do a single traverse of the string, and wildcard pattern where possible to gain the most efficiency in the average simple case. Note it also avoids any functions that you would use for strings like "length()" (may differ depending on the language) that may traverse the string and add unwanted computation time.

#include <stdio.h>

_Bool wildcard_strcmp(char *line, char *pattern)
{
    _Bool wildcard = 0;
    char *placeholder;

    do
    {
        if ((*pattern == *line) || (*pattern == '?'))
        {
            line++;
            pattern++;
        }
        else if (*pattern == '*')
        {
            if (*(++pattern) == '\0')
            {
                return 1;
            }
            wildcard = 1;
        }
        else if (wildcard)
        {
            if (pattern == placeholder)
            {
                line++;
            }
            else
            {
                pattern = placeholder;
            }
        } 
        else
        {
            return 0;
        }
    } while (*line);

    if (*pattern == '\0')
    {
        return 1;
    }
    else
    {
        return 0;
    }
}

int main()
{
    char string[200] = "foobarfoobar";
    char pattern[200] = "fo?*barfoo*";

    if (wildcard_strcmp(string, pattern))
    {
        printf("Match\n");
    }
    else
    {
        printf("No Match\n");
    }

    return 0;
}
0

精彩评论

暂无评论...
验证码 换一张
取 消