开发者

regular expressions: find every word that appears exactly one time in my document

开发者 https://www.devze.com 2023-03-25 17:43 出处:网络
Trying to learn regular expressions. As a practice, I\'m trying to find every word that appears exactly one time in my document -- in linguistics this is a hapax legemenon (http://en.wikipedia.org/wik

Trying to learn regular expressions. As a practice, I'm trying to find every word that appears exactly one time in my document -- in linguistics this is a hapax legemenon (http://en.wikipedia.org/wiki/Hapax_legomenon)

So I thought the following expression give me the desired result:

\w{1}

But this doesn't work. The \w returns a character not a whole word. Also it does not appear to开发者_如何学编程 be giving me characters that appear only once (it actually returns 25873 matches -- which I assume are all alphanumeric characters). Can someone give me an example of how to find "hapax legemenon" with a regular expression?


If you're trying to do this as a learning exercise, you picked a very hard problem :)

First of all, here is the solution:

\b(\w+)\b(?<!\b\1\b.*\b\1\b)(?!.*\b\1\b)

Now, here is the explanation:

  • We want to match a word. This is \b\w+\b - a run of one or more (+) word characters (\w), with a 'word break' (\b) on either side. A word break happens between a word character and a non-word character, so this will match between (e.g.) a word character and a space, or at the beginning and the end of the string. We also capture the word into a backreference by using parentheses ((...)). This means we can refer to the match itself later on.

  • Next, we want to exclude the possibility that this word has already appeared in the string. This is done by using a negative lookbehind - (?<! ... ). A negative lookbehind doesn't match if its contents match the string up to this point. So we want to not match if the word we have matched has already appeared. We do this by using a backreference (\1) to the already captured word. The final match here is \b\1\b.*\b\1\b - two copies of the current match, separated by any amount of string (.*).

  • Finally, we don't want to match if there is another copy of this word anywhere in the rest of the string. We do this by using negative lookahead - (?! ... ). Negative lookaheads don't match if their contents match at this point in the string. We want to match the current word after any amount of string, so we use (.*\b\1\b).

Here is an example (using C#):

var s = "goat goat leopard bird leopard horse";

foreach (Match m in Regex.Matches(s, @"\b(\w+)\b(?<!\b\1\b.*\b\1\b)(?!.*\b\1\b)"))
    Console.WriteLine(m.Value);

Output:

bird
horse


It can be done in a single regex if your regex engine supports infinite repetition inside lookbehind assertions (e. g. .NET):

Regex regexObj = new Regex(
    @"(       # Match and capture into backreference no. 1:
     \b       # (from the start of the word)
     \p{L}+   # a succession of letters
     \b       # (to the end of a word).
    )         # End of capturing group.
    (?<=      # Now assert that the preceding text contains:
     ^        # (from the start of the string)
     (?:      # (Start of non-capturing group)
      (?!     #  Assert that we can't match...
       \b\1\b #  the word we've just matched.
      )       #  (End of lookahead assertion)
      .       #  Then match any character.
     )*       # Repeat until...
     \1       # we reach the word we've just matched.
    )         # End of lookbehind assertion.
    # We now know that we have just matched the first instance of that word.
    (?=       # Now look ahead to assert that we can match the following:
     (?:      # (Start of non-capturing group)
      (?!     #  Assert that we can't match again...
       \b\1\b #  the word we've just matched.
      )       #  (End of lookahead assertion)
      .       #  Then match any character.
     )*       # Repeat until...
     $        # the end of the string.
    )         # End of lookahead assertion.", 
    RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success) {
    // matched text: matchResults.Value
    // match start: matchResults.Index
    // match length: matchResults.Length
    matchResults = matchResults.NextMatch();
} 


If you are trying to match an English word, the best form is:

[a-zA-Z]+

The problem with \w is that it also includes _ and numeric digits 0-9.

If you need to include other characters, you can append them after the Z but before the ]. Or, you might need to normalize the input text first.

Now, if you want a count of all words, or just to see words that don't appear more than once, you can't do that with a single regex. You'll need to invest some time in programming more complex logic. It may very well need to be backed by a database or some sort of memory structure to keep track of the count. After you parse and count the whole text, you can search for words that have a count of 1.


(\w+){1} will match each word. After that you could always perfrom the count on the matches....


Higher level solution:

Create an array of your matches:

preg_match_all("/([a-zA-Z]+)/", $text, $matches, PREG_PATTERN_ORDER);

Let PHP count your array elements:

$tmp_array = array_count_values($matches[1]);

Iterate over the tmp array and check the word count:

foreach ($tmp_array as $word => $count) {
    echo $word . '  ' . $count;
}


Low level but does what you want:

Pass your text in an array using split:

$array = split('\s+', $text);

Iterate over that array:

foreach ($array as $word) { ... }

Check each word if it is a word:

if (!preg_match('/[^a-zA-Z]/', $word) continue;

Add the word to a temporary array as key:

if (!$tmp_array[$word]) $tmp_array[$word] = 0;
$tmp_array[$word]++;

After the loop. Iterate over the tmp array and check the word count:

foreach ($tmp_array as $word => $count) {
    echo $word . '  ' . $count;
}
0

精彩评论

暂无评论...
验证码 换一张
取 消