Utf8 correct regex for CamelCase (WikiWord) in perl_问答_开发者

Utf8 correct regex for CamelCase (WikiWord) in perl

开发者 https://www.devze.com 2023-03-12 02:58 出处：网络

Here was a question about the CamelCase re开发者_如何学编程gex. With the combination of tchrist post i\'m wondering what is the correct utf-8 CamelCase.

Here was a question about the CamelCase re开发者_如何学编程gex. With the combination of tchrist post i'm wondering what is the correct utf-8 CamelCase.

Starting with (brian d foy's) regex:

/
    \b          # start at word boundary
    [A-Z]       # start with upper
    [a-zA-Z]*   # followed by any alpha

    (?:  # non-capturing grouping for alternation precedence
       [a-z][a-zA-Z]*[A-Z]   # next bit is lower, any zero or more, ending with upper
          |                     # or 
       [A-Z][a-zA-Z]*[a-z]   # next bit is upper, any zero or more, ending with lower
    )

    [a-zA-Z]*   # anything that's left
    \b          # end at word 
/x

and modifying to:

/
    \b          # start at word boundary
    \p{Uppercase_Letter}     # start with upper
    \p{Alphabetic}*          # followed by any alpha

    (?:  # non-capturing grouping for alternation precedence
       \p{Lowercase_Letter}[a-zA-Z]*\p{Uppercase_Letter}   ### next bit is lower, any zero or more, ending with upper
          |                  # or 
       \p{Uppercase_Letter}[a-zA-Z]*\p{Lowercase_Letter}   ### next bit is upper, any zero or more, ending with lower
    )

    \p{Alphabetic}*          # anything that's left
    \b          # end at word 
/x

Have a problem with lines marked '###'.

In addition, how to modify the regex when assuming than numbers and the underscore are equivalent to lowercase letters, so W2X3 is an valid CamelCase word.

Updated: (ysth comment)

for the next,

any: mean "uppercase or lowercase or number or underscore"

The regex should match CamelWord, CaW

start with uppercase letter
optional any
lowercase letter or number or underscore
optional any
upper case letter
optional any

Please, do not mark as duplicate, because it is not. The original question (and answers too) thought only ascii.

I really can’t tell what you’re trying to do, but this should be closer to what your original intent seems to have been. I still can’t tell what you mean to do with it, though.

m{
    \b
    \p{Upper}      #  start with uppercase code point (NOT LETTER)

    \w*            #  optional ident chars 

    # note that upper and lower are not related to letters
    (?:  \p{Lower} \w* \p{Upper}
      |  \p{Upper} \w* \p{Lower}
    )

    \w*

    \b
}x

Never use [a-z]. And in fact, don’t use \p{Lowercase_Letter} or \p{Ll}, since those are not the same as the more desirable and more correct \p{Lowercase} and \p{Lower}.

And remember that \w is really just an alias for

[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Letter_Number}\p{Connector_Punctuation}]

Utf8 correct regex for CamelCase (WikiWord) in perl

精彩评论

关注公众号

热门标签

图文推荐

Utf8 correct regex for CamelCase (WikiWord) in perl

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：