Why this regex is not working for german words?_问答_开发者

I am trying to break the followi开发者_运维问答ng sentence in words and wrap them in span.

<p class="german_p big">Das ist ein schönes Armband</p>

I followed this: How to get a word under cursor using JavaScript?

$('p').each(function() {
            var $this = $(this);
            $this.html($this.text().replace(/\b(\w+)\b/g, "<span>$1</span>"));
        });

The only problem i am facing is, after wrapping the words in span the resultant html is like this:

<p class="german_p big"><span>Das</span> <span>ist</span> <span>ein</span> <span>sch</span>ö<span>nes</span> <span>Armband</span>.</p>

so, schönes is broken into three words sch, ö and nes. why this is happening? What could be the correct regex for this?

Unicode in Javascript Regexen

Like Java itself, Javascript doesn't support Unicode in its \w, \d, and \b regex shortcuts. This is (arguably) a bug in Java and Javascript. Even if one manages through casuistry or obstinacy to argue that it is not a bug, it's sure a big gotcha. Kinda bites, really.

The problem is that those popular regex shortcuts only apply to 7-bit ASCII whether in Java or in Javascript. This restriction is painfully 1970s‐ish; it makes absolutely no sense in the 21ˢᵗ century. This blog posting from this past March makes a good argument for fixing this problem in Javascript.

It would be really nice if some public-spirited soul would please add Javascript to this Wikipedia page that compares the support regex features in various languages.

This page says that Javascript doesn't support any Unicode properties at all. That same site has a table that's a lot more detailed than the Wikipedia page I mention above. For Javascript features, look under its ECMA column.

However, that table is in some cases at least five years out of date, so I can't completely vouch for it. It's a good start, though.

Unicode Support in Other Languages

Ruby, Python, Perl, and PCRE all offer ways to extend \w to mean what it is supposed to mean, but the two J‐thingies do not.

In Java, however, there is a good workaround available. There, you can use \pL to mean any character that has the Unicode General_Category=Letter property. That means you can always emulate a proper \w using [\pL\p{Nd}_].

Indeed, there's even an advantage to writing it that way, because it keeps you aware that you're adding decimal numbers and the underscore character to the character class. With a simple \w, please sometimes forget this is going on.

I don't believe that this workaround is available in Javascript, though. You can also use Unicode properties like those in Perl and PCRE, and in Ruby 1.9, but not in Python.

The only Unicode properties current Java supports are the one- and two-character general properties like \pN and \p{Lu} and the block properties like \p{InAncientSymbols}, but not scripts like \p{IsGreek}, etc.

The future JDK7 will finally get around to adding scripts. Even then Java still won't support most of the Unicode properties, though, not even critical ones like \p{WhiteSpace} or handy ones like \p{Dash} and \p{Quotation_Mark}.

SIGH! To understand just how limited Java's property support is, merely compare it with Perl. Perl supports 1633 Unicode properties as of 2007's 5.10 release, and 2478 of them as of this year's 5.12 release. I haven't counted them for ancient releases, but Perl started supporting Unicode properties back during the last millennium.

Lame as Java is, it's still better than Javascript, because Javascript doesn't support any Unicode properties whatsoCENSOREDever. I'm afraid that Javascript's paltry 7-bit mindset makes it pretty close to unusable for Unicode. This is a tremendously huge gaping hole in the language that's extremely difficult to account for given its target domain.

Sorry 'bout that. ☹

You can also use

/\b([äöüÄÖÜß\w]+)\b/g

instead of

/\b(\w+)\b/g

in order to handle the umlauts

\w only matches A-Z, a-z, 0-9, and _ (underscore).

You could use something like \S+ to match all non-space characters, including non-ASCII characters like ö. This might or might not work depending on how the rest of your string is formatted.

Reference: http://www.javascriptkit.com/javatutors/redev2.shtml

To include all the Latin 1 Supplement characters like äöüßÒÿ you can use:

[\w\u00C0-\u00ff]

however, there are even more funny characters in the Latin Extended-A and Latin Extended-B unicode blocks like ČŇů . To include that you can use:

[\w\u00C0-\u024f]

\w and \b are not unicode-aware in javascript; they only match ASCII word/boundary characters. If you use cases will all allow splitting on whitespace, you can use \s/\S, which are unicode-aware.

As others note, the \w shortcut is not very useful for non-Latin character sets. If you need to match other text ranges you should use hex* notation (Ref1) (Ref2) for the appropriate range.

* could be hex or octal or unicode, you'll often see these collectively referred as hex notation.

the \b's will also not work correctly. It is possible to use Xregex library \p{L} tag for unicode support, however there is still not \b support so you wont be able to find the word boundaries. It would be nice to provide \b support by doing lookbehind/lookaheads with \P{L} in the following implementation

http://blog.stevenlevithan.com/archives/mimic-lookbehind-javascript

While javascript doesn't support Unicode natively, you could use this library to work around it: http://xregexp.com/