SED - unable to execute some commands on UTF-8 encoded chars_问答_开发者

SED - unable to execute some commands on UTF-8 encoded chars

开发者 https://www.devze.com 2023-02-27 11:23 出处：网络

I got a file that looks like this: <text top=\"123\" left=\"45\" width=\"50\" height=\"17\" font=\"8\">Måndag</text>

I got a file that looks like this:

<text top="123" left="45" width="50" height="17" font="8">Måndag</text>

As noted in the topic, this file is encoded in utf-8. When using this command:

cat file | sed 's_.*top="\([0-9][0-9]*\)" left="\([0-9][0-9]*\)".*>\(.*\)<.*_\1 \2 \3_'

it never completes the execution and prints nothing.

However executing a line like this one:

cat file | sed 's/å/FOO/'

gives me a correct output:

<text top="123" left="45" width="50" height="17" font="8开发者_如何学JAVA">MFOOndag</text>

Is this a bug in sed or is there something wrong with my regex or the way that I'm using it? What I want is a neat way to extract the top, left and content data without involving too many commands.

The easiest way to do this reliably is just to use perl in place of sed:

bash$ perl -CSAD -pe 's/foo/bar/g'

That will allow Unicode in your arguments, your std streams, and all files you process.

Not all seds are built to handle UTF-8. I would look at the source to see if any relevant patches have been applied. FTR, Red Hat-derived seds do handle UTF-8 properly.

Try this suggestion. Looks like it could work for you.