开发者

awk sed or regex insert substring and change case

开发者 https://www.devze.com 2023-03-01 00:53 出处:网络
I am doing some transformations on a tab-separated file wherein one column contains a heirarchical identifier like this:

I am doing some transformations on a tab-separated file wherein one column contains a heirarchical identifier like this:

VI.d5.5
VII.b2.1
VII.b2.2
VII.b2.3
VII.c1

I need to transform it to look like the following, inserting an up-cased letter from the second dot group between the first and second:

VI.D.d5.5
VII.B.b2.1
VII.B.b2.2
VII.B.b2.3
VII.C.c1

I know about the \U flag in sed but I don't know how to apply it only once. For example, the following up-cases both the inser开发者_如何学Cted letter and the original lower-case: (undesired)

echo 'VII.b1.1' | sed -e 's/\([a-h]\)/\U\1.\1/'
VII.B.B1.1

I would welcome any shell (sed, awk, perl, whatever) or vim solution that would allow me to modify this column in place in the tab-separated file.


have you tried \u instead of \U? According to the sed info page (info sed):

`\U'
     Turn the replacement to uppercase until a `\L' or `\E' is found,

`\u'
     Turn the next character to uppercase,


sed -e 's/\.[a-z]/\U&\E&/'

Perl works well too:

perl -pe 's/\.[a-z]/uc($&) . $&/e'


You can’t do that in standard sed(1), because there is no such thing as \u or \U there. Indeed, on all my systems (but one) it fails — and silently, too, alas! I tried the sed version both on my Mac laptop and my Mac desktop, and then I tried it on our Solaris server and on our OpenBSD server. I tried it on the lone AIX box too, and of course it didn’t work there. :(

However, you should be able to do it portably this way, which works on those systems I tested:

% cat sample
VI.d5.5                                                                           
VII.b2.1
VII.b2.2
VII.b2.3
VII.c1

% perl -wpe 's/([^.]+)\.(.)/$1.\u$2.$2/' /tmp/sample 
VI.D.d5.5
VII.B.b2.1
VII.B.b2.2
VII.B.b2.3
VII.C.c1

Not only is that more portable, it’s a lot easier, too.

That should work on any version of Perl released in the last 20 years, including perl4. However, if you’re living on the bleeding edge and so have at least 5.10 installed, then you can do it in this way instead:

% perl -M5.10.0 -wpe 's/[^.]+\.\K(?=(.))/\u$1./' /tmp/sample
VI.D.d5.5
VII.B.b2.1
VII.B.b2.2
VII.B.b2.3
VII.C.c1

That ‑M5.10.0 is just to make sure you really have the 5.10 feature-set available and loaded.

What about Unicode?

Now suppose that your sample data had Unicode in it:

% cat /tmp/sample.utf8
Ⅵ.ð5.5
Ⅷ.ß2.3
Ⅺ.ç1

% uniquote /tmp/sample.utf8 
\N{U+2165}.\N{U+F0}5.5
\N{U+2167}.\N{U+DF}2.3
\N{U+216A}.\N{U+E7}1

% uniquote -v /tmp/sample.utf8
\N{ROMAN NUMERAL SIX}.\N{LATIN SMALL LETTER ETH}5.5
\N{ROMAN NUMERAL EIGHT}.\N{LATIN SMALL LETTER SHARP S}2.3
\N{ROMAN NUMERAL ELEVEN}.\N{LATIN SMALL LETTER C WITH CEDILLA}1

I can guarantee you that you aren’t going to find a version of sed that does the right thing on that data. It will mess up. I went to our sacrificial Linux box, and although the ɢɴᴜsed they use there works on your sample data, it refused to casemap one of those characters in my fancier Unicode dataset, even when I had the locale all set up right. But the perl version still did the right thing.

But with perl, just add the ‑CSD command-line options to tell perl that the datafiles and std{in,out,err} are all in UTF‑8, then run the same commands and you will see something that’s really Qᴜɪᴛᴇ Iɴᴛᴇʀᴇsᴛɪɴɢ:

% perl -CSD -wpe 's/([^.]+)\.(.)/$1.\u$2.$2/' /tmp/sample.utf8
Ⅵ.Ð.ð5.5
Ⅷ.Ss.ß2.3
Ⅺ.Ç.ç1

% perl -CSD -wpe 's/[^.]+\.\K(?=(.))/\u$1./' /tmp/sample.utf8
Ⅵ.Ð.ð5.5
Ⅷ.Ss.ß2.3
Ⅺ.Ç.ç1

% perl -CSD -wpe 's/[^.]+\.\K(?=(.))/\U$1./' /tmp/sample.utf8
Ⅵ.Ð.ð5.5
Ⅷ.SS.ß2.3
Ⅺ.Ç.ç1

As you see, there is a difference between the titlecasing that \u does and the uppercasing that \U does. That’s because the lowercase letter “ß” is “Ss” in titlecase but “SS” in uppercase. Bizarre but true! This sort of thing admittedly happens a lot more with the Greek letters than it does with the Latin ones like we use, but you still want to do it right.

Here that is all uniquoted so you can see just which code points we’re talking about:

% perl -CSD -wpe 's/[^.]+\.\K(?=(.))/\u$1./' /tmp/sample.utf8 | uniquote
\N{U+2165}.\N{U+D0}.\N{U+F0}5.5
\N{U+2167}.Ss.\N{U+DF}2.3
\N{U+216A}.\N{U+C7}.\N{U+E7}1

% perl -CSD -wpe 's/[^.]+\.\K(?=(.))/\u$1./' /tmp/sample.utf8 | uniquote -v
\N{ROMAN NUMERAL SIX}.\N{LATIN CAPITAL LETTER ETH}.\N{LATIN SMALL LETTER ETH}5.5
\N{ROMAN NUMERAL EIGHT}.Ss.\N{LATIN SMALL LETTER SHARP S}2.3
\N{ROMAN NUMERAL ELEVEN}.\N{LATIN CAPITAL LETTER C WITH CEDILLA}.\N{LATIN SMALL LETTER C WITH CEDILLA}1

Isn’t that way cool?


Try using \u instead of \U which turns the next character uppercase. But if you wanna use \U then you have to stop the uppercase with \E or \L do like

's/\([a-h]\)/\U\1\E.\1/'


sed -e 's/\([^.]\+\)\.\(.\)/\1.\u\2\.\2/'

like this:

$ sed -e 's/\([^.]\+\)\.\(.\)/\1.\u\2\.\2/' <<<'VI.d5.5'
VI.D.d5.5


Here's an awk solution. No messy regular expressions needed. Basic idea: Split on dot, get the first character of 2nd field. Then change its case using toupper() function. Lastly, substitute back to 2nd field.

awk -F"." '{
    ch = toupper(substr($2,1,1))
    $2=ch"."$2
}1' OFS="." file
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号