开发者

Sed: how to replace nextline \n symbol in text files?

开发者 https://www.devze.com 2023-02-28 01:35 出处:网络
I need to fix an error and to replace the second tag </time> with </tags> in an XML file with the following structure:

I need to fix an error and to replace the second tag </time> with </tags> in an XML file with the following structure:

<time>20260664</time>
<tags>substancesummit ss</time>
<geo>asdsadsa</geo>
<time>20260664</time>
<tags>substancesummit ss</time>
<geo>asdsadsa&开发者_开发百科lt;/geo>

I'm trying to do it using sed and since I have 2 </time> closing tag per item, my idea is to replace </time><geo> with </tags><geo>.

However there is a next line symbol in between, so I'm using \n but it doesn't work:

sed 's/time>\n<geo>/tags>\n<geo>/g' old.xml > new.xml

Any help?


You can do that in 1 single sed command like this:

sed '/<\/time>/I{n;:A;N;h;/<geo>/I!{H;bA};/<geo>/I{g;s/<\/time>/<\/tags>/i}}' file.txt

Testing

If your input file.txt is like this:

<time>20260664</time>
<tags>substancesummit ss
</time>

<Geo>asdsadsa</geo>
<time>30260664</time>
<tags>substancesummit st</timE>
<geo>bsdsadsa</geo>

Then output of above command will be:

<time>20260664</time>
<tags>substancesummit ss
</tags>

<Geo>asdsadsa</geo>
<time>30260664</time>
<tags>substancesummit st</tags>
<geo>bsdsadsa</geo>

It covers multiple new line characters (\r or \n) in any combination between </time> and <geo>

PS: Above sed command is doing ignore care search/replace, in case you don't want that then just remove I flag from sed command or just let me know.


Use this:

$ sed -n '1h; 1!H; $ {g; s/<\/time>\n<geo>/<\/tags>\n<geo>/g; p;}' file


If there is a character that you definitely don't use in the file, try to replace \n with it, do your sed work and replace back. tr works really well for that

cat old.txt | tr '\n' '#' | sed 's/time>#<geo>/tags>#<geo>/g' | tr '#' '\n' > new.txt

I use # as replace character.


sed usually edits lines, and it is a bit harder to make it understand multiple lines at once, as you are. Instead, how about fixing up the broken lines more directly, with something like this:

/<tags>/ s@</time>@</tags>@

This will replace </time> with </tags> only on lines which also contain <tags>. Note that I used @ instead of / as the delimiter for the substitution command, to avoid a need to escape the slashes in the XML we're trying to replace.


you can useawk instead

$ awk -vRS="</geo>" '{gsub(/<\/time>.<geo>/,"</tags>\n<geo>")}1' ORS="</geo>" file
<time>20260664</time>
<tags>substancesummit ss</tags>
<geo>asdsadsa</geo>
<time>20260664</time>
<tags>substancesummit ss</tags>
<geo>asdsadsa</geo>

First, I can see that </geo> ends each block, so make this the record separator. After that, substitute what is required. Lastly, put </geo> back as the output record separator (ORS).


Why don't you sidestep the issue with trying to match the linebreaks and instead try matching the line with the opening <tags> tag and the content after it up to the (non-)matching </time> tag? Like

# untested, written from scratch
sed 's/<tags>(.*)<\/time>/<tags>\1<\/tags>/g' infile > outfile


sed -e 's,<\([^>]*\)>\([^<]*\)</[^>]*>,<\1>\2</\1>,g' tags.xml

This replaces in the same line an

(opening tag)(content)(closing tag) 

with

(opening tag)(content)(closing tag) 

but the closing tag is always the same tag, as the opening tag.

It can fail if more than one tag-pair is found in the file.

In Detail, it searches for something, starting with '<', followed by the name of the tag without closing it '>', followed by the content, which is everything until '<'.

0

精彩评论

暂无评论...
验证码 换一张
取 消