开发者

One more greedy sed question

开发者 https://www.devze.com 2023-01-24 20:45 出处:网络
I\'m doing an automated download of a number of ima开发者_运维技巧ges using an html frame source. So fra, so good, Sed, wget. Example of the frame source:

I'm doing an automated download of a number of ima开发者_运维技巧ges using an html frame source. So fra, so good, Sed, wget. Example of the frame source:

<td width="25%" align="center" valign="top"><a href="images/display.htm?concept_Core.jpg"><img border="1" src="t_core.gif" width="120" height="90"><font size="1" face="Verdana"><br>Hyperspace Core<br>(Rob Cunningham)</font></a></td>

So I do this:

sed -n -e 's/^.*htm?\(.*jpg\).*$/\1/p' concept.htm

to get the part which looks like this:

concept_Core.jpg

to do then this:

wget --base=/some/url/concept_Core.jpg

But there is one nasty line. That line, obvioulsy, is a bug in the site, or whatever it can be, but it is wrong, I can't change it, however. ;)

<td width="25%" bla bla face="Verdana"><a href="images/display.htm?concept_frigate16.jpg" target="_top"><img bla bla href="images/concept_frigate16.jpg" target="_top"><br>Frigate 16<br>

That is, two of these "concept_Frigate16.jpg" in a line. And my script gives me

concept_frigate16.jpg" target="_top"><img border="1" src="t_assaultfrigate.gif" width="120" height="90" alt="The '16' in the name may be a Sierra typo."></a><a href="images/concept_frigate16.jpg

You understand why. Sed is greedy and this obviously shows up in this case.

Now the question is, how do I get rid of this corner case? That is, make it non-greedy and make it stop on the FIRST .jpg?emphasized text


use perl:

perl -pe 's/^.*htm?\(.*?jpg\).*$/\1/'


You might want to consider changing:

\(.*jpg\)

into:

\([^"]*jpg\)

This should stop your initial search going beyond the end of the first href. Whether that will introduce other problems (for other edge cases) is a little difficult to say given I don't know the full set of inputs.

If it does, you may want to opt for using a real parser rather than regexes. Regexes are a powerful tool but they're not necessarily suited for everything.


Use [^"] instead of . in the regular expression. This will pick all characters except the appostrophes.


sed -n -e 's/^.*htm?\([^"]*jpg\).*$/\1/p'


GNU grep can do PCRE:

grep -Po '(?<=\.htm\?).*?jpg' concept.htm
0

精彩评论

暂无评论...
验证码 换一张
取 消