I want to retrieve the board name for a 4chan thread using this pattern:
echo $(cat ~/Desktop/test.html | sed -n "s/<title>\(.*\) - />\1</p")
test.html contains:
<link rel="shortcut icon" href="http://static.4chan.org/image/favicon.ico" /><link rel="stylesheet" type="text/css" href="http://static.4chan.org/css/yotsuba.9.css" title="Yotsuba"><link rel="alternate stylesheet" type="text/css" href="http://static.4chan.org/css/yotsublue.9.css" title="Yotsuba B"><link rel="alternate stylesheet" type="text/css" href="http://static.4chan.org/css/futaba.9.css" title="Futaba"><link rel="alternate stylesheet" type="text/css" href="http://static.4chan.org/css/burichan.9.css" title="Burichan"><title>/b/ - Random</title>
I want to match /b/, but instead it just removes "<title>
" and "-
" like so:
<link rel=开发者_如何学JAVA"shortcut icon" href="http://static.4chan.org/image/favicon.ico" /><link rel="stylesheet" type="text/css" href="http://static.4chan.org/css/yotsuba.9.css" title="Yotsuba"><link rel="alternate stylesheet" type="text/css" href="http://static.4chan.org/css/yotsublue.9.css" title="Yotsuba B"><link rel="alternate stylesheet" type="text/css" href="http://static.4chan.org/css/futaba.9.css" title="Futaba"><link rel="alternate stylesheet" type="text/css" href="http://static.4chan.org/css/burichan.9.css" title="Burichan">>/b/<Random</title>
Why?
Because that's all you told it to substitute. If you want to remove from the beginning and to the end then you need to anchor the ends with ^
and $
and match all the characters between.
Something like this:
sed -n "s/.*<title>\([^<>]*\) - .*/\1/p" ~/Desktop/test.html
Your problem is that your regular expression doesn't match the beginning of the string (in my case .* do this" and end of string (again in my case it's ".*" in the end)
精彩评论