Ruby Regex: Return just the match_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-03-21 17:52 出处：网络

When I do puts /<title>(.*?)<\\/title>/.match(html) I get <h2>foobar</h2>开发者_开发百科

相关专题：regex ruby

When I do

puts /<title>(.*?)<\/title>/.match(html)

I get

<h2>foobar</h2>开发者_开发百科

But I want just

foobar

What's the most elegant method for doing so?

The most elegant way would be to parse HTML with an HTML parser:

require 'nokogiri'

html  = '<title><h2>Pancakes</h2></title>'
doc   = Nokogiri::HTML(html)
title = doc.at('title').text
# title is now 'Pancakes'

If you try to do this with a regular expression, you will probably fail. For example, if you have an <h2> in your <title> what's to prevent you from having something like this:

<title><strong>Where</strong> is <span>pancakes</span> <em>house?</em></title>

Trying to handle something like that with a single regex is going to be ugly but doc.at('title').text handles that as easily as it handles <title>Pancakes</title> or <title><h2>Pancakes</h2></title>.

Regular expressions are great tools but they shouldn't be the only tool in your toolbox.

Something of this style will return just the contents of the match.

html[/<title>(.*?)<\/title>/,1]

Maybe you need to tell us more, like what html might contain, but right now, you are capturing the contents of the title block, irrespective of the internal tags. I think that is the way you should do it, rather than assuming that there is an internal tag you want to handle, especially because what would happen if you had two internal tags? This is why everyone is telling you to use an html parser, which you really should do.