I've got a string with very unclean HTML. Before I parse it, I want to convert this:
<TABLE><TR><TD width="33%" nowrap=1><font size="1" face="Arial">
</font> </TD>
<TD width="33%" nowrap=1><font size="1" face="Arial">
</font> </TD>
<TD width="33%" nowrap=1><font size="1" face="Arial">
</font> </TD>
in NE DEK 143
so it is a bit easier to parse. I've got this regula开发者_开发百科r expression (RegexKitLite):
NSString *str = [dataString stringByReplacingOccurrencesOfRegex:@"<TABLE><TR><TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<\\/TR><\\/TABLE>"
withString:@"$1 $3 $5"];
I'm no an expert in Regex. Can someone help me out here?
Regards, dodo
Amarghosh, and bobince, the winning answerer of linked question, is generally right about this. However, since you are just sanitising, regexps are actually just fine.
First, strip the tags:
Then collapse all extra spaces into one:
s/\s+/ /
Then remove leading/trailing space:
Then get the values:
^([^ ]+) ([^ ]+) ([^ ]+)$
I have a few suspicions about why your regex might fail (without knowing the rules for string escaping in the iPhone SDK): The dot .
used in places where it would have to match newlines, the slash looks like it's escaped unnecessarily etc.,
but: in your example, the text you're trying to extract is characterized by not being surrounded by tags.
So a search for all occurences of (?m)^[^<>\r\n]$
should find all matches.
If you sure of your html-code hierarchy, then you can just extract text enclosed by font-tags:
Regex r = Regex(@"<\s*font((\s+[^<>]*)|(\s*))>(?<desiredText>[^<>]*)<\s*/\s*font\s*>")
//C# example
foreach(Match m in r.Matches(txt))
result += m.Groups["desiredText"].Value.Trim()
; It will be text enclosed by font-tags without white-space symbols by edges.