I'm trying to parse some data out of a website. The problem is that a javascript generates the data, thus I can't use a HTML parser for it. The string inside the source looks like:
<a href="http:www.domain.compid.php?id=123">
Everything is constant except the id that comes after the =. I开发者_如何学Go don't know how many times the string will occur either. Would appreciate any help and an explanation on the regex example if possible.
Do you need to save any of it? A blanket regex href="[^"]+">
will match the entire string. If you need to save a specific part, let me know.
EDIT: To save the id, note the paren's after id=
which signifies to capture it. Then to retrieve it, use the match object's Groups field.
string source = "a href=\"http:www.domain.compid.php?id=123\">";
Regex re = new Regex("href=\"[^\"]+id=([^\"]+)\">");
Match match = re.Match(source);
if(match.Success)
{
Console.WriteLine("It's a match!\nI found:{0}", match.Groups[0].Value);
Console.WriteLine("And the id is {0}", match.Groups[1].Value);
}
EDIT: example using MatchCollection
MatchCollection mc = re.Matches(source);
foreach(Match m in mc)
{
//do the same as above. except use "m" instead of "match"
//though you don't have to check for success in each m match object
//since it wouldn't have been added to the MatchCollection if it wasn't a match
}
This does the parsing in javascript and creates a csv-string:
var re = /<a href="http:www.domain.compid.php\?id=(\d+)">/;
var source = document.body.innerHTML;
var result = "result: ";
var match = re(source);
while (match != null) {
result += match[1] + ",";
source = source.substring(match.index + match[0].length);
match = re(source);
}
Demo. If the html-content is not used for anything else on the server it should be sufficient to send the ids.
EDIT, For performance and reliability it's probably better to use builtin javascript-functions (or jQuery) to find the urls instead of searching the entire content:
var re = /www.domain.compid.php\?id=(\d+)/;
var as = document.getElementsByTagName('a');
var result = "result: ";
for (var i = 0; i < as.length; i++) {
var match = re(as[i].getAttribute('href'));
if (match != null) {
result += match[1] + ",";
}
}
精彩评论