Following my previous question I have been trying to parse the href strings out of a html file in order to send that string to the solution of my previous question.
this is what I have but it doesn't work...
void ParseUrls(char* Buffer)
{
char *begin = Buffer;
char *end = NULL;
int total = 0;
while(strstr(begin, "href=\"") != NULL)
{
end = strstr(begin, "</a>");
if(end != NULL)
{
char *url = (char*) malloc (1000 * sizeof(char));
strncpy(url, begin, 100);
printf("URL = %s\n", url);
if(url) free(url);
}
total++;
begin++;
}
printf("Total URLs = %d\n", total);
return;
}
basically I need to extract into a string the information of the href, something like:
<a href="http://www.w3schools.com">Visit W3Schools</a>
Any help is appreciated.
There's a lot of things wrong with this code.
You increment begin only by one each time around the loop. This means you find the same href over and over again. I think you meant to move
begin
to afterend
?The strncpy will normally copy 100 characters (as the HTML will be longer) and so will not nul-terminate the string. You want
url[100]
= '\0' somewhereWhy do you allocate 1000 characters and use only 100?
You search for
end
starting with begin. This means if there's a before the href="" you'll find that instead.You don't use
end
for anything.Why don't you search for the terminating quote at the end of the URL?
Given the above issues (and adding the termination of URL) it works OK for me.
Given
"<a href=\"/email_services.php\">Email services</a> "
it prints
URL = <a href="/email_services.php">Email services</a>
URL = a href="/email_services.php">Email services</a>
URL = href="/email_services.php">Email services</a>
URL = href="/email_services.php">Email services</a>
Total URLs = 4
For the allocation of space, I think you should keep the result of the strstr of "href=\"" (call this start
and then the size you need is end - start
(+1 for the terminating NUL). Allocate that much space, strncpy it across, add the NUL and Robert's your parent's male sibling.
Also, remember href= isn't unique to anchors. It can appear in some other tags too.
This does not really answer your qustion about this code, but it would probably be more reliable to use a C library to do this, such as HTMLParser from libxml2.
HTML parsing looks easy, but there are edge cases that make it easier to use something that is known to work than to work though them all yourself.
精彩评论