count and parse all the href links out of a html file_问答_开发者

count and parse all the href links out of a html file

开发者 https://www.devze.com 2023-01-28 06:35 出处：网络

Following my previous question I have been trying to parse the href strings out of a html file in order to send that string to the solution of my previous question.

相关专题：c parsing

Following my previous question I have been trying to parse the href strings out of a html file in order to send that string to the solution of my previous question.

this is what I have but it doesn't work...

void ParseUrls(char* Buffer)
{
    char *begin = Buffer;
    char *end = NULL;
    int total = 0;

    while(strstr(begin, "href=\"") != NULL)
    {   
        end = strstr(begin, "</a>");
        if(end != NULL)
        {
            char *url = (char*) malloc (1000 * sizeof(char));

            strncpy(url, begin, 100);
            printf("URL = %s\n", url);

            if(url) free(url);
        }

        total++;
        begin++;
    }

    printf("Total URLs = %d\n", total);
    return;
}

basically I need to extract into a string the information of the href, something like:

<a href="http://www.w3schools.com">Visit W3Schools</a>

开发者_运维技巧

Any help is appreciated.

There's a lot of things wrong with this code.

You increment begin only by one each time around the loop. This means you find the same href over and over again. I think you meant to move begin to after end?
The strncpy will normally copy 100 characters (as the HTML will be longer) and so will not nul-terminate the string. You want url[100] = '\0' somewhere
Why do you allocate 1000 characters and use only 100?
You search for end starting with begin. This means if there's a before the href="" you'll find that instead.
You don't use end for anything.
Why don't you search for the terminating quote at the end of the URL?

Given the above issues (and adding the termination of URL) it works OK for me.

Given

"<a href=\"/email_services.php\">Email services</a> "

it prints

URL = <a href="/email_services.php">Email services</a> 
URL = a href="/email_services.php">Email services</a> 
URL =  href="/email_services.php">Email services</a> 
URL = href="/email_services.php">Email services</a> 
Total URLs = 4

For the allocation of space, I think you should keep the result of the strstr of "href=\"" (call this start and then the size you need is end - start (+1 for the terminating NUL). Allocate that much space, strncpy it across, add the NUL and Robert's your parent's male sibling.

Also, remember href= isn't unique to anchors. It can appear in some other tags too.

This does not really answer your qustion about this code, but it would probably be more reliable to use a C library to do this, such as HTMLParser from libxml2.

HTML parsing looks easy, but there are edge cases that make it easier to use something that is known to work than to work though them all yourself.