开发者

How to build a website with Php that collects articles?

开发者 https://www.devze.com 2023-02-02 01:52 出处:网络
I have a quick question. I\'m trying to build a website with php that collects articles from different blogs. How would I code this in php? Would I need some type of regex statement? All I need to do

I have a quick question. I'm trying to build a website with php that collects articles from different blogs. How would I code this in php? Would I need some type of regex statement? All I need to do is grab the articles from the speci开发者_如何学Cfic pages. An example is: http://rss.news.yahoo.com/rss/education Can anyone help? Thank you.


You need to write parser for each and every site. Something like this...

class Parser_Article_SarajevoX extends Parser_Article implements Parser_Interface_Article {

    protected static $_url = 'http://www.sarajevo-x.com/';

    public static function factory($url)
    {
        return new Parser_Article_SarajevoX($url);
    }

    protected static function decode($string)
    {
        return iconv('ISO-8859-2', Kohana::$charset, $string);
    }

    /**
     * SarajevoX Article Parser constructor
     *
     * @param   string  article's url or uri
     */
    public function __construct($url)
    {
        $parsed = parse_url($url);

        if ($path = arr::get($parsed, 'path'))
        {
            // make url's and uri's path the same
            $path = trim($path, '/');

            $exploded = explode('/', $path);

            if (count($exploded == 4))
            {
                list($this->cat_main, $this->cat, $nita, $this->id) = $exploded;
            }
            elseif (count($exploded) == 3)
            {
                list($this->cat, $nita, $this->id) = $exploded;
            }
            else
            {
                throw new Exception("Path not recognized: :url", array(':url' => $url));
            }

            // @todo check if this article is already imported to skip getting HTML

            $html = HTML_Parser::factory(self::$_url.$path);

            $content = $html->find('#content-main .content-bg', 0);

            // @freememory
            $html = NULL;

            $this->title = self::decode($content->find('h1', 0)->innertext);

            // Loop through all inner divs and find the content
            foreach ($content->find('div') as $div)
            {
                switch ($div->class)
                {
                    case 'nadnaslov':

                        $this->suptitle = strip_tags(self::decode($div->innertext));

                    break;
                    case 'uvod':

                        $this->subtitle = strip_tags(self::decode($div->innertext));

                    break;
                    case 'tekst':

                        $pic_wrap = $div->find('div[id="fotka"]', 0);

                        if ($pic_wrap != FALSE)
                        {
                            $this->_pictures[] = array
                            (
                                'url'   =>  self::$_url.trim($pic_wrap->find('img', 0)->src, '/'),
                                'desc'  =>  self::decode($pic_wrap->find('div[id="opisslike"]', 0)->innertext),
                            );

                            // @freememory
                            $pic_wrap   = NULL;
                        }

                        $this->content  = strip_tags(self::decode($div->innertext));

                    break;
                    case 'ad-gallery' :

                        foreach ($div->find('div[id="gallery"] .ad-nav .ad-thumbs ul li a') as $a)
                        {
                            $this->_pictures[] = array
                            (
                                'url'   =>  self::$_url.trim($a->href, '/'),
                                'desc'  =>  self::decode($a->find('img', 0)->alt),
                            );

                            // @freememory
                            $a = NULL;
                        }

                    break;
                }
            }

            echo Kohana::debug($this);

            return;
        }

        throw new Exception("Path not recognized: :url", array(':url' => $url));
    }

}


An RSS feed is XML and so you'd use something like the xml_parse_into_struct to begin parsing this feed. The examples on this page should be good enough to get you going.


Each blog has an associated rss xml file. The blog page will have a "link" tag pointing to this xml file in its header, so that browsers can allow users to subscribe to those rss feeds. The rss xml file will have all of the needed data for each of the blog entries such as title, description, publish date, url. You will want to use the PHP simpleXML class to load the XML content into a simpleXML object. Then you can access each peice of the rss feed that you need.

0

精彩评论

暂无评论...
验证码 换一张
取 消