开发者

Getting title and meta tags from external website

开发者 https://www.devze.com 2023-01-16 07:29 出处:网络
I want to try figure out how to get the <title>A common title</title&g开发者_StackOverflow社区t;

I want to try figure out how to get the

<title>A common title</title&g开发者_StackOverflow社区t;
<meta name="keywords" content="Keywords blabla" />
<meta name="description" content="This is the description" />

Even though if it's arranged in any order, I've heard of the PHP Simple HTML DOM Parser but I don't really want to use it. Is it possible for a solution except using the PHP Simple HTML DOM Parser.

preg_match will not be able to do it if it's invalid HTML?

Can cURL do something like this with preg_match?

Facebook does something like this but it's properly used by using:

<meta property="og:description" content="Description blabla" />

I want something like this so that it is possible when someone posts a link, it should retrieve the title and the meta tags. If there are no meta tags, then it it ignored or the user can set it themselves (but I'll do that later on myself).


This is the way it should be:

function file_get_contents_curl($url)
{
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

    $data = curl_exec($ch);
    curl_close($ch);

    return $data;
}

$html = file_get_contents_curl("http://example.com/");

//parsing begins here:
$doc = new DOMDocument();
@$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');

//get and display what you need:
$title = $nodes->item(0)->nodeValue;

$metas = $doc->getElementsByTagName('meta');

for ($i = 0; $i < $metas->length; $i++)
{
    $meta = $metas->item($i);
    if($meta->getAttribute('name') == 'description')
        $description = $meta->getAttribute('content');
    if($meta->getAttribute('name') == 'keywords')
        $keywords = $meta->getAttribute('content');
}

echo "Title: $title". '<br/><br/>';
echo "Description: $description". '<br/><br/>';
echo "Keywords: $keywords";


<?php
// Assuming the above tags are at www.example.com
$tags = get_meta_tags('http://www.example.com/');

// Notice how the keys are all lowercase now, and
// how . was replaced by _ in the key.
echo $tags['author'];       // name
echo $tags['keywords'];     // php documentation
echo $tags['description'];  // a php manual
echo $tags['geo_position']; // 49.33;-86.59
?>


get_meta_tags will help you with all but the title. To get the title just use a regex.

$url = 'http://some.url.com';
preg_match("/<title>(.+)<\/title>/siU", file_get_contents($url), $matches);
$title = $matches[1];

Hope that helps.


get_meta_tags did not work with title.

Only meta tags with name attributes like

<meta name="description" content="the description">

will be parsed.


Php's native function: get_meta_tags()

http://php.net/manual/en/function.get-meta-tags.php


Shouldnt we be using OG?

The chosen answer is good but doesn't work when a site is redirected (very common!), and doesn't return OG tags, which are the new industry standard. Here's a little function which is a bit more usable in 2018. It tries to get OG tags and falls back to meta tags if it cant them:

function getSiteOG( $url, $specificTags=0 ){
    $doc = new DOMDocument();
    @$doc->loadHTML(file_get_contents($url));
    $res['title'] = $doc->getElementsByTagName('title')->item(0)->nodeValue;

    foreach ($doc->getElementsByTagName('meta') as $m){
        $tag = $m->getAttribute('name') ?: $m->getAttribute('property');
        if(in_array($tag,['description','keywords']) || strpos($tag,'og:')===0) $res[str_replace('og:','',$tag)] = $m->getAttribute('content');
    }
    return $specificTags? array_intersect_key( $res, array_flip($specificTags) ) : $res;
}

How to use it:

/////////////
//SAMPLE USAGE:
print_r(getSiteOG("http://www.stackoverflow.com")); //note the incorrect url

/////////////
//OUTPUT:
Array
(
    [title] => Stack Overflow - Where Developers Learn, Share, & Build Careers
    [description] => Stack Overflow is the largest, most trusted online community for developers to learn, shareâ âtheir programming âknowledge, and build their careers.
    [type] => website
    [url] => https://stackoverflow.com/
    [site_name] => Stack Overflow
    [image] => https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon@2.png?v=73d79a89bded
)


Unfortunately, the built in php function get_meta_tags() requires the name parameter, and certain sites, such as twitter leave that off in favor of the property attribute. This function, using a mix of regex and dom document, will return a keyed array of metatags from a webpage. It checks for the name parameter, then the property parameter. This has been tested on instragram, pinterest and twitter.

/**
 * Extract metatags from a webpage
 */
function extract_tags_from_url($url) {
  $tags = array();

  $ch = curl_init();
  curl_setopt($ch, CURLOPT_HEADER, 0);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  curl_setopt($ch, CURLOPT_URL, $url);
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

  $contents = curl_exec($ch);
  curl_close($ch);

  if (empty($contents)) {
    return $tags;
  }

  if (preg_match_all('/<meta([^>]+)content="([^>]+)>/', $contents, $matches)) {
    $doc = new DOMDocument();
    $doc->loadHTML('<?xml encoding="utf-8" ?>' . implode($matches[0]));
    $tags = array();
    foreach($doc->getElementsByTagName('meta') as $metaTag) {
      if($metaTag->getAttribute('name') != "") {
        $tags[$metaTag->getAttribute('name')] = $metaTag->getAttribute('content');
      }
      elseif ($metaTag->getAttribute('property') != "") {
        $tags[$metaTag->getAttribute('property')] = $metaTag->getAttribute('content');
      }
    }
  }

  return $tags;
}


A simple function to understand how to retrieve og:tags, title and description, adapt this for yourself

function read_og_tags_as_json($url){


    $ch = curl_init();

    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

    $HTML_DOCUMENT = curl_exec($ch);
    curl_close($ch);

    $doc = new DOMDocument();
    $doc->loadHTML($HTML_DOCUMENT);

    // fecth <title>
    $res['title'] = $doc->getElementsByTagName('title')->item(0)->nodeValue;

    // fetch og:tags
    foreach( $doc->getElementsByTagName('meta') as $m ){

          // if had property
          if( $m->getAttribute('property') ){

              $prop = $m->getAttribute('property');

              // here search only og:tags
              if( preg_match("/og:/i", $prop) ){

                  // get results on an array -> nice for templating
                  $res['og_tags'][] =
                  array( 'property' => $m->getAttribute('property'),
                          'content' => $m->getAttribute('content') );
              }

          }
          // end if had property

          // fetch <meta name="description" ... >
          if( $m->getAttribute('name') == 'description' ){

            $res['description'] = $m->getAttribute('content');

          }


    }
    // end foreach

    // render JSON
    echo json_encode($res, JSON_PRETTY_PRINT |
    JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_SLASHES);

}

Return for this page (may have more infos) :

{
    "title": "php - Getting title and meta tags from external website - Stack Overflow",
    "og_tags": [
        {
            "property": "og:type",
            "content": "website"
        },
        {
            "property": "og:url",
            "content": "https://stackoverflow.com/questions/3711357/getting-title-and-meta-tags-from-external-website"
        },
        {
            "property": "og:site_name",
            "content": "Stack Overflow"
        },
        {
            "property": "og:image",
            "content": "https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon@2.png?v=73d79a89bded"
        },
        {
            "property": "og:title",
            "content": "Getting title and meta tags from external website"
        },
        {
            "property": "og:description",
            "content": "I want to try figure out how to get the\n\n&lt;title&gt;A common title&lt;/title&gt;\n&lt;meta name=\"keywords\" content=\"Keywords blabla\" /&gt;\n&lt;meta name=\"description\" content=\"This is the descript..."
        }
    ]
}


Your best bet is to bite the bullet use the DOM Parser - it's the 'right way' to do it. In the long run it'll save you more time than it takes to learn how. Parsing HTML with Regex is known to be unreliable and intolerant of special cases.


We use Apache Tika via php (command line utility) with -j for json :

http://tika.apache.org/

<?php
    shell_exec( 'java -jar tika-app-1.4.jar -j http://www.guardian.co.uk/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying' );
?>

This is a sample output from a random guardian article :

{
   "Content-Encoding":"UTF-8",
   "Content-Length":205599,
   "Content-Type":"text/html; charset\u003dUTF-8",
   "DC.date.issued":"2013-07-21",
   "X-UA-Compatible":"IE\u003dEdge,chrome\u003d1",
   "application-name":"The Guardian",
   "article:author":"http://www.guardian.co.uk/profile/nicholaswatt",
   "article:modified_time":"2013-07-21T22:42:21+01:00",
   "article:published_time":"2013-07-21T22:00:03+01:00",
   "article:section":"Politics",
   "article:tag":[
      "Lynton Crosby",
      "Health policy",
      "NHS",
      "Health",
      "Healthcare industry",
      "Society",
      "Public services policy",
      "Lobbying",
      "Conservatives",
      "David Cameron",
      "Politics",
      "UK news",
      "Business"
   ],
   "content-id":"/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying",
   "dc:title":"Tory strategist Lynton Crosby in new lobbying row | Politics | The Guardian",
   "description":"Exclusive: Firm he founded, Crosby Textor, advised private healthcare providers how to exploit NHS \u0027failings\u0027",
   "fb:app_id":180444840287,
   "keywords":"Lynton Crosby,Health policy,NHS,Health,Healthcare industry,Society,Public services policy,Lobbying,Conservatives,David Cameron,Politics,UK news,Business,Politics",
   "msapplication-TileColor":"#004983",
   "msapplication-TileImage":"http://static.guim.co.uk/static/a314d63c616d4a06f5ec28ab4fa878a11a692a2a/common/images/favicons/windows_tile_144_b.png",
   "news_keywords":"Lynton Crosby,Health policy,NHS,Health,Healthcare industry,Society,Public services policy,Lobbying,Conservatives,David Cameron,Politics,UK news,Business,Politics",
   "og:description":"Exclusive: Firm he founded, Crosby Textor, advised private healthcare providers how to exploit NHS \u0027failings\u0027",
   "og:image":"https://static-secure.guim.co.uk/sys-images/Guardian/Pix/pixies/2013/7/21/1374433351329/Lynton-Crosby-008.jpg",
   "og:site_name":"the Guardian",
   "og:title":"Tory strategist Lynton Crosby in new lobbying row",
   "og:type":"article",
   "og:url":"http://www.guardian.co.uk/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying",
   "resourceName":"tory-strategist-lynton-crosby-lobbying",
   "title":"Tory strategist Lynton Crosby in new lobbying row | Politics | The Guardian",
   "twitter:app:id:googleplay":"com.guardian",
   "twitter:app:id:iphone":409128287,
   "twitter:app:name:googleplay":"The Guardian",
   "twitter:app:name:iphone":"The Guardian",
   "twitter:app:url:googleplay":"guardian://www.guardian.co.uk/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying",
   "twitter:card":"summary_large_image",
   "twitter:site":"@guardian"
}


My solution (adapted from parts of cronoklee's & shamittomar's posts) so I can call it from anywhere and get a JSON return. Can be easily parsed into any content.

<?php
header('Content-type: application/json; charset=UTF-8');

if (!empty($_GET['url']))
{
    file_get_contents_curl($_GET['url']);
}
else
{
    echo "No Valid URL Provided.";
}


function file_get_contents_curl($url)
{
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

    $data = curl_exec($ch);
    curl_close($ch);

    echo json_encode(getSiteOG($data), JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_SLASHES);
}

function getSiteOG( $OGdata){
    $doc = new DOMDocument();
    @$doc->loadHTML($OGdata);
    $res['title'] = $doc->getElementsByTagName('title')->item(0)->nodeValue;

    foreach ($doc->getElementsByTagName('meta') as $m){
        $tag = $m->getAttribute('name') ?: $m->getAttribute('property');
        if(in_array($tag,['description','keywords']) || strpos($tag,'og:')===0) $res[str_replace('og:','',$tag)] = utf8_decode($m->getAttribute('content'));

    }

    return $res;
}
?>


Get meta tags from url, php function example:

function get_meta_tags ($url){
         $html = load_content ($url,false,"");
         print_r ($html);
         preg_match_all ("/<title>(.*)<\/title>/", $html["content"], $title);
         preg_match_all ("/<meta name=\"description\" content=\"(.*)\"\/>/i", $html["content"], $description);
         preg_match_all ("/<meta name=\"keywords\" content=\"(.*)\"\/>/i", $html["content"], $keywords);
         $res["content"] = @array("title" => $title[1][0], "descritpion" => $description[1][0], "keywords" =>  $keywords[1][0]);
         $res["msg"] = $html["msg"];
         return $res;
}

Example:

print_r (get_meta_tags ("bing.com") );

Get Meta Tags php


Easy and php's in-built function.

http://php.net/manual/en/function.get-meta-tags.php


If you're working with PHP, check out the Pear packages at pear.php.net and see if you find anything useful to you. I've used the RSS packages effectively and it saves a lot of time, provided you can follow how they implement their code via their examples.

Specifically take a look at Sax 3 and see if it will work for your needs. Sax 3 is no longer updated but it might be sufficient.


As it was already said, this can handle the problem:

$url='http://stackoverflow.com/questions/3711357/get-title-and-meta-tags-of-external-site/4640613';
$meta=get_meta_tags($url);
echo $title=$meta['title'];

//php - Get Title and Meta Tags of External site - Stack Overflow


<?php 

// ------------------------------------------------------ 

function curl_get_contents($url) {

    $timeout = 5; 
    $useragent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko/20100101 Firefox/27.0'; 

    $ch = curl_init(); 
    curl_setopt($ch, CURLOPT_URL, $url); 
    curl_setopt($ch, CURLOPT_USERAGENT, $useragent); 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); 
    $data = curl_exec($ch); 
    curl_close($ch); 

    return $data; 
}

// ------------------------------------------------------ 

function fetch_meta_tags($url) { 

    $html = curl_get_contents($url); 
    $mdata = array(); 

    $doc = new DOMDocument();
    $doc->loadHTML($html);

    $titlenode = $doc->getElementsByTagName('title'); 
    $title = $titlenode->item(0)->nodeValue;

    $metanodes = $doc->getElementsByTagName('meta'); 
    foreach($metanodes as $node) { 
    $key = $node->getAttribute('name'); 
    $val = $node->getAttribute('content'); 
    if (!empty($key)) { $mdata[$key] = $val; } 
    }

    $res = array($url, $title, $mdata); 

    return $res;
}

// ------------------------------------------------------ 

?>


I made this small composer package based on the top answer: https://github.com/diversen/get-meta-tags

composer require diversen/get-meta-tags

And then:

use diversen\meta;

$m = new meta();

// Simple usage, get's title, description, and keywords by default
$ary = $m->getMeta('https://github.com/diversen/get-meta-tags');
print_r($ary);

// With more params
$ary = $m->getMeta('https://github.com/diversen/get-meta-tags', array ('description' ,'keywords'), $timeout = 10);
print_r($ary);

It requires CURL and DOMDocument, as the top answer - and is built in the way, but has option for setting curl timeout (and for getting all kind of meta tags).


Now a days, most of the sites add meta tags to their sites providing information about their site or any particular article page. Such as news or blog sites.

I have created a Meta API which gives you required meta data ac like OpenGraph, Schema.Org, etc.

Check it out - https://api.sakiv.com/docs


I've got this working a different way and thought I'd share it. Less code than others and found it here. I've added a few things to make it load the page meta that you are on instead of a certain page. I wanted this to copy the default page title and description into the og tags automatically.

For some reason though, whatever way (different scripts) I tried, the page loads super slow online but instant on wamp. Not sure why so I'm probably going with a switch case since the site is not huge.

<?php
$url = 'http://sitename.com'.$_SERVER['REQUEST_URI'];
$fp = fopen($url, 'r');

$content = "";

while(!feof($fp)) {
    $buffer = trim(fgets($fp, 4096));
    $content .= $buffer;
}

$start = '<title>';
$end = '<\/title>';

preg_match("/$start(.*)$end/s", $content, $match);
$title = $match[1];

$metatagarray = get_meta_tags($url);
$description = $metatagarray["description"];

echo "<div><strong>Title:</strong> $title</div>";
echo "<div><strong>Description:</strong> $description</div>";
?>

and in the HTML header

<meta property="og:title" content="<?php echo $title; ?>" />
<meta property="og:description" content="<?php echo $description; ?>" />


Improved answer from @shamittomar above to get the meta tags (or the specified one from html source)

Can be improved further... the difference from php's default get_meta_tags is that it works even when there is unicode string

function getMetaTags($html, $name = null)
{
    $doc = new DOMDocument();
    try {
        @$doc->loadHTML($html);
    } catch (Exception $e) {

    }

    $metas = $doc->getElementsByTagName('meta');

    $data = [];
    for ($i = 0; $i < $metas->length; $i++)
    {
        $meta = $metas->item($i);

        if (!empty($meta->getAttribute('name'))) {
            // will ignore repeating meta tags !!
            $data[$meta->getAttribute('name')] = $meta->getAttribute('content');
        }
    }

    if (!empty($name)) {
        return !empty($data[$name]) ? $data[$name] : false;
    }

    return $data;
}


Here is PHP simple DOM HTML Class two line code to get page META details.

$html = file_get_html($link);
$meat_description = $html->find('head meta[name=description]', 0)->content;
$meat_keywords = $html->find('head meta[name=keywords]', 0)->content;
0

精彩评论

暂无评论...
验证码 换一张
取 消