PHP: finding, replacing, shortening, and prettifying user links with <a> tags, ellipses, and link icons_问答_开发者

When a user enters a URL, e.g. http://www.google.com, I would like to be able to parse that text using PHP, find any links, and replace them with <a> tags that include the original URL as an HREF.

In other words, http://www.google.com will become

<a href="http://www.google.com">http://www.google.com</a>

I'd like to be able to do this for all URLs of these forms (with .com interchangeable with any TLD):

http://www.google.com
www.google.com
google.com
docs.google.com

What's the most performant way to do this? I could try writing some really fancy regex, but I doubt that's the best method available to me.

For bonus points, I'd also like to prepend http:// to any URL lacking it, and strip the display text 开发者_Go百科itself down to something of the form http://www.google.com/reallyLongL... and display an external link icon afterwards.

Trying to find links in the format domain.com is going to be a pain in the butt. It would require keeping track of all TLDs and using them in the search.if you didnt the end of the last sentence i typed and the beginning of this sentence would be a link to http://search.if. Even if you did .in is a valid TLD and a common word.

I'd recommend telling your users they have to begin links with www. or http:// then write a simple regex to capture them and add the links.

www.google.com

This is not a URL, it's a hostname. It's generally not a good idea to start marking up bare hostnames in arbitrary text, because in the general case any word or sequence of dot-separated words is a perfectly valid hostname. That means you up with horrible hacks like looking for leading www. (and you'll get questions like “why can I link to www.stackoverflow.com but not stackoverflow.com?”) or trailing TLDs (which gets more and more impractical as more new TLDs are introduced; “why can I like to ncm.com but not ncm.museum?”), and you'll often mark up things that aren't supposed to be links.

I could try writing some really fancy regex

Well I can't see how you'd do it without regex.

The trick is coping with markup. If you can have <, & and " characters in the input, you mustn't let them into HTML output. If your input is plain text, you can do that by calling htmlspecialchars() before applying a simple replacement on a pattern like that in nico's answer.

(If the input already contains markup, you've got problems and you'd probably need an HTML parser to determine which bits are markup to avoid adding more markup inside of. Similarly, if you're doing more processing after this, inserting more tags, those steps are may have the same difficulty. In ‘bbcode’-like languages this often leads to bugs and security problems.)

Another problem is trailing punctuation. It's common for people to put a full stop, comma, close bracket, exclamation mark etc after a link, which aren't supposed to be part of the link but which are actually valid characters. It's useful to strip these off and not put them in the link. But then you break Wiki links that end in ), so maybe you want to not treat ) as a trailing character if there's a ( in the link, or something like that. This sort of thing can't be done in a simple regex replace, but you can in a replacement callback function.

HTML Purifier has a built-in linkify function to save you all the headaches.

It's other features are also simply too useful to pass up if you're dealing with any kind of user input that you also have to display.

Not so fancy regexps that should work

/\b(https?:\/\/[^\s+\"\<\>]+)/ig
/\b(www.[^\s+\"\<\>]+)/ig

Note that the last two would be impossible to do correctly as you cannot distinguish google.com from something like this.Where I finish one sentence and don't put a space after the full stop.

As for shortening the URLs, having your URL in $url:

if (strlen($url) > 20) // Or whatever length you like
   {
   $shortURL = substr($url, 0, 20)."&hellip;";
   }
else
   {
   $shortURL = $url;
   }

echo '<a href="'.$url.'" >'.$shortURL.'</a>';

From http://www.exorithm.com/algorithm/view/markup_urls

function markup_urls ($text)
{
  // split the text into words
  $words = preg_split('/([\s\n\r]+)/', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
  $text = "";

  // iterate through the words
  foreach($words as $word) {

    // chopword = the portion of the word that will be replaced
    $chopword = $word;
    $chopword = preg_replace('/^[^A-Za-z0-9]*/', '', $chopword);

    if ($chopword <> '') {
      // linkword = the text that will replace chopword in the word
      $linkword='';

      // does it start with http://abc. ?
      if (preg_match('/^(http:\/\/)[a-zA-Z0-9_]{2,}.*/', $chopword)) {

        $chopword = preg_replace('/[^A-Za-z0-9\/]*$/', '', $chopword);
        $linkword = '<a href="'.$chopword.'" target="blank">'.$chopword.'</a>';

      // does it equal abc.def.ghi ?
      } else if (preg_match('/^[a-zA-Z]{2,}\.([a-zA-Z0-9_]+\.)+[a-zA-Z]{2,}(\/.*)?/', $chopword)) {

        $chopword = preg_replace('/[^A-Za-z0-9\/]*$/', '', $chopword);
        $linkword = '<a href="http://'.$chopword.'" target="blank">'.$chopword.'</a>';

      // does it start with abc@def.ghi ?
      } else if (preg_match('/^[a-zA-Z0-9_\.]+\@([a-zA-Z0-9_]{2,}\.)+[a-zA-Z]{2,}.*/', $chopword)) {

        $chopword = preg_replace('/[^A-Za-z0-9]*$/', '', $chopword);
        $linkword = '<a href="mailto:'.$chopword.'">'.$chopword.'</a>';

      }

      // replace chopword with linkword in word (if linkword was set)
      if ($linkword <> '') {
        $word = str_replace($chopword, $linkword, $word);
      }
    }

    // append the word
    $text = $text.$word;
  }

  return $text;
}

I got this working exactly the way I want here:

<?php

$input = <<<EOF
http://www.example.com/
http://example.com
www.example.com
http://iamanextremely.com/long/link/so/I/will/be/trimmed/down/a/bit/so/i/dont/mess
/up/text/wrapping.html
EOF;

  function trimlong($match)
  {
    $url = $match[0];
    $display = $url;
    if ( strlen($display) > 30 ) {
      $display = substr($display,0,30)."...";
    }
    return '<a href="'.$url.'">'.$display.' <img src="http://static.goalscdn.com/img/external-link.gif" height="10" width="11" /></a>';
  }

$output = preg_replace_callback('#(http://|www\\.)[^\\s<]+[^\\s<,.]#i',
                                 array($this,'trimlong'),$input);

echo $output;