Regex to split on punctuation excluding URLs_问答_开发者

I'm trying to split a string on its punctuation, but the string may contain URLs (which conveniently has all the typical punctuation marks).

I have a basic working knowledge of RegEx, but not enough to help me out here. This is what I was using when I discovered the problem:

$text[$i] = preg_split('/[\.\?!\-]+/', $post->text);

(this also accounts for multiple co开发者_JAVA百科nsecutive punctuation characters - ellipses, !!!!, ????, ?!?, etc)

How would I split a string on the punctuation while maintaining the integrity of URLs? Thanks!

Edit:

My apologies...an example would be something along the lines of a tweet:

"Blah blah blah? A sentence. Here's a link: http://somelink.com?key=value ."

The results should look something like this:

[0] => "Blah blah blah?"
[1] => "A sentence."
[2] => "Here's a link: http://somelink.com?key=value ."

What you're doing here isn't quite splitting on punctuation, because you're trying to keep the punctuation in one of the split items. You're also attempting to discard the whitespace afterwards, but don't seem to have covered that in your question.

I would tackle this in the following way: split your input string with a regular expression which matches punctuation or a URL, and keep the pieces, including the separators. Then iterate over the items, and for each separator decide whether it was punctuation, in which case you can strip trailing whitespace and move it to the end of the previous item, or a URL, in which case you just join it with the preceding and following items.

In PHP, you can keep the delimiters using something like this:

$text[$i] = preg_split('/([\.\?!\-]+|https?:\/\/\S+)/', $post->text, PREG_SPLIT_DELIM_CAPTURE);

where the PREG_SPLIT_DELIM_CAPTURE flag is explained in the documentation as:

If this flag is set, parenthesized expression in the delimiter pattern will be captured and returned as well.

Is there a pattern that your non-URL punctuation marks follow? In most English sentences, many punctuation marks are followed (or sometimes preceeded) by a space character. I don't know what your source text is like but that MIGHT be a reliable way to do it, because the punctuation marks in a URL will NOT have space on either side - although they could END with a punctuation mark followed by a space - I guess it depends on the URLs you anticipate as well.

Another approace (if you don't mind doing this in stages) is to remove all of the URLs from the string and then do the rest of your processing on the result of this. That only works if you don't need the URLs. If you need to preserve the URLs, you can add placeholder strings on either side of the URL such as ">>>>http://placeholder.com<<<<" and then when you split on punctuation, be sure to exclude any punction that occurs between >>>> and <<<<. Afterwards, you would have to remove the >>>> and <<<<

This regex produces the example you've given:

/(?<!http[^\s]{0,2048})[\.\?\!\-]+\B/

It looks for your punctuation set not preceded by a string starting with 'http' and ending with a whitespace character. The trailing \B prevents a hyphenated word from causing a split

but...

This input:

Blah blah blah? A sentence. Here's a link: http://somelink.com?key=value.blah blah blah...

won't split the value.blah into two... but I think URL matching regex would have the same problem as 'value.blah' could be part of a valid URL. I think your data, coming from twitter users, will be very inconsistent and therefore hard to clean up, even if you go for FrustratedWithFormsDes' second suggestion.

You can try: