开发者

parse url from string in coldfusion

开发者 https://www.devze.com 2023-01-07 00:18 出处:网络
i need to parse all urls from a paragraph(string) eg. \"check out this site google.com and don\'t forget to see this too bing.com/maps\"

i need to parse all urls from a paragraph(string)

eg.

"check out this site google.com and don't forget to see this too bing.com/maps"

it should return "google.com and bing.com/m开发者_Python百科aps"

i'm currently using this and its not to perfection.

reMatch("(^|\s)[^\s@]+\.[^\s@\?\/]{2,5}((\?|\/)\S*)?",mystring)

thanks


You need to define more clearly what you consider a URL

For example, I might use something such as this:

(?:https?:)?(?://)?(?:[\w-]+\.)+[a-z]{2,6}(?::\d+)?(?:/[\w.,-]+)*(?:\?\S+)?

(use with reMatchNoCase or plonk (?i) at front to ignore case)

Which specifically only allows alphanumerics, underscore, and hyphen in domain and path parts, requires the TLD to be letters only, and only looks for numeric ports.

It might be this is good enough, or you may need something that looks for more characters, or perhaps you want to trim things likes quotes, brackets, etc off the end of the URL, or whatever - it depends on the context of what you're doing as to whether you'd like to err towards missing URLs or detecting non-URLs. (I'd probably go for the latter, then potentially run a secondary filter to verify if something is a URL, but that takes more work, and may not be necessary for what you're doing.)


Anyhow, the explanation of the above expression is below, hopefully with clear comments to help it make sense. :) (Note that all groups are non-capturing (?:...) since we don't need the indiv parts.)

# PROTOCOL
 (?:https?:)?    # optional group of "http:" or "https:"

# SERVER NAME / DOMAIN
 (?://)?         # optional double forward slash
 (?:[\w-]+\.)+   # one or more "word characters" or hyphens, followed by a literal .
                 # grouped together and repeated one or more times
 [a-z]{2,6}      # as many as 6 alphas, but at least 2

# PORT NUMBER
 (?::\d+)?       # an optional group made up of : and one or more digits

# PATH INFO
 (?:/[\w.,-]+)*  # a forward slash then multiple alphanumeric, underscores, or hyphens
                 # or dots or commas (add any other characters as required)
                 # in a group that might occur multiple times (or not at all)

# QUERY STRING
 (?:\?\S+)?      # an optional group containing ? then any non-whitespace



Update: To prevent the end of email addresses being matched, we need to use a lookbehind, to ensure that prior to the URL we don't have an @ sign (or anything else unwanted) but without actually including that prior character in the match.

CF's regex is Apache ORO which doesn't support lookbehinds, but we can use the java.util.regex nice and easily with a component I have created which does support lookbehinds.

Using that is as simple as:

<cfset jrex = createObject('component','jre-utils').init('CASE_INSENSITIVE') />
...
<cfset Urls = jrex.match( regex , input ) />

After the createObject, it should basically be like using the built-in re~ stuff, but with the slight syntax difference, and the different regex engine under the hood.

(If you have any problems or questions with the component, let me know.)


So, on to your excluding emails from URL matching problem:

We can either do a (?<=positive) or (?<!negative) lookbehind, depending on if we want to say "we must have this" or "we must not have this", like so:

(?<=\s) # there must be whitespace before the current position
(?<!@)  # there must NOT be an @ before current position

For this URL example, I would expand either of those examples to:

(?<=\s|^)   # look for whitespace OR start of string

or

(?<![@\w/]) # ensure there is not a @ or / or word character.

Both will work (and can be expanded with more chars), but in different ways, so it simply depends which method you want to do it with.

Put whichever one you like at the start of your expression, and it should no longer match the end of abcd@gmail.com, unless I've screwed something up. :)


Update 2:

Here is some sample code which will exclude any email addresses from the match:

<cfset jrex = createObject('component','jre-utils').init('CASE_INSENSITIVE') />

<cfsavecontent variable="SampleInput">
check out this site google.com and don't forget to see this too bing.com/maps
this is an email@somewhere.com which should not be matched
</cfsavecontent>

<cfset FindUrlRegex = '(?<=\s|^)(?:https?:)?(?://)?(?:[\w-]+\.)+[a-z]{2,6}(?::\d+)?(?:/[\w.,-]+)*(?:\?\S+)?' />

<cfset MatchedUrls = jrex.match( FindUrlRegex , SampleInput ) />

<cfdump var=#MatchedUrls#/>

Make sure you have downloaded the jre-utils.cfc from here and put in an appropriate place (e.g. same directory as script running this code).

This step is required because the (?<=...) construct does not work in CF regular expressions.

0

精彩评论

暂无评论...
验证码 换一张
取 消