I'm looking to learn how to create a REGEX in Coldfusion that will scan through a large item of html text and create a list of items.
The items I want are contained between the following
<span class="findme">The Goods</span>
Thanks for any tips to get th开发者_如何学JAVAis going.
You don't say what version of CF. Since v8 you can use REMatch to get an array
results = REMatch('(?i)<span[^>]+class="findme"[^>]*>(.+?)</span>', text)
Use ArrayToList to turn that into a list. For older version use REFindNoCase and use Mid() to extract substrings.
EDIT: To answer your follow-up comment the process of using REFind to return all matches is quite involved because the function only returns the FIRST match. This means you actually have to call REFind many times passing a new startpos each time. Ben Forta has written a UDF which does exactly this and will save you some time.
<!---
Returns all the matches of a regular expression within a string.
NOTE: Updated to allow subexpression selection (rather than whole match)
@param regex Regular expression. (Required)
@param text String to search. (Required)
@param subexnum Sub-expression to extract (Optional)
@return Returns a structure.
@author Ben Forta (ben@forta.com)
@version 1, July 15, 2005
--->
<cffunction name="reFindAll" output="true" returnType="struct">
<cfargument name="regex" type="string" required="yes">
<cfargument name="text" type="string" required="yes">
<cfargument name="subexnum" type="numeric" default="1">
<!--- Define local variables --->
<cfset var results=structNew()>
<cfset var pos=1>
<cfset var subex="">
<cfset var done=false>
<!--- Initialize results structure --->
<cfset results.len=arraynew(1)>
<cfset results.pos=arraynew(1)>
<!--- Loop through text --->
<cfloop condition="not done">
<!--- Perform search --->
<cfset subex=reFind(arguments.regex, arguments.text, pos, true)>
<!--- Anything matched? --->
<cfif subex.len[1] is 0>
<!--- Nothing found, outta here --->
<cfset done=true>
<cfelse>
<!--- Got one, add to arrays --->
<cfset arrayappend(results.len, subex.len[arguments.subexnum])>
<cfset arrayappend(results.pos, subex.pos[arguments.subexnum])>
<!--- Reposition start point --->
<cfset pos=subex.pos[1]+subex.len[1]>
</cfif>
</cfloop>
<!--- If no matches, add 0 to both arrays --->
<cfif arraylen(results.len) is 0>
<cfset arrayappend(results.len, 0)>
<cfset arrayappend(results.pos, 0)>
</cfif>
<!--- and return results --->
<cfreturn results>
</cffunction>
This gives you the start (pos) and length of each match so to get each substring use another loop
<cfset text = '<span class="findme">The Goods</span><span class="findme">More Goods</span>' />
<cfset pattern = '(?i)<span[^>]+class="findme"[^>]*>(.+?)</span>' />
<cfset results = reFindAll(pattern, text, 2) />
<cfloop index="i" from="1" to="#ArrayLen(results.pos)#">
<cfoutput>match #i#: #Mid(text, results.pos[i], results.len[i])#<br></cfoutput>
</cfloop>
EDIT: Updated reFindAll with subexnum argument. Setting this to 2 will capture the first subexpression. The default value 1 captures the entire match.
Try looking into the possibility of making your HTML work with a regular DOM Parser and querying it via XPath instead of hammering this trough an regex-based abomination.
- to make HTML input usable, pass it through jTidy (see http://jtidy.riaforge.org/)
- Once you have well-formed XML/XHTML, build an XML document from it
<cfset dom = XmlParse(scrubbedHtml, true)>
- query the XML document using XPath
<cfset result = XmlSearch(dom, "//span[@class='findme']")>
Done.
EDIT: Coldfusion's XmlSearch()
doesn't have great XML namespace support. If you end up producing XHTML instead of the more recommendable XML, use the following XPath (note the colon) "//:span[@class='findme']"
or "//*:span[@class='findme']"
. See here and here for more info.
See the jTidy API documentation for a complete overview what jTidy can do.
精彩评论