I have a huge folder filled with xml documents, some of which may break because they contain those curly quotes, i.e. Microsoft Word quotes, i.e. smart quotes. I just want to run a quick check to see what I'm up against. Anybody know how to grep for them so I can easily find the offenders?
Edit
Here's a simplified example.
<?xml version="1.0" encoding="UTF-8"?>
<items>
<item>Pretend this is a cur开发者_JAVA百科ly quote: '</item>
</items>
Curly quotes has the following Unicode code points and UTF-8 sequence:
Name CodePoint UTF-8 sequence ---- --------- -------------- LEFT SINGLE QUOTATION MARK U+2018 0xE2 0x80 0x98 RIGHT SINGLE QUOTATION MARK U+2019 0xE2 0x80 0x99 SINGLE LOW-9 QUOTATION MARK U+201A 0xE2 0x80 0x9A SINGLE HIGH-REVERSED-9 QUOTATION MARK U+201B 0xE2 0x80 0x9B LEFT DOUBLE QUOTATION MARK U+201C 0xE2 0x80 0x9C RIGHT DOUBLE QUOTATION MARK U+201D 0xE2 0x80 0x9D DOUBLE LOW-9 QUOTATION MARK U+201E 0xE2 0x80 0x9E DOUBLE HIGH-REVERSED-9 QUOTATION MARK U+201F 0xE2 0x80 0x9F
XML is usually stored in UTF-8, so you could just compare directly for the byte sequence.
You can find files containing the UTF-8 sequences dalle mentioned thusly:
grep -r -P "\xE2\x80\x9C" .
The -r
makes it recursive and the -P
tells grep to use Perl compatible regular expressions.
If they're xml documents, you could open one of them that you know contains the offending quotes, to see exactly what they look like in the xml file (and copy them to clipboard, if you can't reproduce them easily with your keyboard).
Assuming that your quotes look like „
or ”
, You could do something like sed -i .bak 's/[”„]/"/' file1 file2 ...
(if using Linux/OSX/cygwin on Windows) to quickly substitute the offending quotes with normal quotes, modifying the files in-place.
MIGHT BE A DUPLICATE
I had a situation where the user would copy paste strings from anywhere and I had to allow them an entry excluding any special character except quotes. Quotes be it smart/fancy/straight for that matter. Let me exemplify:
Text | Error
----------------
O*Connor| Yes
O'Connor| No
O’Connor| No
And I came up with below solution for my CF code.
<cfif #REFind("[[:punct:],[:digit:]]",textName)# GT 0 >
<cfset temp_name = textName.ReplaceAll(JavaCast( "string", "[^A-Za-z\u2018\u2019\u201A\u201B\u2032\u2035\'\-\ ]" ),JavaCast( "string", "" )) >
<cfif (len(temp_name )EQ len(textName)) >
<!--- If you find single quote or hyphen, do nothing --->
<cfelse>
<cfset errormsg = The Text contains special charctaer">
</cfif>
Immense help from: http://axonflux.com/handy-regexes-for-smart-quotes
I am on a Mac, and the built-in grep didn't work for me right away (neubert's answer.) I ended up installing Homebrew's version of GNU grep:
brew tap homebrew/dupes
brew install homebrew/dupes/grep
Then I could run the commands in a similar fashion:
ggrep -r -P "\xE2\x80\x9C" .
etc.
I ended up combining dalle and neubert's answers into this script which will run all of the cases that I currently know about and print them all off.
精彩评论