I'm trying to clean up an XML file to have only UTF-8 characters but I'm having issues with a bullet point. The files have a bullet point in them and if I remove these characters, the rest of the regex replace works fine, but it doesn't seem to replace this specific bullet character. Looking at HEX it is 0x07 and in unicode /u0007 but neither of these resolved the error ("hexidecimal value 0x07, is an invalid character")
here is some of the regex replace code (VB script in SSIS) I'm using with several iterations I've tried. Any help would be greatly appreciated.
XMLString = FileIO.FileSystem.ReadAllText(filelocation)
'Dim rgx As Regex = New Regex("[\x00-\x08\x0B-\x0C\x0E-\x1F\u0000-\u0007]", RegexOptions.None)
'Dim rgx As Regex = New Regex("[^0-9a-zA-Z]", RegexOptions.None)
'Dim rgx As Regex = New Regex("[[:^print:]]", RegexOptions.None)
'Dim rgx As Regex = New Regex("[[:^print:][\u0007]]", RegexOptions.None)
Dim rgx As Regex = New Regex("[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]", RegexOptions.None)
'Dim rgx As Regex = New Regex("[\x00-\x1F\x7F-\xFF]+", RegexOptions.None)
rgx.Replace(X开发者_JS百科MLString, "")
thanks
One think you need to know is whether your regular expression is being applied against a string of bytes, or a string of characters. (In perl there is an explicit difference, not too sure about VB - its usually controlled by the way you read the data in). The below two points are not "rules" as such, more good form.
- If running against bytes, then you should only use the
\xXX
escape sequences. (and XX can only be 2 "digits") - If running against characters, then you should use the
\uXXXX
escape sequences (\xXXXX is the same thing in most languages)
Looking at your uncommented regex, it seems you're looking at characters. This would imply the file must already be in some valid character encoding (probably one of UTF-8, UTF-16LE, or cp1252). So all this regex is doing is stripping out valid UTF-8 characters which are not allowed according the the XML spec. http://www.w3.org/TR/xml/#charsets . That should be fine.
But if your string is a stream of bytes, and you are trying to ensure it is valid UTF-8 then that is harder to do with a regex. Other than non-ascii I don't know how.
One other point: Shouldn't you be setting the Global attribute of your regex before doing the replace. Could this be your problem? Its fixing the first occurance but not the whole file?
With Powershell I used the following regex:
-replace "\u2022", "" `
as @Brian Reichle mentioned in your comment
精彩评论