I have XML files that contain invalid characters sequences which cause parsing to fail. They look like 
. To solve the problem, I am escaping them by replacing the whole thing with an escape sequence:  --> !#~10^
. Then after I am done parsing I can restore them to what they were.
buffersize = 2**16 # 64 KB buffer
def escape(filename):
out = file(filename + '_esc', 'w')
with open(filename, 'r') as f:
buffer = 'x' # is there a prettier way to handle the first one?
while buffer != '':
buffer = f.read(buffersize)
out.write(re.sub(r'&#x([a-fA-F0-9]+);', r'!#~\1^', buffer))
out.close()
The files are very large, so I have to use buffering (mmap
gave me a MemoryError
) . Because the buffer has a fixed size, I am running into problems when the buffer happens to be small enough to split a sequence. Imagine the buffer size is 8, and the file is like:
123456789
hello!&x10;
The buffer will only read hello!&x
, allowing &x10;
to slip through the cracks. How do I solve this? I thought of getting more characters if the last few look like they could belong to a c开发者_Go百科haracter sequence, but the logic I thought of is very ugly.
First, don't bother to read and write the file, you can create a file-like object that wraps your open file, and processes the data before it's handled by the parser. Second, your buffering can just take care of the ends of read bytes. Here's some working code:
class Wrapped(object):
def __init__(self, f):
self.f = f
self.buffer = ""
def read(self, size=0):
buf = self.buffer + self.f.read(size)
buf = buf.replace("!", "!!")
buf = re.sub(r"&(#x[0-9a-fA-F]+;)", r"!\1", buf)
# If there's an ampersand near the end, hold onto that piece until we
# have more, to be sure we don't miss one.
last_amp = buf.rfind("&", -10, -1)
if last_amp > 0:
self.buffer = buf[last_amp:]
buf = buf[:last_amp]
else:
self.buffer = ""
return buf
Then in your code, replace this:
it = ET.iterparse(file(xml, "rb"))
with this:
it = ET.iterparse(Wrapped(file(xml, "rb")))
Third, I used a substitution replacing "&" with "!", and "!" with "!!", so you can fix them after parsing, and you aren't counting on obscure sequences. This is Stack Overflow data after all, so lots of strange random punctuation could occur naturally.
If you sequence is 6 characters long, you can use buffers with 5 overlapping characters. That way, you are sure no sequence will even slip between the buffers.
Here is an example to help you visualize :
--
--
--
#x10;--
As for the implementation, just prepend the 5 last characters of the last buffer to the new buffer :
buffer = buffer[-5:] + f.read(buffersize)
The only problem is that the concatenation may require a copy of the whole buffer. Another solution, if you have random access to the file, is to rewind a little bit with :
f.seek(-5, os.SEEK_CUR)
In both case, you'll have to modify the script slightly to handle the first iteration.
精彩评论