I'm trying to convert lines in an RTF file to a series of unicode strings, and then do a regex match on the lines. (I need them to be unicode so that I can output them to another file.)
However, my regex match isn't working - I think because they aren't being converted into unicode properly.
Here's my code:
usefulLines = []
textData = {}
# the regex pattern for an entry in the db (e.g. SUF 76,22): it's sufficient for us to match on three upper-case characters plus a space
entryPattern = '^([A-Z]{3})[\s].*$'
f = open('textbase_1a.rtf', 'Ur')
fileLines = f.readlines()
# get the matching line numbers, and store in usefulLines
for i, line in enumerate(fileLines):
#line = line.decode('utf-16be') # this causes an error: I don't really know what file encoding the RTF file is in...
line = line.decode('mac_roman')
print line
if re.match(entryPattern, line):
# now retrieve the following lines, all the way up until we get a blank line
print "match: " + str(i)
usefulLines.append(i)
At the moment, this prints all the lines, but doesn't print anything with match - though it should match. Also, the lines are being printed with '/par' at the start, for some reason. When I try printing them to an output file, they look very strange.
Part 开发者_StackOverflowof the problem is that I don't know what encoding to specify. How can I find this out?
If I use entryPattern = '^.*$'
then I do get matches.
Can anyone help?
You did not even decode the RTF file. RTFs are not just simple text files. A file containing "äöü", for example, contains this:
{\rtf1\ansi\ansicpg1252\deff0\deflang1031{\fonttbl{\f0\fswiss\fcharset0 Arial;}}
{*\generator Msftedit 5.41.15.1507;}\viewkind4\uc1\pard\f0\fs20\'e4\'f6\'fc\par
}
when opened in a text editor. So the characters "äöü" are encoded as windows-1252 as declared at the beginning of the file (äöü = 0xE4 0xF6 0xFC).
For reading RTF you'll first need something that converts RTF to text (already asked here).
精彩评论