Python: 2.6 and 3.1 string matching inconsistencies_问答_开发者

I wrote my module in Python 3.1.2, but now I have to validate it for 2.6.4.

I'm not going to post all my code since it may cause confusion.

Brief explanation: I'm writing a XML parser (my first interaction with XML) that creates objects from the XML file. There are a lot of objects, so I have a 'unit test' that manually scans the XML and tries to find a matching object. It will print out anything that doesn't have a match.

I open the XML file and use a simple 'for' loop to read line-by-line through the file. If I match a regular expression for an 'application' (XML has different 'application' nodes), then I add开发者_如何转开发 it to my dictionary, d, as the key. I perform a lxml.etree.xpath() query on the title and store it as the value. After I go through the whole thing, I iterate through my dictionary, d, and try to match the key to my value (I have to use the get() method from my 'application' class). Any time a mismatch is found, I print the key and title. Python 3.1.2 has all matching items in the dictionary, so nothing is printed. In 2.6.4, every single value is printed (~600) in all. I can't figure out why my string comparisons aren't working.

Without further ado, here's the relevant code:

    for i in d:                                                                                                        
     if i[1:-2] != d[i].get('id'):                                                                                                                                  
         print('X%sX Y%sY' % (i[1:-3], d[i].get('id')))

I slice the strings because the strings are different. Where the key would be "9626-2008olympics_Prod-SH"\n the value would be 9626-2008olympics_Prod-SH, so I have to cut the quotes and newline. I also added the Xs and Ys to the print statements to make sure that there wasn't any kind of whitespace issues. Here is an example line of output:

X9626-2008olympics_Prod-SHX Y9626-2008olympics_Prod-SHY

Remember to ignore the Xs and Ys. Those strings are identical. I don't understand why Python2 can't match them.

Edit: So the problem seems to be the way that I am slicing. In Python3,

if i[1:-2] != d[i].get('id'):

this comparison works fine.

In Python2,

if i[1:-3] != d[i].get('id'):

I have to change the offset by one.

Why would strings need different offsets? The only possible thing that I can think of is that Python2 treats a newline as two characters (i.e. '\' + 'n').

Edit 2: Updated with requested repr() information.

I added a small amount of code to produce the repr() info from the "2008olympics" exmpale above. I have not done any slicing. It actually looks like it might not be a unicode issue. There is now a "\r" character. Python2:

'"9626-2008olympics_Prod-SH"\r\n' '9626-2008olympics_Prod-SH'

Python3:

'"9626-2008olympics_Prod-SH"\n' '9626-2008olympics_Prod-SH'

Looks like this file was created/modified on Windows. Is there a way in Python2 to automatically suppress '\r'?

You are printing i[1:-3] but comparing i[1:-2] in the loop.

Very Important Question

Why are you writing code to parse XML when lxml will do all that for you? The point of unit tests is to test your code, not to ensure that the libraries you are using work!

Russell Borogrove is right.

Python 3 defaults to unicode, and the newline character is correctly interpreted as one character. That's why my offset of [1:-2] worked in 3 because I needed to eliminate three characters: ", ", and \n.

In Python 2, the newline is being interpreted as two characters, meaning I have to eliminate four characters and use [1:-3].

I just added a manual check for the Python major version.

Here is the fixed code:

    for i in d:
    # The keys in D contain quotes and a newline which need                                                                                                                                                                              
    # to be removed. In v3, newline = 1 char and in v2,                                                                                                                                                                                  
    # newline = 2 char.                                                                                                                                                                                                                  
    if sys.version_info[0] < 3:
        if i[1:-3] != d[i].get('id'):
            print('%s %s' % (i[1:-3], d[i].get('id')))
    else:
        if i[1:-2] != d[i].get('id'):
             print('%s %s' % (i[1:-2], d[i].get('id')))

Thanks for the responses everyone! I appreciate your help.

repr() and %r format are your friends ... they show you (for basic types like str/unicode/bytes) exactly what you've got, including type.

Instead of

print('X%sX Y%sY' % (i[1:-3], d[i].get('id')))

print('%r %r' % (i, d[i].get('id')))

Note leaving off the [1:-3] so that you can see what is in i before you slice it.

Update after comment "You are perfectly right about comparing the wrong slice. However, once I change it, python2.6 works, but python3 has the problem now (i.e. it doesn't match any objects)":

How are you opening the file (two answers please, for Python 2 and 3). Are you running on Windows? Have you tried getting the repr() as I suggested?

Update after actual input finally provided by OP:

If, as it appears, your input file was created on Windows (lines are separated by "\r\n"), you can read Windows and *x text files portably by using the "universal newlines" option ... open('datafile.txt', 'rU') on Python2 -- read this. Universal newlines mode is the default in Python3. Note that the Python3 docs say that you can use 'rU' also in Python3; this would save you having to test which Python version you are using.

I don't understand what you're doing exactly, but would you try using strip() instead of slicing and see whether it helps?

for i in d:
    stripped = i.strip()                                                                                                      
    if stripped != d[i].get('id'):                                                                                                                                  
         print('X%sX Y%sY' % (stripped, d[i].get('id')))