Simple text parsing library_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-02-13 05:33 出处：网络

I have a method that takes addresses from the web, and therefore, there are many known errors like: 123 Awesome St, Pleasantville, NY, Get Directions

相关专题：python

I have a method that takes addresses from the web, and therefore, there are many known errors like:

123 Awesome St, Pleasantville, NY, Get Directions

Which I want to be:

123 Awesome St, Pleasantville, NY

Is there a web service or Python library that can help with this? It's fine for us to start creating a l开发者_高级运维ist of items like ", Get Directions" or a more generalized version of that, but I thought there might be a helper library for this kind of textual analysis.

If the address contains one of those bad strings, walk backwards till you find another non-whitespace character. If the character is one of your separators, say , or :, drop everything from that character onwards. If it's a different character, drop everything after that character.

Make a list of known bad strings. Then, you could take that list and use it to build a gigantic regex and use re.sub().

This is a naive solution, and isn't going to be particularly performant, but it does give you a clean way of adding known bad strings, by adding them to a file called .badstrings or similar and building the list from them.

Note that if you make bad choices about what these bad strings are, you will break the algorithm. But it should work for the simple cases you describe in the comments.

EDIT: Something like this is what I mean:

import re

def sanitize_address(address, regex):
    return regex.sub('', address)

badstrings = ['get directions', 'multiple locations']
base_regex = r'[,\s]+('+'|'.join(badstrings)+')'
regex = re.compile(base_regex, re.I)
address = '123 Awesome St, Pleasantville, NY, Get Directions'
print sanitize_address(address, regex)

which outputs:

123 Awesome St, Pleasantville, NY

I would say that the task is impossible to do with a high degree of confidence unless the data is in a fixed format, or you have a gigantic address database to make matches against.

You could possibly get away with having a list of countries, and then a rule set per country that you use. The American rule set could include a list of states, cities and postal codes and a pattern to find street addresses. You would then drop anything that isn't either a state, city postal code or looks like a street address.

You'd still drop things that should be a part of an address though, at least with Swedish addresses, that can include just the name of a farm instead of a street and number. If US country side addresses are the same there is just no way to know what is a part of an address and what isn't unless you have access to a database with all US addresses. :-)

Here is a Regex that will parse either one. If you have other examples, I can change the current Regex to work for it

(?<address>(?:[0-9]+\s+(?:\w+\s?)+)+)[,]\s+(?<city>(?:\w+\s?)+)[,]\s+(?<state>(?:\w+\s?)+)(?:$|[,])

this will even work for addresses that are in similar format to mine (1234 North 1234 West, Pleasantville, NY)