i need to parse normail mailing addresses in vb.net. the requirement is address shall be split in 2 variables. so if address is 12300 euclid st. then it will be "12300" and "euclid st." in two different variables. also if address is 123 B4 euclid st then "123 B4" and "eucli开发者_JAVA百科d st". Sometimes address is 12008 B2 euclid st Apt 12. In this case i only want "12008 B2" and "euclid st" and "Apt 12" in third variable. how can i do this?
Here's a regex solution. For this to work you need to define exactly how you expect the data to look like. Slight variations can render the pattern useless. If more requirements are needed I suggest breaking it down and using a combination of splitting, parsing and maybe regex.
I came up with this pattern assuming the address starts with a number, an optional set of alphabets followed by digits ("B2"), the street, and an optional Apt/Ste/Unit etc. Given this definition the pattern you can use is:
"^(?<StreetNumber>\d+(?:\s[a-zA-Z]+\d+)?)\s+(?<Street>.+?)\s*(?<Address2>(?:Apt|Ste|Unit).+)?$"
Here's an example with a commented pattern (RegexOptions.IgnorePatternWhiteSpace
required):
Dim inputs() As String = New String() { "12008 Euclid St", "12008 B2 Euclid St", "12008 B2 Euclid St Apt 12", "12345 C8 Euclid Ave Ste #1" }
Dim pattern As String = "^ (?# beginning of line or sentence)" & _
"(?<StreetNumber>\d+(?:\s[a-zA-Z]+\d+)?) (?# digits then optional space, letters and digits)" & _
"\s+(?<Street>.+?) (?# at least one space followed by any char at least once)" & _
"\s*(?<Address2>(?:Apt|Ste|Unit).+)? (?# optional spaces, Apt/Ste/etc. and at least one char)" & _
"$ (?# end of line or sentence)"
Dim rx As Regex = New Regex(pattern, RegexOptions.IgnorePatternWhiteSpace)
For Each input in inputs
Dim m As Match = rx.Match(input)
If m.Success Then
Dim streetNumber As String = m.Groups("StreetNumber").Value
Dim street As String = m.Groups("Street").Value
Dim address2 As String = m.Groups("Address2").Value
Console.WriteLine("Street Number: {0}", streetNumber)
Console.WriteLine("Street: {0}", street)
If address2 <> "" Then Console.WriteLine("Address2: {0}", address2)
Console.WriteLine()
End If
Next
To use the pattern directly (without comments) replace the pattern with this:
Dim pattern As String = "^(?<StreetNumber>\d+(?:\s[a-zA-Z]+\d+)?)\s+(?<Street>.+?)\s*(?<Address2>(?:Apt|Ste|Unit).+)?$"
Then remove the RegexOptions.IgnorePatternWhiteSpace
from the initialization:
Dim rx As Regex = New Regex(pattern)
I did this years ago in Access Basic. I started from end and worked toward the beginning. Much much easier.
Go to USPS and get "Pub. 28 - Postal Addressing Standards" at http://pe.usps.gov/text/pub28/welcome.htm.
Read "Delivery Address Line" at http://pe.usps.gov/text/pub28/pub28c2_012.htm and review its sections.
This guide contains address guidelines, address types, formats, and abbreviations. Appendixes have abbreviations. Extremely helpful.
If I can find my old code, I will post.
Note, while RegEx is nice, VB.NET's LIKE operator can be much easier and cleaner to work with in some cases.
With That Said
I did this about 15 years ago, when there was no http://zip4.usps.com/zip4/welcome.jsp API service available. (I only coded enough to validate a few thousand records. It was too costly to hire vendor to validate this small number of addresses.) USPS now has "Web Tools" that can do this work for you. I strongly recommend you check out http://www.usps.com/webtools for your needs and try to avoid writing code. Moreover, a vendor might be better suited and cost effective to validate a large number of addresses. Ten years ago, I believe it cost a client $2000 to validate 78,000 records.
You should use String.Split and then parse the results using Regex and possibly recombining some later.
This would be difficult to do without adding a delimiter between the sections (i.e. "12008 B2, euclid st; Apt 12"). However, I would probably do something like split the string into tokens using ' ' (space) as a delimiter, then iterate through until you reach the first token with no numbers in it. Recombine the tokens before that one into a single string and place that in your first variable.
Then continue iterating until you hit the "Apt" token, and combine all the tokens prior to that one (after the ones in the first variable) into a sting and that is your second variable. The remaining tokens would be combined into the third variable.
There are issues with this approach. It requires the address to be in exactly the format you specified, and all tokens before the street name MUST have a number in them. It's a little quirky, but if the format of the address well defined it should work.
Note: I know very little about regex as msp suggested so I won't suggest it, but it most likely could be very suited to just this problem.
You could spend a lot of time coming up with two regex expressions to try to do this, or you can do (Edit: improved line 6)
Dim i As Integer
Dim Address = "1234 7b Miller Street"
Dim Addresses() = Address.Split(" ")
Dim SecondPartIndex As Integer
For i = 0 To Addresses.Length - 1
If Regex.IsMatch(Addresses(i), "^\D{1,}$") And SecondPartIndex = 0 Then
SecondPartIndex = i
End If
Next
Dim FirstTerm As String = ""
Dim SecondTerm As String = ""
For i = 0 To Addresses.Length - 1
If i < SecondPartIndex Then
FirstTerm = String.Format("{0} {1}", FirstTerm, Addresses(i))
Else
SecondTerm = String.Format("{0} {1}", SecondTerm, Addresses(i))
End If
Next
FirstTerm = FirstTerm.Trim
SecondTerm = SecondTerm.Trim
My logic is that the first term to contain no digits is the start of the 2nd part. (Though the first term is always in the firstPart.) That's probably the best starting point. An advantage of this code over pure regex is that it's easier to allow for exception cases as you find them.
This is best accomplished through address correction software, which will not only parse out the fields but give you information on the overall validity of the address according to the USPS. Such software will provide an API for you to call. Most of the packages that do this are expensive. However, I know of at least one that is rather inexpensive: http://www.semaphorecorp.com
My advice is to begin by adding in some validation on user input.
If that's not a possible solution, you'll need to go through the laborious task of splitting, evaluating the data via something like a Select Case statement to determine if each element in the array fits into your desired "candidate" requirements and then run through another Select Case statement to further differentiate between something like, element(0) = "1234" element(1) = "1st" element(2) = "St".
Front loading user input validation helps to reduce the headache of parsing through a virtually limitless variability of user input. For instance, I could feasibly input something like, "Corner of 2nd & 3rd" and the program may not know what to do with this.
However, providing multiple text boxes to allow a user to specify if he or she lives in an apartment, for example, greatly reduces the complexity of the problem. I'd have a text box for Address, in which the number and street name are input, and then an additional textbox that is optional for Apt.
Once you have it split up, you can validate by ensuring that there is an address, succeeded by whatever else is the street name. Then, you can split the variable into an array, take the first value and the concatenation of the rest and you have your address.
Hope this helps!
精彩评论