开发者

How can I compare two name strings that are formatted differently in SQL Server?

开发者 https://www.devze.com 2023-01-25 22:32 出处:网络
What w开发者_C百科ould be the best approach for comparing the following set of strings in SQL Server?

What w开发者_C百科ould be the best approach for comparing the following set of strings in SQL Server?

Purpose: The main purpose of this Stored Procedure is to compare a set of input Names of Customers to names that Exist in the Customer database for a specific accounts. If there is a difference between the input name and the name in the Database this should trigger the updating of the Customer Database with the new name information.

Conditions:

Format of Input: FirstName [MiddleName] LastName

Format of Value in Database: LastName, FirstName MiddleName

The complication arises when names like this are presented,

Example:

Input: Dr. John A. Mc Donald

Database: Mc Donald, Dr. John A.

For last names that consist of 2 or more parts what logic would have to be put into place to ensure that the lastname in the input is being compared to the lastname in the database and likewise for the first name and middle name.

I've thought about breaking the database values up into a temp HASH table since I know that everything before the ',' in the database is the last name. I could then check to see if the input contains the lastname and split out the FirstName [MiddleName] from it to perform another comparison to the database for the values that come after the ','.

There is a second part to this however. In the event that the input name has a completely New last name (i.e. if the name in the database is Mary Smith but the updated input name is now Mary Mc Donald). In this case comparing the database value of the last name before the ',' to the input name will result in no match which is correct, but at this point how does the code know where the last name even begins in the input value? How does it know that her Middle name isn't MC and her last name Donald?

Has anyone had to deal with a similar problem like this before? What solutions did you end up going with?

I greatly appreciate your input and ideas.

Thank you.


Realistically, it's extremely computationally difficult (if not impossible) to know if a name like "Mary Jane Evelyn Scott" is first-middle-last1-last2, first1-first2-middle-last, first1-first2-last1-last2, or some other combination... and that's not even getting into cultural considerations...

So personally, I would suggest a change in the data structure (and, correspondingly, the application's input fields). Instead of a single string for name, break it into several fields, e.g.:

FullName{
  title,      //i.e. Dr., Professor, etc.
  firstName,  //or given name
  middleName, //doesn't exist in all countries!
  lastName,   //or surname
  qualifiers  //i.e. Sr., Jr., fils, D.D.S., PE, Ph.D., etc.
}

Then the user could choose that their first name is "Mary", their middle name is "Jane Evelyn", and their last name is "Scott".

UPDATE
Based on your comments, if you must do this entirely in SQL, I'd do something like the following:

  1. Build a table for all possible combinations of "lastname, firstname [middlename]" given an input string "firstname [middlename] lastname"
  2. Run a query based on the join of your original data and all possible orderings.

So, step 1. would take the string "Dr. John A. Mc Donald" and create the table of values:

'Donald, Dr. John A. Mc'
'Mc Donald, Dr. John A.'
'A. Mc Donald, Dr. John'
'John A. Mc Donald, Dr.'

Then step 2. would search for all occurrences of any of those strings in the database.

Assuming MSSQL 2005 or later, step 1. can be achieved using some recursive CTE, and a modification of a method I've used to split CSV strings (found here) (SQL isn't the ideal language for this form of string manipulation...):

declare @str varchar(200)
set @str = 'Dr. John A. Mc Donald'

--Create a numbers table
select [Number] = identity(int)
into #Numbers
from sysobjects s1
    cross join sysobjects s2

create unique clustered index Number_ind on #Numbers(Number) with IGNORE_DUP_KEY

;with nameParts as (
    --Split the name string at the spaces.
    select [ord] = row_number() over(order by Number),
        [part] = substring(fn1, Number, charindex(' ', fn1+' ', Number) - Number)
    from (select @str fn1) s
        join #Numbers n on substring(' '+fn1, Number, 1) = ' '
    where Number<=Len(fn1)+1

),
lastNames as (
    --Build all possible lastName strings.
    select [firstOrd]=ord, [lastOrd]=ord, [lastName]=cast(part as varchar(max))
    from nameParts
    where ord!=1 --remove the case where the whole string is the last name
    UNION ALL
    select firstOrd, p.ord, l.lastName+' '+p.part
    from lastNames l
        join nameParts p on l.lastOrd+1=p.ord
),
firstNames as (
    --Build all possible firstName strings.
    select [firstOrd]=ord, [lastOrd]=ord, [firstName]=cast(part as varchar(max))
    from nameParts
    where ord!=(select max(ord) from nameParts) --remove the case where the whole string is the first name
    UNION ALL
    select p.ord, f.lastOrd, p.part+' '+f.firstName
    from firstNames f
        join nameParts p on f.firstOrd-1 = p.ord
)
--Combine for all possible name strings.
select ln.lastName+', '+fn.firstName
from firstNames fn
    join lastNames ln on fn.lastOrd+1=ln.firstOrd
where fn.firstOrd=1
    and ln.lastOrd = (select max(ord) from nameParts)

drop table #Numbers


Since I had my share of terrible experience with data from third parties, it is almost guaranteed that the input data will contain lots of garbage not following the specified format.
When trying to match data multipart string data like in your case, I preprocessed both input and our data into something I called "normalized string" using the following method.

  1. strip all non-ascii chars (leaving language-specific chars like "č" intact)
  2. compact spaces (replace multiple spaces with single one)
  3. lower case
  4. split into words
  5. remove duplicates
  6. sort alphabetically
  7. join back to string separated by dashes

Using you sample data, this function would produce:

Dr. John A. Mc Donald -> a-donald-dr-john-mc
Mc Donald, Dr. John A.-> a-donald-dr-john-mc

Unfortunaly it's not 100% bulletproof, there are cases where degenerated inputs produce invalid matches.


Your name field is bad in the database. Redesign and get rid of it. If you havea a first name, middlename, lastname, prefix and suffix sttructure, you can hava computed filed that has the structure you are using. But it is a very poor way to store data and your first priority should be to stop using it.

Since you have a common customer Id why aren't you matching on that instead of name?

0

精彩评论

暂无评论...
验证码 换一张
取 消