The best way to join two dissimilar mySQL tables -- planning for django from python_问答_开发者

table a (t_a):
id  name    last    first   email           state   country
0   sklass  klass   steve   sklass@foo.com  in      uk
1   jabid   abid    john    abid@foo.com    ny      us
2   jcolle  colle   john    jcolle@foo.com  wi      us


table b (t_b):
id  sn      given   nick    email           l   c   
0   steven  klass   steve   sklass@foo.com  in  uk
1   john    abid    -       abid_j@foo.com  ny  us
2   johnny  colle   john    jcolle@foo.com  wi  us
3   john    abid    -       abid@foo.com    ny  us

What is listed above is an (abbreviated) column and row mySQL tables. Looking at the two tables it becomes pretty clear that by strictly looking at the values (id's not looked at) and comparing the matching number of values you would get these value matches.

t_a     t_b
0       0
1       3
2       2
-       1

What I ultimately looking to do is to do this in Django — I'm not sure if that matters. In the past I have done this using pure python in which I destroy the old data and just create three new tables. I want to shift away from my implementation (listed below) because the problems I see is that time changes things and people come and go. In the past I have just regenerated the data -- but now I want to keep track of when people leave and don't simply replace (delete) the data. I be开发者_如何学Clieve that by doing a SQL update is more elegant and preserves the history.

I'd like to know how to get this merged answer directly from mySQL (Either a SQL function or the construction of a new table) which merges the data in the following manner. I want to do this using pure SQL (I believe then I can do this in Django). So I am looking for a solution which meets the following criteria:

There is a min_match which defines the minimum number of matches between the two rows of which must be aligned to be considered valid.
While the tables may have different lengths it is a 1-to-1 mapping. In other words many to one may not happen (yet)

Now my background is python and for me the simplest way to do this has always been to do a for loop over the shorter of the two tables, which then does a for loop over the other table looking at the number of matches. In code this looks like this.

t_a = [ ["sklass", "klass", "steve", "sklass@foo.com", "in", "uk", ],
        ["jabid", "abid", "john", "abid@foo.com", "ny", "us", ],
        ["jcolle", "colle", "john", "jcolle@foo.com", "wi", "us", ], ]

t_b = [ ["steven", "klass", "steve", "sklass@foo.com", "in", "uk",],
        ["john", "abid", "abid_j@foo.com", "ny", "us",],
        ["johnny", "colle", "john", "jcolle@foo.com", "wi", "us",],
        ["john", "abid", "abid@foo.com", "ny", "us",], ]

min_match = 3

for person_a in t_a:
    match = 0
    match_pct = 0.0
    match_a_index = t_a.index(person_a)
    for person_b in t_b:
        new_match_count = len(list(set(person_a) & set(person_b)))
        if new_match_count > match:
            match = new_match_count
            match_b_index = t_b.index(person_b)
            match_pct = "%.2f" % (float(new_match_count) / \
              float(len(set(person_a + person_b))) * 100)
    if match >= min_match:
        print match_a_index, match_b_index #, match_pct, match

The comments beg the question why don't you just join on the email address. I don't necessarily know that the values in a column will match. I am certain that values from a given row in t_a will match values for a row in t_b. I want the highest (most probable) match for a given row in t_a to t_b and only if the number of matches is higher than min_match.

You can do this in MySQL directly via a cursor executed through a stored procedure.

DELIMITER $$
CREATE PROCEDURE `proc_name`()
BEGIN
  DECLARE done INT DEFAULT 0;
  DECLARE a_id BIGINT UNSIGNED;
  DECLARE b_id BIGINT UNSIGNED;
  DECLARE x_count INT;

  -- something like the following
  DECLARE cur1 CURSOR FOR SELECT t_a.id, t_b.id FROM t_a, t_b WHERE t_a.email = t_b.email;
  DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;

  SELECT COUNT(*) INTO x_count FROM t_a, t_b WHERE t_a.email = t_b.email;

  IF(x_count > <some_min_value>) THEN

    OPEN cur1;

    REPEAT
      FETCH cur1 INTO a_id, b_id;
      IF NOT done THEN

        -- do something here like update rows, remove rows, etc.
        -- a_id and b_id hold the two id values for the two tables which
        -- I assume to be primary keys

      END IF;
    UNTIL done END REPEAT;

    CLOSE cur1;

  END IF;
END
$$