table a (t_a):
id name last first email state country
0 sklass klass steve sklass@foo.com in uk
1 jabid abid john abid@foo.com ny us
2 jcolle colle john jcolle@foo.com wi us
table b (t_b):
id sn given nick email l c
0 steven klass steve sklass@foo.com in uk
1 john abid - abid_j@foo.com ny us
2 johnny colle john jcolle@foo.com wi us
3 john abid - abid@foo.com ny us
What is listed above is an (abbreviated) column and row mySQL tables. Looking at the two tables it becomes pretty clear that by strictly looking at the values (id's not looked at) and comparing the matching number of values you would get these value matches.
t_a t_b
0 0
1 3
2 2
- 1
What I ultimately looking to do is to do this in Django — I'm not sure if that matters. In the past I have done this using pure python in which I destroy the old data and just create three new tables. I want to shift away from my implementation (listed below) because the problems I see is that time changes things and people come and go. In the past I have just regenerated the data -- but now I want to keep track of when people leave and don't simply replace (delete) the data. I be开发者_如何学Clieve that by doing a SQL update is more elegant and preserves the history.
I'd like to know how to get this merged answer directly from mySQL (Either a SQL function or the construction of a new table) which merges the data in the following manner. I want to do this using pure SQL (I believe then I can do this in Django). So I am looking for a solution which meets the following criteria:
- There is a min_match which defines the minimum number of matches between the two rows of which must be aligned to be considered valid.
- While the tables may have different lengths it is a 1-to-1 mapping. In other words many to one may not happen (yet)
Now my background is python and for me the simplest way to do this has always been to do a for loop over the shorter of the two tables, which then does a for loop over the other table looking at the number of matches. In code this looks like this.
t_a = [ ["sklass", "klass", "steve", "sklass@foo.com", "in", "uk", ],
["jabid", "abid", "john", "abid@foo.com", "ny", "us", ],
["jcolle", "colle", "john", "jcolle@foo.com", "wi", "us", ], ]
t_b = [ ["steven", "klass", "steve", "sklass@foo.com", "in", "uk",],
["john", "abid", "abid_j@foo.com", "ny", "us",],
["johnny", "colle", "john", "jcolle@foo.com", "wi", "us",],
["john", "abid", "abid@foo.com", "ny", "us",], ]
min_match = 3
for person_a in t_a:
match = 0
match_pct = 0.0
match_a_index = t_a.index(person_a)
for person_b in t_b:
new_match_count = len(list(set(person_a) & set(person_b)))
if new_match_count > match:
match = new_match_count
match_b_index = t_b.index(person_b)
match_pct = "%.2f" % (float(new_match_count) / \
float(len(set(person_a + person_b))) * 100)
if match >= min_match:
print match_a_index, match_b_index #, match_pct, match
The comments beg the question why don't you just join on the email address. I don't necessarily know that the values in a column will match. I am certain that values from a given row in t_a will match values for a row in t_b. I want the highest (most probable) match for a given row in t_a to t_b and only if the number of matches is higher than min_match.
You can do this in MySQL directly via a cursor executed through a stored procedure.
DELIMITER $$
CREATE PROCEDURE `proc_name`()
BEGIN
DECLARE done INT DEFAULT 0;
DECLARE a_id BIGINT UNSIGNED;
DECLARE b_id BIGINT UNSIGNED;
DECLARE x_count INT;
-- something like the following
DECLARE cur1 CURSOR FOR SELECT t_a.id, t_b.id FROM t_a, t_b WHERE t_a.email = t_b.email;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;
SELECT COUNT(*) INTO x_count FROM t_a, t_b WHERE t_a.email = t_b.email;
IF(x_count > <some_min_value>) THEN
OPEN cur1;
REPEAT
FETCH cur1 INTO a_id, b_id;
IF NOT done THEN
-- do something here like update rows, remove rows, etc.
-- a_id and b_id hold the two id values for the two tables which
-- I assume to be primary keys
END IF;
UNTIL done END REPEAT;
CLOSE cur1;
END IF;
END
$$
精彩评论