开发者

Searching large data set for multiple strings

开发者 https://www.devze.com 2023-03-31 00:55 出处:网络
I am working with two sets of data: Set 1 has 500 members in a single column (all strings) Set 2 has 30,000 members in two columns (tab delimited) where column 1 is a numerical ID number and column 2

I am working with two sets of data:

Set 1 has 500 members in a single column (all strings) Set 2 has 30,000 members in two columns (tab delimited) where column 1 is a numerical ID number and column 2 is a , delimited list of strings (10,000 possibilities).

I need to search column 2 of Set 2 for the strings in set 1 and tag (or isolate) all of the rows in Set 2 with at least 1 match. Hits also need to be very specific (ie, I want to grab "Jake has a tennis ball" but not "I like that Jake has a tennis ball bag"). If it helps, I can figure out all of the strings that I need to specifically avoid.

I began in excel ("if(isnumber(search(") but discovered that there is a limit to the number of nested if statements one can make. I also had some success using grep but I realized that it was isolating lines that I specifically need to avoid (the "I like that Jake has a tennis ball bag" type lines).

I'm starting to think that python is the way to go, but I'm not sure how to code it. Does anyone else have any suggestions?

here is a sample from set 2 (ID's spoofed to protect anonymity) to give a better idea of what I'm working with:

1230    DEVELOPMENTAL DELAY, LANGUAGE DELAY, MOTOR DELAY  
2257    MULTIPLE CONGENITAL ANOMALIES  
2344    MICROCEPHALY, AUTISM SPECTRUM DISORDER, SHORT STATURE  
3342    DEVELOPMENTAL DELAY, SEIZURE DISORDER, ATAXIA  
7651    CONGENITAL ANOMALY, UNSPECIFIED  
7869    FAMILY HISTORY OF AUTISM SPECTRUM DISORDER

In this scenario, I would be searching for the term "AUTISM SPECTRUM DI开发者_如何学JAVASORDER" and I want ID 2344 but not ID 7869.

Thanks for the help!


If you just want an exact match with Set 1 you can do this with lookups. enter this formula in the row 1 of column3 of set 2, where SET1 is the range address of SET1 =IF(ISERROR(MATCH($a1,SET1,0),"Miss","Hit") Then copy the formula down 30000 rows (doubleclick the litle black sqaure in the southeast corner of the cell). (There is also a more efficient variation if you can sort Set 1)


Not sure about how you hold your sets, but somehow you have to transform/import them in some python data structures. It seems that a very simple way to do this is to hold set1 in a python set (or list) and set2 in a dict, with ids as keys and strings as values. Then a simple iteration through keys and appropriate value searching would yield your desired results. This should suffice

set1 = set( ... ) # your 500 members must be here 
set2 = dict( ... ) # your 30000 other ids-strings stuff

# and a simple function to make queries

def handle_query( query_string ):
    for id in set2:
        if query_string in set2[id].split(","):
            yield id   

# now you can call it for every string in set1 (or whatever...)

for string in set1:
    print string, [ i for i in handle_query( string ) ] 

and yes, python is the way to go :)

However, this is not optimal (and scales badly if the dataset is larger), cause you have to iterate over all ids. This is ok for 30000 ids, but "almost perfection" lies in something called the "inverted index", which is a core concept of a search engine. So the optimal way to "hold" your data of set2 is in a sense:

DEVELOPMENTAL DELAY -> 1230, another_id, yet_another_id, ...
LANGUAGE DELAY -> 1230, ...
MOTOR DELAY -> 1230, ...
MULTIPLE CONGENITAL ANOMALIES -> 2257, ...
etc

Now, strings are keys and (a list of) ids are values. So, "strict" string queries should yield all the desired ids... Implementation is left as an exersice to the reader :)

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号