开发者

Record linking data frames with function RLBigDataLinkage in package RecordLinkage in R

开发者 https://www.devze.com 2023-02-27 01:32 出处:网络
I would like to link two data frames that use different numeric keys, but similar strings. Specifically, one data frame uses a numeric key GVKEY and a company name CONML

I would like to link two data frames that use different numeric keys, but similar strings. Specifically, one data frame uses a numeric key GVKEY and a company name CONML

> head(temp.compustat[order(temp.compustat$CONML, decreasing = T), ])
        GVKEY             CONML
225994  13023  ZZZZ Best Co Inc
211017  11696       Zytrex Corp
213816  11951 Zytec Systems Inc
309886  29163        Zytec Corp
373950 129441         Zynex Inc
383184 145228  ZymoGenetics Inc
> dim(temp.compustat)
[1] 31354     2

The other data frame uses a different numeric key companyid and a company name company that may be slightly different than CONML in the first database.

> head(temp.dealscan[ order(temp.dealscan$company, decreasing = T), ])
      companyid            company
70473     18192 Zytec Corp 
32025     16969 Zynaxis Inc
19714     92271 ZYGO Teraoptix Inc
80473     13185 Zygo Corp 
1901      24303 Zycon Corp SDN Bhd 
33993     21219 Zycon Corp

> dim(temp.dealscan)
[1] 85818     2

(I am sorting in reverse, because the DealScan databases has blanks and * at the beginning of some entries). It seems that the function RLBigDataLinkage in package RecordLinkage is the solution, but I can't get even unsupervised linking to work. Here's my error.

> library(RecordLinkage)
> rpairs <- RLBigDataLinkage(dataset1 = temp.compustat, dataset2 = temp.dealscan, exclude = 1, strcmp = 2, strcmpfun = "levenshtein")
> result <- epiClassify(rpairs, threshold.upper = 0.5)
Error in if (max <= min) stop("must have max > min") : 
  missing value where TRUE/FALSE needed
In addition: Warning message:
In nData1 * nData2 : NAs produced by integer overflow

Coincident with me getting started yesterday, I saw this question about using the Levenshtein distance function manually. Is this approach a better option for me? These data frames are fairly large, so it seems that RLBigData should be the correct approach (the package authors recommend it for > 1000 entries). Than开发者_开发问答ks!


This is due to a bug in the package code which is fixed now. The fixed version will be available from the project site on R-Forge (https://r-forge.r-project.org/projects/recordlinkage) probably until tomorrow morning. Note that it depends on version 1.5.4 of the data.table package, which is currently not available on CRAN, but on R-Forge(https://r-forge.r-project.org/projects/datatable/).

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号