using hash to determine whether 2 dataframes are identical (PART 01)_问答_开发者

using hash to determine whether 2 dataframes are identical (PART 01)

开发者 https://www.devze.com 2023-04-02 05:39 出处：网络

I have created a dataset using WHO ATC/DDD Index a few months bef开发者_如何学运维ore and I want to make sure if the database online remains unchanged today, so I downloaded it again and try to use th

相关专题：hash

The two dataset (in txt format) can be downloaded here. (I am aware that you may think the files are unsafe and may have virus, but I don't know how to generate a dummy dataset to replicate the issue I have now, so I upload the dataset finally)

And I have written a little script as below:

library(digest)

ddd.old <- read.table("ddd.table.old.txt",header=TRUE,stringsAsFactors=FALSE)
ddd.new <- read.table("ddd.table.new.txt",header=TRUE,stringsAsFactors=FALSE)


ddd.old[,"ddd"] <- as.character(ddd.old[,"ddd"])
ddd.new[,"ddd"] <- as.character(ddd.new[,"ddd"])

ddd.old <- data.frame(ddd.old, hash = apply(ddd.old, 1, digest),stringsAsFactors=FALSE)
ddd.new <- data.frame(ddd.new, hash = apply(ddd.new, 1, digest),stringsAsFactors=FALSE)

ddd.old <- ddd.old[order(ddd.old[,"hash"]),]
ddd.new <- ddd.new[order(ddd.new[,"hash"]),]

And something really interesting happens when I do the checking:

> table(ddd.old[,"hash"]%in%ddd.new[,"hash"]) #line01

TRUE 
 506 
> table(ddd.new[,"hash"]%in%ddd.old[,"hash"]) #line02

TRUE 
 506 
> digest(ddd.old[,"hash"])==digest(ddd.new[,"hash"]) #line03
[1] TRUE
> digest(ddd.old)==digest(ddd.new) #line04
[1] FALSE

line01 and line02 shows that every rows in ddd.old can be found in ddd.new, and vice versa.
line03 shows that the hash column for both dataframe are the same
line04 shows that the two dataframe are different

What happen? Both dataframe with the identical rows (from line01 and line02), same order (from line03), but are different? (from line04)

Or do I have any misunderstanding about digest? Thanks.

Read in data as before.

ddd.old <- read.table("ddd.table.old.txt",header=TRUE,stringsAsFactors=FALSE)
ddd.new <- read.table("ddd.table.new.txt",header=TRUE,stringsAsFactors=FALSE)
ddd.old[,"ddd"] <- as.character(ddd.old[,"ddd"])
ddd.new[,"ddd"] <- as.character(ddd.new[,"ddd"])

Like Marek said, start by checking for differences with all.equal.

all.equal(ddd.old, ddd.new)
[1] "Component 6: 4 string mismatches" 
[2] "Component 8: 24 string mismatches"

So we just need to look at columns 6 and 8.

different.old <- ddd.old[, c(6, 8)]   
different.new <- ddd.new[, c(6, 8)]

Hash these columns.

hash.old <- apply(different.old, 1, digest)
hash.new <- apply(different.new, 1, digest)

And find the rows where they don't match.

different_rows <- which(hash.old != hash.new)  #which is optional

Finally, combine the datasets.

cbind(different.old[different_rows, ], different.new[different_rows, ])

using hash to determine whether 2 dataframes are identical (PART 01)

精彩评论

关注公众号

热门标签

图文推荐

using hash to determine whether 2 dataframes are identical (PART 01)

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：