Identify duplicate data with a threshold_问答_开发者

Identify duplicate data with a threshold

开发者 https://www.devze.com 2023-02-21 04:15 出处：网络

相关专题：r

I am working with bluetooth sensor data and need to identify possible duplicate readings for each unique ID. The bluetooth sensor made a scan every five seconds, and may pick up the same device in subsequent readings if the device wasn't moving quickly (i.e. sitting in traffic). There may be multiple readings from the same device if that device made a round trip, but those should be separated by several minutes. I can't wrap my head around how to get rid of the duplicate data开发者_C百科. I need to calculate a time difference column if the macid's match.

The data has the format:

          macid   time
00:03:7A:4D:F3:59  82333
00:03:7A:EF:58:6F 223556
00:03:7A:EF:58:6F 223601
00:03:7A:EF:58:6F 232731
00:03:7A:EF:58:6F 232736
00:05:4F:0B:45:F7 164141

And I need to create:

            macid   time timediff
00:03:7A:4D:F3:59  82333 NA
00:03:7A:EF:58:6F 223556 NA
00:03:7A:EF:58:6F 223601 45
00:03:7A:EF:58:6F 232731 9310
00:03:7A:EF:58:6F 232736 5
00:05:4F:0B:45:F7 164141 NA

My first attempt at this is extremely slow and not really usable:

dedupeIDs <- function (zz) {
  #Order by macid and then time
  zz <- zz[order(zz$macid, zz$time) ,]

  zz$timediff <- c(999999, diff(zz$time))

  for (i in 2:nrow(zz)) {
   if (zz[i, "macid"] == zz[i - 1, "macid"]) {
    print("Different IDs")
   } else {
    zz[i, "timediff"] <- 999999
   }
  }
  return(zz)
}

I'll then be able to filter the data.frame based on the time difference column.

Sample data:

structure(list(macid = structure(c(1L, 2L, 2L, 2L, 2L, 3L),
          .Label = c("00:03:7A:4D:F3:59", "00:03:7A:EF:58:6F", 
                     "00:05:4F:0B:45:F7"), class = "factor"), 
          time = c(82333, 223556, 223601, 232731, 232736, 164141)), 
          .Names = c("macid", "time"), row.names = c(NA, -6L), 
          class = "data.frame")

How about:

x <- structure(list(macid= structure(c(1L, 2L, 2L, 2L, 2L, 3L),
 .Label = c("00:03:7A:4D:F3:59", "00:03:7A:EF:58:6F", "00:05:4F:0B:45:F7"),
 class = "factor"), time = c(82333, 223556, 223601, 232731, 232736, 164141)),
.Names = c("macid", "time"), row.names = c(NA, -6L), class = "data.frame")
# ensure 'x' is ordered properly
x <- x[order(x$macid,x$time),]
# add timediff column by macid
x$timediff <- ave(x$time, x$macid, FUN=function(x) c(NA,diff(x)))