开发者

replace words in R data.frames (Text Mining)

开发者 https://www.devze.com 2023-03-23 01:24 出处:网络
I\'m working on a Text Mining Solution with SQL and R. First I Import Data into R from my SQL selection and than I do data mining stuff with it.

I'm working on a Text Mining Solution with SQL and R.

First I Import Data into R from my SQL selection and than I do data mining stuff with it.

Here is what I got:

rawData = sqlQuery(dwhConnect,sqlString) 
a = data.frame(rawData$ENNOTE_NEU)

If I do a

a[[1]][1:3]

you see the structure:

[1] lorem ipsum li ld ee wö wo di dd
[2] la kdin di da dogs chicken
[3] kd good i need some help 

Now I want to do some data cleaning with my开发者_如何学C own dictionary. An Example would be to replace li with lorem ipsum and kd as well as kdin with kunde

My Problem is how to do that for the whole Data Frame.

 for(i in 1:(nrow(a)))
    {
        a[[1]][i]=gsub( " kd | kdin " , " kunde " ,a[[1]][i])
        a[[1]][i]=gsub( " li " , " lorem ipsum " ,a[[1]][i])
...
    }

works but is slow for a lot of data.

Is there a better way to do that?


cheers The Captain


gsub is vectorised, so you don't need the loop.

a[[1]] <- gsub( " kd | kdin " , " kunde " , a[[1]])

is quicker.


Also, are you sure you want spaces inside your regexes? That way you won't match words at the start or end of lines.


Alternative approach: avoid regexes altogether. This works best when you have a lot of different words to search, because you'll avoid the text manipulation except for the first time.

a1 <- c("lorem ipsum li ld ee wö wo di dd","la kdin di da dogs chicken","kd good i need some help")
x <- strsplit(a1, " ",fixed=TRUE) # fixed option avoids regexes which will  be slower

replfxn <- function(vec,word.in,word.out) {
  vec[vec %in% word.in] <- word.out
  vec
}

word.in <- "kdin"
word.out <- "kunde"

replfxn(x[[2]],word.in,word.out)

lapply(x,replfxn,word.in=word.in,word.out=word.out)
[[1]]
[1] "lorem" "ipsum" "li"    "ld"    "ee"    "wö"    "wo"    "di"    "dd"   

[[2]]
[1] "la"      "kunde"   "di"      "da"      "dogs"    "chicken"

[[3]]
[1] "kd"   "good" "i"    "need" "some" "help"

For a large number of words to search over, I'd guess this is faster than regexes. It's also more amenable to data-code separation, since it lends itself to writing a merge or similar function to read in the dictionary from a file rather than embedding it in code.

If you really need it back in the original format (as a space-separated character vector), you can apply a paste to the result.

And here are timing results. I stand corrected: looks like gsub is faster!

library(microbenchmark)
microbenchmark(
  gsub( word.in , word.out , a1) ,
  lapply(x,replfxn,word.in=word.in,word.out=word.out) ,
  times = 1000
  )

                                                        expr    min     lq
1                                gsub(word.in, word.out, a1)  42772  44484
2 lapply(x, replfxn, word.in = word.in, word.out = word.out) 102653 106075
  median       uq    max
1  47905  48761.0 691193
2 109496 111635.5 970065
0

精彩评论

暂无评论...
验证码 换一张
取 消