I'm working on a Text Mining Solution with SQL and R.
First I Import Data into R from my SQL selection and than I do data mining stuff with it.
Here is what I got:
rawData = sqlQuery(dwhConnect,sqlString)
a = data.frame(rawData$ENNOTE_NEU)
If I do a
a[[1]][1:3]
you see the structure:
[1] lorem ipsum li ld ee wö wo di dd
[2] la kdin di da dogs chicken
[3] kd good i need some help
Now I want to do some data cleaning with my开发者_如何学C own dictionary. An Example would be to replace li with lorem ipsum and kd as well as kdin with kunde
My Problem is how to do that for the whole Data Frame.
for(i in 1:(nrow(a)))
{
a[[1]][i]=gsub( " kd | kdin " , " kunde " ,a[[1]][i])
a[[1]][i]=gsub( " li " , " lorem ipsum " ,a[[1]][i])
...
}
works but is slow for a lot of data.
Is there a better way to do that?
cheers The Captain
gsub
is vectorised, so you don't need the loop.
a[[1]] <- gsub( " kd | kdin " , " kunde " , a[[1]])
is quicker.
Also, are you sure you want spaces inside your regexes? That way you won't match words at the start or end of lines.
Alternative approach: avoid regexes altogether. This works best when you have a lot of different words to search, because you'll avoid the text manipulation except for the first time.
a1 <- c("lorem ipsum li ld ee wö wo di dd","la kdin di da dogs chicken","kd good i need some help")
x <- strsplit(a1, " ",fixed=TRUE) # fixed option avoids regexes which will be slower
replfxn <- function(vec,word.in,word.out) {
vec[vec %in% word.in] <- word.out
vec
}
word.in <- "kdin"
word.out <- "kunde"
replfxn(x[[2]],word.in,word.out)
lapply(x,replfxn,word.in=word.in,word.out=word.out)
[[1]]
[1] "lorem" "ipsum" "li" "ld" "ee" "wö" "wo" "di" "dd"
[[2]]
[1] "la" "kunde" "di" "da" "dogs" "chicken"
[[3]]
[1] "kd" "good" "i" "need" "some" "help"
For a large number of words to search over, I'd guess this is faster than regexes. It's also more amenable to data-code separation, since it lends itself to writing a merge or similar function to read in the dictionary from a file rather than embedding it in code.
If you really need it back in the original format (as a space-separated character vector), you can apply a paste
to the result.
And here are timing results. I stand corrected: looks like gsub is faster!
library(microbenchmark)
microbenchmark(
gsub( word.in , word.out , a1) ,
lapply(x,replfxn,word.in=word.in,word.out=word.out) ,
times = 1000
)
expr min lq
1 gsub(word.in, word.out, a1) 42772 44484
2 lapply(x, replfxn, word.in = word.in, word.out = word.out) 102653 106075
median uq max
1 47905 48761.0 691193
2 109496 111635.5 970065
精彩评论