I am working with some US govt data which has a lengthy list of cities and zip codes. After some work, the data is in the following format.
dat1 = data.frame(keyword=c("Bremen", "Brent", "Centreville, AL", "Chelsea, AL", "Bailytown, Alabama", "Calera, Alabama",
"54023", "54024"), tag=c(rep("AlabamCity",2), rep("AlabamaCityST",2), rep("AlabamaCityState",2), rep("AlabamaZipCode",2)))
dat1
However, there are certain keywords which aren't properly working. So in the below example, there are two 'zip codes' which are labeled as 'AlabamaCity' and 'AlabamaCityState'. For some reason, the original data set from the government has several zipcodes which aren't properly grouped with the other zip codes.
dat2 = data.frame(keyword=c("Bremen", "Brent", "50143", "Chelsea, AL", "Bailytown, Alabama", "52348",
"54023", "54024"), tag=c(rep("AlabamCity",2), rep("AlabamaCityST",2), rep("AlabamaCityState",2), rep("AlabamaZipCo开发者_C百科de",2)))
dat2
I wanted to know how I could iterate through the entire list of keywords and delete all the rows with numeric values (they're acctually saved as character values) which don't have a 'AlabamaZipCode' tag. So the previous data should end up looking like.
dat3 = data.frame(keyword=c("Bremen", "Brent", "Chelsea, AL", "Bailytown, Alabama", "54023", "54024"),
tag=c(rep("AlabamCity",2), rep("AlabamaCityST",1), rep("AlabamaCityState",1), rep("AlabamaZipCode",2)))
dat3
The challange seems to be that there are certain numeric values which I want to keep and others which I want to delete. Can anyone help.
I think two grepl expressions should do the trick:
> dat2[ !( grepl("City", dat2$tag) & grepl("^\\d", dat2$keyword) ) , ]
keyword tag
1 Bremen AlabamCity
2 Brent AlabamCity
4 Chelsea, AL AlabamaCityST
5 Bailytown, Alabama AlabamaCityState
7 54023 AlabamaZipCode
8 54024 AlabamaZipCode
You are eliminating the rows where there are digits in keyword
and "City" in tag
It helps to store the data as characters, not factors:
dat2 <- data.frame(keyword=c("Bremen", "Brent", "50143", "Chelsea, AL",
"Bailytown, Alabama", "52348", "54023", "54024"),
tag=c(rep("AlabamCity",2), rep("AlabamaCityST",2),
rep("AlabamaCityState",2), rep("AlabamaZipCode",2)),
stringsAsFactors = FALSE) ## note this bit
Now we can convert keyword
to numeric, and if it isn't a number in character format, we get an NA
:
want <- with(dat2, as.numeric(keyword))
which gives us this:
> (want <- with(dat2, as.numeric(keyword)))
[1] NA NA 50143 NA NA 52348 54023 54024
Warning message:
In eval(expr, envir, enclos) : NAs introduced by coercion
You can ignore the warning or suppress it, but don't use this casually as it can mask problems:
suppressWarnings(want <- with(dat2, as.numeric(keyword)))
The final step is to select the elements of want
that are not NA
and have keyword
equal to "AlabamaZipCode"
, which we do using &
:
(!is.na(want) & (dat2$tag != "AlabamaZipCode"))
That selects the rows we don't want, so we need to negate the above, turning TRUE
to FALSE
and vice versa:
!(!is.na(want) & (dat2$tag != "AlabamaZipCode"))
Putting this together we have:
dat2[!(!is.na(want) & (dat2$tag != "AlabamaZipCode")), ]
which gives:
> dat2[!(!is.na(want) & (dat2$tag != "AlabamaZipCode")), ]
keyword tag
1 Bremen AlabamCity
2 Brent AlabamCity
4 Chelsea, AL AlabamaCityST
5 Bailytown, Alabama AlabamaCityState
7 54023 AlabamaZipCode
8 54024 AlabamaZipCode
Full solution is:
want <- with(dat2, as.numeric(keyword))
dat2[!(!is.na(want) & (dat2$tag != "AlabamaZipCode")), ]
Here's one (slightly convoluted) approach that you may consider. First, create an identity column for each row. This will help with the subsetting. Secondly, create a vector of those id's that match your criteria. Finally, subset those id's out of your final data.
The data you posted defaults to factors, not character data so I've accounted for that. If that's different than your actual data, you'll have to adjust accordingly. Also, when I converted the data to numeric, NAs are generated. A warning message is generated, but we can ignore that for this bit.
#Generate an ID column
dat4$id <- 1:nrow(dat4)
#Create a vector of the id's that match your criteria'
outliers <- dat4[as.character(dat4$tag) != "AlabamaZipCode" & !(is.na(as.numeric(as.character(dat4$keyword)))) , "id"]
subset(dat4, !(id %in% outliers), select = 1:2)
keyword tag
1 Bremen AlabamCity
2 Brent AlabamCity
4 Chelsea, AL AlabamaCityST
5 Bailytown, Alabama AlabamaCityState
7 54023 AlabamaZipCode
8 54024 AlabamaZipCode
Actually, you can shorten all of this to the following and avoid generating the id.
dat4[!(as.character(dat4$tag) != "AlabamaZipCode" & !(is.na(as.numeric(as.character(dat4$keyword))))) , ]
精彩评论