I want to filter rows from a data.frame
based on a logical condition. Let's suppose that I have data frame like
expr_value cell_type
1 5.345618 bj fibroblast
2 5.195871 bj fibroblast
3 5.247274 bj fibroblast
4 5.929771 hesc
5 5.873096 hesc
6 5.665857 hesc
7 6.791656 hips
8 7.133673 hips
9 7.574058 hips
10 7.208041 hips
11 7.402100 hips
12 7.167792 hips
13 7.156971 hips
14 7.197543 hips
15 7.035404 hips
16 7.269474 hips
17 6.715059 hips
18 7.434339 hips
19 6.997586 hips
20 7.619770 hips
21 7.490749 hips
What I want to is to get a new data frame which looks the same but only has the data for one cell_type. E.g. subset / select rows which contains the cell type "hesc":
expr_value cell_type
1 5.929771 hesc
2 5.873096 hesc
3 5.665857 hesc
Or either cell type "bj fibroblast" or "hesc":
expr_value cell_type
1 5.345618 bj fibroblast
2 5.195871 bj fibroblast
3 5.247274 bj fibroblast
4 5.929771 hesc
5 5.873096 hesc
6 5.665857 hesc
Is there any easy way to do this?
I've tried:
expr[expr[2] == 'hesc']
# [1] "5.929771" "5.873096" "5.665857" "hesc" "hesc" "hesc"
if the original data frame is called "expr", but it gives the results in wron开发者_如何学Gog format as you can see.
To select rows according to one 'cell_type' (e.g. 'hesc'), use ==
:
expr[expr$cell_type == "hesc", ]
To select rows according to two or more different 'cell_type', (e.g. either 'hesc' or 'bj fibroblast'), use %in%
:
expr[expr$cell_type %in% c("hesc", "bj fibroblast"), ]
Use subset
(for interactive use)
subset(expr, cell_type == "hesc")
subset(expr, cell_type %in% c("bj fibroblast", "hesc"))
or better dplyr::filter()
filter(expr, cell_type %in% c("bj fibroblast", "hesc"))
The reason expr[expr[2] == 'hesc']
doesn't work is that for a data frame, x[y]
selects columns, not rows. If you want to select rows, change to the syntax x[y,]
instead:
> expr[expr[2] == 'hesc',]
expr_value cell_type
4 5.929771 hesc
5 5.873096 hesc
6 5.665857 hesc
You could use the dplyr
package:
library(dplyr)
filter(expr, cell_type == "hesc")
filter(expr, cell_type == "hesc" | cell_type == "bj fibroblast")
No one seems to have included the which function. It can also prove useful for filtering.
expr[which(expr$cell == 'hesc'),]
This will also handle NAs and drop them from the resulting dataframe.
Running this on a 9840 by 24 dataframe 50000 times, it seems like the which method has a 60% faster run time than the %in% method.
I was working on a dataframe and having no luck with the provided answers, it always returned 0 rows, so I found and used grepl:
df = df[grepl("downlink",df$Transmit.direction),]
Which basically trimmed my dataframe to only the rows that contained "downlink" in the Transmit direction column. P.S. If anyone can guess as to why I'm not seeing the expected behavior, please leave a comment.
Specifically to the original question:
expr[grepl("hesc",expr$cell_type),]
expr[grepl("bj fibroblast|hesc",expr$cell_type),]
Sometimes the column you want to filter may appear in a different position than column index 2 or have a variable name.
In this case, you can simply refer the column name you want to filter as:
columnNameToFilter = "cell_type"
expr[expr[[columnNameToFilter]] == "hesc", ]
This worked like magic for me.
celltype_hesc_bool = expr['cell_type'] == 'hesc'
expr_celltype_hesc = expr[celltype_hesc]
Check this blog post
we can use data.table library
library(data.table)
expr <- data.table(expr)
expr[cell_type == "hesc"]
expr[cell_type %in% c("hesc","fibroblast")]
or filter using %like%
operator for pattern matching
expr[cell_type %like% "hesc"|cell_type %like% "fibroblast"]
精彩评论