开发者

How to create a new column based on an untidy data?

开发者 https://www.devze.com 2022-12-07 21:14 出处:网络
Im working with a dataset where some of the table elements are mixed. Ideally, rows from the location column containing the word "District", i.e., "District 1", "District 2&qu

Im working with a dataset where some of the table elements are mixed. Ideally, rows from the location column containing the word "District", i.e., "District 1", "District 2" etc.., should be in a separate column from the location column.

A snippet of the dataset looks like this:

location     iri length
   <chr>      <dbl>  <dbl>
 1 District 1 NA        NA
 2 S00180CB    4.60  32171
 3 S00186CB    5.17  31286
 4 S00188CB    6.开发者_运维百科02   4742
 5 District 2 NA        NA
 6 S00136CB    4.91   5968
 7 S00285CB    4.33  29340
 8 S00288CB   10.9     141
 9 S00289CB    4.41   9126
10 District 3 NA        NA
11 S00231CB    4.34   6895
12 S00266CB    5.65  18985
13 S00381CB    4.39  13799
14 S00382CB    8.96    124
15 District 4 NA        NA
16 S00303CB    4.11  79599
17 S00311CB    3.17    625
18 District 5 NA        NA
19 S00163CB    3.49  17996
20 S00253CB    2.65    905
21 S00259CB    2.26    905

Code i tried:

library(tidyverse)

df |> 
  mutate(district = ifelse(str_detect(location, "District"), 
                           rep_len("District 1", length.out = n()), "none")

I expected it to come out like this:

district   location   iri length
   <chr>      <chr>    <dbl>  <dbl>
 1 District 1 S00180CB  4.60  32171
 2 District 1 S00186CB  5.17  31286
 3 District 1 S00188CB  6.02   4742
 4 District 2 S00136CB  4.91   5968
 5 District 2 S00285CB  4.33  29340
 6 District 2 S00288CB 10.9     141
 7 District 2 S00289CB  4.41   9126
 8 District 3 S00231CB  4.34   6895
 9 District 3 S00266CB  5.65  18985
10 District 3 S00381CB  4.39  13799
11 District 3 S00382CB  8.96    124
12 District 4 S00303CB  4.11  79599
13 District 4 S00311CB  3.17    625
14 District 5 S00163CB  3.49  17996
15 District 5 S00253CB  2.65    905
16 District 5 S00259CB  2.26    905

but the result did not come out as expected.

Is there a more efficient way to accomplish this?


You could use mutate to create the District column, then fill to replace the missing values, then omit any rows where values are NA.

library(dplyr)
library(tidyr)

df1 %>% 
  mutate(District = ifelse(str_detect(location, "District"), location, NA)) %>%
  fill(District) %>% 
  na.omit()

Result:

   location   iri length   District
2  S00180CB  4.60  32171 District 1
3  S00186CB  5.17  31286 District 1
4  S00188CB  6.02   4742 District 1
6  S00136CB  4.91   5968 District 2
7  S00285CB  4.33  29340 District 2
8  S00288CB 10.90    141 District 2
9  S00289CB  4.41   9126 District 2
11 S00231CB  4.34   6895 District 3
12 S00266CB  5.65  18985 District 3
13 S00381CB  4.39  13799 District 3
14 S00382CB  8.96    124 District 3
16 S00303CB  4.11  79599 District 4
17 S00311CB  3.17    625 District 4
19 S00163CB  3.49  17996 District 5
20 S00253CB  2.65    905 District 5
21 S00259CB  2.26    905 District 5

The data. Note: named df1 not df, which is bad practice.

df1 <- structure(list(location = c("District 1", "S00180CB", "S00186CB", 
"S00188CB", "District 2", "S00136CB", "S00285CB", "S00288CB", 
"S00289CB", "District 3", "S00231CB", "S00266CB", "S00381CB", 
"S00382CB", "District 4", "S00303CB", "S00311CB", "District 5", 
"S00163CB", "S00253CB", "S00259CB"), iri = c(NA, 4.6, 5.17, 6.02, 
NA, 4.91, 4.33, 10.9, 4.41, NA, 4.34, 5.65, 4.39, 8.96, NA, 4.11, 
3.17, NA, 3.49, 2.65, 2.26), length = c(NA, 32171L, 31286L, 4742L, 
NA, 5968L, 29340L, 141L, 9126L, NA, 6895L, 18985L, 13799L, 124L, 
NA, 79599L, 625L, NA, 17996L, 905L, 905L)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15", "16", "17", "18", "19", "20", "21"))


Here is an alternative dplyr way with grouping and filtering:

library(dplyr)

df %>% 
  group_by(group = cumsum(grepl('District', location))) %>% 
  mutate(District = first(location)) %>% 
  filter(!grepl('District', location)) %>% 
  ungroup() %>% 
  select(-group)
location   iri length District  
   <chr>    <dbl>  <int> <chr>     
 1 S00180CB  4.6   32171 District 1
 2 S00186CB  5.17  31286 District 1
 3 S00188CB  6.02   4742 District 1
 4 S00136CB  4.91   5968 District 2
 5 S00285CB  4.33  29340 District 2
 6 S00288CB 10.9     141 District 2
 7 S00289CB  4.41   9126 District 2
 8 S00231CB  4.34   6895 District 3
 9 S00266CB  5.65  18985 District 3
10 S00381CB  4.39  13799 District 3
11 S00382CB  8.96    124 District 3
12 S00303CB  4.11  79599 District 4
13 S00311CB  3.17    625 District 4
14 S00163CB  3.49  17996 District 5
15 S00253CB  2.65    905 District 5
16 S00259CB  2.26    905 District 5
0

精彩评论

暂无评论...
验证码 换一张
取 消