I have a query that has multiple group-by - summarise statements. When I ungroup the data between everything work开发者_如何学JAVAs fine, but if I don't one of the columns is replaced by another.
I would expect the columns to not be changed. For example in the examples below, the variable gender
should be F
, or M
and not Group X
library(dplyr)
library(arrow)
# Create sample dataset
N <- 1000
set.seed(123)
orig_data <- tibble(
code_group = sample(paste("Group", 1:2), N, replace = TRUE),
year = sample(2015:2016, N, replace = TRUE),
gender = sample(c("F", "M"), N, replace = TRUE),
value = runif(N, 0, 10)
)
write_dataset(orig_data, "example")
# Query and replicate the error
(ds <- open_dataset("example/"))
#> FileSystemDataset with 1 Parquet file
#> code_group: string
#> year: int32
#> gender: string
#> value: double
ds |>
group_by(year, code_group, gender) |>
summarise(value = sum(value)) |>
group_by(code_group, gender) |>
summarise(value = max(value), NN = n()) |>
collect()
#> # A tibble: 2 × 4
#> # Groups: code_group [2]
#> code_group gender value NN
#> <chr> <chr> <dbl> <int>
#> 1 Group 1 Group 1 724. 4
#> 2 Group 2 Group 2 661. 4
ERROR the gender variable is replaced by the values of the group variable
ds |>
group_by(year, code_group, gender) |>
summarise(value = sum(value)) |>
ungroup() |> #< Added this line...
group_by(code_group, gender) |>
summarise(value = max(value), NN = n()) |>
collect()
#> # A tibble: 4 × 4
#> # Groups: code_group [2]
#> code_group gender value NN
#> <chr> <chr> <dbl> <int>
#> 1 Group 1 F 724. 2
#> 2 Group 2 M 627. 2
#> 3 Group 1 M 658. 2
#> 4 Group 2 F 661. 2
Note now after inserting the ungroup()
between the group-by - summarise calls, gender is not replaced
Quick look at the query (note Node 4 where "gender": code_group
)
ds |>
group_by(year, code_group, gender) |>
summarise(value = sum(value)) |>
group_by(code_group, gender) |>
summarise(value = max(value), NN = n()) |>
show_query()
#> ExecPlan with 8 nodes:
#> 7:SinkNode{}
#> 6:ProjectNode{projection=[code_group, gender, value, NN]}
#> 5:GroupByNode{keys=["code_group", "gender"], aggregates=[
#> hash_max(value, {skip_nulls=false, min_count=0}),
#> hash_sum(NN, {skip_nulls=true, min_count=1}),
#> ]}
#> 4:ProjectNode{projection=[value, "NN": 1, code_group, "gender": code_group]}
#> 3:ProjectNode{projection=[year, code_group, gender, value]}
#> 2:GroupByNode{keys=["year", "code_group", "gender"], aggregates=[
#> hash_sum(value, {skip_nulls=false, min_count=0}),
#> ]}
#> 1:ProjectNode{projection=[value, year, code_group, gender]}
#> 0:SourceNode{}
Created on 2022-12-07 by the reprex package (v2.0.1)
Do I have a wrong understanding of arrow/dplyr or is this a bug (if so is that in arrow or dplyr/dbplyr)?
精彩评论