使用tidyverse方案按列值进行子集和row_binding

问题描述:

我有一个data.frame,我想将子集(按行)分成(重叠)“批次”,然后是purrr:::map这些批次到一个函数。在下面的例子中,ddata.frame我想子集和批:使用tidyverse方案按列值进行子集和row_binding

set.seed(19) 
n1 <- data.frame(c0= "N",c1 = rep("A",4),c2 = rep(c("i","j"),2), num = rnorm(4)) 
n2 <- data.frame(c0= "N", c1 = rep("B",6),c2 = rep(c("i","j"),3), num = rnorm(3)) 
y1 <- data.frame(c0 = "Y", c1 = rep("A",2),c2 = c("i","j"), num = rnorm(2)) 
y2 <- data.frame(c0 = "Y", c1 = rep("B",4),c2 = rep(c("i","j"),each = 2), num = rnorm(2)) 

d <- rbind(y1,y2,n1,n2) 

这里是d

# c0 c1 c2  num 
# 1 Y A i -0.7447795 
# 2 Y A j -0.2597870 
# 3 Y B i -0.1830838 
# 4 Y B i 0.5186300 
# 5 Y B j -0.1830838 
# 6 Y B j 0.5186300 
# 7 N A i -1.1894537 
# 8 N A j 0.3885812 
# 9 N A i -0.3443333 
# 10 N A j -0.5478961 
# 11 N B i 0.9806622 
# 12 N B j -0.2366460 
# 13 N B i 0.8097397 
# 14 N B j 0.9806622 
# 15 N B i -0.2366460 
# 16 N B j 0.8097397 

的子集的配方是

  1. 子组c0 - >给组YN
  2. c0=="N"子集由c1内 - >给予组NANB
  3. 子集中的每个的NANB通过c2 - >给予组NAiNAjNBiNBj
  4. row_bind N?iY?iN?jY?j(其中?AB) - >给出最后4个数据子集

在R:

subset.Yi <- d %>% filter(c0=="Y"& c2=="i") 
subset.Yj <- d %>% filter(c0=="Y"& c2=="j") 

list(
    d1 = d %>% filter(c0=="N" & c1 == "A", c2 == "i") %>% rbind(subset.Yi), 
    d2 = d %>% filter(c0=="N" & c1 == "B", c2 == "i") %>% rbind(subset.Yi), 
    d3 = d %>% filter(c0=="N" & c1 == "A", c2 == "j") %>% rbind(subset.Yj), 
    d4 = d %>% filter(c0=="N" & c1 == "B", c2 == "j") %>% rbind(subset.Yj) 
) %>% 
tibble::tibble(batches = paste0("batch",1:length(.)),data = .) ->tmp 

如果c2匹配不是我可以这样做很重要:

d %>% filter(.,c0 == "N") %>% 
    group_by(.,c1) %>% 
    do(batches = rbind(d[d$c0=="Y"],.)) -> tmp 

但事实并非这么回事。先谢谢你! BTW,我知道外面tidyverse这是可行的,但我通过了我的代码的其余tidyverse计划,我希望能保持一致。

下面是在这种情况下工作(虽然,这将是巨大的,看看别人的其他可能更为普遍的方法)的解决方案。

tmp <- d %>% 
    group_by(c2) %>% 
    nest(.key = c2) %>% 
    mutate(c2 = map(c2,~ .x %>% 
        filter(.,c0 == "N") %>% 
        group_by (.,c1) %>% 
        do(batches = bind_rows(
         .x %>% filter(.,c0 == "Y") %>% select(-c1), 
         (.) %>% select(-c1) )) 
       )) 

tmp这里将包含四个子集。然后,我可以做类似

tmp %>% unnest(c2) %>% .$batches %>% map(.,~sum(.$num)) %>% unlist 

这给numcolSum在每个4个子组。

[1] -1.94302047 1.14452254 -0.08355576 1.62951506 

边注:取消选择c1在技术上是没有必要在这里,但因为我是row_binding使得数据帧的一部分被忽视的价值c1(见上子集配方和注意?),C1的价值感到困惑,所以我删除了它。