从分组数据

从分组数据

问题描述:

问题从分组数据

使用dplyr选择第一个和最后一排,我怎么在一个声明中选择分组数据的顶部和底部的意见/行?

数据&例

给定一个数据帧

df <- data.frame(id=c(1,1,1,2,2,2,3,3,3), 
       stopId=c("a","b","c","a","b","c","a","b","c"), 
       stopSequence=c(1,2,3,3,1,4,3,1,2)) 

我可以从使用slice每组顶部和底部的观察,但使用两个单独的statments:

firstStop <- df %>% 
    group_by(id) %>% 
    arrange(stopSequence) %>% 
    slice(1) %>% 
    ungroup 

lastStop <- df %>% 
    group_by(id) %>% 
    arrange(stopSequence) %>% 
    slice(n()) %>% 
    ungroup 

我可以结合这两个statmenets到一个选择均为顶部和底部观察?

有可能是一个更快的方式:

df %>% 
    group_by(id) %>% 
    arrange(stopSequence) %>% 
    filter(row_number()==1 | row_number()==n()) 
+37

'ROWNUMBER()以%C(%1,N( ))将避免两次运行向量扫描的需要 – MichaelChirico

+5

@MichaelChirico I怀疑你省略了一个'_'?即'filter(row_number()%in%c(1,n()))' –

喜欢的东西:

library(dplyr) 

df <- data.frame(id=c(1,1,1,2,2,2,3,3,3), 
       stopId=c("a","b","c","a","b","c","a","b","c"), 
       stopSequence=c(1,2,3,3,1,4,3,1,2)) 

first_last <- function(x) { 
    bind_rows(slice(x, 1), slice(x, n())) 
} 

df %>% 
    group_by(id) %>% 
    arrange(stopSequence) %>% 
    do(first_last(.)) %>% 
    ungroup 

## Source: local data frame [6 x 3] 
## 
## id stopId stopSequence 
## 1 1  a   1 
## 2 1  c   3 
## 3 2  b   1 
## 4 2  c   4 
## 5 3  b   1 
## 6 3  a   3 

随着do你几乎可以在组,但@ jeremycg的答案执行任何数量的操作是方式更合适只是为了这个任务。

+1

没有考虑写一个函数 - 当然是一个更复杂的方法。 – tospig

+1

这似乎过于复杂相比,只是使用'slice',如'DF%>%安排(stopSequence)%>%GROUP_BY(ID)%>%切片(C(1,N()))' – Frank

+3

不不同意(我指出jeremycg在帖子中是一个更好的答案),但在这里有一个'do'的例子可能有助于其他人在'slice'不起作用的时候(例如对一个组进行更复杂的操作)。而且,你可以发表你的评论作为答案(这是最好的答案)。 – hrbrmstr

dplyr,但它使用data.table的更直接:

library(data.table) 
setDT(df) 
df[ df[order(id, stopSequence), .I[c(1L,.N)], by=id]$V1 ] 
# id stopId stopSequence 
# 1: 1  a   1 
# 2: 1  c   3 
# 3: 2  b   1 
# 4: 2  c   4 
# 5: 3  b   1 
# 6: 3  a   3 

更详细的解释:

# 1) get row numbers of first/last observations from each group 
# * basically, we sort the table by id/stopSequence, then, 
#  grouping by id, name the row numbers of the first/last 
#  observations for each id; since this operation produces 
#  a data.table 
# * .I is data.table shorthand for the row number 
# * here, to be maximally explicit, I've named the variable V1 
#  as row_num to give other readers of my code a clearer 
#  understanding of what operation is producing what variable 
first_last = df[order(id, stopSequence), .(row_num = .I[c(1L,.N)]), by=id] 
idx = first_last$row_num 

# 2) extract rows by number 
df[idx] 

一定要检查出Getting Started维基得到data.table基本覆盖

+1

或者'df [df [order(stopSequence),.I [c(1,.N)],keyby = id] $ V1]'。看到'id'出现两次对我来说很奇怪。 – Frank

+0

您可以在'setDT'调用中设置按键。所以'订单'电话不需要在这里。 –

+1

@ArtemKlevtsov - 尽管如此,您可能并不总是想要设置按键。 – SymbolixAU

只是为了完整性:您可以通过slice一个指标向量S:

df %>% arrange(stopSequence) %>% group_by(id) %>% slice(c(1,n())) 

这给

id stopId stopSequence 
1 1  a   1 
2 1  c   3 
3 2  b   1 
4 2  c   4 
5 3  b   1 
6 3  a   3 

我知道指定dplyr的问题。但是,因为其他人使用其他套餐已发布的解决方案,我决定有一个去使用其他的包太:

基础包:

df <- df[with(df, order(id, stopSequence, stopId)), ] 
merge(df[!duplicated(df$id), ], 
     df[!duplicated(df$id, fromLast = TRUE), ], 
     all = TRUE) 

数据。表:

df <- setDT(df) 
df[order(id, stopSequence)][, .SD[c(1,.N)], by=id] 

sqldf:

library(sqldf) 
min <- sqldf("SELECT id, stopId, min(stopSequence) AS StopSequence 
     FROM df GROUP BY id 
     ORDER BY id, StopSequence, stopId") 
max <- sqldf("SELECT id, stopId, max(stopSequence) AS StopSequence 
     FROM df GROUP BY id 
     ORDER BY id, StopSequence, stopId") 
sqldf("SELECT * FROM min 
     UNION 
     SELECT * FROM max") 

在一个查询:

sqldf("SELECT * 
     FROM (SELECT id, stopId, min(stopSequence) AS StopSequence 
       FROM df GROUP BY id 
       ORDER BY id, StopSequence, stopId) 
     UNION 
     SELECT * 
     FROM (SELECT id, stopId, max(stopSequence) AS StopSequence 
       FROM df GROUP BY id 
       ORDER BY id, StopSequence, stopId)") 

输出:

id stopId StopSequence 
1 1  a   1 
2 1  c   3 
3 2  b   1 
4 2  c   4 
5 3  a   3 
6 3  b   1