从分组数据
问题从分组数据
使用dplyr
选择第一个和最后一排,我怎么在一个声明中选择分组数据的顶部和底部的意见/行?
数据&例
给定一个数据帧
df <- data.frame(id=c(1,1,1,2,2,2,3,3,3),
stopId=c("a","b","c","a","b","c","a","b","c"),
stopSequence=c(1,2,3,3,1,4,3,1,2))
我可以从使用slice
每组顶部和底部的观察,但使用两个单独的statments:
firstStop <- df %>%
group_by(id) %>%
arrange(stopSequence) %>%
slice(1) %>%
ungroup
lastStop <- df %>%
group_by(id) %>%
arrange(stopSequence) %>%
slice(n()) %>%
ungroup
我可以结合这两个statmenets到一个选择均为顶部和底部观察?
有可能是一个更快的方式:
df %>%
group_by(id) %>%
arrange(stopSequence) %>%
filter(row_number()==1 | row_number()==n())
喜欢的东西:
library(dplyr)
df <- data.frame(id=c(1,1,1,2,2,2,3,3,3),
stopId=c("a","b","c","a","b","c","a","b","c"),
stopSequence=c(1,2,3,3,1,4,3,1,2))
first_last <- function(x) {
bind_rows(slice(x, 1), slice(x, n()))
}
df %>%
group_by(id) %>%
arrange(stopSequence) %>%
do(first_last(.)) %>%
ungroup
## Source: local data frame [6 x 3]
##
## id stopId stopSequence
## 1 1 a 1
## 2 1 c 3
## 3 2 b 1
## 4 2 c 4
## 5 3 b 1
## 6 3 a 3
随着do
你几乎可以在组,但@ jeremycg的答案执行任何数量的操作是方式更合适只是为了这个任务。
不dplyr
,但它使用data.table
的更直接:
library(data.table)
setDT(df)
df[ df[order(id, stopSequence), .I[c(1L,.N)], by=id]$V1 ]
# id stopId stopSequence
# 1: 1 a 1
# 2: 1 c 3
# 3: 2 b 1
# 4: 2 c 4
# 5: 3 b 1
# 6: 3 a 3
更详细的解释:
# 1) get row numbers of first/last observations from each group
# * basically, we sort the table by id/stopSequence, then,
# grouping by id, name the row numbers of the first/last
# observations for each id; since this operation produces
# a data.table
# * .I is data.table shorthand for the row number
# * here, to be maximally explicit, I've named the variable V1
# as row_num to give other readers of my code a clearer
# understanding of what operation is producing what variable
first_last = df[order(id, stopSequence), .(row_num = .I[c(1L,.N)]), by=id]
idx = first_last$row_num
# 2) extract rows by number
df[idx]
一定要检查出Getting Started维基得到data.table
基本覆盖
或者'df [df [order(stopSequence),.I [c(1,.N)],keyby = id] $ V1]'。看到'id'出现两次对我来说很奇怪。 – Frank
您可以在'setDT'调用中设置按键。所以'订单'电话不需要在这里。 –
@ArtemKlevtsov - 尽管如此,您可能并不总是想要设置按键。 – SymbolixAU
只是为了完整性:您可以通过slice
一个指标向量S:
df %>% arrange(stopSequence) %>% group_by(id) %>% slice(c(1,n()))
这给
id stopId stopSequence
1 1 a 1
2 1 c 3
3 2 b 1
4 2 c 4
5 3 b 1
6 3 a 3
我知道指定dplyr
的问题。但是,因为其他人使用其他套餐已发布的解决方案,我决定有一个去使用其他的包太:
基础包:
df <- df[with(df, order(id, stopSequence, stopId)), ]
merge(df[!duplicated(df$id), ],
df[!duplicated(df$id, fromLast = TRUE), ],
all = TRUE)
数据。表:
df <- setDT(df)
df[order(id, stopSequence)][, .SD[c(1,.N)], by=id]
sqldf:
library(sqldf)
min <- sqldf("SELECT id, stopId, min(stopSequence) AS StopSequence
FROM df GROUP BY id
ORDER BY id, StopSequence, stopId")
max <- sqldf("SELECT id, stopId, max(stopSequence) AS StopSequence
FROM df GROUP BY id
ORDER BY id, StopSequence, stopId")
sqldf("SELECT * FROM min
UNION
SELECT * FROM max")
在一个查询:
sqldf("SELECT *
FROM (SELECT id, stopId, min(stopSequence) AS StopSequence
FROM df GROUP BY id
ORDER BY id, StopSequence, stopId)
UNION
SELECT *
FROM (SELECT id, stopId, max(stopSequence) AS StopSequence
FROM df GROUP BY id
ORDER BY id, StopSequence, stopId)")
输出:
id stopId StopSequence
1 1 a 1
2 1 c 3
3 2 b 1
4 2 c 4
5 3 a 3
6 3 b 1
'ROWNUMBER()以%C(%1,N( ))将避免两次运行向量扫描的需要 – MichaelChirico
@MichaelChirico I怀疑你省略了一个'_'?即'filter(row_number()%in%c(1,n()))' –