dplyr/dt总结列是否不为空/ NA并粘贴?
问题描述:
我的数据是:dplyr/dt总结列是否不为空/ NA并粘贴?
Name House Street Apt City Postal Phone
DUMA PAUL 2030 GREEN ROAD DESERT Z0K2K1 999-577-3789
DUNN S GREEN ROAD DESERT Z0K2K1 999-577-3256
FERGUSON BOB GREEN ROAD DESERT Z0K2K1 999-577-3771
FITSCHEN A 3989 GREEN ROAD DESERT Z0K2K1 999-577-3557
BLACK CARY 2079 GREEN ROAD DESERT Z0K2K1 999-577-3779
BLACK RUTH 2079 GREEN ROAD DESERT Z0K2K1 999-577-3779
我想比较名称(动态,数据由众议院排序),如果相等,房子#是平等的,连接具有各自的两个电话号码“OR”和删除行那不是连接起来并串联了名称“和”
我使用:
data <- data %>%
group_by(House, Street, Apt, City, Postal) %>%
summarise(Name = first(paste(Name, collapse = ", AND ")), Phone =
paste(unique(Phone), collapse = " OR ")) %>%
ungroup() %>%
arrange(Street, desc(House)) %>%
select(colnames(dataset)) %>%
filter(!Phone %in% dnc$`Home Phone`)
问题:上述dplyr,我串联如果房子是NA (或空白,我把我的NA留空),Apt是NA(或“”),我不想。因此,使用上面的代码,我会
Name House Street Apt City Postal Phone
DUNN S, AND FERGUSON BOB GREEN ROAD DESERT Z0K2K1 9995773256
OR 9995773772
DUMAS PAUL 2030 GREEN ROAD DESERT Z0K2K1
9995773789
BLACK CARY, AND BLACK RUTH 2079 GREEN ROAD DESERT Z0K2K1
9995773779
FITSCHEN A 3989 GREEN ROAD DESERT Z0K2K1
9995773556
通过以上,请注意邓恩S,而现在弗格森BOB在一起。我不要那个。
dput(抱歉,如果没有帮助):
list(structure(list(X__1 = c(NA, NA, NA, NA, NA, NA), Name = c("DUMAS
PAUL",
"DUNN S", "FERGUSON BOB", "FITSCHEN A", "BLACK CARY", "BLACK RUTH"
), House = c("2030", NA, NA, "3989", "2079", "2079"), Street = c("GREEN
ROAD",
"GREEN ROAD", "GREEN ROAD", "GREEN ROAD", "GREEN ROAD", "GREEN ROAD"
), Apt = c(NA, NA, NA, NA, NA, NA), City = c("DESERT", "DESERT",
"DESERT", "DESERT", "DESERT", "DESERT"), Prov = c("ZK", "ZK",
"ZK", "ZK", "ZK", "ZK"), Postal = c("Z0K2K1", "Z0K2K1", "Z0K2K1",
"Z0K2K1", "Z0K2K1", "Z0K2K1"), Phone = c("999-577-3789", "999-577-3256",
"999-577-3772", "999-577-3556", "999-577-3779", "999-577-3779"
), `Last Appear Date` = c(NA, NA, NA, NA, NA, NA)), .Names = c("X__1",
"Name", "House", "Street", "Apt", "City", "Prov", "Postal", "Phone",
"Last Appear Date"), class = c("tbl_df", "tbl", "data.frame"), row.names
= c(NA,
-6L)))
感谢
答
里面DT[, {...}, by=]
,你可以写几乎任何东西。在这种情况下,if... else
作品:类似可dplyr::do
做,大概
library(data.table)
library(magrittr)
DT = as.data.table(data)
DT[,
if (!(is.na(House) & is.na(Apt)))
.(
Name = Name %>% paste(collapse = ", AND "),
Phone = Phone %>% unique %>% paste(collapse = " OR ")
)
else
.(Name, Phone)
, by=.(House, Street, Apt, City, Postal)]
House Street Apt City Postal Name Phone
1: 2030 GREEN \n ROAD NA DESERT Z0K2K1 DUMAS \n PAUL 999-577-3789
2: NA GREEN ROAD NA DESERT Z0K2K1 DUNN S 999-577-3256
3: NA GREEN ROAD NA DESERT Z0K2K1 FERGUSON BOB 999-577-3772
4: 3989 GREEN ROAD NA DESERT Z0K2K1 FITSCHEN A 999-577-3556
5: 2079 GREEN ROAD NA DESERT Z0K2K1 BLACK CARY, AND BLACK RUTH 999-577-3779
东西。
你不必在这里使用magrittr;这只是我对paste
零件的偏好。您可能还需要在这些管道中添加%>% sort
步骤(因此手机和名称列表始终是递增的)。
答
我想这个问题没有“漂亮”的解决方案,这是一个不适合dplyr工作流程的处理。一种解决方法是以某种方式唯一标识具有空数据的房屋。这样,他们不会被分组在一起。一种方法是在House
为空时输入“#row_number”。现在他们不会被分组在一起,因为每一个空行都会有不同的数字。处理完成后,您可以简单地将#
开头的值替换为空字符串或NA
。
data %>%
mutate(House = if_else(House == "" | is.na(House), paste0("#", row_number()), House)) %>%
# does the processing... %>%
mutate(House = if_else(startsWith(House, "#"), "", House))