填补空白行依赖于一个/下一个非空值

问题描述:

我有一个订阅数据帧如下所示。 大约有100万个唯一ID。 该表列出订阅状态。当用户开始订阅服务时,状态字段用'Sub'表示,当用户取消订阅时,用'Usub'表示。填补空白行依赖于一个/下一个非空值

dat <- data.frame(ID=c(rep("A",12),(rep("B",12))), Year="2014", Month=rep(seq(1:12),2), Status=NA) 
dat$Status[4] <- "Sub" 
dat$Status[8] <- "Usub" 
dat$Status[17] <- "Usub" 
dat$Status[21] <- "Sub" 

ID Year Month Status 
A 2014 1  
A 2014 2  
A 2014 3  
A 2014 4 Sub 
A 2014 5  
A 2014 6  
A 2014 7  
A 2014 8 Usub 
A 2014 9  
A 2014 10  
A 2014 11  
A 2014 12  
B 2014 1  
B 2014 2  
B 2014 3  
B 2014 4  
B 2014 5 Usub  
B 2014 6  
B 2014 7  
B 2014 8  
B 2014 9 Sub 
B 2014 10  
B 2014 11  
B 2014 12  
C 2014 1  . 
. . .  . 
. . .  . 

我正在寻找填补每个状态更新之间的差距。 所需的输出表将如下所示:

ID Year Month Status 
A 2014 1 Usub 
A 2014 2 Usub 
A 2014 3 Usub 
A 2014 4 Sub 
A 2014 5 Sub 
A 2014 6 Sub 
A 2014 7 Sub 
A 2014 8 Usub 
A 2014 9 Usub 
A 2014 10 Usub 
A 2014 11 Usub 
A 2014 12 Usub 
B 2014 1 Sub 
B 2014 2 Sub 
B 2014 3 Sub 
B 2014 4 Sub 
B 2014 5 Usub 
B 2014 6 Usub 
B 2014 7 Usub 
B 2014 8 Usub 
B 2014 9 Sub 
B 2014 10 Sub 
B 2014 11 Sub 
B 2014 12 Sub 
C 2014 1  . 
. . .  . 
. . .  . 

每个ID具有至少一个状态值。 如果一个ID的第一个状态记录是“Usub”,那么以前所有月份的状态都是“Sub”。 (像2014/05 ID B)与此相反,如果第一个状态记录与“子”开始,所有的前几个月的地位是‘Usub’

+0

@MrFlick最后一次观察可能不会在第一3行为ID = A和第一4行为ID = B工作。 – ohmyan

+0

@MrFlick认购数据是不完整的,这意味着第一个非空的状态可能不是“分”,这可能是“Usub”,在这种情况下,所有的前行实际上是“子”。 – ohmyan

+0

@MrFlick正如帖子中所述。每个ID至少有一个状态值。 – ohmyan

您可以生成交替顺序即相当于你想要的状态通过减去Status == "Usub"Status = "Sub",以这种方式,应该用Sub填充的所有位置将具有比应该用Usub填充的那些值更低的值,然后使用可以以有序方式标记因子以将整数序列转换为一个因素:

library(dplyr) 
df %>% group_by(ID) %>% mutate(Status = factor(cumsum((Status == "Usub") - (Status == "Sub")), 
               labels = c("Sub", "Usub"))) 

# ID Year Month Status 
# 1 A 2014  1 Usub 
# 2 A 2014  2 Usub 
# 3 A 2014  3 Usub 
# 4 A 2014  4 Sub 
# 5 A 2014  5 Sub 
# 6 A 2014  6 Sub 
# 7 A 2014  7 Sub 
# 8 A 2014  8 Usub 
# 9 A 2014  9 Usub 
# 10 A 2014 10 Usub 
# 11 A 2014 11 Usub 
# 12 A 2014 12 Usub 
# 13 B 2014  1 Sub 
# 14 B 2014  2 Sub 
# 15 B 2014  3 Sub 
# 16 B 2014  4 Sub 
# 17 B 2014  5 Usub 
# 18 B 2014  6 Usub 
# 19 B 2014  7 Usub 
# 20 B 2014  8 Usub 
# 21 B 2014  9 Sub 
# 22 B 2014 10 Sub 
# 23 B 2014 11 Sub 
# 24 B 2014 12 Sub 

相应data.table方式将是:

library(data.table) 
setDT(df)[, Status := as.character(factor(cumsum((Status == "Usub") - (Status == "Sub")), labels = c("Sub", "Usub"))), .(ID)] 

您必须将新因子转换回字符类,因为它在创建新列时不允许改变类型。

数据假定你有空字符串,而不是NA

structure(list(ID = c("A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B"), Year = c("2014", "2014", "2014", "2014", "2014", 
"2014", "2014", "2014", "2014", "2014", "2014", "2014", "2014", 
"2014", "2014", "2014", "2014", "2014", "2014", "2014", "2014", 
"2014", "2014", "2014"), Month = c("1", "2", "3", "4", "5", "6", 
"7", "8", "9", "10", "11", "12", "1", "2", "3", "4", "5", "6", 
"7", "8", "9", "10", "11", "12"), Status = c("", "", "", "Sub", 
"", "", "", "Usub", "", "", "", "", "", "", "", "", "Usub", "", 
"", "", "Sub", "", "", "")), .Names = c("ID", "Year", "Month", 
"Status"), row.names = c(NA, 24L), class = "data.frame") 
+0

这是整洁!它输出的是我正在寻找的东西。然而,我仍然有理解这个操作的问题:(Status ==“Usub”) - (Status ==“Sub”)。不完全确定它是如何工作的。你介意更多详情吗?谢谢! – ohmyan

+0

这假定你总是有替代'Usub'和'Sub'。通过这样做减法,你'1'所有'Usub'和'-1'所有'Sub'和序列的cumsum将使'0'和'1'或者'0'和交替序列' -1',具体取决于'Usub'和'Sub'中哪一个先到达。减法还可以确保Sub之后的所有空字符串在前一种情况下都是较低值“0”,在后一种情况下是“-1”。然后,如果您从中构建了一个因子,那么您应该知道“Sub”应该是第一个标签,因为它与上面解释的较低的值相对应。 – Psidom

uniquevector<-unique(dat$ID) 
for(i in uniquevector){ 
    zzz <- which(dat$ID==i & dat$Status == "Sub") 
    zzz2 <- which(dat$ID==i & dat$Status == "Usub") 
    zzz3 <- which(dat$ID==i & dat$Month == 12) 
    zzz4 <- which(dat$ID==i & dat$Month == 1) 
    if(zzz2 > zzz){ 
    index<-zzz:(zzz2-1) 
    dat$Status[index] <- "Sub" 
} 
    if(zzz2 < zzz){ 
    index<-zzz2:(zzz-1) 
    dat$Status[index] <- "Usub" 
    } 
    if(zzz3 > zzz2 & zzz < zzz2){ 
    index<-zzz2:zzz3 
    dat$Status[index] <- "Usub" 
    } 
if(zzz2 < zzz & zzz3 > zzz){ 
    index<-zzz:zzz3 
    dat$Status[index] <- "Sub" 
if((zzz4 < zzz) & zzz < zzz2){ 
    index<-zzz4:(zzz-1) 
    dat$Status[index] <- "Usub" 
} 
    if((zzz4 < zzz2) & zzz2 < zzz){ 
    index<-zzz4:(zzz2-1) 
    dat$Status[index] <- "Sub" 
    } 

    }} 

另一种选择是空白""转换为NA和利用na.locfzoo包与更换NA非NA以前的元素。由于这是一组手术,我们也可以通过avebase R这样做。因为“最后的观察”不可用结转

library(zoo) 
df$Status <- with(df, ave(replace(Status, !nzchar(Status), NA), ID, 
      FUN = function(x){ x1 <- na.locf(x, na.rm=FALSE) 
     replace(x1, is.na(x1), setdiff(unique(na.omit(x1)), x1[!is.na(x1)][1]))})) 
df$Status 
#[1] "Usub" "Usub" "Usub" "Sub" "Sub" "Sub" "Sub" "Usub" "Usub" "Usub" "Usub" "Usub" "Sub" "Sub" "Sub" "Sub" "Usub" "Usub" "Usub" 
#[20] "Usub" "Sub" "Sub" "Sub" "Sub"