汇总组合序列

问题描述：

我有样本数据集，我想汇总user_id。每条记录代表一个注册。汇总组合序列

> test 
    user_id    time plan 
1  1 2017-06-23 20:00:00 monthly 
2  2 2017-07-20 20:00:00 monthly 
3  3 2017-06-03 20:00:00 monthly 
4  1 2017-07-03 20:00:00 monthly 
5  2 2017-05-11 20:00:00 yearly 
6  3 2017-07-27 20:00:00 yearly 
7  1 2017-05-09 20:00:00 yearly 
8  2 2017-01-15 19:00:00 yearly 
9  3 2017-08-18 20:00:00 yearly 
10  1 2017-01-30 19:00:00 monthly

每个用户都有报名参加了不同的顺序（time）不同的计划。例如，用户1的序列是monthly-yearly-monthly- monthly，因此用户1已经切换了两个次。

用户2具有yearly-yearly-monthly，因此用户2已经切换一次

用户3已经从monthly-yearly-yearly消失了，因此用户3已经切换一次。

> test[order(test$time),] 
    user_id    time plan 
8  2 2017-01-15 19:00:00 yearly 
10  1 2017-01-30 19:00:00 monthly 
7  1 2017-05-09 20:00:00 yearly 
5  2 2017-05-11 20:00:00 yearly 
3  3 2017-06-03 20:00:00 monthly 
1  1 2017-06-23 20:00:00 monthly 
4  1 2017-07-03 20:00:00 monthly 
2  2 2017-07-20 20:00:00 monthly 
6  3 2017-07-27 20:00:00 yearly 
9  3 2017-08-18 20:00:00 yearly

我的目的是总结开关的组合，换句话说，总结有多少用户去从yearly到monthly，有多少人从monthly去yearly，以及有多少人交换计划多次。以下数据集的输出可能是这个样子：

> output 
      type count 
1 monthly-yearly  1 
2 yearly-monthly  1 
3  multiple  1

一个人怎么会去通过user_id分组，然后减少R中要么multiple，monthly-yearly，或yearly-monthly串的序列？任何建议或意见，将不胜感激。

上述数据集：

> dput(test) 
structure(list(user_id = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1), time = structure(c(1498262400, 
1500595200, 1496534400, 1499126400, 1494547200, 1501200000, 1494374400, 
1484524800, 1503100800, 1485820800), class = c("POSIXct", "POSIXt" 
)), plan = c("monthly", "monthly", "monthly", "monthly", "yearly", 
"yearly", "yearly", "yearly", "yearly", "monthly")), .Names = c("user_id", 
"time", "plan"), row.names = c(NA, -10L), class = "data.frame")

答

下面是使用dplyr和有用rle函数来做到这一点的一种方式（行程长度编码）..

library(dplyr) 

output <- test %>% group_by(user_id) %>% #group by id 
     arrange(time) %>%     #sort by date 
     summarise(first=first(plan),switches=length(rle(plan)$values)) %>% 
             #find first plan and number of switches 
     mutate(type=ifelse(switches>2,"multiple", 
        ifelse(first=="monthly","monthly-yearly","yearly-monthly"))) %>% 
             #convert these to your three types 
     count(type)      #short for group_by and n() 

output 
      type  n 
      <chr> <int> 
1 monthly-yearly  1 
2  multiple  1 
3 yearly-monthly  1

谢谢，但你是什么意思的“找到第一个计划”？第一个计划是什么意思？ –

第一个是该客户的月度还是年度。所以如果rle = 2，我们可以知道开关一定是哪一种方式。 –

答

这里的另一种方式：

test[order(user_id, time), 
    .(plan = first(plan)) 
, by=.(user_id, rleid(user_id, plan))][, 
    if (.N < 3L) paste(plan, collapse="-") 
    else "multiple" 
, by=user_id][, 
    .N 
, by=.(pattern = V1)] 

#   pattern N 
# 1:  multiple 1 
# 2: yearly-monthly 1 
# 3: monthly-yearly 1

翻译成dplyr，建立在@ AndrewGustar的回答：

library(dplyr) 

test %>% 
    group_by(user_id) %>% 
    arrange(time) %>% 
    summarise(pattern = 
     if (length(r <- rle(plan)$values) < 3) paste(r, collapse="-") 
     else "multiple" 
    ) %>% 
    count(pattern) 

# # A tibble: 3 x 2 
#   pattern  n 
#   <chr> <int> 
# 1 monthly-yearly  1 
# 2  multiple  1 
# 3 yearly-monthly  1

它是如何工作

要打破它的工作原理，尝试%>%前部分运行它，直至每个]或括号。

它...

使用rleid每个值的组的运行;
按运行顺序总结每个用户，为任何3+序列写“多个”;
并按这些摘要统计用户。

它不使用的特定值。

相关推荐