水了一篇PAKDD2018的文章:Topic-specific Retweet Count Ranking for Weibo
看题目就知道做什么工作:Topic-specific Retweet Count Ranking for Weibo
摘要:
In this paper, we study \emph{topic-specific} retweet count ranking problem in Weibo. Two challenges make this task nontrivial. Firstly, traditional methods cannot derive effective feature for tweets, because in topic-specific setting, tweets usually have too many shared contents to distinguish them. We propose a LSTM-embedded autoencoder to generate tweet features with the insight that any different prefixes of tweet text is a possible distinctive feature. Secondly, it is critical to fully catch the meaning of topic in topic-specific setting, but Weibo can provide little information about topic. We leverage real-time news information from Toutiao to enrich the meaning of topic, as more than 85\% topics are headline news. We evaluate the proposed components based on ablation methods, and compare the overall solution with a recently-proposed tensor factorization model. Extensive experiments on real Weibo data show the effectiveness and flexibility of our methods.
可以看出来,本文主要共享在于提取topic、tweet、user特征的方法。其中user特征天然存在,不需要多做处理;关于topic特征,由于微博本身提供的topic信息比较少,本文从今日头条这样的新闻网站上提取相关topic的信息(因为有研究证明,微博上85%的信息都是news,和今日头条的属性比较贴近),然后用DAE提取topic特征;关于tweet特征,主要问题是,同一个topic下的tweets基本上都是相同的(包括大量原封不动的转发、少数添加了几句个人意见的评论、短文本等等难点),本文采取LSTM-embedded autoencoder,和机器翻译中的autoencoder的区别主要在于本文关注特征提取(encoder的输出)而不是两种语言的映射(decoder的输出):
而整个文章用到的排序方法,word embedding方法都是现成的,并没有太大共享。
总结这篇文章的共享有三点:第一,做的是topic-specific的ranking工作,这个之前很少有人做;第二,提出了提取tweet、topic的方法,虽然都很直观,但可以使用的场景也比较多;第三,提出的方法效果还不错。
找到一篇PAKDD2017的介绍文章:
http://data-mining.philippe-fournier-viger.com/pakdd-2017-conference-brief-report/
2) The number of accepted long and short papers at PAKDD forthe last six years is presented below.
5) The acceptance rate of long and short papers at PAKDD during the last six years