In this paper, we study \emph{topic-specific} retweet count ranking problem in Weibo. Two challenges make this task nontrivial. Firstly, traditional methods cannot derive effective feature for tweets, because in topic-specific setting, tweets usually have too many shared contents to distinguish them. We propose a LSTM-embedded autoencoder to generate tweet features with the insight that any different prefixes of tweet text is a possible distinctive feature. Secondly, it is critical to fully catch the meaning of topic in topic-specific setting, but Weibo can provide little information about topic. We leverage real-time news information from Toutiao to enrich the meaning of topic, as more than 85\% topics are headline news. We evaluate the proposed components based on ablation methods, and compare the overall solution with a recently-proposed tensor factorization model. Extensive experiments on real Weibo data show the effectiveness and flexibility of our methods.

可以看出来,本文主要共享在于提取topic、tweet、user特征的方法。其中user特征天然存在,不需要多做处理;关于topic特征,由于微博本身提供的topic信息比较少,本文从今日头条这样的新闻网站上提取相关topic的信息(因为有研究证明,微博上85%的信息都是news,和今日头条的属性比较贴近),然后用DAE提取topic特征;关于tweet特征,主要问题是,同一个topic下的tweets基本上都是相同的(包括大量原封不动的转发、少数添加了几句个人意见的评论、短文本等等难点),本文采取LSTM-embedded autoencoder,和机器翻译中的autoencoder的区别主要在于本文关注特征提取(encoder的输出)而不是两种语言的映射(decoder的输出):

而整个文章用到的排序方法,word embedding方法都是现成的,并没有太大共享。




