使用spaCy替换句子的“主题”

问题描述：

因此，作为思考实验的一部分，我在python中编写了一个函数，它使用spaCy来查找新闻文章的主题，然后将其替换为选择的名词。问题是，它不能很好地工作，我希望可以改进。我不完全理解spaCy，而且文档有点难以理解。使用spaCy替换句子的“主题”

首先，代码：

doc=nlp(thetitle) 
for text in doc: 
    #subject would be 
    if text.dep_ == "nsubj": 
     subject = text.orth_ 
    #iobj for indirect object 
    if text.dep_ == "iobj": 
     indirect_object = text.orth_ 
     #dobj for direct object 
    if text.dep_ == "dobj": 
     direct_object = text.orth_ 
try: 
    subject 
except NameError: 
    if not thetitle: #if empty title 
     thetitle = "cat" 
     subject = "cat" 
    else: #if unknown subject 
     try: #do we have a direct object? 
      direct_object 
     except NameError: 
      try: #do we have an indirect object? 
       indirect_object 
      except NameError: #still no?? 
       subject = random.choice(thetitle.split()) 
      else: 
       subject = indirect_object 
     else: 
      subject = direct_object 
else: 
    thecat = "cat" #do nothing here, everything went okay 
newtitle = re.sub(r"\b%s\b" % subject, toreplace, thetitle) 
if (newtitle == thetitle) : #if no replacement happened due to regex 
    newtitle = thetitle.replace(subject, toreplace) 
return newtitle

“猫”的线是灌装线没有做任何事情。 “thetitle”是一个随机新闻文章标题的变量，我从RSS提要中获取。 “toreplace”是一个变量，它保存字符串以替换找到的主题。

让我们用一个例子：

“这应该是电视动画视频游戏节目 - 屏幕夸大其词”而这里的是，displaCy故障：https://demos.explosion.ai/displacy/?text=Video%20Games%20that%20Should%20Be%20Animated%20TV%20Shows%20-%20Screen%20Rant&model=en&cpu=1&cph=1

代码决定的字代替结束了“那“在这个句子中甚至不是一个名词，但似乎导致了随机词选择回退，因为它找不到主语，间接宾语或直接宾语。我希望在这个例子中能找到更像“视频游戏”的东西。

我应该注意，如果我在displaCy中最后一点出现（它似乎是新闻文章的来源）：https://demos.explosion.ai/displacy/?text=Video%20Games%20that%20Should%20Be%20Animated%20TV%20Shows&model=en&cpu=1&cph=1它似乎认为“that”是主题，这是不正确的。

什么是更好的解析方法？我应该首先寻找专有名词吗？

此尝试/除了块看起来不是很pythonic，初始化为None，然后检查是什么错？ –

在句子片段中你不会看到好的结果，你的例句甚至没有谓词。 –

Re：try/except块我基于这个代码示例找到了如何使用SpaCy的代码示例。不会有谓词不好？有没有比使用SpaCy更好地找到一个句子片段的主题？ – SpaceMouse

答

不直接回答你的问题，我认为下面的代码更具可读性，因为条件是明确的，当条件有效时会发生什么情况，并不会掩埋在远处的else子句中。该代码还处理多个对象的情况。

对于你的问题：任何自然语言处理工具将很难找到一个句子片段的主题（或者可能是主题），他们是用完整的句子训练的。我甚至不确定这样的片段在技术上是否有科目（尽管我不是专家）。你可以尝试训练你自己的模型，但是你将不得不提供带标签的句子，我不知道句子片段是否已经存在这样的事情。

我不完全确定你想达到什么目的，看共同的名词和代词可能包含你想要替换的词，而第一个出现的可能是最重要的。

import spacy 
import random 
import re 
from collections import defaultdict 

def replace_subj(sentence, nlp): 
    doc = nlp(sentence) 
    tokens = defaultdict(list) 

    for text in doc: 
     tokens[text.dep_].append(text.orth_) 

    if not sentence: 
     return "cat" 

    if "nsubj" in tokens: 
     subject = tokens["nsubj"][0] 
    elif "dobj" in tokens: 
     subject = tokens["dobj"][0] 
    elif "iobj" in tokens: 
     subject = tokens["iobj"][0] 
    else: 
     subject = random.choice(sentence.split()) 

    return re.sub(r"\b{}\b".format(subject), "cat", sentence) 

if __name__ == "__main__": 
    sentence = """Video Games that Should Be Animated TV Shows - Screen Rant""" 

    nlp = spacy.load("en") 
    print(replace_subj(sentence, nlp))

使用spaCy替换句子的“主题”

相关推荐