正则表达式:在字典中找到相同的话在一条线上

问题描述:

我有如下格式的TXT文件的字典:正则表达式:在字典中找到相同的话在一条线上

house house$casa | casa, vivienda, hogar | edificio, casa | vivienda 

$符号翻译的长期分离。

我想通过一个带有文本编辑器的正则表达式(例如Sublimetext,Notepad ++等)在同一行找到几次字典单词,我不想要一个php函数,因为我必须检查如果我必须删除那些重复的单词,请手动。在上面的例子中,正则表达式应该找到house,casa和vivienda。我的目标是获得以下结果:使用下面的表达式

house$casa | vivienda, hogar | edificio 

我都试过,但它不能正常工作:

(\b\w+\b)\W+\1 
+4

您将无法单独与正则表达式来做到这一点。习惯于用编程语言来实现它。 – Tomalak 2013-05-07 09:09:29

FWIW,这里是如何做到这一点粗例子在Python:

import re 

def distinct_words(block, seen, delim): 
    """ makes a list of words distinct, given a set of words seen earlier """ 

    unique_words = [] 

    for word in re.split(delim, block): 
     if not word in seen: 
      seen[word] = True 
      unique_words.append(word) 

    return unique_words 

def process_line(line): 
    """ removes all duplicate words from a dictionary line """ 

    # safeguard 
    if '$' not in line: return line 

    # split line at the '$' 
    original, translated = line.split('$') 

    # make original words distinct 
    distinct_original = distinct_words(original, {}, r' +') 

    # make translated words distinct, but keep block structure 

    # split the translated part at '|' into blocks 
    # split each block at ', ' into words 
    seen = {} 
    distinct_translated = [ 
     distinct_list for distinct_list in (
      distinct_words(block, seen, r', +') for block in (
       re.split(r'\s*\|\s*', translated) 
      ) 
     ) 
     if len(distinct_list) > 0 
    ] 

    # put everything back together again 
    part_original = ' '.join(distinct_original) 
    part_translated = [', '.join(block) for block in distinct_translated] 
    part_translated = ' | '.join(part_translated) 
    result = part_original + '$' + part_translated 

    return result 

def process_dictionary(filename): 
    """ processes a dictionary text file, modifies the file in place """ 

    lines = open(filename,'r').readlines()  
    lines_out = [process_line(line) for line in lines] 
    contents_out = '\n'.join(lines_out) 
    open(filename,'w').write(contents_out) 

显然,你会打电话process_dictionary(),像这样:

process_dictionary('dict_en_es.txt') 

但对于例如起见,假设你有一个单行:

line = "house house$casa | casa, vivienda, hogar | edificio, casa | vivienda" 
line_out = process_line(line) 
print line_out 

打印出想要的结果:

 
house$casa | vivienda, hogar | edificio