正则表达式:在字典中找到相同的话在一条线上
问题描述:
我有如下格式的TXT文件的字典:正则表达式:在字典中找到相同的话在一条线上
house house$casa | casa, vivienda, hogar | edificio, casa | vivienda
$符号翻译的长期分离。
我想通过一个带有文本编辑器的正则表达式(例如Sublimetext,Notepad ++等)在同一行找到几次字典单词,我不想要一个php函数,因为我必须检查如果我必须删除那些重复的单词,请手动。在上面的例子中,正则表达式应该找到house,casa和vivienda。我的目标是获得以下结果:使用下面的表达式
house$casa | vivienda, hogar | edificio
我都试过,但它不能正常工作:
(\b\w+\b)\W+\1
答
FWIW,这里是如何做到这一点粗例子在Python:
import re
def distinct_words(block, seen, delim):
""" makes a list of words distinct, given a set of words seen earlier """
unique_words = []
for word in re.split(delim, block):
if not word in seen:
seen[word] = True
unique_words.append(word)
return unique_words
def process_line(line):
""" removes all duplicate words from a dictionary line """
# safeguard
if '$' not in line: return line
# split line at the '$'
original, translated = line.split('$')
# make original words distinct
distinct_original = distinct_words(original, {}, r' +')
# make translated words distinct, but keep block structure
# split the translated part at '|' into blocks
# split each block at ', ' into words
seen = {}
distinct_translated = [
distinct_list for distinct_list in (
distinct_words(block, seen, r', +') for block in (
re.split(r'\s*\|\s*', translated)
)
)
if len(distinct_list) > 0
]
# put everything back together again
part_original = ' '.join(distinct_original)
part_translated = [', '.join(block) for block in distinct_translated]
part_translated = ' | '.join(part_translated)
result = part_original + '$' + part_translated
return result
def process_dictionary(filename):
""" processes a dictionary text file, modifies the file in place """
lines = open(filename,'r').readlines()
lines_out = [process_line(line) for line in lines]
contents_out = '\n'.join(lines_out)
open(filename,'w').write(contents_out)
显然,你会打电话process_dictionary()
,像这样:
process_dictionary('dict_en_es.txt')
但对于例如起见,假设你有一个单行:
line = "house house$casa | casa, vivienda, hogar | edificio, casa | vivienda"
line_out = process_line(line)
print line_out
打印出想要的结果:
house$casa | vivienda, hogar | edificio
您将无法单独与正则表达式来做到这一点。习惯于用编程语言来实现它。 – Tomalak 2013-05-07 09:09:29