重叠正则表达式匹配
我试图创建下面的正则表达式:从以下RNA字符串AUG
和(UAG
或UGA
或UAA
)之间返回一个字符串:AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG
,让所有的比赛会被发现,其中包括那些重叠。重叠正则表达式匹配
我试过几个正则表达式,有这样的事情结束了:
matches = re.findall('(?=AUG)(\w+)(?=UAG|UGA|UAA)',"AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG")
你能告诉我在我的正则表达式中的错误?
用一个正则表达式做这件事实际上很困难,因为大多数使用不需要想要重叠匹配。然而,你可以用一些简单的迭代来做到这一点:
regex = re.compile('(?=AUG)(\w+)(?=UAG|UGA|UAA)');
RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
matches = []
tmp = RNA
while (match = regex.search(tmp)):
matches.append(match)
tmp = tmp[match.start()-2:] #Back up two to get the UG portion. Shouldn't matter, but safer.
for m in matches:
print m.group(0)
虽然,这有一些问题。你认为AUGUAGUGAUAA
的回报是多少?有两个字符串要退回吗?还是只有一个?目前,您的正则表达式甚至无法捕捉到UAG
,因为它继续匹配UAGUGA
并在UAA
处截断。为了解决这个问题,你可能希望使用?
操作符让操作符懒惰 - 这种方法将无法捕获更长的子字符串。
也许迭代字符串两次是答案,但如果你的RNA序列包含AUGAUGUAGUGAUAA
会怎样?那里有什么正确的行为?
我可能有利于正则表达式免费的方式,通过遍历字符串及其子:
RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
candidates = []
start = 0
while (RNA.find('AUG', start) > -1):
start = RNA.find('AUG', start) #Confound python and its lack of assignment returns
candidates.append(RNA[start+3:])
start += 1
matches = []
for candidate in candidates:
for terminator in ['UAG', 'UGA', 'UAA']:
end = 1;
while(candidate.find(terminator, end) > -1):
end = candidate.find(terminator, end)
matches.append(candidate[:end])
end += 1
for match in matches:
print match
这样一来,你一定会得到所有的比赛,不管是什么。
如果你需要保持每场比赛的位置的轨迹,您可以修改您的考生数据结构使用哪个维持起始位置的元组:
RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
candidates = []
start = 0
while (RNA.find('AUG', start) > -1):
start = RNA.find('AUG', start) #Confound python and its lack of assignment returns
candidates.append((RNA[start+3:], start+3))
start += 1
matches = []
for candidate in candidates:
for terminator in ['UAG', 'UGA', 'UAA']:
end = 1;
while(candidate[0].find(terminator, end) > -1):
end = candidate[0].find(terminator, end)
matches.append((candidate[1], candidate[1] + end, candidate[0][:end]))
end += 1
for match in matches:
print "%d - %d: %s" % match
它打印:
7 - 49: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAU
7 - 85: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
7 - 31: UAGCUAACUCAGGUUACAUGGGGA
7 - 72: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
7 - 76: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
7 - 11: UAGC
7 - 66: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
27 - 49: GGGAUGACCCCGCGACUUGGAU
27 - 85: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
27 - 31: GGGA
27 - 72: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
27 - 76: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
27 - 66: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
33 - 49: ACCCCGCGACUUGGAU
33 - 85: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
33 - 72: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
33 - 76: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
33 - 66: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
78 - 85: AUCCGAG
地狱,再增加三行,你甚至可以根据它们落在RNA序列中的位置对它们进行排序:
from operator import itemgetter
matches.sort(key=itemgetter(1))
matches.sort(key=itemgetter(0))
最终印刷网前放置你:
007 - 011: UAGC
007 - 031: UAGCUAACUCAGGUUACAUGGGGA
007 - 049: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAU
007 - 066: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
007 - 072: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
007 - 076: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
007 - 085: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
027 - 031: GGGA
027 - 049: GGGAUGACCCCGCGACUUGGAU
027 - 066: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
027 - 072: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
027 - 076: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
027 - 085: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
033 - 049: ACCCCGCGACUUGGAU
033 - 066: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
033 - 072: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
033 - 076: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
033 - 085: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
078 - 085: AUCCGAG
不幸的是,re
模块不提供支持的那一刻重叠的匹配,但你可以轻松突破解下来,像这样:
'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
matches = []
for m in re.finditer('AUG', str):
for n in re.finditer('(UAG)|(UGA)|(UAA)', str[m.start():]):
matches.append(str[m.start()+3:m.start()+n.end()-3]
print matches
也许我错了,但我想你应该这么做:'for n in re.finditer('(UAG)|(UGA)|(UAA)',n.group()):' – FrankieTheKneeMan 2013-04-03 23:43:32
@FrankieTheKneeMan:Not真的,但我确实在那里犯了一个错误。谢谢你让我知道。 – 2013-04-03 23:45:35
正则表达式库允许使用“重叠”标志重叠匹配。 https://pypi.python.org/pypi/regex – ednincer 2015-06-30 06:49:49
如果你不用'比赛'来思考,而是用'间隔'的话来说,我想你会发现它更容易。这就是@ ionut-hulub所做的。如下所示,您可以一次完成此操作,但是您应该使用更简单的finditer()方法,除非您有足够的RNA字符串(或者它们足够长),您需要避免在字符串上进行冗余传递。
s = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
def intervals(s):
state = []
i = 0
max = len(s) - 2
while i < max:
if s[i] == 'A' and s[i+1] == 'U' and s[i+2] == 'G':
state.append(i)
if s[i] == 'U' and (s[i+1] == 'A' and s[i+2] == 'G') or (s[i+1] == 'G' and s[i+2] == 'A') or (s[i+1] == 'A' and s[i+2] == 'A'):
for b in state:
yield (b, i)
i += 1
for interval in intervals(s):
print interval
我之前回答过类似的问题:无法用Python正则表达式afaik完成。在Perl中,你可以用一些技巧获得所有可能的匹配。 – Qtax 2013-04-03 22:29:58
有一个[新的正则表达式Python模块](https://pypi.python.org/pypi/regex)允许重叠匹配。 – ovgolovin 2013-04-03 22:34:31