正则表达式，多行字符串中两个模式之间的匹配

问题描述：

我有一个多行字符串，我想要一个正则表达式来抓取两个模式之间的一些东西。例如，在这里我想匹配的标题和日期正则表达式，多行字符串中两个模式之间的匹配

之间的一切例如：

s ="""\n#here's a title\n\nhello world!!!\n\nPosted on 11-09-2014 02:32:30""" 
re.findall(r'#.+\n',s)[0][1:-1] # this grabs the title 
Out: "here's a title" 
re.findall(r'Posted on .+\n',s)[0][10:-1] #this grabs the date 
Out: "11-09-2014 02:32:30" 
re.findall(r'^[#\W+]',s) # try to grab everything after the title 
Out: ['\n'] # but it only grabs until the end of line

答

>>> s = '''\n#here's a title\n\nhello world!!!\n\nPosted on 11-09-2014 02:32:30''' 
>>> m1 = re.search(r'^#.+$', s, re.MULTILINE) 
>>> m2 = re.search(r'^Posted on ', s, re.MULTILINE) 
>>> m1.end() 
16 
>>> m2.start() 
34 
>>> s[m1.end():m2.start()] 
'\n\nhello world!!!\n\n'

不要忘记检查m1和m2不None。

用户使用'findall'这可能表明在同一个多行字符串中可能存在多个匹配。在这种情况下使用多个正则表达式可能会导致问题。 – rhlobo 2014-09-10 23:45:39

答

>>> re.findall(r'\n([^#].*)Posted', s, re.S) 
['\nhello world!!!\n\n']

如果你想避免新行：

>>> re.findall(r'^([^#\n].*?)\n+Posted', s, re.S + re.M) 
['hello world!!!']

答

你可以匹配所有使用一个正则表达式。

>>> s = '''\n#here's a title\n\nhello world!!!\n\nPosted on 11-09-2014 02:32:30''' 
>>> re.search(r'#([^\n]+)\s+([^\n]+)\s+\D+([^\n]+)', s).groups() 
("here's a title", 'hello world!!!', '11-09-2014 02:32:30')

答

你应该使用圆括号使用小组赛：

result = re.search(r'#[^\n]+\n+(.*)\n+Posted on .*', s, re.MULTILINE | re.DOTALL) 
    result.group(1)

在这里，我用search，但你仍然可以使用findall如果相同的字符串可以包含多个匹配...

如果要捕捉标题，内容和日期，可以使用多个组：

result = re.search(r'#([^\n]+)\n+(.*)\n+Posted on ([^\n]*)', s, re.MULTILINE | re.DOTALL) 
    result.group(1) # The title 
    result.group(2) # The contents 
    result.group(3) # The date

在同一个正则表达式中捕获所有3个要比每个部分使用一个要好得多，特别是如果你的多行字符串可能包含多个匹配（其中'同步'你的个人findall结果可能很容易导致错误的标题内容日期组合）。

如果你要使用这个表达式很多，考虑性能，一旦它编译：

regex = re.compile(r'#([^\n]+)\n+(.*)\n+(Posted on [^\n]*)', re.MULTILINE | re.DOTALL) 
    # ... 
    result = regex.search(s) 
    result = regex.search('another multiline string, ...')

答

使用组与非贪婪搜索匹配（。*？）。并为该组织提供一个更简单的查找名称。

>>> s = '\n#here\'s a title\n\nhello world!!!\n\nPosted on 11-09-2014 02:32:30' 
>>> pattern = r'\s*#[\w \']+\n+(?P<content>.*?)\n+Posted on' 
>>> a = re.match(pattern, s, re.M) 
>>> a.group('content') 
'hello world!!!'

正则表达式，多行字符串中两个模式之间的匹配

相关推荐