重复标点符号和符号的Python正则表达式

问题描述：

我需要一个匹配重复（多于一个）标点和符号的正则表达式。基本上所有重复的非字母数字和非空白字符，如...，???，!!!，###，@@@，+++等等。它必须是重复的相同字符，所以不能像“！？@”这样的序列。重复标点符号和符号的Python正则表达式

我曾试过[^ \ s \ w] +，虽然涵盖了所有的!!!，???，$$$情况，但是这给了我比我想要的更多，因为它也会匹配“ ！@”。

有人能够赐教吗？谢谢。

S/O是用于帮助解决代码问题 - 不是为您编写代码 - 查看're'的文档并尝试使用 –

答

尝试此图案：

([.\?#@+,<>%~`!$^&\(\):;])\1+

\1指的是第一个匹配的基团，其是括号的内容。

您需要根据需要扩展标点符号列表。

Python，AFAIK中不支持。 – nhahtdh

也就是'\ p {P}'和'\ p {S}'。反向引用部分是。 –

@nhahtdh更新了答案。 –

答

编辑：@Firoze Lafeer发布了一个答案，用一个正则表达式来完成所有事情。如果任何人有兴趣将正则表达式与过滤函数结合起来，我会留下来，但对于这个问题，使用Firoze Lafeer的答案会更简单快捷。

在我看到Firoze Lafeer的答案之前写的答案在下面，不变。

一个简单的正则表达式不能做到这一点。经典的简洁摘要是“正则表达式无法计数”。这里讨论：

How to check that a string is a palindrome using regular expressions?

对于Python的解决办法，我建议正则表达式用Python代码一点点的结合。正则表达式抛出所有不是某种标点符号的运行，然后Python代码检查是否抛出错误匹配（匹配是标点符号而不是全部相同字符）。

import re 
import string 

# Character class to match punctuation. The dash ('-') is special 
# in character classes, so put a backslash in front of it to make 
# it just a literal dash. 
_char_class_punct = "[" + re.escape(string.punctuation) + "]" 

# Pattern: a punctuation character followed by one or more punctuation characters. 
# Thus, a run of two or more punctuation characters. 
_pat_punct_run = re.compile(_char_class_punct + _char_class_punct + '+') 

def all_same(seq, basis_case=True): 
    itr = iter(seq) 
    try: 
     first = next(itr) 
    except StopIteration: 
     return basis_case 
    return all(x == first for x in itr) 

def find_all_punct_runs(text): 
    return [s for s in _pat_punct_run.findall(text) if all_same(s, False)] 


# alternate version of find_all_punct_runs() using re.finditer() 
def find_all_punct_runs(text): 
    return (s for s in (m.group(0) for m in _pat_punct_run.finditer(text)) if all_same(s, False))

我写all_same()我这样做了，它会很好的工作在一个迭代器作为一个字符串的方式。 Python内置的all()为空序列返回True，这不是我们想要的all_same()的特定用法，所以我为所需的基本情况提出了一个参数，并使其默认为True以匹配all()的行为。

尽可能多地使用Python的内部工作（正则表达式引擎或all()），所以它应该非常快。对于大输入文本，您可能需要重写find_all_punct_runs()以使用re.finditer()而不是re.findall()。我举了一个例子。该示例还返回一个生成器表达式而不是一个列表。你总是可以迫使它做一个清单：

lst = list(find_all_punct_runs(text))

'-'和'['（不确定Python）和']'在字符类中是特殊的，所以在开始时也是'^'。 – nhahtdh

改为尝试使用're.escape（string.punctuation）'。这样可行。（确认它是正确的：'对于string.punctuation中的字母，all（re.match（'[％s]'％re.escape（string.punctuation），letter）== True。） –

@ChrisMorgan：Wow ，那太好了。很明显它在做什么，我不需要担心我是否做对了。 – steveha

答

这是我会怎么做：

>>> st='non-whitespace characters such as ..., ???, !!!, ###, @@@, +++ and' 
>>> reg=r'(([.?#@+])\2{2,})' 
>>> print [m.group(0) for m in re.finditer(reg,st)]

或

>>> print [g for g,l in re.findall(reg, st)]

任一个打印：

['...', '???', '###', '@@@', '+++']

答

我认为你正在寻找像这样的东西：

[run for run, leadchar in re.findall(r'(([^\w\s])\2+)', yourstring)]

例子：

In : teststr = "4spaces then(*(@^#$&&&&(2((((99999****" 

In : [run for run, leadchar in re.findall(r'(([^\w\s])\2+)',teststr)] 
Out: ['&&&&', '((((', '****']

这使您可以运行的列表，但不包括在字符串中的4位，以及像 '*（@ ^'

如果序列这不完全是你想要的，你可以用一个示例字符串编辑你的问题，并且准确地输出你想看到的输出。

重复标点符号和符号的Python正则表达式

相关推荐