从带引号的字符串中提取键值对

问题描述：

我在为这个需求编写'优雅'分析器时遇到了麻烦。（一个看起来不像一杯C早餐）。输入是一个字符串，由'，'分隔的键值对加入'='。从带引号的字符串中提取键值对

key1=value1,key2=value2

的部分欺骗我是值可以引号（“），且引号内‘’并没有结束的关键。

key1=value1,key2="value2,still_value2"

最后这部分取得了棘手的我使用拆分或re.split，诉诸于我的范围内循环:(。

任何人都可以证明一个干净的方式来做到这一点？

它是确定假设报价仅在值发生，并有不是白人速度或非字母数字字符。

可以请您发布预期的输出吗？ –

第二个例子中'key2'的值是否包含引号？即在你的例子中，'key2'映射到''value2，still_value2“'或'”\“value2，still_value2 \”“'？ – EvilTak

答

我建议不要使用正则表达式完成这个任务，因为你想解析的语言是不规则的。

您有一个多个键值对的字符串。解析这个问题的最好方法不是匹配它上的模式，而是正确地标记它。

Python标准库中有一个模块，名为shlex，它模仿POSIX shell所做的解析，并提供了一个可以根据需要轻松定制的词法分析器实现。

from shlex import shlex 

def parse_kv_pairs(text, item_sep=",", value_sep="="): 
    """Parse key-value pairs from a shell-like text.""" 
    # initialize a lexer, in POSIX mode (to properly handle escaping) 
    lexer = shlex(text, posix=True) 
    # set ',' as whitespace for the lexer 
    # (the lexer will use this character to separate words) 
    lexer.whitespace = item_sep 
    # include '=' as a word character 
    # (this is done so that the lexer returns a list of key-value pairs) 
    # (if your option key or value contains any unquoted special character, you will need to add it here) 
    lexer.wordchars += value_sep 
    # then we separate option keys and values to build the resulting dictionary 
    # (maxsplit is required to make sure that '=' in value will not be a problem) 
    return dict(word.split(value_sep, maxsplit=1) for word in lexer)

实例运行：

parse_kv_pairs(
    'key1=value1,key2=\'value2,still_value2,not_key1="not_value1"\'' 
)

输出：

{'key1': 'value1', 'key2': 'value2,still_value2,not_key1="not_value1"'}

编辑：我忘了补充一点，我通常shlex坚持，而不是使用常规的理由表达式（在这种情况下更快）是gi你不会感到惊讶，特别是如果你以后需要允许更多的投入。我从来没有发现如何正确解析这些键值对与正则表达式，总会有输入（例如：A="B=\"1,2,3\""），将欺骗引擎。

如果你不关心这样的输入，（或换句话说，如果你能确保你的输入遵循常规语言的定义），正则表达式是完全正确的。

EDIT2：split有一个maxsplit参数，这比使用split/slicing/joining要干净得多。感谢@cdlane的声音输入！

我相信'shlex'是一个可靠的生产解决方案，这是一个很好的例子，可以帮助您解决手头的问题。然而，这个回答在我的return语句中失去了所有的优雅 - 分割（）相同的数据两次，然后用'join（）'在过多的split（）之后清理，这样你就可以使用字典理解？如何在词法分析器中返回字典（word.split（value_sep，maxsplit = 1）for word）' – cdlane

是的，这样更好，我在写入时忘记了'maxsplit'参数，并且确实在添加时不太优雅在值中支持'='。感谢您的建议，我编辑答案。 – pistache

答

我不知道它看起来并不像体C的早餐，它是相当考究:)

data = {} 
original = 'key1=value1,key2="value2,still_value2"' 
converted = '' 

is_open = False 
for c in original: 
    if c == ',' and not is_open: 
     c = '\n' 
    elif c in ('"',"'"): 
     is_open = not is_open 
    converted += c 

for item in converted.split('\n'): 
    k, v = item.split('=') 
    data[k] = v

答

使用正则表达式的一些魔术从Split a string, respect and preserve quotes，我们可以这样做：

import re 

string = 'key1=value1,key2="value2,still_value2"' 

key_value_pairs = re.findall(r'(?:[^\s,"]|"(?:\\.|[^"])*")+', string) 

for key_value_pair in key_value_pairs: 
    key, value = key_value_pair.split("=")

Per BioGeek，我试图猜测，我的意思是解释正则表达式Janne Karila使用：该模式在逗号上打断了字符串，但是在过程中尊重双引号部分（可能带有逗号）。它有两个单独的选项：不涉及引号的字符串运行;和双引号，其中一个双引号结束运行，除非它的（反斜杠）字符的运行转义：

(?:    # parenthesis for alternation (|), not memory 
[^\s,"]   # any 1 character except white space, comma or quote 
|    # or 
"(?:\\.|[^"])*" # a quoted string containing 0 or more characters 
       # other than quotes (unless escaped) 
)+    # one or more of the above

你可以添加关于正则表达式如何工作的一些解释。 – BioGeek

@BioGeek，我试着按照你的要求，让我知道我是否成功！ – cdlane

cdlane，谢谢你的解释！ – BioGeek

答

我想出了这个正则表达式的解决方案：

import re 
match = re.findall(r'([^=]+)=(("[^"]+")|([^,]+)),?', 'key1=value1,key2=value2,key3="value3,stillvalue3",key4=value4')

，这使得“匹配”：

[('key1', 'value1', '', 'value1'), ('key2', 'value2', '', 'value2'), ('key3', '"value3,stillvalue3"', '"value3,stillvalue3"', ''), ('key4', 'value4', '', 'value4')]

然后你就可以做一个for循环得到键和值：

for m in match: 
    key = m[0] 
    value = m[1]

答

基于其他几个答案，我想出了以下解决方案：

import re 
import itertools 

data = 'key1=value1,key2="value2,still_value2"' 

# Based on Alan Moore's answer on http://*.com/questions/2785755/how-to-split-but-ignore-separators-in-quoted-strings-in-python 
def split_on_non_quoted_equals(string): 
    return re.split('''=(?=(?:[^'"]|'[^']*'|"[^"]*")*$)''', string) 
def split_on_non_quoted_comma(string): 
    return re.split(''',(?=(?:[^'"]|'[^']*'|"[^"]*")*$)''', string) 

split1 = split_on_non_quoted_equals(data) 
split2 = map(lambda x: split_on_non_quoted_comma(x), split1) 

# 'Unpack' the sublists in to a single list. Based on Alex Martelli's answer on http://*.com/questions/952914/making-a-flat-list-out-of-list-of-lists-in-python 
flattened = [item for sublist in split2 for item in sublist] 

# Convert alternating elements of a list into keys and values of a dictionary. Based on Sven Marnach's answer on http://*.com/questions/6900955/python-convert-list-to-dictionary 
d = dict(itertools.izip_longest(*[iter(flattened)] * 2, fillvalue=""))

所得d在以下词典：

{'key1': 'value1', 'key2': '"value2,still_value2"'}

从带引号的字符串中提取键值对

相关推荐