正则表达式重新格式化不正确的JSON数据

问题描述:

我有一些数据未正确保存在旧数据库中。我正在将系统移至新的数据库,并重新格式化旧数据。旧的数据是这样的:正则表达式重新格式化不正确的JSON数据

a:10:{ 
    s:7:"step_no";s:1:"1"; 
    s:9:"YOUR_NAME";s:14:"Firtname Lastname"; 
    s:11:"CITIZENSHIP"; s:7:"Indian"; 
    s:22:"PROPOSE_NAME_BUSINESS1"; s:12:"ABC Limited"; 
    s:22:"PROPOSE_NAME_BUSINESS2"; s:15:"XYZ Investment"; 
    s:22:"PROPOSE_NAME_BUSINESS3";s:0:""; 
    s:22:"PROPOSE_NAME_BUSINESS4";s:0:""; 
    s:23:"PURPOSE_NATURE_BUSINESS";s:15:"Some dummy content"; 
    s:15:"CAPITAL_COMPANY";s:24:"20 Million Capital"; 
    s:14:"ANOTHER_AMOUNT";s:0:""; 
} 

我希望新的面貌是正确的JSON格式,这样我就可以在Python阅读突出部分是这样的:

data = { 
    "step_no": "1", 
    "YOUR_NAME":"Firtname Lastname", 
    "CITIZENSHIP":"Indian", 
    "PROPOSE_NAME_BUSINESS1":"ABC Limited", 
    "PROPOSE_NAME_BUSINESS2":"XYZ Investment", 
    "PROPOSE_NAME_BUSINESS3":"", 
    "PROPOSE_NAME_BUSINESS4":"", 
    "PURPOSE_NATURE_BUSINESS":"Some dummy content", 
    "CAPITAL_COMPANY":"20 Million Capital", 
    "ANOTHER_AMOUNT":"" 
} 

我使用正则表达式来剔除思考不需要的部分并使用上限中的名称重新格式化内容将工作,但我不知道如何去做这件事。

正则表达式在这里将是错误的方法。没有必要,格式比你想象的要复杂一点。

您有数据在PHP serialize format。你可以平凡与phpserialize library deserialise它在Python:

import phpserialize 
import json 

def fixup_php_arrays(o): 
    if isinstance(o, dict): 
     if isinstance(next(iter(o), None), int): 
      # PHP has no lists, only mappings; produce a list for 
      # a dictionary with integer keys to 'repair' 
      return [fixup_php_arrays(o[i]) for i in range(len(o))] 
     return {k: fixup_php_arrays(v) for k, v in o.items()} 
    return o 

json.dumps(fixup_php(phpserialize.loads(yourdata, decode_strings=True))) 

注意PHP字符串字节字符串,没有Unicode文本,所以尤其是在Python 3你不得不后您的键值对解码事实上,如果你想能够重新编码为JSON。 decode_strings=True标志为你照顾这个。默认值是UTF-8,传入encoding参数来选择不同的编解码器。

PHP还使用数组的序列号,以便您可能必须转换解码任何整数键dict对象名单第一,这是fixup_php_arrays()功能做什么。

演示(与修复的数据,许多串长度为,并添加空白):

>>> import phpserialize, json 
>>> from pprint import pprint 
>>> data = b'a:10:{s:7:"step_no";s:1:"1";s:9:"YOUR_NAME";s:18:"Firstname Lastname";s:11:"CITIZENSHIP";s:6:"Indian";s:22:"PROPOSE_NAME_BUSINESS1";s:11:"ABC Limited";s:22:"PROPOSE_NAME_BUSINESS2";s:14:"XYZ Investment";s:22:"PROPOSE_NAME_BUSINESS3";s:0:"";s:22:"PROPOSE_NAME_BUSINESS4";s:0:"";s:23:"PURPOSE_NATURE_BUSINESS";s:18:"Some dummy content";s:15:"CAPITAL_COMPANY";s:18:"20 Million Capital";s:14:"ANOTHER_AMOUNT";s:0:"";}' 
>>> pprint(phpserialize.loads(data, decode_strings=True)) 
{'ANOTHER_AMOUNT': '', 
'CAPITAL_COMPANY': '20 Million Capital', 
'CITIZENSHIP': 'Indian', 
'PROPOSE_NAME_BUSINESS1': 'ABC Limited', 
'PROPOSE_NAME_BUSINESS2': 'XYZ Investment', 
'PROPOSE_NAME_BUSINESS3': '', 
'PROPOSE_NAME_BUSINESS4': '', 
'PURPOSE_NATURE_BUSINESS': 'Some dummy content', 
'YOUR_NAME': 'Firstname Lastname', 
'step_no': '1'} 
>>> print(json.dumps(phpserialize.loads(data, decode_strings=True), sort_keys=True, indent=4)) 
{ 
    "ANOTHER_AMOUNT": "", 
    "CAPITAL_COMPANY": "20 Million Capital", 
    "CITIZENSHIP": "Indian", 
    "PROPOSE_NAME_BUSINESS1": "ABC Limited", 
    "PROPOSE_NAME_BUSINESS2": "XYZ Investment", 
    "PROPOSE_NAME_BUSINESS3": "", 
    "PROPOSE_NAME_BUSINESS4": "", 
    "PURPOSE_NATURE_BUSINESS": "Some dummy content", 
    "YOUR_NAME": "Firstname Lastname", 
    "step_no": "1" 
}