如何解析python中的BIG JSON文件
问题描述:
我正在处理一个非常大的数据集,并且遇到了无法找到任何答案的问题。 我试图解析来自JSON数据,这里是我做过什么从整个数据集一块的数据和工作原理:如何解析python中的BIG JSON文件
import json
s = set()
with open("data.raw", "r") as f:
for line in f:
d = json.loads(line)
混乱的部分是,当我申请我的主数据代码(大小约200G)它显示了以下错误(不包括外出内存):
d = json.loads(line)
File "C:\Users\Sathyanarayanan\AppData\Local\Programs\Python\Python35-32\lib\json\__init__.py", line 319, in loads
return _default_decoder.decode(s)
File "C:\Users\Sathyanarayanan\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Users\Sathyanarayanan\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1)
类型(F)= TextIOWrapper是否有帮助......但这种数据类型也为小数据集。 ..
这里有几行我的数据看格式:
{"MessageType": "SALES.CONTRACTS.SALESTATUSCHANGED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-002-0801743-2330650"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:57.9681547", "Source": 1}, "OldStatus": {"Status": 3, "AutoRemoveInfo": null}, "NewStatus": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 1193}, "Sku": {"Sku": "Con BM20"}, "Quantity": 1, "UnitPrice": {"amount": 11.92, "currency": 840}}], "FulfilledItems": []}, "ShippingInfo": {"Carrier": "", "Class": "", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "2013-01-13T13:39:57", "PendingItems": null, "Kits": null, "Products": null, "OldSaleDate": "0001-01-01T00:00:00", "AdditionalSaleInfo": null}}
{"MessageType": "SALES.CONTRACTS.SALESHIPPINGINFOCHANGED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-002-0801743-2330650"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:57.9681547", "Source": 1}, "Status": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 1193}, "Sku": {"Sku": "Con BM20"}, "Quantity": 1, "UnitPrice": {"amount": 11.92, "currency": 840}}], "FulfilledItems": []}, "OldShippingInfo": {"Carrier": "", "Class": "", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "NewShippingInfo": {"Carrier": "USPS", "Class": "FIRST/RECTPARCEL", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "0001-01-01T00:00:00", "PendingItems": null, "Kits": null, "Products": null, "OldSaleDate": "0001-01-01T00:00:00", "AdditionalSaleInfo": null}}
{"MessageType": "SALES.CONTRACTS.SALECREATED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-002-4851828-6514632"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:58.1402505", "Source": 1}, "Status": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 9223372036854775807}, "Sku": {"Sku": "NFL Blanket Seahawks"}, "Quantity": 1, "UnitPrice": {"amount": 22.99, "currency": 840}}], "FulfilledItems": []}, "ShippingInfo": {"Carrier": "USPS", "Class": "FIRST/RECTPARCEL", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "2013-01-13T15:51:12", "Kits": null, "Products": null, "AdditionalSaleInfo": null}}
{"MessageType": "SALES.CONTRACTS.SALECREATED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-102-3824485-2270645"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:58.3436109", "Source": 1}, "Status": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 9223372036854775807}, "Sku": {"Sku": "NFL CD Wallet Chargers"}, "Quantity": 1, "UnitPrice": {"amount": 12.99, "currency": 840}}], "FulfilledItems": []}, "ShippingInfo": {"Carrier": "USPS", "Class": "FIRST/RECTPARCEL", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "2013-01-12T02:49:58", "Kits": null, "Products": null, "AdditionalSaleInfo": null}}
这是因为Json的我已经在第一解析2000行和它完美的作品。但是,当我尝试对大文件使用相同的过程时,它会显示数据的第一行中的错误。
答
下面是一些简单的代码,看看哪些数据是无效的JSON和它在哪里:
for i, line in enumerate(f):
try:
d = json.loads(line)
except json.decoder.JSONDecodeError:
print('Error on line', i + 1, ':\n', repr(line))
答
一个很好的解决方案来读取一个大JSON数据集,它是在python
使用像yield
发电机,因为200G对于你的内存来说太大了,如果你的json解析器将整个文件存储在内存中,一步一步地将内存与迭代器一起保存。
您可以使用迭代JSON解析器与Pythonic接口http://pypi.python.org/pypi/ijson/。
但是这里你的文件有.raw
扩展名,它不是json文件。
要读那些:
import numpy as np
content = np.fromfile("data.raw", dtype=np.int16, sep="")
但是这种解决方案可以为崩溃的大文件。
如果事实.raw
似乎一个.csv
文件,那么你可以像创建你的读者:
import csv
def read_big_file(filename):
with open(filename, "rb") as csvfile:
reader = csv.reader(csvfile)
for row in reader:
yield row
或者像taht为一个文本文件:
def read_big_file(filename):
with open(filename, "r") as _file:
for line in _file:
yield line
使用rb
只有当你的文件是二进制的。
执行:
for line in read_big_file(filename):
<treatment>
<free memory after a size of chunk>
,我可以精确我的回答如果你给你的文件的第一行。
应该对那个json数据做些什么改变? – RomanPerekhrest
'data.raw'是每个行上的json对象的json文件还是文件?如果前者使用['json.load'](https://docs.python.org/3.5/library/json.html#json.load) – Will
你的文件不是有效的JSON。不过,它似乎在每一行都包含有效的JSON文本。我的建议是,修正产生这个“JSON”的东西(它实际上不是JSON)。除此之外,我想你可以一行一行地将反序列化的对象堆积成一个列表或其他东西。 –