使用熊猫时跳过0xff字节read_csv

问题描述：

我想从我的锅炉读取一些日志文件，但它们格式不太好。使用熊猫时跳过0xff字节read_csv

当我试着使用

import pandas 

print(pandas.read_csv('./data/CM120102.CSV', delimiter=';'))

读取文件（S）我得到

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 49: invalid start byte

的CSV标题中由于某种原因，一个空字节结束。

https://gist.github.com/Ession/6e5bf67392276048c7bd

http://mathiasjost.com/CM120102.CSV < ==这个应该工作（或者说无法正常工作）

有没有办法读取与大熊猫这些文件，而无需先固定呢？

我不能用2.7和0.16重现你的错误。文件读取正常，并为我打印罚款。 –

您是否复制了文本，或者您是否点击了原始文件并下载了该文件？如果我从网站复制文本，我也不会收到空字节/错误。但是当下载时我得到错误。 –

我下载了RAW文件，并在'pandas'中打开。 NULL字节可能会丢失某处。也就是说，最好分别修复这些文件，然后如果错误继续发生，则使用熊猫。 –

答

我会把它读入一个字符串。然后在将它传递给pandas.read_csv之前，使用python进行一些消除。示例代码如下。

# get the data as a python string 
with open ("CM120102.CSV", "r") as myfile: 
    data=myfile.read() 

# munge in python - get rid of the garbage in the input (lots of xff bytes) 
import re 
data = re.sub(r'[^a-zA-Z0-9_\.;:\n]', '', data) # get rid of the rubbish 
data = data + '\n' # the very last one is missing? 
data = re.sub(r';\n', r'\n', data) # last ; separator on line is problematic 

# now let's suck into a pandas DataFrame 
from StringIO import StringIO 
import pandas as pd 
df = pd.read_csv(StringIO(data), index_col=None, header=0, 
    skipinitialspace=True, sep=';', parse_dates=True)

将从'StringIO import StringIO'改为'from io import StringIO' for Python 3之后，这个功能完美无缺！谢谢。 –

使用熊猫时跳过0xff字节read_csv

相关推荐