pandas Cookbook部分整理内容

pandas Cookbook (http://pandas.pydata.org/pandas-docs/stable/tutorials.html)

1 Reading your data into pandas is pretty much the easiest thing. Even when the encoding is wrong!

1.1 读取csv文件

fixed_df = pd.read_csv('../data/bikes.csv', sep=';', encoding='latin1', parse_dates=['Date'], dayfirst=True,   index_col='Date')

1.2 选择某一列（像引用字典类型一样，中括号内为列名）

fixed_df['Berri 1']

1.3 对某一列的可视化

fixed_df['Berri 1'].plot()#一列
fixed_df.plot(figsize=(15, 10))#所有列

2 It’s not totally obvious how to select data from a pandas dataframe. Here we explain the basics (how to take slices and get columns)

2.1 选择某一列或某一行

complaints = pd.read_csv('../data/311-service-requests.csv')
complaints['Complaint Type']#中括号内为列名
complaints[:5]#所有数据的前5行
complaints['Complaint Type'][:5]#某一列的前5行，也可写成complaints[:5]['Complaint Type']

2.2 选择多列

complaints[['Complaint Type', 'Borough']] 
complaints[['Complaint Type', 'Borough']][:10]#前10行

2.3 某一列中某种个占多少个

complaint_counts = complaints['Complaint Type'].value_counts()
complaint_counts[:10].plot(kind='bar')#类型最多的前10个可视化，画柱状图

pandas Cookbook部分整理内容

3 Here we get into serious slicing and dicing and learn how to filter dataframes in complicated ways, really fast.

3.1 选择某列的某个特定值（过滤）

is_noise = complaints['Complaint Type'] == "Noise - Street/Sidewalk"#得到的结果为True或False
in_*lyn = complaints['Borough'] == "*LYN"
complaints[is_noise & in_*lyn][:5]#输出为同时True的数据的前5行

3.2 numpy 与 Series

pd.Series([1,2,3]).values#输出为：array([1, 2, 3])
arr = np.array([1,2,3])
arr != 2#输出为：array([ True, False,  True], dtype=bool)
arr[arr != 2]#输出为：array([1, 3]) #只输出True的数值

3.3 条件判断

is_noise = complaints['Complaint Type'] == "Noise - Street/Sidewalk"#输出为True或False
noise_complaints = complaints[is_noise]#输出为具体数值
noise_complaints['Borough'].value_counts()

4 Groupby/aggregate is seriously my favorite thing about pandas and I use it all the time. You should probably read this.

4.1 将日期转成每月的第多少天、星期几，dataframe增加新的一列

berri_bikes.index为日期[2012-01-01, ..., 2012-11-05]
berri_bikes.index.day#输出为array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, ...,4, 5], detype=int32)
berri_bikes.index.weekday#输出为array([6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0,
       1, 2, 3, 4, 5, 6, 0, ..., 6, 0], detype=int32)#0为周一
berri_bikes.loc[:,'weekday'] = berri_bikes.index.weekday#增加weekday列

5 Web scraping with pandas is fun! Here we combine dataframes.

5.1 网页上读取数据（.read_csv）

url_template = "http://climate.weather.gc.ca/climateData/bulkdata_e.html?format=csv&stationID=5415"
url = url_template.format(month=3, year=2012)#format限定只读取2012年3月的数据
url = url_template.format(year=year, month=month)#format限定读取全年的数据
weather_mar2012 = pd.read_csv(url, skiprows=15, index_col='Date/Time', parse_dates=True, encoding='latin1', header=True)

5.2 去除含Nan的column或row（dropna）

weather_mar2012 = weather_mar2012.dropna(axis=1, how='any')#axis=1(columns), axis=0(rows),how='any'(drop the column or row if any value is null)

5.3 去除整行或整列

weather_mar2012 = weather_mar2012.drop(['Year', 'Month', 'Day', 'Time', 'Data Quality'], axis=1)#中括号为要去除的列名或行名，axis=1(column),axis=0(row)

5.4 获得时间的hour部分

temperatures = weather_mar2012[[u'Temp (C)']].copy()
temperatures.loc[:,'Hour'] = weather_mar2012.index.hour#index为2012-03-01 01:00:00格式

6 Strings with pandas are great. It has all these vectorized string operations and they’re the best. We will turn a bunch of strings into vectors of numbers in a trice.

6.1 .resample函数求每月的相关统计信息

weather_2012['Temp (C)'].resample('M', how=np.median)#得到每月温度的中位数
is_snowing.astype(float).resample('M', how=np.mean)#每月下雪时间的比例#is_snowing.astype(float)将True转为1，False转为0

7 Cleaning up messy data is never a joy, but with pandas it’s easier.

7.1 如何知道数据规范不

requests['Incident Zip'].unique()#unique（）方法查看数据

8 Parsing Unix timestamps is confusing at first but it turns out to be really easy

8.1 表示时间的整数用to_datetime()转成时间表示形式

popcon['atime'] = popcon['atime'].astype(int)#先转成整型如1387295796
popcon['atime'] = pd.to_datetime(popcon['atime'], unit='s')#1387295796表示为201-12-17 15:56:37

8.2 ‘~’符号的使用

nonlibraries = popcon[~popcon['package-name'].str.contains('lib')]#得到package-name不包含lib的部分

9 Reading data from SQL databases. （SQL：import sqlite3）（MySQL：import MySQLdb ）

9.1 从SQL数据库中读取数据

con = sqlite3.connect("../data/weather_2012.sqlite")#与SQL数据库建立连接
con = MySQLdb.connect(host="localhost", db="test")#与MySQL数据库建立连接
df = pd.read_sql("SELECT * from weather_2012 LIMIT 3", con, index_col='id')#index_col也可幅值数据库中的索引，这里是‘id’，不写这个参数默认就是重新建立dataframe的index

9.2 数据写入SQL数据库

weather_df = pd.read_csv('../data/weather_2012.csv')
con = sqlite3.connect("../data/test_db.sqlite")
con.execute("DROP TABLE IF EXISTS weather_2012")
weather_df.to_sql("weather_2012", con)

pandas Cookbook部分整理内容

相关推荐