pandas Cookbook部分整理内容
pandas Cookbook (http://pandas.pydata.org/pandas-docs/stable/tutorials.html)
1 Reading your data into pandas is pretty much the easiest thing. Even when the encoding is wrong!
1.1 读取csv文件
fixed_df = pd.read_csv('../data/bikes.csv', sep=';', encoding='latin1', parse_dates=['Date'], dayfirst=True, index_col='Date')
1.2 选择某一列(像引用字典类型一样,中括号内为列名)
fixed_df['Berri 1']
1.3 对某一列的可视化
fixed_df['Berri 1'].plot()#一列 fixed_df.plot(figsize=(15, 10))#所有列
2 It’s not totally obvious how to select data from a pandas dataframe. Here we explain the basics (how to take slices and get columns)
2.1 选择某一列或某一行
complaints = pd.read_csv('../data/311-service-requests.csv') complaints['Complaint Type']#中括号内为列名 complaints[:5]#所有数据的前5行 complaints['Complaint Type'][:5]#某一列的前5行,也可写成complaints[:5]['Complaint Type']
2.2 选择多列
complaints[['Complaint Type', 'Borough']] complaints[['Complaint Type', 'Borough']][:10]#前10行
2.3 某一列中某种个占多少个
complaint_counts = complaints['Complaint Type'].value_counts() complaint_counts[:10].plot(kind='bar')#类型最多的前10个可视化,画柱状图
3 Here we get into serious slicing and dicing and learn how to filter dataframes in complicated ways, really fast.
3.1 选择某列的某个特定值(过滤)
is_noise = complaints['Complaint Type'] == "Noise - Street/Sidewalk"#得到的结果为True或False in_*lyn = complaints['Borough'] == "*LYN" complaints[is_noise & in_*lyn][:5]#输出为同时True的数据的前5行
3.2 numpy 与 Series
pd.Series([1,2,3]).values#输出为:array([1, 2, 3])
arr = np.array([1,2,3])
arr != 2#输出为:array([ True, False, True], dtype=bool)
arr[arr != 2]#输出为:array([1, 3]) #只输出True的数值
3.3 条件判断
is_noise = complaints['Complaint Type'] == "Noise - Street/Sidewalk"#输出为True或False
noise_complaints = complaints[is_noise]#输出为具体数值
noise_complaints['Borough'].value_counts()
4 Groupby/aggregate is seriously my favorite thing about pandas and I use it all the time. You should probably read this.
4.1 将日期转成每月的第多少天、星期几,dataframe增加新的一列
berri_bikes.index为日期[2012-01-01, ..., 2012-11-05] berri_bikes.index.day#输出为array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, ...,4, 5], detype=int32) berri_bikes.index.weekday#输出为array([6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, ..., 6, 0], detype=int32)#0为周一 berri_bikes.loc[:,'weekday'] = berri_bikes.index.weekday#增加weekday列
5 Web scraping with pandas is fun! Here we combine dataframes.
5.1 网页上读取数据(.read_csv)
url_template = "http://climate.weather.gc.ca/climateData/bulkdata_e.html?format=csv&stationID=5415" url = url_template.format(month=3, year=2012)#format限定只读取2012年3月的数据 url = url_template.format(year=year, month=month)#format限定读取全年的数据 weather_mar2012 = pd.read_csv(url, skiprows=15, index_col='Date/Time', parse_dates=True, encoding='latin1', header=True)
5.2 去除含Nan的column或row(dropna)
weather_mar2012 = weather_mar2012.dropna(axis=1, how='any')#axis=1(columns), axis=0(rows),how='any'(drop the column or row if any value is null)
5.3 去除整行或整列
weather_mar2012 = weather_mar2012.drop(['Year', 'Month', 'Day', 'Time', 'Data Quality'], axis=1)#中括号为要去除的列名或行名,axis=1(column),axis=0(row)
5.4 获得时间的hour部分
temperatures = weather_mar2012[[u'Temp (C)']].copy()
temperatures.loc[:,'Hour'] = weather_mar2012.index.hour#index为2012-03-01 01:00:00格式
6 Strings with pandas are great. It has all these vectorized string operations and they’re the best. We will turn a bunch of strings into vectors of numbers in a trice.
6.1 .resample函数求每月的相关统计信息
weather_2012['Temp (C)'].resample('M', how=np.median)#得到每月温度的中位数
is_snowing.astype(float).resample('M', how=np.mean)#每月下雪时间的比例#is_snowing.astype(float)将True转为1,False转为0
7 Cleaning up messy data is never a joy, but with pandas it’s easier.
7.1 如何知道数据规范不
requests['Incident Zip'].unique()#unique()方法查看数据
8 Parsing Unix timestamps is confusing at first but it turns out to be really easy
8.1 表示时间的整数用to_datetime()转成时间表示形式
popcon['atime'] = popcon['atime'].astype(int)#先转成整型如1387295796 popcon['atime'] = pd.to_datetime(popcon['atime'], unit='s')#1387295796表示为201-12-17 15:56:37
8.2 ‘~’符号的使用
nonlibraries = popcon[~popcon['package-name'].str.contains('lib')]#得到package-name不包含lib的部分
9 Reading data from SQL databases. (SQL:import sqlite3)(MySQL:import MySQLdb )
9.1 从SQL数据库中读取数据
con = sqlite3.connect("../data/weather_2012.sqlite")#与SQL数据库建立连接 con = MySQLdb.connect(host="localhost", db="test")#与MySQL数据库建立连接 df = pd.read_sql("SELECT * from weather_2012 LIMIT 3", con, index_col='id')#index_col也可幅值数据库中的索引,这里是‘id’,不写这个参数默认就是重新建立dataframe的index
9.2 数据写入SQL数据库
weather_df = pd.read_csv('../data/weather_2012.csv')
con = sqlite3.connect("../data/test_db.sqlite")
con.execute("DROP TABLE IF EXISTS weather_2012")
weather_df.to_sql("weather_2012", con)