Numpy的简单介绍
注:本文大部分是看python数据分析这本书做的笔记加上一些自己的解释,以后作为复习
Numpy中最重要的一个对象就是 ndarray ----- 一个多维数组对象
创建ndarray
方法 | 描述 |
---|---|
array | Convert input data (list, tuple, array, or other sequence type) to an ndarray either by inferring a dtype or explicitly specifying a dtype. Copies the input data by default. |
asarray | Convert input to ndarray, but do not copy if the input is already an ndarray |
arange | Like the built-in range but returns an ndarray instead of a list. |
ones, ones_like | Produce an array of all 1’s with the given shape and dtype. ones_like takes another array and produces a ones array of the same shape and dtype. |
zeros, zeros_like | Like ones and ones_like but producing arrays of 0’s instead |
empty, empty_like | Create new arrays by allocating new memory, but do not populate with any values like ones and zeros |
eye, identity | Create a square N x N identity matrix (1’s on the diagonal and 0’s elsewhere) |
每个数组都有:
shape -- a tuple indicating the size of each dimension
dtype -- an object describing the data type of the array
np.ones(),np.zeros()函数均可由np.full()函数替代 例:a = np.full((3,3),0)
np.random.random() # 产生0到1之间的随机数
NumPy 的数据类型
类型 类型代码 描述int8, uint8 | i1, u1 | Signed and unsigned 8-bit (1 byte) integer types |
int16, uint16 | i2, u2 | Signed and unsigned 16-bit integer types |
int32, uint32 | i4, u4 | Signed and unsigned 32-bit integer types |
int64, uint64 | i8, u8 | Signed and unsigned 32-bit integer types |
float16 | f2 | Half-precision floating point |
float32 | f4 or f | Standard single-precision floating point. Compatible with C float |
float64, float128 | f8 or d |
Standard double-precision floating point. Compatible with C double and Python float object |
float128 | f16 or g | Extended-precision floating point |
complex64, complex128, complex256 |
c8, c16, c32 |
Complex numbers represented by two 32, 64, or 128 floats, respectively |
bool |
? | Boolean type storing True and False values |
object | O | Python object type |
string_ |
S | Fixed-length string type (1 byte per character). For example, to create a string dtype with length 10, use 'S10'. |
unicode_ | U | Fixed-length unicode type (number of bytes platform specific). Same specification semantics as string_ (e.g. 'U10') |
ndarray之间的数据类型可以使用 astype 函数转化:
In [12]: arr = np.array([1, 2, 3, 4, 5]) In [13]: arr.dtype Out[13]: dtype('int64') In [14]: float_arr = arr.astype(np.float64) In [15]: float_arr.dtype Out[15]: dtype('float64')
数字值的字符串转换为 numeric :
In [16]: numeric_strings = np.array(['1.25', '-9.6', '42'], dtype=np.string_) In [17]: numeric_strings.astype(float) Out[17]: array([ 1.25, -9.6 , 42. ])
还可以使用别的数组的类型进行转换或者使用数据类型代码指定类型:
In [18]: calibers = np.array([.22, .270, .357, .380, .44, .50], dtype=np.float64) In [19]: int_array.astype(calibers.dtype) Out[19]: array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
In [20]: empty_uint32 = np.empty(8, dtype='u4')
astype函数总是会产生数据的副本,即使转换前后的数据类型一样
ndarray的shape属性和reshape函数
a = np.array([1,2,3]) a.shape # (3,) # reshape函数可以明确指定维数改变 each dimension -- 参数一个tuple a = a.reshape((1,-1)) # (1行3列) a = a.reshape((3,-1)) # (3行1列) #其中-1是个占位符,不表示任何意义 a = np.arange(16).reshape((2, 2, 4)) # 产生一个三维数组:一维数组中包含2个元素,每个元素是一个包含2个 # 元素的数组,这2个元素每个元素同样又是一个包含4个元素的数组 In [51]: a = np.arange(16).reshape((2, 2, 4)) In [52]: a Out[52]: array([[[ 0, 1, 2, 3], [ 4, 5, 6, 7]], [[ 8, 9, 10, 11], [12, 13, 14, 15]]])
索引
numpy中切片出来的数组都是原始数据的一个视图,并不是原数据的一个副本。python中切片出来的是原数据的一个副本。
如果要创建原始数据的一个副本需要明确指定 例:arr[:].copy()
one-dimension 的数组很简单,跟python中的数组切片差不多。
In [21]: arr = np.arange(10)
In [22]: arr
Out[22]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [23]: arr[5]
Out[23]: 5
In [24]: arr[5:8]
Out[24]: array([5, 6, 7])
In [25]: arr[5:8] = 12 # numpy中的广播特性,python中不可以
In [26]: arr
Out[26]: array([ 0, 1, 2, 3, 4, 12, 12, 12, 8, 9])
higher dimension 索引的选择就比较多了。
2d数组的索引如上图
In [28]: arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) In [29]: arr2d Out[29]: array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) In [30]: arr2d[2] Out[30]: array([7, 8, 9]) In [31]: arr2d[0][2] Out[31]: 3 In [32]: arr2d[0,2] Out[32]: 3 In [34]: a = arr2d[0,2] In [35]: a.shape Out[35]: () # 0维
索引切片
In [38]: a = arr2d[0,2:3] In [39]: a Out[39]: array([3]) In [40]: a.shape Out[40]: (1,) In [41]: a = arr2d[0:3,2:3] In [42]: a Out[42]: array([[3], [6], [9]])
进行索引时,若对其中一个维度进行整数操作,则结果维度减一
>>> a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> b = a[2,1:3]
>>> b.shape
(2,)
>>> b = a[2:3,1:3]
>>> b.shape
(1, 2)
布尔索引
In [83]: names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe']) In [84]: data = randn(7, 4)In [87]: names == 'Bob' # 产生一个bool数组 Out[87]: array([ True, False, False, True, False, False, False], dtype=bool) In [88]: data[names == 'Bob'] # 选取为True的行, bool数组的长度必须与要操作的数组的索引长度相同, # 还可以data[-(names == 'Bob')] Out[88]: array([[-0.048 , 0.5433, -0.2349, 1.2792], [ 2.1452, 0.8799, -0.0523, 0.0672]]) In [89]: data[names == 'Bob', 2:] # 可以用切片或者整数选取部分 Out[89]: array([[-0.2349, 1.2792], [-0.0523, 0.0672]]) In [90]: data[names == 'Bob', 3] # 整数 Out[90]: array([ 1.2792, 0.0672]) In [93]: mask = (names == 'Bob') | (names == 'Will') # | (or) & (and) In [94]: mask Out[94]: array([True, False, True, True, True, False, False], dtype=bool) In [95]: data[mask]
花式索引 --用整数数组描述索引
In [100]: arr = np.empty((8, 4)) # 产生 8 x 4 的数组 In [101]: for i in range(8): # 维数组赋值 ..... : arr[i] = i In [103]: arr[[4, 3, 0, 6]] # 一次选择下标为 4,3,0,6 的行 In [104]: arr[[-3, -5, -7]] # 使用负数索引也是可以的 负数索引从 -1 开始 In [105]: arr = np.arange(32).reshape((8, 4)) In [107]: arr[[1, 5, 7, 2], [0, 3, 1, 2]] # (1, 0), (5, 3), (7,1), (2, 2),是不是和自己想的不一样?就是这样的 Out[107]: array([ 4, 23, 29, 10]) In [108]: arr[[1, 5, 7, 2]][:, [0, 3, 1, 2]] # 产生一个矩阵 Out[108]: array([[ 4, 7, 5, 6], [20, 23, 21, 22], [28, 31, 29, 30], [ 8, 11, 9, 10]]) In [109]: arr[np.ix_([1, 5, 7, 2], [0, 3, 1, 2])] # np.ix_()函数作用同 arr[[1, 5, 7, 2]][:, [0, 3, 1, 2]] 花式索引总是产生原始数组的一个副本
数学运算与常见函数
(以上函数均不是矩阵运算) + --> np.add(a,b) - --> np.suntract(a,b) * --> np.multiply(a,b) / --> np.divide(a,b)
对a的第二列加10 # 数组 a 的shape (3,3)
>>> a[np.arange(3),1] +=10
>>> a[np.arange(3),[1,1,1]] +=10
>>> a[[0,1,2],[1,1,1]] +=10
选取a中大于0 的值
>>> re = a>10
>>> re
array([[False, True, False, False],
[False, True, False, False],
[False, True, False, True]])
>>> a[re]
array([31, 35, 39, 11])
>>> a[a>10]
array([31, 35, 39, 11])
常用函数
np.sum()
>>> a = np.array([[1,2],[3,4]])
>>> a
array([[1, 2],
[3, 4]])
>>> a.sum()
10
>>> np.sum(a)
10
>>> np.sum(a,axis=0) # 每一列求和
array([4, 6])
>>> np.sum(a,axis=1) # 每一行求和
array([3, 7])
>>> a.sum(axis=0)
array([4, 6])
>>> a.sum(axis=1)
array([3, 7])
np.random.uniform() # 产生随机数 np.tile(array,(,)) # 将指定的数组重复一定的次数 np.argsort() # 排序 返回下标
T,transpose,swapaxes
1)a.T # 属性 2)np.transpose(a) # 内置方法 3)a.transpose() # 数组方法 不带参数的话 = a.T 4) a.swapaxes() # 交换维度 a = np.arange(16).reshape((2, 2, 4)) # a.shape (2,2,4) a.T # a.shape (4,2,2) a.transpose() # a.shape (4,2,2) a.transpose((1,0,2)) # a.shape (2,2,4), # a.shape (2,2,4) 的下标表示 --->(0,1,2), 其中 (1,0,2) 就是交换第一,二个元素。 # transpose参数中的(1,0,2)相当于把原始数据中每个数据的第一,二个下标交换之后组成新的数组。 a.swapaxes(1,2) # a.shape (2,4,2) 参数仍然是维度下标
numpy 中的 where 条件函数
In [147]: arr = randn(4, 4) In [149]: np.where(arr > 0, 2, -2) In [150]: np.where(arr > 0, 2, arr) # set only positive values to 2
any和all boolean函数
In [162]: bools = np.array([False, False, True, False]) In [163]: bools.any() Out[163]: True In [164]: bools.all() Out[164]: False
sort函数
默认是按行排序,也可以指定按行(1)、列(0)排序
In [80]: arr = np.random.randn(5, 3) In [82]: arr.sort() # 默认是按行排序 In [83]: arr Out[83]: array([[-1.2629921 , -0.75419353, 0.24817741], [ 0.5467019 , 1.46272747, 1.50331672], [-1.19504888, 0.61300717, 0.83061943], [-1.22133562, 0.49668954, 1.73834466], [-2.25860226, -0.90163896, -0.53758088]]) In [84]: arr.sort(0) # 列(0) In [85]: arr Out[85]: array([[-2.25860226, -0.90163896, -0.53758088], [-1.2629921 , -0.75419353, 0.24817741], [-1.22133562, 0.49668954, 0.83061943], [-1.19504888, 0.61300717, 1.50331672], [ 0.5467019 , 1.46272747, 1.73834466]]) In [86]: arr.sort(axis=1) # 行(1)
Unique函数和一些其他函数
Method Description
unique(x) | Compute the sorted, unique elements in x |
intersect1d(x, y) | Compute the sorted, common elements in x and y |
union1d(x, y) | Compute the sorted union of elements |
in1d(x, y) | Compute a boolean array indicating whether each element of x is contained in y |
setdiff1d(x, y) | Set difference, elements in x that are not in y |
setxor1d(x, y) | Set symmetric differences; elements that are in either of the arrays, but not both |
In [89]: names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe']) In [90]: np.unique(names) Out[90]: array(['Bob', 'Joe', 'Will'], dtype='<U4') In [91]: ints = np.array([3, 3, 3, 2, 2, 1, 1, 4, 4]) In [92]: np.unique(ints) Out[92]: array([1, 2, 3, 4]) In [93]: sorted(set(names)) Out[93]: ['Bob', 'Joe', 'Will'] In [94]: values = np.array([6, 0, 0, 3, 2, 5, 6]) In [95]: np.in1d(values, [2, 3, 6]) Out[95]: array([ True, False, False, True, True, False, True])
文件操作
In [96]: arr = np.arange(10) In [98]: np.save('D:\save_arr',arr) In [101]: np.load('D:/save_arr.npy') Out[101]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.loadtxt() # 加载TXT文件
常用的 numpy.linalg 函数
矩阵乘法 1)a.dot(b) 2)np.dot(a,b)
numpy.random中的函数
Function Descriptionseed | Seed the random number generator |
permutation | Return a random permutation of a sequence, or return a permuted range |
shuffle | Randomly permute a sequence in place |
rand | Draw samples from a uniform distribution |
randint | Draw random integers from a given low-to-high range |
randn | Draw samples from a normal distribution with mean 0 and standard deviation 1 (MATLAB-like interface) |
binomial | Draw samples a binomial distribution |
normal | Draw samples from a normal (Gaussian) distribution |
beta | Draw samples from a beta distribution |
chisquare | Draw samples from a chi-square distribution |
gamma | Draw samples from a gamma distribution |
uniform | Draw samples from a uniform [0, 1) distribution |
Draw random samples from a normal (Gaussian) distribution. 也就是高斯分布。
numpy.random.normal(loc=0.0, scale=1.0, size=None)
np.random.randn(size)所谓标准正态分布 loc = 0,scale = 1
来自官方文档: loc : float or array_like of floats Mean ("centre") of the distribution. 即均值 scale : float or array_like of floats Standard deviation (spread or "width") of the distribution. 即标准差 size : int or tuple of ints, optional Output shape. If the given shape is, e.g., ``(m, n, k)``, then ``m * n * k`` samples are drawn. If size is ``None`` (default), a single value is returned if ``loc`` and ``scale`` are both scalars. Otherwise, ``np.broadcast(loc, scale).size`` samples are drawn.
# 校验均值和标准差 In [112]: mu, sigma = 0, 0.1 In [113]: s = np.random.normal(mu, sigma, 1000) In [114]: abs(mu - np.mean(s)) < 0.01 Out[114]: True In [115]: abs(sigma - np.std(s, ddof=1)) < 0.01 # ddof,delta degrees of freedom,表示*度 # 一般取1,表示无偏估计, Out[115]: True
# 用 matplotlib 拟合 In [116]: import matplotlib.pyplot as plt In [118]: count, bins, ignored = plt.hist(s, 30, density=True) In [119]: plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * ...: np.exp( - (bins - mu)**2 / (2 * sigma**2) ), ...: linewidth=2, color='r') Out[119]: [<matplotlib.lines.Line2D at 0x212dc9f5a90>] In [120]: plt.show()
高斯分布概率密度函数:
拟合结果: