pandas笔记

一、factorize()

官网说明

This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values. factorize is available as both a top-level function pandas.factorize(), and as a method Series.factorize() and Index.factorize().

pandas.factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None)
Encode input values as an enumerated type or categorical variable

Parameters:

values:sequence

A 1-D sequence. Sequences that aren’t pandas objects are coerced to ndarrays before factorization.

sort:bool, default False

Sort uniques and shuffle codes to maintain the relationship.

na_sentinel:int or None, default -1

Value to mark “not found”. If None, will not drop the NaN from the uniques of the values.

Changed in version 1.1.2.

size_hint:int, optional

Hint to the hashtable sizer.

Returns

codes:ndarray

An integer ndarray that’s an indexer into uniques. uniques.take(codes) will have the same values as values.

uniques:ndarray, Index, or Categorical

The unique valid values. When values is Categorical, uniques is a Categorical. When values is some other pandas object, an Index is returned. Otherwise, a 1-D ndarray is returned.

个人理解

factorize函数可以将Series中的标称型数据映射称为一组数字,相同的标称型映射为相同的数字。即它把字符串映射成的数字的规则是先看见的小,后看见的大。意思就是这一列的第一行,必定为0,第二行如果和第一行的取值不同,就为1,否则就是0.以此类推。factorize函数的返回值是一个tuple(元组),元组中包含两个元素。第一个元素是一个array,其中的元素是标称型元素映射为的数字;第二个元素是Index类型,其中的元素是所有标称型元素,没有重复。

python实例

pandas笔记