如何从字符串列中生成Categorical的熊猫DataFrame列？

问题描述：

我可以在熊猫字符串列转换为范畴，但是当我试图插入它作为一个新的数据框柱似乎被转换右后卫STR系列：如何从字符串列中生成Categorical的熊猫DataFrame列？

train['LocationNFactor'] = pd.Categorical.from_array(train['LocationNormalized']) 

>>> type(pd.Categorical.from_array(train['LocationNormalized'])) 
<class 'pandas.core.categorical.Categorical'> 
# however it got converted back to... 
>>> type(train['LocationNFactor'][2]) 
<type 'str'> 
>>> train['LocationNFactor'][2] 
'Hampshire'

猜测这是因为直言没有按” t映射到任何numpy dtype;所以我必须将其转换为某种int类型，从而失去因子标签< - >关联关系？什么是最优雅的解决方法来存储水平< - >标签关联并保留转换能力？（只是存储像here一个字典，并手动在需要时转换？）我想Categorical is still not a first-class datatype for DataFrame，不像R.

（使用熊猫0.10.1，numpy的1.6.2，2.7.3蟒 - 最新版本的MacPorts一切）。

答

唯一的解决办法大熊猫为前0.15我发现如下：

列必须被转换成一个明确的分类，但numpy的将立即强制该水平恢复INT，失去因子信息
所以因子存储在一个全局变量数据帧

外。

train_LocationNFactor = pd.Categorical.from_array(train['LocationNormalized']) # default order: alphabetical 

train['LocationNFactor'] = train_LocationNFactor.labels # insert in dataframe

[更新：熊猫0.15+ added decent support for Categorical]

答

标签< - >等级存储在索引对象中。

要的整数数组转换为字符串数组：索引[integer_array]
要转换的字符串数组为整数数组：index.get_indexer（string_array）

下面是一些exampe：

In [56]: 

c = pd.Categorical.from_array(['a', 'b', 'c', 'd', 'e']) 

idx = c.levels 

In [57]: 

idx[[1,2,1,2,3]] 

Out[57]: 

Index([b, c, b, c, d], dtype=object) 

In [58]: 

idx.get_indexer(["a","c","d","e","a"]) 

Out[58]: 

array([0, 2, 3, 4, 0])

我知道，但这里的问题是，这一切又轰出回来时，我们分配到一个数据帧列海峡，就像我表明：'火车[“LocationNFactor” ] = pd.Categorical ...' – smci 2013-03-12 19:47:59

如何从字符串列中生成Categorical的熊猫DataFrame列？

相关推荐