在熊猫数据框中将字符串2.90K转换为2900或5.2M到5200000

问题描述:

在处理熊猫数据框内的数据时需要一些帮助。 任何帮助是最受欢迎的。在熊猫数据框中将字符串2.90K转换为2900或5.2M到5200000

我有CSV格式的OHCLV数据。我已将文件加载到熊猫数据框中。

如何将体积栏从2.90K转换为2900或5.2M转换为5200000. 该栏可以包含K形式的千位和M百万位数。

import pandas as pd 

file_path = '/home/fatjoe/UCHM.csv' 
df = pd.read_csv(file_path, parse_dates=[0], index_col=0) 
df.columns = [ 
"closing_price", 
"opening_price", 
"high_price", 
"low_price", 
"volume", 
"change"] 

df['opening_price'] = df['closing_price'] 
df['opening_price'] = df['opening_price'].shift(-1) 
df = df.replace('-', 0) 
df = df[:-1] 
print(df.head()) 

Console: 
Date 
2016-09-23   0 
2016-09-22  9.60K 
2016-09-21  54.20K 
2016-09-20 115.30K 
2016-09-19  18.90K 
2016-09-16 176.10K 
2016-09-15  31.60K 
2016-09-14  10.00K 
2016-09-13  3.20K 

假设你有以下DF:

In [30]: df 
Out[30]: 
     Date  Val 
0 2016-09-23  100 
1 2016-09-22 9.60M 
2 2016-09-21 54.20K 
3 2016-09-20 115.30K 
4 2016-09-19 18.90K 
5 2016-09-16 176.10K 
6 2016-09-15 31.60K 
7 2016-09-14 10.00K 
8 2016-09-13 3.20M 

你能做到这样:

In [31]: df.Val = (df.Val.replace(r'[KM]+$', '', regex=True).astype(float) * \ 
    ....:   df.Val.str.extract(r'[\d\.]+([KM]+)', expand=False) 
    ....:    .fillna(1) 
    ....:    .replace(['K','M'], [10**3, 10**6]).astype(int)) 

In [32]: df 
Out[32]: 
     Date  Val 
0 2016-09-23  100.0 
1 2016-09-22 9600000.0 
2 2016-09-21 54200.0 
3 2016-09-20 115300.0 
4 2016-09-19 18900.0 
5 2016-09-16 176100.0 
6 2016-09-15 31600.0 
7 2016-09-14 10000.0 
8 2016-09-13 3200000.0 

说明:

In [36]: df.Val.replace(r'[KM]+$', '', regex=True).astype(float) 
Out[36]: 
0 100.0 
1  9.6 
2  54.2 
3 115.3 
4  18.9 
5 176.1 
6  31.6 
7  10.0 
8  3.2 
Name: Val, dtype: float64 

In [37]: df.Val.str.extract(r'[\d\.]+([KM]+)', expand=False) 
Out[37]: 
0 NaN 
1  M 
2  K 
3  K 
4  K 
5  K 
6  K 
7  K 
8  M 
Name: Val, dtype: object 

In [38]: df.Val.str.extract(r'[\d\.]+([KM]+)', expand=False).fillna(1) 
Out[38]: 
0 1 
1 M 
2 K 
3 K 
4 K 
5 K 
6 K 
7 K 
8 M 
Name: Val, dtype: object 

In [39]: df.Val.str.extract(r'[\d\.]+([KM]+)', expand=False).fillna(1).replace(['K','M'], [10**3, 10**6]).astype(int) 
Out[39]: 
0   1 
1 1000000 
2  1000 
3  1000 
4  1000 
5  1000 
6  1000 
7  1000 
8 1000000 
Name: Val, dtype: int32 
+0

谢谢......它的工作原理........现在让我试着了解代码在做什么 –

+0

@JosephMNjuguna,欢迎您!我已经为我的答案添加了一步一步的解释 - 请检查... – MaxU

+0

@MaxU ..........现在我知道如何使用正则表达式与熊猫....... ..在它几天 –

def value_to_float(x): 
    if type(x) == float or type(x) == int: 
     return x 
    if 'K' in x: 
     if len(x) > 1: 
      return float(x.replace('K', '')) * 1000 
     return 1000.0 
    if 'M' in x: 
     if len(x) > 1: 
      return float(x.replace('M', '')) * 1000000 
     return 1000000.0 
    if 'B' in x: 
     return float(x.replace('B', '')) * 1000000000 
    return 0.0 

    df."Insert data series column" = df."Insert data series column" .apply(value_to_float) 

`