在熊猫系列查找相邻区域

问题描述:

我想选择具有大于1的值所有区域,如果它们被连接到具有值的元素以上5. 两个如果它们由0在熊猫系列查找相邻区域

分离未连接值对于下面的数据集,

pd.Series(data = [0,2,0,2,3,6,3,0]) 

输出应该是

pd.Series(data = [False,False,False,True,True,True,True,False]) 
+1

第二个2与高于5的值不相邻。您能澄清定义吗? –

+0

这个澄清了吗? –

+1

严格超过1或> = 1? – FLab

嘛,貌似我已经找到了一个内胆,利用大熊猫GROUPBY功能:

import pandas as pd 

ts = pd.Series(data = [0,2,0,2,3,6,3,0]) 

# The flag column allows me to identify sequences. Here 0s are included 
# in the "sequence", but as you can see in next line doesn't matter 
df = pd.concat([ts, (ts==0).cumsum()], axis = 1, keys = ['val', 'flag']) 

# val flag 
#0 0  1 
#1 2  1 
#2 0  2 
#3 2  2 
#4 3  2 
#5 6  2 
#6 3  2 
#7 0  3 

# For each group (having the same flag), I do a boolean AND of two conditions: 
# any value above 5 AND value above 1 (which excludes zeros) 
df.groupby('flag').transform(lambda x: (x>5).any() * x > 1) 

#Out[32]: 
#  val 
#0 False 
#1 False 
#2 False 
#3 True 
#4 True 
#5 True 
#6 True 
#7 False 

如果你想知道,您可以在一个行崩溃的一切:

ts.groupby((ts==0).cumsum()).transform(lambda x: (x>5).any() * x > 1).astype(bool) 

我仍然参考我的第一种方法:

import itertools 
import pandas as pd 

def flatten(l): 
    # Util function to flatten a list of lists 
    # e.g. [[1], [2,3]] -> [1,2,3] 
    return list(itertools.chain(*l)) 

ts = pd.Series(data = [0,2,0,2,3,6,3,0]) 
#Get data as list 
values = ts.values.tolist() 

# From what I understand the 0s delimit subsequences (so numbers are not 
# connected if separated by a 0 

# Get location of zeros 
gap_loc = [idx for (idx, el) in enumerate(values) if el==0] 
# Re-create pandas series 
gap_series = pd.Series(False, index = gap_loc) 

# Get values and locations of the subsequences (i.e. seperated by zeros) 
valid_loc = [range(prev_gap+1,gap) for prev_gap, gap in zip(gap_loc[:-1],gap_loc[1:])] 
list_seq = [values[prev_gap+1:gap] for prev_gap, gap in zip(gap_loc[:-1],gap_loc[1:])] 
# list_seq = [[2], [2, 3, 6, 3]] 

# Verify your condition 
check_condition = [[el>1 and any(map(lambda x: x>5, sublist)) for el in sublist] 
        for sublist in list_seq] 
# Put results back into a pandas Series 
valid_series = pd.Series(flatten(check_condition), index = flatten(valid_loc)) 

# Put everything together: 
result = pd.concat([gap_series, valid_series], axis = 0).sort_index() 

#result 
#Out[101]: 
#0 False 
#1 False 
#2 False 
#3  True 
#4  True 
#5  True 
#6  True 
#7 False 
#dtype: bool 
+0

您可能想要检查新的单线解决方案 – FLab

我解决了它自己在一个丑陋的方式,请参见下文。但是,我仍然想知道是否有更好的方法来做到这一点。

test_series = pd.Series(data = [0,2,0,2,3,6,3,0]) 

bool_df = pd.DataFrame(data= [(test_series>1), (test_series>5)]).T 
bool_df.loc[:,0] = (bool_df.loc[:,0])&(~bool_df.loc[:,1]) 
# make a boolean DataFrame. 
# Column 0 is values between 1 and 5, and column 1 is values above 5. 
# the resulting boolean series we are looking for is column 1 after it has been modified in the following way. 



k=0 # k is an integer that indexes the bool_df values that are less than 1 
while k < len(bool_df.loc[bool_df.loc[:,0],0]): 
    i = bool_df.loc[bool_df.loc[:,0],0].index[k] # the bool_df index corresponding to k 
    if i > 0: # avoid negative indeces 
     if bool_df.loc[i-1,1]: # Check if the previous entry had a value above 5 
      bool_df.loc[i,1] = True 
      k+=1 
     else: 
      j=i 
      while bool_df.loc[j,0]: # find the end of the streak of 1<values<5. 
       j+=1 
      bool_df.loc[i:j,1] = bool_df.loc[j,1] # set the whole streak to the value found at the end, either >5 or <1 
      k = sum(bool_df.loc[bool_df.loc[:,0],0].index<j) 
    else: 
     k+=1