优化R中的匹配

问题描述：

希望有人可以提供帮助。我在R中有很多ortholog映射，这被证明是非常耗时的。我已经在下面发布了一个示例结构。显而易见的答案，例如逐行迭代（用于i：1：nrow（df））和字符串分割，或者使用sapply已经尝试过，速度非常慢。因此，我希望有一个向量化的选项。优化R中的匹配

stringsasFactors = F 

# example accession mapping 
map <- data.frame(source = c("1", "2 4", "3", "4 6 8", "9"), 
        target = c("a b", "c", "d e f", "g", "h i")) 

# example protein list 
df <- data.frame(sourceIDs = c("1 2", "3", "4", "5", "8 9")) 

# now, map df$sourceIDs to map$target 


# expected output 
> matches 
[1] "a b c" "d e f" "g"  ""  "g h i"

我感谢任何帮助！

答

在大多数情况下，解决此类问题的最佳方法是每行创建一个观察值的数据框架。

map_split <- lapply(map, strsplit, split = ' ') 
long_mappings <- mapply(expand.grid, map2$source, map2$target, SIMPLIFY = FALSE) 
all_map <- do.call(rbind, long_mappings) 
names(all_map) <- c('source', 'target')

现在all_map看起来是这样的：

source target 
1  1  a 
2  1  b 
3  2  c 
4  4  c 
5  3  d 
6  3  e 
7  3  f 
8  4  g 
9  6  g 
10  8  g 
11  9  h 
12  9  i

做着df一样...

sourceIDs_split <- strsplit(df$sourceIDs, ' ') 
df_long <- data.frame(
    index = rep(seq_along(sourceIDs_split), lengths(sourceIDs_split)), 
    source = unlist(sourceIDs_split) 
)

让我们在这为df_long：

index source 
1  1  1 
2  1  2 
3  2  3 
4  3  4 
5  4  5 
6  5  8 
7  5  9

现在他们只需要合并和折叠。

matches <- merge(df_long, all_map, by = 'source', all.x = TRUE) 
tapply(
    matches$target, 
    matches$index, 
    function(x) { 
    paste0(sort(x), collapse = ' ') 
    } 
) 

#  1  2  3  4  5 
# "a b c" "d e f" "c g"  "" "g h i"

'lapply（地图，strsplit，分裂='“）'给我一个错误。这对你有用吗？ – CPak

这是一个很好的解决方案。谢谢。 – user8173495

@ChiPak我从原来的例子中假设'options（stringsAsFactors = FALSE）'。如果“map”的列是因素，我的解决方案将不起作用。 –

相关推荐