在R中，如何根据列属性的统计信息选择行？

问题描述：

我的表格有成千上万行按400个类别分类，还有十几列。在R中，如何根据列属性的统计信息选择行？

理想的结果是基于列“z”的最大值并包含所有原始列的400行（每类为1行）的表。

这里是我的数据的例子，而我只需要在第2，第4，第7，8日在这个例子中提取的行，使用R.

 x   y   z cluster 
1 712521.75 3637426.49 19.46 12 
2 712520.69 3637426.47 19.66 12 * 
3 712518.88 3637426.63 17.37 225 
4 712518.4 3637426.48 19.42 225 * 
5 712517.11 3637426.51 18.81 225 
6 712515.7 3637426.58 17.8 17 
7 712514.68 3637426.55 18.16 17 * 
8 712513.58 3637426.55 18.23 50 * 
9 712512.1 3637426.62 17.24 50 
10 712513.93 3637426.88 18.08 50

我已经尝试了许多不同的组合其中包括：

tapply(data$z, data$cluster, max)  # returns only the max value and cluster columns 
    which.max(data$z)   # returns only the index of the max value in the entire table

我也通过plyr包读取，但没有找到解决方案。

答

谢谢大家的帮助！ 聚合（）和合并（）完美地为我工作。

重要的一点：骨料（） - 只选择每个集群重复点之一，但是，合并（） - 选择所有重复的点，因为他们在一个集群有相同的最大值。

这在这种情况下是理想的，因为这些点是3D的，并且在考虑x和y坐标时不重复。

这里是我的解决方案：

df  <- read.table("data.txt", header=TRUE, sep=",") 
attach(df) 
names(df) 
[1] "Row"   "x"   "y"   "z"   "cluster"

head(df) 
    Row  x  y  z  cluster 
1 1 712521.8 3637426 19.46   361 
2 2 712520.7 3637426 19.66   361 
3 3 712518.9 3637427 17.37   147 
4 4 712518.4 3637426 19.42   147 
5 5 712517.1 3637427 18.81   147 
6 6 712515.7 3637427 17.80   42 


new_table_a  <- aggregate(z ~ cluster, df, max) # output 400 rows, no duplicates 
new_table_b  <- merge(new_table_a, df)   # output 408 rows, includes duplicates of "z" 

head(new_table_b) 
     cluster  z Row  x  y 
1   1 20.44 6043 712416.2 3637478 
2   10 26.09 1138 712458.4 3637511 
3   100 19.39 6496 712423.4 3637485 
4   101 25.74 2141 712521.2 3637488 
5   102 17.33 2320 712508.2 3637484 
6   103 21.01 6908 712462.2 3637493

答

一个非常简单的方法是使用aggregate和merge：

> merge(aggregate(z ~ cluster, mydf, max), mydf) 
    cluster  z  x  y 
1  12 19.66 712520.7 3637426 
2  17 18.16 712514.7 3637427 
3  225 19.42 712518.4 3637426 
4  50 18.23 712513.6 3637427

你甚至可以使用你的tapply代码的输出来获得你所需要的。只需将它制作为data.frame而不是命名矢量。

> merge(mydf, data.frame(z = with(mydf, tapply(z, cluster, max)))) 
     z  x  y cluster 
1 18.16 712514.7 3637427  17 
2 18.23 712513.6 3637427  50 
3 19.42 712518.4 3637426  225 
4 19.66 712520.7 3637426  12

几年更多的选择，请参阅this question的答案。

只是一个警告：确保你使用'统计:: aggregate'而不是'光栅:: aggregate'。（“合并”同上）。在正常情况下，这不太可能是一个问题;但有一天它可能会欺骗你:-) – 2013-04-29 13:05:55

非常感谢你的帮助，aggregate（）和merge（）完美工作。使用tapply（）的第二个例子在我的情况下不起作用，并产生一个不合理的大表。这是很好的提及聚合（）不保留重复值和合并， – Inga 2013-04-29 18:01:42

@Inga，我不完全理解你的评论在这里，但我的直觉说，你可能想阅读“合并”的帮助文件如何更好地控制它。特别是，'by'参数（它指定哪些列用作匹配每个数据集的列）应该在这里用于控制重复值。祝你好运，欢迎来到Stack Overflow！ – A5C1D2H2I1M1N2O1R2T1 2013-04-29 18:35:06

在R中，如何根据列属性的统计信息选择行？

相关推荐