R - 从数据帧中剪切数据以进行平衡

问题描述：

我有一个数据框，其中有2600个条目，分布在249个因子级别（人员）中。数据集不够平衡。R - 从数据帧中剪切数据以进行平衡

我想删除其具有小于5所中出现的一个因素所有条目。此外，我想修剪那些发生次数超过5次的事件到5次。所以最后我希望有一个数据框架，它的总体条目较少，但是它对于因素人员是平衡的。

数据集建立如下：

file_list <- list.files("path/to/image/folder", full.names=TRUE) 
# the folder contains 2600 images, which include information about the 
# person factor in their file name 

file_names <- sapply(strsplit(file_list , split = '_'), "[", 1) 
person_list <- substr(file_names, 1 ,3) 
person_class <- as.factor(person_list) 

imageWidth = 320; # uniform pixel width of all images 
imageHeight = 280; # uniform pixel height of all images 
variableCount = imageHeight * imageWidth + 2 

images <- as.data.frame(matrix(seq(count),nrow=count,ncol=variableCount)) 
images[1] <- person_class 
images[2] <- eyepos_class 

for(i in 1:count) { 
    img <- readJPEG(file_list[i]) 
    image <- c(img) 
    images[i, 3:variableCount] <- image 
}

所以基本上我需要得到每个因子水平（样品的使用summary(images[1])时，如金额，然后进行操作来修剪数据我真的不知道该如何从这里开始，并且不需要任何帮助

我知道你的数据是不是很小，但为了写一个很好的问题，这是可重复的，这将让你upvotes和答案，请提供可重复的，我们可以复制并粘贴以重现您的数据/问题并重现您的问题。您可以使用内置数据集或创建自己的数据集并包含您使用的代码。 –

那么我尽我所能使它具有可重现性，但仍需要数据集，这是公开可用的，但下载速度很慢af – 4ndro1d

答

的选项使用data.table

library(data.table) 
res <- setDT(images)[, if(.N > = 5) head(.SD, 5) , by = V1]

这似乎工作（减少数据集从2639 - > 1090对象）。谢谢！ – 4ndro1d

也许你可以告诉我为什么'plot（res $ V1）'在那之后工作，但是'plot（res [1]）'给出了一个错误：'plot.new（）中的错误：figure margins too too'？这不应该是一样的吗？ – 4ndro1d

@ 4ndro1d'data.table'中的子集设置稍有不同'res $ V1'是一个向量。你可以使用'res [[1]]来获得第一列作为矢量 – akrun

答

使用dplyr：

library(dplyr) 
group_by(images, V1) %>% # group by the V1 column 
    filter(n() >= 5) %>% # keep only groups with 5 or more rows 
    slice(1:5)   # keep only the first 5 rows in each group

您可以将结果指定给正常的对象。例如my_desired_result = group_by(images, ...

我是否必须将结果分配给某个变量？我试过了，没有结果。数据框不会改变，也不能将某些东西存储到变量中。但它看起来非常像我正在寻找的东西。 – 4ndro1d

Need obj_name –

我试过没有成功 – 4ndro1d

R - 从数据帧中剪切数据以进行平衡

相关推荐