更改因子水平 - “f”中的未知水平 - 无法更改水平

问题描述：

我有一个包含许多行业名称的因子。我需要将它们分解成大类和行业。例如，因为我允许受访者以他们想要的方式做出回应，所以我有很多级别的金额（例如金融服务，金融服务，银行，金融）。由于这些情况不匹配，他们出来作为一个附加的水平，所以我想用forcats塌陷他们：更改因子水平 - “f”中的未知水平 - 无法更改水平

test <- fct_collapse(PrescreenF$Industry, Finance = c("Banking", 
    "Corporate Finance", "Finance", "Financial", "financial services", 
    "financial services", "Financial Services", "Financial services"), 
    NULL = "H")

我得到的说，一个警告：“金融服务”是未知的。这是非常令人沮丧的，因为当我调用向量时，我可以看到它确实存在。我试着复制和粘贴来自通话的确切单词，重新写入，而且好像有隐藏字符阻止了它被更改。

如何正确折叠这些值？

-> test$industry 
Banking 
Corporate Finance 
Finance Financial 
financial services 
financial services 
Financial Services 
Financial services

当我去“重估”说，最后一级“金融服务”，它告诉我它是一个未知的字符串。

编辑输出dput的（X $行业）

structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 
4L, 3L, 3L, 3L, 5L, 7L, 8L, 9L, 10L, 11L, 12L, 12L, 13L, 14L, 
15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 16L, 16L, 16L, 16L, 
16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 
16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 17L, 18L, 18L, 18L, 
18L, 19L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 25L, 26L, 27L, 28L 
), .Label = c("", "{\"ImportId\":\"QID8_TEXT\"}", "Finance", 
"Financial ", "Financial services ", "Please indicate the industry you work in (e.g. technology, healthcare etc):", 
"Cleantech", "Delivery", "e-commerce/fashion", "Food", "Food & Bev", 
"Retail", "Service", "tech", "technology", "Technology", "IT, technology", 
"Software", "Technology ", "Tehcnology", "Consulting", "Digital advertising", 
"Education", "Higher education", "Technology, management consulting", 
"University professor; teaching, research and service", "Information Technology and Services", 
"mobile technology"), class = "factor")

编辑想通了。有些术语在结束后有额外的空间。例如，尽管当我打电话给Prescreen $ Industry时，它会返回一些名称，如“银行”和“公司金融”，它并没有告诉我该级别后有空间。银行业实际上是......“银行业”，有一个无形的空间，并没有在R中出现。人们如何确保这一点可见并且不再发生？

我可以在列中运行len函数吗？如果是这样，那是如何工作的？ PrescreenF $ Industry（“Banking”）？

请分享您的数据的一个可重复的例子，以便我们可以解决这个问题。 –

如果有隐藏的字符，它们可能是空白的。 'stringr :: str_trim'可以提供帮助，但是你必须首先将这些因素改为字符，然后返回因子。 – shea

你可以发布'dput（test $ industry）'或'dput（head（test，20））'的输出吗？ –

答

如果 “x” 是您dataframe

library(stringr) 

x$industry <- as.character(x$industry) 
x$industry <- str_trim(x$industry) 
x$industry <- as.factor(x$industry)

然后你就可以回到fct_collapse()来简化你的因素。

更改因子水平 - “f”中的未知水平 - 无法更改水平

相关推荐