因素水平保持不变,即使删除一个水平
问题描述:
我试着在Kaggle泰坦尼克号机器学习数据集示例,我面临以下问题。 错误消息如下:因素水平保持不变,即使删除一个水平
Error in predict.randomForest(modelFit, newtest) :
Type of predictors in new data do not match that of the training data.
这是我的全部代码:
#Load the libraries:
library(ggplot2)
library(randomForest)
#Load the data:
set.seed(1)
train <- read.csv("train.csv")
test <- read.csv("test.csv")
gendermodel <- read.csv("gendermodel.csv")
genderclassmodel <- read.csv("genderclassmodel.csv")
#Preprocess the data and feature extraction:
features <- c("Pclass", "Age", "Sex", "Parch", "SibSp", "Fare", "Embarked")
newtrain <- train[,features]
newtest <- test[,features]
newtrain$Embarked[newtrain$Embarked==""] <- "S"
newtrain$Fare[newtrain$Fare == 0] <- median(newtrain$Fare, na.rm=TRUE)
newtrain$Age[is.na(newtrain$Age)] <- -1
newtest$Embarked[newtest$Embarked==""] <- "S"
newtest$Fare[newtest$Fare == 0] <- median(newtest$Fare, na.rm=TRUE)
newtest$Fare <- ifelse(is.na(newtest$Fare), mean(newtest$Fare, na.rm = TRUE), newtest$Fare)
newtest$Age[is.na(newtest$Age)] <- -1
#Model building
modelFit <- randomForest(newtrain, as.factor(train$Survived), ntree = 100, importance = TRUE)
predictedOutput <- data.frame(PassengerID = test$PassengerId)
predictedOutput$Survived <- predict(modelFit, newtest)
write.csv(predictedOutput, file = "TitanicPrediction.csv", row.names=FALSE)
MDA <- importance(modelFit, type=1)
featureImportance <- data.frame(Feature = row.names(MDA), Importance = MDA[,1])
#Plots
g <- ggplot(featureImportance, aes(x=Feature, y=Importance)) + geom_bar(stat="identity") + xlab("Feature") + ylab("Importance") + ggtitle("Feature importance")
ggsave("FeatureImportance.png", p)
我明白了什么错误消息意味着,所以当我做str(newtrain)
和str(newtest)
,我得到即使分配newtrain$Embarked[newtrain$Embarked==""] <- "S"
后下。
str(newtrain)
'data.frame': 891 obs. of 7 variables:
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Age : num 22 38 26 35 35 -1 54 2 27 14 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
> length(which(train$Embarked == ""))
[1] 2
> length(which(newtrain$Embarked == ""))
[1] 0
当我检查包含缺失值的train和newtrain数据集的长度时,我得到如上的正确输出。我不知道我哪里出错了。任何帮助深表感谢!谢谢!
答
您的线路后,
newtrain$Embarked[newtrain$Embarked==""] <- "S"
做:
newtrain$Embarked <- factor(newtrain$Embarked)
这将从修改newtrain$Embarked
因子的水平复位。
此外,在您的发布代码的最后一行中,p
应该是g
。
好运与Kaggle!
+0
它的工作!非常感谢! :d – Gingerbread
如果问题是因素水平,你尝试过'水滴'吗? – aosmith