R:是否有与Stata的codebookout命令相同的内容?
在Stata中,我可以使用codebookout
命令创建一个Excel工作簿,该工作簿将现有数据集中所有变量的名称,标签和存储类型与相应的值和值标签一起保存。R:是否有与Stata的codebookout命令相同的内容?
我想在R中找到等效函数。到目前为止,我遇到了memisc
函数库,它有一个名为codebook
的函数,但它与Stata中的函数不同。
例如,在Stata,码本的输出应该是这样的......(见下文 - 这就是我想要的)
Variable Name Variable Label Answer Label Answer Code Variable Type
hhid hhid Open ended String
inter_month inter_month Open ended long
year year Open ended long
org_unit org_unit long
Balaka 1
Blantyre 2
Chikwawa 3
Chiradzulu 4
即在数据帧中的每一列进行评估,以产生5个不同列的值:
- 变量名这是该列的名称
- 变量标签是 列的名称
- 答案标签,它是 列中的唯一值。如果没有唯一值,则认为是开放式结果
- 答案代码,它是对答案标签中每个类别的数字分配。如果答案标签不是分类,则为空。
- 变量类型:int,STR,长(日期)...
这里是我的尝试:
CreateCodebook <- function(dF){
numbercols <- length(colnames(dF))
table <- data.frame()
for (i in 1:length(colnames(dF))){
AnswerCode <- if (sapply(dF, is.factor)[i]) 1:nrow(unique(dF[i])) else ""
AnswerLabel <- if (sapply(dF, is.factor)[i]) unique(dF[order(dF[i]),][i]) else "Open ended"
VariableName <- if (length(AnswerCode) - 1 > 1) c(colnames(dF)[i],
rep("",length(AnswerCode) - 1)) else colnames(dF)[i]
VariableLabel <- if (length(AnswerCode) - 1 > 1) c(colnames(dF)[i],
rep("",length(AnswerCode) - 1)) else colnames(dF)[i]
VariableType <- if (length(AnswerCode) - 1 > 1) c(sapply(dF, class)[i],
rep("",length(AnswerCode) - 1)) else sapply(dF, class)[i]
df = data.frame(VariableName, VariableLabel, AnswerLabel, AnswerCode, VariableType)
names(df) <- c("Variable Name", "Variable Label", "Variable Type", "Answer Code", "Answer Label")
table <- rbind(table, df)
}
return(table)
}
不幸的是,我得到以下警告消息:
Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = 1:3) :
invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, ri, value = 1:2) :
invalid factor level, NA generated
我产生的输出结果导致答案代码标签混乱:
Variable Name Variable Label Variable Type Answer Code Answer Label
hhid hhid hhid Open ended character
month month month Open ended integer
year year year Open ended integer
org_unit org_unit org_unit Open ended character
v000 v000 v000 Open ended character
v001 v001 v001 Open ended integer
v002 v002 v002 Open ended integer
v003 v003 v003 Open ended integer
v005 v005 v005 Open ended integer
v006 v006 v006 Open ended integer
v007 v007 v007 Open ended integer
v021 v021 v021 Open ended numeric
2285 v024 v024 central <NA> factor
1 north <NA>
7119 south <NA>
11 v025 v025 rural <NA> factor
1048 v025 v025 urban <NA> factor
district_name district_name district_name Open ended character
coords_x1 coords_x1 coords_x1 Open ended numeric
coords_x2 coords_x2 coords_x2 Open ended numeric
itn_color itn_color itn_color Open ended numeric
piped piped piped Open ended numeric
sanit sanit sanit Open ended numeric
sanit_cd sanit_cd sanit_cd Open ended numeric
water water water Open ended numeric
我决定为了自己的娱乐而对此采取一些措施。我使用了内置的Titanic
数据集。然而,我对你的一个定义有一个问题:你说“如果没有独特的价值,它就被认为是开放式的”。但每长度> 0的变量有一些独特的价值:你的意思是“如果每个值是唯一的”?即使这个定义不一定按预期工作:在Titanic
数据集中,响应是整数,并且32个总值中只有22个唯一值。我不认为有人会真的想要这个枚举,所以我测试了factor
类型(但是如果你真的想要的话,你可以用下面的length(u)==length(x)
这行代替)。
## utility function: pad vector with blanks to specified length
pad <- function(x,n,p="") {
return(c(x,rep(p,n-length(x))))
}
## process a single column
proc_col <- function(x,nm) {
u <- unique(x)
## if (length(u)==length(x)) {
if (!is.factor(x)) {
n <- 1
u <- "open ended"
cc <- ""
} else {
cc <- as.numeric(u)
n <- length(u)
}
dd <- data.frame(`Variable Name`=pad(nm,n),
`Variable Label`=pad(nm,n),
`Answer Label`=u,
`Answer Code`=cc,
`Variable Type`=pad(class(x),n),
stringsAsFactors=FALSE)
return(dd)
}
## process all columns
proc_df <- function(x) {
L <- Map(proc_col,x,names(x))
dd <- do.call(rbind,L)
rownames(dd) <- NULL
return(dd)
}
例子:
xx <- as.data.frame.table(Titanic)
proc_df(xx)
## Variable.Name Variable.Label Answer.Label Answer.Code Variable.Type
## 1 Class Class 1st 1 factor
## 2 2nd 2
## 3 3rd 3
## 4 Crew 4
## 5 Sex Sex Male 1 factor
## 6 Female 2
## 7 Age Age Child 1 factor
## 8 Adult 2
## 9 Survived Survived No 1 factor
## 10 Yes 2
## 11 Freq Freq open ended numeric
我没码值等的名单之前离开空格,但你可以自己做出这些调整?
非常感谢你本!我肯定会赞成这一点,并接受这个答案。为了我自己的利益,我也想出了一个解决方案。我非常接近,但我收到一条警告信息。 –
这里是我的一个解决方案,破解:
CreateCodebook <- function(dF){
numbercols <- length(colnames(dF))
table <- data.frame()
for (i in 1:length(colnames(dF))){
AnswerCode <- if (sapply(dF, is.factor)[i]) 1:nrow(unique(dF[i])) else ""
AnswerLabel <- if (sapply(dF, is.factor)[i]) unique(dF[order(dF[i]),][i]) else "Open ended"
VariableName <- if (length(AnswerCode) > 1) c(colnames(dF)[i],
rep("",length(AnswerCode) - 1)) else colnames(dF)[i]
VariableLabel <- if (length(AnswerCode) > 1) c(colnames(dF)[i],
rep("",length(AnswerCode) - 1)) else colnames(dF)[i]
VariableType <- if (length(AnswerCode) > 1) c(sapply(dF, class)[i],
rep("",length(AnswerCode) - 1)) else sapply(dF, class)[i]
df = data.frame(VariableName, VariableLabel, AnswerLabel, AnswerCode, VariableType, stringsAsFactors = FALSE)
names(df) <- c("Variable Name", "Variable Label", "Variable Type", "Answer Code", "Answer Label")
table <- rbind(table, df)
}
rownames(table) <- 1:nrow(table)
return(table)
}
输出:
Variable Name Variable Label Variable Type Answer Code Answer Label
1 brid brid Open ended character
2 month month Open ended integer
3 year year Open ended integer
4 org_unit org_unit Open ended character
5 v000 v000 Open ended character
6 v001 v001 Open ended integer
7 v002 v002 Open ended integer
8 v003 v003 Open ended integer
9 v005 v005 Open ended integer
10 v006 v006 Open ended integer
11 v007 v007 Open ended integer
12 v021 v021 Open ended numeric
13 v024 v024 central 1 factor
14 north 2
15 south 3
16 v025 v025 rural 1 factor
17 urban 2
18 bidx bidx Open ended integer
19 district_name district_name Open ended character
20 coords_x1 coords_x1 Open ended numeric
21 coords_x2 coords_x2 Open ended numeric
22 anc4 anc4 Open ended numeric
23 antimal_48 antimal_48 Open ended numeric
24 carep carep Open ended numeric
25 csec csec Open ended numeric
26 dptv dptv Open ended numeric
27 ebreast ebreast Open ended numeric
28 fans_48 fans_48 Open ended numeric
29 ideliv ideliv Open ended numeric
30 iptp iptp Open ended numeric
31 iron90 iron90 Open ended numeric
32 measlesv measlesv Open ended numeric
33 ors ors Open ended numeric
34 ort ort Open ended numeric
35 pncwm pncwm Open ended numeric
36 sstools sstools Open ended numeric
37 tt tt Open ended numeric
38 vita vita Open ended numeric
你能证明你是如何试图到目前为止回答这个问题?你可以开始写一些代码......(否则,这是“找到一个非现场资源”(脱离主题)或“为我写代码”(脱离主题)......) –
我基本上有一个DataFrame (它可以是任何数据帧,无关紧要),我将代码簿应用于该df。但输出不是我想要的。 –
对不起,我读得太快了,没有看到你在问题的原始版本中提到过'memisc :: codebook'。尽管如此,恐怕(如果你自己不能取得更大的进展)这个问题可能不适合SO,因为你基本上需要一个定制/非常具体的输出。 –