其中R解析XML文件获取到的数据帧
问题描述:
XML数据 其中R解析XML文件获取到的数据帧
<HealthData locale="en_US">
<ExportDate value="2016-06-02 14:05:23 -0400"/>
<Me HKCharacteristicTypeIdentifierDateOfBirth="" HKCharacteristicTypeIdentifierBiologicalSex="HKBiologicalSexNotSet" HKCharacteristicTypeIdentifierBloodType="HKBloodTypeNotSet" HKCharacteristicTypeIdentifierFitzpatrickSkinType="HKFitzpatrickSkinTypeNotSet"/>
<Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:07:06 -0400" endDate="2014-09-24 15:07:11 -0400" value="7"/>
<Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:12:13 -0400" endDate="2014-09-24 15:12:18 -0400" value="15"/>
<Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:17:16 -0400" endDate="2014-09-24 15:17:21 -0400" value="20"/>
</HealthData>
> library(XML)
> doc="\\pathtoXMLfile"
> list <-xpathApply(doc, "//HealthData/Record", xmlAttrs)
> df <- do.call(rbind.data.frame, list)
> str(df)
我试图采取上面所示的XML数据样本并将其加载到一个数据帧R代码R与每个记录的名称即类型,sourceName,单位,endDate,值作为列标题和每个记录值即计数,2014-09-24 15:07:11 -0400,7作为每行的值在数据帧。
当df <- do.call(rbind.data.frame, list)
这个关闭,但它也看起来像它绑定列标题的所有值也。如果你View(df)
或str(df)
你会明白我的意思。如何使用Record变量名称作为列标题名称?
感谢, 瑞安
答
考虑xpathSApply()
检索属性,然后用t()
调换结果列表到数据帧:
library(XML)
xmlstr <- '<?xml version="1.0" encoding="UTF-8"?>
<HealthData locale="en_US">
<ExportDate value="2016-06-02 14:05:23 -0400"/>
<Me HKCharacteristicTypeIdentifierDateOfBirth="" HKCharacteristicTypeIdentifierBiologicalSex="HKBiologicalSexNotSet" HKCharacteristicTypeIdentifierBloodType="HKBloodTypeNotSet" HKCharacteristicTypeIdentifierFitzpatrickSkinType="HKFitzpatrickSkinTypeNotSet"/>
<Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:07:06 -0400" endDate="2014-09-24 15:07:11 -0400" value="7"/>
<Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:12:13 -0400" endDate="2014-09-24 15:12:18 -0400" value="15"/>
<Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:17:16 -0400" endDate="2014-09-24 15:17:21 -0400" value="20"/>
</HealthData>'
xml <- xmlParse(xmlstr)
recordAttribs <- xpathSApply(doc=xml, path="//HealthData/Record", xmlAttrs)
df <- data.frame(t(recordAttribs))
df
# type sourceName unit
# 1 HKQuantityTypeIdentifierStepCount Ryan Praskievicz iPhone count
# 2 HKQuantityTypeIdentifierStepCount Ryan Praskievicz iPhone count
# 3 HKQuantityTypeIdentifierStepCount Ryan Praskievicz iPhone count
# creationDate startDate endDate
# 1 2014-10-02 08:30:17 -0400 2014-09-24 15:07:06 -0400 2014-09-24 15:07:11 -0400
# 2 2014-10-02 08:30:17 -0400 2014-09-24 15:12:13 -0400 2014-09-24 15:12:18 -0400
# 3 2014-10-02 08:30:17 -0400 2014-09-24 15:17:16 -0400 2014-09-24 15:17:21 -0400
# value
# 1 7
# 2 15
# 3 20
在属性的情况下,出现在一些不其他人则考虑与预先确定的名称列表进行匹配,并反复填写NAs
。下面是使用sapply()
与for
环和第二list参数两个版本:
recordnames <- c("type", "unit", "sourceName", "device", "sourceVersion",
"creationDate", "startDate", "endDate", "value")
# FOR LOOP VERSION
recordAttribs <- sapply(recordAttribs, function(i) {
for (r in recordnames){
i[r] <- ifelse(is.null(i[r]), NA, i[r])
}
i <- i[recordnames] # REORDER INNER VECTORS
return(i)
})
# TWO LIST ARGUMENT SAPPLY
recordAttribs <- sapply(recordAttribs, function(i,r) {
if (is.null(i[r])) i[r] <- NA
else i[r] <- i[r]
i <- i[recordnames] # REORDER INNER VECTORS
return(i)
}, recordnames)
df <- data.frame(t(recordAttribs))
答
另一种选择是xmlAttrsToDataFrame
,这应该处理缺少的属性。您还可以获取具有特定属性的标签,如设备
XML:::xmlAttrsToDataFrame(xml["//Record"])
XML:::xmlAttrsToDataFrame(xml["//Record[@device]"])
+0
这个工程也很棒。谢谢! –
感谢它为我提供的测试数据完美地工作。当我回去试图将其应用到完整的数据集时,我意识到有一些记录中有9列不是7,即 '不起作用。有任何想法吗? –
你知不知道要保持共同的属性还是全部?您是否事先知道要保留哪些属性? – Parfait
是的,我想保留矢量中的所有9行,并只有NAs为7行的向量。 –