提取深层XML结构
问题描述:
我有下面的XML文件,我想使用R进行分析。XML具有很深的结构,并且也有不同数量的子节点。提取深层XML结构
<?xml version="1.0" encoding="UTF-8"?>
<Alert date="20161223_2" type="full">
<Records>
<Person Id="100">
<PersonNameDetails>
<PersonNames id="Name1">
<ReferenceGroup ReferenceGroupCode="ABC"/>
<ReferenceGroup ReferenceGroupCode="DEF"/>
<PersonNameValue>
<FirstName>Carl Bangouvounda</FirstName>
<Surname>Toziz</Surname>
</PersonNameValue>
</PersonNames>
<PersonNames id="Name2">
<ReferenceGroup ReferenceGroupCode="ABC"/>
<ReferenceGroup ReferenceGroupCode="GHI" ReferenceGroupLanguageCode="en"/>
<ReferenceGroup ReferenceGroupCode="JKL"/>
<ReferenceGroup ReferenceGroupCode="MNO"/>
<ReferenceGroup ReferenceGroupCode="DEF"/>
<PersonNameValue>
<FirstName>Tozize</FirstName>
<Surname>Bangouvonda</Surname>
</PersonNameValue>
</PersonNames>
<PersonNames id="Name3">
<ReferenceGroup ReferenceGroupCode="MNO"/>
<PersonNameValue>
<FirstName>Carol</FirstName>
<Surname>Tozize</Surname>
</PersonNameValue>
</PersonNames>
<PersonNames id="Name4">
<ReferenceGroup ReferenceGroupCode="PQR"/>
<ReferenceGroup ReferenceGroupCode="MNO"/>
<PersonNameValue>
<FirstName>Carol</FirstName>
<MiddleName>Bangouvonda</MiddleName>
<Surname>Tozize</Surname>
</PersonNameValue>
</PersonNames>
<PersonNames id="Name5">
<ReferenceGroup ReferenceGroupCode="GHI" ReferenceGroupLanguageCode="en"/>
<ReferenceGroup ReferenceGroupCode="JKL"/>
<ReferenceGroup ReferenceGroupCode="DEF"/>
<PersonNameValue>
<FirstName>Carl Bangouvonda</FirstName>
<Surname>Toziz</Surname>
</PersonNameValue>
</PersonNames>
</PersonNameDetails>
</Person>
</Records>
</Alert>
预期的输出如下:
-----------------------------------------------------------
Id | id | ReferenceGroup | FirstName | MiddleName | Surname
-----------------------------------------------------------
100 | Name1 | ABC, DEF | Carl Bangouvounda | NA | Toziz
-----------------------------------------------------------
100 | Name2 | ABC, GHI, JKL, MNO, DEF | Tozize | NA | Bangouvonda
-----------------------------------------------------------
100 | Name3 | MNO | Carol | NA | Tozize
-----------------------------------------------------------
100 | Name4 | PQR, MNO | Carol | Bangouvonda | Tozize
-----------------------------------------------------------
100 | Name5 | GHI, JKL, DEF | Carl Bangouvonda | NA | Toziz
-----------------------------------------------------------
ID是元素人的属性,且其他所有从PersonNameDetails。我也想将ReferenceGroupCode连接成同一个Personnames元素中的一个字符串。
我跟着建议转换为XSLT用下面的代码:
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" method="xml"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/Alert ">
<xsl:copy>
<xsl:apply-templates select="Records"/>
</xsl:copy>
</xsl:template>
<xsl:template match="Records">
<xsl:apply-templates select="Person"/>
</xsl:template>
<xsl:template match="Person">
<xsl:apply-templates select="PersonNameDetails"/>
</xsl:template>
<xsl:template match="PersonNameDetails">
<xsl:apply-templates select="PersonNames"/>
</xsl:template>
<xsl:template match="PersonNames">
<xsl:apply-templates select="PersonNameValue"/>
</xsl:template>
<xsl:template match="PersonNameValue">
<PersonNameValue>
<Id><xsl:value-of select="ancestor::Person/@Id"/></Id>
<id><xsl:value-of select="ancestor::PersonNames/@id"/></id>
<xsl:copy-of select="FirstName"/>
<MiddleName><xsl:value-of select="MiddleName"/></MiddleName>
<Surname><xsl:value-of select="Surname"/></Surname>
<ReferenceGroupCode><xsl:value-of select="ancestor::PersonNames/ReferenceGroup/@ReferenceGroupCode"/></ReferenceGroupCode>
</PersonNameValue>
</xsl:template>
</xsl:transform>
如何更改XSLT代码,以便ReferenceGroup输出将是
<ReferenceGroupCode>ABC,DEF</ReferenceGroupCode>
任何帮助,高度赞赏。
答
不确定XSLT,但可以在PersonNames节点上使用xpath并编写一个函数来处理缺失值或多个值。
doc <- xmlParse("<your XML file>")
x <- getNodeSet(doc, "//PersonNames")
xpath2 <-function(x, ...){
y <- xpathSApply(x, ...)
ifelse(length(y) == 0, NA, paste(y, collapse=", "))
}
y <- data.frame(
id = sapply(x, xpath2, ".", xmlGetAttr, "id"),
ReferenceGroup= sapply(x, xpath2, ".//ReferenceGroup", xmlGetAttr, "ReferenceGroupCode"),
FirstName = sapply(x, xpath2, ".//FirstName", xmlValue),
MiddleName = sapply(x, xpath2, ".//MiddleName", xmlValue),
Surname = sapply(x, xpath2, ".//Surname", xmlValue)
)
id ReferenceGroup FirstName MiddleName Surname
1 Name1 ABC, DEF Carl Bangouvounda <NA> Toziz
2 Name2 ABC, GHI, JKL, MNO, DEF Tozize <NA> Bangouvonda
3 Name3 MNO Carol <NA> Tozize
4 Name4 PQR, MNO Carol Bangouvonda Tozize
5 Name5 GHI, JKL, DEF Carl Bangouvonda <NA> Toziz
也许通过计算PersonName节点的数量来添加人员ID?
n <- xpathSApply(doc, "//Person/PersonNameDetails", xmlSize)
y$ID <- rep(xpathSApply(doc, "//Person", xmlGetAttr, "Id"), n)
我不希望将XML转换为XSLT。你能告诉我你需要什么样的信息来解决这个XMl解析问题吗? –