如何使用Python中的多个名称空间验证XML?

问题描述:

我尝试写一些单元测试在Python 2.7来验证一些扩展我的OAI-PMH架构进行:http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd如何使用Python中的多个名称空间验证XML?

,我快到的问题是企业有多个嵌套的命名空间是在上面提到的XSD由此引起的规格:

<complexType name="metadataType"> 
    <annotation> 
     <documentation>Metadata must be expressed in XML that complies 
     with another XML Schema (namespace=#other). Metadata must be 
     explicitly qualified in the response.</documentation> 
    </annotation> 
    <sequence> 
     <any namespace="##other" processContents="strict"/> 
    </sequence> 
</complexType> 

下面是我使用的代码片段:我结束了以下错误

import lxml.etree, urllib2 

query = "http://localhost:8080/OAI-PMH?verb=GetRecord&by_doc_ID=false&metadataPrefix=nsdl_dc&identifier=http://www.purplemath.com/modules/ratio.htm" 
schema_file = file("../schemas/OAI/2.0/OAI-PMH.xsd", "r") 
schema_doc = etree.parse(schema_file) 
oaischema = etree.XMLSchema(schema_doc) 

request = urllib2.Request(query, headers=xml_headers) 
response = urllib2.urlopen(request) 
body = response.read() 
response_doc = etree.fromstring(body) 

try: 
    oaischema.assertValid(response_doc) 
except etree.DocumentInvalid as e: 
    line = 1; 
    for i in body.split("\n"): 
     print "{0}\t{1}".format(line, i) 
     line += 1 
    print(e.message) 

AssertionError: http://localhost:8080/OAI-PMH?verb=GetRecord&by_doc_ID=false&metadataPrefix=nsdl_dc&identifier=http://www.purplemath.com/modules/ratio.htm 
Element '{http://www.openarchives.org/OAI/2.0/oai_dc/}oai_dc': No matching global element declaration available, but demanded by the strict wildcard., line 22 

我明白错误,因为模式要求严格验证元数据元素的子元素,这是xml示例的作用。

现在我已经用Java编写了一个验证器,它可以工作 - 但是这会对Python有帮助,因为我构建的其他解决方案是基于Python的。为了使我的Java变体能够工作,我不得不使我的DocumentFactory命名空间感知到,否则我得到了同样的错误。我还没有在python中找到任何正确执行此验证的工作示例。

有没有人有一个想法,我可以如何使用多个嵌套命名空间获取XML文档,因为我的示例doc使用Python进行了验证?

这里是我试图验证示例XML文档:

<?xml version="1.0" encoding="UTF-8"?> 
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ 
    http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> 
    <responseDate>2002-02-08T08:55:46Z</responseDate> 
    <request verb="GetRecord" identifier="oai:arXiv.org:cs/0112017" 
     metadataPrefix="oai_dc">http://arXiv.org/oai2</request> 
    <GetRecord> 
    <record> 
    <header> 
     <identifier>oai:arXiv.org:cs/0112017</identifier> 
     <datestamp>2001-12-14</datestamp> 
     <setSpec>cs</setSpec> 
     <setSpec>math</setSpec> 
    </header> 
    <metadata> 
     <oai_dc:dc 
    xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" 
    xmlns:dc="http://purl.org/dc/elements/1.1/" 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ 
    http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> 
    <dc:title>Using Structural Metadata to Localize Experience of 
      Digital Content</dc:title> 
    <dc:creator>Dushay, Naomi</dc:creator> 
    <dc:subject>Digital Libraries</dc:subject> 
    <dc:description>With the increasing technical sophistication of 
     both information consumers and providers, there is 
     increasing demand for more meaningful experiences of digital 
     information. We present a framework that separates digital 
     object experience, or rendering, from digital object storage 
     and manipulation, so the rendering can be tailored to 
     particular communities of users. 
    </dc:description> 
    <dc:description>Comment: 23 pages including 2 appendices, 
     8 figures</dc:description> 
    <dc:date>2001-12-14</dc:date> 
     </oai_dc:dc> 
    </metadata> 
    </record> 
</GetRecord> 
</OAI-PMH> 
+0

尽我所能在这一点上说,似乎是在libxml2的一个错误,这是防止验证的嵌套命名验证。 – Jim 2011-03-28 23:50:34

lxml's doc on validation发现这一点:

>>> schema_root = etree.XML('''\ 
... <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> 
...  <xsd:element name="a" type="xsd:integer"/> 
... </xsd:schema> 
... ''') 
>>> schema = etree.XMLSchema(schema_root) 

>>> parser = etree.XMLParser(schema = schema) 
>>> root = etree.fromstring("<a>5</a>", parser) 

所以,也许,你需要的是什么? (见最后两行):

schema_doc = etree.parse(schema_file) 
oaischema = etree.XMLSchema(schema_doc) 

request = urllib2.Request(query, headers=xml_headers) 
response = urllib2.urlopen(request) 
body = response.read() 
parser = etree.XMLParser(schema = oaischema) 
response_doc = etree.fromstring(body, parser)