从Python中的ADO.Net数据服务XML解析元数据属性
我想在将数据填入数据库表之前将一些XML放入熊猫数据框中。我已经看过元素树和lxml,但这些例子非常简单,我似乎无法将它们推断为复杂的东西。我了解XML我只是不知道如何深入到我需要的东西。下面是一个示例。从Python中的ADO.Net数据服务XML解析元数据属性
我在<m:properties>
的东西之后。所以NEW_DATE = 1997-01-02T00:00:00,BC_1YEAR = 5.630000114440918等等是数据库中的内容。请注意BC_1MONTH = NULL,并且不像其他节点。
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<feed xml:base="http://data.treasury.gov/Feed.svc/" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata" xmlns="http://www.w3.org/2005/Atom">
<title type="text">DailyTreasuryYieldCurveRateData</title>
<id>http://data.treasury.gov/feed.svc/DailyTreasuryYieldCurveRateData</id>
<updated>2017-10-30T20:31:53Z</updated>
<link rel="self" title="DailyTreasuryYieldCurveRateData" href="DailyTreasuryYieldCurveRateData" />
<entry>
<id>http://data.treasury.gov/Feed.svc/DailyTreasuryYieldCurveRateData(1)</id>
<title type="text"></title>
<updated>2017-10-30T20:31:53Z</updated>
<author>
<name />
</author>
<link rel="edit" title="DailyTreasuryYieldCurveRateDatum" href="DailyTreasuryYieldCurveRateData(1)" />
<category term="TreasuryDataWarehouseModel.DailyTreasuryYieldCurveRateDatum" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" />
<content type="application/xml">
<m:properties>
<d:Id m:type="Edm.Int32">1</d:Id>
<d:NEW_DATE m:type="Edm.DateTime">1997-01-02T00:00:00</d:NEW_DATE>
<d:BC_1MONTH m:type="Edm.Double" m:null="true" />
<d:BC_3MONTH m:type="Edm.Double">5.190000057220459</d:BC_3MONTH>
<d:BC_6MONTH m:type="Edm.Double">5.3499999046325684</d:BC_6MONTH>
<d:BC_1YEAR m:type="Edm.Double">5.630000114440918</d:BC_1YEAR>
<d:BC_2YEAR m:type="Edm.Double">5.96999979019165</d:BC_2YEAR>
<d:BC_3YEAR m:type="Edm.Double">6.130000114440918</d:BC_3YEAR>
<d:BC_5YEAR m:type="Edm.Double">6.3000001907348633</d:BC_5YEAR>
<d:BC_7YEAR m:type="Edm.Double">6.4499998092651367</d:BC_7YEAR>
<d:BC_10YEAR m:type="Edm.Double">6.5399999618530273</d:BC_10YEAR>
<d:BC_20YEAR m:type="Edm.Double">6.8499999046325684</d:BC_20YEAR>
<d:BC_30YEAR m:type="Edm.Double">6.75</d:BC_30YEAR>
<d:BC_30YEARDISPLAY m:type="Edm.Double">0</d:BC_30YEARDISPLAY>
</m:properties>
</content>
</entry>
</feed>
如果您有关于此的一篇好文章的链接,那也将不胜感激。
下面是我的工作代码:从达菲的代码收到
import xml.etree.ElementTree as ET
import pandas as pd
xml_data = open('/path/user_agents.xml').read()
def xml2df(xml_data):
root = ET.XML(xml_data) # element tree
all_records = []
for i, child in enumerate(root):
record = {}
for subchild in child:
record[subchild.tag] = subchild.text
all_records.append(record)
return pd.DataFrame(all_records)
错误消息:
Traceback (most recent call last):
File "C:/Users/Bob/Desktop/temp/yield curve script.py", line 25, in <module>
xml2dict(xml_data)
File "C:/Users/Bob/Desktop/temp/yield curve script.py", line 13, in xml2dict
root = lxml.etree.parse(xml_file)
File "src\lxml\lxml.etree.pyx", line 3427, in lxml.etree.parse (src\lxml\lxml.etree.c:81100)
File "src\lxml\parser.pxi", line 1811, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:117831)
File "src\lxml\parser.pxi", line 1837, in lxml.etree._parseDocumentFromURL (src\lxml\lxml.etree.c:118178)
File "src\lxml\parser.pxi", line 1741, in lxml.etree._parseDocFromFile (src\lxml\lxml.etree.c:117090)
File "src\lxml\parser.pxi", line 1138, in lxml.etree._BaseParser._parseDocFromFile (src\lxml\lxml.etree.c:111636)
File "src\lxml\parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:105092)
File "src\lxml\parser.pxi", line 706, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:106800)
File "src\lxml\parser.pxi", line 633, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:105611)
OSError: Error reading file '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
import lxml.etree
import datetime
nsmap = {
'm': 'http://schemas.microsoft.com/ado/2007/08/dataservices/metadata',
'd': 'http://schemas.microsoft.com/ado/2007/08/dataservices'
}
m_null = ('{%s}null' % nsmap['m'])
m_type = ('{%s}type' % nsmap['m'])
type_handlers = {
'Edm.Double': float,
'Edm.Int32': int,
'Edm.DateTime': lambda s: datetime.datetime.strptime(s.translate(None, ':-'), "%Y%m%dT%H%M%S"),
}
def xml2dict(xml_file):
root = lxml.etree.parse(xml_file)
result = {}
for properties_el in root.xpath('//m:properties', namespaces=nsmap):
for child in properties_el.getchildren():
tag = child.tag.split('}',1)[-1] # split the namespace off the tag
if child.attrib.get(m_null):
value = None
else:
value = child.text
type_handler = type_handlers.get(child.attrib.get(m_type))
if type_handler is not None:
value = type_handler(value)
result[tag] = value
return result
...正常返回,为您的数据:
{'BC_10YEAR': 6.539999961853027,
'BC_1MONTH': None,
'BC_1YEAR': 5.630000114440918,
'BC_20YEAR': 6.849999904632568,
'BC_2YEAR': 5.96999979019165,
'BC_30YEAR': 6.75,
'BC_30YEARDISPLAY': 0.0,
'BC_3MONTH': 5.190000057220459,
'BC_3YEAR': 6.130000114440918,
'BC_5YEAR': 6.300000190734863,
'BC_6MONTH': 5.349999904632568,
'BC_7YEAR': 6.449999809265137,
'Id': 1,
'NEW_DATE': datetime.datetime(1997, 1, 2, 0, 0)}
东西不太对。 'root = lxml.etree.parse(xml_file)'我得到一个消息,说“在导入的模块lxml中找不到引用'etree'。我厌倦了pip安装lxml,但它说它已经安装。 ,它出错了,我发布的文件是文件的一部分,尽管我可以告诉它它是格式良好的XML,下面是整个文件https://www.treasury.gov/resource-center/data-chart-center/利率/ Pages/TextView.aspx?data = yield选择时间段= ALL在选择列表底部 –
'import lxml.etree',不只是'import lxml'。 –
是的,我得到了正确的import语句。我只是通过发布的XML来运行它,但也失败了。这是在Python 3中,如果这有所帮助,我将用错误消息更新问题。 –
你想使用上游ElementTree多少钱?取而代之的是,[lxml.etree](http://lxml.de/tutorial.html)会使这些变得容易得多。 –
获取'// {http://schemas.microsoft.com/ado/2007/08/dataservices/metadata}属性'并迭代子节点。 –
我打算把'{'m':'http://schemas.microsoft.com/ado/2007/08/dataservices/met adata'}'作为nsmap传递,但这也起作用。 :) –