在Python中使用lxml解析HTML,xpath
我想用lxml和xpath使用python解析值表单html。在Python中使用lxml解析HTML,xpath
这里是我的HTML数据
<table>
<tr>
<td class="u"><input class="wide" name="record[13][name]" value="exampledomain1.com"></td>
<td class="u">
<select name="record[13][type]">
<option SELECTED value="A" >A</option>
<option value="AAAA" >AAAA</option>
<option value="CNAME" >CNAME</option>
<option value="HINFO" >HINFO</option>
<option value="MX" >MX</option>
<option value="NAPTR" >NAPTR</option>
<option value="NS" >NS</option>
<option value="PTR" >PTR</option>
<option value="SOA" >SOA</option>
<option value="SPF" >SPF</option>
<option value="SRV" >SRV</option>
<option value="SSHFP" >SSHFP</option>
<option value="TXT" >TXT</option>
<option value="RP" >RP</option>
<option value="URL" >URL</option>
<option value="MBOXFW" >MBOXFW</option>
<option value="CURL" >CURL</option>
</select>
</td>
<td class="u"><input class="wide" name="record[13][content]" value='10.10.10.1'></td>
<td class="u"><input class="wide" name="record[14][name]" value="exampledomain2.com"></td>
<td class="u">
<select name="record[14][type]">
<option SELECTED value="CNAME" >A</option>
<option value="AAAA" >AAAA</option>
<option value="CNAME" >CNAME</option>
<option value="HINFO" >HINFO</option>
<option value="MX" >MX</option>
<option value="NAPTR" >NAPTR</option>
<option value="NS" >NS</option>
<option value="PTR" >PTR</option>
<option value="SOA" >SOA</option>
<option value="SPF" >SPF</option>
<option value="SRV" >SRV</option>
<option value="SSHFP" >SSHFP</option>
<option value="TXT" >TXT</option>
<option value="RP" >RP</option>
<option value="URL" >URL</option>
<option value="MBOXFW" >MBOXFW</option>
<option value="CURL" >CURL</option>
</select>
</td>
<td class="u"><input class="wide" name="record[14][content]" value='exampledomain1.com'></td>
<td class="u"><input class="wide" name="record[15][name]" value="exampledomain3.com"></td>
<td class="u">
<select name="record[15][type]">
<option SELECTED value="A" >A</option>
<option value="AAAA" >AAAA</option>
<option value="CNAME" >CNAME</option>
<option value="HINFO" >HINFO</option>
<option value="MX" >MX</option>
<option value="NAPTR" >NAPTR</option>
<option value="NS" >NS</option>
<option value="PTR" >PTR</option>
<option value="SOA" >SOA</option>
<option value="SPF" >SPF</option>
<option value="SRV" >SRV</option>
<option value="SSHFP" >SSHFP</option>
<option value="TXT" >TXT</option>
<option value="RP" >RP</option>
<option value="URL" >URL</option>
<option value="MBOXFW" >MBOXFW</option>
<option value="CURL" >CURL</option>
</select>
</td>
<td class="u"><input class="wide" name="record[15][content]" value='10.10.10.3'></td>
</tr>
</table>
我要的是解析值和打印如下:
exampledomain1.com A 10.10.10.1
exampledomain2.com CNAME exampledomain1.com
exampledomain3.com A 10.10.10.3
这里是我试过
#!/usr/bin/python
import lxml.html
from lxml import etree
doc = lxml.html.document_fromstring("""Here whole html data""")
txt1 = doc.xpath('//*[@class="wide"]/@value')
txt2 = doc.xpath('//@SELECTED/text()')
print txt1
print txt2
但它不是按我想要的方式工作。任何帮助,将不胜感激。
谢谢大家。
我固定的代码返回以下,这是非常接近你的要求为:
(py26_default)[[email protected] ~]$ python parse.py
exampledomain1.com 10.10.10.1
exampledomain2.com exampledomain1.com
exampledomain3.com 10.10.10.3
(py26_default)[[email protected] ~]$
无法检索record[13][type]
使用XPath ......还有其他的方式,通过这个迭代,但我将这作为OP的练习。请注意,我没有固定的OP的问题HTML包括<table>
和<tr>
标签...
import lxml.html
from lxml import etree
from lxml.etree import XMLParser
parser = XMLParser(ns_clean=True, recover=True)
doc = etree.fromstring("""Here whole html data""", parser)
elem1 = doc.xpath('//input[@name="record[13][name]"]')
# NOTE: <option SELECTED> cannot be retrieved with xpath... SELECTED must have
# a value to do so...
#elem2 = doc.xpath('//select[@name="record[13][type]"]/option[@SELECTED]')
elem3 = doc.xpath('//input[@name="record[13][content]"]')
for idx, val in enumerate(elem1):
print val.attrib['value'], elem3[idx].attrib['value']
<!-- The (fixed) html source I used -->
<table>
<tr>
<td class="u"><input class="wide" name="record[13][name]" value="exampledomain1.com"></td>
<td class="u">
<select name="record[13][type]">
<option SELECTED value="A" >A</option>
<option value="AAAA" >AAAA</option>
<option value="CNAME" >CNAME</option>
<option value="HINFO" >HINFO</option>
<option value="MX" >MX</option>
<option value="NAPTR" >NAPTR</option>
<option value="NS" >NS</option>
<option value="PTR" >PTR</option>
<option value="SOA" >SOA</option>
<option value="SPF" >SPF</option>
<option value="SRV" >SRV</option>
<option value="SSHFP" >SSHFP</option>
<option value="TXT" >TXT</option>
<option value="RP" >RP</option>
<option value="URL" >URL</option>
<option value="MBOXFW" >MBOXFW</option>
<option value="CURL" >CURL</option>
</select>
</td>
<td class="u"><input class="wide" name="record[13][content]" value='10.10.10.1'></td>
<td class="u"><input class="wide" name="record[13][name]" value="exampledomain2.com"></td>
<td class="u">
<select name="record[13][type]">
<option SELECTED value="CNAME" >A</option>
<option value="AAAA" >AAAA</option>
<option value="CNAME" >CNAME</option>
<option value="HINFO" >HINFO</option>
<option value="MX" >MX</option>
<option value="NAPTR" >NAPTR</option>
<option value="NS" >NS</option>
<option value="PTR" >PTR</option>
<option value="SOA" >SOA</option>
<option value="SPF" >SPF</option>
<option value="SRV" >SRV</option>
<option value="SSHFP" >SSHFP</option>
<option value="TXT" >TXT</option>
<option value="RP" >RP</option>
<option value="URL" >URL</option>
<option value="MBOXFW" >MBOXFW</option>
<option value="CURL" >CURL</option>
</select>
</td>
<td class="u"><input class="wide" name="record[13][content]" value='exampledomain1.com'></td>
<td class="u"><input class="wide" name="record[13][name]" value="exampledomain3.com"></td>
<td class="u">
<select name="record[13][type]">
<option SELECTED value="A" >A</option>
<option value="AAAA" >AAAA</option>
<option value="CNAME" >CNAME</option>
<option value="HINFO" >HINFO</option>
<option value="MX" >MX</option>
<option value="NAPTR" >NAPTR</option>
<option value="NS" >NS</option>
<option value="PTR" >PTR</option>
<option value="SOA" >SOA</option>
<option value="SPF" >SPF</option>
<option value="SRV" >SRV</option>
<option value="SSHFP" >SSHFP</option>
<option value="TXT" >TXT</option>
<option value="RP" >RP</option>
<option value="URL" >URL</option>
<option value="MBOXFW" >MBOXFW</option>
<option value="CURL" >CURL</option>
</select>
</td>
<td class="u"><input class="wide" name="record[13][content]" value='10.10.10.3'></td>
</tr>
</table>
嗨迈克,字段“name =”记录[13]“正在改变所有这些其他dns记录记录,我已纠正在这个html代码中,所以在这种情况下,/input [@ name =“record [13] [name]”]'不会捕获所有不同数字的记录,所以我可以在其中定义通配符或范围。 – Manish 2012-08-01 15:01:42
您可以使用[lxml'正则表达式]( http://*.com/a/2756994/667301)解决这个问题 – 2012-08-01 15:26:26
谢谢你迈克,那么我得到了与正则表达式工作,但仍然坚持获得SELECTED值 – Manish 2012-08-02 16:13:49
record_13_name = tree.xpath("//select[@name='record[13][name]']/text()")
record_13_type = tree.xpath("//select[@name='record[13][type]']/option/text()")
record_13_content = tree.xpath("//input[@name='record[13][content]']/text()")
record_14_name = tree.xpath("//select[@name='record[14][name]']/text()")
record_14_type = tree.xpath("//select[@name='record[14][type]']/option/text()")
record_14_content = tree.xpath("//input[@name='record[14][content]']/text()")
record_15_name = tree.xpath("//select[@name='record[15][name]']/text()")
record_15_type = tree.xpath("//select[@name='record[15][type]']/option/text()")
record_15_content = tree.xpath("//input[@name='record[15][content]']/text()")
运行“xmllint --noout在您的HTML报告7个错误。在解析它之前,你应该修复它们。 – 2012-07-31 16:33:17
它如何“不按你想要的”工作? – 2012-07-31 17:11:49
使用BeautifulSoup ..它的简单和容易 – Surya 2012-08-01 14:55:47