Python：用BeautifulSoup解析HTML

问题描述：

<a href="/watch?gl=US&amp;client=mv-google&amp;hl=en&amp;v=0C_yXOhJxWg">Miss Black OCU 2011</a>

我的程序读取一个html文件，上面是该文件的块。我想在Python中使用BeautifulSoup来获取Miss Black OCU 2011。有什么建议么？Python：用BeautifulSoup解析HTML

答

我建议看标签和NavigableString类

text = """<a href="/watch?gl=US&amp;client=mv-google&amp;hl=en&amp;v=0C_yXOhJxWg">Miss Black OCU 2011</a>""" 
soup = BeautifulSoup(text) 
print soup.find('a').text

答

的属性。如果href属性遵循类似的href文字模式=“...看......” 您就可以轻松解决问题使用re：正则表达式。

import re 
from bs4 import BeautifulSoup 
response = """<a href="/watch?gl=US&amp;client=mv-google&amp;hl=en&amp;v=0C_yXOhJxWg">Miss Black OCU 2011</a>""" 
# the response should might be the urlreponse object if you search through a whole html page 
soup = BeautifulSoup(response) 
print soup.find("a", {"href":re.compile(".*watch.*")}).text

的输出是这样的：

Miss Black OCU 2011

整点是找到正则表达式模式。有关re的更多信息，请点击这里http://docs.python.org/2/library/re.html：

Python：用BeautifulSoup解析HTML

相关推荐