Beautifulsoup无法读取
问题描述:
页
我尝试以下操作:Beautifulsoup无法读取
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669'
soup = BeautifulSoup(urlopen(url).read())
print soup
以上print
声明显示如下:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-type" content="text/html;charset=utf-8" />
<title>Travis Property Search</title>
<style type="text/css">
body { text-align: center; padding: 150px; }
h1 { font-size: 50px; }
body { font: 20px Helvetica, sans-serif; color: #333; }
#article { display: block; text-align: left; width: 650px; margin: 0 auto; }
a { color: #dc8100; text-decoration: none; }
a:hover { color: #333; text-decoration: none; }
</style>
</head>
<body>
<div id="article">
<h1>Please try again</h1>
<div>
<p>Sorry for the inconvenience but your session has either timed out or the server is busy handling other requests. You may visit us on the the following website for information, otherwise please retry your search again shortly:<br /><br />
<a href="http://www.traviscad.org/">Travis Central Appraisal District Website</a> </p>
<p><b><a href="http://propaccess.traviscad.org/clientdb/?cid=1">Click here to reload the property search to try again</a></b></p>
</div>
</div>
</body>
</html>
我能够通过同一台计算机上的浏览器但访问URL所以服务器绝对不会阻止我的IP。我不明白我的代码有什么问题?
答
您需要先获取一些cookie,然后才能访问该网址。
虽然这可以用urllib2
和CookieJar
做,我建议requests
:
import requests
from BeautifulSoup import BeautifulSoup
url1 = 'http://propaccess.traviscad.org/clientdb/?cid=1'
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669'
ses = requests.Session()
ses.get(url1)
soup = BeautifulSoup(ses.get(url).content)
print soup.prettify()
注意requests
是不是一个标准库,你将不得不英索尔它。 如果你想使用urllib2
:
import urllib2
from cookielib import CookieJar
url1 = 'http://propaccess.traviscad.org/clientdb/?cid=1'
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669'
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.open(url1)
soup = BeautifulSoup(opener.open(url).read())
print soup.prettify()
'从BeautifulSoup进口BeautifulSoup'应该不会是'从BS4进口BeautifulSoup'? –
@MD。 Khairul Basar是的,这就是我通常导入它的方式,但它可以工作。 –
你为什么要导入cookies ..我尝试过的其他例子从来不需要cookies – Zanam