Beautifulsoup无法读取

Beautifulsoup无法读取

问题描述:

我尝试以下操作:Beautifulsoup无法读取

from urllib2 import urlopen 
from BeautifulSoup import BeautifulSoup 
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669' 
soup = BeautifulSoup(urlopen(url).read()) 
print soup 

以上print声明显示如下:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 
     "http://www.w3.org/TR/html4/loose.dtd"> 
<html> 
<head> 
<meta http-equiv="Content-type" content="text/html;charset=utf-8" /> 
<title>Travis Property Search</title> 
<style type="text/css"> 
     body { text-align: center; padding: 150px; } 
     h1 { font-size: 50px; } 
     body { font: 20px Helvetica, sans-serif; color: #333; } 
     #article { display: block; text-align: left; width: 650px; margin: 0 auto; } 
     a { color: #dc8100; text-decoration: none; } 
     a:hover { color: #333; text-decoration: none; } 
    </style> 
</head> 
<body> 
<div id="article"> 
<h1>Please try again</h1> 
<div> 
<p>Sorry for the inconvenience but your session has either timed out or the server is busy handling other requests. You may visit us on the the following website for information, otherwise please retry your search again shortly:<br /><br /> 
<a href="http://www.traviscad.org/">Travis Central Appraisal District Website</a> </p> 
<p><b><a href="http://propaccess.traviscad.org/clientdb/?cid=1">Click here to reload the property search to try again</a></b></p> 
</div> 
</div> 
</body> 
</html> 

我能够通过同一台计算机上的浏览器但访问URL所以服务器绝对不会阻止我的IP。我不明白我的代码有什么问题?

您需要先获取一些cookie,然后才能访问该网址。
虽然这可以用urllib2CookieJar做,我建议requests

import requests 
from BeautifulSoup import BeautifulSoup 

url1 = 'http://propaccess.traviscad.org/clientdb/?cid=1' 
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669' 
ses = requests.Session() 
ses.get(url1) 
soup = BeautifulSoup(ses.get(url).content) 
print soup.prettify() 

注意requests是不是一个标准库,你将不得不英索尔它。 如果你想使用urllib2

import urllib2 
from cookielib import CookieJar 

url1 = 'http://propaccess.traviscad.org/clientdb/?cid=1' 
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669' 
cj = CookieJar() 
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) 
opener.open(url1) 
soup = BeautifulSoup(opener.open(url).read()) 
print soup.prettify() 
+0

'从BeautifulSoup进口BeautifulSoup'应该不会是'从BS4进口BeautifulSoup'? –

+1

@MD。 Khairul Basar是的,这就是我通常导入它的方式,但它可以工作。 –

+0

你为什么要导入cookies ..我尝试过的其他例子从来不需要cookies – Zanam