无法获取网址列表
问题描述:
我正在尝试使用下面的脚本。为什么它不检索这个网站的URL列表?它适用于其他网站。无法获取网址列表
最初我以为问题是robots.txt
不允许,但是当我运行它时没有返回错误。
import urllib
from bs4 import BeautifulSoup
import urlparse
import mechanize
url = "https://www.danmurphys.com.au"
br = mechanize.Browser()
br.set_handle_robots(False)
urls = [url]
visited =[url]
print
while len(urls)>0:
try:
br.open(urls[0])
urls.pop(0)
for link in br.links():
#print link
#print "The base url is :" + link.base_url # just check there is this applicable to all sites.
#print "The url is: " + link.url # This gives generally just the page name
new_url = urlparse.urljoin(link.base_url,link.url)
b1 = urlparse.urlparse(new_url).hostname
b2 = urlparse.urlparse(new_url).path
new_url = "http://"+ b1 + b2
if new_url not in visited and urlparse.urlparse(url).hostname in new_url:
visited.append(new_url)
urls.append(new_url)
print new_url
except:
print "error"
urls.pop(0)
答
你需要用别的东西来凑这个URL,例如scrapy与scrapyJS或Phantom JS作为机械化库不使用JavaScript工作。
r = br.open(urls[0])
html = r.read()
print html
,你会看到输出:
<noscript>Please enable JavaScript to view the page content.</noscript>