Scrapy一般铲运机
问题描述:
我试图为scrapy建造一般铲运机 - 虽然它看起来有点儿车。这个想法是,它应该把网址作为输入,只从该特定的网址中删除网页,但它似乎要离开YouTube等网站。理想情况下,它也会有一个深度选项,它允许1,2 ,3等作为远离初始页面的深度链接数量。任何想法如何实现这一目标?Scrapy一般铲运机
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib
from route import urls
import pickle
import os
import urllib2
import urlparse
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(body, 'html.parser')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
return u" ".join(t.strip() for t in visible_texts)
def getAllUrl(url):
try:
page = urllib2.urlopen(url).read()
except:
return []
urlList = []
try:
soup = BeautifulSoup(page)
soup.prettify()
for anchor in soup.findAll('a', href=True):
if not 'http://' in anchor['href']:
if urlparse.urljoin(url, anchor['href']) not in urlList:
urlList.append(urlparse.urljoin(url, anchor['href']))
else:
if anchor['href'] not in urlList:
urlList.append(anchor['href'])
length = len(urlList)
return urlList
except urllib2.HTTPError, e:
print e
def listAllUrl(url):
urls_new = list(set(url))
return urls_new
count = 0
main_url = str(raw_input('Enter the url : '))
url_split=main_url.split('.',1)
folder_name =url_split[1]
txtfile_split = folder_name.split('.',1)
txtfile_name = txtfile_split[0]
url = getAllUrl(main_url)
urls_new = listAllUrl(url)
os.makedirs('c:/Scrapy/Extracted/'+folder_name+"/")
for url in urls_new:
if url.startswith("http") or url.startswith(" "):
if(main_url == url):
url = url
else:
pass
else:
url = main_url+url
if '#' in url:
new_url = str(url).replace('#','/')
else:
new_url =url
count = count+1
if new_url:
print""+str(count)+">>",new_url
html = urllib.urlopen(new_url).read()
page_text_data=text_from_html(html)
with open("c:/Scrapy/Extracted/"+folder_name+"/"+txtfile_name+".txt", "a") as myfile:
myfile.writelines("\n\n"+new_url.encode('utf-8')+"\n\n"+page_text_data.encode('utf-8'))
path ='c:/Scrapy/Extracted/'+folder_name+"/"
filename ="url"+str(count)+".txt"
with open(os.path.join(path, filename), 'wb') as temp_file:
temp_file.write(page_text_data.encode('utf-8'))
temp_file.close()
else:
pass
答
您当前的解决方案并不涉及Scrapy的。但是,当你特意要求Scrapy时,在这里你走了。
根据您的蜘蛛CrawlSpider
类。这允许您抓取给定的网站,并可能指定导航应遵守的规则。
要禁止离场请求,请使用allowed_domains
蜘蛛属性。或者,如果使用CrawlSpider
类,则可以在Rule
中指定构造函数的allow_domains
(或其他方式,deny_domains
)属性。
要限制抓取深度,请使用DEPTH_LIMIT
在settings.py
。
答
你有一个标签scrapy,但你根本不使用它。我建议你尝试使用它 - 这很容易。比尝试在自己身上发展要容易得多。已经有一个选项来限制特定域名的请求。