美丽的汤 - 无法从分页页面中获取链接
问题描述:
我无法刮取分页网页中存在的文章的链接。此外,我有时会得到一个空白屏幕作为我的输出。我无法在循环中找到问题。此外,csv文件不会被创建。美丽的汤 - 无法从分页页面中获取链接
from pprint import pprint
import requests
from bs4 import BeautifulSoup
import lxml
import csv
import urllib2
def get_url_for_search_key(search_key):
for i in range(1,100):
base_url = 'http://www.thedrum.com/'
response = requests.get(base_url + 'search?page=%s&query=' + search_key +'&sorted=')%i
soup = BeautifulSoup(response.content, "lxml")
results = soup.findAll('a')
return [url['href'] for url in soup.findAll('a')]
pprint(get_url_for_search_key('artificial intelligence'))
with open('StoreUrl.csv', 'w+') as f:
f.seek(0)
f.write('\n'.join(get_url_for_search_key('artificial intelligence')))
答
您确定只需要第100页?也许有更多的人......
我下面你的任务的视野,这将收集所有的页面链接,也正是抓住翻页按钮链接:
import requests
from bs4 import BeautifulSoup
base_url = 'http://www.thedrum.com/search?sort=date&query=artificial%20intelligence'
response = requests.get(base_url)
soup = BeautifulSoup(response.content, "lxml")
res = []
while 1:
results = soup.findAll('a')
res.append([url['href'] for url in soup.findAll('a')])
next_button = soup.find('a', text='Next page')
if not next_button:
break
response = requests.get(next_button['href'])
soup = BeautifulSoup(response.content, "lxml")
编辑:另一种方法收集只有文章链接:
import requests
from bs4 import BeautifulSoup
base_url = 'http://www.thedrum.com/search?sort=date&query=artificial%20intelligence'
response = requests.get(base_url)
soup = BeautifulSoup(response.content, "lxml")
res = []
while 1:
search_results = soup.find('div', class_='search-results') #localizing search window with article links
article_link_tags = search_results.findAll('a') #ordinary scheme goes further
res.append([url['href'] for url in article_link_tags])
next_button = soup.find('a', text='Next page')
if not next_button:
break
response = requests.get(next_button['href'])
soup = BeautifulSoup(response.content, "lxml")
打印链接中使用:
for i in res:
for j in i:
print(j)
为了进行初步测试,我拿了第100页。问题是,当我尝试打印基于您的解决方案的链接时,我会看到一系列“无”打印在另一个下面。 – Rrj17
你如何打印它们?请提供完整的代码 –
刚刚在您提供的代码片段之后使用'pprint(res.append([url]'[url]')])'url中的URL。我不确定这是否正确。非常困惑。 – Rrj17