Python BeautifulSoup以CSV格式打印信息
问题描述:
我可以打印出我从网站上提取的信息,没有任何问题。但是,当我尝试将街道名称放在一列中,并将邮编放入另一列中时,我就会遇到遇到问题时的CSV文件。我所获得的所有CSV都是两列名称,并且每一页都在页面的各列中。这是我的代码。另外我使用Python 2.7.5和美丽的汤4Python BeautifulSoup以CSV格式打印信息
from bs4 import BeautifulSoup
import csv
import urllib2
url="http://www.conakat.com/states/ohio/cities/defiance/road_maps/"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
f = csv.writer(open("Defiance Steets1.csv", "w"))
f.writerow(["Name", "ZipCodes"]) # Write column headers as the first line
links = soup.find_all(['i','a'])
for link in links:
names = link.contents[0]
print unicode(names)
f.writerow(names)
答
您从URL检索数据包含比i
元素更a
元素。您必须过滤a
元素,然后使用Python zip
buildin构建对。
links = soup.find_all('a')
links = [link for link in links
if link["href"].startswith("http://www.conakat.com/map/?p=")]
zips = soup.find_all('i')
for l, z in zip(links, zips):
f.writerow((l.contents[0], z.contents[0]))
输出:
Name,ZipCodes
1ST ST,(43512)
E 1ST ST,(43512)
W 1ST ST,(43512)
2ND ST,(43512)
E 2ND ST,(43512)
W 2ND ST,(43512)
3 RIVERS CT,(43512)
3RD ST,(43512)
E 3RD ST,(43512)
...
+0
这正是我所需要的,非常感谢。 – Codin
答
另一种方法(python3
)是每一个<a>
链接后,找到下一个兄弟,检查它是否是一个标签,并提取其价值:
from bs4 import BeautifulSoup
import csv
import urllib.request as urllib2
url="http://www.conakat.com/states/ohio/cities/defiance/road_maps/"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
f = csv.writer(open("Defiance Steets1.csv", "w"))
f.writerow(["Name", "ZipCodes"]) # Write column headers as the first line
links = soup.find_all('a')
for link in links:
i = link.find_next_sibling('i')
if getattr(i, 'name', None):
a, i = link.string, i.string
f.writerow([a, i])
它产生:
Name,ZipCodes
1ST ST,(43512)
E 1ST ST,(43512)
W 1ST ST,(43512)
2ND ST,(43512)
E 2ND ST,(43512)
W 2ND ST,(43512)
3 RIVERS CT,(43512)
3RD ST,(43512)
E 3RD ST,(43512)
W 3RD ST,(43512)
...
您的代码不显示如何获取邮政编码。另外,你在循环中没有使用f.writerow,名字为 – Vorsprung