Python BeautifulSoup以CSV格式打印信息

问题描述:

我可以打印出我从网站上提取的信息,没有任何问题。但是,当我尝试将街道名称放在一列中,并将邮编放入另一列中时,我就会遇到遇到问题时的CSV文件。我所获得的所有CSV都是两列名称,并且每一页都在页面的各列中。这是我的代码。另外我使用Python 2.7.5和美丽的汤4Python BeautifulSoup以CSV格式打印信息

from bs4 import BeautifulSoup 
import csv 
import urllib2 

url="http://www.conakat.com/states/ohio/cities/defiance/road_maps/" 

page=urllib2.urlopen(url) 

soup = BeautifulSoup(page.read()) 

f = csv.writer(open("Defiance Steets1.csv", "w")) 
f.writerow(["Name", "ZipCodes"]) # Write column headers as the first line 

links = soup.find_all(['i','a']) 

for link in links: 
    names = link.contents[0] 
    print unicode(names) 

f.writerow(names) 
+0

您的代码不显示如何获取邮政编码。另外,你在循环中没有使用f.writerow,名字为 – Vorsprung

您从URL检索数据包含比i元素更a元素。您必须过滤a元素,然后使用Python zip buildin构建对。

links = soup.find_all('a') 
links = [link for link in links 
     if link["href"].startswith("http://www.conakat.com/map/?p=")] 
zips = soup.find_all('i') 

for l, z in zip(links, zips): 
    f.writerow((l.contents[0], z.contents[0])) 

输出:

Name,ZipCodes 
1ST ST,(43512) 
E 1ST ST,(43512) 
W 1ST ST,(43512) 
2ND ST,(43512) 
E 2ND ST,(43512) 
W 2ND ST,(43512) 
3 RIVERS CT,(43512) 
3RD ST,(43512) 
E 3RD ST,(43512) 
... 
+0

这正是我所需要的,非常感谢。 – Codin

另一种方法(python3)是每一个<a>链接后,找到下一个兄弟,检查它是否是一个标签,并提取其价值:

from bs4 import BeautifulSoup 
import csv 
import urllib.request as urllib2 

url="http://www.conakat.com/states/ohio/cities/defiance/road_maps/" 

page=urllib2.urlopen(url) 

soup = BeautifulSoup(page.read()) 

f = csv.writer(open("Defiance Steets1.csv", "w")) 
f.writerow(["Name", "ZipCodes"]) # Write column headers as the first line 

links = soup.find_all('a') 

for link in links: 
    i = link.find_next_sibling('i') 
    if getattr(i, 'name', None): 
     a, i = link.string, i.string 
     f.writerow([a, i]) 

它产生:

Name,ZipCodes 
1ST ST,(43512) 
E 1ST ST,(43512) 
W 1ST ST,(43512) 
2ND ST,(43512) 
E 2ND ST,(43512) 
W 2ND ST,(43512) 
3 RIVERS CT,(43512) 
3RD ST,(43512) 
E 3RD ST,(43512) 
W 3RD ST,(43512) 
... 
+0

你的方法很好,也谢谢你。我有一个简短的问题,你将如何从邮政编码周围删除()。谢谢 – Codin

+0

@Codin:一个字符串('i.string')也是一个可迭代的,所以你可以使用一个切片去除第一个和最后一个字符:'a,i = link.string,i.string [1: - 1]' – Birei

+0

谢谢你的帮助 – Codin