从文件Python中读取url

从文件Python中读取url

问题描述:

无法读取txt文件中的url 我想逐个读取并打开txt中的url地址,我想从url地址的源中获取标题的正则表达式 错误消息:从文件Python中读取url

Traceback (most recent call last): File "Mypy.py", line 14, in UrlsOpen = urllib2.urlopen(listSplit) File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen return opener.open(url, data, timeout) File "/usr/lib/python2.7/urllib2.py", line 420, in open req.timeout = timeout AttributeError: 'list' object has no attribute 'timeout'

Mypy.py

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 
import re 
import requests 
import urllib2 
import threading 

UrlListFile = open("Url.txt","r") 
UrlListRead = UrlListFile.read() 
UrlListFile.close() 
listSplit = UrlListRead.split('\r\n') 


    UrlsOpen = urllib2.urlopen(listSplit) 
    ReadSource = UrlsOpen.read().decode('utf-8') 
    regex = '<title.*?>(.+?)</title>' 
    comp = re.compile(regex) 
    links = re.findall(comp,ReadSource) 
    for i in links: 
     SaveDataFiles = open("SaveDataMyFile.txt","w") 
     SaveDataFiles.write(i) 
    SaveDataFiles.close() 
+0

你可以添加你'Url.txt'内容的示例? – fievel

+0

@fievel我的Url.txt https://i.stack.imgur.com/s81Mt.png –

+0

你可以复制你的URL.txt文件的内容并使用代码格式粘贴到你的问题中吗?它会使你更容易地帮助你调试 – PeterH

当你调用urllib2.urlopen(listSplit) listSplit是当它需要一个string or request object列表。这是迭代listSplit而不是将整个列表传递给urlopen的简单方法。

另外re.findall()将返回搜索到的每个ReadSource的列表。您可以处理这几种方法:

我选择了通过只是让列表

websites = [ [link, link], [link], [link, link, link]

的列表并遍历两个列表来处理它。这使得你可以为每个网站的每个网址列表做一些特定的事情(放在不同的文件中)。

你也可以拼合website列表中只包含链接,而不是另一个列表则包含链接:

links = [link, link, link, link]

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 
import re 
import urllib2 
from pprint import pprint 

UrlListFile = open("Url.txt", "r") 
UrlListRead = UrlListFile.read() 
UrlListFile.close() 
listSplit = UrlListRead.splitlines() 
pprint(listSplit) 
regex = '<title.*?>(.+?)</title>' 
comp = re.compile(regex) 
websites = [] 
for url in listSplit: 
    UrlsOpen = urllib2.urlopen(url) 
    ReadSource = UrlsOpen.read().decode('utf-8') 
    websites.append(re.findall(comp, ReadSource)) 

with open("SaveDataMyFile.txt", "w") as SaveDataFiles: 
    for website in websites: 
     for link in website: 
      pprint(link) 
      SaveDataFiles.write(link.encode('utf-8')) 
    SaveDataFiles.close() 
+0

Traceback(最近调用最后一次): 文件“Mypy.py”,第14行,在 UrlsOpen = urllib2.urlopen(url) 文件“/ usr/lib/python2.7/urllib2.py“,第154行,在urlopen中 返回opener.open(url,data,timeout) 文件”/usr/lib/python2.7/urllib2.py“,第427行,打开 req = meth(req) 文件“/usr/lib/python2.7/urllib2.py”,行1126,在do_request_ raise URLError('no host given') urllib2.URLError:

+0

我更新了代码以处理更多的新链接'.splitlines()'并修复了一个编码错误'link.encode('utf-8')'。尝试新的代码。 – PeterH