计数每唯一键唯一值的Python字典
问题描述:
我的字典是这样的:计数每唯一键唯一值的Python字典
yahoo.com|98.136.48.100
yahoo.com|98.136.48.105
yahoo.com|98.136.48.110
yahoo.com|98.136.48.114
yahoo.com|98.136.48.66
yahoo.com|98.136.48.71
yahoo.com|98.136.48.73
yahoo.com|98.136.48.75
yahoo.net|98.136.48.100
g03.msg.vcs0|98.136.48.105
中,我有重复键和值。我想要的是具有唯一键(ips)和唯一值(域)的最终字典。我已经在下面的代码:
for dirpath, dirs, files in os.walk(path):
for filename in fnmatch.filter(files, '*.txt'):
with open(os.path.join(dirpath, filename)) as f:
for line in f:
if line.startswith('.'):
ip = line.split('|',1)[1].strip('\n')
semi_domain = (line.rsplit('|',1)[0]).split('.',1)[1]
d[ip]= semi_domains
if ip not in d:
key = ip
val = [semi_domain]
domains_per_ip[key]= val
但这是行不通的。有人能帮我解决这个问题吗?
答
使用defaultdict:
from collections import defaultdict
d = defaultdict(set)
with open('somefile.txt') as thefile:
for line in the_file:
if line.strip():
value, key = line.split('|')
d[key].add(value)
for k,v in d.iteritems(): # use d.items() in Python3
print('{} - {}'.format(k, len(v)))
+0
谢谢@Burhan Khalid它解决了我的问题 – Ounk 2014-10-17 11:35:35
答
您可以使用zip
功能拖列表中ips
和domains
分开,然后用set
得到独特的作品!
>>>f=open('words.txt','r').readlines()
>>> zip(*[i.split('|') for i in f])
[('yahoo.com', 'yahoo.com', 'yahoo.com', 'yahoo.com', 'yahoo.com', 'yahoo.com', 'yahoo.com', 'yahoo.com', 'yahoo.net', 'g03.msg.vcs0'), ('98.136.48.100\n', '98.136.48.105\n', '98.136.48.110\n', '98.136.48.114\n', '98.136.48.66\n', '98.136.48.71\n', '98.136.48.73\n', '98.136.48.75\n', '98.136.48.100\n', '98.136.48.105')]
>>> [set(dom) for dom in zip(*[i.split('|') for i in f])]
[set(['yahoo.com', 'g03.msg.vcs0', 'yahoo.net']), set(['98.136.48.71\n', '98.136.48.105\n', '98.136.48.100\n', '98.136.48.105', '98.136.48.114\n', '98.136.48.110\n', '98.136.48.73\n', '98.136.48.66\n', '98.136.48.75\n'])]
然后用len
可以找到唯一对象的数量! 所有在列表理解一行:
>>> [len(i) for i in [set(dom) for dom in zip(*[i.split('|') for i in f])]]
[3, 9]
为什么你使用'startswith( '')'? – Kasramvd 2014-10-17 10:00:37
你是什么意思*“重复键”*?字典中的键已经是唯一的。 – jonrsharpe 2014-10-17 10:03:18