删除重复
我有不幸的是包含重复,像这样的元组的列表:删除重复
[(67, u'top-coldestcitiesinamerica'), (66, u'ecofriendlyideastocelebrateindependenceday-phpapp'), (65, u'a-b-c-ca-d-ab-ea-d-c-c'), (64, u'a-b-c-ca-d-ab-ea-d-c-c'), (63, u'alexandre-meybeck-faowhatisclimate-smartagriculture-backgroundopportunitiesandchallenges'), (62, u'ghgemissions'), (61, u'top-coldestcitiesinamerica'), (58, u'infographicthe-stateofdigitaltransformationaltimetergroup'), (57, u'culture'), (55, u'cas-k-ihaveanidea'), (54, u'trendsfor'), (53, u'batteryimpedance'), (52, u'evs-howey-full'), (51, u'bericht'), (49, u'classiccarinsurance'), (47, u'uploaded_file'), (46, u'x_file'), (45, u's-s-main'), (44, u'vehicle-propulsion'), (43, u'x_file')]
的问题是,元组的第一个元素(0基于排序)是我想要的条目检查重复。所以,我可以看到:
(67, u'top-coldestcitiesinamerica')
(61, u'top-coldestcitiesinamerica')
..are重复,我想删除其中的一个(类似于set
)。因此,在最后,我想有元组的,象这样没有重复(即元组的第一个元素没有重复)干净的列表:
[(67, u'top-coldestcitiesinamerica'), (66, u'ecofriendlyideastocelebrateindependenceday-phpapp'), (65, u'a-b-c-ca-d-ab-ea-d-c-c') (63, u'alexandre-meybeck-faowhatisclimate-smartagriculture-backgroundopportunitiesandchallenges'), (62, u'ghgemissions'), (58, u'infographicthe-stateofdigitaltransformationaltimetergroup'), (57, u'culture'), (55, u'cas-k-ihaveanidea'), (54, u'trendsfor'), (53, u'batteryimpedance'), (52, u'evs-howey-full'), (51, u'bericht'), (49, u'classiccarinsurance'), (47, u'uploaded_file'), (46, u'x_file'), (45, u's-s-main'), (44, u'vehicle-propulsion')]
我怎样才能在Python的实现这一目标办法? 谢谢!
您可以使用set
方法从How do you remove duplicates from a list in whilst preserving order?,使用x[1]
作为唯一标识符:
def unique_second_element(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x[1] in seen or seen_add(x[1]))]
注意,OrderedDict
做法也显示也将工作,如果你想保留最后发生;对于第一次发生,您必须将输入反向,然后再次反向输出。
你可以让这个更通用的支持key
功能:
def unique_preserve_order(seq, key=None):
if key is None:
key = lambda elem: elem
seen = set()
seen_add = seen.add
augmented = ((key(x), x) for x in seq)
return [x for k, x in augmented if not (k in seen or seen_add(k))]
然后用
import operator
unique_preserve_order(yourlist, key=operator.itemgetter(1))
演示:
>>> def unique_preserve_order(seq, key=None):
... if key is None:
... key = lambda elem: elem
... seen = set()
... seen_add = seen.add
... augmented = ((key(x), x) for x in seq)
... return [x for k, x in augmented if not (k in seen or seen_add(k))]
...
>>> from pprint import pprint
>>> import operator
>>> yourlist = [(67, u'top-coldestcitiesinamerica'), (66, u'ecofriendlyideastocelebrateindependenceday-phpapp'), (65, u'a-b-c-ca-d-ab-ea-d-c-c'), (64, u'a-b-c-ca-d-ab-ea-d-c-c'), (63, u'alexandre-meybeck-faowhatisclimate-smartagriculture-backgroundopportunitiesandchallenges'), (62, u'ghgemissions'), (61, u'top-coldestcitiesinamerica'), (58, u'infographicthe-stateofdigitaltransformationaltimetergroup'), (57, u'culture'), (55, u'cas-k-ihaveanidea'), (54, u'trendsfor'), (53, u'batteryimpedance'), (52, u'evs-howey-full'), (51, u'bericht'), (49, u'classiccarinsurance'), (47, u'uploaded_file'), (46, u'x_file'), (45, u's-s-main'), (44, u'vehicle-propulsion'), (43, u'x_file')]
>>> pprint(unique_preserve_order(yourlist, operator.itemgetter(1)))
[(67, u'top-coldestcitiesinamerica'),
(66, u'ecofriendlyideastocelebrateindependenceday-phpapp'),
(65, u'a-b-c-ca-d-ab-ea-d-c-c'),
(63,
u'alexandre-meybeck-faowhatisclimate-smartagriculture-backgroundopportunitiesandchallenges'),
(62, u'ghgemissions'),
(58, u'infographicthe-stateofdigitaltransformationaltimetergroup'),
(57, u'culture'),
(55, u'cas-k-ihaveanidea'),
(54, u'trendsfor'),
(53, u'batteryimpedance'),
(52, u'evs-howey-full'),
(51, u'bericht'),
(49, u'classiccarinsurance'),
(47, u'uploaded_file'),
(46, u'x_file'),
(45, u's-s-main'),
(44, u'vehicle-propulsion')]
作为一个备选答案,你可以使用itertools.groupby()
如果你有一个巨大的列表,这可能会有帮助,但是不如set
:
>>> from itertools import groupby
>>> from operator import itemgetter
>>> [next(g) for _,g in groupby(sorted(l,key=itemgetter(1)),itemgetter(1))]
[(65, u'a-b-c-ca-d-ab-ea-d-c-c'), (63, u'alexandre-meybeck-faowhatisclimate-smartagriculture-backgroundopportunitiesandchallenges'), (53, u'batteryimpedance'), (51, u'bericht'), (55, u'cas-k-ihaveanidea'), (49, u'classiccarinsurance'), (57, u'culture'), (66, u'ecofriendlyideastocelebrateindependenceday-phpapp'), (52, u'evs-howey-full'), (62, u'ghgemissions'), (58, u'infographicthe-stateofdigitaltransformationaltimetergroup'), (45, u's-s-main'), (67, u'top-coldestcitiesinamerica'), (54, u'trendsfor'), (47, u'uploaded_file'), (44, u'vehicle-propulsion'), (46, u'x_file')]
这会杀死订单,并且这种排序使得它成为O(NlogN)解决方案,而不是我的O(N)方法。 – 2015-03-03 14:23:02
@MartijnPieters不幸的是!但也许它对于OP来说并不重要!我提到'set'是一个更好的配方! – Kasramvd 2015-03-03 14:24:41
- 定义检查列表变量添加关键。
- 迭代输入列表中的每个项目。
- 检查密钥是否存在或不在检查列表中。
- 如果不存在,则将项目添加到结果列表并更新检查列表。
- 打印结果。
代码:
input_list = [(67, u'top-coldestcitiesinamerica'), (66, u'ecofriendlyideastocelebrateindependenceday-phpapp'), (65, u'a-b-c-ca-d-ab-ea-d-c-c'), (64, u'a-b-c-ca-d-ab-ea-d-c-c'), (63, u'alexandre-meybeck-faowhatisclimate-smartagriculture-backgroundopportunitiesandchallenges'), (62, u'ghgemissions'), (61, u'top-coldestcitiesinamerica'), (58, u'infographicthe-stateofdigitaltransformationaltimetergroup'), (57, u'culture'), (55, u'cas-k-ihaveanidea'), (54, u'trendsfor'), (53, u'batteryimpedance'), (52, u'evs-howey-full'), (51, u'bericht'), (49, u'classiccarinsurance'), (47, u'uploaded_file'), (46, u'x_file'), (45, u's-s-main'), (44, u'vehicle-propulsion'), (43, u'x_file')]
check_list = set()
result = []
for i in input_list:
if not i[1] in check_list:
result.append(i)
check_list.add(i[1])
import pprint
pprint.pprint(result)
输出:
$ python task4.py
[(67, u'top-coldestcitiesinamerica'),
(66, u'ecofriendlyideastocelebrateindependenceday-phpapp'),
(65, u'a-b-c-ca-d-ab-ea-d-c-c'),
(63,
u'alexandre-meybeck-faowhatisclimate-smartagriculture-backgroundopportunitiesandchallenges'),
(62, u'ghgemissions'),
(58, u'infographicthe-stateofdigitaltransformationaltimetergroup'),
(57, u'culture'),
(55, u'cas-k-ihaveanidea'),
(54, u'trendsfor'),
(53, u'batteryimpedance'),
(52, u'evs-howey-full'),
(51, u'bericht'),
(49, u'classiccarinsurance'),
(47, u'uploaded_file'),
(46, u'x_file'),
(45, u's-s-main'),
(44, u'vehicle-propulsion')]
@MartijnPieters:道歉。使用的集合。 – 2015-03-03 14:53:19
我做了一个很朴实和简单的方法。
lst=[(67, u'top-coldestcitiesinamerica'), (66, u'ecofriendlyideastocelebrateindependenceday-phpapp'), (65, u'a-b-c-ca-d-ab-ea-d-c-c'), (64, u'a-b-c-ca-d-ab-ea-d-c-c'), (63, u'alexandre-meybeck-faowhatisclimate-smartagriculture-backgroundopportunitiesandchallenges'), (62, u'ghgemissions'), (61, u'top-coldestcitiesinamerica'), (58, u'infographicthe-stateofdigitaltransformationaltimetergroup'), (57, u'culture'), (55, u'cas-k-ihaveanidea'), (54, u'trendsfor'), (53, u'batteryimpedance'), (52, u'evs-howey-full'), (51, u'bericht'), (49, u'classiccarinsurance'), (47, u'uploaded_file'), (46, u'x_file'), (45, u's-s-main'), (44, u'vehicle-propulsion'), (43, u'x_file')]
lst2 = [] #empty list to fill with unique tuples
lst_banned = [] #empty list to fill with banned elements
for tup in lst:
if tup[-1] not in lst_banned:
lst_banned.append(tup[-1])
lst2.append(tup)
lst=lst2
del lst2
del lst_banned
我只是看到在写这篇文章时发布了类似的答案。抱歉! :) – 2015-03-03 14:29:14
同样的评论给你:使用一个列表来跟踪独特的元素是**慢**,因为每个测试需要'len(lst_banned)'步骤。一套可以让你测试*常数时间*的成员资格。 – 2015-03-03 14:29:42
好点! 'set'更加pythonic ...我想,这也是问题的关键! – 2015-03-03 14:33:27
对不起关于延迟答复 - 最后我用你的'unique_second_element'方法 - 就像一个魅力。非常感谢你! – AJW 2015-03-16 09:36:03