Python - 从多个文件中提取多个字符串中的文本
Python大师,我需要从列表中提取所有文本直到URL,下面是模式的示例。我也希望脚本能够循环文件夹中的所有文件。Python - 从多个文件中提取多个字符串中的文本
.....
.....
<List>Product Line</List>
<URL>http://teamspace.abb.com/sites/Product</URL>
...
...
<List>Contact Number</List>
<URL>https://teamspace.abb.com/sites/Contact</URL>
....
....
预计输出
<List>Product Line</List>
<URL>http://teamspace.abb.com/sites/Product</URL>
<List>Contact Number</List>
<URL>https://teamspace.abb.com/sites/Contact</URL>
我已经开发了一个脚本,能循环所有在文件夹中的文件,然后提取从列表中开头的所有关键字,但我无法包含URL。非常感谢您的帮助。
# defining location of parent folder
BASE_DIRECTORY = 'C:\D_Drive\Projects\Test'
output_file = open('C:\D_Drive\Projects\\Test\Output.txt', 'w')
output = {}
file_list = []
# scanning through sub folders
for (dirpath, dirnames, filenames) in os.walk(BASE_DIRECTORY):
for f in filenames:
if 'xml' in str(f):
e = os.path.join(str(dirpath), str(f))
file_list.append(e)
for f in file_list:
print f
txtfile = open(f, 'r')
output[f] = []
for line in txtfile:
if '<List>' in line:
output[f].append(line)
tabs = []
for tab in output:
tabs.append(tab)
tabs.sort()
for tab in tabs:
output_file.write(tab + '\n')
output_file.write('\n')
for row in output[tab]:
output_file.write(row + '')
output_file.write('\n')
output_file.write('----------------------------------------------------------\n')
raw_input()
你的答案基本上是正确的唯一的变化需要它来创建一个迭代器为文件。你可以使用元素树或美丽的汤,但像这样的理解迭代也会工作,当它是一个非XML或HTML文件。
txtfile = iter(open(f, 'r')) # change here
output[f] = []
for line in txtfile:
if '<List>' in line:
output[f].append(line)
output[f].append(next(txtfile)) # and here
优秀!非常感谢 – user1902849
import xml.etree.ElementTree as ET
tree = ET.parse('Product_Workflow.xml')
root = tree.getroot()
with open('Output.txt','w') as opfile:
for l,u in zip(root.iter('List'),root.iter('URL')):
opfile.write(ET.tostring(l).strip())
opfile.write('\n')
opfile.write(ET.tostring(u).strip())
opfile.write('\n')
的Output.txt
会给你:
<List>Emove</List>
<URL>http://teamspace.abb.com/sites/Product</URL>
<List>Asset_KWT</List>
<URL>https://teamspace.slb.com/sites/Contact</URL>
感谢您的信息。我会看看xml元素的方法。 – user1902849
可以使用filter
或列表理解像这样:
tgt=('URL', 'List')
with open('file') as f:
print filter(lambda line: any(e in line for e in tgt), (line for line in f))
或者:
with open('/tmp/file') as f:
print [line for line in f if any(e in line for e in tgt)]
或者打印:
[' <List>Product Line</List>\n', ' <URL>http://teamspace.abb.com/sites/Product</URL>\n', ' <List>Contact Number</List>\n', ' <URL>https://teamspace.abb.com/sites/Contact</URL>\n']
感谢您的评论,我会看看它。 – user1902849
输入和预期的输出看起来是一样的。尝试改善你的问题 – fferri
为什么要重新发明车轮?只需使用xml解析器,如[xml树](https://docs.python.org/2/library/xml.etree.elementtree.html) – dawg
请更新缩进。 –