从HTML页面提取数据(Python)
我试图从this page中提取一些数据。我想提取两个字符串之间的任何文本(项目1A风险因素和项目1B未解释员工评论)。很难拿出正确的正则表达式来做到这一点。从HTML页面提取数据(Python)
import re
import html2text
url = "https://www.sec.gov/Archives/edgar/data/104169/000010416916000079/wmtform10-kx1312016.htm"
html = urllib.urlopen(url).read()
text = html2text.html2text(html)
regex= '(?<=Item 1A Risk Factors)(.*)(?=Item 1B Unresolved)'
match = re.search(regex, text, flags=re.IGNORECASE)
print match
上述代码返回'none'。有什么建议么?
如果你想使用regEx,你可以使用下面的代码在Python 3.5.2中运行。 尝试打印“文本”以查看与您在网页中看到的不同的ITEM 1A的实际值(ITEM \ &#160 \; 1A)。希望这可以帮助。
import urllib.request
from urllib.error import URLError, HTTPError
import re
import contextlib
mainpage = "https://www.sec.gov/Archives/edgar/data/104169/000010416916000079/wmtform10-kx1312016.htm"
try:
with contextlib.closing(urllib.request.urlopen(mainpage)) as url:
htmltext = url.read().decode('utf-8')
#print(htmltext)
except HTTPError as e:
print("HTTPError")
except URLError as e:
print("URLError")
else:
results = re.findall(r'(?=ITEM\&\#160\;1A\.(.*)(RISK FACTORS))(.*)(?=ITEM\&\#160\;1B\.(.*)(UNRESOLVED))',htmltext)
print (results)
谢谢! @anonyXmous – kevin
你可以只用这
查找删除HTML标签:
<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>
什么也没有更换: “”
然后在运行此产生的字符串
1A\s*\.\s*RISK\s+FACTORS(.*?)1B\s*\.\s*UNRESOLVED\s+STAFF\s+COMMENTS
你想要的是在捕获组1
你可以包装在自己的应用程序或文本,
组1串粘贴到http://www.regexformat.com应用
文件,右单击上下文菜单 - >其他实用程序 - > Word Wrap。
在最大行长度中输入约60的值。
它弹出5k的包装文本,如下图(截断)。
The risks described below could materially and adversely
affect our business, results of operations, financial
condition and liquidity. Our business operations could also
be affected by additional factors that apply to all
companies operating in the U.S. and globally.Strategic
RisksGeneral or macro-economic factors, both domestically
and internationally, may materially adversely affect our
financial performance.General economic conditions, globally
or in one or more of the markets we serve, may adversely
affect our financial performance. Higher interest rates,
lower or higher prices of petroleum products, including
crude oil, natural gas, gasoline, and diesel fuel, higher
costs for electricity and other energy, weakness in the
housing market, inflation, deflation, increased costs of
essential services, such as medical care and utilities,
higher levels of unemployment, decreases in consumer
disposable income, unavailability of consumer credit, higher
consumer debt levels, changes in consumer spending and
shopping patterns, fluctuations in currency exchange rates,
higher tax rates, imposition of new taxes and surcharges,
other changes in tax laws, other regulatory changes, overall
不要用正则表达式解析HTML?你可以使用CSS选择器或Xpath与实际的解析器? – jonrsharpe
html源文件不包含字符串“Item 1A Risk Factors”和“Item 1B Unresolved”。 – horcrux
“1A项危险因素”或“1B项未解决”都在实际文本中。这就是为什么我先去掉html标签并尝试使用正则表达式。希望这是有道理的。 – kevin