从HTML页面提取数据(Python)

从HTML页面提取数据(Python)

问题描述:

我试图从this page中提取一些数据。我想提取两个字符串之间的任何文本(项目1A风险因素和项目1B未解释员工评论)。很难拿出正确的正则表达式来做到这一点。从HTML页面提取数据(Python)

import re 
import html2text 

url = "https://www.sec.gov/Archives/edgar/data/104169/000010416916000079/wmtform10-kx1312016.htm" 
html = urllib.urlopen(url).read() 

text = html2text.html2text(html) 

regex= '(?<=Item 1A Risk Factors)(.*)(?=Item 1B Unresolved)' 

match = re.search(regex, text, flags=re.IGNORECASE) 

print match 

上述代码返回'none'。有什么建议么?

+1

不要用正则表达式解析HTML?你可以使用CSS选择器或Xpath与实际的解析器? – jonrsharpe

+0

html源文件不包含字符串“Item 1A Risk Factors”和“Item 1B Unresolved”。 – horcrux

+0

“1A项危险因素”或“1B项未解决”都在实际文本中。这就是为什么我先去掉html标签并尝试使用正则表达式。希望这是有道理的。 – kevin

如果你想使用regEx,你可以使用下面的代码在Python 3.5.2中运行。 尝试打印“文本”以查看与您在网页中看到的不同的ITEM 1A的实际值(ITEM \ &#160 \; 1A)。希望这可以帮助。

import urllib.request 
from urllib.error import URLError, HTTPError 
import re 
import contextlib 

mainpage = "https://www.sec.gov/Archives/edgar/data/104169/000010416916000079/wmtform10-kx1312016.htm" 

try: 
    with contextlib.closing(urllib.request.urlopen(mainpage)) as url: 
     htmltext = url.read().decode('utf-8') 
     #print(htmltext) 
except HTTPError as e: 
    print("HTTPError") 
except URLError as e: 
    print("URLError") 
else: 
    results = re.findall(r'(?=ITEM\&\#160\;1A\.(.*)(RISK FACTORS))(.*)(?=ITEM\&\#160\;1B\.(.*)(UNRESOLVED))',htmltext) 
    print (results) 
+0

谢谢! @anonyXmous – kevin

你可以只用这

查找删除HTML标签:

<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>

什么也没有更换: “”

然后在运行此产生的字符串

1A\s*\.\s*RISK\s+FACTORS(.*?)1B\s*\.\s*UNRESOLVED\s+STAFF\s+COMMENTS

你想要的是在捕获组1

你可以包装在自己的应用程序或文本,

组1串粘贴到http://www.regexformat.com应用
文件,右单击上下文菜单 - >其他实用程序 - > Word Wrap。
在最大行长度中输入约60的值。

它弹出5k的包装文本,如下图(截断)。

The risks described below could materially and adversely 
affect our business, results of operations, financial 
condition and liquidity. Our business operations could also 
be affected by additional factors that apply to all 
companies operating in the U.S. and globally.Strategic 
RisksGeneral or macro-economic factors, both domestically 
and internationally, may materially adversely affect our 
financial performance.General economic conditions, globally 
or in one or more of the markets we serve, may adversely 
affect our financial performance. Higher interest rates, 
lower or higher prices of petroleum products, including 
crude oil, natural gas, gasoline, and diesel fuel, higher 
costs for electricity and other energy, weakness in the 
housing market, inflation, deflation, increased costs of 
essential services, such as medical care and utilities, 
higher levels of unemployment, decreases in consumer 
disposable income, unavailability of consumer credit, higher 
consumer debt levels, changes in consumer spending and 
shopping patterns, fluctuations in currency exchange rates, 
higher tax rates, imposition of new taxes and surcharges, 
other changes in tax laws, other regulatory changes, overall