用Haskell中的TagSoup解析标签

问题描述：

我一直在试图学习如何从Haskell中的HTML文件中提取数据，并且遇到了困难。我根本没有真正的Haskell经验，我以前的知识来自Python（和BeatifulSoup for HTML解析）。用Haskell中的TagSoup解析标签

我正在使用TagSoup来看看我的HTML（似乎是推荐），并有一个它如何工作的基本思路。下面是我的代码有问题的基本段（自包含的，用于测试输出信息）：

import System.IO 
import Network.HTTP 
import Text.HTML.TagSoup 
import Data.List 

main :: IO() 
main = do 
    http <- simpleHTTP (getRequest "http://www.cbssports.com/nba/scoreboard/20130310") >>= getResponseBody 
    let tags = dropWhile (~/= TagOpen "div" []) (parseTags http) 
    done tags where 
     done xs = case xs of 
      [] -> putStrLn $ "\n" 
      _ -> do 
       putStrLn $ show $ head xs 
       done (tail xs)

不过，我不试图去任何“分区”标签。我想放弃之前的一切标签的格式如下：

TagOpen "div" [("id","scores-1997830"),("class","scoreBox spanCol2")] 
TagOpen "div" [("id","scores-1997831"),("class","scoreBox spanCol2 lastCol")]

我试着写出来：

let tags = dropWhile (~/= TagOpen "div" [("id", "scores-[0-9]+"), ("class", "scoreBox(spanCol[0-9]?)+(lastCol)?")]) (parseTags http)

但随后试图找到字面[0-9] +。我还没有想出Text.Regex.Posix模块的解决方法，并且转义字符不起作用。这里有什么解决方案？

答

~==没有做正则表达式，你必须写一个匹配自己的东西沿着

import Data.Maybe 
import Text.Regex 

goodTag :: TagOpen -> Bool 
goodTag tag = tag ~== TagOpen "div" [] 
    && fromAttrib "id" tag `matches` "scores-[0-9]+" 

-- Just a wrapper around Text.Regex.matchRegex 
matches :: String -> String -> Bool 
matches string regex = isJust $ mkRegex regex `matchRegex` string

行怎么样'fromAttrib“身份证”标签=〜“scores- [0-9] + “'？ – 2013-03-17 15:28:15

谢谢，伙计们！这两个工作。我不确定哪个“更好”，但是由于我想尽可能多地写出代码（为了学习的目的，请不要担心），我现在只需要Koterpillar的方法。谢谢一堆！ – simonsays 2013-03-17 18:36:23

用Haskell中的TagSoup解析标签

相关推荐