在单个输入中选择多个正确的解析器
我想知道解析多个解析器可以成功的输入的最佳方式。我已经概述了我第一次失败的尝试和一个我希望可以变得更加习惯的不雅的解决方案。在单个输入中选择多个正确的解析器
比如我想莱克斯“号的”,“快速”和“狐狸”从下面的句子翻译成他们自己的数据构造:
"the quick brown fox jumps over the lazy dog".
因此,考虑以下类型的构造函数:
data InterestingWord = Quick | The | Fox deriving Show
data Snippet = Word InterestingWord | Rest String deriving Show
我想解析的输出是:
[Word The,
Rest " ", Word Quick,
Rest " brown ", Word Fox,
Rest " jumped over ", Word The,
Rest " lazy dog"]
这里有两种解决方案:
import Text.Parsec
import Data.Maybe
import Data.Ord
import Data.List
data InterestingWord = Quick | The | Fox deriving Show
data Snippet = Word InterestingWord | Rest String deriving Show
testCase = "the quick brown fox jumped over the lazy dog"
-- Expected output:
-- [Word The,
-- Rest " ", Word Quick,
-- Rest " brown ", Word Fox,
-- Rest " jumped over ", Word The,
-- Rest " lazy dog"]
toString Quick = "quick"
toString The = "the"
toString Fox = "fox"
-- First attempt
-- Return characters upto the intended word along
-- with the word itself
upto word = do
pre <- manyTill anyChar $ lookAhead $ string (toString word)
word' <- try $ string (toString word)
return [Rest pre, Word word]
-- Parsers for the interesting words
parsers = [upto Quick,
upto The,
upto Fox]
-- Try each parser and return its results with the
-- rest of the input.
-- An incorrect result is produced because "choice"
-- picks the first successful parse result.
wordParser = do
snippets <- many $ try $ choice parsers
leftOver <- many anyChar
return $ concat $ snippets ++ [[Rest leftOver]]
-- [Rest "the ",Word Quick,Rest " brown fox jumped over the lazy dog"]
test1 = parseTest wordParser testCase
-- Correct
-- In addition to the characters leading upto the
-- word and the word, the position is also returned
upto' word = do
result <- upto word
pos <- getPosition
return (pos, result)
-- The new parsers
parsers' = [upto' Quick,
upto' The,
upto' Fox]
-- Try each of the given parsers and
-- possibly returning the results and
-- the parser but don't consume
-- input.
tryAll = mapM (\p -> do
r <- optionMaybe $ try (lookAhead p)
case r of
Just result -> return $ Just (p, result)
Nothing -> return $ Nothing
)
-- Pick the parser that has consumed the least.
firstSuccess ps = do
successes <- tryAll ps >>= return . catMaybes
if not (null successes) then
return $ Just (fst $ head (sortBy (comparing (\(_,(pos,_)) -> pos)) successes))
else return $ Nothing
-- Return the parse results for the parser that
-- has consumed the least
wordParser' = do
parser <- firstSuccess parsers'
case parser of
Just p -> do
(_,snippet) <- p
return snippet
Nothing -> parserZero
-- Returns the right result
test2 = parseTest (many wordParser' >>= return . concat) testCase
第一次尝试“测试1”,因为“选择”返回成功,当我真正想要的是同时消耗最少的字符成功的第一个解析器第一分析器不产生所需的输出。这是我接下来的尝试,通过保持输入被解析后的源位置,并使用源位置最低的解析器。
这种情况似乎很普遍,我觉得我错过了一些明显的combinator咒语。谁能提供更好的建议?
谢谢!
-deech
这是不是一个特别普遍的需求,但在这里是一个实现:
import Control.Monad
import "parsec3" Text.Parsec
import Data.Maybe
import Data.List
import Data.Ord
longestParse :: [Parsec String() a] -> Parsec String() a
longestParse parsers = do
allParses <- sequence [lookAhead $ optionMaybe $ try $
liftM2 (,) parse getPosition | parse <- parsers]
-- allParses :: [Maybe (a, SourcePos)]
(bestParse, bestPos) <- case catMaybes allParses of
[] -> fail "No valid parse" -- maybe we can do something better?
successfulParses -> return $ minimumBy (comparing snd) successfulParses
setPosition bestPos
return bestParse
有趣的是,这不是一个常见的用例,并且给出您的答案,我想不是太遥远了。我非常喜欢你如何使用列表理解来过滤出成功的分析。谢谢! – Deech 2012-02-11 14:56:24
据我了解,你要反复分析给你看的第一个有趣的词。目前,您正在解析每个有趣的单词,并检查您找到的哪个有趣单词更接近。
我的建议是定义一个解析器来检查你现在是否在一个有趣的单词(只有一个选择可以成功,所以没有必要做比任何简单的选择更有用的东西)。然后你继续前进,直到第一个解析器成功,当碰到任何有趣的单词时会发生这种情况。这给了你第一个有趣的单词,因为在它包含任何有趣的单词之前你什么都不知道。
是的,这并没有回答确定哪个解析器匹配最短的问题。这避免了这个问题,通过解决实际问题而不关心哪个解析器匹配是最短的。
作为一般观点 - 我不会急于使用Parsec进行NLP解析,它实际上是解析编程语言和结构化文本格式的工具。正在进行的Haskell NLP书似乎直接使用Prelude的“单词”和列表功能 - http://nlpwp.org/book/ – 2012-02-10 18:41:53