category feedParsingCommon needseditdelete

This category is a work in progress






Parsers turn one representation of data into another its representation with the latter usually being more convenient to work with. Needless to say, the representations are rarely isomorphic, so the process can fail.

Recommendations

Most people are using parsec, or attoparsec if they need speed. megaparsec is better than parsec (while still very similar to it), but not yet as widespread. The API of attoparsec is easier than the one of parsec/megaparsec, but error messages are bad.

trifecta is for advanced users – it has highlighting and nice error messages, but it's hard to figure out. If you're not writing a compiler, you probably don't need it. On the other hand, if you are writing a compiler, then you might also look at alex/happy – e.g. GHC's parser is implemented using those.

Some people favor Earley, since it has a more advanced algorithm than most other libraries and so it's easier to express complex parsers in it. However, Earley is still a pretty rare library.

edit description
or press Ctrl+Enter to savemarkdown supported
#
parsec-like
move item up move item down edit item info delete item
Summary edit summary

An unofficial successor of parsec (which hasn't seen any updates in quite some time). Nothing particularly fancy – just a good, modern parsing library.

Summary quit editing summary
Prosedit prosquit editing pros
  • Very easy to use.
    move trait up move trait down edit trait delete trait
  • Error messages are good.
    move trait up move trait down edit trait delete trait
  • Allows to use custom error messages tailored to your domain of interest (that means you can signal errors using your own custom data constructors).
    move trait up move trait down edit trait delete trait
  • The API is largely similar to Parsec, so existing tutorials/code samples could be reused and migration is easy.
    move trait up move trait down edit trait delete trait
  • Works well with Text and custom streams of tokens, such as result of running Alex/Happy.
    move trait up move trait down edit trait delete trait
  • Has special combinators for parsing indentation (good if you're writing a parser for a small programming language or data format like YAML).
    move trait up move trait down edit trait delete trait
  • Has rudimentary error recovery – if a part of a parser fails, you can log a parse error and skip a part of input. Sometimes it's useful.
    move trait up move trait down edit trait delete trait
  • Has special combinator (as of 5.1.0) for debugging that shows what is going on on lower level.
    move trait up move trait down edit trait delete trait
  • Well-tested and robust.
    move trait up move trait down edit trait delete trait

press Ctrl+Enter or Enter to addmarkdown supportededit off
Consedit consquit editing cons
  • Like all parsec-like libraries, it doesn't like left recursion – i.e. if you're parsing 1+2+3, you can't just write something like (in pseudocode) expr = number | (expr '+' number) and expect it to work. See this post for a more detailed explanation.
    move trait up move trait down edit trait delete trait
  • Doesn't have automatic backtracking. This means that if you write expr = add | multiply and the parser for add fails in the middle (e.g. after parsing a single number), it won't try multiply unless you explicitly tell it to. This can be a good thing (saying when you want to backtrack explicitly can lead to better performance and better error messages), but it can still be somewhat annoying.
    move trait up move trait down edit trait delete trait

press Ctrl+Enter or Enter to addmarkdown supportededit off
Ecosystemedit ecosystem
Ecosystemquit editing ecosystemor press Ctrl+Enter to savemarkdown supported
Notes
collapse notesedit notes

Links

  • Parsing CSS with Parsec is a very to-the-point tutorial and I recommend looking at it first – it's possible that after reading it you'll understand how to do parsing without any lengthy explanations. It's about Parsec, not Megaparsec, but the only difference is in combinator names (e.g. many1 should be some).

Imports

import Text.Megaparsec
import Text.Megaparsec.String    -- or Text / Text.Lazy / etc

Additionally, if you're going to use number parsers (e.g. integer):

import Text.Megaparsec.Lexer

A long, very basic example

(Only read it if you were left confused by Parsing CSS with Parsec.)

Parsers work like this:

  • A parser consumes a piece of string and turns it into a value. Parser a returns a value of type a.

  • If you combine parsers with >> or do, they get applied one-by-one (the 2nd parser would start consuming string from the place where the 1st parser stopped).

  • If you combine parsers with <|>, they will be applied to the same pieces of string, but the 2nd parser would only be tried if the 1st parser fails.

Let's say you have something that is either a pair or a triplet of numbers, and has to be enclosed into double parens: ((1,2)) or ((1,2,3)). We'll represent it with Vec:

data Vec = V2 Integer Integer
         | V3 Integer Integer Integer
  deriving Show

Let's write a parser for Vec. First of all, notice that the parens are the same in both cases, and so we can write a special function for parsing something inside double parens:

doubleParens :: Parser a -> Parser a
doubleParens p = between (string "((") (string "))") p

between is a combinator available in Megaparsec. There are many such combinators – you can find the whole list in Text.Megaparsec.Combinator.

Next, parsers for V2 and V3:

v2 :: Parser Vec
v2 = doubleParens $ do
  a <- integer
  char ','
  b <- integer
  return (V2 a b)

v3 :: Parser Vec
v3 = doubleParens $ do
  a <- integer
  char ','
  b <- integer
  char ','
  c <- integer
  return (V3 a b c)

Once we have those 2 parsers, we can combine them:

vec :: Parser Vec
vec = try v2 <|> v3

try means that if v2 consumes some input and then fails, the next parser (i.e. v3) will be tried. By default Megaparsec only tries the next parser if the previous one fails without consuming any input, which gives you more control over how your parser behaves. (There are some parsing libraries that do automatic backtracking (i.e. when a parser fails, even at the last step, they always go back in time and try other parsers), but Megaparsec doesn't do it.)

To actually use the parser, we need parse:

parse
  :: Parser a
  -> String            	    -- Filepath (can be "")
  -> String                 -- String that is being parsed
  -> Either ParseError a	 

(The actual signature is more general because it can also work on Text and so on.)

> parse vec "" "((3,12))"
Right (V2 3 12)

What would happen if the input is bad?

> parse vec "" "1,2))"

Left line 1, column 1:
unexpected '1'
expecting "(("

> parse vec "" "((3,12,))"

Left line 1, column 7:
unexpected ','
expecting "))" or rest of integer

> parse vec "" "((1,a))"

Left line 1, column 5:
unexpected 'a'
expecting integer

By the way, there's a bit more repetition in our parsers that we could eliminate – specifically, parsing the first 2 numbers:

vec :: Parser Vec
vec = doubleParens $ do
  a <- integer
  char ','
  b <- integer
  choice [
    do char ','
       c <- integer
       return (V3 a b c),
    return (V2 a b) ]

Here we used choice, which is like <|> but for many parsers instead of just 2.

collapse notesedit notes
#
attoparsec (Hackage)
parsec-like
move item up move item down edit item info delete item
Summary edit summary

A very fast parsing library for Text and ByteString. Best suited for parsing things that aren't going to be seen by humans (like JSON, binary protocols, and so on). Not that good for parsing e.g. programming languages – for instance, it doesn't even tell you the positions of errors when they happen.

Summary quit editing summary
Prosedit prosquit editing pros
  • Performance (see this for a comparison of sorts). Can be 10× faster than Parsec.
    move trait up move trait down edit trait delete trait
  • Has automatic backtracking, which means that you don't have to figure out where to put try – everything just works.
    move trait up move trait down edit trait delete trait
  • Has a simpler API than parsec/megaparsec.
    move trait up move trait down edit trait delete trait

press Ctrl+Enter or Enter to addmarkdown supportededit off
Consedit consquit editing cons
  • Can't report positions of parsing errors. (And the error messages are generally poor.)
    move trait up move trait down edit trait delete trait
  • Doesn't provide a monad transformer. This means that if you want to do something while parsing (e.g. keep state, or print warnings, or whatever), you can't.
    move trait up move trait down edit trait delete trait
  • Backtracking can't be turned off or limited in scope (i.e. you can't say “if this parser didn't fail then commit to it”). It makes error messages worse and likely hurts performance (but I'm not sure, given that attoparsec is still the fastest library around).
    move trait up move trait down edit trait delete trait

press Ctrl+Enter or Enter to addmarkdown supportededit off
Ecosystemquit editing ecosystemor press Ctrl+Enter to savemarkdown supported
Notes
collapse notesedit notes

Links

Imports

For parsing ByteString:

import Control.Applicative
import Data.Attoparsec.ByteString

For parsing Text:

import Control.Applicative
import Data.Attoparsec.Text

Helpers

This function modifies a parser to print some info about it (namely, what it has consumed, remaining input, and the value it has parsed):

import Debug.Trace
import Data.Attoparsec.Combinators

debug :: Show a => Parser a -> Parser a
debug p = do
  (consumed, a) <- match p
  remaining <- lookAhead takeText
  traceM ("result    : " ++ show a)
  traceM ("consumed  : " ++ show consumed)
  traceM ("remaining : " ++ show remaining)
  return a
collapse notesedit notes
#
trifecta (Hackage)
parsec-like
move item up move item down edit item info delete item
Summary edit summary

A library that is supposed to give you much nicer error messages that the ones of Parsec. Can be hard to figure out, but worth trying if you're writing an interpreter/compiler.

Summary quit editing summary
Prosedit prosquit editing pros
  • Lets you report errors in a manner similar to Clang, with colors and ^~~~~~~~~ and so on, which is very useful when writing e.g. a compiler. (For an example of what Clang does, see here.)
    move trait up move trait down edit trait delete trait
  • Has a module for doing highlighting of parsed text (i.e. you assign labels like Number, Operator, Identifier, etc and you can generate colored text from them).
    move trait up move trait down edit trait delete trait

press Ctrl+Enter or Enter to addmarkdown supportededit off
Consedit consquit editing cons
  • Kinda complicated, doesn't have any tutorials available, and documentation doesn't help at all.
    move trait up move trait down edit trait delete trait
  • Can parse String and ByteString natively, but not Text.
    move trait up move trait down edit trait delete trait
  • Depends on lens and thus by depending on trifecta you pull in half of Hackage too.
    move trait up move trait down edit trait delete trait

press Ctrl+Enter or Enter to addmarkdown supportededit off
Ecosystemedit ecosystem
Ecosystemquit editing ecosystemor press Ctrl+Enter to savemarkdown supported
Notes
collapse notesedit notes

Imports

import Control.Applicative
import Text.Trifecta

Gotchas

If you want to parse Text, either convert it to String or do something like this.

collapse notesedit notes
#
parsec-like
move item up move item down edit item info delete item
Summary edit summary

A small, simple parsing module shipped with GHC by default. Pretty usable if you don't care about getting error messages. Good for writing Read instances.

Summary quit editing summary
Prosedit prosquit editing pros
  • It's in base, so you can use it even when you can't (or don't want to) depend on any parsing library.
    move trait up move trait down edit trait delete trait
  • Non-deterministic – all parse results will be returned. Hence doesn't need try or backtracking, and doesn't leak space. (Left-biased Parsec-like choice is still possible with <++.)
    move trait up move trait down edit trait delete trait
  • Can be used for writing complicated Read instances that are fully compliant with Haskell's precedency parsing requirements (see the ReadPrec module).
    move trait up move trait down edit trait delete trait
  • Can be faster than Parsec (see this benchmark where parsing a simple config file is twice as fast with ReadP).
    move trait up move trait down edit trait delete trait
  • Has a function for using the Read instance as a parser (i.e. readP_to_S reads).
    move trait up move trait down edit trait delete trait

press Ctrl+Enter or Enter to addmarkdown supportededit off
Consedit consquit editing cons
  • Doesn't give any errors whatsoever.
    move trait up move trait down edit trait delete trait
  • Non-determinism everywhere can be annoying if you don't need it (for instance, it's non-trivial to write a greedy many if you need it).
    move trait up move trait down edit trait delete trait
  • Can't parse Text.
    move trait up move trait down edit trait delete trait
  • Doesn't provide any advanced features like monad transformers or state.
    move trait up move trait down edit trait delete trait

press Ctrl+Enter or Enter to addmarkdown supportededit off
Ecosystemedit ecosystem
Ecosystemquit editing ecosystemor press Ctrl+Enter to savemarkdown supported
Notes
collapse notesedit notes

Imports

import Text.ParserCombinators.ReadP

Or if you have Control.Applicative imported, write it like this to avoid clashes:

import Control.Applicative
import Text.ParserCombinators.ReadP hiding (many, optional)

Usage

It might not be obvious how to run a parser, so:

-- Return all possible parses
parse :: ReadP a -> String -> [(a, String)]
parse = readP_to_S

-- Return all possible parses, and additionally require all input to be consumed
parseAll :: ReadP a -> String -> [a]
parseAll p = map fst . readP_to_S (p <* eof)

Gotchas

many is rather inefficient because it's nondeterministic and so all alternatives (0 elements consumed, 1 element consumed, 2 elements consumed, etc) have to be considered:

p = (,) <$> many (satisfy isLetter)
        <*> many (satisfy isUpper)
> map fst $ readP_to_S (p <* eof) "abcXYZ"
[("abc","XYZ"),
 ("abcX","YZ"),
 ("abcXY","Z"),
 ("abcXYZ","")]

If you know you wouldn't need this, use munch or munch1, it will be much faster. (Also note that in this example you'll get ("abcXYZ",""), not ("abc","XYZ"), if you use munch, but this is to be expected – just like with many in Parsec.)

Other combinators are non-deterministic too, which can sometimes lead to unexpected issues.

collapse notesedit notes
#
Earley (Hackage)
other
move item up move item down edit item info delete item
Summary edit summary

write something here!

Summary quit editing summary
Prosedit prosquit editing pros

press Ctrl+Enter or Enter to addmarkdown supportededit off
Consedit consquit editing cons
  • Doesn't have monadic parsing
    move trait up move trait down edit trait delete trait

press Ctrl+Enter or Enter to addmarkdown supportededit off
Ecosystemedit ecosystem
Ecosystemquit editing ecosystemor press Ctrl+Enter to savemarkdown supported
Notes
collapse notesedit notes

Links

Imports and pragmas

{-# LANGUAGE RecursiveDo #-}
import Control.Applicative
import Text.Earley
-- If you want to construct a parser for a language with operators
import Text.Earley.Mixfix

Usage

The type for parsers is Prod:

data Prod
       r    -- A phantom variable, like “s” in “ST s a”
       e    -- Type for names of parsers
       t    -- Type for characters (e.g. Char or Word8 or some Token)
       a    -- Result of the parser

e is usually going to be String even if you're parsing e.g. Text. t will be Char for String and Text, and SomeToken if you have previously lexed/tokenized your input. It's usual to define a type synonym like this:

type Parser r a = Prod r String Char a

Writing parsers with Earley is similar to Parsec, with 2 major differences:

  • You can't use do and monadic parsing – for instance, it's impossible to write a parser that would parse a prime number, while with Parsec it's easy.

  • You're given much less combinators by default and so you have to write many things (between, etc) by yourself.

If a parser depends on itself, it will loop. To avoid that, you have to define recursive parsers in the context of Grammar:

data Term = Number Integer | Add Term Term
  deriving Show

grammar :: Grammar r (Parser r Term)
grammar = mdo
  let number = Number <$> integer
  add <- rule $
    Add <$> term <*> (word "+" *> number)
  term <- rule $
    number <|> add
  return term

(word is the same as string in Parsec.)

As you can see, you can define parsers that depend on each other, but all such parsers have to be marked with rule.

To run a Grammar, use fullParses:

> let (parses, rep) = fullParses (parser grammar) "1+2+3"

> parses
[Add (Add (Number 1) (Number 2))
     (Number 3)]

> rep
Report {position = 5, expected = [], unconsumed = ""})

You can also use allParses to get partial parses as well, or report to see how much of the input the parser can consume without actually getting all parses.

Note that our grammar is left-recursive, but Earley was still able to parse it. The model of parsing you might have if you know Parsec is inapplicable to Earley.

Also note that there can be several parses. For instance, if we changed the definition of add to

  add <- rule $
    Add <$> term <*> (word "+" *> term)

we'd get the following parses:

> fst $ fullParses (parser grammar) "1+2+3"
[ Add (Add (Number 1) (Number 2))
      (Number 3)
, Add (Number 1)
      (Add (Number 2) (Number 3))]

Mixfix parsing

See Text.Earley.Mixfix and this example.

(TODO: write a better example here.)

collapse notesedit notes
#
uu-parsinglib (Hackage)
other
move item up move item down edit item info delete item
Summary edit summary

write something here!

Summary quit editing summary
Prosedit prosquit editing pros

    press Ctrl+Enter or Enter to addmarkdown supportededit off
    Consedit consquit editing cons

      press Ctrl+Enter or Enter to addmarkdown supportededit off
      Ecosystemedit ecosystem
      Ecosystemquit editing ecosystemor press Ctrl+Enter to savemarkdown supported
      Notes
      collapse notesedit notes

      <notes are empty>

      add something!

      #
      alex/happy
      other
      move item up move item down edit item info delete item
      Summary edit summary

      write something here!

      Summary quit editing summary
      Prosedit prosquit editing pros

        press Ctrl+Enter or Enter to addmarkdown supportededit off
        Consedit consquit editing cons

          press Ctrl+Enter or Enter to addmarkdown supportededit off
          Ecosystemedit ecosystem
          Ecosystemquit editing ecosystemor press Ctrl+Enter to savemarkdown supported
          Notes
          collapse notesedit notes

          <notes are empty>

          add something!

          #
          fastparser (Hackage)
          non-incremental
          move item up move item down edit item info delete item
          Summary edit summary

          A very simple, backtracking, fast parser combinator library.

          Do not use fastparser when:

          • performance is not the most pressing concern.
          • you need to parse anything else but strict ByteString.
          • you need to use a battle-tested library (still experimental)
          • you need to parse large inputs that are not easily cut into many smaller pieces that can be parsed independently
          Summary quit editing summary
          Prosedit prosquit editing pros
          • Very, very fast. Measurably faster than attoparsec (36% in this use case)
            move trait up move trait down edit trait delete trait

          press Ctrl+Enter or Enter to addmarkdown supportededit off
          Consedit consquit editing cons
          • only works on strict ByteString
            move trait up move trait down edit trait delete trait
          • lacks many helper functions
            move trait up move trait down edit trait delete trait
          • is not resumable
            move trait up move trait down edit trait delete trait

          press Ctrl+Enter or Enter to addmarkdown supportededit off
          Ecosystemedit ecosystem
          Ecosystemquit editing ecosystemor press Ctrl+Enter to savemarkdown supported
          Notes
          collapse notesedit notes

          <notes are empty>

          add something!