Parsing – Haskell – Aelve Guide

fastparser

2017-10-03T15:23:56Z

fastparser (Hackage)

A very simple, backtracking, fast parser combinator library.

Do not use fastparser when:

performance is not the most pressing concern.
you need to parse anything else but strict ByteString.
you need to use a battle-tested library (still experimental)
you need to parse large inputs that are not easily cut into many smaller pieces that can be parsed independently

Pros

Very, very fast. Measurably faster than attoparsec (36% in this use case)

Cons

only works on strict ByteString

lacks many helper functions

is not resumable

uu-parsinglib

2016-03-10T11:06:40Z

uu-parsinglib (Hackage)

Pros

Cons

megaparsec

2016-03-10T11:06:40Z

megaparsec (Hackage)

An unofficial successor of parsec (which hasn't seen any updates in quite some time). Nothing particularly fancy – just a good, modern parsing library.

Pros

Very easy to use.

Error messages are good.

Allows to use custom error messages tailored to your domain of interest (that means you can signal errors using your own custom data constructors).

The API is largely similar to Parsec, so existing tutorials/code samples could be reused and migration is easy.

Works well with Text and custom streams of tokens, such as result of running Alex/Happy.

Has special combinators for parsing indentation (good if you're writing a parser for a small programming language or data format like YAML).

Has rudimentary error recovery – if a part of a parser fails, you can log a parse error and skip a part of input. Sometimes it's useful.

Has special combinator (as of 5.1.0) for debugging that shows what is going on on lower level.

Well-tested and robust.

Cons

Like all parsec-like libraries, it doesn't like left recursion – i.e. if you're parsing 1+2+3, you can't just write something like (in pseudocode) expr = number | (expr '+' number) and expect it to work. See this post for a more detailed explanation.

Doesn't have automatic backtracking. This means that if you write expr = add | multiply and the parser for add fails in the middle (e.g. after parsing a single number), it won't try multiply unless you explicitly tell it to. This can be a good thing (saying when you want to backtrack explicitly can lead to better performance and better error messages), but it can still be somewhat annoying.

Ecosystem

hspec-megaparsec - utility functions for testing Megaparsec parsers with Hspec.
cassava-megaparsec - Megaparsec parser of CSV files that plays nicely with Cassava.
tagsoup-megaparsec - a Tag token parser and Tag specific parsing combinators.
parser-combinators - lightweight package providing commonly useful parser combinators.

Notes

Links

Parsing CSS with Parsec is a very to-the-point tutorial and I recommend looking at it first – it's possible that after reading it you'll understand how to do parsing without any lengthy explanations. It's about Parsec, not Megaparsec, but the only difference is in combinator names (e.g. many1 should be some).

Imports

import Text.Megaparsec
import Text.Megaparsec.String    -- or Text / Text.Lazy / etc

Additionally, if you're going to use number parsers (e.g. integer):

import Text.Megaparsec.Lexer

A long, very basic example

(Only read it if you were left confused by Parsing CSS with Parsec.)

Parsers work like this:

A parser consumes a piece of string and turns it into a value. Parser a returns a value of type a.
If you combine parsers with >> or do, they get applied one-by-one (the 2nd parser would start consuming string from the place where the 1st parser stopped).
If you combine parsers with <|>, they will be applied to the same pieces of string, but the 2nd parser would only be tried if the 1st parser fails.

Let's say you have something that is either a pair or a triplet of numbers, and has to be enclosed into double parens: ((1,2)) or ((1,2,3)). We'll represent it with Vec:

data Vec = V2 Integer Integer
         | V3 Integer Integer Integer
  deriving Show

Let's write a parser for Vec. First of all, notice that the parens are the same in both cases, and so we can write a special function for parsing something inside double parens:

doubleParens :: Parser a -> Parser a
doubleParens p = between (string "((") (string "))") p

between is a combinator available in Megaparsec. There are many such combinators – you can find the whole list in Text.Megaparsec.Combinator.

Next, parsers for V2 and V3:

v2 :: Parser Vec
v2 = doubleParens $ do
  a <- integer
  char ','
  b <- integer
  return (V2 a b)

v3 :: Parser Vec
v3 = doubleParens $ do
  a <- integer
  char ','
  b <- integer
  char ','
  c <- integer
  return (V3 a b c)

Once we have those 2 parsers, we can combine them:

vec :: Parser Vec
vec = try v2 <|> v3

try means that if v2 consumes some input and then fails, the next parser (i.e. v3) will be tried. By default Megaparsec only tries the next parser if the previous one fails without consuming any input, which gives you more control over how your parser behaves. (There are some parsing libraries that do automatic backtracking (i.e. when a parser fails, even at the last step, they always go back in time and try other parsers), but Megaparsec doesn't do it.)

To actually use the parser, we need parse:

parse
  :: Parser a
  -> String            	    -- Filepath (can be "")
  -> String                 -- String that is being parsed
  -> Either ParseError a

(The actual signature is more general because it can also work on Text and so on.)

> parse vec "" "((3,12))"
Right (V2 3 12)

What would happen if the input is bad?

> parse vec "" "1,2))"

Left line 1, column 1:
unexpected '1'
expecting "(("

> parse vec "" "((3,12,))"

Left line 1, column 7:
unexpected ','
expecting "))" or rest of integer

> parse vec "" "((1,a))"

Left line 1, column 5:
unexpected 'a'
expecting integer

By the way, there's a bit more repetition in our parsers that we could eliminate – specifically, parsing the first 2 numbers:

vec :: Parser Vec
vec = doubleParens $ do
  a <- integer
  char ','
  b <- integer
  choice [
    do char ','
       c <- integer
       return (V3 a b c),
    return (V2 a b) ]

Here we used choice, which is like <|> but for many parsers instead of just 2.

attoparsec

2016-03-10T11:06:40Z

attoparsec (Hackage)

A very fast parsing library for Text and ByteString. Best suited for parsing things that aren't going to be seen by humans (like JSON, binary protocols, and so on). Not that good for parsing e.g. programming languages – for instance, it doesn't even tell you the positions of errors when they happen.

Pros

Performance (see this for a comparison of sorts). Can be 10× faster than Parsec.

Has automatic backtracking, which means that you don't have to figure out where to put try – everything just works.

Has a simpler API than parsec/megaparsec.

Cons

Can't report positions of parsing errors. (And the error messages are generally poor.)

Doesn't provide a monad transformer. This means that if you want to do something while parsing (e.g. keep state, or print warnings, or whatever), you can't.

Backtracking can't be turned off or limited in scope (i.e. you can't say “if this parser didn't fail then commit to it”). It makes error messages worse and likely hurts performance (but I'm not sure, given that attoparsec is still the fastest library around).

Ecosystem

Additional parsers: attoparsec-expr (Parsec-like expression parser), attoparsec-binary, http-attoparsec, aeson (JSON), timeparsers, html-entities, taggy (HTML/XML), css-text, hweblib (HTTP, MIME, URI, ABNF), http-date
Iteratees: attoparsec-iteratee, pipes-attoparsec, conduit-extra, conduit-tokenize-attoparsec, streaming-utils, io-streams
Other: foldl-transduce-attoparsec, hspec-attoparsec, list-t-attoparsec, network-attoparsec, attosplit

Notes

Imports

For parsing ByteString:

import Control.Applicative
import Data.Attoparsec.ByteString

For parsing Text:

import Control.Applicative
import Data.Attoparsec.Text

Helpers

This function modifies a parser to print some info about it (namely, what it has consumed, remaining input, and the value it has parsed):

import Debug.Trace
import Data.Attoparsec.Combinators

debug :: Show a => Parser a -> Parser a
debug p = do
  (consumed, a) <- match p
  remaining <- lookAhead takeText
  traceM ("result    : " ++ show a)
  traceM ("consumed  : " ++ show consumed)
  traceM ("remaining : " ++ show remaining)
  return a

Earley

2016-03-10T11:06:40Z

Earley (Hackage)

Pros

Handles left recursion just fine

Can generate strings given a parser that parses them.

Cons

Doesn't have monadic parsing

Ecosystem

pinchot

Notes

Imports and pragmas

{-# LANGUAGE RecursiveDo #-}

import Control.Applicative
import Text.Earley
-- If you want to construct a parser for a language with operators
import Text.Earley.Mixfix

Usage

The type for parsers is Prod:

data Prod 
       r    -- A phantom variable, like “s” in “ST s a”
       e    -- Type for names of parsers
       t    -- Type for characters (e.g. Char or Word8 or some Token)
       a    -- Result of the parser

e is usually going to be String even if you're parsing e.g. Text. t will be Char for String and Text, and SomeToken if you have previously lexed/tokenized your input. It's usual to define a type synonym like this:

type Parser r a = Prod r String Char a

Writing parsers with Earley is similar to Parsec, with 2 major differences:

You can't use do and monadic parsing – for instance, it's impossible to write a parser that would parse a prime number, while with Parsec it's easy.
You're given much less combinators by default and so you have to write many things (between, etc) by yourself.

If a parser depends on itself, it will loop. To avoid that, you have to define recursive parsers in the context of Grammar:

data Term = Number Integer | Add Term Term
  deriving Show

grammar :: Grammar r (Parser r Term)
grammar = mdo
  let number = Number <$> integer
  add <- rule $
    Add <$> term <*> (word "+" *> number)
  term <- rule $
    number <|> add
  return term

(word is the same as string in Parsec.)

As you can see, you can define parsers that depend on each other, but all such parsers have to be marked with rule.

To run a Grammar, use fullParses:

> let (parses, rep) = fullParses (parser grammar) "1+2+3"

> parses
[Add (Add (Number 1) (Number 2))
     (Number 3)]

> rep
Report {position = 5, expected = [], unconsumed = ""})

You can also use allParses to get partial parses as well, or report to see how much of the input the parser can consume without actually getting all parses.

Note that our grammar is left-recursive, but Earley was still able to parse it. The model of parsing you might have if you know Parsec is inapplicable to Earley.

Also note that there can be several parses. For instance, if we changed the definition of add to

  add <- rule $
    Add <$> term <*> (word "+" *> term)

we'd get the following parses:

> fst $ fullParses (parser grammar) "1+2+3"
[ Add (Add (Number 1) (Number 2))
      (Number 3)
, Add (Number 1)
      (Add (Number 2) (Number 3))]

Mixfix parsing

See Text.Earley.Mixfix and this example.

(TODO: write a better example here.)

trifecta

2016-03-10T11:06:40Z

trifecta (Hackage)

A library that is supposed to give you much nicer error messages that the ones of Parsec. Can be hard to figure out, but worth trying if you're writing an interpreter/compiler.

Pros

Lets you report errors in a manner similar to Clang, with colors and ^~~~~~~~~ and so on, which is very useful when writing e.g. a compiler. (For an example of what Clang does, see here.)

Has a module for doing highlighting of parsed text (i.e. you assign labels like Number, Operator, Identifier, etc and you can generate colored text from them).

Cons

Kinda complicated, doesn't have any tutorials available, and documentation doesn't help at all.

Can parse String and ByteString natively, but not Text.

Depends on lens and thus by depending on trifecta you pull in half of Hackage too.

Ecosystem

indentation

Notes

Imports

import Control.Applicative
import Text.Trifecta

Gotchas

If you want to parse Text, either convert it to String or do something like this.

ReadP

2016-03-10T11:06:40Z

ReadP

A small, simple parsing module shipped with GHC by default. Pretty usable if you don't care about getting error messages. Good for writing Read instances.

Pros

It's in base, so you can use it even when you can't (or don't want to) depend on any parsing library.

Non-deterministic – all parse results will be returned. Hence doesn't need try or backtracking, and doesn't leak space. (Left-biased Parsec-like choice is still possible with <++.)

Can be used for writing complicated Read instances that are fully compliant with Haskell's precedency parsing requirements (see the ReadPrec module).

Can be faster than Parsec (see this benchmark where parsing a simple config file is twice as fast with ReadP).

Has a function for using the Read instance as a parser (i.e. readP_to_S reads).

Cons

Doesn't give any errors whatsoever.

Non-determinism everywhere can be annoying if you don't need it (for instance, it's non-trivial to write a greedy many if you need it).

Can't parse Text.

Doesn't provide any advanced features like monad transformers or state.

Notes

Imports

import Text.ParserCombinators.ReadP

Or if you have Control.Applicative imported, write it like this to avoid clashes:

import Control.Applicative
import Text.ParserCombinators.ReadP hiding (many, optional)

Usage

It might not be obvious how to run a parser, so:

-- Return all possible parses
parse :: ReadP a -> String -> [(a, String)]
parse = readP_to_S

-- Return all possible parses, and additionally require all input to be consumed
parseAll :: ReadP a -> String -> [a]
parseAll p = map fst . readP_to_S (p <* eof)

Gotchas

many is rather inefficient because it's nondeterministic and so all alternatives (0 elements consumed, 1 element consumed, 2 elements consumed, etc) have to be considered:

p = (,) <$> many (satisfy isLetter)
        <*> many (satisfy isUpper)

> map fst $ readP_to_S (p <* eof) "abcXYZ"
[("abc","XYZ"),
 ("abcX","YZ"),
 ("abcXY","Z"),
 ("abcXYZ","")]

If you know you wouldn't need this, use munch or munch1, it will be much faster. (Also note that in this example you'll get ("abcXYZ",""), not ("abc","XYZ"), if you use munch, but this is to be expected – just like with many in Parsec.)

Other combinators are non-deterministic too, which can sometimes lead to unexpected issues.

alex/happy

2016-03-10T11:06:40Z