Regular expressions – Haskell – Aelve Guidehttps://guide.aelve.com/haskell/feed/category/jf3ju3u22016-08-21T22:59:44Zb9j0hxi9regex-type2016-08-21T22:59:44Z<h1> <span class="item-name">regex-type</span>
(<a href="https://hackage.haskell.org/package/regex-type">Hackage</a>)
</h1><h2>Pros</h2><ul></ul><h2>Cons</h2><ul></ul>azxhk1pprelit2016-04-09T14:08:28Z<h1> <span class="item-name">relit</span>
(<a href="https://hackage.haskell.org/package/relit">Hackage</a>)
</h1><p>Not a library, but a quasiquoter for various <code>regex*</code> libraries.</p>
<h2>Pros</h2><ul></ul><h2>Cons</h2><ul></ul>tzjxzagzregex-pderiv2016-04-09T14:08:20Z<h1> <span class="item-name">regex-pderiv</span>
(<a href="https://hackage.haskell.org/package/regex-pderiv">Hackage</a>)
</h1><h2>Pros</h2><ul></ul><h2>Cons</h2><ul></ul>cy31mlzzregex-posix2016-04-09T14:07:23Z<h1> <span class="item-name">regex-posix</span>
(<a href="https://hackage.haskell.org/package/regex-posix">Hackage</a>)
</h1><p>The library is bundled.</p>
<p>Regex flavor: POSIX.</p>
<h2>Pros</h2><ul></ul><h2>Cons</h2><ul></ul><h2>Ecosystem</h2><p><a href="https://hackage.haskell.org/package/lens-regex">lens-regex</a></p>
vpcicemuregex-pcre2016-04-09T14:06:22Z<h1> <span class="item-name">regex-pcre</span>
(<a href="https://hackage.haskell.org/package/regex-pcre">Hackage</a>)
</h1><p>Uses system PCRE library.</p>
<h2>Pros</h2><ul></ul><h2>Cons</h2><ul></ul>s6qidbvwregex-pcre-builtin2016-04-09T14:06:15Z<h1> <span class="item-name">regex-pcre-builtin</span>
(<a href="https://hackage.haskell.org/package/regex-pcre-builtin">Hackage</a>)
</h1><p>Regex flavor: PCRE.</p>
<h2>Pros</h2><ul></ul><h2>Cons</h2><ul></ul><h2>Ecosystem</h2><p><a href="https://hackage.haskell.org/package/pcre-utils">pcre-utils</a> – provides split/replace</p>
n0057uwjregex-tdfa2016-04-09T14:03:31Z<h1> <span class="item-name">regex-tdfa</span>
(<a href="https://hackage.haskell.org/package/regex-tdfa">Hackage</a>)
</h1><p>Uses <a href="https://hackage.haskell.org/package/regex-base">regex-base</a>. Seems to be the most popular library.</p>
<p>Regex flavor: POSIX.</p>
<h2>Pros</h2><ul><p><li>Handles corner cases <a href="https://wiki.haskell.org/Regex_Posix">better</a> than other POSIX implementations (including glibc and the rest).</li></p><p><li>Doesn't require any installed libraries (since it's written in pure Haskell).</li></p></ul><h2>Cons</h2><ul><p><li>Slightly complicated to use, and documentation isn't particularly good.</li></p></ul><h2>Ecosystem</h2><ul>
<li>
<p><a href="https://hackage.haskell.org/package/regex-tdfa-pipes">regex-tdfa-pipes</a>, <a href="https://hackage.haskell.org/package/regex-tdfa-quasiquoter">regex-tdfa-quasiquoter</a>, <a href="https://hackage.haskell.org/package/regex-tdfa-text">regex-tdfa-text</a></p>
</li>
<li>
<p><a href="https://hackage.haskell.org/package/regex-genex">regex-genex</a> can generate all strings matching some regex, and <a href="https://hackage.haskell.org/package/quickcheck-regex">quickcheck-regex</a> can use that to generate test cases for Quickcheck</p>
</li>
<li>
<p><a href="https://hackage.haskell.org/package/regex-compat-tdfa">regex-compat-tdfa</a> is a wrapper over regex-tdfa with a simple interface</p>
</li>
</ul>
prhw67xdregexpr2016-04-09T14:01:52Z<h1> <span class="item-name">regexpr</span>
(<a href="https://hackage.haskell.org/package/regexpr">Hackage</a>)
</h1><h2>Pros</h2><ul></ul><h2>Cons</h2><ul></ul>w6kin9fkregex-applicative2016-04-09T14:00:50Z<h1> <span class="item-name">regex-applicative</span>
(<a href="https://hackage.haskell.org/package/regex-applicative">Hackage</a>)
</h1><p>Regex-like parsing combinators.</p>
<h2>Pros</h2><ul></ul><h2>Cons</h2><ul></ul><h2>Ecosystem</h2><p><a href="https://hackage.haskell.org/package/lexer-applicative">lexer-applicative</a>, <a href="https://hackage.haskell.org/package/regex-applicative-text">regex-applicative-text</a></p>
szrn6x33weighted-regexp2016-04-09T13:59:17Z<h1> <span class="item-name">weighted-regexp</span>
(<a href="https://hackage.haskell.org/package/weighted-regexp">Hackage</a>)
</h1><p>See <a href="http://sebfisch.github.io/haskell-regexp">http://sebfisch.github.io/haskell-regexp</a>.</p>
<h2>Pros</h2><ul></ul><h2>Cons</h2><ul></ul><h2>Ecosystem</h2><p><a href="https://hackage.haskell.org/package/regexp-tries">regexp-tries</a></p>
w9lymolvhxt-regex-xmlschema2016-04-09T13:59:05Z<h1> <span class="item-name">hxt-regex-xmlschema</span>
(<a href="https://hackage.haskell.org/package/hxt-regex-xmlschema">Hackage</a>)
</h1><h2>Pros</h2><ul></ul><h2>Cons</h2><ul></ul>ljx3hkhbre22016-04-09T13:56:56Z<h1> <span class="item-name">re2</span>
(<a href="https://hackage.haskell.org/package/re2">Hackage</a>)
</h1><p>Bindings to Google's RE2 library.</p>
<h2>Pros</h2><ul></ul><h2>Cons</h2><ul></ul>x754lm90regexdot2016-04-09T13:55:09Z<h1> <span class="item-name">regexdot</span>
(<a href="https://hackage.haskell.org/package/regexdot">Hackage</a>)
</h1><p>Regex flavor: POSIX.</p>
<h2>Pros</h2><ul><p><li>Works on lists of arbitrary objects.</li></p></ul><h2>Cons</h2><ul></ul><h2>Ecosystem</h2><p><a href="https://hackage.haskell.org/package/regexchar">regexchar</a></p>
ykbn4lidpcre-light2016-04-09T13:46:23Z<h1> <span class="item-name">pcre-light</span>
(<a href="https://hackage.haskell.org/package/pcre-light">Hackage</a>)
</h1><p>Binds to the C PCRE library.</p>
<h2>Pros</h2><ul></ul><h2>Cons</h2><ul><p><li>Only works with bytestrings unless you use a wrapper.</li></p></ul><h2>Ecosystem</h2><ul>
<li><a href="https://hackage.haskell.org/package/regex-easy">regex-easy</a> – a convenience wrapper (TODO: probably useless?)</li>
<li><a href="https://hackage.haskell.org/package/pcre-heavy">pcre-heavy</a> – another convenience wrapper</li>
<li><a href="https://hackage.haskell.org/package/rex">rex</a> – quasiquoter</li>
</ul>
n2ri8dzxtext-icu2016-04-03T09:50:05Z<h1> <span class="item-name">text-icu</span>
(<a href="https://hackage.haskell.org/package/text-icu">Hackage</a>)
</h1><p>Bindings to <a href="http://site.icu-project.org/">International Components for Unicode</a>, which among other things provides regexes. See <a href="https://hackage.haskell.org/package/text-icu/docs/Data-Text-ICU.html#g:10">this section</a> of the <code>Data.Text.ICU</code> module.</p>
<p>Regex flavor: PCRE.</p>
<h2>Pros</h2><ul><p><li>Supports Unicode classes – e.g. <code>[:lower:]</code> matches all lowercase letters, not just Latin ones.</li></p><p><li>Allows limiting time/memory used by the matcher.</li></p></ul><h2>Cons</h2><ul><p><li>Requires <a href="http://site.icu-project.org/">ICU</a> installed.</li></p></ul><h2>Ecosystem</h2><p><a href="https://hackage.haskell.org/package/text-regex-replace">text-regex-replace</a></p>
<h2>Notes</h2><h1><span id="item-notes-n2ri8dzx-links"></span>Links</h1><ul>
<li><a href="http://userguide.icu-project.org/strings/regexp#TOC-Regular-Expression-Metacharacters">Documentation on ICU regular expressions</a></li>
<li><a href="https://hackage.haskell.org/package/text-icu/docs/Data-Text-ICU.html#g:10">Documentation on the part of the library that deals with regular expressions</a></li>
</ul>
<h1><span id="item-notes-n2ri8dzx-imports-and-pragmas"></span>Imports and pragmas</h1><div class="sourceCode"><pre class="sourceCode"><code class="sourceCode"><span class="ot">{-# LANGUAGE OverloadedStrings #-}</span></code></pre></div>
<p>It's better to import the module qualified, because some functions from it (like <code>find</code> and <code>span</code>) clash with those from Prelude. Additionally, many functions clash with ones from <code>Data.Text</code>, so don't import it as <code>T</code> either.</p>
<div class="sourceCode"><pre class="sourceCode"><code class="sourceCode"><span class="kw">import qualified</span> <span class="dt">Data.Text.ICU</span> <span class="kw">as</span> <span class="dt">ICU</span></code></pre></div>
<p>If you want replacement as well:</p>
<div class="sourceCode"><pre class="sourceCode"><code class="sourceCode"><span class="co">-- from text-regex-replace</span>
<span class="kw">import qualified</span> <span class="dt">Data.Text.ICU.Replace</span> <span class="kw">as</span> <span class="dt">ICU</span></code></pre></div>
<h1><span id="item-notes-n2ri8dzx-search"></span>Search</h1><p>To search, use <code>findAll</code> (or <code>find</code> if you only need the 1st match):</p>
<div class="sourceCode"><pre class="sourceCode repl"><code class="sourceCode"><span class="fu">></span> ICU.findAll <span class="st">"[0-9]+"</span> <span class="st">"12 + 34 = 55"</span>
[<span class="dt">Match</span> [<span class="st">"12"</span>],<span class="dt">Match</span> [<span class="st">"34"</span>],<span class="dt">Match</span> [<span class="st">"55"</span>]]</code></pre></div>
<p><code>findAll</code> returns a list of <code>Match</code>es. A <code>Match</code> holds information about the matched piece of text, groups inside of the match, and text occuring between the matches.</p>
<p>For example, let's construct a regex that would match a name and a surname: <code>(\p{Lu}\w*) (\p{Lu}\w*)</code> – here <code>\w</code> means “character that can occur inside a word”, and <code>\p{Lu}</code> means “character from Unicode category <code>Lu</code>”, which is “Letter, uppercase”:</p>
<div class="sourceCode"><pre class="sourceCode repl"><code class="sourceCode"><span class="fu">></span> <span class="kw">let</span> regex <span class="fu">=</span> <span class="st">"(\\p{Lu}\\w*) (\\p{Lu}\\w*)"</span>
<span class="fu">></span> <span class="kw">let</span> [zaphod, ford] <span class="fu">=</span> ICU.findAll regex <span class="st">"Zaphod Beeblebrox and Ford Prefect"</span></code></pre></div>
<p>To get the match itself, use <code>group 0</code> (which will always return <code>Just</code>, but unfortunately the library doesn't provide an easier way to get the match without having to unwrap <code>Just</code>):</p>
<div class="sourceCode"><pre class="sourceCode repl"><code class="sourceCode"><span class="fu">></span> ICU.group <span class="dv">0</span> ford
<span class="dt">Just</span> <span class="st">"Ford Prefect"</span></code></pre></div>
<p>You can also use <code>group</code> to get a particular capturing group:</p>
<div class="sourceCode"><pre class="sourceCode repl"><code class="sourceCode"><span class="fu">></span> ICU.group <span class="dv">1</span> ford
<span class="dt">Just</span> <span class="st">"Ford"</span>
<span class="fu">></span> ICU.group <span class="dv">2</span> ford
<span class="dt">Just</span> <span class="st">"Prefect"</span></code></pre></div>
<p><code>span</code> returns the text between the previous match and this match:</p>
<div class="sourceCode"><pre class="sourceCode repl"><code class="sourceCode"><span class="fu">></span> ICU.span ford
<span class="st">" and "</span></code></pre></div>
<p>Finally, you can use <code>prefix</code> and <code>suffix</code> to get the whole string before/after the match:</p>
<div class="sourceCode"><pre class="sourceCode repl"><code class="sourceCode"><span class="fu">></span> ICU.prefix <span class="dv">0</span> ford
<span class="dt">Just</span> <span class="st">"Zaphod Beeblebrox and "</span>
<span class="fu">></span> ICU.suffix <span class="dv">0</span> ford
<span class="dt">Just</span> <span class="st">""</span></code></pre></div>
<h1><span id="item-notes-n2ri8dzx-replacement"></span>Replacement</h1><p>Simple replacement is done with <code>replaceAll</code> (to replace only the 1st match, use <code>replace</code>):</p>
<div class="sourceCode"><pre class="sourceCode repl"><code class="sourceCode"><span class="fu">></span> ICU.replaceAll <span class="st">"[0-9]+"</span> <span class="st">"<num>"</span> <span class="st">"12 + 34 = 55"</span>
<span class="st">"<num> + <num> = <num>"</span></code></pre></div>
<p>Replacement with groups:</p>
<div class="sourceCode"><pre class="sourceCode repl"><code class="sourceCode"><span class="fu">></span> ICU.replaceAll <span class="st">"(.*), (.*)"</span> <span class="st">"$2 $1"</span> <span class="st">"Beeblebrox, Zaphod"</span>
<span class="st">"Zaphod Beeblebrox"</span></code></pre></div>
<p>(To have a literal <code>$</code> in the output, write <code>$$</code> instead of <code>$</code>.)</p>
<h1><span id="item-notes-n2ri8dzx-splitting"></span>Splitting</h1><p>text-icu doesn't export a splitting function, which makes it a bit complicated. Here's one that you could use:</p>
<div class="sourceCode"><pre class="sourceCode"><code class="sourceCode"><span class="ot">split ::</span> <span class="dt">ICU.Regex</span> <span class="ot">-></span> <span class="dt">Text</span> <span class="ot">-></span> [<span class="dt">Text</span>]
split r s <span class="fu">=</span> go (ICU.findAll r s)
<span class="kw">where</span> go [] <span class="fu">=</span> [s]
go [m] <span class="fu">=</span> [ICU.span m, fromJust (ICU.suffix <span class="dv">0</span> m)]
go (m<span class="fu">:</span>ms) <span class="fu">=</span> ICU.span m <span class="fu">:</span> go ms</code></pre></div>
<h1><span id="item-notes-n2ri8dzx-regex-settings"></span>Regex settings</h1><p>You can customise the way regexes are applied by using <code>regex</code> and <a href="https://hackage.haskell.org/package/text-icu/docs/Data-Text-ICU.html#t:MatchOption"><code>MatchOption</code></a>. For instance, if you want the matching to be case-insensitive, use <code>CaseInsensitive</code>:</p>
<div class="sourceCode"><pre class="sourceCode repl"><code class="sourceCode"><span class="fu">></span> <span class="kw">let</span> regex <span class="fu">=</span> ICU.regex [<span class="dt">ICU.CaseInsensitive</span>] <span class="st">"xxx_(\\w+)_xxx"</span>
<span class="fu">></span> <span class="kw">let</span> str <span class="fu">=</span> <span class="st">"xxx_Overlord_xxx XXX_dp_ak_XXX"</span>
<span class="fu">></span> mapMaybe (ICU.group <span class="dv">1</span>) (ICU.findAll regex str)
[<span class="st">"Overlord"</span>,<span class="st">"dp_ak"</span>]</code></pre></div>
<p>There are other settings available – look at the docs for <a href="https://hackage.haskell.org/package/text-icu/docs/Data-Text-ICU.html#t:MatchOption"><code>MatchOption</code></a> to see the full list.</p>