Strings – Haskell – Aelve Guide

text-utf8

2019-01-07T08:12:56Z

text-utf8 (Hackage)

This is a fork of the text package ported which uses UTF-8 instead of UTF-16 as its internal representation.

Pros

Cons

intern

2019-01-06T12:13:53Z

intern (Hackage)

An implementation of interned strings (also known as "hash-consing", "symbols" or "atoms"). Every distinct string will only be kept in memory once, which is very useful when many of your strings are duplicates. Also provides O(1) string comparison, since it can be done simply by looking at the references.

Pros

Cons

text-short

2019-01-06T12:03:40Z

text-short (Hackage)

A version of Text with less memory overhead, suitable for keeping a lot of short strings in memory. Implemented as a wrapper over ShortByteString.

The main difference between Text and ShortText is that ShortText uses UTF-8 instead of UTF-16 internally and also doesn't support zero-copy slicing (thereby saving 2 words). Consequently, the memory footprint of a (boxed) ShortText value is 4 words (2 words when unboxed) plus the length of the UTF-8 encoded payload.

Note that unlike ByteString, Text doesn't use pinned memory, so there's no point in switching from Text to ShortText if you want to avoid heap fragmentation – Text already avoids it.

Pros

Cons

Ecosystem

text-containers provides memory-dense sets, arrays and associative maps over ShortText values.

bytestring

2016-04-14T10:37:25Z

bytestring (Hackage)

Provides byte arrays, with a fake-string interface in Data.ByteString.Char8.

Only use it if you're working with text with known encoding and you need it to be fast, or when you're working with network protocols. For instance:

aeson doesn't translate JSON to Text before parsing it, but works on raw ByteStrings (and assumes UTF-8)
cassava stores CSV fields as ByteStrings
lucid outputs HTML as a ByteString
http-types uses ByteStrings for headers, URLs, and so on

Pros

The fastest option available. Unlike text, it doesn't do any encoding/decoding under the hood and provides you direct access to the bytes.

Cons

Only suitable for working with ASCII text, unless you take care to handle the encoding (like e.g. aeson does). It won't necessarily break – e.g. you can still search for a UTF-8 substring in a UTF-8 string even if both are broken from the ByteString point of view, because they are broken the same way. However, it's still very fragile. A better alternative for dealing with UTF-8 (or ASCII) encoded memory is to use text-utf8 or text-short.

Ecosystem

case-insensitive for case-insensitive comparisons
bytestring-show as replacement for Show, readable as replacement for Read
attoparsec is particularly well-suited for parsing ByteStrings
stringsearch for fast searching, replacement, and splitting
utf8-string for basic UTF-8 operations on ByteStrings, e.g. taking first N characters

There are more packages in the entry for bytestring in the “Arrays” category.

Notes

Imports and pragmas

This module is the same as Data.ByteString but converts all bytes to characters without you having to do it:

import qualified Data.ByteString.Char8 as BC

And to be able to use string literals to construct ByteStrings, enable OverloadedStrings:

{-# LANGUAGE OverloadedStrings #-}

String

2016-04-14T10:25:12Z

String

The default Haskell type for strings. Unicode-aware but not particularly clever (slightly less clever than Text). Defined as an ordinary list of characters:

type String = [Char]

Isn't very fast, but isn't horribly slow either, and lots of libraries work with String instead of Text, so if you're not doing web dev and not writing anything with lots of string processing, you might just as well use it.

Even in codebases that use Text all the way, String is still sometimes used for error messages (e.g. a function that returns Either String a).

Pros

The most widely used string type.

Bundled with base.

Easy to process manually (because it's just a list of characters).

Cons

Slow, uses lots of memory (being a linked list).

Doesn't support Unicode perfectly (if you do something like map toUpper, for instance).

Ecosystem

split can do pretty much anything when it comes to string splitting.
utf8-string for converting to/from UTF8.
case-insensitive for case-insensitive comparisons.

Notes

Usage

Splitting

You can split strings into words and lines by using words/lines. However, for more options use the split package:

import Data.List.Split

Its documentation is actually pretty good, so it won't be replicated here.

text

2016-04-14T10:25:09Z

text (Hackage)

The type for strings that is most commonly recommended “for production”. Implemented as UTF-16 arrays under the hood.

Comes in strict and lazy variants (Data.Text and Data.Text.Lazy); the latter can be used for processing huge strings in a streaming fashion instead of more explicit approaches like pipes or conduit.

Pros

Fast and uses less memory than String.

Has more utility functions like splitOn, etc. available out of the box.

Better conforms to various Unicode rules about string casing and so on. It still does codepoint-based processing instead of grapheme-based processing, though, but there are libraries that process Text the right way.

Cons

Can be harder to manipulate if you're used to processing strings as lists (i.e. String).

Uses UTF-16 and thus takes additional time to encode/decode from UTF-8. See also text-utf8 or text-short.

Doesn't have O(1) indexing because UTF-16 is a variable-length encoding. Can be annoying if you only process ASCII (or close to ASCII) text, for which O(1) indexing is possible.

Ecosystem

Most parsing packages nowadays support Text, including megaparsec and attoparsec.
To encode/decode Text to UTF-8, UTF-16, or UTF-32, use Data.Text.Encoding. For more encodings, see Data.Text.ICU.Convert.
For a fast alternative to the Show class, see text-show (and additional instances in text-show-instances). For an alternative to the Read class, see readable. Fast show specifically for Double is in double-conversion.
For advanced Unicode handing, see text-icu (which provides ICU bindings). unicode-transforms is a pure Haskell alternative that does only normalization (NFC, NFKC, NFD, NFKD), but with performance comparable to text-icu. text-manipulate has additional functions for working with word boundaries, PascalCasing and snake_casing, acronyms, truncating text intelligently, and so on. text-icu-translit has transliteration.
case-insensitive provides newtypes for strings that should be compared case-insensitively, and text-normal provides newtypes for normalized text.
For using big text literals (like templates) in Haskell sources, see neat-interpolation. For printf-like functionality, see formatting, fmt, or PyF.
Orphan instances: cereal-text, quickcheck-text. Instances for binary are provided since text-1.2.1.

Notes

Imports and pragmas

OverloadedStrings lets string literals have type Text:

{-# LANGUAGE OverloadedStrings #-}

Imports are most commonly qualified:

import qualified Data.Text as T
import Data.Text (Text)

import qualified Data.Text.IO as T             -- for putStrLn, etc
import qualified Data.Text.Encoding as T       -- for UTF8 encoding/decoding

Lazy variant:

import qualified Data.Text.Lazy as T
import Data.Text.Lazy (Text)

import qualified Data.Text.Lazy.IO as T        -- for putStrLn, etc
import qualified Data.Text.Lazy.Encoding as T  -- for UTF8 encoding/decoding

If you're not using anything like base-prelude, you might want to import Data.Monoid to have concatenation:

import Data.Monoid

Strict and lazy `Text`

There are 2 text types in the library – both are called Text but one comes from the Data.Text module and the other from Data.Text.Lazy module. They're not compatible (but you can convert between them), and are intended for use in different situations.

Strict Text is an array of characters. Lazy Text is a list (possibly infinite) of arrays of characters, or chunks. It's recommended to use lazy Text for cases where it makes sense to process text in a streaming fashion – for instance, if you have a huge file that you want to read and output as a web page, you could do it like “read a chunk, output a chunk, read a chunk, output a chunk...” – which is what might happen automatically if you use lazy Text correctly.

A rule of thumb is “if you don't ever intend for the string to be in memory only partially, use strict Text”.

To convert lazy Text to strict Text, use toStrict from Data.Text.Lazy; fromStrict goes in the opposite direction. To break a lazy Text into a list of chunks, use toChunks, and for the reverse – fromChunks.

Usage

Most functions from Prelude are replicated in Data.Text. The ones that are new are replicated below.

Common functions

pack and unpack for converting between String and Text
cons and snoc to prepend/append a character
(<>) from Data.Monoid appends two strings
toLower and toUpper convert to upper/lowercase (there's also `toTitle)
toCaseFold is used for case-insensitive comparisons: toCaseFold x == toCaseFold y

Searching

replace x y replaces x by y:

> replace " " "_" "hello world"
"hello_world"

> replace "ofo" "bar" "ofofo"
"barfo"

breakOn splits the string into “before separator” and “after separator” parts, where separator can be a string; breakOnEnd does the same but starts from the end:

> breakOn "::" "a::b::c"
("a", "::b::c")

> breakOnEnd "::" "a::b::c"
("a::b::", "c")

breakOnAll gives you all splitting variants:

> breakOnAll "::" "a::b::c"
[("a", "::b::c"), ("a::b", "::c")]

splitOn splits the string into a list of strings; split breaks on predicate Char -> Bool:

> splitOn "::" "a::b::c"
["a","b","c"]

> split (not . isAlphaNum) "a::b::c"
["a","","b","","c"]

count counts how many times a string occurs in another string (without overlaps).

Cutting strings

take and takeEnd take N characters from the beginning/end, drop and dropEnd remove them.

takeWhile, takeWhileEnd, dropWhile and dropWhileEnd exist as well. dropAround strips characters from both sides of the string.

strip, stripStart and stripEnd strip spaces specifically.

stripPrefix and stripSuffix remove some particular prefix/suffix (or return Nothing). commonPrefixes takes two strings and cuts out the longest matching prefix from them.

chunksOf splits a string into chunks of length N.

Transformations

justifyRight and justifyLeft add characters to the beginning/end of the string until it reaches certain length:

> justifyRight 7 '_' "foo"
"____foo"

> justifyLeft 7 '_' "foo"
"foo____"

center adds the character to both sides equally, breaking ties in favor of the left side:

> center 7 '_' "foo"
"__foo__"

> center 8 '_' "foo"
"___foo__"

Optimisation

TODO: mention copy, Builder, explain how fusion works, etc.

FAQ

Where is elem?

It's been removed from text because you can use isInfixOf to do the same thing.
Thanks to rewrite rules, T.isInfixOf "c" or T.isInfixOf (T.singleton c) will be as fast as elem.

Strings – Haskell – Aelve Guide

text-utf8

text-utf8 (Hackage)

Pros

Cons

intern

intern (Hackage)

Pros

Cons

text-short

text-short (Hackage)

Pros

Cons

Ecosystem

bytestring

bytestring (Hackage)

Pros

Cons

Ecosystem

Notes

Imports and pragmas

String

String

Pros

Cons

Ecosystem

Notes

Usage

Splitting

text

text (Hackage)

Pros

Cons

Ecosystem

Notes

Imports and pragmas

Strict and lazy Text

Usage

Common functions

Searching

Cutting strings

Transformations

Optimisation

FAQ

Strict and lazy `Text`