Common-case recommendations

String is the standard string type from Prelude, is widely used, and is rather easy to work with, as it's simply a synonym for [Char] and all list functions work with String out of the box. This representation is very convenient, but is neither fast nor memory-efficient. In addition, it doesn't fully conform to complicated Unicode rules regarding casing, string comparison, etc.

If you're working with strings a lot, you would be better off with Text from the text package, which is also very widely used in the Haskell ecosystem, has more text-specific utilities available, and is faster. If your strings are Unicode-heavy (e.g. names or addresses) and you must process them correctly, text-icu or unicode-transforms will be indispensable.

NB: Some people advocate for using Text instead of String in all cases, but if you're a beginner, String might be a better choice because all list operations apply to it. Even if you're not a beginner, you're still likely to use String in some places (when defining Show, for instance, or when working with exceptions or logging). Don't try to run away from String everywhere, sometimes it's not worth it.

Rare-case recommendations

If you need speed and you're willing to do decoding by yourself, or if you're working with network protocols, you should consider bytestring. (For some reason the networking ecosystem in Haskell mostly uses bytestring.)

If you have lots of small strings, you can switch to text-short, which has less memory overhead than Text.

If you have lots of small strings and you anticipate that many of them will be the same (e.g. if you're writing a compiler), you can use interned strings with something like intern, or roll your own like GHC did with its FastString.

edit description
or press Ctrl+Enter to savemarkdown supported
#
text (Hackage)
other
move item up move item down edit item info delete item
Summary edit summary

The type for strings that is most commonly recommended “for production”. Implemented as UTF-16 arrays under the hood.

Comes in strict and lazy variants (Data.Text and Data.Text.Lazy); the latter can be used for processing huge strings in a streaming fashion instead of more explicit approaches like pipes or conduit.

Summary quit editing summary
Prosedit prosquit editing pros
  • Fast and uses less memory than String.
    move trait up move trait down edit trait delete trait
  • Has more utility functions like splitOn, etc. available out of the box.
    move trait up move trait down edit trait delete trait
  • Better conforms to various Unicode rules about string casing and so on. It still does codepoint-based processing instead of grapheme-based processing, though, but there are libraries that process Text the right way.
    move trait up move trait down edit trait delete trait

press Ctrl+Enter or Enter to addmarkdown supportededit off
Consedit consquit editing cons
  • Can be harder to manipulate if you're used to processing strings as lists (i.e. String).
    move trait up move trait down edit trait delete trait
  • Uses UTF-16 and thus takes additional time to encode/decode from UTF-8. See also text-utf8 or text-short.
    move trait up move trait down edit trait delete trait
  • Doesn't have O(1) indexing because UTF-16 is a variable-length encoding. Can be annoying if you only process ASCII (or close to ASCII) text, for which O(1) indexing is possible.
    move trait up move trait down edit trait delete trait

press Ctrl+Enter or Enter to addmarkdown supportededit off
Ecosystemedit ecosystem
Ecosystemquit editing ecosystemor press Ctrl+Enter to savemarkdown supported
Notes
collapse notesedit notes

Imports and pragmas

OverloadedStrings lets string literals have type Text:

{-# LANGUAGE OverloadedStrings #-}

Imports are most commonly qualified:

import qualified Data.Text as T
import Data.Text (Text)

import qualified Data.Text.IO as T             -- for putStrLn, etc
import qualified Data.Text.Encoding as T       -- for UTF8 encoding/decoding

Lazy variant:

import qualified Data.Text.Lazy as T
import Data.Text.Lazy (Text)

import qualified Data.Text.Lazy.IO as T        -- for putStrLn, etc
import qualified Data.Text.Lazy.Encoding as T  -- for UTF8 encoding/decoding

If you're not using anything like base-prelude, you might want to import Data.Monoid to have concatenation:

import Data.Monoid

Strict and lazy Text

There are 2 text types in the library – both are called Text but one comes from the Data.Text module and the other from Data.Text.Lazy module. They're not compatible (but you can convert between them), and are intended for use in different situations.

Strict Text is an array of characters. Lazy Text is a list (possibly infinite) of arrays of characters, or chunks. It's recommended to use lazy Text for cases where it makes sense to process text in a streaming fashion – for instance, if you have a huge file that you want to read and output as a web page, you could do it like “read a chunk, output a chunk, read a chunk, output a chunk...” – which is what might happen automatically if you use lazy Text correctly.

A rule of thumb is “if you don't ever intend for the string to be in memory only partially, use strict Text”.

To convert lazy Text to strict Text, use toStrict from Data.Text.Lazy; fromStrict goes in the opposite direction. To break a lazy Text into a list of chunks, use toChunks, and for the reverse – fromChunks.

Usage

Most functions from Prelude are replicated in Data.Text. The ones that are new are replicated below.

Common functions

  • pack and unpack for converting between String and Text
  • cons and snoc to prepend/append a character
  • (<>) from Data.Monoid appends two strings
  • toLower and toUpper convert to upper/lowercase (there's also `toTitle)
  • toCaseFold is used for case-insensitive comparisons: toCaseFold x == toCaseFold y

Searching

replace x y replaces x by y:

> replace " " "_" "hello world"
"hello_world"

> replace "ofo" "bar" "ofofo"
"barfo"

breakOn splits the string into “before separator” and “after separator” parts, where separator can be a string; breakOnEnd does the same but starts from the end:

> breakOn "::" "a::b::c"
("a", "::b::c")

> breakOnEnd "::" "a::b::c"
("a::b::", "c")

breakOnAll gives you all splitting variants:

> breakOnAll "::" "a::b::c"
[("a", "::b::c"), ("a::b", "::c")]

splitOn splits the string into a list of strings; split breaks on predicate Char -> Bool:

> splitOn "::" "a::b::c"
["a","b","c"]

> split (not . isAlphaNum) "a::b::c"
["a","","b","","c"]

count counts how many times a string occurs in another string (without overlaps).

Cutting strings

take and takeEnd take N characters from the beginning/end, drop and dropEnd remove them.

takeWhile, takeWhileEnd, dropWhile and dropWhileEnd exist as well. dropAround strips characters from both sides of the string.

strip, stripStart and stripEnd strip spaces specifically.

stripPrefix and stripSuffix remove some particular prefix/suffix (or return Nothing). commonPrefixes takes two strings and cuts out the longest matching prefix from them.

chunksOf splits a string into chunks of length N.

Transformations

justifyRight and justifyLeft add characters to the beginning/end of the string until it reaches certain length:

> justifyRight 7 '_' "foo"
"____foo"

> justifyLeft 7 '_' "foo"
"foo____"

center adds the character to both sides equally, breaking ties in favor of the left side:

> center 7 '_' "foo"
"__foo__"

> center 8 '_' "foo"
"___foo__"

Optimisation

TODO: mention copy, Builder, explain how fusion works, etc.

FAQ

  • Where is elem?

    It's been removed from text because you can use isInfixOf to do the same thing.
    Thanks to rewrite rules, T.isInfixOf "c" or T.isInfixOf (T.singleton c) will be as fast as elem.

collapse notesedit notes
#
other
move item up move item down edit item info delete item
Summary edit summary

The default Haskell type for strings. Unicode-aware but not particularly clever (slightly less clever than Text). Defined as an ordinary list of characters:

type String = [Char]

Isn't very fast, but isn't horribly slow either, and lots of libraries work with String instead of Text, so if you're not doing web dev and not writing anything with lots of string processing, you might just as well use it.

Even in codebases that use Text all the way, String is still sometimes used for error messages (e.g. a function that returns Either String a).

Summary quit editing summary
Prosedit prosquit editing pros
  • The most widely used string type.
    move trait up move trait down edit trait delete trait
  • Bundled with base.
    move trait up move trait down edit trait delete trait
  • Easy to process manually (because it's just a list of characters).
    move trait up move trait down edit trait delete trait

press Ctrl+Enter or Enter to addmarkdown supportededit off
Consedit consquit editing cons
  • Slow, uses lots of memory (being a linked list).
    move trait up move trait down edit trait delete trait
  • Doesn't support Unicode perfectly (if you do something like map toUpper, for instance).
    move trait up move trait down edit trait delete trait

press Ctrl+Enter or Enter to addmarkdown supportededit off
Ecosystemedit ecosystem
  • split can do pretty much anything when it comes to string splitting.

  • utf8-string for converting to/from UTF8.

  • case-insensitive for case-insensitive comparisons.

Ecosystemquit editing ecosystemor press Ctrl+Enter to savemarkdown supported
Notes
collapse notesedit notes

Usage

Splitting

You can split strings into words and lines by using words/lines. However, for more options use the split package:

import Data.List.Split

Its documentation is actually pretty good, so it won't be replicated here.

collapse notesedit notes
#
bytestring (Hackage)
other
move item up move item down edit item info delete item
Summary edit summary

Provides byte arrays, with a fake-string interface in Data.ByteString.Char8.

Only use it if you're working with text with known encoding and you need it to be fast, or when you're working with network protocols. For instance:

  • aeson doesn't translate JSON to Text before parsing it, but works on raw ByteStrings (and assumes UTF-8)
  • cassava stores CSV fields as ByteStrings
  • lucid outputs HTML as a ByteString
  • http-types uses ByteStrings for headers, URLs, and so on
Summary quit editing summary
Prosedit prosquit editing pros
  • The fastest option available. Unlike text, it doesn't do any encoding/decoding under the hood and provides you direct access to the bytes.
    move trait up move trait down edit trait delete trait

press Ctrl+Enter or Enter to addmarkdown supportededit off
Consedit consquit editing cons
  • Only suitable for working with ASCII text, unless you take care to handle the encoding (like e.g. aeson does). It won't necessarily break – e.g. you can still search for a UTF-8 substring in a UTF-8 string even if both are broken from the ByteString point of view, because they are broken the same way. However, it's still very fragile. A better alternative for dealing with UTF-8 (or ASCII) encoded memory is to use text-utf8 or text-short.
    move trait up move trait down edit trait delete trait

press Ctrl+Enter or Enter to addmarkdown supportededit off
Ecosystemedit ecosystem

There are more packages in the entry for bytestring in the “Arrays” category.

Ecosystemquit editing ecosystemor press Ctrl+Enter to savemarkdown supported
Notes
collapse notesedit notes

Imports and pragmas

This module is the same as Data.ByteString but converts all bytes to characters without you having to do it:

import qualified Data.ByteString.Char8 as BC

And to be able to use string literals to construct ByteStrings, enable OverloadedStrings:

{-# LANGUAGE OverloadedStrings #-}
collapse notesedit notes
#
text-short (Hackage)
other
move item up move item down edit item info delete item
Summary edit summary

A version of Text with less memory overhead, suitable for keeping a lot of short strings in memory. Implemented as a wrapper over ShortByteString.

The main difference between Text and ShortText is that ShortText uses UTF-8 instead of UTF-16 internally and also doesn't support zero-copy slicing (thereby saving 2 words). Consequently, the memory footprint of a (boxed) ShortText value is 4 words (2 words when unboxed) plus the length of the UTF-8 encoded payload.

Note that unlike ByteString, Text doesn't use pinned memory, so there's no point in switching from Text to ShortText if you want to avoid heap fragmentation – Text already avoids it.

Summary quit editing summary
Prosedit prosquit editing pros

    press Ctrl+Enter or Enter to addmarkdown supportededit off
    Consedit consquit editing cons

      press Ctrl+Enter or Enter to addmarkdown supportededit off
      Ecosystemedit ecosystem
      • text-containers provides memory-dense sets, arrays and associative maps over ShortText values.
      Ecosystemquit editing ecosystemor press Ctrl+Enter to savemarkdown supported
      Notes
      collapse notesedit notes

      <notes are empty>

      add something!

      #
      text-utf8 (Hackage)
      other
      move item up move item down edit item info delete item
      Summary edit summary

      This is a fork of the text package ported which uses UTF-8 instead of UTF-16 as its internal representation.

      Summary quit editing summary
      Prosedit prosquit editing pros

        press Ctrl+Enter or Enter to addmarkdown supportededit off
        Consedit consquit editing cons

          press Ctrl+Enter or Enter to addmarkdown supportededit off
          Ecosystemedit ecosystem
          Ecosystemquit editing ecosystemor press Ctrl+Enter to savemarkdown supported
          Notes
          collapse notesedit notes

          <notes are empty>

          add something!

          #
          intern (Hackage)
          other
          move item up move item down edit item info delete item
          Summary edit summary

          An implementation of interned strings (also known as "hash-consing", "symbols" or "atoms"). Every distinct string will only be kept in memory once, which is very useful when many of your strings are duplicates. Also provides O(1) string comparison, since it can be done simply by looking at the references.

          Summary quit editing summary
          Prosedit prosquit editing pros

            press Ctrl+Enter or Enter to addmarkdown supportededit off
            Consedit consquit editing cons

              press Ctrl+Enter or Enter to addmarkdown supportededit off
              Ecosystemedit ecosystem
              Ecosystemquit editing ecosystemor press Ctrl+Enter to savemarkdown supported
              Notes
              collapse notesedit notes

              <notes are empty>

              add something!