StringsBasics
Common-case recommendations
String
is the standard string type from Prelude, is widely used, and is rather easy to work with, as it's simply a synonym for [Char]
and all list functions work with String
out of the box. This representation is very convenient, but is neither fast nor memory-efficient. In addition, it doesn't fully conform to complicated Unicode rules regarding casing, string comparison, etc.
If you're working with strings a lot, you would be better off with Text
from the text package, which is also very widely used in the Haskell ecosystem, has more text-specific utilities available, and is faster. If your strings are Unicode-heavy (e.g. names or addresses) and you must process them correctly, text-icu or unicode-transforms will be indispensable.
NB: Some people advocate for using Text
instead of String
in all cases, but if you're a beginner, String
might be a better choice because all list operations apply to it. Even if you're not a beginner, you're still likely to use String
in some places (when defining Show
, for instance, or when working with exceptions or logging). Don't try to run away from String
everywhere, sometimes it's not worth it.
Rare-case recommendations
If you need speed and you're willing to do decoding by yourself, or if you're working with network protocols, you should consider bytestring. (For some reason the networking ecosystem in Haskell mostly uses bytestring.)
If you have lots of small strings, you can switch to text-short, which has less memory overhead than Text
.
If you have lots of small strings and you anticipate that many of them will be the same (e.g. if you're writing a compiler), you can use interned strings with something like intern, or roll your own like GHC did with its FastString
.
The type for strings that is most commonly recommended “for production”. Implemented as UTF-16 arrays under the hood.
Comes in strict and lazy variants (Data.Text
and Data.Text.Lazy
); the latter can be used for processing huge strings in a streaming fashion instead of more explicit approaches like pipes or conduit.
-
Fast and uses less memory than
String
. -
Has more utility functions like
splitOn
, etc. available out of the box. -
Better conforms to various Unicode rules about string casing and so on. It still does codepoint-based processing instead of grapheme-based processing, though, but there are libraries that process
Text
the right way.
-
Can be harder to manipulate if you're used to processing strings as lists (i.e.
String
). -
Uses UTF-16 and thus takes additional time to encode/decode from UTF-8. See also text-utf8 or text-short.
-
Doesn't have O(1) indexing because UTF-16 is a variable-length encoding. Can be annoying if you only process ASCII (or close to ASCII) text, for which O(1) indexing is possible.
-
Most parsing packages nowadays support
Text
, including megaparsec and attoparsec. -
To encode/decode
Text
to UTF-8, UTF-16, or UTF-32, useData.Text.Encoding
. For more encodings, seeData.Text.ICU.Convert
. -
For a fast alternative to the
Show
class, see text-show (and additional instances in text-show-instances). For an alternative to theRead
class, see readable. Fastshow
specifically forDouble
is in double-conversion. -
For advanced Unicode handing, see text-icu (which provides ICU bindings). unicode-transforms is a pure Haskell alternative that does only normalization (NFC, NFKC, NFD, NFKD), but with performance comparable to text-icu. text-manipulate has additional functions for working with word boundaries,
PascalCasing
andsnake_casing
, acronyms, truncating text intelligently, and so on. text-icu-translit has transliteration. -
case-insensitive provides newtypes for strings that should be compared case-insensitively, and text-normal provides newtypes for normalized text.
-
For using big text literals (like templates) in Haskell sources, see neat-interpolation. For printf-like functionality, see formatting, fmt, or PyF.
-
Orphan instances: cereal-text, quickcheck-text. Instances for binary are provided since
text-1.2.1
.
The default Haskell type for strings. Unicode-aware but not particularly clever (slightly less clever than Text
). Defined as an ordinary list of characters:
type String = [Char]
Isn't very fast, but isn't horribly slow either, and lots of libraries work with String
instead of Text
, so if you're not doing web dev and not writing anything with lots of string processing, you might just as well use it.
Even in codebases that use Text
all the way, String
is still sometimes used for error messages (e.g. a function that returns Either String a
).
-
split can do pretty much anything when it comes to string splitting.
-
utf8-string for converting to/from UTF8.
-
case-insensitive for case-insensitive comparisons.
Provides byte arrays, with a fake-string interface in Data.ByteString.Char8
.
Only use it if you're working with text with known encoding and you need it to be fast, or when you're working with network protocols. For instance:
- aeson doesn't translate JSON to
Text
before parsing it, but works on rawByteString
s (and assumes UTF-8) - cassava stores CSV fields as
ByteString
s - lucid outputs HTML as a
ByteString
- http-types uses
ByteString
s for headers, URLs, and so on
-
The fastest option available. Unlike text, it doesn't do any encoding/decoding under the hood and provides you direct access to the bytes.
-
Only suitable for working with ASCII text, unless you take care to handle the encoding (like e.g. aeson does). It won't necessarily break – e.g. you can still search for a UTF-8 substring in a UTF-8 string even if both are broken from the
ByteString
point of view, because they are broken the same way. However, it's still very fragile. A better alternative for dealing with UTF-8 (or ASCII) encoded memory is to use text-utf8 or text-short.
-
case-insensitive for case-insensitive comparisons
-
bytestring-show as replacement for
Show
, readable as replacement forRead
-
attoparsec is particularly well-suited for parsing
ByteString
s -
stringsearch for fast searching, replacement, and splitting
-
utf8-string for basic UTF-8 operations on
ByteString
s, e.g. taking first N characters
There are more packages in the entry for bytestring in the “Arrays” category.
A version of Text
with less memory overhead, suitable for keeping a lot of short strings in memory. Implemented as a wrapper over ShortByteString
.
The main difference between
Text
andShortText
is thatShortText
uses UTF-8 instead of UTF-16 internally and also doesn't support zero-copy slicing (thereby saving 2 words). Consequently, the memory footprint of a (boxed)ShortText
value is 4 words (2 words when unboxed) plus the length of the UTF-8 encoded payload.
Note that unlike ByteString
, Text
doesn't use pinned memory, so there's no point in switching from Text
to ShortText
if you want to avoid heap fragmentation – Text
already avoids it.
- text-containers provides memory-dense sets, arrays and associative maps over
ShortText
values.
<notes are empty>
<notes are empty>
An implementation of interned strings (also known as "hash-consing", "symbols" or "atoms"). Every distinct string will only be kept in memory once, which is very useful when many of your strings are duplicates. Also provides O(1) string comparison, since it can be done simply by looking at the references.
<notes are empty>