Copyright	(c) 2020 Composewell Technologies
License	BSD-3-Clause
Maintainer	streamly@composewell.com
Stability	released
Portability	GHC
Safe Haskell	Safe-Inferred
Language	Haskell2010

Streamly.Unicode.Stream

Contents

Setup
Construction (Decoding)
Elimination (Encoding)

Description

Processing Unicode Strings

A Char stream is the canonical representation to process Unicode strings. It can be processed efficiently using regular stream processing operations. A byte stream of Unicode text read from an IO device or from an Array in memory can be decoded into a Char stream using the decoding routines in this module. A String ([Char]) can be converted into a Char stream using fromList. An Array Char can be unfolded into a stream using the array read unfold.

Storing Unicode Strings

A stream of Char can be encoded into a byte stream using the encoding routines in this module and then written to IO devices or to arrays in memory.

If you have to store a Char stream in memory you can fold the Char stream as Array Char using the array write fold. The Array type provides a more compact representation reducing GC overhead. If space efficiency is a concern you can use encodeUtf8' on the Char stream before writing it to an Array providing an even more compact representation.

String Literals

Stream Identity Char and Array Char are instances of IsString and IsList, therefore, OverloadedStrings and OverloadedLists extensions can be used for convenience when specifying unicode strings literals using these types.

Idioms

Some simple text processing operations can be represented simply as operations on Char streams. Follow the links for the following idioms:

Pitfalls

Case conversion: Some unicode characters translate to more than one code point on case conversion. The toUpper and toLower functions in base package do not handle such characters. Therefore, operations like map toUpper on a character stream or character array may not always perform correct conversion.
String comparison: In some cases, visually identical strings may have different unicode representations, therefore, a character stream or character array cannot be directly compared. A normalized comparison may be needed to check string equivalence correctly.

Experimental APIs

Some experimental APIs to conveniently process text using the Array Char represenation directly can be found in Streamly.Internal.Unicode.Array.

Synopsis

decodeLatin1 :: Monad m => Stream m Word8 -> Stream m Char
decodeUtf8 :: Monad m => Stream m Word8 -> Stream m Char
decodeUtf8' :: Monad m => Stream m Word8 -> Stream m Char
decodeUtf8Chunks :: MonadIO m => Stream m (Array Word8) -> Stream m Char
encodeLatin1 :: Monad m => Stream m Char -> Stream m Word8
encodeLatin1' :: Monad m => Stream m Char -> Stream m Word8
encodeUtf8 :: Monad m => Stream m Char -> Stream m Word8
encodeUtf8' :: Monad m => Stream m Char -> Stream m Word8
encodeStrings :: MonadIO m => (Stream m Char -> Stream m Word8) -> Stream m String -> Stream m (Array Word8)

Setup

To execute the code examples provided in this module in ghci, please run the following commands first.

>>> :m

>>> import qualified Streamly.Data.Fold as Fold
>>> import qualified Streamly.Data.Stream as Stream
>>> import qualified Streamly.Unicode.Stream as Unicode

For APIs that have not been released yet.

>>> :set -XMagicHash
>>> import qualified Streamly.Internal.Unicode.Stream as Unicode

Construction (Decoding)

decodeLatin1 :: Monad m => Stream m Word8 -> Stream m Char Source #

Decode a stream of bytes to Unicode characters by mapping each byte to a corresponding Unicode Char in 0-255 range.

decodeUtf8 :: Monad m => Stream m Word8 -> Stream m Char Source #

Decode a UTF-8 encoded bytestream to a stream of Unicode characters. Any invalid codepoint encountered is replaced with the unicode replacement character.

decodeUtf8' :: Monad m => Stream m Word8 -> Stream m Char Source #

Decode a UTF-8 encoded bytestream to a stream of Unicode characters. The function throws an error if an invalid codepoint is encountered.

decodeUtf8Chunks :: MonadIO m => Stream m (Array Word8) -> Stream m Char Source #

Like decodeUtf8 but for a chunked stream. It may be slightly faster than flattening the stream and then decoding with decodeUtf8.

Elimination (Encoding)

encodeLatin1 :: Monad m => Stream m Char -> Stream m Word8 Source #

Like encodeLatin1' but silently maps input codepoints beyond 255 to arbitrary Latin1 chars in 0-255 range. No error or exception is thrown when such mapping occurs.

encodeLatin1' :: Monad m => Stream m Char -> Stream m Word8 Source #

Encode a stream of Unicode characters to bytes by mapping each character to a byte in 0-255 range. Throws an error if the input stream contains characters beyond 255.

encodeUtf8 :: Monad m => Stream m Char -> Stream m Word8 Source #

Encode a stream of Unicode characters to a UTF-8 encoded bytestream. Any Invalid characters (U+D800-U+D8FF) in the input stream are replaced by the Unicode replacement character U+FFFD.

encodeUtf8' :: Monad m => Stream m Char -> Stream m Word8 Source #

Encode a stream of Unicode characters to a UTF-8 encoded bytestream. When any invalid character (U+D800-U+D8FF) is encountered in the input stream the function errors out.

encodeStrings :: MonadIO m => (Stream m Char -> Stream m Word8) -> Stream m String -> Stream m (Array Word8) Source #

Encode a stream of String using the supplied encoding scheme. Each string is encoded as an Array Word8.