distributors-0.3.0.0: Unifying Parsers, Printers & Grammars
Copyright(C) 2026 - Eitan Chatav
LicenseBSD-style (see the file LICENSE)
MaintainerEitan Chatav <eitan.chatav@gmail.com>
Stabilityprovisional
Portabilitynon-portable
Safe HaskellSafe-Inferred
LanguageHaskell2010

Control.Lens.Grammar

Description

Synopsis

Regular grammar

type RegGrammar token a = forall p. (Lexical token p, Alternator p) => p a a Source #

A regular grammar may be constructed using Lexical and Alternator combinators. Let's see an example using semantic versioning.

>>> import Numeric.Natural (Natural)
>>> :{
data SemVer = SemVer          -- e.g., 2.1.5-rc.1+build.123
  { major         :: Natural  -- e.g., 1
  , minor         :: Natural  -- e.g., 2
  , patch         :: Natural  -- e.g., 3
  , preRelease    :: [String] -- e.g., "alpha.1", "rc.2"
  , buildMetadata :: [String] -- e.g., "build.123", "20130313144700"
  }
  deriving (Eq, Ord, Show, Read)
:}

We'd like to define an optic _SemVer, corresponding to the constructor pattern SemVer. Unfortunately, we can't use TemplateHaskell to generate it in GHCi, which is used to test this documenation. Normally we would write makeNestedPrisms ''SemVer, but here is equivalent explicit Haskell code instead. Since SemVer is a newtype, _SemVer can be an Iso.

>>> :set -XRecordWildCards
>>> import Control.Lens (Iso', iso)
>>> :{
_SemVer :: Iso' SemVer (Natural, (Natural, (Natural, ([String], [String]))))
_SemVer = iso
  (\SemVer {..} -> (major, (minor, (patch, (preRelease, buildMetadata)))))
  (\(major, (minor, (patch, (preRelease, buildMetadata)))) -> SemVer {..})
:}

Now we can build a RegGrammar for SemVer using the "idiom" style of Applicative parsing with a couple modifications.

>>> :{
semverGrammar :: RegGrammar Char SemVer
semverGrammar = _SemVer
  >?  numberG
  >*< terminal "." >* numberG
  >*< terminal "." >* numberG
  >*< option [] (terminal "-" >* identifiersG)
  >*< option [] (terminal "+" >* identifiersG)
  where
    numberG = iso show read >~ someP (asIn @Char DecimalNumber)
    identifiersG = several1 (sepBy (terminal ".")) (someP charG)
    charG = asIn LowercaseLetter
      <|> asIn UppercaseLetter
      <|> asIn DecimalNumber
      <|> token '-'
:}

Instead of using the constructor SemVer with the Functor applicator <$>, we use the optic _SemVer we defined and the Choice applicator >?; although, we could have used the Profunctor applicator >~ instead, because _SemVer is an Iso. A few Alternative combinators like <|> work both Functorially and Profunctorially.

FunctorialProfunctorial
SemVer_SemVer
<$>>?
*>>*
<**<
<*>>*<
<|><|>
optionoption
choicechoice
manymanyP
somesomeP
optionaloptionalP

You can generate a RegString from a RegGrammar with regstringG.

>>> putStringLn (regstringG semverGrammar)
\p{Nd}+(.\p{Nd}+(.\p{Nd}+((-((\p{Ll}|\p{Lu}|\p{Nd}|-)+(.(\p{Ll}|\p{Lu}|\p{Nd}|-)+)*))?(\+((\p{Ll}|\p{Lu}|\p{Nd}|-)+(.(\p{Ll}|\p{Lu}|\p{Nd}|-)+)*))?)))

You can also generate parsers and printers.

>>> [parsed | (parsed, "") <- parseG semverGrammar "2.1.5-rc.1+build.123"]
[SemVer {major = 2, minor = 1, patch = 5, preRelease = ["rc","1"], buildMetadata = ["build","123"]}]

Parsing unconses tokens left-to-right, from the beginning of a string. Unparsing, on the other hand, snocs tokens left-to-right, to the end of a string.

>>> unparseG semverGrammar (SemVer 1 0 0 ["alpha"] []) "SemVer: " :: Maybe String
Just "SemVer: 1.0.0-alpha"

Printing, on the gripping hand, conses tokens right-to-left, to the beginning of a string.

>>> ($ " is the SemVer.") <$> printG semverGrammar (SemVer 1 2 3 [] []) :: Maybe String
Just "1.2.3 is the SemVer."

Profunctorial combinators give us correct-by-construction invertible parsers. New RegGrammar generators can be defined with new instances of Lexical Alternators.

type Lexical token p = (forall x y. (x ~ (), y ~ ()) => TerminalSymbol token (p x y), forall x y. (x ~ token, y ~ token) => TokenAlgebra token (p x y)) :: Constraint Source #

newtype RegString Source #

RegStrings are an embedded domain specific language of regular expression strings. Since they are strings, they have a string-like interface.

>>> let rex = fromString "ab|c" :: RegString
>>> putStringLn rex
ab|c
>>> rex
"ab|c"

RegStrings can be generated from RegGrammars with regstringG.

>>> regstringG (terminal "a" >* terminal "b" <|> terminal "c")
"ab|c"

RegStrings are actually stored as an algebraic datatype, RegEx.

>>> runRegString rex
RegExam (Alternate (Terminal "ab") (Terminal "c"))

RegStrings are similar to regular expression strings in many other programming languages. We can use them to see if a word and pattern are Matching.

>>> "ab" =~ rex
True
>>> "c" =~ rex
True
>>> "xyz" =~ rex
False

Like RegGrammars, RegStrings can use all the Lexical combinators. Unlike RegGrammars, instead of using Monoidal and Alternator combinators, RegStrings use Monoid and KleeneStarAlgebra combinators.

>>> terminal "a" <> terminal "b" >|< terminal "c" :: RegString
"ab|c"
>>> mempty :: RegString
""

Since RegStrings are a KleeneStarAlgebra, they support Kleene quantifiers.

>>> starK rex
"(ab|c)*"
>>> plusK rex
"(ab|c)+"
>>> optK rex
"(ab|c)?"

Like other regular expression languages, RegStrings support character classes.

>>> oneOf "abc" :: RegString
"[abc]"
>>> notOneOf "abc" :: RegString
"[^abc]"

The character classes are used for failure, matching no character or string, as well as the wildcard, matching any single character.

>>> zeroK :: RegString
"[]"
>>> anyToken :: RegString
"[^]"

Additional forms of character classes test for a character's GeneralCategory.

>>> asIn LowercaseLetter :: RegString
"\\p{Ll}"
>>> notAsIn Control :: RegString
"\\P{Cc}"

KleeneStarAlgebras support alternation >|<, and the Tokenized combinators are all negatable. However, we'd like to be able to take the intersection of character classes as well. RegStrings can combine characters' tokenClasses using BooleanAlgebra combinators.

>>> tokenClass (notOneOf "abc" >&&< notOneOf "xyz") :: RegString
"[^abcxyz]"
>>> tokenClass (oneOf "abcxyz" >&&< notOneOf "xyz") :: RegString
"[abc]"
>>> tokenClass (notOneOf "#$%" >&&< notAsIn Control) :: RegString
"[^#$%\\P{Cc}]"
>>> tokenClass (allB notAsIn [MathSymbol, Control]) :: RegString
"\\P{Sm|Cc}"
>>> tokenClass (notB (oneOf "xyz")) :: RegString
"[^xyz]"

Ill-formed RegStrings normalize to failure.

>>> fromString ")(" :: RegString
"[]"

Constructors

RegString 

Instances

Instances details
IsString RegString Source # 
Instance details

Defined in Control.Lens.Grammar

Monoid RegString Source # 
Instance details

Defined in Control.Lens.Grammar

Semigroup RegString Source # 
Instance details

Defined in Control.Lens.Grammar

IsList RegString Source # 
Instance details

Defined in Control.Lens.Grammar

Associated Types

type Item RegString #

Read RegString Source # 
Instance details

Defined in Control.Lens.Grammar

Show RegString Source # 
Instance details

Defined in Control.Lens.Grammar

KleeneStarAlgebra RegString Source # 
Instance details

Defined in Control.Lens.Grammar

NonTerminalSymbol RegString Source # 
Instance details

Defined in Control.Lens.Grammar

Eq RegString Source # 
Instance details

Defined in Control.Lens.Grammar

Ord RegString Source # 
Instance details

Defined in Control.Lens.Grammar

Matching String RegString Source # 
Instance details

Defined in Control.Lens.Grammar

Methods

(=~) :: String -> RegString -> Bool Source #

TokenAlgebra Char RegString Source # 
Instance details

Defined in Control.Lens.Grammar

TerminalSymbol Char RegString Source # 
Instance details

Defined in Control.Lens.Grammar

Methods

terminal :: [Char] -> RegString Source #

Tokenized Char RegString Source # 
Instance details

Defined in Control.Lens.Grammar

type Item RegString Source # 
Instance details

Defined in Control.Lens.Grammar

regstringG :: RegGrammar Char a -> RegString Source #

regstringG generates a RegString from a regular grammar. Since context-free Grammars and CtxGrammars aren't necessarily regular, the type system will prevent regstringG from being applied to them.

regexGrammar :: Grammar Char RegString Source #

regexGrammar is a context-free Grammar for RegStrings. It can't be a RegGrammar, since RegStrings include parenthesization. But balanced parentheses are a context-free language.

>>> putStringLn (regbnfG regexGrammar)
{start} = \q{regex}
{alternate} = \q{sequence}(\|\q{sequence})*
{atom} = (\\q\{)\q{char}*\}|\q{char}|\q{char-class}|\(\q{regex}\)
{category} = Ll|Lu|Lt|Lm|Lo|Mn|Mc|Me|Nd|Nl|No|Pc|Pd|Ps|Pe|Pi|Pf|Po|Sm|Sc|Sk|So|Zs|Zl|Zp|Cc|Cf|Cs|Co|Cn
{category-test} = (\\p\{)\q{category}\}|(\\P\{)(\q{category}(\|\q{category})*)\}
{char} = [^\(\)\*\+\?\[\\\]\^\{\|\}\P{Cc}]|\\\q{char-escaped}
{char-any} = \[\^\]
{char-class} = \q{fail}|\q{char-any}|\q{one-of}|\q{not-one-of}|\q{category-test}
{char-control} = NUL|SOH|STX|ETX|EOT|ENQ|ACK|BEL|BS|HT|LF|VT|FF|CR|SO|SI|DLE|DC1|DC2|DC3|DC4|NAK|SYN|ETB|CAN|EM|SUB|ESC|FS|GS|RS|US|DEL|PAD|HOP|BPH|NBH|IND|NEL|SSA|ESA|HTS|HTJ|VTS|PLD|PLU|RI|SS2|SS3|DCS|PU1|PU2|STS|CCH|MW|SPA|EPA|SOS|SGCI|SCI|CSI|ST|OSC|PM|APC
{char-escaped} = [\(\)\*\+\?\[\\\]\^\{\|\}]|\q{char-control}
{expression} = \q{atom}\?|\q{atom}\*|\q{atom}\+|\q{atom}
{fail} = \[\]
{not-one-of} = (\[\^)\q{char}+(\q{category-test}?\])
{one-of} = \[\q{char}+\]
{regex} = \q{alternate}
{sequence} = \q{char}*|\q{expression}*

Context-free grammar

type Grammar token a = forall p. (Lexical token p, forall x. BackusNaurForm (p x x), Alternator p) => p a a Source #

Context-free Grammars add two capabilities to RegGrammars, coming from the BackusNaurForm interface

  • rule abstraction,
  • and general recursion.

regexGrammar and regbnfGrammar are examples of context-free Grammars. Regular expressions are a form of expression algebra. Let's see a similar but simpler example, the algebra of arithmetic expressions of natural numbers.

>>> import Numeric.Natural (Natural)
>>> :{
data Arith
  = Num Natural
  | Add Arith Arith
  | Mul Arith Arith
  deriving stock (Eq, Ord, Show, Read)
:}

Here are Prisms for the constructor patterns.

>>> import Control.Lens (Prism', prism')
>>> :{
_Num :: Prism' Arith Natural
_Num = prism' Num (\case Num n -> Just n; _ -> Nothing)
_Add, _Mul :: Prism' Arith (Arith, Arith)
_Add = prism' (uncurry Add) (\case Add x y -> Just (x,y); _ -> Nothing)
_Mul = prism' (uncurry Mul) (\case Mul x y -> Just (x,y); _ -> Nothing)
:}

Now we can build a Grammar for Arith by combining "idiom" style with named rules, and tying the recursive loop (caused by parenthesization) with ruleRec.

>>> :{
arithGrammar :: Grammar Char Arith
arithGrammar = ruleRec "arith" sumG
  where
    sumG arith = rule "sum" $
      chain1 Left _Add (sepBy (terminal "+")) (prodG arith)
    prodG arith = rule "product" $
      chain1 Left _Mul (sepBy (terminal "*")) (factorG arith)
    factorG arith = rule "factor" $
      numberG <|> terminal "(" >* arith *< terminal ")"
    numberG = rule "number" $
      _Num . iso show read >? someP (asIn @Char DecimalNumber)
:}

We can generate a RegBnf, printers and parsers from arithGrammar.

>>> putStringLn (regbnfG arithGrammar)
{start} = \q{arith}
{arith} = \q{sum}
{factor} = \q{number}|\(\q{arith}\)
{number} = \p{Nd}+
{product} = \q{factor}(\*\q{factor})*
{sum} = \q{product}(\+\q{product})*
>>> [x | (x,"") <- parseG arithGrammar "1+2*3+4"]
[Add (Add (Num 1) (Mul (Num 2) (Num 3))) (Num 4)]
>>> unparseG arithGrammar (Add (Num 1) (Mul (Num 2) (Num 3))) "" :: Maybe String
Just "1+2*3"
>>> do pr <- printG arithGrammar (Num 69); return (pr "") :: Maybe String
Just "69"

newtype RegBnf Source #

RegBnfs are an embedded domain specific language of Backus-Naur forms extended by regular expression strings. Like RegStrings they have a string-like interface.

>>> let bnf = fromString "{start} = foo|bar" :: RegBnf
>>> putStringLn bnf
{start} = foo|bar
>>> bnf
"{start} = foo|bar"

RegBnfs can be generated from context-free Grammars with regbnfG.

>>> :type regbnfG regbnfGrammar
regbnfG regbnfGrammar :: RegBnf

Like RegStrings, RegBnfs can be constructed using Lexical, Monoid and KleeneStarAlgebra combinators. But they also support BackusNaurForm rules and ruleRecs.

Constructors

RegBnf 

Instances

Instances details
IsString RegBnf Source # 
Instance details

Defined in Control.Lens.Grammar

Methods

fromString :: String -> RegBnf #

Monoid RegBnf Source # 
Instance details

Defined in Control.Lens.Grammar

Semigroup RegBnf Source # 
Instance details

Defined in Control.Lens.Grammar

IsList RegBnf Source # 
Instance details

Defined in Control.Lens.Grammar

Associated Types

type Item RegBnf #

Read RegBnf Source # 
Instance details

Defined in Control.Lens.Grammar

Show RegBnf Source # 
Instance details

Defined in Control.Lens.Grammar

BackusNaurForm RegBnf Source # 
Instance details

Defined in Control.Lens.Grammar

KleeneStarAlgebra RegBnf Source # 
Instance details

Defined in Control.Lens.Grammar

NonTerminalSymbol RegBnf Source # 
Instance details

Defined in Control.Lens.Grammar

Eq RegBnf Source # 
Instance details

Defined in Control.Lens.Grammar

Methods

(==) :: RegBnf -> RegBnf -> Bool #

(/=) :: RegBnf -> RegBnf -> Bool #

Ord RegBnf Source # 
Instance details

Defined in Control.Lens.Grammar

Matching String RegBnf Source # 
Instance details

Defined in Control.Lens.Grammar

Methods

(=~) :: String -> RegBnf -> Bool Source #

TokenAlgebra Char RegBnf Source # 
Instance details

Defined in Control.Lens.Grammar

TerminalSymbol Char RegBnf Source # 
Instance details

Defined in Control.Lens.Grammar

Methods

terminal :: [Char] -> RegBnf Source #

Tokenized Char RegBnf Source # 
Instance details

Defined in Control.Lens.Grammar

type Item RegBnf Source # 
Instance details

Defined in Control.Lens.Grammar

regbnfG :: Grammar Char a -> RegBnf Source #

regbnfG generates a RegBnf from a context-free Grammar. Since CtxGrammars aren't context-free, the type system will prevent regbnfG from being applied to a CtxGrammar. It can apply to a RegGrammar.

regbnfGrammar :: Grammar Char RegBnf Source #

regbnfGrammar is a context-free Grammar for RegBnfs. That means that it can generate a self-hosted definition.

>>> putStringLn (regbnfG regbnfGrammar)
{start} = \q{regbnf}
{alternate} = \q{sequence}(\|\q{sequence})*
{atom} = (\\q\{)\q{char}*\}|\q{char}|\q{char-class}|\(\q{regex}\)
{category} = Ll|Lu|Lt|Lm|Lo|Mn|Mc|Me|Nd|Nl|No|Pc|Pd|Ps|Pe|Pi|Pf|Po|Sm|Sc|Sk|So|Zs|Zl|Zp|Cc|Cf|Cs|Co|Cn
{category-test} = (\\p\{)\q{category}\}|(\\P\{)(\q{category}(\|\q{category})*)\}
{char} = [^\(\)\*\+\?\[\\\]\^\{\|\}\P{Cc}]|\\\q{char-escaped}
{char-any} = \[\^\]
{char-class} = \q{fail}|\q{char-any}|\q{one-of}|\q{not-one-of}|\q{category-test}
{char-control} = NUL|SOH|STX|ETX|EOT|ENQ|ACK|BEL|BS|HT|LF|VT|FF|CR|SO|SI|DLE|DC1|DC2|DC3|DC4|NAK|SYN|ETB|CAN|EM|SUB|ESC|FS|GS|RS|US|DEL|PAD|HOP|BPH|NBH|IND|NEL|SSA|ESA|HTS|HTJ|VTS|PLD|PLU|RI|SS2|SS3|DCS|PU1|PU2|STS|CCH|MW|SPA|EPA|SOS|SGCI|SCI|CSI|ST|OSC|PM|APC
{char-escaped} = [\(\)\*\+\?\[\\\]\^\{\|\}]|\q{char-control}
{expression} = \q{atom}\?|\q{atom}\*|\q{atom}\+|\q{atom}
{fail} = \[\]
{not-one-of} = (\[\^)\q{char}+(\q{category-test}?\])
{one-of} = \[\q{char}+\]
{regbnf} = (\{start\} = )\q{regex}(\LF\q{rule})*
{regex} = \q{alternate}
{rule} = \{\q{char}*(\} = )\q{regex}
{sequence} = \q{char}*|\q{expression}*

Context-sensitive grammar

type CtxGrammar token a = forall p. (Lexical token p, forall x. BackusNaurForm (p x x), Alternator p, Filtrator p, Monadic p) => p a a Source #

In addition to context-sensitivity via Monadic combinators, CtxGrammars adds general filtration via Filtrator to Grammars.

>>> :{
palindromeG :: CtxGrammar Char String
palindromeG = rule "palindrome" $
  satisfied (\wrd -> reverse wrd == wrd) >?< manyP (anyToken @Char)
:}

The satisfied pattern is used together with the Choice & Cochoice applicator >?< for general filtration. For context-sensitivity, the Monadic interface is used by importing Data.Profunctor.Monadic qualified and using a "bonding" notation which mixes "idiom" style with qualified do-notation. Let's use length-encoded vectors of numbers as an example.

>>> import Numeric.Natural (Natural)
>>> import Control.Lens.Iso (Iso', iso)
>>> :set -XRecordWildCards
>>> :{
data LenVec = LenVec {length :: Natural, vector :: [Natural]}
  deriving (Eq, Ord, Show, Read)
_LenVec :: Iso' LenVec (Natural, [Natural])
_LenVec = iso (\LenVec {..} -> (length, vector)) (\(length, vector) -> LenVec {..})
:}
>>> :set -XQualifiedDo
>>> import qualified Data.Profunctor.Monadic as P
>>> :{
lenvecGrammar :: CtxGrammar Char LenVec
lenvecGrammar = _LenVec >? P.do
  let
    numberG = iso show read >~ someP (asIn @Char DecimalNumber)
    vectorG n = intercalateP n (sepBy (terminal ",")) numberG
  len <- numberG             -- bonds to _LenVec
  terminal ";"               -- doesn't bond
  vectorG (fromIntegral len) -- bonds to _LenVec
:}

The qualified do-notation changes the signature of P.>>=, so that we must apply the constructor pattern _LenVec to the do-block with the >? applicator. Any bound named variable, var <- action, gets "bonded" to the constructor pattern. Any unbound actions, except for the last action in the do-block, does not get bonded to the pattern. The last action does get bonded to the pattern. Any unnamed bound action, _ <- action, also gets bonded to the pattern, but being unnamed means it isn't added to the context. If all bound actions are unnamed, then a CtxGrammar can be rewritten as a Grammar since it is context-free. We can't generate a RegBnf since the rules of a CtxGrammar aren't static, but dynamic and contextual. We can generate parsers and printers as expected.

>>> [vec | (vec, "") <- parseG lenvecGrammar "3;1,2,3"] :: [LenVec]
[LenVec {length = 3, vector = [1,2,3]}]
>>> [vec | (vec, "") <- parseG lenvecGrammar "0;1,2,3"] :: [LenVec]
[]
>>> [pr "" | pr <- printG lenvecGrammar (LenVec 2 [6,7])] :: [String]
["2;6,7"]
>>> [pr "" | pr <- printG lenvecGrammar (LenVec 200 [100])] :: [String]
[]
>>> [pal | word <- ["racecar", "word"], (pal, "") <- parseG palindromeG word]
["racecar"]

printG Source #

Arguments

:: Cons string string token token 
=> (IsList string, Item string ~ token, Categorized token) 
=> (Alternative m, Monad m, Filterable m) 
=> CtxGrammar token a 
-> a

syntax

-> m (string -> string) 

printG generates a printer from a CtxGrammar. Since both RegGrammars and context-free Grammars are CtxGrammars, the type system will allow printG to be applied to them. Running the printer on a syntax value returns a function that conses tokens at the beginning of an input string, from right to left.

parseG Source #

Arguments

:: (Cons string string token token, Snoc string string token token) 
=> (IsList string, Item string ~ token, Categorized token) 
=> (Alternative m, Monad m, Filterable m) 
=> CtxGrammar token a 
-> string

input

-> m (a, string) 

parseG generates a parser from a CtxGrammar. Since both RegGrammars and context-free Grammars are CtxGrammars, the type system will allow parseG to be applied to them. Running the parser on an input string value unconses tokens from the beginning of an input string from left to right, returning a syntax value and the remaining output string.

unparseG Source #

Arguments

:: (Cons string string token token, Snoc string string token token) 
=> (IsList string, Item string ~ token, Categorized token) 
=> (Alternative m, Monad m, Filterable m) 
=> CtxGrammar token a 
-> a

syntax

-> string

input

-> m string 

unparseG generates an unparser from a CtxGrammar. Since both RegGrammars and context-free Grammars are CtxGrammars, the type system will allow unparseG to be applied to them. Running the unparser on a syntax value and an input string snocs tokens at the end of the string, from left to right, returning the output string.

Utility

putStringLn :: (IsList string, Item string ~ Char) => string -> IO () Source #

putStringLn is a utility that generalizes putStrLn to string-like interfaces such as RegString and RegBnf.