ac-library-hs-1.5.3.0: Data structures and algorithms
Safe HaskellNone
LanguageGHC2021

AtCoder.Extra.AhoCorasick

Description

Aho–Corasick algorithm is a fast dictionary-matching (multi-pattern matching) algorithm.

Example

Expand
>>> import AtCoder.Extra.AhoCorasick qualified as AC
>>> import Data.Vector.Unboxed qualified as VU

Pattern strings must be given as V.Vector (VU.Vector Int):

>>> let patterns = V.fromList [VU.fromList [0, 1], VU.fromList [0, 2], VU.fromList [2, 3, 4]]
>>> let ac = AC.build patterns
>>> AC.size ac
7

The automaton could be run manually with next or nextN:

>>> AC.nextN ac {- empty node -} 0 (VU.fromList [0, 2, 3])
5

match returns a vector of (endPos, patternId):

>>> --                         [.....) pattern 0
>>> --                               [.......) pattern2
>>> AC.match ac $ VU.fromList [0, 1, 2, 3, 4]
[(2,0),(5,2)]

If you need a vector of (startPos, patternId), you must manually map the result:

>>> let f (!end, !patId) = (end - VU.length (patterns V.! patId), patId)
>>> --                                    [.....) pattern 0
>>> --                                          [.......) pattern2
>>> VU.map f . AC.match ac $ VU.fromList [0, 1, 2, 3, 4]
[(0,0),(2,2)]

Note that duplicate patterns are only counted once with match.

Since: 1.5.3.0

Synopsis

Documentation

data AhoCorasick Source #

Aho–Corasick algorithm data.

Since: 1.5.3.0

Constructors

AhoCorasick 

Fields

  • sizeAc :: !Int

    The number of nodes in the trie.

    Since: 1.5.3.0

  • trieAc :: !(Vector (HashMap Int Int))

    A trie (-like directed graph) of input words: Vertex -> (Char -> Vertex).

    Since: 1.5.3.0

  • parentAc :: !(Vector Int)

    Node data of links to parent vertex.

    Since: 1.5.3.0

  • patternAc :: !(Vector Int)

    Node data that represents completed pattern string or nothing (-1).

    Since: 1.5.3.0

  • suffixAc :: !(Vector Int)

    Node data of links to the longest suffix vertex.

    Since: 1.5.3.0

  • outputAc :: !(Vector Int)

    Node data of links to the longest suffix pattern vertex.

    Since: 1.5.3.0

build Source #

Arguments

:: HasCallStack 
=> Vector (Vector Int)

Pattern strings.

-> AhoCorasick

Aho–Corasick automaton based on a trie.

\(O(\sum_i |S_i|)\)

Constraints

  • \(|S_i| > 0\)

Since: 1.5.3.0

size :: HasCallStack => AhoCorasick -> Int Source #

\(O(1)\) Returns the number of nodes in the trie.

Since: 1.5.3.0

next Source #

Arguments

:: HasCallStack 
=> AhoCorasick

The automaton.

-> Int

Current node ID (empty node is 0).

-> Int

Character.

-> Int

Next node ID.

\(O(1)\) Retrieves the next node to visit.

Since: 1.5.3.0

nextN Source #

Arguments

:: HasCallStack 
=> AhoCorasick

The automaton.

-> Int

Current node.

-> Vector Int

String.

-> Int

Resulting node.

\(n\) Applies next N times for a given input string.

Constraints

Since: 1.5.3.0

match :: HasCallStack => AhoCorasick -> Vector Int -> Vector (Int, Int) Source #

\(O(|T|)\) Runs dictionary matching (multi-pattern matching) in linear time and returns a list of (endPos, patId), where [endPos - patLen, endPos) corresponds to the interval of original source slice.

Note that duplicate patterns are counted just once with one of them; if pattern A and B are the same, their appearence is counted as either A or B.

Since: 1.5.3.0