fuzzyset-haskell

Table of Contents
About
A fuzzy string set data structure for approximate string matching. This implementation is based on the Python and JavaScript libraries with the same name:
Install
cabal install fuzzyset
For details, see Hackage docs. This library is also available on Stackage. To install using Stack:
stack install fuzzyset
How to use
Make sure the OverloadedStrings
pragma is enabled. Then there are just three steps:
- Create a set using one of
defaultSet
, mkSet
, or fromList
.
- To add entries, use
add
, addToSet
, or addMany
.
- Then query the set with
get
, getOne
, or getWithMinScore
.
>>> defaultSet `add` "Jurassic Park" `add` "Terminator" `add` "The Matrix" `getOne` "percolator"
Just "Terminator"
>>> defaultSet `add` "Shaggy Rogers" `add` "Fred Jones" `add` "Daphne Blake" `add` "Velma Dinkley" `get` "Shaggy Jones"
[(0.7692307692307693,"Shaggy Rogers"),(0.5,"Fred Jones")]
See here for more examples.
API
Initializing
mkSet :: Size -> Size -> Bool -> FuzzySet
Size
The lower bound of gram sizes to use (inclusive)
Size
The upper bound of gram sizes to use (inclusive)
Bool
Whether to use Levenshtein distance to determine the score
FuzzySet
An empty fuzzy string set
Initialize a FuzzySet.
defaultSet :: FuzzySet
A FuzzySet with the following field values:
{ gramSizeLower = 2
, gramSizeUpper = 3
, useLevenshtein = True
, exactSet = ε
, matchDict = ε
, items = ε }
fromList :: [Text] -> FuzzySet
Create a fuzzy string set with entries from the given list.
fromList = addMany defaultSet
Adding
add :: FuzzySet -> Text -> FuzzySet
FuzzySet
Fuzzy string set to add the entry to
Text
The new entry
FuzzySet
The updated set
Add an entry to the set, or do nothing if a key identical to the provided value already exists in the set.
addToSet :: FuzzySet -> Text -> (FuzzySet, Bool)
FuzzySet
Fuzzy string set to add the entry to
Text
The new entry
(FuzzySet, Bool)
The updated set and a boolean, which will be True
if, and only if, the value was not already in the set.
Add an entry to the set and return a pair with the new set, and a boolean to indicate if a new entry was inserted, or not.
addMany :: FuzzySet -> [Text] -> FuzzySet
FuzzySet
Fuzzy string set to add the entries to
[Text]
A list of new entries
FuzzySet
A new fuzzy string set
Add a list of entries to the set, in one go.
addMany = foldr (flip add)
Retrieving
get :: FuzzySet -> Text -> [(Double, Text)]
FuzzySet
The fuzzy string set to compare the string against
Text
The lookup query
[(Double, Text)]
A list of results (score and matched value pairs)
Try to match the given string against the entries in the set, using a minimum score of 0.33. Return a list of results ordered by similarity score, with the closest match first.
getWithMinScore :: Double -> FuzzySet -> Text -> [(Double, Text)]
Double
A minimum score
FuzzySet
The fuzzy string set to compare the string against
Text
The lookup query
[(Double, Text)]
A list of results (score and matched value pairs)
Try to match the given string against the entries in the set, and return a list of all results with a score greater than or equal to the specified minimum score (i.e., the first argument). The results are ordered by similarity score, with the closest match first.
getOne :: FuzzySet -> Text -> Maybe Text
FuzzySet
The fuzzy string set to compare the string against
Text
The lookup query
Maybe Text
Return Just
the result, if a match was found, otherwise Nothing
Try to match the given string against the entries in the set, and return the closest match, if one is found.
Inspecting
There are also a few functions to inspect the set.
size :: FuzzySet -> Int
Return the number of entries in the set.
>>> size (defaultSet `add` "map" `add` "cap")
2
isEmpty :: FuzzySet -> Bool
Return a boolean to indicate whether the set is empty.
>>> isEmpty (fromList [])
True
values :: FuzzySet -> [Text]
Return the elements of the set.
>>> values (fromList ["bass", "craze", "space", "lace", "daze", "haze", "ace", "maze"])
["space","daze","bass","maze","ace","craze","lace","haze"]
Examples
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.FuzzySet
states = [ "Alabama" , "Alaska" , "American Samoa" , "Arizona" , "Arkansas"
, "California" , "Colorado" , "Connecticut" , "Delaware" , "District of Columbia"
, "Florida" , "Georgia" , "Guam" , "Hawaii" , "Idaho"
, "Illinois" , "Indiana" , "Iowa" , "Kansas" , "Kentucky"
, "Louisiana" , "Maine" , "Maryland" , "Massachusetts" , "Michigan"
, "Minnesota" , "Mississippi" , "Missouri" , "Montana" , "Nebraska"
, "Nevada" , "New Hampshire" , "New Jersey" , "New Mexico" , "New York"
, "North Carolina" , "North Dakota" , "Northern Marianas Islands" , "Ohio" , "Oklahoma"
, "Oregon" , "Pennsylvania" , "Puerto Rico" , "Rhode Island" , "South Carolina"
, "South Dakota" , "Tennessee" , "Texas" , "Utah" , "Vermont"
, "Virginia" , "Virgin Islands" , "Washington" , "West Virginia" , "Wisconsin"
, "Wyoming" ]
statesSet = fromList states
main = mapM_ print (get statesSet "Burger Islands")
The output of this program is:
(0.7142857142857143,"Virgin Islands")
(0.5714285714285714,"Rhode Island")
(0.44,"Northern Marianas Islands")
(0.35714285714285715,"Maryland")
Using the same definition of statesSet
from previous example:
>>> get statesSet "Why-oh-me-ing"
[(0.5384615384615384,"Wyoming")]
>>> get statesSet "Connect a cat"
[(0.6923076923076923,"Connecticut")]
>>> get statesSet "Transylvania"
[(0.75,"Pennsylvania"),(0.3333333333333333,"California"),(0.3333333333333333,"Arkansas"),(0.3333333333333333,"Kansas")]
>>> get statesSet "CanOfSauce"
[(0.4,"Kansas")]
>>> get statesSet "Alaska"
[(1.0,"Alaska")]
>>> get statesSet "Alaskanbraskansas"
[(0.47058823529411764,"Arkansas"),(0.35294117647058826,"Kansas"),(0.35294117647058826,"Alaska"),(0.35294117647058826,"Alabama"),(0.35294117647058826,"Nebraska")]