BSD-3-Clause licensed by The Alfred-Margaret authors
This version can be pinned in stack with:alfred-margaret-2.1.1.0@sha256:50ca0049482e6ba8d1b13501593dfe89378e4aa0953a7a2c98ab520dd6f266b5,5048

Alfred–Margaret

Alfred–Margaret is a fast implementation of the Aho–Corasick string searching algorithm in Haskell. It powers many string-related operations in Channable.

The library is designed to work with the text package. It matches directly on the internal UTF-16 representation of Text for efficiency. See the announcement blog post for a deeper dive into Aho–Corasick, and the optimizations that make this library fast.

Alfred–Margaret is named after Alfred Aho and Margaret Corasick.

Performance

Running time to count all matches, in a real-world data set, comparing a Java implementation and a Rust implementation against Alfred–Margaret, and against memcopy to establish a lower bound:

For the full details of this benchmark, see our announcement blog post, which includes more details about the data set, the benchmark setup, and a few things to keep in mind when interpreting this graph.

LLVM

If you are using LLVM instead of the GHC backend, make sure to compare different versions. Using LLVM 9 instead of LLVM 12 with GHC 8.10.7 made a significant dent in benchmark times. For more information, see this issue.

Example

Check if a string contains one of the needles:


import qualified Data.Text.AhoCorasick.Automaton as Aho
import qualified Data.Text.AhoCorasick.Searcher as Searcher

searcher = Searcher.build Aho.CaseSensitive ["tshirt", "shirts", "shorts"]

Searcher.containsAny searcher "short tshirts"
> True

Searcher.containsAny searcher "long shirt"
> False

Searcher.containsAny searcher "Short TSHIRTS"
> False

searcher' = Searcher.build Aho.IgnoreCase ["tshirt", "shirts", "shorts"]

Searcher.containsAny searcher' "Short TSHIRTS"
> True

Sequentially replace many needles:

import Data.Text.AhoCorasick.Automaton (CaseSensitivity (..))
import qualified Data.Text.AhoCorasick.Replacer as Replacer

replacer = Replacer.build CaseSensitive [("tshirt", "banana"), ("shirt", "pear")]

Replacer.run replacer "tshirts for sale"
> "bananas for sale"

Replacer.run replacer "tshirts and shirts for sale"
> "bananas and pears for sale"

Replacer.run replacer "sweatshirts and shirtshirts"
> "sweabananas and shirbananas"

Replacer.run replacer "sweatshirts and shirttshirts"
> "sweabananas and pearbananas"

Get all matches, possibly overlapping:

import qualified Data.Text.AhoCorasick.Automaton as Aho

pairNeedleWithSelf text = (Aho.unpackUtf16 text, text)
automaton = Aho.build $ fmap pairNeedleWithSelf ["tshirt", "shirts", "shorts"]
allMatches = Aho.runText [] (\matches match -> Aho.Step (match : matches))

allMatches automaton "short tshirts"
> [ Match {matchPos = CodeUnitIndex 13, matchValue = "shirts"}
> , Match {matchPos = CodeUnitIndex 12, matchValue = "tshirt"}
> ]

allMatches automaton "sweatshirts and shirtshirts"
> [ Match {matchPos = CodeUnitIndex 27, matchValue = "shirts"}
> , Match {matchPos = CodeUnitIndex 26, matchValue = "tshirt"}
> , Match {matchPos = CodeUnitIndex 22, matchValue = "shirts"}
> , Match {matchPos = CodeUnitIndex 11, matchValue = "shirts"}
> , Match {matchPos = CodeUnitIndex 10, matchValue = "tshirt"}
> ]

License

Alfred–Margaret is licensed under the 3-clause BSD license.

Changes

Changelog

v2.1.1.0 - “Generically Primitive” (2026-04-24)

Tested on GHC 9.14.1.

  • Derive Generic for the exposed data types in Data.Text.AhoCorasick.*, Data.Text.BoyerMoore.*, Data.Text.BoyerMooreCI.* and Data.Text.CaseSensitivity (#65).
  • Replace the internal Data.TypedByteArray wrapper with PrimArray from Data.Primitive; introduces the new internal module Data.Primitive.Extended.
  • Widen containers upper bound to include 0.8.
  • Toolchain: switch development environment and CI to GHC 9.14 and LLVM 19; add direnv support; bump pinned nixpkgs and Nix to 2.24.12.
  • Update tested-with to GHC 9.14.1.

v2.1.0.2 - “All the bounds!” (2024-09-06)

Publishing requires cabal to know the version bounds as well.

v2.1.0.1 - “Higher bounds!” (2024-09-05)

Tested on GHC 9.6.6.

  • Revise dependency bounds (#62 thanks @Bodigrim)
  • Allow using primitive < 0.9 and vector < 0.14 (#59 thanks @rampion)

v2.1.0.0 - “All The Cases!” (2022-08-31)

  • Added a case-insensitive variant of the Boyer-Moore algorithm in the Data.Text.BoyerMooreCI.* modules. (#47)
  • Fixed a bug in the case-insensitive Aho-Corasick replacer where it would replace the wrong section of the haystack when the needle had a different byte-length than the matching part of the haystack. (#47)
  • Allow mapping the payloads of Aho-Corasick automatons. (#46)

v2.0.0.0 - “So Long Surrogates” (2022-05-02)

Switched to text-2.0 which uses UTF-8 encoding internally.

  • Removed Data.Text.Utf8.* modules
  • Replaced Data.Text.AhoCorasick.* and Data.Text.BoyerMoore.* (previously using UTF-16) with the UTF-8 implementation

v1.1.2.0 - “ByteArray Boogaloo” (2022-04-21)

Added UTF-8 implementations on a mock Text type (in Data.Text.Utf8).

  • Added Data.Text.Utf8* modules
  • Moved CaseSensitivity to its own Data.Text.CaseSensitivity module.
  • Added the private module Data.TypedByteArray which contains thin wrappers over ByteArray and MutableByteArray.
  • Replaced uses of Data.Vector.Unboxed.Vector by TypedByteArray.

v1.1.0.0 - “Moore Features” (2020-10-13)

The most notable addition in this release is the implementation of the Boyer-Moore string search algorithm.

Compatibility:

  • Extracted the UTF-16 manipulation functions from Data.Text.AhoCorasick.Automaton into Data.Text.Utf16
  • Changed Data.Text.AhoCorasick.Searcher.Searcher to remember the case sensitivity used for constructing the searcher
  • Removed Data.Text.AhoCorasick.Searcher.containsAnyIgnoreCase, the correct implementation is now chosen by containsAny based on the case sensitivity of the searcher

Other changes:

  • Added Data.Text.AhoCorasick.Splitter for splitting a lot of text using the same needle
  • Added Data.Text.BoyerMoore.Automaton, a UTF-16 implementation of Boyer-Moore
  • Added Data.Text.BoyerMoore.Searcher for searching for multiple needles at once using Boyer-Moore
  • Added Data.Text.BoyerMoore.Replacer for replacing text based on the Boyer-Moore search
  • Added optional FromJSON/ToJSON instances for most types (can be toggled via aeson cabal flag)

v1.0.0.0 - “Initial Release” (2019-03-19)

This is the initial open-source release.

  • Added Data.Text.AhoCorasick.Automaton, a UTF-16 implementation of the Aho-Corasick search algorithm
  • Added Data.Text.AhoCorasick.Searcher, a bulk search abstraction based on Aho-Corasick
  • Added Data.Text.AhoCorasick.Replacer, a bulk replace abstraction based on Aho-Corasick