regex-tdfa
Pure Haskell Tagged DFA Backend for "Text.Regex" (regex-base)
https://wiki.haskell.org/Regular_expressions
| LTS Haskell 24.16: | 1.3.2.5 |
| Stackage Nightly 2025-10-23: | 1.3.2.5 |
| Latest on Hackage: | 1.3.2.5 |
regex-tdfa-1.3.2.5@sha256:b2724328a8b02d6520186aa24d6a0bc926e86206361a76535b22d31ef84b3451,6468Module documentation for 1.3.2.5
- Data
- Data.IntMap
- Data.IntSet
- Text
- Text.Regex
- Text.Regex.TDFA
- Text.Regex
regex-tdfa
This is regex-tdfa which is a pure Haskell regular expression library (for POSIX extended regular expressions) originally written by Christopher Kuklewicz.
The name “tdfa” stands for Tagged-DFA.
Getting started
Importing and using
Declare a dependency on the regex-tdfa library in your .cabal file:
build-depends: regex-tdfa ^>= 1.3.2
In Haskell modules where you need to use regexes import the respective regex-tdfa module:
import Text.Regex.TDFA
Basics
λ> emailRegex = "[a-zA-Z0-9+._-]+@[a-zA-Z-]+\\.[a-z]+"
λ> "my email is [email protected]" =~ emailRegex :: Bool
>>> True
-- non-monadic
<to-match-against> =~ <regex>
-- monadic, uses 'fail' on lack of match
<to-match-against> =~~ <regex>
(=~) and (=~~) are polymorphic in their return type. This is so that
regex-tdfa can pick the most efficient way to give you your result based on
what you need. For instance, if all you want is to check whether the regex
matched or not, there’s no need to allocate a result string. If you only want
the first match, rather than all the matches, then the matching engine can stop
after finding a single hit.
This does mean, though, that you may sometimes have to explicitly specify the type you want, especially if you’re trying things out at the REPL.
Common use cases
Get the first match
-- returns empty string if no match
a =~ b :: String -- or ByteString, or Text...
λ> "alexis-de-tocqueville" =~ "[a-z]+" :: String
>>> "alexis"
λ> "alexis-de-tocqueville" =~ "[[:digit:]]+" :: String
>>> ""
Check if it matched at all
a =~ b :: Bool
λ> "alexis-de-tocqueville" =~ "[a-z]+" :: Bool
>>> True
Get first match + text before/after
-- if no match, will just return whole
-- string in the first element of the tuple
a =~ b :: (String, String, String)
λ> "alexis-de-tocqueville" =~ "de" :: (String, String, String)
>>> ("alexis-", "de", "-tocqueville")
λ> "alexis-de-tocqueville" =~ "kant" :: (String, String, String)
>>> ("alexis-de-tocqueville", "", "")
Get first match + submatches
-- same as above, but also returns a list of /just/ submatches
-- submatch list is empty if regex doesn't match at all
a =~ b :: (String, String, String, [String])
λ> "div[attr=1234]" =~ "div\\[([a-z]+)=([^]]+)\\]"
:: (String, String, String, [String])
>>> ("", "div[attr=1234]", "", ["attr","1234"])
Get all non-overlapping matches
-- can also return Data.Array instead of List
getAllTextMatches (a =~ b) :: [String]
λ> getAllTextMatches ("john anne yifan" =~ "[a-z]+") :: [String]
>>> ["john","anne","yifan"]
λ> getAllTextMatches ("0a0b0" =~ "0[[:lower:]]0") :: [String]
>>> ["0a0"]
Note that "0b0" is not included in the result since it overlaps with "0a0".
Special characters
regex-tdfa only supports a small set of special characters and is much less
featureful than some other regex engines you might be used to, such as PCRE.
\`— Match start of entire text (similar to^in other regex engines)\'— Match end of entire text (similar to$in other regex engines)\<— Match beginning of word\>— Match end of word\b— Match beginning or end of word\B— Match neither beginning nor end of word
While shorthands like \d (for digit) are not recognized, one can use the respective
POSIX character class inside [...]. E.g., [[:digit:][:lower:]_] is short for
[0-9a-z_]. The supported character classes are listed on
Wikipedia
and defined in module
TNFA.
Please also consult a variant of this documentation which is part of the Text.Regex.TDFA haddock, and the original documentation at the Haskell wiki.
Less common stuff
Get match indices
-- can also return Data.Array instead of List
getAllMatches (a =~ b) :: [(Int, Int)] -- (index, length)
λ> getAllMatches ("john anne yifan" =~ "[a-z]+") :: [(Int, Int)]
>>> [(0,4), (5,4), (10,5)]
Get submatch indices
-- match of __entire__ regex is first element, not first capture
-- can also return Data.Array instead of List
getAllSubmatches (a =~ b) :: [(Int, Int)] -- (index, length)
λ> getAllSubmatches ("div[attr=1234]" =~ "div\\[([a-z]+)=([^]]+)\\]")
:: [(Int, Int)]
>>> [(0,14), (4,4), (9,4)]
Replacement
regex-tdfa does not provide find-and-replace.
Avoiding backslashes
If you find yourself writing a lot of regexes, take a look at raw-strings-qq. It’ll let you write regexes without needing to escape all your backslashes.
{-# LANGUAGE QuasiQuotes #-}
import Text.RawString.QQ
import Text.Regex.TDFA
λ> "2 * (3 + 1) / 4" =~ [r|\([^)]+\)|] :: String
>>> "(3 + 1)"
Known bugs and infelicities
-
Regexes with large character classes combined with
{m,n}are very slow and memory-hungry (#3).An example of such a regex is
^[\x0020-\xD7FF]{1,255}$. -
POSIX submatch semantics are broken in some rare cases (#2).
About this package
This was inspired by the algorithm (and Master’s thesis) behind the regular expression library known as TRE or libtre. This was created by Ville Laurikari and tackled the difficult issue of efficient sub-match capture for POSIX regular expressions.
By building on this thesis and adding a few more optimizations, regex-tdfa matching input text of length N should have O(N) runtime, and should have a maximum memory bounded by the pattern size that does not scale with N. It should do this while returning well defined (and correct) values for the parenthesized sub-matches.
Regardless of performance, nearly every single OS and Libra for POSIX regular expressions has bugs in sub-matches. This was detailed on the Regex POSIX Haskell wiki page, and can be demonstrated with the regex-posix-unittest suite of checks. Test regex-tdfa-unittest should show regex-tdfa passing these same checks. I owe my understanding of the correct behvior and many of these unit tests to Glenn Fowler at AT&T (“An Interpretation of the POSIX regex Standard”).
Maintenance history
The original Darcs repository was at code.haskell.org. For a while a fork was maintained by Roman Cheplyaka as regex-tdfa-rc.
Then the repository moved to https://github.com/ChrisKuklewicz/regex-tdfa, which was primarily maintained by Artyom (neongreen).
Finally, maintainership was passed on again and the repository moved to its current location at https://github.com/haskell-hvr/regex-tdfa.
Other related packages
Searching for “tdfa” on hackage finds some related packages (unmaintained as of 2022-07-14).
Document notes
This README was originally written 2016-04-30.
Changes
1.3.2.5
2025-09-29 Andreas Abel
- Bump
cabal-versionto 2.0 to support fieldautogen-modules(#70) - SPECIALIZE pragmas for faster
Textmatching (#72) - Address deprecation warning for
sizeofMutualByteArray#(#71) - Tested with GHC 8.0 - 9.12.2
1.3.2.4
2025-05-09 Andreas Abel
- Fix definition of
[[:graph:]]character class (#66) - Tested with GHC 8.0 - 9.12.2
1.3.2.3
2025-03-02 Andreas Abel
- Drop support for GHC 7
- Allow
containers < 1 - Tested with GHC 8.0 - 9.12.1
1.3.2.2
2023-08-02, Andreas Abel
- Fix return type in
memcpyFFI signature (#52) - Refactor
regexecto avoid partial functionstailand(!0) - Tested with GHC 7.4 - 9.8.1-alpha1
1.3.2.1
2023-05-19, Andreas Abel
- Fix haddock rendering of code examples in top-level documentation (#50)
- Tested with GHC 7.4 - 9.6
1.3.2
2022-07-18, Andreas Abel
- Export
decodePatternSetanddecodeCharacterClassfromText.Regex.TDFA.Pattern(#16) - Extend and correct docs for
Patternmodule - Tested with GHC 7.4 - 9.4
1.3.1.5
2022-07-18, Andreas Abel
1.3.1.4
2022-07-17, Andreas Abel
- Fix parsing of dashes in bracket expressions, e.g.
[-a-z](#1) - Fix a deprecation warning except for on GHC 8.2 (#21)
- Documentation: link
defaultComptOptto its definition (#13) - Verify documentation examples with new
doc-testtestsuite - Tested with GHC 7.4 - 9.4
1.3.1.3
2022-07-14, Andreas Abel
- Fix an
undefinedinShow PatternSet(#37) - Document POSIX character classes (e.g.
[[:digit:]]) in README - Tested with GHC 7.4 - 9.4
1.3.1.2 Revision 1
2022-05-25, Andreas Abel
- Allow
base >= 4.17(GHC 9.4)
1.3.1.2
2022-02-19, Andreas Abel
- No longer rely on the
MonadFailinstance forST(futurebaselibrary change, see #29). - Silence warning
incomplete-uni-patterns(GHC >= 9.2). - Import from
Data.Listexplicitly or qualified (warningcompat-unqualified-imports). - Import from
Control.Monadto allowmtl-2.3in itsrc3incarnation.
1.3.1.1 Revision 3
2022-01-31, Andreas Abel
- Speculatively allow unreleased
mtl-2.3(works with release candidatemtl-2.3-rc4).
1.3.1.1 Revision 2
2021-12-26, Andreas Abel
- Allow
text-2.0.
1.3.1.1 Revision 1
2021-08-12, Andreas Abel
- Compatibility with
base-4.16(GHC 9.2).
1.3.1.1
2021-06-03, Andreas Abel
- Removed extension
NoMonoPatBindsfrom.cabal-file for GHC 9.2 compatibility. - Removed some outdated documentation.
1.3.1.0 Revision 2
2021-02-20, Andreas Abel
- Compatibility with
base-4.15(GHC 9.0) andbytestring-0.11.
1.3.1.0 Revision 1
2020-03-26, phadej
- Compatibility with
base-4.14(GHC 8.10).
1.3.1.0
2019-11-26, hvr
- Merge http://hackage.haskell.org/package/regex-tdfa-text into
regex-tdfa; see https://github.com/haskell-hvr/regex-tdfa/issues/4. - Don’t inject
ghc-options: -O2by default anymore (see #7 for rationale) and introduceforce-O2cabal flag to control the injection ofghc-options: -O2. Note that you can conveniently control optimization levels on a per-package granularity viacabal.projectfiles; see cabal’s user-guide for more details.
1.3.0 Revision 1
2019-11-26, hvr
- Tighten bounds on
base,mtl,parsecand fail.
1.3.0
2019-09-29, Artyom
- Same as 1.2.3.3 release, but in accordance to PVP; see https://github.com/ChrisKuklewicz/regex-tdfa/issues/29.
- Compatibility with GHC 8.8 and regex-base-0.9.4 (h/t @asr).
- Turned
regex-tdfa-unittestinto aregex-tdfatestsuite.
1.2.3.3 (deprecated, not following PVP)
- Compatibility with GHC 8.8 and regex-base-0.9.4 (h/t @asr).
- Turned
regex-tdfa-unittestinto aregex-tdfatestsuite.
1.2.3.2
2019-05-09, Artyom
- Significantly improved documentation (h/t William Yao).
1.2.3.1
2018-06-22, Artyom
- Compatibility with
containers-0.6.
1.2.3
2018-03-10, Artyom
- Added
Semigroupinstances for some types (h/t Herbert Valerio Riedel).
1.2.2
2016-04-28, Artyom
- New maintainer.
- Now we don’t reexport the problematic
Showinstance for functions.
1.2.1
2015-08-29, Chris Kuklewicz
- Updated dependency versions.
1.2.0
2014-02-02, Chris Kuklewicz
- “Almost ghc-7.8” with the array 0.4 changes for
Data.Array.Unsafe
1.1.8
Make ghc-7.0.2 on platform 2011.2.0.0.0 happy
1.1.7
fix url below
1.1.6
Fix bug preventing []] [-] [^]] [^-] (thanks to Maxime Henrion)
1.1.5
try needUniqTags in POr in CorePattern.hs, try toAdvice b for PStar child
1.1.4
fixed
1.1.3
BROKEN after 100 characters the compressOrbit dies!
1.1.2
worked
1.1.1
add gnu escapes
1.1.0
NewDFA code working
1.0.7
make NewDFA directory and String_NC
1.0.6
try NewDFATest_SBS with uncons
1.0.5
use uncons on SBS
1.0.4
try repaired NewDFATest_SBS
- np13: try to improve readability with the
mmcombinator? Yes! - np12: expand
oin the case wheretlookup getNothing? Yes – this is the fix!? - np11: break multi to not look at
oand just returnTrue? Yes !!!! - np10: Peel off
CharMap/IntMapand DFA/DT with pattern matching? No - np9:
INLINEendOf? No - np8: np6 and
NOINLINEendOff? No - np7: just return
True? Fast - np6: comment out ans check? No
- np5: comment out all
Multi0code? No - np4: comment out all
Single0andSinglecode? No - np3:
!offthe multi? No - np2: comment out all Testing code? No
1.0.3
try to alter matchTest to not have the Bool args? No
1.0.2
arg, the prof is fast and the normal slow!
1.0.1
add NewDFATest.hs
0.99.20
go to many vs single?
0.99.19
try for pre-comparison of orbit-logs!
0.99.18
try alternate lazy/strict strategy in NewDFA. Fix offset laziness.
0.99.17
radical removal of flag array and adding of SetVal to handle groups
0.99.16
performance? up to v15
0.99.15
get string with NewDFA testing, unit tests and 1000 random regex pass
0.99.14
start changing to the new real DFA
0.99.13
more cleanup
0.99.12
try to debug 0.99.11: fixed updateWinner
0.99.11
improve above fix and make stuff work better – HAS BUG, along with old TDFA!
0.99.10
fixed ((.?)*)* patterns by changing PStar nullView when mayFirstBeNull
0.99.9
testing changing bestTrans/chooseWith/choose to include enterOrbit/newFlags/(_,True) info
0.99.8
testing changing Maximize to Minimize for Tags, decide (a*)* is canonical problem
0.99.7
Use (PGroup Nothing) in Pattern to decompose PBound
0.99.6
change to nested nonEmpty calls for PBound
0.99.5
remove PNonEmpty constructor
0.99.4
tests pnonempty' = \ p -> POr [ PEmpty, p ] instead of PNonEmpty