Text Metrics
The library provides efficient implementations of various strings metrics.
It works with strict Text
values and returns either Natural
numbers
(because the metrics cannot be negative), or Ratio Natural
values because
returned values are rational non-negative numbers by definition.
The current version of the package implements:
TODO list:
Comparison with the edit-distance
package
There
is edit-distance
package whose scope overlaps with the scope of this package. The differences
are:
-
edit-distance
allows to specify costs for every operation when
calculating Levenshtein distance (insertion, deletion, substitution, and
transposition). This is rarely needed though in real-world applications,
IMO.
-
edit-distance
only provides single Levenshtein distance, text-metrics
aims to provide implementations of most string metrics algorithms.
-
edit-distance
works on Strings
, while text-metrics
works on strict
Text
values.
-
As README.md
of edit-distance
states, “[the algorithms] have been
fairly heavily optimized”, which is apparently true, yet the
text-metrics
is faster for short strings (under 64 characters) and even
faster for longer strings (scales better). How much faster? For short
strings more than ×2.5, and about ×4 for longer strings.
Implementation
All “meat” of the algorithms is written is C in a rather straightforward
way. Levenshtein variants are based on the “iterative algorithm with two
matrix rows” from Wikipedia with additional improvement that we do not copy
current row of distances into previous row, but just swap the pointers
(which is OK, since the arrays have equal length and current row will be
overwritten in the next iteration anyway).
Normalized versions are defined as thin (inlined) Haskell wrappers.
License
Copyright © 2016 Mark Karpov
Distributed under BSD 3 clause license.