unicode-collation
Haskell implementation of the Unicode Collation Algorithm
https://github.com/jgm/unicode-collation
Version on this page: | 0.1.3.5 |
LTS Haskell 22.37: | 0.1.3.6@rev:1 |
Stackage Nightly 2024-10-06: | 0.1.3.6@rev:1 |
Latest on Hackage: | 0.1.3.6@rev:1 |
unicode-collation-0.1.3.5@sha256:b77068c2900af27b4e604e077834f25f2a9ef914370935a74e18c3e3ad86e11a,5125
Module documentation for 0.1.3.5
unicode-collation
Haskell implementation of unicode collation algorithm.
Motivation
Previously there was no way to do correct unicode collation
(sorting) in Haskell without depending on the C library icu
and the barely maintained Haskell wrapper text-icu
. This
library offers a pure Haskell solution.
Conformance
The library passes all UCA conformance tests.
Localized collations have not been tested as extensively.
Performance
As might be expected, this library is slower than text-icu
,
which wraps a heavily optimized C library. How much slower
depends quite a bit on the input.
On a sample of ten thousand random Unicode strings, we get a factor of about 3:
sort a list of 10000 random Texts (en):
5.9 ms ± 487 μs, 22 MB allocated, 899 KB copied
sort same list with text-icu (en):
2.1 ms ± 87 μs, 7.1 MB allocated, 148 KB copied
Performance is worse on a sample drawn from a smaller character set including predominantly composed accented letters, which mut be decomposed as part of the algorithm:
sort a list of 10000 Texts (composed latin) (en):
12 ms ± 1.1 ms, 34 MB allocated, 910 KB copied
sort same list with text-icu (en):
2.3 ms ± 56 μs, 7.0 MB allocated, 146 KB copied
Much of the impact here comes from normalization (decomposition). If we use a pre-normalized sample and disable normalization in the collator, it’s much faster:
sort same list but pre-normalized (en-u-kk-false):
5.4 ms ± 168 μs, 19 MB allocated, 909 KB copied
On plain ASCII, we get a factor of 3 again:
sort a list of 10000 ASCII Texts (en):
4.6 ms ± 405 μs, 17 MB allocated, 880 KB copied
sort same list with text-icu (en):
1.6 ms ± 114 μs, 6.2 MB allocated, 130 KB copied
Note that this library does incremental normalization, so when strings can mostly be distinguished on the basis of the first two characters, as in the first sample, the impact is much less. On the other hand, performance is much slower on a sample of texts which differ only after the first 32 characters:
sort a list of 10000 random Texts that agree in first 32 chars:
116 ms ± 8.6 ms, 430 MB allocated, 710 KB copied
sort same list with text-icu (en):
3.2 ms ± 251 μs, 8.8 MB allocated, 222 KB copied
However, in the special case where the texts are identical, the algorithm can be short-circuited entirely and sorting is very fast:
sort a list of 10000 identical Texts (en):
877 μs ± 54 μs, 462 KB allocated, 9.7 KB copied
Localized collations
The following localized collations are available. For languages not listed here, the root collation is used.
af
ar
as
az
be
bn
ca
cs
cu
cy
da
de-AT-u-co-phonebk
de-u-co-phonebk
dsb
ee
eo
es
es-u-co-trad
et
fa
fi
fi-u-co-phonebk
fil
fo
fr-CA
gu
ha
haw
he
hi
hr
hu
hy
ig
is
ja
kk
kl
kn
ko
kok
lkt
ln
lt
lv
mk
ml
mr
mt
nb
nn
nso
om
or
pa
pl
ro
sa
se
si
si-u-co-dict
sk
sl
sq
sr
sv
sv-u-co-reformed
ta
te
th
tn
to
tr
ug-Cyrl
uk
ur
vi
vo
wae
wo
yo
zh
zh-u-co-big5han
zh-u-co-gb2312
zh-u-co-pinyin
zh-u-co-stroke
zh-u-co-zhuyin
Collation reordering (e.g. [reorder Latn Kana Hani]
)
is not suported
Data files
Version 13.0.0 of the Unicode data is used: http://www.unicode.org/Public/UCA/13.0.0/
Locale-specific tailorings are derived from the Perl module Unicode::Collate: https://cpan.metacpan.org/authors/id/S/SA/SADAHIRO/Unicode-Collate-1.29.tar.gz
Executable
The package includes an executable component, unicode-collate
,
which may be used for testing and for collating in scripts.
To build it, enable the executable
flag.
For usage instructions, unicode-collate --help
.
References
- Unicode Technical Standard #35: Unicode Locale Data Markup Language (LDML): http://www.unicode.org/reports/tr35/
- Unicode Technical Standard #10: Unicode Collation Algorithm: https://www.unicode.org/reports/tr10
- Unicode Technical Standard #215: Unicode Normalization Forms: https://unicode.org/reports/tr15/
Changes
Changelog
unicode-collation
uses PVP Versioning.
0.1.3.5
- Allow text 2.1.
0.1.3.4
- Allow base 4.18.
0.1.3.3
- Allow base 4.17. Closes #12.
0.1.3.2
- Allow text 2.0.
0.1.3.1
-
Allow base 4.16 (so the library can compile with ghc 9.2).
-
Micro-optimization in normalize; update benchmarks.
0.1.3
-
Add
collateWithUnpacker
(#4). This allows the library to be used with types other than Text. Alternatively we could use a typeclass such as mono-traversable, but this seems a lighter-weight solution and keeps dependencies down. -
Add Text.Collate.Normalize, exporting
toNFD
. By doing our own normalization, we avoid a dependency on unicode-transforms, and we gain the ability to do normalization incrementally (lazily). This is useful because in practice, the ordering of two strings is very often decided on the basis of one or two initial characters; normalizing the whole string is thus a waste of time. -
Improve benchmark suite, with more varied samples.
-
Remove dependency on bytestring-lexing; use Data.Text.Read instead.
-
Add internal module Text.Collate.UnicodeData. This generates unicode data from
data/UnicodeData.txt
. Removedata/DerivedCombiningClass.txt
, which is no longer needed. to get canonical combining class data. -
Remove dependency on filepath.
-
Fix getCollationElements behaviour with discontiguous matches (Christian Despres, #5). The getCollationElements function now implements a more or less exact translation of section S2.1 of the main UCA algorithm. Since DUCET does not satisfy well-formedness condition 5, that function cannot rearrange the unblocked non-starters as it was doing previously. We now pass all conformance tests.
-
Unit test: skip conformance tests that yield invalid code points, as allowed by the spec (#6). “Implementations that do not weight surrogate code points the same way as reserved code points may filter out such lines lines in the test cases, before testing for conformance.” Uncomment the commented-out lines in the collation tests.
-
Rename internal CombiningClass module -> CanonicalCombiningClass.
-
Generalize
matchLongestPrefix
toFoldable
. Rewrite usingfoldM
for clarity. -
Rewrite
recursivelyDecompose
using a fold.
0.1.2
-
API change: Expose
collatorOptions
andCollatorOptions
. DeprecatecollatorLang
which is now redundant. -
API change: Export
renderSortKey
. This renders the sort key in a compact form, used by the CLDR collation tests. A vertical bar is used in place of 0000. -
Remove
optCollation
fromCollatorOptions
. Make theCollation
a separate parameter ofCollator
instead. This doesn’t affect the public API but it makes more sense conceptually. -
Avoid spurious FFFFs in sort keys. We were including FFFFs at L4 of sort keys even with NonIgnorable, which is not right, though it should not affect the sort.
-
Move
VariableWeighting
fromCollation
toCollator
module. -
Add a benchmark for texts of length 1.
-
Small optimization: don’t generate sort key when strings are equal.
-
Executable: add
--hex
and--verbose
options. For testing purposes it is convenient to enter code points manually as hex numbers.--verbose
causes diagnostic output to be printed to stderr, including the tailoring used, options, and normalized code points and sort keys.
0.1.1
-
API change: Add
collatorLang
, which reports theLang
used for tailoring (which may be different from theLang
passed tocollatorFor
, because of fallbacks). -
Fix fallback behavior with
lookupLang
(#3). PreviouslylookupLang
would letde
fall back tode-u-co-phonebk
. -
Add
--verbose
option to executable. This prints the fallback Lang used for tailoring to stderr to help diagnose issues.
0.1
- Initial release.