symbolize
Efficient global Symbol table, with Garbage Collection.
https://github.com/Qqwy/haskell-symbolize#readme
Stackage Nightly 2025-04-30: | 1.0.3.1 |
Latest on Hackage: | 1.0.3.1 |
symbolize-1.0.3.1@sha256:c7a414d58ab07d29be75c54556d2b4594258f69a94d4d7e66b5324efdd51323b,6348
Module documentation for 1.0.3.1
Symbolize
Haskell library implementing a global Symbol Table, with garbage collection.
Symbols, also known as Atoms or Interned Strings, are a common technique to reduce memory usage and improve performance when using many small strings:
A Symbol represents a string (any Textual
, so String, Text, ShortText, ByteString, ShortByteString, etc.)
Just like ShortText
, ShortByteString
and ByteArray
, a Symbol
has an optimized memory representation,
directly wrapping a primitive ByteArray#
.
Furthermore, a global symbol table keeps track of which values currently exist, ensuring we always deduplicate symbols. This therefore allows us to:
- Check for equality between symbols in constant-time (using pointer equality)
- Calculate the hash in constant-time (using
StableName
) - Keep the memory footprint of repeatedly-seen strings low.
This is very useful if you’re frequently comparing strings
and the same strings might come up many times.
It also makes Symbol a great candidate for a key in e.g. a HashMap
or HashSet
.
The global symbol table is implemented using weak pointers, which means that unused symbols will be garbage collected. As such, you do not need to be concerned about memory leaks (as is the case with many other symbol table implementations).
Symbols are considered ‘the same’ regardless of whether they originate
from a String
, (lazy or strict, normal or short) Data.Text
, (lazy or strict, normal or short) Data.ByteString
etc.
The main advantages of Symbolize over other symbol table implementations are:
- Garbage collection: Symbols which are no longer used are automatically cleaned up.
- Support for any
Textual
type, includingString
, (strict and lazy)Data.Text
, (strict and lazy)Data.ByteString
,ShortText
,ShortByteString
, etc. - Great memory usage:
Symbol
s are simply a (lifted) wrapper around aByteArray#
, which is nicely unpacked by GHC.- The symbol table is an
IntMap
that contains weak pointers to these sameByteArray#
s and their associatedStableName#
s
- Great performance:
unintern
is a simple pointer-dereference- calls to
lookup
are free of atomic memory barriers (and never have to wait on a concurrent thread runningintern
)
- Thread-safe
Basic usage
This module is intended to be imported qualified, e.g.
import Symbolize (Symbol)
import qualified Symbolize
To intern a string, use intern
:
>>> hello = Symbolize.intern "hello"
>>> world = Symbolize.intern "world"
>>> (hello, world)
(Symbolize.intern "hello",Symbolize.intern "world")
Interning supports any Textual
type, so you can also use Data.Text
or Data.ByteString
etc.:
>>> import Data.Text (Text)
>>> niceCheeses = fmap Symbolize.intern (["Roquefort", "Camembert", "Brie"] :: [Text])
>>> niceCheeses
[Symbolize.intern "Roquefort",Symbolize.intern "Camembert",Symbolize.intern "Brie"]
And if you are using OverloadedStrings, you can use the IsString
instance to intern constants:
>>> hello2 = ("hello" :: Symbol)
>>> hello2
Symbolize.intern "hello"
Comparisons between symbols run in O(1) time:
>>> hello == hello2
True
>>> hello == world
False
To get back the textual value of a symbol, use unintern
:
>>> Symbolize.unintern hello
"hello"
If you only want to check whether a string is already interned, use lookup
:
>>> Symbolize.lookup "hello"
Just (Symbolize.intern "hello")
Symbols make great keys for Data.HashMap
and Data.HashSet
.
Hashing them is a no-op and they are guaranteed to be unique:
>>> import qualified Data.Hashable as Hashable
>>> Hashable.hash hello
0
>>> fmap Hashable.hash niceCheeses
[2,3,4]
For introspection, you can look at how many symbols currently exist:
>>> Symbolize.globalSymbolTableSize
5
>>> [unintern (intern (show x)) | x <- [1..5]]
["1","2","3","4","5"]
>>> Symbolize.globalSymbolTableSize
10
Unused symbols will be garbage-collected, so you don’t have to worry about memory leaks:
>>> System.Mem.performGC
>>> Symbolize.globalSymbolTableSize
5
For deeper introspection, you can look at the Show instance of the global symbol table: /(Note that the exact format is subject to change.)/
>>> Symbolize.globalSymbolTable
GlobalSymbolTable { size = 5, symbols = ["Brie","Camembert","Roquefort","hello","world"] }
Changes
Changelog for symbolize
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to the Haskell Package Versioning Policy.
Unreleased
1.0.3.1
- Fix typo in documentation of
globalSymbolTable
1.0.3.0
- Swap internals of the global symbol table.
- Before:
IORef (Data.IntSet (NonEmpty WeakSymbol))
- After:
MVar (HashTable.Dictionary (HashTable.PrimState IO) U.MVector Int V.MVector NonEmptyWeakSymbol)
- Reading/writing to this mutable hashtable from the
vector-hashtables
package scales almost linearly - The
NonEmpty WeakSymbol
is replaced by a monomorphised version of the same, reducing memory overhead by one more word per symbol.
- Reading/writing to this mutable hashtable from the
- Before:
1.0.2.4
- Fix too eager creation of weak pointers, resulting in many needless allocations. This greatly reduces memory pressure and improves speed when inserting many symbols in a short amount of time!
- Only calculate the SipHash hash of an incoming text once when inserting. (c.f. #8)
1.0.2.3
- Swaps the internal usage of lists in the global symbol table for
NonEmpty
lists, since they will never be empty. (PR #5). Thank you, @Bodigrim!
1.0.2.2
- Fixing some typos in the documentation (PR #3). Thank you, @Bodigrim!
1.0.2.1
- Widen dependency bounds, making the package work with the latest version of
containers
andhashable
.
1.0.2.0
- Adds
Textual.toShortTextUnsafe
and relatedinternUnsafe
andinternUnsafe#
; these skip UTF-8 validity checks. Very useful when working with trusted serialized data.- rename
intern##
tointernUnsafe##
(the old name is marked as deprecated.)
- rename
- Cleans up
mkWeakSymbol
to be slightly less sketchy in the presence ofaccursedUnutterablePerformIO
; thanks @fatho!
1.0.1.0
- Add
Data.Data
instance - Add
Data.Binary
instance
1.0.0.4
- Minor documentation improvements
1.0.0.3
- Minor documentation improvements
1.0.0.2
- Minor documentation improvements
1.0.0.1
- Minor documentation improvements
1.0.0.0 - 2025-02-17
Completely overhauled implementation:
- The old implementation used
newtype Symbol = Symbol Word
, adding Weak-pointers to thisWord
.- This was brittle, since those
Word
s would often get inlined, potentially triggering weak-pointer finalization (too) early. - The symbol table had to keep track of mappings in both directions.
- Int also meant that
unintern
required access to the symbol table.
- This was brittle, since those
- The new implementation uses
newtype Symbol# = Symbol# ByteArray#
, and adds weak pointers to this unliftedByteArray#
.- Much safer, this is how
mkWeak
is intended to be used. - The symbol table now only needs to store ‘TextHash -> Weak Symbol’, and is essentially just a ‘weak hashset’. Less than half the memory usage!
- Much faster
unintern
, as it no longer needs access to the symbol table but is a simple pointer dereference.
- Much safer, this is how
0.1.0.1 / 0.1.0.2 - 2023-11-24
Fixes in the README and package description only, for better rendering on Hackage.
0.1.0.0 - 2023-11-24
Initial version