A high-performance HTML tokenizer

Latest on Hackage:

This package is not currently in any snapshots. If you're interested in using it, we recommend adding it to Stackage Nightly. Doing so will make builds more reliable, and allow to host generated Haddocks.

BSD3 licensed by Ben Gamari
Maintained by

This package provides a fast and reasonably robust HTML5 tokenizer built upon the attoparsec library. The parsing strategy is based upon the HTML5 parsing specification with few deviations.

The package targets similar use-cases to the venerable tagsoup library, but is significantly more efficient, achieving parsing speeds of over 50 megabytes per second on modern hardware with and typical web documents.

For instance,

parseTokens "<div><h1 class=widget>Hello World</h1><br/>"
[TagOpen "div" [],TagOpen "h1" [Attr "class" "widget"],
ContentText "Hello World",TagClose "h1",TagSelfClose "br" []]
Depends on 5 packages:
Used by 1 package:
comments powered byDisqus