A high-performance HTML tokenizer http://github.com/bgamari/html-parse
|Latest on Hackage:||0.2.0.1|
This package is not currently in any snapshots. If you're interested in using it, we recommend adding it to Stackage Nightly. Doing so will make builds more reliable, and allow stackage.org to host generated Haddocks.
This package provides a fast and reasonably robust HTML5 tokenizer built
attoparsec library. The parsing strategy is based upon the HTML5
parsing specification with few deviations.
The package targets similar use-cases to the venerable
but is significantly more efficient, achieving parsing speeds of over 50
megabytes per second on modern hardware with and typical web documents.
parseTokens "<div><h1 class=widget>Hello World</h1><br/>"
[TagOpen "div" ,TagOpen "h1" [Attr "class" "widget"],
ContentText "Hello World",TagClose "h1",TagSelfClose "br" ]