markup-parse
A markup parser.
https://github.com/tonyday567/markup-parse#readme
Version on this page: | 0.1.0.1 |
LTS Haskell 23.21: | 0.1.1.1 |
Stackage Nightly 2025-05-10: | 0.1.1.1 |
Latest on Hackage: | 0.1.1.1 |
BSD-3-Clause licensed by Tony Day
Maintained by [email protected]
This version can be pinned in stack with:
markup-parse-0.1.0.1@sha256:e0b649ea322f4aadfdf9677f0fc1cfa8752ea7bfb7fda0ce0814c5ac52131ea9,4284
Module documentation for 0.1.0.1
Depends on 10 packages(full list with versions):
Used by 1 package in nightly-2023-08-26(full list with versions):
* markup-parse
[[https://hackage.haskell.org/package/markup-parse][https://img.shields.io/hackage/v/markup-parse.svg]]
[[https://github.com/tonyday567/markup-parse/actions?query=workflow%3Ahaskell-ci][https://github.com/tonyday567/markup-parse/workflows/haskell-ci/badge.svg]]
~markup-parse~ parses and prints a subset of common XML & HTML data.
* Development
#+begin_src haskell :results output
:r
:set prompt "> "
:set -Wno-type-defaults
:set -Wno-name-shadowing
:set -XOverloadedStrings
:set -XTemplateHaskell
:set -XQuasiQuotes
import Control.Monad
import MarkupParse
import MarkupParse.FlatParse
import Data.Map.Strict qualified as Map
import MarkupParse.Patch
import Data.ByteString qualified as B
import Data.ByteString.Char8 qualified as C
import Data.Function
import FlatParse.Basic hiding (take)
import Data.String.Interpolate
import Data.TreeDiff
import Control.Monad
-- import Perf
bs <- B.readFile "other/line.svg"
C.length bs
#+end_src
#+RESULTS:
: Ok, three modules loaded.
: >
: >
: 7554
* Main Pipeline
#+begin_src haskell :results output
:t tokenize Html
:t gather Html
:t Markup Html
:t normalize
:t tokenize Html >=> gather Html >>> fmap (Markup Html >>> normalize) >=> degather >>> fmap (fmap (detokenize Html) >>> mconcat)
:t detokenize Html
#+end_src
#+RESULTS:
: tokenize Html :: ByteString -> These [MarkupWarning] [Token]
: gather Html :: [Token] -> These [MarkupWarning] [Tree Token]
: Markup Html :: [Tree Token] -> Markup
: normalize :: Markup -> Markup
: tokenize Html >=> gather Html >>> fmap (Markup Html >>> normalize) >=> degather >>> fmap (fmap (detokenize Html) >>> mconcat)
: :: ByteString -> These [MarkupWarning] ByteString
: detokenize Html :: Token -> ByteString
Round trip equality
#+begin_src haskell :results output
m = markup_ Xml bs
m == (markup_ Xml $ markdown Compact m)
#+end_src
#+RESULTS:
: True
* MarkupParse.Patch
Obviously doesn't belong here long-term but has been very useful in testing and development.
#+begin_src haskell :results output
show $ ansiWlEditExpr <$> patch [1, 2, 3, 5] [0, 1, 2, 4, 6]
#+end_src
#+RESULTS:
: Just [+0, -3, +4, -5, +6]
* wiki diff test debug
#+begin_src haskell :results output
bs <- B.readFile "other/Parsing - Wikipedia.html"
m = markup_ Html bs
m == (markup_ Html $ markdown Compact m)
#+end_src
#+RESULTS:
: True
* Reference
** Html Standards
[[https://html.spec.whatwg.org/multipage/syntax.html#elements-2:void-elements-2][HTML Standard]]
[[https://developer.mozilla.org/en-US/docs/Glossary/Void_element#self-closing_tags][void elements]]
[[https://stackoverflow.com/questions/3558119/are-non-void-self-closing-tags-valid-in-html5][html - Are (non-void) self-closing tags valid in HTML5? - Stack Overflow]]
[[https://www.w3.org/TR/2017/REC-html52-20171214/syntax.html#tree-construction][HTML 5.2: 8. The HTML syntax]]
** Prior Art
attoparsec-based
https://hackage.haskell.org/package/html-parse
event-based
https://hackage.haskell.org/package/xeno
parsec-based
https://hackage.haskell.org/package/XMLParser
https://hackage.haskell.org/package/hexml
* Performance
Most testing has been via app/speed.hs
#+begin_src elisp
(setq haskell-process-args-cabal-repl '("markup-parse:bench:markup-parse-speed"))
#+end_src
#+RESULTS:
| markup-parse:bench:markup-parse-speed |
** Benchmarks
#+begin_src sh :results output
cabal run markup-parser-speed
#+end_src
#+RESULTS:
** Profiling
#+begin_src sh :results output
cabal configure --enable-library-profiling --enable-executable-profiling -fprof-auto -fprof --write-ghc-environment-files=always --enable-benchmarks -O2
#+end_src
cabal.project.local
#+begin_example
write-ghc-environment-files: always
ignore-project: False
flags: +prof +prof-auto
library-profiling: True
executable-profiling: True
#+end_example
Profiling slowed the main functions significantly:
#+begin_example
./app/speed -n 1000 --best -c +RTS -s -p -hc -l -RTS
label1 label2 old_result new_result status
gather time 2.08e4 3.01e4 degraded
html-parse tokens time 4.70e5 1.72e6 degraded
html-parse tree time 2.30e4 3.85e4 degraded
markdown time 3.51e5 5.70e5 degraded
markup time 2.10e5 1.05e6 degraded
normalize time 8.43e4 1.90e5 degraded
tokenize time 1.94e5 1.02e6 degraded
4,520,989,296 bytes allocated in the heap
2,668,887,592 bytes copied during GC
287,122,272 bytes maximum residency (21 sample(s))
1,572,000 bytes maximum slop
560 MiB total memory in use (0 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1073 colls, 0 par 0.471s 0.479s 0.0004s 0.0024s
Gen 1 21 colls, 0 par 2.428s 2.575s 0.1226s 0.3303s
INIT time 0.007s ( 0.008s elapsed)
MUT time 2.142s ( 1.945s elapsed)
GC time 1.904s ( 2.071s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.995s ( 0.982s elapsed)
EXIT time 0.026s ( 0.000s elapsed)
Total time 5.074s ( 5.006s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,110,654,040 bytes per MUT second
Productivity 61.8% of total user, 58.5% of total elapsed
#+end_example
[[https://hackage.haskell.org/package/markup-parse][https://img.shields.io/hackage/v/markup-parse.svg]]
[[https://github.com/tonyday567/markup-parse/actions?query=workflow%3Ahaskell-ci][https://github.com/tonyday567/markup-parse/workflows/haskell-ci/badge.svg]]
~markup-parse~ parses and prints a subset of common XML & HTML data.
* Development
#+begin_src haskell :results output
:r
:set prompt "> "
:set -Wno-type-defaults
:set -Wno-name-shadowing
:set -XOverloadedStrings
:set -XTemplateHaskell
:set -XQuasiQuotes
import Control.Monad
import MarkupParse
import MarkupParse.FlatParse
import Data.Map.Strict qualified as Map
import MarkupParse.Patch
import Data.ByteString qualified as B
import Data.ByteString.Char8 qualified as C
import Data.Function
import FlatParse.Basic hiding (take)
import Data.String.Interpolate
import Data.TreeDiff
import Control.Monad
-- import Perf
bs <- B.readFile "other/line.svg"
C.length bs
#+end_src
#+RESULTS:
: Ok, three modules loaded.
: >
: >
: 7554
* Main Pipeline
#+begin_src haskell :results output
:t tokenize Html
:t gather Html
:t Markup Html
:t normalize
:t tokenize Html >=> gather Html >>> fmap (Markup Html >>> normalize) >=> degather >>> fmap (fmap (detokenize Html) >>> mconcat)
:t detokenize Html
#+end_src
#+RESULTS:
: tokenize Html :: ByteString -> These [MarkupWarning] [Token]
: gather Html :: [Token] -> These [MarkupWarning] [Tree Token]
: Markup Html :: [Tree Token] -> Markup
: normalize :: Markup -> Markup
: tokenize Html >=> gather Html >>> fmap (Markup Html >>> normalize) >=> degather >>> fmap (fmap (detokenize Html) >>> mconcat)
: :: ByteString -> These [MarkupWarning] ByteString
: detokenize Html :: Token -> ByteString
Round trip equality
#+begin_src haskell :results output
m = markup_ Xml bs
m == (markup_ Xml $ markdown Compact m)
#+end_src
#+RESULTS:
: True
* MarkupParse.Patch
Obviously doesn't belong here long-term but has been very useful in testing and development.
#+begin_src haskell :results output
show $ ansiWlEditExpr <$> patch [1, 2, 3, 5] [0, 1, 2, 4, 6]
#+end_src
#+RESULTS:
: Just [+0, -3, +4, -5, +6]
* wiki diff test debug
#+begin_src haskell :results output
bs <- B.readFile "other/Parsing - Wikipedia.html"
m = markup_ Html bs
m == (markup_ Html $ markdown Compact m)
#+end_src
#+RESULTS:
: True
* Reference
** Html Standards
[[https://html.spec.whatwg.org/multipage/syntax.html#elements-2:void-elements-2][HTML Standard]]
[[https://developer.mozilla.org/en-US/docs/Glossary/Void_element#self-closing_tags][void elements]]
[[https://stackoverflow.com/questions/3558119/are-non-void-self-closing-tags-valid-in-html5][html - Are (non-void) self-closing tags valid in HTML5? - Stack Overflow]]
[[https://www.w3.org/TR/2017/REC-html52-20171214/syntax.html#tree-construction][HTML 5.2: 8. The HTML syntax]]
** Prior Art
attoparsec-based
https://hackage.haskell.org/package/html-parse
event-based
https://hackage.haskell.org/package/xeno
parsec-based
https://hackage.haskell.org/package/XMLParser
https://hackage.haskell.org/package/hexml
* Performance
Most testing has been via app/speed.hs
#+begin_src elisp
(setq haskell-process-args-cabal-repl '("markup-parse:bench:markup-parse-speed"))
#+end_src
#+RESULTS:
| markup-parse:bench:markup-parse-speed |
** Benchmarks
#+begin_src sh :results output
cabal run markup-parser-speed
#+end_src
#+RESULTS:
** Profiling
#+begin_src sh :results output
cabal configure --enable-library-profiling --enable-executable-profiling -fprof-auto -fprof --write-ghc-environment-files=always --enable-benchmarks -O2
#+end_src
cabal.project.local
#+begin_example
write-ghc-environment-files: always
ignore-project: False
flags: +prof +prof-auto
library-profiling: True
executable-profiling: True
#+end_example
Profiling slowed the main functions significantly:
#+begin_example
./app/speed -n 1000 --best -c +RTS -s -p -hc -l -RTS
label1 label2 old_result new_result status
gather time 2.08e4 3.01e4 degraded
html-parse tokens time 4.70e5 1.72e6 degraded
html-parse tree time 2.30e4 3.85e4 degraded
markdown time 3.51e5 5.70e5 degraded
markup time 2.10e5 1.05e6 degraded
normalize time 8.43e4 1.90e5 degraded
tokenize time 1.94e5 1.02e6 degraded
4,520,989,296 bytes allocated in the heap
2,668,887,592 bytes copied during GC
287,122,272 bytes maximum residency (21 sample(s))
1,572,000 bytes maximum slop
560 MiB total memory in use (0 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1073 colls, 0 par 0.471s 0.479s 0.0004s 0.0024s
Gen 1 21 colls, 0 par 2.428s 2.575s 0.1226s 0.3303s
INIT time 0.007s ( 0.008s elapsed)
MUT time 2.142s ( 1.945s elapsed)
GC time 1.904s ( 2.071s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.995s ( 0.982s elapsed)
EXIT time 0.026s ( 0.000s elapsed)
Total time 5.074s ( 5.006s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,110,654,040 bytes per MUT second
Productivity 61.8% of total user, 58.5% of total elapsed
#+end_example
Changes
0.0.0
Initial split away from chart-svg