cheapskate
Experimental markdown processor.
http://github.com/jgm/cheapskate
LTS Haskell 20.26: | 0.1.1.2@rev:1 |
Stackage Nightly 2022-11-17: | 0.1.1.2@rev:1 |
Latest on Hackage: | 0.1.1.2@rev:1 |
cheapskate-0.1.1.2@sha256:b8ae3cbb826610ea45e6840b7fde0af2c2ea6690cb311edfe9683f61c0a50d96,3072
Module documentation for 0.1.1.2
Cheapskate
Note: This library is unmaintained (by me anyway). I recommend using cmark.
This is an experimental Markdown processor in pure Haskell. (A cheapskate is
always in search of the best markdown.) It aims to process Markdown efficiently
and in the most forgiving possible way. It is about seven times faster than
pandoc
and uses a fifth the memory. It is also faster, and considerably
more accurate, than the markdown
package on Hackage.
There is no such thing as an invalid Markdown document. Any string of
characters is valid Markdown. So the processor should finish efficiently no
matter what input it gets. Garbage in should not cause an error or exponential
slowdowns. This processor has been tested on many large inputs consisting of
random strings of characters, with performance that is consistently linear with
the input size. (Try make fuzztest
.)
Installing
To build, get the Haskell Platform, then:
cabal update && cabal install
This will install both the cheapskate
executable and the Haskell
library. A man page can be found in man/man1
in the source.
Usage
As an executable:
cheapskate [FILE*]
As a library:
import Cheapskate
import Text.Blaze.Html
toMarkdown :: Text -> Html
toMarkdown = toHtml . markdown def
If the markdown input you are converting comes from an untrusted source
(e.g. a web form), you should always set sanitize
to True
. This causes
the generated HTML to be filtered through xss-sanitize
’s
sanitizeBalance
function. Otherwise you risk a XSS attack from
raw HTML or a markdown link or image attribute attribute.
You may also wish to disallow users from entering raw HTML for aesthetic,
rather than security reasons. In that case, set allowRawHtml
to False
,
but let sanitize
stay True
, since it still affects attributes coming
from markdown links and images.
Manipulating the parsed document
You can manipulate the parsed document before rendering using the walk
and walkM
functions. For example, you might want to highlight code blocks
using highlighting-kate:
import Data.Text as T
import Data.Text.Lazy as TL
import Cheapskate
import Text.Blaze.Html
import Text.Blaze.Html.Renderer.Text
import Text.Highlighting.Kate
markdownWithHighlighting :: Text -> Html
markdownWithHighlighting = toHtml . walk addHighlighting . markdown def
addHighlighting :: Block -> Block
addHighlighting (CodeBlock (CodeAttr lang _) t) =
HtmlBlock (T.concat $ TL.toChunks
$ renderHtml $ toHtml
$ formatHtmlBlock defaultFormatOpts
$ highlightAs (T.unpack lang) (T.unpack t))
addHighlighting x = x
Extensions
This processor adds the following Markdown extensions:
Hyperlinked URLs
All absolute URLs are automatically made into hyperlinks, where
inside <>
or not.
Fenced code blocks
Fenced code blocks with attributes are allowed. These begin with a line of three or more backticks or tildes, followed by an optional language name and possibly other metadata. They end with a line of backticks or tildes (the same character as started the code block) of at least the length of the starting line.
Explicit hard line breaks
A hard line break can be indicated with a backslash before a newline. The standard method of two spaces before a newline also works, but this gives a more “visible” alternative.
Backslash escapes
All ASCII symbols and punctuation marks can be backslash-escaped, not just those with a use in Markdown.
Revisions
In departs from the markdown syntax document in the following ways:
Intraword emphasis
Underscores cannot be used for word-internal emphasis. This prevents common mistakes with filenames, usernames, and indentifiers. Asterisks can still be used if word internal emphasis is needed.
The exact rule is this: an underscore that appears directly after an alphanumeric character does not begin an emphasized span. (However, an underscore directly before an alphanumeric can end an emphasized span.)
Ordered lists
The starting number of an ordered list is now significant.
Other numbers are ignored, so you can still use 1.
for each
list item.
In addition to the 1.
form, you can use 1)
in your ordered lists.
A new list starts if you change the form of the delimiter. So, the
following is two lists:
1. one
2. two
1) one
2) two
Bullet lists
A new bullet lists starts if you change the bullet marker. So, the following is two consecutive bullet lists:
+ one
+ two
- one
- two
List separation
Two blank lines breaks out of a list. This allows you to have consecutive lists:
- one
- two
- one (new list)
The blank lines break out of a list no matter how deeply it is nested:
- one
- two
- three
- new top-level list
Indentation of list continuations
Block elements inside list items need not be indented four spaces. If they are indented beyond the bullet or numerical list marker, they will be considered additional blocks inside the list item. So, the following is a list item with two paragraphs:
- one
two
The amount of indentation required for an indented code block inside a list item depends on the first line of the list item. Generally speaking, code must be indented four spaces past the first non-space character after the list marker. Thus:
- My code
{code here}
- My code
{code here}
The following diagram shows how the first line of a list item divides the following lines into three regions:
- My code
| |
+-----+
Content to the left of the marked region will not be part of the list item. Content to the right of the marked region will be indented code under the list item. Regular blocks that belong under the list item should start inside the marked region.
When the first line itself contains indented code, this code and subsequent indented code blocks should be indented five spaces past the list marker:
- { code }
{ more code }
Raw HTML blocks
Raw HTML blocks work a bit differently than in Markdown.pl
.
A raw HTML block starts with a block-level HTML tag (opening or
closing), or a comment start <!--
or end -->
, and goes until
the next blank line. The whole block is included as raw HTML.
No attempt is made to parse balanced tags. This means that
in the following, the asterisks are literal asterisks:
<div>
*hello*
</div>
while in the following, the asterisks are interpreted as markdown emphasis:
<div>
*hello*
</div>
In the first example, we have a single raw HTML block; in the second, we have two raw HTML blocks with an intervening paragraph. This system provides flexibility to authors to use enclose markdown sections in html block-level tags if they wish, while also allowing them to include verbatim HTML blocks (taking care that the don’t include any blank lines).
As a consequence of this rule, HTML blocks may not contain blank lines.
Clarifications
This implementation resolves the following issues left vague in the markdown syntax document:
Tight vs. loose lists
A list is considered “tight” if (a) it has only one item or
there is no blank space between any two consecutive items, and
(b) no item has blank lines as its immediate children.
If a list is “tight,” then list items consisting of a single
paragraph or a paragraph followed by a sublist will be rendered
without <p>
tags.
Sublists
Sublists work like other block elements inside list items; they must be indented past the bullet or numerical list marker (but no more than three spaces past, or they will be interpreted as indented code).
ATX headers
ATX headers must have a space after the initial ###
s.
Separation of block quotes
A blank line will end a blockquote. So, the following is a single blockquote:
> hi
>
> there
But this is two blockquotes:
> hi
> there
Blank lines are not required before horizontal rules, blockquotes, lists, code blocks, or headers. They are not required after, either, though in many cases “laziness” will effectively require a blank line after. For example, in
Hello there.
> A quote.
Still a quote.
the “Still a quote.” is part of the block quote, because of laziness (the ability to leave off the > from the beginning of subsequent lines). Laziness also affects lists. However, we can have a code block, ATX header, or horizontal rule between two paragraphs without any blank lines.
Link references
Link references may occur anywhere in the document, even in nested list contexts. They need not be at the outer level.
Tests
The tests
subdirectory contains an extensive suite of tests,
including all of John Gruber’s original Markdown tests, plus
many of the tests from Michel Fortin’s mdtest
suite. Each
test consists in two files with the same basename, a markdown
source and an expected HTML output.
To run the test suite, do
make test
To run only tests that match a regex pattern, do
PATT=Orig make test
Setting the environment variable TIDY=1
will run the expected and
actual output through tidy before comparing them. You can run this
test suite on another markdown processor by doing
PROG=myothermarkdown make test
Benchmarks
To run a crude benchmark comparing cheapskate
to pandoc
, do
make bench
. Set the BENCHPROGS
environment variable to
compare to other implementations.
License
Copyright © 2012, 2013, 2014 John MacFarlane.
The library is released under the BSD license; see LICENSE for terms.
Some of the test cases are borrowed from Michel Fortin’s mdtest suite and John Gruber’s original markdown test suite.