A deep dive into MyST, Part 2: The MyST-Parser, Docutils and Sphinx

2021-08-31 11:35 | Source | Minimap

This is a series of blog posts inviting you to join me in a little journey I have experienced in the last few weeks to figure it out a nice story for MyST inside the Nikola world.

In the previous blog post, we have identified some limitations in the MyST-Parser Python API and we started to understand why roles and directives were not supported by the API as we expected.

In this post, we will explore the machinery underneath the MyST-Parser with the goal to deepen our understanding and being able to propose some alternatives to provide the expected support.

Markdown-it-py and its plugins¶

Coming back to the MyST-Parser default_parser implementation, you can see the Parser as a collection of markdown-it-py plugins:

md = (
        MarkdownIt("commonmark", renderer_cls=renderer_cls)
        .enable("table")
        .use(front_matter_plugin)
        .use(myst_block_plugin)
        .use(myst_role_plugin)
        .use(footnote_plugin)
        .use(wordcount_plugin, per_minute=config.words_per_minute)
        .disable("footnote_inline")
        .disable("footnote_tail")
    )

The MarkdownIt class is instantiated with some parsing configuration options, dictating the syntax rules and additional options for the parser and renderer. In addition, several plugins are activated to load a collection of additional syntax rules and render methods into the parser.

When the input data is parsed via nested chains of rules, it generates a list (stream) of tokens, that will be eventually passed to the renderer to generate a Docutils object.

In the previous post, we highlighted some MyST specific markdown-it-py plugins, such as the myst_block_plugin and the myst_role_plugin. Let's take a brief look at the latest one for the sake of simplicity:

def myst_role_plugin(md: MarkdownIt):
    """Parse ``{role-name}`content```"""
    md.inline.ruler.before("backticks", "myst_role", myst_role)
    md.add_render_rule("myst_role", render_myst_role)

The myst_role_plugin is essentially adding a new syntax rule to the parser that now is able to the parse roles from the input and a new method for the renderer to render the now role-associated tokens:

def render_myst_role(self, tokens, idx, options, env):
    token = tokens[idx]
    name = token.meta.get("name", "unknown")
    return (
        '<code class="sphinx-role">'
        f"{{{name}}}[{escapeHtml(token.content)}]"
        "</code>"
    )

We now understand the output the MyST-Parser Python API is giving us at the time to parse and render a simple role input:

In [1]:

from myst_parser.main import to_html

In [2]:

text = """
{emphasis}`content`
"""

In [3]:

to_html(text)

Out[3]:

'<p><code class="sphinx-role">{emphasis}[content]</code></p>\n'

We previously showed the to_html function using the default_parser configured with the html renderer (provided by the markdown-it-py RendererHTML class). But, what happens when we use other renderers provided by the MyST-Parser Python API?

The Docutils and Sphinx renderers¶

Let's first explore the to_docutils function:

In [4]:

from myst_parser.main import to_docutils

In [5]:

text = """
{emphasis}`content`

```{admonition} This is my admonition
This is my note
```
"""

In [6]:

print(to_docutils(text).pformat())

<document source="notset">
    <paragraph>
        <emphasis>
            content
    <admonition classes="admonition-this-is-my-admonition">
        <title>
            This is my admonition
        <paragraph>
            This is my note

The DocutilsRenderer converts a token directly to the docutils.document representation of the document, converting roles and directives to a docutils.nodes if a converter can be found for the given name.

The SphinxRenderer builds on the DocutilsRenderer to add sphinx specific nodes, e.g. for cross-referencing between documents.

In [7]:

from myst_parser.main import to_docutils

In [8]:

text = """
{ref}`target`
"""

In [9]:

print(to_docutils(text, in_sphinx_env=True).pformat())

<document source="notset">
    <paragraph>
        <pending_xref refdoc="mock_docname" refdomain="std" refexplicit="False" reftarget="target" reftype="ref" refwarn="True">
            <inline classes="xref std std-ref">
                target

Notice the sphinx-specific roles (and directives) needs the in_sphinx_env option enabled.

The MystParser class¶

Previously, we presented the MystParser as a Sphinx parser:

from sphinx.parsers import Parser as SphinxParser
...
class MystParser(SphinxParser):
    """Sphinx parser for Markedly Structured Text (MyST)."""

using a set of general and some specific markdown-it-py plugins (notice the default_parser configured - by default - with the "sphinx" renderer):

def parse(self, inputstring: str, document: nodes.document) -> None:
        """Parse source text.
        :param inputstring: The source string to parse
        :param document: The root docutils node to add AST elements to
        """
        config = document.settings.env.myst_config
        parser = default_parser(config)
        parser.options["document"] = document
        env = AttrDict()

to parse the input text into a token stream and then rendering those (via the SphinxRenderer which is a subclass of the DocutilsRenderer):

tokens = parser.parse(inputstring, env)
        if not tokens or tokens[0].type != "front_matter":
            # we always add front matter, so that we can merge it with global keys,
            # specified in the sphinx configuration
            tokens = [Token("front_matter", "", 0, content="{}", map=[0, 0])] + tokens
        parser.renderer.render(tokens, parser.options, env)

into a Docutils' document representation:

def parse(app: Sphinx, text: str, docname: str = "index") -> nodes.document:
    """Parse a string as MystMarkdown with Sphinx application."""
    app.env.temp_data["docname"] = docname
    app.env.all_docs[docname] = time.time()
    reader = SphinxStandaloneReader()
    reader.setup(app)
    parser = MystParser()
    parser.set_application(app)
    with sphinx_domains(app.env):
        return publish_doctree(
            text,
            path.join(app.srcdir, docname + ".md"),
            reader=reader,
            parser=parser,
            parser_name="markdown",
            settings_overrides={"env": app.env, "gettext_compact": True},
        )

Finally, those objects are passed to the built-in Sphinx machinery to write the html output.

The big questions¶

This is great! We now understand why we need Sphinx to generate the expected HTML output for roles and directives. Several questions now arise:

Do we want to have Sphinx-free support for rendering roles and directives in the MyST-Parser

Nikola (among other projects) already have their own machinery (based in Docutils) to build the final HTML output. Getting the docutils object from the Python API would be a super interesting way to easily expose and provide that object to the downstream projects!

One caveat with this approach would be missing some Sphinx features (ie. cross-linking between documents) based on custom roles and directives that we may need to re-implement.
Do we want to have docutils-free support for roles and directives in the MyST Python API?

Docutils actually introduces the roles and directives concept (that Sphinx extend) so if we want to go docutils-free, then we will need to re-implement those concepts.
Does it makes sense to create a docutils alternative in Python? How much of its functionality would need to be replicated? How should it be extended or enhanced?

There is currently a nice example about an alternative implementation from the Executable Books community, but in the Javascript world: https://github.com/executablebooks/markdown-it-docutils ;-). Since we do not have Docutils there, it actually makes a lot of sense to write something new. But, what is the value/need/place for an alternative implementation in Python? Maybe, we do not need the whole Docutils stuff... but we maybe need some core functionality?

If we decide to write some minimal support, what pieces are we interested to bring first? Where those pieces should end up? The markdown-it-docutils package I referenced above is actually a markdown-it (JS) plugin. If we follow that pattern, we should create a new markdown-it-py-docutils plugin and we are no longer in the MyST-Parser territory. But the MyST-Parser has, in fact, some parsing directive functions. We may need to move that toward markdown-it-py as the JS plugin does. That sounds nice, but... is there any other suitable (simpler) alternatives besides the one I proposed above?

Finally,in the MyST community, there are some ongoing discussions about developing a MyST specification that represent what we understand as the MyST language. Having one specification could be interesting and super useful because different actors interested in writing a MyST parser for their own space can check their stuff against that specification. Those actors could be different programming languages such as JS or Python or even more, it could be different flavors in the same programming language, such as Docutils, Sphinx, or a theoretical markdown-it-py-docutils ;-)

Show time... not really ;-)¶

OK, again, this is long enough! In the next post I will try to start answering some of these questions showcasing alternative approaches for the Nikola use case.

I hope you enjoyed the ride and I will see you soon with the third part!

Did you like the content? Great!

Or visit my support page for more information.

Btw, don't forget this blog post is an ipynb file itself! So, you can download it from the "Source" link at the top of the post if you want to play with it ;-)