blog/posts/awk_for_static_site_generation.html

<!DOCTYPE html>
<html lang="fr" dir="ltr">

<head>
    <meta charset="utf-8">
    <title>simpet</title>
    <meta name="viewport" content="width=device-width, initial-scale=1, viewport-fit=cover">
    <link href="https://fonts.googleapis.com/css?family=Cutive+Mono|IBM+Plex+Mono&display=swap" rel="stylesheet">
    <link rel="stylesheet" type="text/css" href="../css/poststyle.css">
</head>

<body>
    <h1 class='title'><a href="../index.html">simpet</a></h1>
    <article>
        <h1>A static site generator</h1>
<p>when I decided to start blogging, it was mostly for me to learn and remember all tech thing I learnt throughout time.</p>
<p>I also want to explore a wide diversity of technology, not focus on a particular one.</p>
<p>Hence to start blogging, I obviously needed a static site generator. </p>
<p>Many of them exist already, like Hugo for example, however rewriting one from scratch is typically the kind of exercise I want to throw myself into.</p>
<p>The advantage of a static site is clearly its loading speed : a simple html file, combined with a small licked css, and a whole new blog is born</p>
<p>Anyway, writing this static site generator from scratch is also the perfect excuse to explore a not so widely know technology to manipulate text files. </p>
<h2>Introduction to AWK</h2>
<p>AWK, from the intials of its creator, is an old an powerful text file maniulation. Syntactically close to C, it is a scripting language to manipulation text entries.</p>
<p>Its <a href="https://en.wikipedia.org/wiki/AWK">wikipedia page</a> sums up nicely its story.</p>
<p>I thought it was clever to use is for a site generator, to parse markdown files and generate html ones.</p>
<p>However, according to this <a href="https://jamstack.org/generators/">listing</a> of static site generator programs, another one has had the same idea.</p>
<p>Hence, the following, as well as my code is heavily inspired by <a href="https://github.com/nuex/zodiac">Zodiac</a> (even though the repo has not been touched for 8 years).</p>
<h2>Parsing markdown</h2>
<p>Following the official <a href="https://daringfireball.net/projects/markdown/syntax">syntax</a>, is a good start for a parser.</p>
<p>AWK works as follow : it takes an optional regex and execute some code between bracket, as a function, at each line of the text input.</p>
<p>For example :</p>
<pre><code>/^#/ {
    print "{{article}}lt;h1{{article}}gt;" $0 "{{article}}lt;/h1{{article}}gt;"
}
</code>
</pre>
<p>Although <code>$n<code> refers to the n-th records in the line (according to a delimiter, like in a csv), the special </code>$0</code> refers to the whole line.</p>
<p>In this case, for each line starting with <code>#<code>, awk will print (to the standard output), </code>{{article}}lt;h1{{article}}gt; [content of the line] {{article}}lt;/h1{{article}}gt;</code>.</p>
<p>This is the beginning to parse headers in markdown.</p>
<p>However, by trying this, we immediatly see that <code>#</code> is part of the whole line, hence it also appear in the html whereas it sould not.</p>
<p>AWK has a way to prevent this, as it is a complete scripting language, with built-in functions, that enable further manipulations.</p>
<p><code>substr</code> acts as its name indicates, it return a substring of its argument.</p>
<pre><code>/^#/ {
    print "{{article}}lt;h1{{article}}gt;" substr($0, 3) "{{article}}lt;/h1{{article}}gt;"
}
</code>
</pre>
<p>In the example above, as per the <a href="https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html#index-substr_0028_0029-function">documentation</a> </p>
<p>it returns the subtring of <code>$0<code> starting at 3 (1 being </code>#</code> and 2 the whitespace following it) to the end of the line.</p>
<p>Now this is better, but we now are able to generalized it to all headers. Another function, <code>match</code> can return the number of char matched by a regex,</p>
<p>and allows the script to dynamically determine which depth of header it parses. This length is stored is the global variable <code>RLENGTH</code>:</p>
<pre><code>/^#+ / {
    match($0, /#+ /);
    n = RLENGTH;
    print "{{article}}lt;h" n-1 "{{article}}gt;" substr($0, n + 1) "{{article}}lt;/h" n-1 "{{article}}gt;"
}
</code>
</pre>
<p>Reproducing this technique to parse the rest proves to be difficult, as lists for example, are not contained in a single line, hence </p>
<p>how to know when to close it with <code>{{article}}lt;/ul{{article}}gt;<code> or </code>{{article}}lt;/ol{{article}}gt;</code></p>
<h2>Introducing a LIFO stack</h2>
<p>Since according to the markown syntax, it is possible to have nested blocks such as headers and lists withing blockquotes, or lists withing lists, I came with the simple idea to track to current environnement in a stack in AWK.</p>
<p>Turns out it came out to be easy, I only needed a pointer to track the size of the lifo, a fonction to push an element, an another one to pop one out :</p>
<pre><code>BEGIN {
    env = "none"
    stack_pointer = 0
    push(env)
}
</code>
</pre>
<pre><code># Function to push a value onto the stack
function push(value) {
    stack_pointer++
    stack[stack_pointer] = value
}
</code>
</pre>
<pre><code># Function to pop a value from the stack (LIFO)
function pop() {
    if (stack_pointer > 0) {
        value = stack[stack_pointer]
        delete stack[stack_pointer]
        stack_pointer--
        return value
    } else {
        return "empty"
    }
}
</code>
</pre>
<p>The stack does not have to be strictly declared. The value of inside the LIFO correspond to the current markdown environment.</p>
<p>This is a clever trick, because when I need to close an html tag, I use the poped element between a <code>{{article}}lt;/<code> and a </code>{{article}}gt;</code> instead of having a matching table.</p>
<p>I also used a simple <code>last()</code> function to return the last pushed value in the stack without popping it out :</p>
<pre><code># Function to get last value in LIFO
function last() {
    return stack[stack_pointer]
}
</code>
</pre>
<p>This way, parsing lists became trivial : </p>
<pre><code># Matching unordered lists
/^[-+*] / {
    env = last()
    if (env == "ul" ) {
        # In a unordered list block, print a new item
        print "{{article}}lt;li{{article}}gt;" substr($0, 3) "{{article}}lt;/li{{article}}gt;"
    } else {
        # Otherwise, init the unordered list block
        push("ul")
        print "{{article}}lt;ul{{article}}gt;
{{article}}lt;li{{article}}gt;" substr($0, 3) "{{article}}lt;/li{{article}}gt;"
    }
}
</code>
</pre>
<p>I believe the code is pretty self explanatory, but when the last environement is not <code>ul</code>, then we enter this environement.</p>
<p>This translates as pushing it to the stack.</p>
<p>Otherwise, it means we are already reading a list, and we only need to add a new element to it.</p>
<h2>Parsing the simple paragraph and ending the parser</h2>
<p>I showed examples of lists and headers, but it works the same way for code blocks, blockquotes, etc.. Only the simple paragraph is different : </p>
<p>it does not start with a specific caracter. That is, to match it, we match everything that is not a special character.</p>
<p>I have no idea if this is the best solution, but so far it proved to work:</p>
<pre><code># Matching a simple paragraph
!/^(#|*|-|+|>|`|$|	|    )/ {
    env = last()
    if (env == "none") {
        # If no block, print a paragraph
        print "{{article}}lt;p{{article}}gt;" $0 "{{article}}lt;/p{{article}}gt;"
    } else if (env == "blockquote") {
        print $0
    }
}
</code>
</pre>
<p>AS <code>BEGIN<code>, AWK provide the possibilty to execute code at the very end of the file, with the </code>END</code> keyword.</p>
<p>Naturally we need to empty the stack and close all html tags that might have been opened during the parsing.</p>
<p>It only is a while loop, until the last environement is "none", as it way initiated : </p>
<pre><code>END {
    env = last()
    while (env != "none") {
        env = pop()
        print "{{article}}lt;/" env "{{article}}gt;"
        env = last()
    }
}
</code>
</pre>
<p>This way we are able to simply parse markdown and turn it into an HTML file.</p>
<h2>Parsing in-line fonctionnalities</h2>
<p>For now we have seen a way to parse blocks, but markdown also handles strong, emphasis and links. However, these tags can appear anywhere in a line.</p>
<p>Hence we need to be able to parse these lines apart from the block itself : indeed a header can container a strong and a link.</p>
<p>The previously introduced but very useful function <code>match</code> fits this need : it literally is a regex engine, looking for a pattern in a string.</p>
<p>Whenever the pattern is found, two global variables are filled :</p>
<ul>
<li>RSTART : the index of the first character matching the <em>group</em></li>
<li>RLENGTH: the length of the matched <em>group</em></li>
</ul>
<p>For the following, <code>line<code> represents the line processed by the function, as the following </code>while</code> loops are actually part of a single function.</p>
<p>This way <code>match(line, /&#42;([^{{article}}#42;]+)&#42;/)<code> matches a string (that does not start with a <code>{{article}}#42</code>) surrounded by two </code>{{article}}#42;</code>, corresponding to an emphasis text.</p>
<p>The <code>{{article}}#42;</code> are espaced as they are special characters, and the <em>group</em> is delimited by the parenthesis.</p>
<p>To match several instances of emphasis text within a line, a simple <code>while</code> will do the trick.</p>
<p>We now only have to insert html tags <code>{{article}}lt;em{{article}}gt;</code> are the right space around the matched text, and we are good to go.</p>
<p>We can save the global variables <code>RSTART<code> and </code>RLENGTH</code> for further use, in case they were to be change. Using them we also can extract the </p>
<p>matched substrings and reconstruct the actual html string :</p>
<pre><code>while (match(line, /&#42;([^{{article}}#42;]+)&#42;/)) {
    start = RSTART
    end = RSTART + RLENGTH - 1
    # Build the result: before match, {{article}}lt;em{{article}}gt;, content, {{article}}lt;/em{{article}}gt;, after match
    line = substr(line, 1, start-1) "{{article}}lt;em{{article}}gt;" substr(line, start+1, RLENGTH-2) "{{article}}lt;/em{{article}}gt;" substr(line, end+1)
}
</code>
</pre>
<p>The while loop enables us to repeat this process as many times as this pattern is encountered within the line.</p>
<p>  </p>
<p>We now can repeat the pattern for all inline fonctionnalities, e.g. strong and code.</p>
<p>The case of url is a bit more deep as we need to match two groups : the actual text and the url itself.</p>
<p>No real issue here, the naïve way is to match the whole, and looking for both the link and the url within the matched whole.</p>
<p>This way <code>match(line, /[([^]]+)]([^)]+)/)<code> matches a text between <code>[]</code> followed by a text between </code>()</code> : the markdown representation of links.</p>
<p>As above, we store the <code>start<code> and </code>end</code> and also the whole match :</p>
<p> </p>
<pre><code>start = RSTART
end = RSTART + RLENGTH - 1
matched = substr($0, RSTART, RLENGTH)
</code>
</pre>
<p>It is possible to apply the match fonction on this <code>matched<code> string, and extract, first, the text in <code>[]</code>, and last the text in </code>()</code></p>
<pre><code>if (match(matched, /[([^]]+)]/)) {
    matched_link = substr(matched, RSTART+1, RLENGTH-2)
}
if (match(matched, /([^)]+)/)) {
    matched_url = substr(matched, RSTART+1, RLENGTH-2)
}
</code>
</pre>
<p>As the link text and the url are stored, using the variables <code>start<code> and </code>end</code>, it is easy to reconstruct the html line :</p>
<pre><code>line = substr(line, 1, start-1) "{{article}}lt;a href="" matched_url ""{{article}}gt;" matched_link "{{article}}lt;/a{{article}}gt;" substr(line, end+1)
</code>
</pre>
<p>The inline parsing function is now complete, all we have to do it apply is systematically on the text within html tags and this finished the markdown parser.</p>
<p>This, of course, is the first brick of a static site generator, maybe the most complexe one. </p>
<p>We shall see up next how to orchestrate this parser to make is a actual site generator.</p>
<p>The code is available in the <a href="https://git.simonpetit.top/simonpetit/top">repo</a>.</p>
    </article>
</body>

</html>