<!DOCTYPE html> <html lang="fr" dir="ltr"> <head> <meta charset="utf-8"> <title>simpet</title> <meta name="viewport" content="width=device-width, initial-scale=1, viewport-fit=cover"> <link href="https://fonts.googleapis.com/css?family=Cutive+Mono|IBM+Plex+Mono&display=swap" rel="stylesheet"> <link rel="stylesheet" type="text/css" href="../css/poststyle.css"> </head> <body> <h1 class='title'><a href="../index.html">simpet</a></h1> <article> <h1>A static site generator</h1> <p>when I decided to start blogging, it was mostly for me to learn and remember all tech thing I learnt throughout time.</p> <p>I also want to explore a wide diversity of technology, not focus on a particular one.</p> <p>Hence to start blogging, I obviously needed a static site generator. </p> <p>Many of them exist already, like Hugo for example, however rewriting one from scratch is typically the kind of exercise I want to throw myself into.</p> <p>The advantage of a static site is clearly its loading speed : a simple html file, combined with a small licked css, and a whole new blog is born</p> <p>Anyway, writing this static site generator from scratch is also the perfect excuse to explore a not so widely know technology to manipulate text files. </p> <h2>Introduction to AWK</h2> <p>AWK, from the intials of its creator, is an old an powerful text file maniulation. Syntactically close to C, it is a scripting language to manipulation text entries.</p> <p>Its <a href="https://en.wikipedia.org/wiki/AWK">wikipedia page</a> sums up nicely its story.</p> <p>I thought it was clever to use is for a site generator, to parse markdown files and generate html ones.</p> <p>However, according to this <a href="https://jamstack.org/generators/">listing</a> of static site generator programs, another one has had the same idea.</p> <p>Hence, the following, as well as my code is heavily inspired by <a href="https://github.com/nuex/zodiac">Zodiac</a> (even though the repo has not been touched for 8 years).</p> <h2>Parsing markdown</h2> <p>Following the official <a href="https://daringfireball.net/projects/markdown/syntax">syntax</a>, is a good start for a parser.</p> <p>AWK works as follow : it takes an optional regex and execute some code between bracket, as a function, at each line of the text input.</p> <p>For example :</p> <pre><code>/^#/ { print "<h1>" $0 "</h1>" } </code> </pre> <p>Although `$n` refers to the n-th records in the line (according to a delimiter, like in a csv), the special `$0` refers to the whole line.</p> <p>In this case, for each line starting with `#`, awk will print (to the standard output), `<h1> [content of the line] </h1>`.</p> <p>This is the beginning to parse headers in markdown.</p> <p>However, by trying this, we immediatly see that `#` is part of the whole line, hence it also appear in the html whereas it sould not.</p> <p>AWK has a way to prevent this, as it is a complete scripting language, with built-in functions, that enable further manipulations.</p> <pre><code>/^#/ { print "<h1>" substr($0, 3) "</h1>" } </code> </pre> <p>In the example above, as per the <a href="https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html#index-substr_0028_0029-function">documentation</a> </p> <p>it returns the subtring of `$0` starting at 3 (1 being `#` and 2 the whitespace following it) to the end of the line.</p> <p>Now this is better, but we now are able to generalized it to all headers. Another function, `match` can return the number of char matched by a regex,</p> <p>and allows the script to dynamically determine which depth of header it parses. This length is stored is the global variable `RLENGTH`:</p> <pre><code>/^#+ / { match($0, /#+ /); n = RLENGTH; print "<h" n-1 ">" substr($0, n + 1) "</h" n-1 ">" } </code> </pre> <p>Reproducing this technique to parse the rest proves to be difficult, as lists for example, are not contained in a single line, hence </p> <p>how to know when to close it with `</ul>` or `</ol>`</p> <h2>Introducing a LIFO stack</h2> <p>Since according to the markown syntax, it is possible to have nested blocks such as headers and lists withing blockquotes, or lists withing lists, I came with the simple idea to track to current environnement in a stack in AWK.</p> <p>Turns out it came out to be easy, I only needed a pointer to track the size of the lifo, a fonction to push an element, an another one to pop one out :</p> <pre><code>BEGIN { env = "none" stack_pointer = 0 push(env) } </code> </pre> <pre><code># Function to push a value onto the stack function push(value) { stack_pointer++ stack[stack_pointer] = value } </code> </pre> <pre><code># Function to pop a value from the stack (LIFO) function pop() { if (stack_pointer > 0) { value = stack[stack_pointer] delete stack[stack_pointer] stack_pointer-- return value } else { return "empty" } } </code> </pre> <p>The stack does not have to be strictly declared. The value of inside the LIFO correspond to the current markdown environment.</p> <p>This is a clever trick, because when I need to close an html tag, I use the poped element between a `</` and a `>` instead of having a matching table.</p> <p>I also used a simple `last()` function to return the last pushed value in the stack without popping it out :</p> <pre><code># Function to get last value in LIFO function last() { return stack[stack_pointer] } </code> </pre> <p>This way, parsing lists became trivial : </p> <pre><code># Matching unordered lists /^[-+*] / { env = last() if (env == "ul" ) { # In a unordered list block, print a new item print "<li>" substr($0, 3) "</li>" } else { # Otherwise, init the unordered list block push("ul") print "<ul> <li>" substr($0, 3) "</li>" } } </code> </pre> <p>I believe the code is pretty self explanatory, but when the last environement is not `ul`, then we enter this environement.</p> <p>This translates as pushing it to the stack.</p> <p>Otherwise, it means we are already reading a list, and we only need to add a new element to it.</p> <h2>Parsing the simple paragraph and ending the parser</h2> <p>I showed examples of lists and headers, but it works the same way for code blocks, blockquotes, etc.. Only the simple paragraph is different : </p> <p>it does not start with a specific caracter. That is, to match it, we match everything that is not a special character.</p> <p>I have no idea if this is the best solution, but so far it proved to work:</p> <pre><code># Matching a simple paragraph !/^(#|\*|-|\+|>|`|$| | )/ { env = last() if (env == "none") { # If no block, print a paragraph print "<p>" $0 "</p>" } else if (env == "blockquote") { print $0 } } </code> </pre> <p>AS `BEGIN`, AWK provide the possibilty to execute code at the very end of the file, with the `END` keyword.</p> <p>Naturally we need to empty the stack and close all html tags that might have been opened during the parsing.</p> <p>It only is a while loop, until the last environement is "none", as it way initiated : </p> <pre><code>END { env = last() while (env != "none") { env = pop() print "</" env ">" env = last() } } </code> </pre> <p>This way we are able to simply parse markdown and turn it into an HTML file.</p> <p>Of course I am aware that is lacks emphasis, strong and code within a line of text. </p> <p>However I did implement it, but maybe it will be explained in another edit of this post.</p> <p>Nonetheless the code can still be consulted on <a href="https://github.com/SiwonP/bob">github</a>.</p> <h2>Parsing in-line fonctionnalities</h2> <p>For now we have seen a way to parse blocks, but markdown also handles strong, emphasis and links. However, these tags can appear anywhere in a line.</p> <p>Hence we need to be able to parse these lines apart from the block itself : indeed a header can container a strong and a link.</p> <p>A very useful function in awk is `match` : it literally is a regex engine, looking for a pattern in a string.</p> <p>Whenever the pattern is found, two global variables are filled :</p> <ul> <li>RSTART : the index of the first character matching the *group*</li> <li>RLENGTH: the length of the matched *group*</li> </ul> <p>For the following, `line` represents the line processed by the function, as the following `while` loops are actually part of a single function.</p> <p>This way `match(line, /*([^*]+)*/)` matches a string surrounded by two `*`, corresponding to an emphasis text.</p> <p>The `<em>` are espaced are thez are special characters, and the </em>group* is inside the parenthesis.</p> <p>To matche several instances of emphasis text within a line, a simple `while` will do the trick.</p> <p>We now only have to insert html tags `<em>` are the right space around the matched text, and we are good to go.</p> <p>We can save the global variables `RSTART` and `RLENGTH` for further use, in case they were to be change. Using them we also can extract the </p> <p>matched substrings and reconstruct the actual html string :</p> <pre><code>while (match(line, /*([^*]+)*/)) { start = RSTART end = RSTART + RLENGTH - 1 # Build the result: before match, <em>, content, </em>, after match line = substr(line, 1, start-1) "<em>" substr(line, start+1, RLENGTH-2) "</em>" substr(line, end+1) } </code> </pre> <p>The case of url is a bit more deep as we need to match two groups : the actual text and the url itself.</p> <p>No real issue here, the naïve way is to match thd whole, and looking for both the link and the url within the matched whole.</p> <p>This way `match(line, /\[([^\]]+)\]\([^\)]+\)/)` matches a text between `[]` followed by a text between `()` : the markdown representation of links.</p> <p>As above, we store the `start` and `end` and also the whole match :</p> <p> </p> <pre><code>start = RSTART end = RSTART + RLENGTH - 1 matched = substr($0, RSTART, RLENGTH) </code> </pre> <p>It is possible to apply the match fonction on this `matched` string, and extract, first, the text in `[]`, and last the text in `()`</p> <pre><code>if (match(matched, /\[([^\]]+)\]/)) { matched_link = substr(matched, RSTART+1, RLENGTH-2) } if (match(matched, /\([^\)]+\)/)) { matched_url = substr(matched, RSTART+1, RLENGTH-2) } </code> </pre> <p>As the link text and the url are stored, using the variables `start` and `end`, it is easy to reconstruct the html line :</p> <pre><code>line = substr(line, 1, start-1) "<a href="" matched_url "">" matched_link "</a>" substr(line, end+1) </code> </pre> <p>The inline parsing function is now complete, all we have to do it apply is systematically on the text within html tags and this finished the markdown parser.</p> <p>This, of course, is the first brick of a static site generator, maybe the most complexe one. </p> <p>We shall see up next how to orchestrate this parser to make is a actual site generator.</p> </article> </body> </html>